Category Archives: Go

Go 1.1 performance improvements, part 2

This is the second in a three part series exploring the performance improvements in the recent Go 1.1 release.

In part 1 I explored the improvements on amd64 platforms, as well as general improvements available to all via runtime and compiler frontend improvements.

In this article I will focus on the performance of Go 1.1 on 386 machines. The results in this article are taken from linux-386-d5666bad617d-vs-e570c2daeaca.txt.

Go 1 benchmarks on linux/386

When it comes to performance, the 8g compiler is at a disadvantage. The small number of general purpose registers available in the 386 programming model, and the weird restrictions on their use place a heavy burden on the compiler and optimiser. However that did not stop Rémy Oudompheng making several significant contributions to 8g during the 1.1 cycle.

Firstly the odd 387 floating point model was deprecated (it’s still there if you are running very old hardware with the GO386=387 switch) in favor of SSE2 instructions.

Secondly, Rémy put significant effort into porting code generation improvements from 6g into 8g (and 5g, the arm compiler). Where possible code was moved into the compiler frontend, gc, including introducing a framework to rewrite division as simpler shift and multiply operations.

linux-386-baseline

In general the results for linux/386 on this host show improvements that are as good, or in some cases, better than linux/amd64. Unlike linux/amd64, there is no slowdown in the Gzip or Gob benchmarks.

The two small regressions, BinaryTree17 and Fannkuch11, are assumed to be attributable to the garbage collector becoming more precise. This involves some additional bookkeeping to track the size and type of objects allocated on the heap, which shows up in these benchmarks.

net/http benchmarks

The improvements in the net package previously demonstrated in the linux/amd64 article carry over to linux/386. The improvements in the ClientServer benchmarks are not as marked as its amd64 cousin, but nonetheless show a significant improvement overall due to the tighter integration between the runtime and net package.

linux-386-net-http

Runtime microbenchmarks

Like the amd64 benchmarks in part 1, the runtime microbenchmarks show a mixture of results. Some low level operations got a bit slower, while other operations, like map have improved significantly.

linux-386-microbenchmarks

The final two benchmarks, which appear truncated, are actually so large they do not fit on the screen. The improvement is mostly due to this change which introduced a faster low level Equals operation for the strings, bytes and runtime packages. The results speak for themselves.

benchmark                                  old MB/s   new MB/s  speedup
BenchmarkCompareStringBigUnaligned         29.08      1145.48   39.39x
BenchmarkCompareStringBig                  29.09      1253.48   43.09x

Conclusion

Although 8g is not the leading compiler of the gc suite, Ken Thompson himself has said that there are essentially no free registers available on 386, linux/386 shows that it easily meets the 30-40% performance improvement claim. In some benchmarks, compared to Go 1.0, linux/386 beats linux/amd64.

Additionally, due to reductions in memory usage, all the compilers now use around half as much memory when compiling, and as a direct consequence, compile up to 30% faster than their 1.0 predecessors.

I encourage you to review the benchmark data in the autobench repository and if you are able, submit your own results.

In the final article in this series I will investigate the performance improvement Go 1.1 brings to arm platforms. I assure you, I’ve saved the best til last.

Update: thanks to @ajstarks who provided me with higher quality benchviz images.

Go 1.1 performance improvements

This is the first in a series of articles analysing the performance improvements in the Go 1.1 release.

It has been reported (here, and here) that performance improvements of 30-40% are available simply by recompiling your code under Go 1.1. For linux/amd64 this holds true for a wide spectrum of benchmarks. For platforms like linux/386 and linux/arm the results are even more impressive, but I’m putting the cart before the horse.

A note about gccgo. This series focuses on the contributions that the improvements to the gc series of compilers (5g, 6g and 8g) have made to Go 1.1′s performance. gccgo benefits indirectly from these improvements as it shares the same runtime and standard library, but is not the focus of this benchmarking series.

Go 1.1 features several improvements in the compilers, runtime and standard library that are directly attributable for the resulting improvements in program speed. Specifically

  • Code generation improvements across all three gc compilers, including better register allocation, reduction in redundant indirect loads, and reduced code size.
  • Improvements to inlining, including inlining of some builtin function calls and compiler generated stub methods when dealing with interface conversions.
  • Reduction in stack usage, which reduces pressure on stack size, leading to fewer stack splits.
  • Introduction of a parallel garbage collector. The collector remains mark and sweep, but the phases can now utillise all CPUs.
  • More precise garbage collection, which reduces the size of the heap, leading to lower GC pause times.
  • A new runtime scheduler which can make better decisions when scheduling goroutines.
  • Tighter integration of the scheduler with the net package, leading to significantly decreased packet processing latencies and higher throughput.
  • Parts of the runtime and standard library have been rewritten in assembly to take advantage of specific bulk move or crypto instructions.

Introducing autobench

Few things irk me more than unsubstantiated, unrepeatable benchmarks. As this series is going to throw out a lot of numbers, and draw some strong conclusions, it was important for me to provide a way for people to verify my results on their machines.

To this end I have built a simple make based harness which can be run on any platform that Go supports to compare the performance of a set of synthetic benchmarks against Go 1.0 and Go 1.1. While the project is still being developed, it has generated a lot of useful data which is captured in the repository. You can find the project on Github.

https://github.com/davecheney/autobench

I am indebted to Go community members who submitted benchmark data from their machines allowing me to make informed conclusions about the relative performance of Go 1.1.

If you are interested in participating in autobench there will be a branch which tracks the performance of Go 1.1 against tip opening soon.

A picture speaks a thousand words

To better visualise the benchmark results, AJ Starks has produced a wonderful tool, benchviz which turns the dry text based output of misc/benchcmp into rather nice graphs. You can read all about benchviz on AJ’s blog.

http://mindchunk.blogspot.com.au/2013/05/visualizing-go-benchmarks-with-benchviz.html

Following a tradition set by the misc/benchcmp tool, improvements, be they a reduction in run time, or an increase in throughput, are shown as bars extending towards the right. Regressions, fall back to the left.

Go 1 benchmarks on linux/amd64

The remainder of this post will focus on linux/amd64 performance. The 6g compiler is considered to be the flagship of the gc compiler suite. In addition to code generation improvements in the front and back ends, performance critical parts of the standard library and runtime have been rewritten in assembly to take advantage of SSE2 instructions.

The data for the remainder of this article is taken from the results file linux-amd64-d5666bad617d-vs-e570c2daeaca.txt.

bm0

 

The go1 benchmark suite, while being a synthetic benchmark, attempts to capture some real world usages of the main packages in the standard library. In general the results support the hypothesis of a broad 30-40% improvement. Looking at the results submitted to the autobench repository it is clear that GobEncode and Gzip have regressed and issues 5165 and 5166 have been raised, respectively  In the latter case, the switch to 64 bit ints is assumed to be at least partially to blame.

net/http benchmarks

This set of benchmarks are extracted from the net/http package and demonstrated the work that Brad Fitzpatrick and Dmitry Vyukov, and many others, have put into net and net/http packages.

bm2

 

Of note in this benchmark set are the improvements in ReadRequest benchmarks, which attempt to benchmark the decoding a HTTP request. The improvements in the ClientServerParallel benchmarks are not currently available across all amd64 platforms, as some of them have no support for the new runtime integration with the net package. Finishing support for the remaining BSD and Windows platforms is a focus for the 1.2 cycle.

Runtime microbenchmarks

The final set of benchmarks presented here are extracted from the runtime package.

bm1

 

The runtime benchmarks represent micro benchmarks of very low level parts of the runtime package.

The obvious regression is the first Append benchmark. While in wall time, the benchmark has increased from 36 ns/op to 100 ns/op, this shows that for some append use cases there has been a regression. This may have already been addressed in tip by CL 9360043.

The big wins in the runtime benchmarks are the amazing new map code by khr which addresses issue 3886, the reduction in overhead of channel operations (thanks to Dmitry’s new scheduler), improvements in operations involving complex128 operations, and speedups in hash and memmove operations which were rewritten in 64bit assembly.

Conclusion

For linux/amd64 on modern 64 bit Intel CPUs, the 6g compiler and runtime can generate significantly faster code. Other amd64 platforms share similar speedups, although the specific improvements vary. I encourage you to review the benchmark data in the autobench repository and if you are able, submit your own results.

In subsequent articles I will investigate the performance improvement Go 1.1 brings to 386 and arm platforms.

Update: thanks to @ajstarks who provided me with higher quality benchviz images.

Go 1.1 tarballs for linux/arm

For the time poor ARM fans in the room, I’ve updated my tarball distributions to Go 1.1. These tarballs are built using the same misc/dist tool that makes the official builds on the golang.org download page.

You can find the link at the Unofficial ARM tarballs for Go item at the top of this page. Please address any bug reports or comments to me directly.

There are also a number of other ways to obtain Go 1.1 appearing on the horizon. For example, if you are using Debian Sid, Go 1.1 is available now. This version has been imported into Ubuntu Saucy (which will become 13.10), although at this time it remains in the proposed channel.

Rest assured I will not be shy in announcing when Go 1.1 has wider availability in Ubuntu.

Go and Juju at Canonical slides posted

This month I had the privilege of presenting a talk at the GoSF meetup to 120 keen Gophers.

I was absolutely blown away by the Iron.io/HeavyBit offices. It was a fantastic presentation space with a professional sound and video crew to stream the meetup straight to G+.

The slides are available on my GitHub account, but a more convenient way to consume them is Gary Burd’s fantastic talks.godoc.org site.

http://talks.godoc.org/github.com/davecheney/gosf/5nines.slide#1

If you’re interested in finding out more about the Juju project itself, you can find us on the project page, https://launchpad.net/juju-core/ or #juju / #juju-dev on IRC.

Curious Channels

Channels are a signature feature of the Go programming language. Channels provide a powerful way to reason about the flow of data from one goroutine to another without the use of locks or critical sections.

Today I want to talk about two important properties of channels that make them useful for controlling not just data flow within your program, but the flow of control as well.

A closed channel never blocks

The first property I want to talk about is a closed channel. Once a channel has been closed, you cannot send a value on this channel, but you can still receive from the channel.

package main

import "fmt"

func main() {
        ch := make(chan bool, 2)
        ch <- true
        ch <- true
        close(ch)

        for i := 0; i < cap(ch) +1 ; i++ {
                v, ok := <- ch
                fmt.Println(v, ok)
        }
}

In this example we create a channel with a buffer of two, fill the buffer, then close it.

true true
true true
false false

Running the program shows we retrieve the first two values we sent on the channel, then on our third attempt the channel gives us the values of false and false. The first false is the zero value for that channel’s type, which is false, as the channel is of type chan bool. The second indicates the open state of the channel, which is now false, indicating the channel is closed. The channel will continue to report these values infinitely. As an experiment, alter this example to receive from the channel 100 times.

Being able to detect if your channel is closed is a useful property, it is used in the range over channel idiom to exit the loop once a channel has been drained.

package main

import "fmt"

func main() {
        ch := make(chan bool, 2)
        ch <- true
        ch <- true
        close(ch)

        for v := range ch {
                fmt.Println(v) // called twice
        }
}

but really comes into its own when combined with select. Let’s start with this example

package main

import (
        "fmt"
        "sync"
        "time"
)

func main() {
        finish := make(chan bool)
        var done sync.WaitGroup
        done.Add(1)
        go func() {
                select {
                case <-time.After(1 * time.Hour):
                case <-finish:
                }
                done.Done()
        }()
        t0 := time.Now()
        finish <- true // send the close signal
        done.Wait()    // wait for the goroutine to stop
        fmt.Printf("Waited %v for goroutine to stop\n", time.Since(t0))
}

Running the program, on my system, gives a low wait duration, hence it is clear that the goroutine does not wait the full hour before calling done.Done()

Waited 129.607us for goroutine to stop

But there are a few problems with this program. The first is the finish channel is not buffered, so the send to finish may block if the receiver forgot to add finish to their select statement. You could solve that problem by wrapping the send in a select block to make it non blocking, or making the finish channel buffered. However what if you had many goroutines listening on the finish channel, you would need to track this and remember to send the correct number of times to the finish channel. This might get tricky if you aren’t in control of creating these goroutines; they may be being created in another part of your program, perhaps in response to incoming requests over the network.

A nice solution to this problem is to leverage the property that a closed channel is always ready to receive. Using this property we can rewrite the program, now including 100 goroutines, without having to keep track of the number of goroutines spawned, or correctly size the finish channel

package main

import (
        "fmt"
        "sync"
        "time"
)

func main() {
        const n = 100
        finish := make(chan bool)
        var done sync.WaitGroup
        for i := 0; i < n; i++ { 
                done.Add(1)
                go func() {
                        select {
                        case <-time.After(1 * time.Hour):
                        case <-finish:
                        }
                        done.Done()
                }()
        }
        t0 := time.Now()
        close(finish)    // closing finish makes it ready to receive
        done.Wait()      // wait for all goroutines to stop
        fmt.Printf("Waited %v for %d goroutines to stop\n", time.Since(t0), n)
}

On my system, this returns

Waited 231.385us for 100 goroutines to stop

So what is going on here? As soon as the finish channel is closed, it becomes ready to receive. As all the goroutines are waiting to receive either from their time.After channel, or finish, the select statement is now complete and the goroutines exits after calling done.Done() to deincrement the WaitGroup counter. This powerful idiom allows you to use a channel to send a signal to an unknown number of goroutines, without having to know anything about them, or worrying about deadlock.

Before moving on to the next topic, I want to mention a final simplification that is preferred by many Go programmers. If you look at the sample program above, you’ll note that we never send a value on the finish channel, and the receiver always discards any value received. Because of this it is quite common to see the program written like this:

package main

import (
        "fmt"
        "sync"
        "time"
)

func main() {
        finish := make(chan struct{})
        var done sync.WaitGroup
        done.Add(1)
        go func() {
                select {
                case <-time.After(1 * time.Hour):
                case <-finish:
                }
                done.Done()
        }()
        t0 := time.Now()
        close(finish)
        done.Wait()
        fmt.Printf("Waited %v for goroutine to stop\n", time.Since(t0))
}

As the behaviour of the close(finish) relies on signalling the close of the channel, not the value sent or received, declaring finish to be of type chan struct{} says that the channel contains no value; we’re only interested in its closed property.

A nil channel always blocks

The second property I want to talk about is polar opposite of the closed channel property. A nil channel; a channel value that has not been initalised, or has been set to nil will always block. For example

package main

func main() {
        var ch chan bool
        ch <- true // blocks forever
}

will deadlock as ch is nil and will never be ready to send. The same is true for receiving

package main

func main() {
        var ch chan bool
        <- ch // blocks forever
}

This might not seem important, but is a useful property when you want to use the closed channel idiom to wait for multiple channels to close. For example

// WaitMany waits for a and b to close.
func WaitMany(a, b chan bool) {
        var aclosed, bclosed bool
        for !aclosed || !bclosed {
                select {
                case <-a:
                        aclosed = true
                case <-b:
                        bclosed = true
                }
        }
}

WaitMany() looks like a good way to wait for channels a and b to close, but it has a problem. Let’s say that channel a is closed first, then it will always be ready to receive. Because bclosed is still false the program can enter an infinite loop, preventing the channel b from ever being closed.

A safe way to solve the problem is to leverage the blocking properties of a nil channel and rewrite the program like this

package main

import (
        "fmt"
        "time"
)

func WaitMany(a, b chan bool) {
        for a != nil || b != nil {
                select {
                case <-a:
                        a = nil 
                case <-b:
                        b = nil
                }
        }
}

func main() {
        a, b := make(chan bool), make(chan bool)
        t0 := time.Now()
        go func() {
                close(a)
                close(b)
        }()
        WaitMany(a, b)
        fmt.Printf("waited %v for WaitMany\n", time.Since(t0))
}

In the rewritten WaitMany() we nil the reference to a or b once they have received a value. When a nil channel is part of a select statement, it is effectively ignored, so niling a removes it from selection, leaving only b which blocks until it is closed, exiting the loop without spinning.

Running this on my system gives

waited 54.912us for WaitMany

In conclusion, the simple properties of closed and nil channels are powerful building blocks that can be used to create highly concurrent programs that are simple to reason about.

What is the zero value, and why is it useful ?

Let’s start with the Go language spec on the zero value.

When memory is allocated to store a value, either through a declaration or a call of make or new, and no explicit initialization is provided, the memory is given a default initialization. Each element of such a value is set to the zero value for its type: false for booleans, 0 for integers, 0.0 for floats, "" for strings, and nil for pointers, functions, interfaces, slices, channels, and maps. This initialization is done recursively, so for instance each element of an array of structs will have its fields zeroed if no value is specified.

This property of always setting a value to a known default is important for safety and correctness of your program, but can also make your Go programs simpler and more compact. This is what Go programmers talk about when they say “give your structs a useful zero value”.

Here is an example using sync.Mutex, which is designed to be usable without explicit initialization. The sync.Mutex contains two unexported integer fields. Thanks to the zero value those fields will be set to will be set to 0 whenever a sync.Mutex is declared.

package main

import "sync"

type MyInt struct {
        mu sync.Mutex
        val int
}

func main() {
        var i MyInt

        // i.mu is usable without explicit initialisation.
        i.mu.Lock()      
        i.val++
        i.mu.Unlock()
}

Another example of a type with a useful zero value is bytes.Buffer. You can decare a bytes.Buffer and start Reading or Writeing without explicit initialisation. Note that io.Copy takes an io.Reader as its second argument so we need to pass a pointer to b.

package main

import "bytes"
import "io"
import "os"

func main() {
        var b bytes.Buffer
        b.Write([]byte("Hello world"))
        io.Copy(os.Stdout, &b)
}

A useful property of slices is their zero value is nil. This means you don’t need to explicitly make a slice, you can just declare it.

package main

import "fmt"
import "strings"

func main() {
        // s := make([]string, 0)
        // s := []string{}
        var s []string

        s = append(s, "Hello")
        s = append(s, "world")
        fmt.Println(strings.Join(s, " "))
}

Note: var s []string is similar to the two commented lines above it, but not identical. It is possible to detect the difference between a slice value that is nil and a slice value that has zero length. The following code will output false.

package main

import "fmt"
import "reflect"

func main() {
        var s1 = []string{}
        var s2 []string
        fmt.Println(reflect.DeepEqual(s1, s2))
}

A surprising, but useful, property of nil pointers is you can call methods on types that have a nil value. This can be used to provide default values simply.

package main

import "fmt"

type Config struct {
        path string
}

func (c *Config) Path() string {
        if c == nil {
                return "/usr/home"
        }
        return c.path
}

func main() {
        var c1 *Config
        var c2 = &Config{
                path: "/export",
        }
        fmt.Println(c1.Path(), c2.Path())
}

With thanks to Jan MerclDoug LandauerStefan Nilsson, and Roger Peppe from the wonderful Go+ community for their feedback and suggestions.

Go, the language for emulators

So, I hear you like emulators. It turns out that Go is a great language for writing retro-computing emulators. Here are the ones that I have tried so far:

trs80 by Lawrence Kesteloot

I really liked this one because it avoids the quagmire of OpenGL or SDL dependencies and runs in your web browser. I had a little trouble getting it going so if you run into problems remember to execute the trs80 command in the source directory itself. If you’ve used go get github.com/lkesteloot/trs80 then it will be $GOPATH/src/github.com/lkesteloot/trs80.

trs80

GoSpeccy by Andrea Fazzi

GoSpeccy was the first emulator written in Go that I am aware of, Andrea has been quietly hacking away well before Go hit 1.0. I’ve even been able to get GoSpeccy running on a Raspberry Pi, X forwarded back to my laptop. Here is a screenshot running the Fire104b intro by Andrew Gerrand

GoSpeccy on a Raspberry Pi

Fergulator by Scott Ferguson

Like GoSpeccy, Fergulator shows the power of Go as a language for writing complex emulators, and the power of go get to handle packages with complex dependencies. Here are the two commands that took me from having no NES emulation on my laptop, to full NES emulation on my laptop.

lucky(~) sudo apt-get install libsdl1.2-dev libsdl-gfx1.2-dev libsdl-image1.2-dev libglew1.6-dev libxrandr-dev
lucky(~) % go get github.com/scottferg/Fergulator

Fergulator

sms by Andrea Fazzi

What’s this? Another emulator for Andrea Fazzi ? Why, yes it is. Again, super easy to install with go get -v github.com/remogatto/sms. Sadly there are no sample roms included with sms due to copyright restrictions, so no screenshot. Update: Andrea has included an open source ROM so we can have a screenshot.

sms

Update: Several Gophers from the wonderful Go+ community commented that there are still more emulators that I haven’t mentioned.

Testing Go on the Raspberry Pi running FreeBSD

This afternoon Oleksandr Tymoshenko posted an update on the state of FreeBSD on ARMv6 devices. The takeaway for Raspberry Pi fans is things are working out nicely. A few days ago a usable image was published allowing me to do some serious testing of the Go freebsd/arm port.

So, what works? Pretty much everything

[root@raspberry-pi ~]# go run src/hello.go
Hello, 世界

For the moment cgo and hardware floating point is disabled. I disabled cgo support early in testing after some segfaults, but it shouldn’t be too hard to fix. The dist tool is currently failing to auto detect1 support for any floating point hardware.

[root@raspberry-pi ~]# go tool dist env
GOROOT="/root/go"
GOBIN="/root/go/bin"
GOARCH="arm"
GOOS="freebsd"
GOHOSTARCH="arm"
GOHOSTOS="freebsd"
GOTOOLDIR="/root/go/pkg/tool/freebsd_arm"
GOCHAR="5"
GOARM="5"

This could be because the auto detection is broken on freebsd/arm, but possibly this kernel image does not enable the floating point unit. I’ll update this post when I’ve done some more testing.

At the moment performance is not great, even by Pi standards. The SDCard runs in 1bit 25mhz mode, and I believe the caches are disabled or set to write though. The image has been stable for me, allowing me to compile Go, and various ports required by the build scripts.

[root@raspberry-pi ~/go/test/bench/go1]# go test -bench=.
testing: warning: no tests to run
PASS
BenchmarkBinaryTree17    1        166473841000 ns/op
BenchmarkFannkuch11      1        83260837000 ns/op
BenchmarkGobDecode       5         518688800 ns/op           1.48 MB/s
BenchmarkGobEncode      10         225905200 ns/op           3.40 MB/s
BenchmarkGzip            1        16926476000 ns/op          1.15 MB/s
BenchmarkGunzip          1        2849252000 ns/op           6.81 MB/s
BenchmarkJSONEncode      1        3149797000 ns/op           0.62 MB/s
BenchmarkJSONDecode      1        6253162000 ns/op           0.31 MB/s
BenchmarkMandelbrot200   1        20880387000 ns/op
BenchmarkParse          10         250097600 ns/op           0.23 MB/s
BenchmarkRevcomp         5         279384200 ns/op           9.10 MB/s
BenchmarkTemplate        1        7347360000 ns/op           0.26 MB/s
ok      _/root/go/test/bench/go1        380.408s

If you are interested in experimenting with FreeBSD on your Pi, or testing Go on freebsd/arm, please get in touch with me.

Update: As of 6th Jan, 2013, benchmarks and IO have improved.

BenchmarkGobDecode             5         482796600 ns/op           1.59 MB/s
BenchmarkGobEncode            10         226637900 ns/op           3.39 MB/s
BenchmarkGzip          1        15986424000 ns/op          1.21 MB/s
BenchmarkGunzip        1        2553481000 ns/op           7.60 MB/s
BenchmarkJSONEncode            1        2967743000 ns/op           0.65 MB/s
BenchmarkJSONDecode            1        6014558000 ns/op           0.32 MB/s
BenchmarkMandelbrot200         1        19312855000 ns/op
BenchmarkParse        10         238778300 ns/op           0.24 MB/s
BenchmarkRevcomp               5         307852000 ns/op           8.26 MB/s
BenchmarkTemplate              1        6767514000 ns/op           0.29 MB/s

1. Did you know that Go automatically detects the floating point capabilities of the machine it is built on ?

The Go Programming Language (2009)

This month, Go turns three, and this is the video that started it all.

As a testiment to skills of Pike, Thompson and Griesemer, the ideals presented in 2009 have survived virtually unaltered into the Go 1.0 release earlier this year. Rewatching this video recently I was reminded that when changes were required they were almost always to the standard library, which underwent constant revision (aided by the gofix tool) until the 1.0 code freeze. From my notes, the important language level changes were

  • By early 2010, semicolons had been removed from the written syntax (although still present implicitly inside the compiler).
  • The semantics of non-blocking send and recieve and channel closure were explored and altered a number of times before arriving at their final form.
  • Maps dropped the confusing map[key] = nil, false deletion form, replaced with more regular delete(map, key), although some bemoaned the addition of another language builtin.
  • The constant literal syntax has been improved to make it clearer when constructing large constant literal forms.
  • Lastly, the builtin error type replaced the original os.Error interface type.

 

Notes on exploring the compiler flags in the Go compiler suite

I’ve been doing some work improving the code generation of the 5g compiler, which is the Go compiler for arm. These notes also apply to the 6g and 8g compilers for amd64 and 386 respectively.

For this discussion we’ll use a very simple package.

package addr

func addr(s[]int) *int {
        return &s[2]
}

To see the assembly produced by compiling this package we use the -S flag. -S can be passed directly to the compiler with go tool 5g -S addr.go, but it is simpler (and more portable) to use the -gcflags flag on the go tool itself.

% go build -gcflags=-S addr.go
# command-line-arguments

--- prog list "addr" ---
0000 (/home/dfc/src/addr.go:3) TEXT addr+0(SB),$0-16
0001 (/home/dfc/src/addr.go:4) MOVW $s+0(FP),R0
0002 (/home/dfc/src/addr.go:4) MOVW 4(R0),R1
0003 (/home/dfc/src/addr.go:4) CMP $2,R1,
0004 (/home/dfc/src/addr.go:4) BHI ,6(APC)
0005 (/home/dfc/src/addr.go:4) BL ,runtime.panicindex+0(SB)
0006 (/home/dfc/src/addr.go:4) MOVW 0(R0),R0
0007 (/home/dfc/src/addr.go:4) ADD $8,R0
0008 (/home/dfc/src/addr.go:4) MOVW R0,.noname+12(FP)
0009 (/home/dfc/src/addr.go:4) RET ,

This is quite a lot of code for a one line function. One of the reasons for this is s is a slice, whose length is not known at compile time, so the compiler must insert a bounds check. We can tell the compiler to not emit bounds checks with the -B flag.

% go build -gcflags=-SB addr.go
# command-line-arguments

--- prog list "addr" ---
0000 (/home/dfc/src/addr.go:3) TEXT     addr+0(SB),$0-16
0001 (/home/dfc/src/addr.go:4) MOVW     $s+0(FP),R0
0002 (/home/dfc/src/addr.go:4) MOVW     0(R0),R0
0003 (/home/dfc/src/addr.go:4) ADD      $8,R0
0004 (/home/dfc/src/addr.go:4) MOVW     R0,.noname+12(FP)
0005 (/home/dfc/src/addr.go:4) RET      ,

It is important to note that -B is an unsupported flag. The goal of Go is a safe language, one where array subscripts are bounds checked when they are not provably safe. Go already elides bounds checks when you use range loops, and future compilers will improve this. It is also important to note that none of the builders test -B so it might even generate incorrect code. In summary, when the compiler improves, -B will go away, so don’t get too attached.

One other interesting flag is -N, which will disable the optimisation pass in the compiler

% go build -gcflags=-SN addr.go
# command-line-arguments

--- prog list "addr" ---
0000 (/home/dfc/src/addr.go:3) TEXT     addr+0(SB),$0-16
0001 (/home/dfc/src/addr.go:4) MOVW     $s+0(FP),R0
0002 (/home/dfc/src/addr.go:4) MOVW     R0,R0
0003 (/home/dfc/src/addr.go:4) MOVW     4(R0),R1
0004 (/home/dfc/src/addr.go:4) CMP      $2,R1,
0005 (/home/dfc/src/addr.go:4) BHI      ,8(APC)
0006 (/home/dfc/src/addr.go:4) BL       ,runtime.panicindex+0(SB)
0007 (/home/dfc/src/addr.go:4) UNDEF    ,
0008 (/home/dfc/src/addr.go:4) MOVW     0(R0),R0
0009 (/home/dfc/src/addr.go:4) ADD      $8,R0
0010 (/home/dfc/src/addr.go:4) MOVW     R0,.noname+12(FP)
0011 (/home/dfc/src/addr.go:4) RET      ,
0012 (/home/dfc/src/addr.go:5) BL       ,runtime.throwreturn+0(SB)
0013 (/home/dfc/src/addr.go:5) RET      ,

I think the only thing that is useful about this example is, it’s good thing the optimiser is on by default because there are some strange things going on here, for example line 0002, and the unreachable branch at line 0012.

The last thing to talk about is the output of 5g is not the final code that is executed. Aside from the usual work of a linker, 5l does several transformations on the code which are important to understand.

func addr(s[]int) *int {
   10c00:       e59a1000        ldr     r1, [sl]
   10c04:       e15d0001        cmp     sp, r1
   10c08:       33a01004        movcc   r1, #4
   10c0c:       33a02010        movcc   r2, #16
   10c10:       31a0300e        movcc   r3, lr
   10c14:       3b00668c        blcc    2a64c 
   10c18:       e52de004        push    {lr}            ; (str lr, [sp, #-4]!)
        return &s[2]
   10c1c:       e28d0008        add     r0, sp, #8
   10c20:       e5901004        ldr     r1, [r0, #4]
   10c24:       e3510002        cmp     r1, #2
   10c28:       8a000000        bhi     10c30 
   10c2c:       eb0035d5        bl      1e388 
   10c30:       e5900000        ldr     r0, [r0]
   10c34:       e2800008        add     r0, r0, #8
   10c38:       e58d0014        str     r0, [sp, #20]
   10c3c:       e49df004        pop     {pc}            ; (ldr pc, [sp], #4)

Here we use objdump -dS to dump the addr function as it is compiled into the executable. The first six instructions, starting at 10c00, are the function preamble that deals with segmented stacks which is inserted automatically by the 5l.

Taking it further

There are several other compiler flags which are useful when debugging or optimising your Go code.

  • -g will output the steps a the compiler is a taking at a very low level. The discussion of the output format is outside the scope of this article. Personally I find it easier to add a warn statement which will tell me the source line the compiler was working on at the time.
  • -l will disable inlining (but still retain other compiler optimisations). This is very useful if you are investigating small methods, but can’t find them in objdump.
  • -m is mainly a frontend switch and outputs details about escape analysis and inlining choices.