The Go Programming Language (2009)

This month, Go turns three, and this is the video that started it all for me.

As a testiment to skills of Pike, Thompson and Griesemer, the ideals presented in 2009 have survived virtually unaltered into the Go 1.0 release earlier this year. Rewatching this video recently I was reminded that when changes were required they were almost always to the standard library, which underwent constant revision (aided by the gofix tool) until the 1.0 code freeze. From my notes, the important language level changes were

  • By early 2010, semicolons had been removed from the written syntax (although still present implicitly inside the compiler).
  • The semantics of non-blocking send and recieve and channel closure were explored and altered a number of times before arriving at their final form.
  • Maps dropped the confusing map[key] = nil, false deletion form, replaced with more regular delete(map, key), although some bemoaned the addition of another language builtin.
  • The constant literal syntax has been improved to make it clearer when constructing large constant literal forms.
  • Lastly, the builtin error type replaced the original os.Error interface type.

Notes on exploring the compiler flags in the Go compiler suite

I’ve been doing some work improving the code generation of the 5g compiler, which is the Go compiler for arm. These notes also apply to the 6g and 8g compilers for amd64 and 386 respectively.

For this discussion we’ll use a very simple package.

package addr

func addr(s[]int) *int {
return &s[2]
}

To see the assembly produced by compiling this package we use the -S flag. -S can be passed directly to the compiler with go tool 5g -S addr.go, but it is simpler (and more portable) to use the -gcflags flag on the go tool itself.

% go build -gcflags=-S addr.go 
# command-line-arguments
--- prog list "addr" ---
0000 (/home/dfc/src/addr.go:3) TEXT addr+0(SB),$0-16
0001 (/home/dfc/src/addr.go:4) MOVW $s+0(FP),R0
0002 (/home/dfc/src/addr.go:4) MOVW 4(R0),R1
0003 (/home/dfc/src/addr.go:4) CMP $2,R1,
0004 (/home/dfc/src/addr.go:4) BHI ,6(APC)
0005 (/home/dfc/src/addr.go:4) BL ,runtime.panicindex+0(SB)
0006 (/home/dfc/src/addr.go:4) MOVW 0(R0),R0
0007 (/home/dfc/src/addr.go:4) ADD $8,R0
0008 (/home/dfc/src/addr.go:4) MOVW R0,.noname+12(FP)
0009 (/home/dfc/src/addr.go:4) RET ,

This is quite a lot of code for a one line function. One of the reasons for this is s is a slice, whose length is not known at compile time, so the compiler must insert a bounds check. We can tell the compiler to not emit bounds checks with the -B flag.

% go build -gcflags=-SB addr.go
# command-line-arguments
--- prog list "addr" ---
0000 (/home/dfc/src/addr.go:3) TEXT addr+0(SB),$0-16
0001 (/home/dfc/src/addr.go:4) MOVW $s+0(FP),R0
0002 (/home/dfc/src/addr.go:4) MOVW 0(R0),R0
0003 (/home/dfc/src/addr.go:4) ADD $8,R0
0004 (/home/dfc/src/addr.go:4) MOVW R0,.noname+12(FP)
0005 (/home/dfc/src/addr.go:4) RET ,

It is important to note that -B is an unsupported flag. The goal of Go is a safe language, one where array subscripts are bounds checked when they are not provably safe. Go already elides bounds checks when you use range loops, and future compilers will improve this. It is also important to note that none of the builders test -B so it might even generate incorrect code. In summary, when the compiler improves, -B will go away, so don’t get too attached.

One other interesting flag is -N, which will disable the optimisation pass in the compiler

% go build -gcflags=-SN addr.go
# command-line-arguments
--- prog list "addr" ---
0000 (/home/dfc/src/addr.go:3) TEXT addr+0(SB),$0-16
0001 (/home/dfc/src/addr.go:4) MOVW $s+0(FP),R0
0002 (/home/dfc/src/addr.go:4) MOVW R0,R0
0003 (/home/dfc/src/addr.go:4) MOVW 4(R0),R1
0004 (/home/dfc/src/addr.go:4) CMP $2,R1,
0005 (/home/dfc/src/addr.go:4) BHI ,8(APC)
0006 (/home/dfc/src/addr.go:4) BL ,runtime.panicindex+0(SB)
0007 (/home/dfc/src/addr.go:4) UNDEF ,
0008 (/home/dfc/src/addr.go:4) MOVW 0(R0),R0
0009 (/home/dfc/src/addr.go:4) ADD $8,R0
0010 (/home/dfc/src/addr.go:4) MOVW R0,.noname+12(FP)
0011 (/home/dfc/src/addr.go:4) RET ,
0012 (/home/dfc/src/addr.go:5) BL ,runtime.throwreturn+0(SB)
0013 (/home/dfc/src/addr.go:5) RET ,

I think the only thing that is useful about this example is, it’s good thing the optimiser is on by default because there are some strange things going on here, for example line 0002, and the unreachable branch at line 0012.

The last thing to talk about is the output of 5g is not the final code that is executed. Aside from the usual work of a linker, 5l does several transformations on the code which are important to understand.

func addr(s[]int) *int {
10c00: e59a1000 ldr r1, [sl]
10c04: e15d0001 cmp sp, r1
10c08: 33a01004 movcc r1, #4
10c0c: 33a02010 movcc r2, #16
10c10: 31a0300e movcc r3, lr
10c14: 3b00668c blcc 2a64c
10c18: e52de004 push {lr} ; (str lr, [sp, #-4]!)
return &s[2]
10c1c: e28d0008 add r0, sp, #8
10c20: e5901004 ldr r1, [r0, #4]
10c24: e3510002 cmp r1, #2
10c28: 8a000000 bhi 10c30
10c2c: eb0035d5 bl 1e388
10c30: e5900000 ldr r0, [r0]
10c34: e2800008 add r0, r0, #8
10c38: e58d0014 str r0, [sp, #20]
10c3c: e49df004 pop {pc} ; (ldr pc, [sp], #4)

Here we use objdump -dS to dump the addr function as it is compiled into the executable. The first six instructions, starting at 10c00, are the function preamble that deals with segmented stacks which is inserted automatically by the 5l.

Taking it further

There are several other compiler flags which are useful when debugging or optimising your Go code.

  • -g will output the steps a the compiler is a taking at a very low level. The discussion of the output format is outside the scope of this article. Personally I find it easier to add a warn statement which will tell me the source line the compiler was working on at the time.
  • -l will disable inlining (but still retain other compiler optimisations). This is very useful if you are investigating small methods, but can’t find them in objdump.
  • -m is mainly a frontend switch and outputs details about escape analysis and inlining choices.

Mikio Hara’s ipv4 package

Yesterday Mikio Hara committed a new package for ipv4 handling to the go.net repository. I wanted to recognise Mikio’s work for two reasons.

  1. The level of detail and control this package offers is, in my opinion, unmatched by any other language. If you’re using C or C++, then you have the raw power of setsockopt(2) and ioctl(2) available, but you also have no guide or safety rail when using them. Mikio’s package provides exquisite control over ipv4 minutia without sacrificing the safety that Go provides.
  2. The package is a fantastic example of using conditional compilation and embedding to build a very low level package that works across all the supported Go platforms. This package compiles cleanly on Linux, *BSD, OS X and Windows without resorting to the lowest common denominator or using a single ifdef. If you are looking for examples on how to structure your code using build tags, or looking for ways to work within the Go1 API, you should study this package.

Check it out for yourselves, go.pkgdoc.org/code.google.com/p/go.net/ipv4.

How the Go language improves expressiveness without sacrificing runtime performance

This week there was a discussion on the golang-nuts mailing list about an idiomatic way to update a slice of structs. For example, consider this struct representing a set of counters.

type E struct {
A, B, C, D int
}

var e = make([]E, 1000)

Updating these counters may take the form

for i := range e {
e[i].A += 1
e[i].B += 2
e[i].C += 3
e[i].D += 4]
}

Which is good idiomatic Go code. It’s pretty fast too

BenchmarkManual   500000              4642 ns/op

However there is a problem with this example. Each access the ith element of e requires the compiler to insert an array bounds checks. You can avoid 3 of these checks by referencing the ith element once per iteration.

for i := range e {
v := &e[i]
v.A += 1
v.B += 2
v.C += 3
v.D += 4
}

By reducing the number of subscript checks, the code now runs considerably faster.

BenchmarkUnroll  1000000              2824 ns/op

If you are coding a tight loop, clearly this is the more efficient method, but it comes with a cost to readability, as well as a few gotchas. Someone else reading the code might be tempted to move the creation of v into the for declaration, or wonder why the address of e[i] is being taken. Both of these changes would cause an incorrect result. Obviously tests are there to catch this sort of thing, but I propose there is a better way to write this code, one that doesn’t sacrifice performance, and expresses the intent of the author more clearly.

func (e *E) update(a, b, c, d int) {
e.A += a
e.B += b
e.C += c
e.D += d
}

for i := range e {
e[i].update(1, 2, 3, 4)
}

Because E is a named type, we can create an update() method on it. Because update is declared on with a receiver of *E, the compiler automatically inserts the (&e[i]).update() for us. Most importantly because of the simplicity of the update() method itself, the inliner can roll it up into the body of the calling loop, negating the method call cost. The result is very similar to the hand unrolled version.

BenchmarkUpdate   500000              2996 ns/op

In conclusion, as Cliff Click observed, there are Lies, Damn Lies and Microbenchmarks. This post is an example of a least one of the three. My intent in writing was not to spark a language war about who can update a struct the fastest, but instead argue that Go lets you write your code in a more expressive manner without having to trade off performance.

You can find the source code for the benchmarks presented in this below.

package b

import "testing"

// SIZE=1000 results (core i5 late 2011 mac mini, 10.7.3)
// % go test -v -run='XXX' -bench='.'
// PASS
// BenchmarkUpdate 500000 2996 ns/op
// BenchmarkManual 500000 4642 ns/op
// BenchmarkUnroll 1000000 2824 ns/op

type E struct {
A, B, C, D int
}

func (e *E) update(a, b, c, d int) {
e.A += a
e.B += b
e.C += c
e.D += d
}

var SIZE = 1000 // needed to make a valid testable package

func TestNothing(t *testing.T) {}

func assert(e []E, b testing.B) {
for _, v := range e {
if v.A != b.N || v.B != b.N2 || v.C != b.N3 || v.D != b.N4 {
b.Errorf("Expected: %d, %d, %d, %d; actual: %d, %d, %d, %d",
b.N, b.N2, b.N3, b.N*4, v.A, v.B, v.C, v.D)
}
}
}

func BenchmarkUpdate(b *testing.B) {
var e = make([]E, SIZE)
for j := 0; j < b.N; j++ {
for i := range e {
e[i].update(1, 2, 3, 4)
}
}
b.StopTimer()
assert(e, b)
}

func BenchmarkManual(b *testing.B) {
var e = make([]E, SIZE)
for j := 0; j < b.N; j++ {
for i := range e {
e[i].A += 1
e[i].B += 2
e[i].C += 3
e[i].D += 4
}
}
b.StopTimer()
assert(e, b)
}

func BenchmarkUnroll(b *testing.B) {
var e = make([]E, SIZE)
for j := 0; j < b.N; j++ {
for i := range e {
v := &e[i]
v.A += 1
v.B += 2
v.C += 3
v.D += 4
}
}
b.StopTimer()
assert(e, b)
}