How the Go language improves expressiveness without sacrificing runtime performance

This week there was a discussion on the golang-nuts mailing list about an idiomatic way to update a slice of structs. For example, consider this struct representing a set of counters.

type E struct {
A, B, C, D int
}

var e = make([]E, 1000)

Updating these counters may take the form

for i := range e {
e[i].A += 1
e[i].B += 2
e[i].C += 3
e[i].D += 4]
}

Which is good idiomatic Go code. It’s pretty fast too

BenchmarkManual   500000              4642 ns/op

However there is a problem with this example. Each access the ith element of e requires the compiler to insert an array bounds checks. You can avoid 3 of these checks by referencing the ith element once per iteration.

for i := range e {
v := &e[i]
v.A += 1
v.B += 2
v.C += 3
v.D += 4
}

By reducing the number of subscript checks, the code now runs considerably faster.

BenchmarkUnroll  1000000              2824 ns/op

If you are coding a tight loop, clearly this is the more efficient method, but it comes with a cost to readability, as well as a few gotchas. Someone else reading the code might be tempted to move the creation of v into the for declaration, or wonder why the address of e[i] is being taken. Both of these changes would cause an incorrect result. Obviously tests are there to catch this sort of thing, but I propose there is a better way to write this code, one that doesn’t sacrifice performance, and expresses the intent of the author more clearly.

func (e *E) update(a, b, c, d int) {
e.A += a
e.B += b
e.C += c
e.D += d
}

for i := range e {
e[i].update(1, 2, 3, 4)
}

Because E is a named type, we can create an update() method on it. Because update is declared on with a receiver of *E, the compiler automatically inserts the (&e[i]).update() for us. Most importantly because of the simplicity of the update() method itself, the inliner can roll it up into the body of the calling loop, negating the method call cost. The result is very similar to the hand unrolled version.

BenchmarkUpdate   500000              2996 ns/op

In conclusion, as Cliff Click observed, there are Lies, Damn Lies and Microbenchmarks. This post is an example of a least one of the three. My intent in writing was not to spark a language war about who can update a struct the fastest, but instead argue that Go lets you write your code in a more expressive manner without having to trade off performance.

You can find the source code for the benchmarks presented in this below.

package b

import "testing"

// SIZE=1000 results (core i5 late 2011 mac mini, 10.7.3)
// % go test -v -run='XXX' -bench='.'
// PASS
// BenchmarkUpdate 500000 2996 ns/op
// BenchmarkManual 500000 4642 ns/op
// BenchmarkUnroll 1000000 2824 ns/op

type E struct {
A, B, C, D int
}

func (e *E) update(a, b, c, d int) {
e.A += a
e.B += b
e.C += c
e.D += d
}

var SIZE = 1000 // needed to make a valid testable package

func TestNothing(t *testing.T) {}

func assert(e []E, b testing.B) {
for _, v := range e {
if v.A != b.N || v.B != b.N2 || v.C != b.N3 || v.D != b.N4 {
b.Errorf("Expected: %d, %d, %d, %d; actual: %d, %d, %d, %d",
b.N, b.N2, b.N3, b.N*4, v.A, v.B, v.C, v.D)
}
}
}

func BenchmarkUpdate(b *testing.B) {
var e = make([]E, SIZE)
for j := 0; j < b.N; j++ {
for i := range e {
e[i].update(1, 2, 3, 4)
}
}
b.StopTimer()
assert(e, b)
}

func BenchmarkManual(b *testing.B) {
var e = make([]E, SIZE)
for j := 0; j < b.N; j++ {
for i := range e {
e[i].A += 1
e[i].B += 2
e[i].C += 3
e[i].D += 4
}
}
b.StopTimer()
assert(e, b)
}

func BenchmarkUnroll(b *testing.B) {
var e = make([]E, SIZE)
for j := 0; j < b.N; j++ {
for i := range e {
v := &e[i]
v.A += 1
v.B += 2
v.C += 3
v.D += 4
}
}
b.StopTimer()
assert(e, b)
}

11 thoughts on “How the Go language improves expressiveness without sacrificing runtime performance

  1. pmylund

    Thanks for the well-written article!I was curious what the results for a slice of *B would be, and it turns out it’s actually faster:func BenchmarkPtr(b *testing.B) { var e = make([]*E, SIZE) for i := range e { e[i] = &E{} } for j := 0; j < b.N; j++ { for _, v := range e { v.A += 1 v.B += 2 v.C += 3 v.D += 4 } }}BenchmarkUpdate 1000000 2469 ns/opBenchmarkManual 500000 4244 ns/opBenchmarkUnroll 1000000 2372 ns/opBenchmarkPtr 1000000 1956 ns/op

  2. Wouter

    That’s interesting. I get the following results on an old IBM ThinkPad:PASSb.BenchmarkUpdate 100000 16495 ns/opb.BenchmarkManual 100000 18789 ns/opb.BenchmarkUnroll 200000 9713 ns/op… so Update is more in the ballpark of Manual than Unroll.BTW: I’m using the 8c/8l binaries for compilation and linking on GNU/Debian.

  3. anon

    Using the _, v := range avoids bounds checking completely so you expect that to be faster again than either of the adsress-taking approaches. It will be nice when the optimizer can recognize repeated access/write to the same element and avoid additional bounds checks when not necessary.

  4. Dave Cheney

    @Wouter – thank you for posting the 8g results. For comparison (in delta, not wall time) here is the same test on an ancient Arm5 system.% go test -v -run=’XXX’ -bench=’.’ PASSBenchmarkUpdate 50000 48785 ns/opBenchmarkManual 20000 81782 ns/opBenchmarkUnroll 50000 41537 ns/opok b 8.104sEven though both 8g and 5g are 32bit architectures, 5g has more registers to work with. ken has previously noted on the mailing list that 8g has so few registers that there are almost none left over for general purpose use. This might explain why the inlined variant does not exhibit much of an improvement.

  5. Dave Cheney

    @anon, If you are suggesting this variant for _, v := range e { v.A += 1 v.B += 2 v.C += 3 v.D += 4 }You’ll find it won’t give you the correct result as v is a copy of e[i], at the end of the loop, e will be unchanged.

  6. Dave Cheney

    @pmylund – thank you for benchmarking the Ptr case, however if I can offer this counter example.func BenchmarkPtrUpdate(b *testing.B) { var e = make([]*E, SIZE) for i := range e { e[i] = &E{} } for j := 0; j < b.N; j++ { for _, v := range e { v.update(1, 2, 3, 4) } }}Which shows this technique is just as valid for slices of pointers.% go test -v -run=’XXX’ -bench=’.’ PASSBenchmarkUpdate 500000 3064 ns/opBenchmarkManual 500000 5143 ns/opBenchmarkUnroll 1000000 2927 ns/opBenchmarkPtr 1000000 2358 ns/opBenchmarkPtrUpdate 1000000 2368 ns/opHere is an updated gist including the two additional benchmarks, noting that in both ptr cases the elements of the slice escape to the heap which could add GC pressure and hurt cache locality.https://gist.github.com/1806550

  7. anon

    Sorry Dave I should have been clear. No I didn’t mean the for _, v := range e {…} where e is []E, but the BenchmarkPtr given by pmylund where e is []*E. There you get the advantage of no additional range checking and in place change to the elements of E through indirection.

  8. Dave Cheney

    @anon – that was my mistake for not clarifying, it’s interesting that the Ptr based approach is faster again (even considering the benchmark includes the cost of setting up the ptr slice), but also comforting that BenchmarkPtr and BenchmarkPtrUpdate share similar performance qualities with their non ptr cousins. There is no appreciable performance cost in using a more readable helper function vs. manually inlining.

Comments are closed.