Monthly Archives: May 2013

Go 1.1 performance improvements, part 3

This is the final article in the series exploring the performance improvements available in the recent Go 1.1 release. You can also read part 1 and part 2 to get the back story for amd64 and 386.

This article focuses on the performance of arm platforms. Go 1.1 was an important release as it raised arm to a level on par with amd64 and 386 and introduced support for additional operating systems. Some highlights that Go 1.1 brings to arm are:

  • Support for cgo.
  • Additional of experimental support for freebsd/arm and netbsd/arm.
  • Better code generation, including a now partially working peephole optimiser, better register allocator, and many small improvements to reduce code size.
  • Support for ARMv6 hosts, including the Raspberry Pi.
  • The GOARM variable is now optional, and automatically chooses its value based on the host Go is compiled on.
  • The memory allocator is now significantly faster due to elimination of many 64 bit instructions which were previously emulated a high cost.
  • A significantly faster software division/modulo facility.

These changes were not possible without the efforts of Shenghou Ma, Rémy Oudompheng and Daniel Morsing who made enormous contributions to the compiler and runtime during the Go 1.1 development cycle.

Again, a huge debt of thanks is owed to Anthony Starks who helped prepare the benchmark data and images for this article.

Go 1 benchmarks on linux/arm

Since its release Go has supported more that one flavor of arm architecture. Presented here are benchmarks from a wide array of hosts to give a representative sample of the performance of Go 1.1 programs on arm hosts. From top left to bottom right

As always the results presented here are available in the autobench repository. The thumbnails are clickable for a full resolution view.

Hey, the images don’t work on my iSteve! Yup, it looks like iOS devices have a limit for the size of images they will load inside a web page, and these images are on the sadface side of that limit. If you click on the broken image, you’ll find the images will load fine in a separate page. Sorry for the inconvenience.

baseline-grid
The speedup in BinaryTree17, and to a lesser extent Fannkuch11, benchmarks is influenced by the performance of the heap allocator. Part of heap allocation involves updating statistics stored in 64 bit quantities, which flow into runtime.MemStats. During the 1.1 cycle, some quick work on the part of the Atom symbol removed many of these 64 bit operations, which shows as decreased run time in these benchmarks.

net/http

Across all the samples, net/http benchmarks have benefited from the new poller implementation as well as the pure Go improvements to the net/http package through the work of Brad Fitzpatrick and Jeff Allen.
net-grid

runtime

The results of the runtime benchmarks mirror those from amd64 and 386. The general trend is towards improvement, and in some cases, a large improvement, in areas like map operations.
runtime-grid
The improvements to the Append set of benchmarks shows the benefit of a change committed by Rob Pike which avoids a call to runtime.memmove when appending small amounts of data to a []byte.

The common theme across all the samples is the regression in some channel operations. This may be attributable to the high cost of performing atomic operations on arm platforms. Currently all atomic operations are implemented by the runtime package, but in the future they may be handled directly in the compiler which could reduce their overhead.

The CompareString benchmarks show a smaller improvement than other platforms because CL 8056043 has not yet been backported to arm.

Conclusion

With the additions of cgo support, throughput improvements in the net package, and improvements to code generation and garbage collector, Go 1.1 represents a significant milestone for writing Go programs targeting arm.

To wrap up this series of articles it is clear that Go 1.1 delivers on its promise of a general 30-40% improvement across all three supported architectures. If we consider the relative improvements across compilers, while 6g remains the flagship compiler and benefits from the fastest underlying hardware, 8g and 5g show a greater improvement relative to the Go 1.0 release of last year.

But wait, there is more

If you’ve enjoyed this series of posts and want to follow the progress of Go 1.2 I’ll soon be opening a branch of autobench which will track Go 1.1 vs tip (1.2). I’ll post and tweet the location when it is ready.

Since the Go 1.2 change window was opened on May 14th, the allocator and garbage collector have already received improvements from Dmitry Vyukov and the Atom symbol aimed at further reducing the cost of GC, and Carl Shapiro has started work on precise collection of stack allocated values.

Also for Go 1.2 are proposals for a better memory allocator, and a change to the scheduler to give it the ability to preempt long running goroutines, which is aimed at reducing GC latency.

Finally, Go 1.2 has a release timetable. So while we can’t really say what will or will not making it into 1.2, we can say that it should be done by the end of 2013.

Go 1.1 performance improvements, part 2

This is the second in a three part series exploring the performance improvements in the recent Go 1.1 release.

In part 1 I explored the improvements on amd64 platforms, as well as general improvements available to all via runtime and compiler frontend improvements.

In this article I will focus on the performance of Go 1.1 on 386 machines. The results in this article are taken from linux-386-d5666bad617d-vs-e570c2daeaca.txt.

Go 1 benchmarks on linux/386

When it comes to performance, the 8g compiler is at a disadvantage. The small number of general purpose registers available in the 386 programming model, and the weird restrictions on their use place a heavy burden on the compiler and optimiser. However that did not stop Rémy Oudompheng making several significant contributions to 8g during the 1.1 cycle.

Firstly the odd 387 floating point model was deprecated (it’s still there if you are running very old hardware with the GO386=387 switch) in favor of SSE2 instructions.

Secondly, Rémy put significant effort into porting code generation improvements from 6g into 8g (and 5g, the arm compiler). Where possible code was moved into the compiler frontend, gc, including introducing a framework to rewrite division as simpler shift and multiply operations.

linux-386-baseline

In general the results for linux/386 on this host show improvements that are as good, or in some cases, better than linux/amd64. Unlike linux/amd64, there is no slowdown in the Gzip or Gob benchmarks.

The two small regressions, BinaryTree17 and Fannkuch11, are assumed to be attributable to the garbage collector becoming more precise. This involves some additional bookkeeping to track the size and type of objects allocated on the heap, which shows up in these benchmarks.

net/http benchmarks

The improvements in the net package previously demonstrated in the linux/amd64 article carry over to linux/386. The improvements in the ClientServer benchmarks are not as marked as its amd64 cousin, but nonetheless show a significant improvement overall due to the tighter integration between the runtime and net package.

linux-386-net-http

Runtime microbenchmarks

Like the amd64 benchmarks in part 1, the runtime microbenchmarks show a mixture of results. Some low level operations got a bit slower, while other operations, like map have improved significantly.

linux-386-microbenchmarks

The final two benchmarks, which appear truncated, are actually so large they do not fit on the screen. The improvement is mostly due to this change which introduced a faster low level Equals operation for the strings, bytes and runtime packages. The results speak for themselves.

benchmark                                  old MB/s   new MB/s  speedup
BenchmarkCompareStringBigUnaligned         29.08      1145.48   39.39x
BenchmarkCompareStringBig                  29.09      1253.48   43.09x

Conclusion

Although 8g is not the leading compiler of the gc suite, Ken Thompson himself has said that there are essentially no free registers available on 386, linux/386 shows that it easily meets the 30-40% performance improvement claim. In some benchmarks, compared to Go 1.0, linux/386 beats linux/amd64.

Additionally, due to reductions in memory usage, all the compilers now use around half as much memory when compiling, and as a direct consequence, compile up to 30% faster than their 1.0 predecessors.

I encourage you to review the benchmark data in the autobench repository and if you are able, submit your own results.

In the final article in this series I will investigate the performance improvement Go 1.1 brings to arm platforms. I assure you, I’ve saved the best til last.

Update: thanks to @ajstarks who provided me with higher quality benchviz images.

Go 1.1 performance improvements

This is the first in a series of articles analysing the performance improvements in the Go 1.1 release.

It has been reported (here, and here) that performance improvements of 30-40% are available simply by recompiling your code under Go 1.1. For linux/amd64 this holds true for a wide spectrum of benchmarks. For platforms like linux/386 and linux/arm the results are even more impressive, but I’m putting the cart before the horse.

A note about gccgo. This series focuses on the contributions that the improvements to the gc series of compilers (5g, 6g and 8g) have made to Go 1.1’s performance. gccgo benefits indirectly from these improvements as it shares the same runtime and standard library, but is not the focus of this benchmarking series.

Go 1.1 features several improvements in the compilers, runtime and standard library that are directly attributable for the resulting improvements in program speed. Specifically

  • Code generation improvements across all three gc compilers, including better register allocation, reduction in redundant indirect loads, and reduced code size.
  • Improvements to inlining, including inlining of some builtin function calls and compiler generated stub methods when dealing with interface conversions.
  • Reduction in stack usage, which reduces pressure on stack size, leading to fewer stack splits.
  • Introduction of a parallel garbage collector. The collector remains mark and sweep, but the phases can now utillise all CPUs.
  • More precise garbage collection, which reduces the size of the heap, leading to lower GC pause times.
  • A new runtime scheduler which can make better decisions when scheduling goroutines.
  • Tighter integration of the scheduler with the net package, leading to significantly decreased packet processing latencies and higher throughput.
  • Parts of the runtime and standard library have been rewritten in assembly to take advantage of specific bulk move or crypto instructions.

Introducing autobench

Few things irk me more than unsubstantiated, unrepeatable benchmarks. As this series is going to throw out a lot of numbers, and draw some strong conclusions, it was important for me to provide a way for people to verify my results on their machines.

To this end I have built a simple make based harness which can be run on any platform that Go supports to compare the performance of a set of synthetic benchmarks against Go 1.0 and Go 1.1. While the project is still being developed, it has generated a lot of useful data which is captured in the repository. You can find the project on Github.

https://github.com/davecheney/autobench

I am indebted to Go community members who submitted benchmark data from their machines allowing me to make informed conclusions about the relative performance of Go 1.1.

If you are interested in participating in autobench there will be a branch which tracks the performance of Go 1.1 against tip opening soon.

A picture speaks a thousand words

To better visualise the benchmark results, AJ Starks has produced a wonderful tool, benchviz which turns the dry text based output of misc/benchcmp into rather nice graphs. You can read all about benchviz on AJ’s blog.

http://mindchunk.blogspot.com.au/2013/05/visualizing-go-benchmarks-with-benchviz.html

Following a tradition set by the misc/benchcmp tool, improvements, be they a reduction in run time, or an increase in throughput, are shown as bars extending towards the right. Regressions, fall back to the left.

Go 1 benchmarks on linux/amd64

The remainder of this post will focus on linux/amd64 performance. The 6g compiler is considered to be the flagship of the gc compiler suite. In addition to code generation improvements in the front and back ends, performance critical parts of the standard library and runtime have been rewritten in assembly to take advantage of SSE2 instructions.

The data for the remainder of this article is taken from the results file linux-amd64-d5666bad617d-vs-e570c2daeaca.txt.

bm0

 

The go1 benchmark suite, while being a synthetic benchmark, attempts to capture some real world usages of the main packages in the standard library. In general the results support the hypothesis of a broad 30-40% improvement. Looking at the results submitted to the autobench repository it is clear that GobEncode and Gzip have regressed and issues 5165 and 5166 have been raised, respectively  In the latter case, the switch to 64 bit ints is assumed to be at least partially to blame.

net/http benchmarks

This set of benchmarks are extracted from the net/http package and demonstrated the work that Brad Fitzpatrick and Dmitry Vyukov, and many others, have put into net and net/http packages.

bm2

 

Of note in this benchmark set are the improvements in ReadRequest benchmarks, which attempt to benchmark the decoding a HTTP request. The improvements in the ClientServerParallel benchmarks are not currently available across all amd64 platforms, as some of them have no support for the new runtime integration with the net package. Finishing support for the remaining BSD and Windows platforms is a focus for the 1.2 cycle.

Runtime microbenchmarks

The final set of benchmarks presented here are extracted from the runtime package.

bm1

 

The runtime benchmarks represent micro benchmarks of very low level parts of the runtime package.

The obvious regression is the first Append benchmark. While in wall time, the benchmark has increased from 36 ns/op to 100 ns/op, this shows that for some append use cases there has been a regression. This may have already been addressed in tip by CL 9360043.

The big wins in the runtime benchmarks are the amazing new map code by khr which addresses issue 3886, the reduction in overhead of channel operations (thanks to Dmitry’s new scheduler), improvements in operations involving complex128 operations, and speedups in hash and memmove operations which were rewritten in 64bit assembly.

Conclusion

For linux/amd64 on modern 64 bit Intel CPUs, the 6g compiler and runtime can generate significantly faster code. Other amd64 platforms share similar speedups, although the specific improvements vary. I encourage you to review the benchmark data in the autobench repository and if you are able, submit your own results.

In subsequent articles I will investigate the performance improvement Go 1.1 brings to 386 and arm platforms.

Update: thanks to @ajstarks who provided me with higher quality benchviz images.

Go 1.1 tarballs for linux/arm

For the time poor ARM fans in the room, I’ve updated my tarball distributions to Go 1.1. These tarballs are built using the same misc/dist tool that makes the official builds on the golang.org download page.

You can find the link at the Unofficial ARM tarballs for Go item at the top of this page. Please address any bug reports or comments to me directly.

There are also a number of other ways to obtain Go 1.1 appearing on the horizon. For example, if you are using Debian Sid, Go 1.1 is available now. This version has been imported into Ubuntu Saucy (which will become 13.10), although at this time it remains in the proposed channel.

Rest assured I will not be shy in announcing when Go 1.1 has wider availability in Ubuntu.

Go and Juju at Canonical slides posted

This month I had the privilege of presenting a talk at the GoSF meetup to 120 keen Gophers.

I was absolutely blown away by the Iron.io/HeavyBit offices. It was a fantastic presentation space with a professional sound and video crew to stream the meetup straight to G+.

The slides are available on my GitHub account, but a more convenient way to consume them is Gary Burd’s fantastic talks.godoc.org site.

http://talks.godoc.org/github.com/davecheney/gosf/5nines.slide#1

If you’re interested in finding out more about the Juju project itself, you can find us on the project page, https://launchpad.net/juju-core/ or / -dev on IRC.