You don’t need to set GOROOT, really

Introduction

This is a short post to explain why it is not necessary to set $GOROOT when compiling or using Go.

TL;DR

In general1 it is not necessary to set the $GOROOT environment variable when compiling or using Go 1.0 or later. In fact, setting $GOROOT can lead to hard to debug problems if you have multiple versions of Go present on your computer.

You still need to set $GOPATH. Since Go 1.0 setting $GOPATH has been highly recommended, and with the release of Go 1.1, it is considered mandatory.

Why isn’t GOROOT required anymore ?

You’re still reading ? Excellent. Now for some history.

The history of the GO* environment variables

Go old timers may remember when not only $GOROOT, but $GOOS and $GOARCH were required environment variables. These were required because the Makefile based build system used lots of includes which used $GOROOT as their base path.

By the time the go tool was introduced, prior to Go 1.0, $GOOS and $GOARCH were optional as the build scripts were able to detect the host’s operating system and cpu architecture. With the release of Go 1.0, and the introduction of the cmd/dist bootstrap build tool, $GOOS and $GOARCH became truly optional. They are now only used when cross compiling.

Go 1.0 also introduced $GOPATH based workspaces. If you’ve read this far, you probably know what a $GOPATH workspace is. But in case you don’t, this is documented on the golang.org website, and in this screencast.

Okay, so I don’t need $GOOS or $GOARCH, but what about $GOROOT ?

$GOROOT has always been defined as a pointer to the root of your Go installation. In the old Makefile based build system, it was used as the base path for including other Makefiles, and since Go 1.0 it is used by the go tool to find the compiler (stored in $GOROOT/pkg/tool/$GOOS_$GOARCH) and the standard library (also in $GOROOT/pkg/$GOOS_$GOARCH). If you are a Java user, $GOROOT is similar in effect to $JAVA_HOME.

When you compile Go from source, the value of $GOROOT is automatically discovered2 (it is one directory up from the all.bash script) and then embedded into the go tool built from that source tree. You can see this when you run go env

% echo $GOROOT

% go env GOROOT
/home/dfc/go

The binary distributions you download from the golang.org website, or install from your operating system distribution also have the correct $GOROOT value embedded into the go tool binary. Here is an example from a Ubuntu 12.04 system which ships with Go 1.0.

% dpkg -l golang-{go,src} | grep ^ii
ii  golang-go        2:1-5        Go programming language compiler
ii  golang-src       2:1-5        Go programming language compiler - source files
% which go
/usr/bin/go
% go env GOROOT
/usr/lib/go

You can see that the go tool is installed in /usr/bin/go and $GOROOT is embedded as /usr/lib/go.

So, why shouldn’t I set $GOROOT anymore ?

You should not set $GOROOT because the correct value is already embedded in the go tool.

Setting $GOROOT will override the value stored in the go tool which could lead to the go tool from one version of Go pointing to the compiler and standard library from another version.

There are only two cases that where you may have to set a $GOROOT environment. These are both described in the installation page on the golang.org website. For completeness I will recap them here

  • You are a Linux, FreeBSD or OS X user using the the zip or tarball binary downloads from the golang.org website. These binaries have a $GOROOT value of /usr/local/go and recommend you unpack them into that location. If you choose not to do this, then you must set $GOROOT to the location you chose.
  • You are a Windows user using the zip binary download from the golang.org website. These binaries have a $GOROOT value of C:\Go. If you place Go somewhere else on your system then you must set $GOROOT to the location you chose.

Super nerdy bonus detail

This post has explained how $GOROOT is automatically discovered when compiling from source. I’ve also shown that the build scripts can detect when that value doesn’t match the path that all.bash is invoked from. So, how do operating system distributions set $GOROOT when they normally compile Go in a temporary build directory or chroot? The answer is the $GOROOT_FINAL value, which is used to override the $GOROOT location stored in the go tool.

For example, the Debian/Ubuntu build process will supply a value for $GOROOT_FINAL of /usr/lib/go. This frees it to leave $GOROOT unset, making the build process happy. After the build, the build process will install the go tool in /usr/bin, and the compilers, sources and packages in /usr/lib/go.


  1. There are a few cases where it is required if using the binary distributions of Go, which are described in this post.
  2. That is, if you haven’t set $GOROOT. Although the build system will detect when the parent directory of all.bash does not match $GOROOT.

Writing table driven tests in Go

This article is intended as a short introduction to the mechanics and syntax of writing a table driven test in Go. Supporting this article is a small repository, https://github.com/davecheney/fib, which contains all the code mentioned below.

Introduction

I enjoy writing table driven tests in Go. While not unique to the language, table driven tests leverage several features, composite literals and anonymous structs, to allow you to write related tests in a compact form.

As an example, please consider the case of testing this overused function

package fib

// Fib returns the nth number in the Fibonacci series.
func Fib(n int) int {
        if n < 2 {
                return n
        }
        return Fib(n-1) + Fib(n-2)
}

The table structure

At the heart of all table driven tests is the table itself, which provides the inputs and expected results of the function under test. In most cases the table is a slice of anonymous structs, which allows the table to be written in a compact form.

var fibTests = []struct {
        n        int // input
        expected int // expected result
}{
        {1, 1},
        {2, 1},
        {3, 2},
        {4, 3},
        {5, 5},
        {6, 8},
        {7, 13},
}

If you wished, you could give the struct a name, in which case the table definition would look something like this

type fibTest struct {
        n        int
        expected int
}

var fibTests = []fibTest {
        {1, 1}, {2, 1}, {3, 2}, {4, 3}, {5, 5}, {6, 8}, {7, 13},
}

Hooking it up

Now that we have the table of inputs and results defined, we need to write a driver function to iterate through the inputs and compare the results to their expected value. Rather than one Test function per set of values, we can use the range clause to loop over each test case.

func TestFib(t *testing.T) {
        for _, tt := range fibTests {
                actual := Fib(tt.n)
                if actual != tt.expected {
                        t.Errorf("Fib(%d): expected %d, actual %d", tt.n, tt.expected, actual)
                }
        }
}

In this example we range over all the fibTests defined above, assigning their value in turn to tt. We then call Fib passing in the value of tt.n and compare the result, stored in actual, with the value of tt.expected.

The use of the names actual and expected show my JUnit heritage, others may prefer names like want and got. You should choose something that works for you and gives a clear meaning in your test code.

The use of t.Errorf instead of t.Fatalf is a personal preference. As Fib is a pure function it is safe to continue the loop after a failure. I find this generally reduces test whack-a-mole by returning all the failures at once.

Conclusion

In my introduction I said that table driven tests are one my favorite parts of the Go language. They allow you to write unit tests in a concise fashion, hopefully leading to greater test coverage at a lower line count. If done correctly, adding additional test cases is as simple as a new element in the test table.

This is certainly not the only way that tests could be written in Go, nor the only way to write table driven tests. The Go standard library contains many examples of this form of testing which are worth studying. In particular I suggest the tests for the math and time packages are an excellent starting point.

At the other end of the spectrum is this table driven test of the Juju status command which defines its own language of helpers to populate the table structure. Although this elevates the table driven test to ninja levels, it still contains the same components and concepts, and right at the bottom of the file you’ll find a simple function driving each test.

How Go uses Go to build itself

This post is based on a talk I gave to the Sydney Go users group in mid April 2013 describing the Go build process.


Frequently on mailing list or IRC channel there are requests for documentation on the details of the Go compiler, runtime and internals. Currently the canonical source of documentation about Go’s internals is the source, which I encourage everyone to read. Having said that, the Go build process has been stable since the Go 1.0 release, so documenting it here will probably remain relevant for some time.

This post walks through the nine steps of the Go build process, starting with the source and ending with a fully tested Go installation. For simplicity, all paths mentioned are relative to the root of the source checkout, $GOROOT/src.

For background you should also read Installing Go from source on the golang.org website.

Step 1. all.bash

% cd $GOROOT/src
% ./all.bash

The first step is a bit anticlimactic as all.bash just calls two other shell scripts; make.bash and run.bash. If you’re using Windows or Plan 9 the process is the same, but the scripts end in .bat or .rc respectively. For the rest of this post, please substitute the extension appropriate for your operating system.

Step 2. make.bash

. ./make.bash --no-banner

make.bash is sourced from all.bash so that calls to exit will terminate the build process properly. make.bash has three main jobs, the first job is to validate the environment Go is being compiled in is sane. The sanity checks have been built up over the last few years and generally try to avoid building with known broken tools, or in environments where the build will fail.

Step 3. cmd/dist

gcc -O2 -Wall -Werror -ggdb -o cmd/dist/dist -Icmd/dist cmd/dist/*.c

Once the sanity checks are complete, make.bash compiles cmd/dist. cmd/dist replaces the Makefile based system which existed before Go 1 and manages the small amounts of code generation in pkg/runtime. cmd/dist is a C program which allows it to leverage the system C compiler and headers to handle most of the host platform detection issues. cmd/dist always detects your host’s operating system and architecture, $GOHOSTOS and $GOHOSTARCH. These may differ from any value of $GOOS and $GOARCH you may have set if you are cross compiling. In fact, the Go build process is always building a cross compiler, but in most cases the host and target platform are the same. Next, make.bash invokes cmd/dist with the bootstrap argument which compiles the supporting libraries, lib9, libbio and libmach, used by the compiler suite, then the compilers themselves. These tools are also written in C and are compiled by the system C compiler.

echo "# Building compilers and Go bootstrap tool for host, $GOHOSTOS/$GOHOSTARCH."
buildall="-a"
if [ "$1" = "--no-clean" ]; then
 buildall=""
fi
./cmd/dist/dist bootstrap $buildall -v # builds go_bootstrap

Using the compiler suite, cmd/dist then compiles a version of the go tool, go_bootstrap. The go_bootstrap tool is not the full go tool, for example pkg/net is stubbed out which avoids a dependency on cgo. The list of directories containing packages or libraries be compiled, and their dependencies is encoded in the cmd/dist tool itself, so great care is taken to avoid introducing new build dependencies for cmd/go.

Step 4. go_bootstrap

Now that go_bootstrap is built, the final stage of make.bash is to use go_bootstrap to compile the complete Go standard library, including a replacement version of the full go tool.

echo "# Building packages and commands for $GOOS/$GOARCH."
"$GOTOOLDIR"/go_bootstrap install -gcflags "$GO_GCFLAGS" \
    -ldflags "$GO_LDFLAGS" -v std

Step 5. run.bash

Now that make.bash is complete, execution falls back to all.bash, which invokes run.bash. run.bash‘s job is to compile and test the standard library, the runtime, and the language test suite.

bash run.bash --no-rebuild

The --no-rebuild flag is used because make.bash and run.bash can both invoke go install -a std, so to avoid duplicating the previous effort, --no-rebuild skips the second go install.

# allow all.bash to avoid double-build of everything
rebuild=true
if [ "$1" = "--no-rebuild" ]; then
 shift
else
 echo '# Building packages and commands.'
 time go install -a -v std
 echo
fi

Step 6. go test -a std

echo '# Testing packages.'
time go test std -short -timeout=$(expr 120 \* $timeout_scale)s
echo

Next run.bash is to run the unit tests for all the packages in the standard library, which are written using the testing package. Because code in $GOPATH and $GOROOT live in the same namespace, we cannot use go test ... as this would also test every package in $GOPATH, so an alias, std, was created to address the packages in the standard library. Because some tests take a long time, or consume a lot of memory, some tests filter themselves with the -short flag.

Step 7. runtime and cgo tests

The next section of run.bash runs a set of tests for platforms that support cgo, runs a few benchmarks, and compiles miscellaneous programs that ship with the Go distribution. Over time this list of miscellaneous programs has grown as it was found that when they were not included in the build process, they would inevitably break silently.

Step 8. go run test

(xcd ../test
unset GOMAXPROCS
time go run run.go
) || exit $?

The penultimate stage of run.bash invokes the compiler and runtime tests in the test folder directly under $GOROOT. These are tests of the low level details of the compiler and runtime itself. While the tests exercise the specification of the language, the test/bugs and test/fixedbugs sub directories capture unique tests for issues which have been found and fixed. The test driver for all these tests is $GOROOT/test/run.go which is a small Go program that runs each .go file inside the test directory. Some .go files contain directives on the first line which instruct run.go to expect, for example, the program to fail, or to emit a certain output sequence.

Step 9. go tool api

echo '# Checking API compatibility.'
go tool api -c $GOROOT/api/go1.txt,$GOROOT/api/go1.1.txt \
    -next $GOROOT/api/next.txt -except $GOROOT/api/except.txt

The final step of run.bash is to invoke the api tool. The api tool’s job is to enforce the Go 1 contract; the exported symbols, constants, functions, variables, types and methods that made up the Go 1 API when it shipped in 2012. For Go 1 they are spelled out in api/go1.txt, and Go 1.1, api/go1.1.txt. An additional file, api/next.txt identifies the symbols that make up the additions to the standard library and runtime since Go 1.1. Once Go 1.2 ships, this file will become the contract for Go 1.2, and there will be a new next.txt. There is also a small file, except.txt, which contains exceptions to the Go 1 contract which have been approved. Additions to the file are not expected to be taken lightly.

Additional tips and tricks

You’ve probably figured out that make.bash is useful for building Go without running the tests, and  likewise, run.bash is useful for building and testing the Go runtime. This distinction is also useful as the former can be used when cross compiling Go, and the latter is useful if you are working on the standard library.

Update: Thanks to Russ Cox and Andrew Gerrand for their feedback and suggestions.

Why is a Goroutine’s stack infinite ?

Occasionally new Gophers stumble across a curious property of the Go language related to the amount of stack available to a Goroutine. This typically arises due to the programmer inadvertently creating an infinitely recursive function call. To illustrate this, consider the following (slightly contrived) example.

package main

import "fmt"

type S struct {
        a, b int
}

// String implements the fmt.Stringer interface
func (s *S) String() string {
        return fmt.Sprintf("%s", s) // Sprintf will call s.String()
}

func main() {
        s := &S{a: 1, b: 2}
        fmt.Println(s)
}

Were you to run this program, and I do not suggest that you do, you’d find that your machine would start to swap heavily, and will probably become unresponsive unless you’re quick to hit ^C before things become unsalvageable  Because I know the first thing everyone will do is try to run this program in the playground, I’ve saved you the bother.

Most programmers have run into problems with infinite recursion before, and while it is fatal to their program, it isn’t usually fatal to their machine. So, why are Go programs different ?

One of the key features of Goroutines is their cost; they are cheap to create in terms of initial memory footprint (as opposed to the 1 to 8 megabytes with a traditional POSIX thread) and their stack grows and shrinks as necessary. This allows a Goroutine to start with a single 4096 byte stack which grows and shrinks as needed without the risk of ever running out.

To implement this the linker (5l, 6l, 8l) inserts a small preamble at the start of each function1, which checks to see if the amount of stack required for the function is below the amount currently available. If not, a call is made to runtime⋅morestack, which allocates a new stack page2, copies the arguments from the caller, then returns control to the original function which can now execute safely. When that function exits, the process is undone, its return arguments are copied back to the stack frame of the caller and the unneeded stack space released.

By this process the stack is effectively infinite, and assuming that you’re not continually straddling the boundary between two stacks, colloquially known as stack splitting, is very cheap.

There is however one detail I have withheld until now, which links the accidental use of a recursive function to a serious case of memory exhaustion for your operating system, and that is, when new stack pages are needed, they are allocated from the heap.

As your infinite function continues to call itself, new stack pages are allocated from the heap, permitting the function to continue to call itself over and over again. Fairly quickly the size of the heap will exceed the amount of free physical memory in your machine, at which point swapping will soon make your machine unusable.

The size of the heap available to Go programs depends on a lot of things, including the architecture of your CPU and your operating system, but it generally represents an amount of memory that exceeds the physical memory of your machine, so your machine is likely to swap heavily before your program ever exhausts its heap.

In Go 1.1 there was a strong desire to increase the maximum size of the heap for both 32 bit and 64 bit platforms, and this has exacerbated the problem to some extent, ie, it is unlikely that you will have 128Gb3 of physical memory in your system.

As a final comment, there are several open issues (linklink) regarding this problem, but a solution that does not extract a performance penalty on properly written programs has yet to be found.

Notes
  1. This also applies to methods, but as methods are implemented as functions where the first argument is the method receiver, there is no practical difference when discussion how segmented stacks work in Go.
  2. Using the word page does not imply that only fixed, 4096 byte, allocations are possible, if necessary runtime⋅morestack will allocate a larger amount, probably rounded to a page boundary.
  3. 64 bit Windows platforms only permit a 32Gb heap due to a late change in the Go 1.1 release cycle.

Go 1.1 performance improvements, part 3

This is the final article in the series exploring the performance improvements available in the recent Go 1.1 release. You can also read part 1 and part 2 to get the back story for amd64 and 386.

This article focuses on the performance of arm platforms. Go 1.1 was an important release as it raised arm to a level on par with amd64 and 386 and introduced support for additional operating systems. Some highlights that Go 1.1 brings to arm are:

  • Support for cgo.
  • Additional of experimental support for freebsd/arm and netbsd/arm.
  • Better code generation, including a now partially working peephole optimiser, better register allocator, and many small improvements to reduce code size.
  • Support for ARMv6 hosts, including the Raspberry Pi.
  • The GOARM variable is now optional, and automatically chooses its value based on the host Go is compiled on.
  • The memory allocator is now significantly faster due to elimination of many 64 bit instructions which were previously emulated a high cost.
  • A significantly faster software division/modulo facility.

These changes were not possible without the efforts of Shenghou Ma, Rémy Oudompheng and Daniel Morsing who made enormous contributions to the compiler and runtime during the Go 1.1 development cycle.

Again, a huge debt of thanks is owed to Anthony Starks who helped prepare the benchmark data and images for this article.

Go 1 benchmarks on linux/arm

Since its release Go has supported more that one flavor of arm architecture. Presented here are benchmarks from a wide array of hosts to give a representative sample of the performance of Go 1.1 programs on arm hosts. From top left to bottom right

As always the results presented here are available in the autobench repository. The thumbnails are clickable for a full resolution view.

Hey, the images don’t work on my iSteve! Yup, it looks like iOS devices have a limit for the size of images they will load inside a web page, and these images are on the sadface side of that limit. If you click on the broken image, you’ll find the images will load fine in a separate page. Sorry for the inconvenience.

baseline-grid
The speedup in BinaryTree17, and to a lesser extent Fannkuch11, benchmarks is influenced by the performance of the heap allocator. Part of heap allocation involves updating statistics stored in 64 bit quantities, which flow into runtime.MemStats. During the 1.1 cycle, some quick work on the part of the Atom symbol removed many of these 64 bit operations, which shows as decreased run time in these benchmarks.

net/http

Across all the samples, net/http benchmarks have benefited from the new poller implementation as well as the pure Go improvements to the net/http package through the work of Brad Fitzpatrick and Jeff Allen.
net-grid

runtime

The results of the runtime benchmarks mirror those from amd64 and 386. The general trend is towards improvement, and in some cases, a large improvement, in areas like map operations.
runtime-grid
The improvements to the Append set of benchmarks shows the benefit of a change committed by Rob Pike which avoids a call to runtime.memmove when appending small amounts of data to a []byte.

The common theme across all the samples is the regression in some channel operations. This may be attributable to the high cost of performing atomic operations on arm platforms. Currently all atomic operations are implemented by the runtime package, but in the future they may be handled directly in the compiler which could reduce their overhead.

The CompareString benchmarks show a smaller improvement than other platforms because CL 8056043 has not yet been backported to arm.

Conclusion

With the additions of cgo support, throughput improvements in the net package, and improvements to code generation and garbage collector, Go 1.1 represents a significant milestone for writing Go programs targeting arm.

To wrap up this series of articles it is clear that Go 1.1 delivers on its promise of a general 30-40% improvement across all three supported architectures. If we consider the relative improvements across compilers, while 6g remains the flagship compiler and benefits from the fastest underlying hardware, 8g and 5g show a greater improvement relative to the Go 1.0 release of last year.

But wait, there is more

If you’ve enjoyed this series of posts and want to follow the progress of Go 1.2 I’ll soon be opening a branch of autobench which will track Go 1.1 vs tip (1.2). I’ll post and tweet the location when it is ready.

Since the Go 1.2 change window was opened on May 14th, the allocator and garbage collector have already received improvements from Dmitry Vyukov and the Atom symbol aimed at further reducing the cost of GC, and Carl Shapiro has started work on precise collection of stack allocated values.

Also for Go 1.2 are proposals for a better memory allocator, and a change to the scheduler to give it the ability to preempt long running goroutines, which is aimed at reducing GC latency.

Finally, Go 1.2 has a release timetable. So while we can’t really say what will or will not making it into 1.2, we can say that it should be done by the end of 2013.

Go 1.1 performance improvements, part 2

This is the second in a three part series exploring the performance improvements in the recent Go 1.1 release.

In part 1 I explored the improvements on amd64 platforms, as well as general improvements available to all via runtime and compiler frontend improvements.

In this article I will focus on the performance of Go 1.1 on 386 machines. The results in this article are taken from linux-386-d5666bad617d-vs-e570c2daeaca.txt.

Go 1 benchmarks on linux/386

When it comes to performance, the 8g compiler is at a disadvantage. The small number of general purpose registers available in the 386 programming model, and the weird restrictions on their use place a heavy burden on the compiler and optimiser. However that did not stop Rémy Oudompheng making several significant contributions to 8g during the 1.1 cycle.

Firstly the odd 387 floating point model was deprecated (it’s still there if you are running very old hardware with the GO386=387 switch) in favor of SSE2 instructions.

Secondly, Rémy put significant effort into porting code generation improvements from 6g into 8g (and 5g, the arm compiler). Where possible code was moved into the compiler frontend, gc, including introducing a framework to rewrite division as simpler shift and multiply operations.

linux-386-baseline

In general the results for linux/386 on this host show improvements that are as good, or in some cases, better than linux/amd64. Unlike linux/amd64, there is no slowdown in the Gzip or Gob benchmarks.

The two small regressions, BinaryTree17 and Fannkuch11, are assumed to be attributable to the garbage collector becoming more precise. This involves some additional bookkeeping to track the size and type of objects allocated on the heap, which shows up in these benchmarks.

net/http benchmarks

The improvements in the net package previously demonstrated in the linux/amd64 article carry over to linux/386. The improvements in the ClientServer benchmarks are not as marked as its amd64 cousin, but nonetheless show a significant improvement overall due to the tighter integration between the runtime and net package.

linux-386-net-http

Runtime microbenchmarks

Like the amd64 benchmarks in part 1, the runtime microbenchmarks show a mixture of results. Some low level operations got a bit slower, while other operations, like map have improved significantly.

linux-386-microbenchmarks

The final two benchmarks, which appear truncated, are actually so large they do not fit on the screen. The improvement is mostly due to this change which introduced a faster low level Equals operation for the strings, bytes and runtime packages. The results speak for themselves.

benchmark                                  old MB/s   new MB/s  speedup
BenchmarkCompareStringBigUnaligned         29.08      1145.48   39.39x
BenchmarkCompareStringBig                  29.09      1253.48   43.09x

Conclusion

Although 8g is not the leading compiler of the gc suite, Ken Thompson himself has said that there are essentially no free registers available on 386, linux/386 shows that it easily meets the 30-40% performance improvement claim. In some benchmarks, compared to Go 1.0, linux/386 beats linux/amd64.

Additionally, due to reductions in memory usage, all the compilers now use around half as much memory when compiling, and as a direct consequence, compile up to 30% faster than their 1.0 predecessors.

I encourage you to review the benchmark data in the autobench repository and if you are able, submit your own results.

In the final article in this series I will investigate the performance improvement Go 1.1 brings to arm platforms. I assure you, I’ve saved the best til last.

Update: thanks to @ajstarks who provided me with higher quality benchviz images.

Go 1.1 performance improvements

This is the first in a series of articles analysing the performance improvements in the Go 1.1 release.

It has been reported (here, and here) that performance improvements of 30-40% are available simply by recompiling your code under Go 1.1. For linux/amd64 this holds true for a wide spectrum of benchmarks. For platforms like linux/386 and linux/arm the results are even more impressive, but I’m putting the cart before the horse.

A note about gccgo. This series focuses on the contributions that the improvements to the gc series of compilers (5g, 6g and 8g) have made to Go 1.1’s performance. gccgo benefits indirectly from these improvements as it shares the same runtime and standard library, but is not the focus of this benchmarking series.

Go 1.1 features several improvements in the compilers, runtime and standard library that are directly attributable for the resulting improvements in program speed. Specifically

  • Code generation improvements across all three gc compilers, including better register allocation, reduction in redundant indirect loads, and reduced code size.
  • Improvements to inlining, including inlining of some builtin function calls and compiler generated stub methods when dealing with interface conversions.
  • Reduction in stack usage, which reduces pressure on stack size, leading to fewer stack splits.
  • Introduction of a parallel garbage collector. The collector remains mark and sweep, but the phases can now utillise all CPUs.
  • More precise garbage collection, which reduces the size of the heap, leading to lower GC pause times.
  • A new runtime scheduler which can make better decisions when scheduling goroutines.
  • Tighter integration of the scheduler with the net package, leading to significantly decreased packet processing latencies and higher throughput.
  • Parts of the runtime and standard library have been rewritten in assembly to take advantage of specific bulk move or crypto instructions.

Introducing autobench

Few things irk me more than unsubstantiated, unrepeatable benchmarks. As this series is going to throw out a lot of numbers, and draw some strong conclusions, it was important for me to provide a way for people to verify my results on their machines.

To this end I have built a simple make based harness which can be run on any platform that Go supports to compare the performance of a set of synthetic benchmarks against Go 1.0 and Go 1.1. While the project is still being developed, it has generated a lot of useful data which is captured in the repository. You can find the project on Github.

https://github.com/davecheney/autobench

I am indebted to Go community members who submitted benchmark data from their machines allowing me to make informed conclusions about the relative performance of Go 1.1.

If you are interested in participating in autobench there will be a branch which tracks the performance of Go 1.1 against tip opening soon.

A picture speaks a thousand words

To better visualise the benchmark results, AJ Starks has produced a wonderful tool, benchviz which turns the dry text based output of misc/benchcmp into rather nice graphs. You can read all about benchviz on AJ’s blog.

http://mindchunk.blogspot.com.au/2013/05/visualizing-go-benchmarks-with-benchviz.html

Following a tradition set by the misc/benchcmp tool, improvements, be they a reduction in run time, or an increase in throughput, are shown as bars extending towards the right. Regressions, fall back to the left.

Go 1 benchmarks on linux/amd64

The remainder of this post will focus on linux/amd64 performance. The 6g compiler is considered to be the flagship of the gc compiler suite. In addition to code generation improvements in the front and back ends, performance critical parts of the standard library and runtime have been rewritten in assembly to take advantage of SSE2 instructions.

The data for the remainder of this article is taken from the results file linux-amd64-d5666bad617d-vs-e570c2daeaca.txt.

bm0

 

The go1 benchmark suite, while being a synthetic benchmark, attempts to capture some real world usages of the main packages in the standard library. In general the results support the hypothesis of a broad 30-40% improvement. Looking at the results submitted to the autobench repository it is clear that GobEncode and Gzip have regressed and issues 5165 and 5166 have been raised, respectively  In the latter case, the switch to 64 bit ints is assumed to be at least partially to blame.

net/http benchmarks

This set of benchmarks are extracted from the net/http package and demonstrated the work that Brad Fitzpatrick and Dmitry Vyukov, and many others, have put into net and net/http packages.

bm2

 

Of note in this benchmark set are the improvements in ReadRequest benchmarks, which attempt to benchmark the decoding a HTTP request. The improvements in the ClientServerParallel benchmarks are not currently available across all amd64 platforms, as some of them have no support for the new runtime integration with the net package. Finishing support for the remaining BSD and Windows platforms is a focus for the 1.2 cycle.

Runtime microbenchmarks

The final set of benchmarks presented here are extracted from the runtime package.

bm1

 

The runtime benchmarks represent micro benchmarks of very low level parts of the runtime package.

The obvious regression is the first Append benchmark. While in wall time, the benchmark has increased from 36 ns/op to 100 ns/op, this shows that for some append use cases there has been a regression. This may have already been addressed in tip by CL 9360043.

The big wins in the runtime benchmarks are the amazing new map code by khr which addresses issue 3886, the reduction in overhead of channel operations (thanks to Dmitry’s new scheduler), improvements in operations involving complex128 operations, and speedups in hash and memmove operations which were rewritten in 64bit assembly.

Conclusion

For linux/amd64 on modern 64 bit Intel CPUs, the 6g compiler and runtime can generate significantly faster code. Other amd64 platforms share similar speedups, although the specific improvements vary. I encourage you to review the benchmark data in the autobench repository and if you are able, submit your own results.

In subsequent articles I will investigate the performance improvement Go 1.1 brings to 386 and arm platforms.

Update: thanks to @ajstarks who provided me with higher quality benchviz images.

Go 1.1 tarballs for linux/arm

For the time poor ARM fans in the room, I’ve updated my tarball distributions to Go 1.1. These tarballs are built using the same misc/dist tool that makes the official builds on the golang.org download page.

You can find the link at the Unofficial ARM tarballs for Go item at the top of this page. Please address any bug reports or comments to me directly.

There are also a number of other ways to obtain Go 1.1 appearing on the horizon. For example, if you are using Debian Sid, Go 1.1 is available now. This version has been imported into Ubuntu Saucy (which will become 13.10), although at this time it remains in the proposed channel.

Rest assured I will not be shy in announcing when Go 1.1 has wider availability in Ubuntu.

Go and Juju at Canonical slides posted

This month I had the privilege of presenting a talk at the GoSF meetup to 120 keen Gophers.

I was absolutely blown away by the Iron.io/HeavyBit offices. It was a fantastic presentation space with a professional sound and video crew to stream the meetup straight to G+.

The slides are available on my GitHub account, but a more convenient way to consume them is Gary Burd’s fantastic talks.godoc.org site.

http://talks.godoc.org/github.com/davecheney/gosf/5nines.slide#1

If you’re interested in finding out more about the Juju project itself, you can find us on the project page, https://launchpad.net/juju-core/ or / -dev on IRC.

Curious Channels

Channels are a signature feature of the Go programming language. Channels provide a powerful way to reason about the flow of data from one goroutine to another without the use of locks or critical sections.

Today I want to talk about two important properties of channels that make them useful for controlling not just data flow within your program, but the flow of control as well.

A closed channel never blocks

The first property I want to talk about is a closed channel. Once a channel has been closed, you cannot send a value on this channel, but you can still receive from the channel.

package main

import "fmt"

func main() {
        ch := make(chan bool, 2)
        ch <- true
        ch <- true
        close(ch)

        for i := 0; i < cap(ch) +1 ; i++ {
                v, ok := <- ch
                fmt.Println(v, ok)
        }
}

In this example we create a channel with a buffer of two, fill the buffer, then close it.

true true
true true
false false

Running the program shows we retrieve the first two values we sent on the channel, then on our third attempt the channel gives us the values of false and false. The first false is the zero value for that channel’s type, which is false, as the channel is of type chan bool. The second indicates the open state of the channel, which is now false, indicating the channel is closed. The channel will continue to report these values infinitely. As an experiment, alter this example to receive from the channel 100 times.

Being able to detect if your channel is closed is a useful property, it is used in the range over channel idiom to exit the loop once a channel has been drained.

package main

import "fmt"

func main() {
        ch := make(chan bool, 2)
        ch <- true
        ch <- true
        close(ch)

        for v := range ch {
                fmt.Println(v) // called twice
        }
}

but really comes into its own when combined with select. Let’s start with this example

package main

import (
        "fmt"
        "sync"
        "time"
)

func main() {
        finish := make(chan bool)
        var done sync.WaitGroup
        done.Add(1)
        go func() {
                select {
                case <-time.After(1 * time.Hour):
                case <-finish:
                }
                done.Done()
        }()
        t0 := time.Now()
        finish <- true // send the close signal
        done.Wait()    // wait for the goroutine to stop
        fmt.Printf("Waited %v for goroutine to stop\n", time.Since(t0))
}

Running the program, on my system, gives a low wait duration, hence it is clear that the goroutine does not wait the full hour before calling done.Done()

Waited 129.607us for goroutine to stop

But there are a few problems with this program. The first is the finish channel is not buffered, so the send to finish may block if the receiver forgot to add finish to their select statement. You could solve that problem by wrapping the send in a select block to make it non blocking, or making the finish channel buffered. However what if you had many goroutines listening on the finish channel, you would need to track this and remember to send the correct number of times to the finish channel. This might get tricky if you aren’t in control of creating these goroutines; they may be being created in another part of your program, perhaps in response to incoming requests over the network.

A nice solution to this problem is to leverage the property that a closed channel is always ready to receive. Using this property we can rewrite the program, now including 100 goroutines, without having to keep track of the number of goroutines spawned, or correctly size the finish channel

package main

import (
        "fmt"
        "sync"
        "time"
)

func main() {
        const n = 100
        finish := make(chan bool)
        var done sync.WaitGroup
        for i := 0; i < n; i++ { 
                done.Add(1)
                go func() {
                        select {
                        case <-time.After(1 * time.Hour):
                        case <-finish:
                        }
                        done.Done()
                }()
        }
        t0 := time.Now()
        close(finish)    // closing finish makes it ready to receive
        done.Wait()      // wait for all goroutines to stop
        fmt.Printf("Waited %v for %d goroutines to stop\n", time.Since(t0), n)
}

On my system, this returns

Waited 231.385us for 100 goroutines to stop

So what is going on here? As soon as the finish channel is closed, it becomes ready to receive. As all the goroutines are waiting to receive either from their time.After channel, or finish, the select statement is now complete and the goroutines exits after calling done.Done() to deincrement the WaitGroup counter. This powerful idiom allows you to use a channel to send a signal to an unknown number of goroutines, without having to know anything about them, or worrying about deadlock.

Before moving on to the next topic, I want to mention a final simplification that is preferred by many Go programmers. If you look at the sample program above, you’ll note that we never send a value on the finish channel, and the receiver always discards any value received. Because of this it is quite common to see the program written like this:

package main

import (
        "fmt"
        "sync"
        "time"
)

func main() {
        finish := make(chan struct{})
        var done sync.WaitGroup
        done.Add(1)
        go func() {
                select {
                case <-time.After(1 * time.Hour):
                case <-finish:
                }
                done.Done()
        }()
        t0 := time.Now()
        close(finish)
        done.Wait()
        fmt.Printf("Waited %v for goroutine to stop\n", time.Since(t0))
}

As the behaviour of the close(finish) relies on signalling the close of the channel, not the value sent or received, declaring finish to be of type chan struct{} says that the channel contains no value; we’re only interested in its closed property.

A nil channel always blocks

The second property I want to talk about is polar opposite of the closed channel property. A nil channel; a channel value that has not been initalised, or has been set to nil will always block. For example

package main

func main() {
        var ch chan bool
        ch <- true // blocks forever
}

will deadlock as ch is nil and will never be ready to send. The same is true for receiving

package main

func main() {
        var ch chan bool
        <- ch // blocks forever
}

This might not seem important, but is a useful property when you want to use the closed channel idiom to wait for multiple channels to close. For example

// WaitMany waits for a and b to close.
func WaitMany(a, b chan bool) {
        var aclosed, bclosed bool
        for !aclosed || !bclosed {
                select {
                case <-a:
                        aclosed = true
                case <-b:
                        bclosed = true
                }
        }
}

WaitMany() looks like a good way to wait for channels a and b to close, but it has a problem. Let’s say that channel a is closed first, then it will always be ready to receive. Because bclosed is still false the program can enter an infinite loop, preventing the channel b from ever being closed.

A safe way to solve the problem is to leverage the blocking properties of a nil channel and rewrite the program like this

package main

import (
        "fmt"
        "time"
)

func WaitMany(a, b chan bool) {
        for a != nil || b != nil {
                select {
                case <-a:
                        a = nil 
                case <-b:
                        b = nil
                }
        }
}

func main() {
        a, b := make(chan bool), make(chan bool)
        t0 := time.Now()
        go func() {
                close(a)
                close(b)
        }()
        WaitMany(a, b)
        fmt.Printf("waited %v for WaitMany\n", time.Since(t0))
}

In the rewritten WaitMany() we nil the reference to a or b once they have received a value. When a nil channel is part of a select statement, it is effectively ignored, so niling a removes it from selection, leaving only b which blocks until it is closed, exiting the loop without spinning.

Running this on my system gives

waited 54.912us for WaitMany

In conclusion, the simple properties of closed and nil channels are powerful building blocks that can be used to create highly concurrent programs that are simple to reason about.