Go 1.3 linker improvements

Go obtains much of its compilation speed from the Plan 9 compiler, of which it is a direct descendant. The Plan 9 toolchain deferred much of the work traditionally performed by a compiler to the linking stage and its performance was summarised in section 8 of this paper

The new compilers compile quickly, load slowly, and produce medium quality object code.

Because of the similar division of labour between Go’s compiler and linker, linking is commonly more expensive that compilation. This leads to several problems

  • Linking cannot benefit from incremental compilation, each link pass starts afresh even if only a tiny part of program has changed.
  • Linking speed is at least linear (often worse) with the number of packages being linked into the final executable – larger programs link slower.
  • While multiple commands may be linked in parallel, each individual link is single threaded – as CPU speeds stall, or go backwards, favouring additional cores, linking gets slower in real terms.

It should be noted that while linking is considered slow in terms of the other parts of the Go toolchain, it remains much faster than linking a traditional C or C++ application.

Linking speed has long been recognised as an issue by the Go developers and during the 1.3 development cycle the linker underwent major changes to improve its performance.

What does the Go linker do ?

To understand the change, Russ Cox wrote in his proposal at the beginning of the cycle

The current linker performs two separable tasks.

First, it translates an input stream of pseudo-instructions into executable code and data blocks, along with a list of relocations.

Second, it deletes dead code, merges what’s left into a single image, resolves relocations, and generates a few whole-program data structures such as the runtime symbol table.

In a nutshell, the change has moved the first task of the linker, the pseudo instruction translation, from the linker into the compiler.

What are pseudo instructions ?

Prior to Go 1.3, the output of the compiler, the .a files in your $GOPATH/pkg directory was not directly executable machine code. It was, as Russ describes, a stream of pseudo instructions, which, while not machine independent, it was also not directly executable.

During the linking phase, the linker itself was responsible for turning this pseudo instruction stream into real machine code. Dealing with pseudo instructions during the compilation phase makes the compiler simpler to write.

An example of a pseudo instructions are the unified MOV instruction available to the compiler and assembler

MOV R0, R1          // move the contents of R0 into R1
MOV $80000001, R0   // move the the constant 80000001  into R0
MOV R0, 0           // move the value into R0 into address 0

On a processor like ARM, when translated to machine code, this becomes three separate operations. The first MOV R0, R1 is a simple move from one register to another and the final opcode is MOV.

The second form stores the large constant, 80000001 into R0. ARM processors cannot encode such a large constant directly into the instruction so the linker would store the constant at the end of the function and replace the MOV with a load from an address relative to the program counter into a scratch register (R11 is reserved for the linker) then move the value from R11 to R0.

The final form also needs help from the linker as ARM processors cannot use a constant as an address to store a value. The linker will arrange to load 0 into R11 then store the contents with an instruction like MOV R0, (R11)

The work required for X86 is similar, although the specific restrictions differ.

The results

To evaluate the performance improvements of the change to the linker I’m going to compare building and linking two programs, the venerable godoc and the jujud server binary which depends on almost every line of code in the Juju repo.

In preparation I checked out copies of Go 1.2.1 and Go 1.3beta1 (this data was collected some time ago, but the changes in 1.3beta2 are unrelated to the linker).


% rm -rf ~/pkg
% time /tmp/go1.2.1/bin/go install code.google.com/p/go.tools/cmd/godoc
real    0m3.239s
user    0m3.307s
sys     0m0.595s

The time to compile and link godoc from scratch on this machine with Go 1.2.1 was 3.2 seconds. Let’s look at the time just to recompile the main package and link godoc

% touch ~/src/code.google.com/p/go.tools/cmd/godoc/play.go
% time /tmp/go1.2.1/bin/go install code.google.com/p/go.tools/cmd/godoc
real    0m1.578s
user    0m1.434s
sys     0m0.146s

With most of the compilation avoided, the total time drops to 1.5 seconds. Let’s look at how the linker change in Go 1.3 affects the results.

% rm -rf ~/pkg
% time /tmp/go1.3beta1/bin/go install code.google.com/p/go.tools/cmd/godoc
real    0m3.193s
user    0m3.441s
sys     0m0.530s

Under Go 1.3beta1 the time to compile and link from scratch is roughly the same as Go 1.2.1. There is perhaps a hint that more work is being done in parallel. Let’s compare the results from an incremental compilation.

% touch ~/src/code.google.com/p/go.tools/cmd/godoc/play.go
% time /tmp/go1.3beta1/bin/go install code.google.com/p/go.tools/cmd/godoc
real    0m0.996s
user    0m0.881s
sys     0m0.118s

Under Go 1.3beta1 the time to recompile godoc‘s main package and link has dropped from 1.5 seconds to just under a second. A saving of half a second, or 30% compared to the performance of Go 1.2.1.


% rm -rf ~/pkg
% time /tmp/go1.2.1/bin/go install  launchpad.net/juju-core/cmd/jujud
real    0m8.247s
user    0m18.110s
sys     0m3.861s

Time to compile and link jujud from scratch, roughly 220 packages at this time, was 8.2 seconds using Go 1.2.1. Let’s look at the incremental performance.

% touch ~/src/launchpad.net/juju-core/cmd/jujud/main.go
% time /tmp/go1.2.1/bin/go install launchpad.net/juju-core/cmd/jujud
real    0m3.139s
user    0m2.831s
sys     0m0.305s

Which shows the time to recompile the main package and link the executable is 3.2 seconds. You can also see that the process is almost entirely single threaded as the sum of user and sys is equal to the wall clock time, real.

Let’s turn to Go 1.3beta1

% rm -rf ~/pkg
% time /tmp/go1.3beta1/bin/go install launchpad.net/juju-core/cmd/jujud
real    0m8.107s
user    0m20.533s
sys     0m3.574s

The full rebuild times are comparable to the Go 1.2.1 results, possibly a hair faster. The real improvements show themselves in the incremental compilation scenario.

% touch ~/src/launchpad.net/juju-core/cmd/jujud/main.go
% time /tmp/go1.3beta1/bin/go install launchpad.net/juju-core/cmd/jujud
real    0m2.219s
user    0m1.929s
sys     0m0.290s

Under Go 1.3beta1 the time to recompile the main package and link has dropped from 3.2 seconds to 2.2 seconds, a saving of one second, again roughly 30%.

In conclusion

In my tests, the linker improvements coming in Go 1.3 deliver approximately a 30% reduction in link time. Depending on the size of your final executable this could be a small amount, or a significant amount of time.

Overall the linking change has several important benefits:

    1. Because it’s performed in the compile stage, the result is stored in your $GOPATH/pkg directory. Effectively the instruction selection pass is cached, whereas previously it was recomputed for each binary linked even if they shared many common packages.
    2. Because more work is done in the compile stage, it is done in parallel if you have more than one CPU. If you have one CPU the end result is unchanged, the same amount of work is done, just at different phases, although you will benefit from point 1 regardless
    3. Because more work is done in the compilation stage, less work is done in the linker, so the files passed to the linker are smaller and the work it has to do, which is effectively a mark, sweep and compact over all the symbols presented, is less.

In short, the 1.3 linker rocks.

4 thoughts on “Go 1.3 linker improvements

    1. Dave Cheney Post author

      My GOPATH is set to $HOME, so ~/pkg is the cache of compiled packages. This does not include the Go standard library itself, so the uncache test is only the code from $GOPATH/src

    1. Dave Cheney Post author

      I’d imagine that the filesystem cache does play a part in improving the incremental build time, but as we’re doing the same steps for both versions of Go, I don’t think the cache is favouring one or the other.

      If you wanted to experiment, and are using linux, you could try inserting sync && echo 3 > /proc/sys/vm/drop_caches between each test.

Comments are closed.