Go obtains much of its compilation speed from the Plan 9 compiler, of which it is a direct descendant. The Plan 9 toolchain deferred much of the work traditionally performed by a compiler to the linking stage and its performance was summarised in section 8 of this paper
The new compilers compile quickly, load slowly, and produce medium quality object code.
Because of the similar division of labour between Go’s compiler and linker, linking is commonly more expensive that compilation. This leads to several problems
- Linking cannot benefit from incremental compilation, each link pass starts afresh even if only a tiny part of program has changed.
- Linking speed is at least linear (often worse) with the number of packages being linked into the final executable – larger programs link slower.
- While multiple commands may be linked in parallel, each individual link is single threaded – as CPU speeds stall, or go backwards, favouring additional cores, linking gets slower in real terms.
It should be noted that while linking is considered slow in terms of the other parts of the Go toolchain, it remains much faster than linking a traditional C or C++ application.
Linking speed has long been recognised as an issue by the Go developers and during the 1.3 development cycle the linker underwent major changes to improve its performance.
What does the Go linker do ?
To understand the change, Russ Cox wrote in his proposal at the beginning of the cycle
The current linker performs two separable tasks.
First, it translates an input stream of pseudo-instructions into executable code and data blocks, along with a list of relocations.
Second, it deletes dead code, merges what’s left into a single image, resolves relocations, and generates a few whole-program data structures such as the runtime symbol table.
In a nutshell, the change has moved the first task of the linker, the pseudo instruction translation, from the linker into the compiler.
What are pseudo instructions ?
Prior to Go 1.3, the output of the compiler, the .a
files in your $GOPATH/pkg
directory was not directly executable machine code. It was, as Russ describes, a stream of pseudo instructions, which, while not machine independent, it was also not directly executable.
During the linking phase, the linker itself was responsible for turning this pseudo instruction stream into real machine code. Dealing with pseudo instructions during the compilation phase makes the compiler simpler to write.
An example of a pseudo instructions are the unified MOV
instruction available to the compiler and assembler
MOV R0, R1 // move the contents of R0 into R1 MOV $80000001, R0 // move the the constant 80000001 into R0 MOV R0, 0 // move the value into R0 into address 0
On a processor like ARM, when translated to machine code, this becomes three separate operations. The first MOV R0, R1
is a simple move from one register to another and the final opcode is MOV
.
The second form stores the large constant, 80000001
into R0
. ARM processors cannot encode such a large constant directly into the instruction so the linker would store the constant at the end of the function and replace the MOV
with a load from an address relative to the program counter into a scratch register (R11
is reserved for the linker) then move the value from R11
to R0
.
The final form also needs help from the linker as ARM processors cannot use a constant as an address to store a value. The linker will arrange to load 0
into R11
then store the contents with an instruction like MOV R0, (R11)
The work required for X86 is similar, although the specific restrictions differ.
The results
To evaluate the performance improvements of the change to the linker I’m going to compare building and linking two programs, the venerable godoc
and the jujud
server binary which depends on almost every line of code in the Juju repo.
In preparation I checked out copies of Go 1.2.1 and Go 1.3beta1 (this data was collected some time ago, but the changes in 1.3beta2 are unrelated to the linker).
godoc
% rm -rf ~/pkg % time /tmp/go1.2.1/bin/go install code.google.com/p/go.tools/cmd/godoc real 0m3.239s user 0m3.307s sys 0m0.595s
The time to compile and link godoc
from scratch on this machine with Go 1.2.1 was 3.2 seconds. Let’s look at the time just to recompile the main package and link godoc
% touch ~/src/code.google.com/p/go.tools/cmd/godoc/play.go % time /tmp/go1.2.1/bin/go install code.google.com/p/go.tools/cmd/godoc real 0m1.578s user 0m1.434s sys 0m0.146s
With most of the compilation avoided, the total time drops to 1.5 seconds. Let’s look at how the linker change in Go 1.3 affects the results.
% rm -rf ~/pkg % time /tmp/go1.3beta1/bin/go install code.google.com/p/go.tools/cmd/godoc real 0m3.193s user 0m3.441s sys 0m0.530s
Under Go 1.3beta1 the time to compile and link from scratch is roughly the same as Go 1.2.1. There is perhaps a hint that more work is being done in parallel. Let’s compare the results from an incremental compilation.
% touch ~/src/code.google.com/p/go.tools/cmd/godoc/play.go % time /tmp/go1.3beta1/bin/go install code.google.com/p/go.tools/cmd/godoc real 0m0.996s user 0m0.881s sys 0m0.118s
Under Go 1.3beta1 the time to recompile godoc
‘s main package and link has dropped from 1.5 seconds to just under a second. A saving of half a second, or 30% compared to the performance of Go 1.2.1.
jujud
% rm -rf ~/pkg % time /tmp/go1.2.1/bin/go install launchpad.net/juju-core/cmd/jujud real 0m8.247s user 0m18.110s sys 0m3.861s
Time to compile and link jujud
from scratch, roughly 220 packages at this time, was 8.2 seconds using Go 1.2.1. Let’s look at the incremental performance.
% touch ~/src/launchpad.net/juju-core/cmd/jujud/main.go % time /tmp/go1.2.1/bin/go install launchpad.net/juju-core/cmd/jujud real 0m3.139s user 0m2.831s sys 0m0.305s
Which shows the time to recompile the main package and link the executable is 3.2 seconds. You can also see that the process is almost entirely single threaded as the sum of user
and sys
is equal to the wall clock time, real
.
Let’s turn to Go 1.3beta1
% rm -rf ~/pkg % time /tmp/go1.3beta1/bin/go install launchpad.net/juju-core/cmd/jujud real 0m8.107s user 0m20.533s sys 0m3.574s
The full rebuild times are comparable to the Go 1.2.1 results, possibly a hair faster. The real improvements show themselves in the incremental compilation scenario.
% touch ~/src/launchpad.net/juju-core/cmd/jujud/main.go % time /tmp/go1.3beta1/bin/go install launchpad.net/juju-core/cmd/jujud real 0m2.219s user 0m1.929s sys 0m0.290s
Under Go 1.3beta1 the time to recompile the main package and link has dropped from 3.2 seconds to 2.2 seconds, a saving of one second, again roughly 30%.
In conclusion
In my tests, the linker improvements coming in Go 1.3 deliver approximately a 30% reduction in link time. Depending on the size of your final executable this could be a small amount, or a significant amount of time.
Overall the linking change has several important benefits:
- Because it’s performed in the compile stage, the result is stored in your
$GOPATH/pkg
directory. Effectively the instruction selection pass is cached, whereas previously it was recomputed for each binary linked even if they shared many common packages. - Because more work is done in the compile stage, it is done in parallel if you have more than one CPU. If you have one CPU the end result is unchanged, the same amount of work is done, just at different phases, although you will benefit from point 1 regardless
- Because more work is done in the compilation stage, less work is done in the linker, so the files passed to the linker are smaller and the work it has to do, which is effectively a mark, sweep and compact over all the symbols presented, is less.
In short, the 1.3 linker rocks.
Wow, you “benchmarked” the build times by doing a full package installation..?
My GOPATH is set to $HOME, so ~/pkg is the cache of compiled packages. This does not include the Go standard library itself, so the uncache test is only the code from $GOPATH/src
Does OS (filesystem) cache affect the time results?
I’d imagine that the filesystem cache does play a part in improving the incremental build time, but as we’re doing the same steps for both versions of Go, I don’t think the cache is favouring one or the other.
If you wanted to experiment, and are using linux, you could try inserting
sync && echo 3 > /proc/sys/vm/drop_caches
between each test.