Tag Archives: arm

Go 1.8 toolchain improvements

This is a progress report on the Go toolchain improvements during the 1.8 development cycle.

Now we’re well into November, the 1.8 development window is closing fast on the few remaining in fly change lists, with the remainder being told to wait until the 1.9 development season opens when Go 1.8 ships in February 2017.

For more in this series, read my previous post on the Go 1.8 toolchain improvements from September, and my post on the improvements to the Go toolchain in the 1.7 development cycle.

Faster compilation

Since Go 1.5, released in August 2015, compile times have been significantly slower than Go 1.4. Work on addressing this slow down started in ernest in the Go 1.7 cycle, and is still ongoing.

Robert Griesemer and Matthew Dempsky’s worked on rewriting the parser to make it faster and remove many of the package level variables inherited from the previous yacc based parser. This parser produces a new abstract syntax tree while the rest of the compiler expects the previous yacc syntax tree. For 1.8 the new parser must transform its output into the previous syntax tree for consumption by the rest of the compiler. Even with this extra transformation step the new parser is no slower than the previous version and plans are being made to remove this transformation requirement in Go 1.9.

Compile time for full build relative to Go 1.4.3

The take away is Go 1.8 is on target to improve compile times by an average of 15% over Go 1.7. Compared to the 3-5% improvements reported two months prior, it’s nice to know that there is still blood in this stone.

Note: The benchmark scripts for jujud, kube-controller-manager, and gogs are online. Please try them yourself and report your findings.

Code generation improvements

The big feature of the previous 1.7 cycle was the new SSA backend for 64 bit Intel. In Go 1.8 the SSA backend has been rolled out to all the other architectures that Go supports and the old backend code has been deleted.

amd64, by virtue of being the most popular production architecture, has always been the fastest. As I reported a few months ago, the results comparing Go 1.8 to Go 1.7 on Intel architectures show middling improvement driven equally by improvements to code generation, escape analysis improvements, and optimisations to the std library.

name                     old time/op    new time/op    delta
BinaryTree17-4              3.04s ± 1%     3.03s ± 0%     ~     (p=0.222 n=5+5)
Fannkuch11-4                3.27s ± 0%     3.39s ± 1%   +3.74%  (p=0.008 n=5+5)
FmtFprintfEmpty-4          60.0ns ± 3%    58.3ns ± 1%   -2.70%  (p=0.008 n=5+5)
FmtFprintfString-4          177ns ± 2%     164ns ± 2%   -7.47%  (p=0.008 n=5+5)
FmtFprintfInt-4             169ns ± 2%     157ns ± 1%   -7.22%  (p=0.008 n=5+5)
FmtFprintfIntInt-4          264ns ± 1%     243ns ± 1%   -8.10%  (p=0.008 n=5+5)
FmtFprintfPrefixedInt-4     254ns ± 2%     244ns ± 1%   -4.02%  (p=0.008 n=5+5)
FmtFprintfFloat-4           357ns ± 1%     348ns ± 2%   -2.35%  (p=0.032 n=5+5)
FmtManyArgs-4              1.10µs ± 1%    0.97µs ± 1%  -11.03%  (p=0.008 n=5+5)
GobDecode-4                9.85ms ± 1%    9.31ms ± 1%   -5.51%  (p=0.008 n=5+5)
GobEncode-4                8.75ms ± 1%    8.17ms ± 1%   -6.67%  (p=0.008 n=5+5)
Gzip-4                      282ms ± 0%     289ms ± 1%   +2.32%  (p=0.008 n=5+5)
Gunzip-4                   50.9ms ± 1%    51.7ms ± 0%   +1.67%  (p=0.008 n=5+5)
HTTPClientServer-4          195µs ± 1%     196µs ± 1%     ~     (p=0.095 n=5+5)
JSONEncode-4               21.6ms ± 6%    19.8ms ± 3%   -8.37%  (p=0.008 n=5+5)
JSONDecode-4               70.2ms ± 3%    71.0ms ± 1%     ~     (p=0.310 n=5+5)
Mandelbrot200-4            5.20ms ± 0%    4.73ms ± 1%   -9.05%  (p=0.008 n=5+5)
GoParse-4                  4.38ms ± 3%    4.28ms ± 2%     ~     (p=0.056 n=5+5)
RegexpMatchEasy0_32-4      96.7ns ± 2%    98.1ns ± 0%     ~     (p=0.127 n=5+5)
RegexpMatchEasy0_1K-4       311ns ± 1%     313ns ± 0%     ~     (p=0.214 n=5+5)
RegexpMatchEasy1_32-4      97.9ns ± 2%    89.8ns ± 2%   -8.33%  (p=0.008 n=5+5)
RegexpMatchEasy1_1K-4       519ns ± 0%     510ns ± 2%   -1.70%  (p=0.040 n=5+5)
RegexpMatchMedium_32-4      158ns ± 2%     146ns ± 0%   -7.71%  (p=0.016 n=5+4)
RegexpMatchMedium_1K-4     46.3µs ± 1%    47.8µs ± 2%   +3.12%  (p=0.008 n=5+5)
RegexpMatchHard_32-4       2.53µs ± 3%    2.46µs ± 0%   -2.91%  (p=0.008 n=5+5)
RegexpMatchHard_1K-4       76.1µs ± 0%    74.5µs ± 2%   -2.12%  (p=0.008 n=5+5)
Revcomp-4                   563ms ± 2%     531ms ± 1%   -5.78%  (p=0.008 n=5+5)
Template-4                 86.7ms ± 1%    82.2ms ± 1%   -5.16%  (p=0.008 n=5+5)
TimeParse-4                 433ns ± 3%     399ns ± 4%   -7.90%  (p=0.008 n=5+5)
TimeFormat-4                467ns ± 2%     430ns ± 1%   -7.76%  (p=0.008 n=5+5)

name                     old speed      new speed      delta
GobDecode-4              77.9MB/s ± 1%  82.5MB/s ± 1%   +5.84%  (p=0.008 n=5+5)
GobEncode-4              87.7MB/s ± 1%  94.0MB/s ± 1%   +7.15%  (p=0.008 n=5+5)
Gzip-4                   68.8MB/s ± 0%  67.2MB/s ± 1%   -2.27%  (p=0.008 n=5+5)
Gunzip-4                  381MB/s ± 1%   375MB/s ± 0%   -1.65%  (p=0.008 n=5+5)
JSONEncode-4             89.9MB/s ± 5%  98.1MB/s ± 3%   +9.11%  (p=0.008 n=5+5)
JSONDecode-4             27.6MB/s ± 3%  27.3MB/s ± 1%     ~     (p=0.310 n=5+5)
GoParse-4                13.2MB/s ± 3%  13.5MB/s ± 2%     ~     (p=0.056 n=5+5)
RegexpMatchEasy0_32-4     331MB/s ± 2%   326MB/s ± 0%     ~     (p=0.151 n=5+5)
RegexpMatchEasy0_1K-4    3.29GB/s ± 1%  3.27GB/s ± 0%     ~     (p=0.222 n=5+5)
RegexpMatchEasy1_32-4     327MB/s ± 2%   357MB/s ± 2%   +9.20%  (p=0.008 n=5+5)
RegexpMatchEasy1_1K-4    1.97GB/s ± 0%  2.01GB/s ± 2%   +1.76%  (p=0.032 n=5+5)
RegexpMatchMedium_32-4   6.31MB/s ± 2%  6.83MB/s ± 1%   +8.31%  (p=0.008 n=5+5)
RegexpMatchMedium_1K-4   22.1MB/s ± 1%  21.4MB/s ± 2%   -3.01%  (p=0.008 n=5+5)
RegexpMatchHard_32-4     12.6MB/s ± 3%  13.0MB/s ± 0%   +2.98%  (p=0.008 n=5+5)
RegexpMatchHard_1K-4     13.4MB/s ± 0%  13.7MB/s ± 2%   +2.19%  (p=0.008 n=5+5)
Revcomp-4                 451MB/s ± 2%   479MB/s ± 1%   +6.12%  (p=0.008 n=5+5)
Template-4               22.4MB/s ± 1%  23.6MB/s ± 1%   +5.43%  (p=0.008 n=5+5)

The big improvements from the switch to the SSA backend show up on non intel architectures. Here are the results for Arm64:

name                     old time/op    new time/op     delta
BinaryTree17-8              10.6s ± 0%       8.1s ± 1%  -23.62%  (p=0.016 n=4+5)
Fannkuch11-8                9.19s ± 0%      5.95s ± 0%  -35.27%  (p=0.008 n=5+5)
FmtFprintfEmpty-8           136ns ± 0%      118ns ± 1%  -13.53%  (p=0.008 n=5+5)
FmtFprintfString-8          472ns ± 1%      331ns ± 1%  -29.82%  (p=0.008 n=5+5)
FmtFprintfInt-8             388ns ± 3%      273ns ± 0%  -29.61%  (p=0.008 n=5+5)
FmtFprintfIntInt-8          640ns ± 2%      438ns ± 0%  -31.61%  (p=0.008 n=5+5)
FmtFprintfPrefixedInt-8     580ns ± 0%      423ns ± 0%  -27.09%  (p=0.008 n=5+5)
FmtFprintfFloat-8           823ns ± 0%      613ns ± 1%  -25.57%  (p=0.008 n=5+5)
FmtManyArgs-8              2.69µs ± 0%     1.96µs ± 0%  -27.12%  (p=0.016 n=4+5)
GobDecode-8                24.4ms ± 0%     17.3ms ± 0%  -28.88%  (p=0.008 n=5+5)
GobEncode-8                18.6ms ± 0%     15.1ms ± 1%  -18.65%  (p=0.008 n=5+5)
Gzip-8                      1.20s ± 0%      0.74s ± 0%  -38.02%  (p=0.008 n=5+5)
Gunzip-8                    190ms ± 0%      130ms ± 0%  -31.73%  (p=0.008 n=5+5)
HTTPClientServer-8          205µs ± 1%      166µs ± 2%  -19.27%  (p=0.008 n=5+5)
JSONEncode-8               50.7ms ± 0%     41.5ms ± 0%  -18.10%  (p=0.008 n=5+5)
JSONDecode-8                201ms ± 0%      155ms ± 1%  -22.93%  (p=0.008 n=5+5)
Mandelbrot200-8            13.0ms ± 0%     10.1ms ± 0%  -22.78%  (p=0.008 n=5+5)
GoParse-8                  11.4ms ± 0%      8.5ms ± 0%  -24.80%  (p=0.008 n=5+5)
RegexpMatchEasy0_32-8       271ns ± 0%      225ns ± 0%  -16.97%  (p=0.008 n=5+5)
RegexpMatchEasy0_1K-8      1.69µs ± 0%     1.92µs ± 0%  +13.42%  (p=0.008 n=5+5)
RegexpMatchEasy1_32-8       292ns ± 0%      255ns ± 0%  -12.60%  (p=0.000 n=4+5)
RegexpMatchEasy1_1K-8      2.20µs ± 0%     2.38µs ± 0%   +8.38%  (p=0.008 n=5+5)
RegexpMatchMedium_32-8      411ns ± 0%      360ns ± 0%  -12.41%  (p=0.000 n=5+4)
RegexpMatchMedium_1K-8      118µs ± 0%      104µs ± 0%  -12.07%  (p=0.008 n=5+5)
RegexpMatchHard_32-8       6.83µs ± 0%     5.79µs ± 0%  -15.27%  (p=0.016 n=4+5)
RegexpMatchHard_1K-8        205µs ± 0%      176µs ± 0%  -14.19%  (p=0.008 n=5+5)
Revcomp-8                   2.01s ± 0%      1.43s ± 0%  -29.02%  (p=0.008 n=5+5)
Template-8                  259ms ± 0%      158ms ± 0%  -38.93%  (p=0.008 n=5+5)
TimeParse-8                 874ns ± 1%      733ns ± 1%  -16.16%  (p=0.008 n=5+5)
TimeFormat-8               1.00µs ± 1%     0.86µs ± 1%  -13.88%  (p=0.008 n=5+5)

name                     old speed      new speed       delta
GobDecode-8              31.5MB/s ± 0%   44.3MB/s ± 0%  +40.61%  (p=0.008 n=5+5)
GobEncode-8              41.3MB/s ± 0%   50.7MB/s ± 1%  +22.92%  (p=0.008 n=5+5)
Gzip-8                   16.2MB/s ± 0%   26.1MB/s ± 0%  +61.33%  (p=0.008 n=5+5)
Gunzip-8                  102MB/s ± 0%    150MB/s ± 0%  +46.45%  (p=0.016 n=4+5)
JSONEncode-8             38.3MB/s ± 0%   46.7MB/s ± 0%  +22.10%  (p=0.008 n=5+5)
JSONDecode-8             9.64MB/s ± 0%  12.49MB/s ± 0%  +29.54%  (p=0.016 n=5+4)
GoParse-8                5.09MB/s ± 0%   6.78MB/s ± 0%  +33.02%  (p=0.008 n=5+5)
RegexpMatchEasy0_32-8     118MB/s ± 0%    142MB/s ± 0%  +20.29%  (p=0.008 n=5+5)
RegexpMatchEasy0_1K-8     605MB/s ± 0%    534MB/s ± 0%  -11.85%  (p=0.016 n=5+4)
RegexpMatchEasy1_32-8     110MB/s ± 0%    125MB/s ± 0%  +14.23%  (p=0.029 n=4+4)
RegexpMatchEasy1_1K-8     465MB/s ± 0%    430MB/s ± 0%   -7.72%  (p=0.008 n=5+5)
RegexpMatchMedium_32-8   2.43MB/s ± 0%   2.77MB/s ± 0%  +13.99%  (p=0.016 n=5+4)
RegexpMatchMedium_1K-8   8.68MB/s ± 0%   9.87MB/s ± 0%  +13.71%  (p=0.008 n=5+5)
RegexpMatchHard_32-8     4.68MB/s ± 0%   5.53MB/s ± 0%  +18.08%  (p=0.016 n=4+5)
RegexpMatchHard_1K-8     5.00MB/s ± 0%   5.83MB/s ± 0%  +16.60%  (p=0.008 n=5+5)
Revcomp-8                 126MB/s ± 0%    178MB/s ± 0%  +40.88%  (p=0.008 n=5+5)
Template-8               7.48MB/s ± 0%  12.25MB/s ± 0%  +63.74%  (p=0.008 n=5+5)

These are pretty big improvements from just recompiling your binary.

Defer and cgo improvements

The question of if defer can be used in hot code paths remains open, but during the 1.8 cycle Austin reduced the overhead of using defer by a half, according to some benchmarks.

The runtime package benchmarks are a little less rosy.

name         old time/op  new time/op  delta
Defer-4       101ns ± 1%    66ns ± 0%  -34.73%  (p=0.000 n=20+20)
Defer10-4    93.2ns ± 1%  62.5ns ± 8%  -33.02%  (p=0.000 n=20+20)
DeferMany-4   148ns ± 3%   131ns ± 3%  -11.42%  (p=0.000 n=19+19)

According to them defer improved by a third in most common circumstances where the statement closes over no more than a single variable.

Additionally, an optimisation by David Crawshaw reduced the overhead of defer in the cgo path by nearly half.

name       old time/op  new time/op  delta
CgoNoop-8  93.5ns ± 0%  51.1ns ± 1%  -45.34%  (p=0.016 n=4+5)

One more thing

Go 1.7 supported 64 bit mips platforms, thanks to the work of Minux and Cherry. However, the less powerful but plentiful, 32 bit mips platforms were not supported. As a bonus, thanks to the work of Vladimir Stefanovic, Go 1.8 will ship will support for 32 bit mips.

% env GOARCH=mips go build -o godoc.mips golang.org/x/tools/cmd/godoc
% file godoc.mips 
godoc.mips: ELF 32-bit MSB  executable, MIPS, MIPS32 version 1 (SYSV), statically linked, not stripped

While 32 bit mips hosts are probably too small to compile Go programs natively, you can always cross compile from your development workstation for linux/mips.

Unofficial Go 1.2 tarballs for ARM now available

I have updated my unofficial ARM tarball distributions page with prebuilt Go 1.2 tarballs. You can find them by following the link in the main header of this page.

If you are interested in the potential performance improvements in Go 1.2, I wrote a post about it on the Gopher Academy blog as part of this year’s Go Advent series.

Unofficial Go 1.1.2 tarballs for ARM now available

I have updated my unofficial ARM tarball distributions to Go version 1.1.2.

You can find them by following the link in the main header of this page.

Unofficial Go 1.1.1 tarballs for ARM now available

This evening I rebuilt my unofficial ARM tarball distributions to Go version 1.1.1.

You can find them by following the link in the main header of this page.

Go 1.1 performance improvements, part 3

This is the final article in the series exploring the performance improvements available in the recent Go 1.1 release. You can also read part 1 and part 2 to get the back story for amd64 and 386.

This article focuses on the performance of arm platforms. Go 1.1 was an important release as it raised arm to a level on par with amd64 and 386 and introduced support for additional operating systems. Some highlights that Go 1.1 brings to arm are:

Support for cgo.
Additional of experimental support for freebsd/arm and netbsd/arm.
Better code generation, including a now partially working peephole optimiser, better register allocator, and many small improvements to reduce code size.
Support for ARMv6 hosts, including the Raspberry Pi.
The GOARM variable is now optional, and automatically chooses its value based on the host Go is compiled on.
The memory allocator is now significantly faster due to elimination of many 64 bit instructions which were previously emulated a high cost.
A significantly faster software division/modulo facility.

These changes were not possible without the efforts of Shenghou Ma, Rémy Oudompheng and Daniel Morsing who made enormous contributions to the compiler and runtime during the Go 1.1 development cycle.

Again, a huge debt of thanks is owed to Anthony Starks who helped prepare the benchmark data and images for this article.

Go 1 benchmarks on `linux/arm`

Since its release Go has supported more that one flavor of arm architecture. Presented here are benchmarks from a wide array of hosts to give a representative sample of the performance of Go 1.1 programs on arm hosts. From top left to bottom right

Beaglebone Black, Texas Instruments AM335x Cortex-A8 ARMv7
Samsung Chromebook, Samsung Exynos 5250 Dual Cortex-A15 ARMv7
QNAP TS-119P, Marvell Kirkwood ARMv5
Raspberry Pi Model B, Broadcom 2835 ARMv6

As always the results presented here are available in the autobench repository. The thumbnails are clickable for a full resolution view.

Hey, the images don’t work on my iSteve! Yup, it looks like iOS devices have a limit for the size of images they will load inside a web page, and these images are on the sadface side of that limit. If you click on the broken image, you’ll find the images will load fine in a separate page. Sorry for the inconvenience.

The speedup in BinaryTree17, and to a lesser extent Fannkuch11, benchmarks is influenced by the performance of the heap allocator. Part of heap allocation involves updating statistics stored in 64 bit quantities, which flow into runtime.MemStats. During the 1.1 cycle, some quick work on the part of the Atom symbol removed many of these 64 bit operations, which shows as decreased run time in these benchmarks.

net/http

Across all the samples, net/http benchmarks have benefited from the new poller implementation as well as the pure Go improvements to the net/http package through the work of Brad Fitzpatrick and Jeff Allen.

runtime

The results of the runtime benchmarks mirror those from amd64 and 386. The general trend is towards improvement, and in some cases, a large improvement, in areas like map operations.

The improvements to the Append set of benchmarks shows the benefit of a change committed by Rob Pike which avoids a call to runtime.memmove when appending small amounts of data to a []byte.

The common theme across all the samples is the regression in some channel operations. This may be attributable to the high cost of performing atomic operations on arm platforms. Currently all atomic operations are implemented by the runtime package, but in the future they may be handled directly in the compiler which could reduce their overhead.

The CompareString benchmarks show a smaller improvement than other platforms because CL 8056043 has not yet been backported to arm.

Conclusion

With the additions of cgo support, throughput improvements in the net package, and improvements to code generation and garbage collector, Go 1.1 represents a significant milestone for writing Go programs targeting arm.

To wrap up this series of articles it is clear that Go 1.1 delivers on its promise of a general 30-40% improvement across all three supported architectures. If we consider the relative improvements across compilers, while 6g remains the flagship compiler and benefits from the fastest underlying hardware, 8g and 5g show a greater improvement relative to the Go 1.0 release of last year.

But wait, there is more

If you’ve enjoyed this series of posts and want to follow the progress of Go 1.2 I’ll soon be opening a branch of autobench which will track Go 1.1 vs tip (1.2). I’ll post and tweet the location when it is ready.

Since the Go 1.2 change window was opened on May 14th, the allocator and garbage collector have already received improvements from Dmitry Vyukov and the Atom symbol aimed at further reducing the cost of GC, and Carl Shapiro has started work on precise collection of stack allocated values.

Also for Go 1.2 are proposals for a better memory allocator, and a change to the scheduler to give it the ability to preempt long running goroutines, which is aimed at reducing GC latency.

Finally, Go 1.2 has a release timetable. So while we can’t really say what will or will not making it into 1.2, we can say that it should be done by the end of 2013.

Go 1.1 tarballs for linux/arm

For the time poor ARM fans in the room, I’ve updated my tarball distributions to Go 1.1. These tarballs are built using the same misc/dist tool that makes the official builds on the golang.org download page.

You can find the link at the Unofficial ARM tarballs for Go item at the top of this page. Please address any bug reports or comments to me directly.

There are also a number of other ways to obtain Go 1.1 appearing on the horizon. For example, if you are using Debian Sid, Go 1.1 is available now. This version has been imported into Ubuntu Saucy (which will become 13.10), although at this time it remains in the proposed channel.

Rest assured I will not be shy in announcing when Go 1.1 has wider availability in Ubuntu.

Notes on exploring the compiler flags in the Go compiler suite

I’ve been doing some work improving the code generation of the 5g compiler, which is the Go compiler for arm. These notes also apply to the 6g and 8g compilers for amd64 and 386 respectively.

For this discussion we’ll use a very simple package.

package addr

func addr(s[]int) *int {
        return &s[2]
}

To see the assembly produced by compiling this package we use the -S flag. -S can be passed directly to the compiler with go tool 5g -S addr.go, but it is simpler (and more portable) to use the -gcflags flag on the go tool itself.

% go build -gcflags=-S addr.go 
# command-line-arguments 
--- prog list "addr" ---
0000 (/home/dfc/src/addr.go:3) TEXT addr+0(SB),$0-16
0001 (/home/dfc/src/addr.go:4) MOVW $s+0(FP),R0
0002 (/home/dfc/src/addr.go:4) MOVW 4(R0),R1
0003 (/home/dfc/src/addr.go:4) CMP $2,R1,
0004 (/home/dfc/src/addr.go:4) BHI ,6(APC)
0005 (/home/dfc/src/addr.go:4) BL ,runtime.panicindex+0(SB)
0006 (/home/dfc/src/addr.go:4) MOVW 0(R0),R0
0007 (/home/dfc/src/addr.go:4) ADD $8,R0
0008 (/home/dfc/src/addr.go:4) MOVW R0,.noname+12(FP)
0009 (/home/dfc/src/addr.go:4) RET ,

This is quite a lot of code for a one line function. One of the reasons for this is s is a slice, whose length is not known at compile time, so the compiler must insert a bounds check. We can tell the compiler to not emit bounds checks with the -B flag.

% go build -gcflags=-SB addr.go
# command-line-arguments
--- prog list "addr" ---
0000 (/home/dfc/src/addr.go:3) TEXT     addr+0(SB),$0-16
0001 (/home/dfc/src/addr.go:4) MOVW     $s+0(FP),R0
0002 (/home/dfc/src/addr.go:4) MOVW     0(R0),R0
0003 (/home/dfc/src/addr.go:4) ADD      $8,R0
0004 (/home/dfc/src/addr.go:4) MOVW     R0,.noname+12(FP)
0005 (/home/dfc/src/addr.go:4) RET      ,

It is important to note that -B is an unsupported flag. The goal of Go is a safe language, one where array subscripts are bounds checked when they are not provably safe. Go already elides bounds checks when you use range loops, and future compilers will improve this. It is also important to note that none of the builders test -B so it might even generate incorrect code. In summary, when the compiler improves, -B will go away, so don’t get too attached.

One other interesting flag is -N, which will disable the optimisation pass in the compiler

% go build -gcflags=-SN addr.go
# command-line-arguments
--- prog list "addr" ---
0000 (/home/dfc/src/addr.go:3) TEXT     addr+0(SB),$0-16
0001 (/home/dfc/src/addr.go:4) MOVW     $s+0(FP),R0
0002 (/home/dfc/src/addr.go:4) MOVW     R0,R0
0003 (/home/dfc/src/addr.go:4) MOVW     4(R0),R1
0004 (/home/dfc/src/addr.go:4) CMP      $2,R1,
0005 (/home/dfc/src/addr.go:4) BHI      ,8(APC)
0006 (/home/dfc/src/addr.go:4) BL       ,runtime.panicindex+0(SB)
0007 (/home/dfc/src/addr.go:4) UNDEF    ,
0008 (/home/dfc/src/addr.go:4) MOVW     0(R0),R0
0009 (/home/dfc/src/addr.go:4) ADD      $8,R0
0010 (/home/dfc/src/addr.go:4) MOVW     R0,.noname+12(FP)
0011 (/home/dfc/src/addr.go:4) RET      ,
0012 (/home/dfc/src/addr.go:5) BL       ,runtime.throwreturn+0(SB)
0013 (/home/dfc/src/addr.go:5) RET      ,

I think the only thing that is useful about this example is, it’s good thing the optimiser is on by default because there are some strange things going on here, for example line 0002, and the unreachable branch at line 0012.

The last thing to talk about is the output of 5g is not the final code that is executed. Aside from the usual work of a linker, 5l does several transformations on the code which are important to understand.

func addr(s[]int) *int {
   10c00:       e59a1000        ldr     r1, [sl]
   10c04:       e15d0001        cmp     sp, r1
   10c08:       33a01004        movcc   r1, #4
   10c0c:       33a02010        movcc   r2, #16
   10c10:       31a0300e        movcc   r3, lr
   10c14:       3b00668c        blcc    2a64c
   10c18:       e52de004        push    {lr}            ; (str lr, [sp, #-4]!)
        return &s[2]
   10c1c:       e28d0008        add     r0, sp, #8
   10c20:       e5901004        ldr     r1, [r0, #4]
   10c24:       e3510002        cmp     r1, #2
   10c28:       8a000000        bhi     10c30
   10c2c:       eb0035d5        bl      1e388 
   10c30:       e5900000        ldr     r0, [r0]
   10c34:       e2800008        add     r0, r0, #8
   10c38:       e58d0014        str     r0, [sp, #20]
   10c3c:       e49df004        pop     {pc}            ; (ldr pc, [sp], #4)

Here we use objdump -dS to dump the addr function as it is compiled into the executable. The first six instructions, starting at 10c00, are the function preamble that deals with segmented stacks which is inserted automatically by the 5l.

Taking it further

There are several other compiler flags which are useful when debugging or optimising your Go code.

-g will output the steps a the compiler is a taking at a very low level. The discussion of the output format is outside the scope of this article. Personally I find it easier to add a warn statement which will tell me the source line the compiler was working on at the time.
-l will disable inlining (but still retain other compiler optimisations). This is very useful if you are investigating small methods, but can’t find them in objdump.
-m is mainly a frontend switch and outputs details about escape analysis and inlining choices.

Netgear Stora as an ARM development platform

About a week ago I posted a request for recommendations for ARM based systems that could be used for Go development. There were some great responses, including the BeagleBoard and the Guru Plug. Being impatient, and in Australia, I ended up getting a Netgear Stora which has turned out to be a great home NAS, and a capable ARM5 development system. This is the same hardware, albeit with less RAM, that ships in the ShivaPlug.

axentraserver(~/go/src) % export MAKEFLAGS=-j1
axentraserver(~/go/src) % hg identify 
546b1fc95dcc+ tip
axentraserver(~/go/src) % time ./make.bash > /dev/null
hg not installed
conflicts: 3 shift/reduce

real    10m48.889s
user    9m15.380s
sys     0m52.480s

Not too shabby, my 8g host (2.8Ghz Celeron) turns around the same build in just under 12 minutes.

Pros

Very good value. For less thatn $200 AUD you get a 1.2Ghz Marvel ARM5 CPU, 128mb of ram and a 1Tb Seagate 3.5″ drive (and a slot for a second drive). Online Computer have the 1Tb units for $185.
Very hackable. SSH is enabled out of the box, if you know the magic suffix that Netgear, and all users created via the web interface are in /etc/sudoers. The fantastic ipkg system will close the gap between the slimmed down RedHat distribution that Netgear Axentra ship and a GNU buildchain that can bootstrap Go.

Cons

128mb of ram, non expandable. This actually turns out to not be a big deal. The stock install has ~75mb of RAM free while running. Turing off a few options and trimming the daemons Netgear installs can get another 10-15mb back.

Dave Cheney

The acme of foolishness

Tag Archives: arm

Go 1.8 toolchain improvements

Faster compilation

Code generation improvements

Defer and cgo improvements

One more thing

Unofficial Go 1.2 tarballs for ARM now available

Unofficial Go 1.1.2 tarballs for ARM now available

Unofficial Go 1.1.1 tarballs for ARM now available

Go 1.1 performance improvements, part 3

Go 1 benchmarks on `linux/arm`

net/http

runtime

Conclusion

But wait, there is more

Go 1.1 tarballs for linux/arm

Notes on exploring the compiler flags in the Go compiler suite

Taking it further

Netgear Stora as an ARM development platform

Faster compilation

Code generation improvements

Defer and cgo improvements

One more thing

Go 1 benchmarks on linux/arm

net/http

runtime

Conclusion

But wait, there is more

Taking it further

Go 1 benchmarks on `linux/arm`