Microblog: TestMain can cause one to question reality

This morning a one line change had several of us tearing up the fabric of reality trying to understand why a failing test wasn’t failing, or, in fact, being run at all. Increasingly frantic efforts to upgrade/downgrade Go, run the tests on another machine, run the tests in CI, all served to only unnerve us further.

Can you spot the bug?

package gosh_darn_important_test

import (
    "testing"
    "go.uber.org/goleak"
)

func TestMain(m *testing.M) {
//	goleak.VerifyTestMain(m)
}

...

A few bytes here, a few there, pretty soon you’re talking real memory

Today’s post comes from a recent Go pop quiz. Consider this benchmark fragment.1

func BenchmarkSortStrings(b *testing.B) {
        s := []string{"heart", "lungs", "brain", "kidneys", "pancreas"}
        b.ReportAllocs()
        for i := 0; i < b.N; i++ {
                sort.Strings(s)
        }
}

A convenience wrapper around sort.Sort(sort.StringSlice(s)), sort.Strings sorts the input in place, so it isn’t expected to allocate (or at least that’s what 43% of the tweeps who responded thought). However it turns out that, at least in recent versions of Go, each iteration of the benchmark causes one heap allocation. Why does this happen?

Interfaces, as all Go programmers should know, are implemented as a two word structure. Each interface value contains a field which holds the type of the interface’s contents, and a pointer to the interface’s contents.2

In pseudo Go code, an interface might look something like this:

type interface struct {
        // the ordinal number for the type of the value
        // assigned to the interface 
        type uintptr

        // (usually) a pointer to the value assigned to
        // the interface
        data uintptr
}

interface.data can hold one machine word–8 bytes in most cases–but a []string is 24 bytes; one word for the pointer to the slice’s underlying array; one word for the length; and one for the remaining capacity of the underlying array, so how does Go fit 24 bytes into 8? With the oldest trick in the book, indirection. s, a []string is 24 bytes, but *[]string–a pointer to a slice of strings–is only 8.

Escaping to the heap

To make the example a little more explicit, here’s the benchmark rewritten without the sort.Strings helper function:

func BenchmarkSortStrings(b *testing.B) {
        s := []string{"heart", "lungs", "brain", "kidneys", "pancreas"}
        b.ReportAllocs()
        for i := 0; i < b.N; i++ {
                var ss sort.StringSlice = s
                var si sort.Interface = ss // allocation
                sort.Sort(si)
        }
}

To make the interface magic work, the compiler rewrites the assignment as var si sort.Interface = &ss–the address of ss is assigned to the interface value.3 We now have a situation where the interface value holds a pointer to ss, but where does it point? Where in memory does ss live?

It appears that ss is moved to the heap, causing the allocation that the benchmark reports.

  Total:    296.01MB   296.01MB (flat, cum) 99.66%
      8            .          .           func BenchmarkSortStrings(b *testing.B) { 
      9            .          .           	s := []string{"heart", "lungs", "brain", "kidneys", "pancreas"} 
     10            .          .           	b.ReportAllocs() 
     11            .          .           	for i := 0; i < b.N; i++ { 
     12            .          .           		var ss sort.StringSlice = s 
     13     296.01MB   296.01MB           		var si sort.Interface = ss // allocation 
     14            .          .           		sort.Sort(si) 
     15            .          .           	} 
     16            .          .           } 

The allocation occurs because the compiler currently cannot convince itself that ss outlives si. The general attitude amongst Go compiler hackers seems to be that this could be improved, but that’s a discussion for another time. As it stands, ss is allocated on the heap. Thus the question becomes, how many bytes are allocated per iteration? Why don’t we ask the testing package.

% go test -bench=. sort_test.go
goos: darwin
goarch: amd64
cpu: Intel(R) Core(TM) i7-5650U CPU @ 2.20GHz
BenchmarkSortStrings-4          12591951                91.36 ns/op           24 B/op          1 allocs/op
PASS
ok      command-line-arguments  1.260s

Using Go 1.16beta1, on amd64, 24 bytes are allocated per operation.4 However, the previous Go version, on the same platform, consumes 32 bytes per operation

% go1.15 test -bench=. sort_test.go
goos: darwin
goarch: amd64
BenchmarkSortStrings-4          11453016                96.4 ns/op            32 B/op          1 allocs/op
PASS
ok      command-line-arguments  1.225s

This brings us, circuitously, to the subject of this post, a nifty quality of life improvement coming in Go 1.16. But before I can talk about it, I need to discuss size classes.

Size classes

To explain what a size class is, consider how a theoretical Go runtime could allocate 24 bytes on its heap. A simple way to do this would be keep track of all the memory allocated so far using a pointer to the last allocated byte on the heap. To allocate 24 bytes, the heap pointer is incremented by 24, and the previous value returned to the caller. As long as the code that asked for 24 bytes never writes beyond that mark this mechanism has no overhead. Sadly, in real life, memory allocators don’t just allocate memory, sometimes they have to free it.

Eventually the Go runtime will have to free those 24 bytes, but from the runtime’s point of view, the only thing it knows is the starting address it gave to the caller. It does not know how many bytes after that address were allocated. To permit deallocation, our hypothetical Go runtime allocator would have to record, for each allocation on the heap, its length. Where are the allocation for those lengths allocated? On the heap of course.

In our scenario, when the runtime wants to allocate memory, it could request a little more than it was asked for and use it to store the amount requested. For our slice example, when we asked for 24 bytes, we’d actually consume 24 bytes plus some overhead to store the number 24. How large is this overhead? It turns out the minimum amount that is practical is one word.5

To record a 24 byte allocation the overhead would 8 bytes. 25% is not great, but not terrible, and as the size of the allocation goes up, the overhead would become insignificant. However, what would happen if we wanted to store just one byte on the heap? The overhead would be eight times the amount of data we asked for! Is there are more efficient way to allocate small amounts on the heap?

Rather than storing the length alongside each allocation, what if all the thing of the same size were stored together? If all the 24 byte things are stored together, then the runtime would automatically know how large they are. All the runtime needs is a single bit to indicate if a 24 byte area is in use or not. In Go these areas are known as size classes, because all the things of the same size are stored together (think school class–all the students are the same grade–not a C++ class). When the runtime needs to allocate a small amount, it does so using the smallest size class that can accomodate the allocation.

Unlimited size classes for all

Now we know how size classes work, the obvious question is, where are they stored? Not surprisingly, the memory for a size class comes from the heap. To minimise overhead, the runtime allocates a larger amount from the heap (usually a multiple of the system page size) then dedicates that space for allocations of a single size. But, there’s a problem.

The pattern of allocating a large area to store things of the same size works well6 if there’s a fixed, preferably small, number of allocation sizes, but in a general purpose language programs can ask the runtime for an allocation of any size.7

For example, imagine asking the runtime for 9 bytes. 9 bytes is an uncommon size, so it’s likely that a new size class for things of size 9 would be required. As 9 byte things are uncommon, it’s likely the remainder of the allocation, 4kb or more, would be wasted. Because of this the set of possible size classes is fixed. If a size class of the exact amount is not available, the allocation is rounded up to the next size class. In our example 9 bytes might be allocated in the 12 byte size class. The overhead of 3 unused bytes is better than an entire size class allocation which goes mostly unused.

All together now

This is the final piece of the puzzle. Go 1.15 did not have a 24 byte size class, so the heap allocation of ss was allocated in the 32 byte size class. Thanks to the work of Martin Möhrmann Go 1.16 has a 24 byte size class, which is a perfect fit for slice values assigned to interfaces.

The story of the one line fix

Picture yourself, an engineer working at the hottest distributed microservices de jour, assigned to fix a bug. You jump into an unfamiliar codebase and quickly locate the line where the problem occurred. The fix is simple, just return early or substitute a default value in the case that one cannot be determined from your input. Boom, problem solved. Code compiles, fix goes out to production, and the card’s in the done pile in time for Tuesday drinks at the pub.

Now consider this alternative scenario. You find the problematic line and, before you make the fix, you decide to add a test so you will know that your fix worked, and hopefully someone will not accidentally revert it in the future.

You’ve figured out that some weird edge case causes that line to be called without being able to determine the right value for y. You can see it clearly, y ends up being zero so the program crashes. To write a test, you need to get the conditions that caused y to be zero to occur on command. There are only two parameters passed into this function, this shouldn’t be hard. Oh, but this is a method, not a function, so you need to construct an instance of the object in the right state to trigger the bug. Hmm, this language uses constructors to make sure people can’t just monkey up the bits of the object they need for a test, you’ll have to find, mock, stub or build instances of all the objects dependencies (and their dependencies, and so on) then run the object through the precise series of operations to put the it in the state that causes y to be empty. Then you can write the test.

At this point Tuesday night drinks are fading before you eyes. Without being able to write a test for your fix, how will you convince a reviewer that it actually fixed what it fixed? All you can prove was your “simple” change didn’t break any behaviour that was covered by existing tests. 

Fast forward to Friday and your one line change has finally been merged, along with a few new test cases to make sure it doesn’t happen again. Next week you’re going to sit down and try to figure out a bunch of weird crashes you caused while trying to stub out calls to the database. Eventually you gave up and copied some code from another test that uses a copy of production data. It’s much slower than it needs to be, but it worked, and was the only way to put a reasonable estimate on a job that had already consumed your entire week.

The moral of the story; was the test the reason it took nearly a week to make a one line fix?

How to dump the GOSSAFUNC graph for a method

The Go compiler’s SSA backend contains a facility to produce HTML debugging output of the compilation phases. This post covers how to print the SSA output for function and methods.

Let’s start with a sample program which contains a function, a value method, and a pointer method:

package main

import (
    "fmt"
)

type Numbers struct {
    vals []int
}

func (n *Numbers) Add(v int) {
    n.vals = append(n.vals, v)
}

func (n Numbers) Average() float64 {
    sum := 0.0
    for _, num := range n.vals {
        sum += float64(num)
    }
    return sum / float64(len(n.vals))
}


func main() {
    var numbers Numbers
    numbers.Add(200)
    numbers.Add(43)
    numbers.Add(-6)
    fmt.Println(numbers.Average())
}

Control of the SSA debugging output is via the GOSSAFUNC environment variable. This variable takes the name of the function to dump. This is not the functions fully qualified name. For func main above the name of the function is main not main.main.

% env GOSSAFUNC=main go build
runtime
dumped SSA to ../../go/src/runtime/ssa.html
t
dumped SSA to ./ssa.html

In this example GOSSAFUNC=main matched both main.main and a function called runtime.main.1 This is a little unfortunate, but in practice probably not a big deal as, if you’re performance tuning your code, it won’t be in a giant spaghetti blob in func main.

What is more likely is your code will be in a method, so you’ve probably landed on this post looking for the correct incantation to dump the SSA output for a method.

To print the SSA debug for the pointer method func (n *Numbers) Add, the equivalent function name is(*Numbers).Add:2

% env "GOSSAFUNC=(*Numbers).Add" go build
t
dumped SSA to ./ssa.html

To print the SSA debug for a value method func (n Numbers) Average, the equivalent function name is (*Numbers).Average even though this is a value method:

% env "GOSSAFUNC=(*Numbers).Average" go build
t
dumped SSA to ./ssa.html

Diamond interface composition in Go 1.14

Per the overlapping interfaces proposal, Go 1.14 now permits embedding of interfaces with overlapping method sets. This is a brief post explain what this change means:

Let’s start with the definition of the three key interfaces from the io package; io.Reader, io.Writer, and io.Closer:

package io

type Reader interface {
    Read([]byte) (int, error)
}

type Writer interface {
    Write([]byte) (int, error)
}

type Closer interface {
    Close() error
}    

Just as embedding a type inside a struct allows the embedded type’s fields and methods to be accessed as if it were declared on the embedding type1, the process is true for interfaces. Thus there is no difference between explicitly declaring

type ReadCloser interface {
    Read([]byte) (int, error)
    Close() error
}

and using embedding to compose the interface

type ReadCloser interface {
    Reader
    Closer
}

You can even mix and match

type WriteCloser interface {
    Write([]byte) (int, error)
    Closer
}

However, prior to Go 1.14, if you continued to compose interface declarations in this manner you would likely find that something like this,

type ReadWriteCloser interface {
    ReadCloser
    WriterCloser
}

would fail to compile

% go build interfaces.go
command-line-arguments
./interfaces.go:27:2: duplicate method Close

Fortunately, with Go 1.14 this is no longer a limitation, thus solving problems that typically occur with diamond-shaped embedding graphs.

However, there is a catch that I ran into attempting to demonstrate this feature to the local user group–this feature is only enabled when the Go compiler uses the 1.14 (or later) spec.

As near as I can make out the rules for which version of the Go spec is used during compilation appear to be:

  1. If your source code is stored inside GOPATH (or you have disabled modules with GO111MODULE=off) then the version of the Go spec used to compile with matches the version of the compiler you are using. Said another way, if you have Go 1.13 installed, your Go version is 1.13. If you have Go 1.14 installed, your version is 1.14. No surprises here.
  2. If your source code is stored outside GOPATH (or you have forced modules on with GO111MODULE=on) then the go tool will take the Go version from the go.mod file.
  3. If there is no Go version listed in go.mod then the version of the spec will be the version of Go installed. This is identical to point 1.
  4. If you are in module mode, either by being outside GOPATH or with GO111MODULE=on, but there is no go.mod file in the current, or any parent, directory then the version of the Go spec used to compile your code defaults to Go 1.13.

The last point caught me out.

Fatih’s question

A few days ago Fatih posted this question on twitter.

I’m going to attempt to give my answer, however to do that I need to apply some simplifications as my previous attempts to answer it involved a lot of phrases like a pointer to a pointer, and other unhelpful waffling. Hopefully my simplified answer can be useful in building a mental framework to answer Fatih’s original question.

Restating the question

Fatih’s original tweet showed four different variations of json.Unmarshal. I’m going to focus on the last two, which I’ll rewrite a little:

package main

import (
    "encoding/json"
    "fmt"
)

type Result struct {
    Foo string `json:"foo"`
}

func main() {
    content := []byte(`{"foo": "bar"}`)
    var result1, result2 *Result

    err := json.Unmarshal(content, &result1)
    fmt.Println(result1, err) // &{bar} <nil>

    err = json.Unmarshal(content, result2)
    fmt.Println(result2, err) // <nil> json: Unmarshal(nil *main.Result)
}

Restated in words, result1 and result2 are the same type; *Result. Decoding into result1 works as expected, whereas decoding into result2 causes the json package to complain that the value passed to Unmarshal is nil. However, both values were declared without an initialiser so both would have taken on the type’s zero value, nil.

Eagle eyed readers will have spotted that the reason for the difference is the first invocation is passed &result1, while the second is passed result2, but this explanation is unsatisfactory because the documentation for json.Unmarshal states:

Unmarshal parses the JSON-encoded data and stores the result in the value pointed to by v. If v is nil or not a pointer, Unmarshal returns an InvalidUnmarshalError.

Which is confusing because result1 and result2 are pointers. Furthermore, without initialisation, both are nil. Now, the documentation is correct (as you’d expect from a package that has been hammered on for a decade), but explaining why takes a little more investigation.

Functions receive a copy of their arguments

Every assignment in Go is a copy, this includes function arguments and return values.

package main

import (
    "fmt"
)

func increment(v int) {
    v++
}

func main() {
    v := 1
    increment(v)
    fmt.Println(v) // 1
}

In this example, increment is operating on a copy of main‘s v. This is because the v declared in main and increment‘s v parameter have different addresses in memory. Thus changes to increment‘s v cannot affect the contents of main‘s v.

package main

import (
    "fmt"
)

func increment(v *int) {
    *v++
}

func main() {
    v := 1
    increment(&v)
    fmt.Println(v) // 2
}

If we wanted to write increment in a way that it could affect the contents of its caller we would need to pass a reference, a pointer, to main.v.1 This example demonstrates why json.Unmarshal needs a pointer to the value to decode JSON into.

Pointers to pointers

Returning to the original question, both result1 and result2 are declared as *Result, that is, pointers to a Result value. We established that you have to pass the address of caller’s value to json.Unmarshal otherwise it won’t be able to alter the contents of the caller’s value. Why then must we pass the address of result1, a **Result, a pointer to a pointer to a Result, for the operation to succeed.

To explain this another detour is required. Consider this code:

package main

import (
    "encoding/json"
    "fmt"
)

type Result struct {
    Foo *string `json:"foo"`
}

func main() {
    content := []byte(`{"foo": "bar"}`)
    var result1 *Result

    err := json.Unmarshal(content, &result1)
    fmt.Printf("%#v %v", result1, err) // &main.Result{Foo:(*string)(0xc0000102f0)} <nil>
}

In this example Result contains a pointer typed field, Foo *string. During JSON decoding Unmarshal allocated a new string value, stored the value bar in it, then placed the address of the string in Result.Foo. This behaviour is quite handy as it frees the caller from having to initialise Result.Foo and makes it easier to detect when a field was not initialised because the JSON did not contain a value. Beyond the convenience this offers for simple examples it would be prohibitively difficult for the caller to properly initialise all the reference type fields in a structure before decoding unknown JSON without first inspecting the incoming JSON which itself may be problematic if the input is coming from an io.Reader without the ability to rewind the input.

To unmarshal JSON into a pointer, Unmarshal first handles the case of the JSON being the JSON literal null. In that case, Unmarshal sets the pointer to nil. Otherwise, Unmarshal unmarshals the JSON into the value pointed at by the pointer. If the pointer is nil, Unmarshal allocates a new value for it to point to.

json.Unmarshal‘s handling of pointer fields is clearly documented, and works as you would expect, allocating a new value whenever there is a need to decode into a pointer shaped field. It is this behaviour that gives us a hint to what is happening in the original example.

We’ve seen that when json.Unmarshal encounters a field which points to nil it will allocate a new value of the correct type and assign its address the field before proceeding. Not only is does behaviour is applied recursively–for example in the case of a complex structure which contains pointers to other structures–but it also applies to the value passed to Unmarshal.

package main

import (
    "encoding/json"
    "fmt"
)

func main() {
    content := []byte(`1`)
    var result *int

    err := json.Unmarshal(content, &result)
    fmt.Println(*result, err) // 1 <nil>
}

In this example result is not a struct, but a simple *int which, lacking an initialiser, defaults to nil. After JSON decoding, result now points to an int with the value 1.

Putting the pieces together

Now I think I’m ready to take a shot at answering Fatih’s question.

json.Unmarshal requires the address of the variable you want to decode into, otherwise it would decode into a temporary copy which would be discard on return. Normally this is done by declaring a value, then passing its address, or explicitly initialising the the value

var result1 Result
err := json.Unmarshal(content, &result1) // this is fine

var result2 = new(Result)
err = json.Unmarshal(content, result2) // and this

var result3 = &Result{}
err = json.Unmarshal(content, result3) // this is also fine

In all three cases the address that the *Result points too is not nil, it points to initialised memory that json.Unmarshal decodes into.

Now consider what happens when json.Unmarshal encounters this

var result4 *Result
err = json.Unmarshal(content, result4) // err json: Unmarshal(nil *main.Result)

result2, result3, and the expression &result1 point to a Result. However result4, even though it has the same type as the previous three, does not point to initialised memory, it points to nil. Thus, according to the examples we saw previously, before json.Unmarshal can decode into it, the memory result4 points too must be initialised.

However, because each function receives a copy of its arguments, the caller’s result4 variable and the copy inside json.Unmarshal are unique. json.Unmarshal can allocate a new Result value and decode into it, but it cannot alter result4 to point to this new value because it was not provided with a reference to result4, only a copy of its contents.

Ensmallening Go binaries by prohibiting comparisons

Conventional wisdom dictates that the larger the number of types declared in a Go program, the larger the resulting binary. Intuitively this makes sense, after all, what’s the point in defining a bunch of types if you’re not going to write code that operates on them. However, part of the job of a linker is to detect functions which are not referenced by a program–say they are part of a library of which only a subset of functionality is used–and remove them from the final output. Yet, the adage mo’ types, mo’ binary holds true for the majority of Go programs.

In this post I’ll dig into what equality, in the context of a Go program, means and why changes like this have a measurable impact on the size of a Go program.

Defining equality between two values

The Go spec defines the concepts of assignability and equality. Assignabiity is the act of assigning a value to an identifier. Not everything which is declared can be assigned, for example constants and functions. Equality is the act of comparing two identifies by asking are their contents the same?

Being a strongly typed language, the notion of sameness is fundamentally rooted in the identifier’s type. Two things can only be the same if they are of the same type. Beyond that, the type of the values defines how they are compared.

For example, integers are compared arithmetically. For pointer types, equality is determining if the addresses they point too are the same. Reference types like maps and channels, like pointers, are considered to be the same if they have the same address.

These are all examples of bitwise equality, that is, if the bit patterns of the memory that value occupies are the same, those values are equal. This is known as memcmp, short for memory comparison, as equality is defined by comparing the contents of two areas of memory.

Hold on to this idea, I’ll come back to in a second.

Struct equality

Beyond scalar types like integers, floats, and pointers is the realm of compound types; structs. All structs are laid out in memory in program order, thus this declaration:

type S struct {
    a, b, c, d int64
}

will consume 32 bytes of memory; 8 bytes for a, then 8 bytes for b, and so on. The spec says that struct values are comparable if all their fields are comparable. Thus two structs are equal iff each of their fields are equal.

a := S{1, 2, 3, 4}
b := S{1, 2, 3, 4}
fmt.Println(a == b) // prints true

Under the hood the compiler uses memcmp to compare the 32 bytes of a and b.

Padding and alignment

However the simplistic bitwise comparison strategy will fail in situations like this:

type S struct {
    a byte
    b uint64
    c int16
    d uint32
}

func main()
    a := S{1, 2, 3, 4}
    b := S{1, 2, 3, 4}
    fmt.Println(a == b) // prints true
}

The code compiles, the comparison is still true, but under the hood the compiler cannot rely on comparing the bit patterns of a and b because the structure contains padding.

Go requires each field in a struct to be naturally aligned. 2 byte values must start on an even address, four byte values on an address divisible by 4, and so on1. The compiler inserts padding to ensure the fields are aligned to according to their type and the underlying platform. In effect, after padding, this is what the compiler sees2:

type S struct {
    a byte
    _ [7]byte // padding
    b uint64
    c int16
    _ [2]int16 // padding
    d uint32
}

Padding exists to ensure the correct field alignments, and while it does take up space in memory, the contents of those padding bytes are unknown. You might assume that, being Go, the padding bytes are always zero, but it turns out that’s not the case–the contents of padding bytes are simply not defined. Because they’re not defined to always be a certain value, doing a bitwise comparison may return false because the nine bytes of padding spread throughout the 24 bytes of S are may not be the same.

The Go compiler solves this problem by generating what is known as an equality function. In this case S‘s equality function knows how to compare two values of type S by comparing only the fields in the function while skipping over the padding.

Type algorithms

Phew, that was a lot of setup to illustrate why, for each type defined in a Go program, the compiler may generate several supporting functions, known inside the compiler as the type’s algorithms. In addition to the equality function the compiler will generate a hash function if the type is used as a map key. Like the equality function, the hash function must consider factors like padding when computing its result to ensure it remains stable.

It turns out that it can be hard, and sometimes non obvious, to intuit when the compiler will generate these functions–it’s more than you’d expect–and it can be hard for the linker to eliminate the ones that are not needed as reflection often causes the linker to be more conservative when trimming types.

Reducing binary size by prohibiting comparisons

Now we’re at a point to explain Brad’s change. By adding an incomparable field 3 to the type, the resulting struct is by extension incomparable, thus forcing the compiler to elide the generation of eq and hash algorithms, short circuiting the linkers elimination of those types and, in practice, reducing the size of the final binary. As an example of this technique, this program:

package main

import "fmt"

func main() {
    type t struct {
        // _ [0][]byte uncomment to prevent comparison
        a byte
        b uint16
        c int32
        d uint64
    }
    var a t
    fmt.Println(a)
}

when compiled with Go 1.14.2 (darwin/amd64), decreased from 2174088 to 2174056, a saving of 32 bytes. In isolation this 32 byte saving may seem like small beer, but consider that equality and hash functions can be generated for every type in the transitive closure of your program and all its dependencies, and the size of these functions varies depending on the size of the type and its complexity, prohibiting them can have a sizeable impact on the final binary over and above the old saw of -ldflags="-s -w".

The bottom line, if you don’t wish to make your types comparable, a hack like this enforces it at the source level while contributing to a small reduction in the size of your binary.


Addendum: thanks to Brad’s prodding, Go 1.15 already has a bunch of improvements by Cherry Zhang and Keith Randall that fix the most egregious of the failures to eliminate unnecessary equality and hash functions (although I suspect it was also to avoid the proliferation of this class of CLs).

Mid-stack inlining in Go

In the previous post I discussed how leaf inlining allows the Go compiler to reduce the overhead of function calls and extend optimisation opportunities across function boundaries. In this post I’ll discuss the limits of inlining and leaf vs mid-stack inlining.

The limits of inlining

Inlining a function into its caller removes the call’s overhead and increases the opportunity for the compiler to apply additional optimisations so the question should be asked, if some inlining is good, would more be better, why not inline as much as possible?

Inlining trades possibly larger program sizes for potentially faster execution time. The main reason to limit inlining is creating many inlined copies of a function can increase compile time and result in larger binaries for marginal gain. Even taking into account the opportunities for further optimisation, aggressive inlining tends to increase the size of, and the time too compile, the resulting binary.

Inlining works best for small functions that do relatively little work compared to the overhead of calling them. As the size of a function grows, the time saved avoiding the call’s overhead diminishes relative to the work done inside the function. Larger functions tend to be more complex, thus the benefits of optimising their inlined forms vs in situ are reduced.

Inlining budget

During compilation each function’s inlineabilty is calculated using what is known as the inlining budget1. The cost calculation can be tricky to internalise but is broadly one unit per node in the AST for simple things like unary and binary operations but can be higher for complex operations like make. Consider this example:

package main

func small() string {
    s := "hello, " + "world!"
    return s
}

func large() string {
    s := "a"
    s += "b"
    s += "c"
    s += "d"
    s += "e"
    s += "f"
    s += "g"
    s += "h"
    s += "i"
    s += "j"
    s += "k"
    s += "l"
    s += "m"
    s += "n"
    s += "o"
    s += "p"
    s += "q"
    s += "r"
    s += "s"
    s += "t"
    s += "u"
    s += "v"
    s += "w"
    s += "x"
    s += "y"
    s += "z"
    return s
}

func main() {
    small()
    large()
}

Compiling this function with -gcflags=-m=2 allows us to see the cost the compiler assigns to each function.

% go build -gcflags=-m=2 inl.go 
# command-line-arguments
./inl.go:3:6: can inline small with cost 7 as: func() string { s := "hello, world!"; return s }
./inl.go:8:6: cannot inline large: function too complex: cost 82 exceeds budget 80
./inl.go:38:6: can inline main with cost 68 as: func() { small(); large() }
./inl.go:39:7: inlining call to small func() string { s := "hello, world!"; return s }

The compiler determined that func small() can be inlined due to its cost of 7. func large() was determined to be too expensive. func main()has been marked as eligible and assigned a cost of 68; 7 from the body of small, 57 from the function call to small and the remainder in its own overhead.

The inlining budget can be controlled to some degree with the -gcflag=-l flag. Currently the values that apply are:

  • -gcflags=-l=0 is the default level of inlining.
  • -gcflags=-l (or -gcflags=-l=1) disables inlining.
  • -gcflags=-l=2 and -gcflags=-l=3 are currently unused and have no effect over -gcflags=-l=0
  • -gcflags=-l=4 reduces the cost for inlining non-leaf functions and calls through interfaces.2

Hairy optimisations

Some functions with a relatively low inlining cost may be ineligible because of their complexity. This is known as the function’s hairiness as the semantics of some operations are hard to reason about once inlined, for example recover, break. Others, like select and go, involve co-ordination with the runtime so the extra effort of inlining doesn’t pay for itself.

The list of hairy statements also includes things like for and range which don’t have an inherently large cost, but simply haven’t been optimised yet.

Mid stack inlining

Historically the Go compiler only performed leaf inlining–only functions which did not call other functions were eligible. In the context of the hairiness discussion previously, a function call would disqualify the function from being inlined.

Enter mid stack inlining which, as its name implies, allows functions in the middle of a call stack to be inlined without requiring everything below them to be eligible. Mid stack inlining was introduced by David Lazar in Go 1.9 and improved in subsequent releases. This presentation goes into some of the difficulties with retaining the behaviour of stack traces and runtime.Callers in code paths that had been heavily inlined.

We see an example of mid-stack inlining in the previous example. After inlining, func main() contains the body of func small() and a call to func large(), thus it is considered a non-leaf function. Historically this would have prevented it from being further inlined even though its combined cost was less than the inlining budget.

The primary use case for mid stack inlining is to reduce the overhead of a path through the call stack. Consider this example:

package main

import (
    "fmt"
    "strconv"
)

type Rectangle struct {}

//go:noinline
func (r *Rectangle) Height() int {
    h, _ := strconv.ParseInt("7", 10, 0)
    return int(h)
}

func (r *Rectangle) Width() int {
    return 6
}

func (r *Rectangle) Area() int { return r.Height() * r.Width() }

func main() {
    var r Rectangle
    fmt.Println(r.Area())
}

In this example r.Area() is a simple function which calls two others. r.Width() can be inlined while r.Height(), simulated here with the //go:noinline annotation, cannot. 3

% go build -gcflags='-m=2' square.go                                                                                                          
# command-line-arguments
./square.go:12:6: cannot inline (*Rectangle).Height: marked go:noinline                                                                               
./square.go:17:6: can inline (*Rectangle).Width with cost 2 as: method(*Rectangle) func() int { return 6 }
./square.go:21:6: can inline (*Rectangle).Area with cost 67 as: method(*Rectangle) func() int { return r.Height() * r.Width() }                       ./square.go:21:61: inlining call to (*Rectangle).Width method(*Rectangle) func() int { return 6 }                                                     
./square.go:23:6: cannot inline main: function too complex: cost 150 exceeds budget 80                        
./square.go:25:20: inlining call to (*Rectangle).Area method(*Rectangle) func() int { return r.Height() * r.Width() }
./square.go:25:20: inlining call to (*Rectangle).Width method(*Rectangle) func() int { return 6 }

As the multiplication performed by r.Area() is cheap compared to the overhead of calling it, inlining r.Area()‘s single expression is a net win even if its downstream caller to r.Height() remains ineligible.

Fast path inlining

The most startling example of the power of mid-stack inlining comes from 2019 when Carlo Alberto Ferraris improved the performance of sync.Mutex.Lock() by allowing the fast path of the lock–the uncontended case–to be inlined into its caller. Prior to this change sync.Mutex.Lock() was a large function containing many hairy conditions which made it ineligible to be inlined. Even in the case where the lock was available, the caller had to pay the overhead of calling sync.Mutex.Lock().

Carlo’s change split sync.Mutex.Lock() into two functions (a process he dubbed outlining). The outer sync.Mutex.Lock() method now calls sync/atomic.CompareAndSwapInt32() and returns to the caller immediately if the CAS succeeds. If not, the function falls through to sync.Mutex.lockSlow() which handles the slow path required to register interest on the lock and park the goroutine.4

% go build -gcflags='-m=2 -l=0' sync 2>&1 | grep '(*Mutex).Lock'
../go/src/sync/mutex.go:72:6: can inline (*Mutex).Lock with cost 69 as: method(*Mutex) func() { if "sync/atomic".CompareAndSwapInt32(&m.state, 0, mutexLocked) { if race.Enabled {  }; return  }; m.lockSlow() }

By splitting the function into an easily inlineable outer function, falling through to a complex inner function to handle the slow path Carlo’s combined mid stack inlining and the compiler’s support for intrinsic operations to reduce the cost of an uncontended lock by 14%. Then he repeated the trick for an additional 9% saving in sync.RWMutex.Unlock().

Inlining optimisations in Go

This is a post about how the Go compiler implements inlining and how this optimisation affects your Go code.

n.b. This article focuses on gc, the de facto Go compiler from golang.org. The concepts discussed apply broadly to other Go compilers like gccgo and tinygo but may differ in implementation and efficacy.

What is inlining?

Inlining is the act of combining smaller functions into their respective callers. In the early days of computing this optimisation was typically performed by hand. Nowadays inlining is one of a class of fundamental optimisations performed automatically during the compilation process.

Why is inlining important?

Inlining is important for two reasons. The first is it removes the overhead of the function call itself. The second is it permits the compiler to more effectively apply other optimisation strategies.

Function call overhead

Calling a function1 in any language carries a cost. There are the overheads of marshalling parameters into registers or onto the stack (depending on the ABI) and reversing the process on return. Invoking a function call involves jumping the program counter from one point in the instruction stream to another which can cause a pipeline stall. Once inside the function there is usually some preamble required to prepare a new stack frame for the function to execute and a similar epilogue needed to retire the frame before returning to the caller.

In Go a function call carries additional costs to support dynamic stack growth. On entry the amount of stack space available to the goroutine is compared to the amount required for the function. If insufficient stack space is available, the preamble jumps into the runtime logic that grows the stack by copying it to a new, larger, location. Once this is done the runtime jumps back to the start of the original function, the stack check is performed again, which now passes, and the call continues. In this way goroutines can start with a small stack allocation which grows only when needed.2

This check is cheap–only a few instructions–and because goroutine stacks grows geometrically the check rarely fails. Thus, the branch prediction unit inside a modern processor can hide the cost of the stack check by assuming it will always be successful. In the case where the processor mis-predicts the stack check and has to discard the work done while it was executing speculatively, the cost of the pipeline stall is relatively small compared to the cost of the work needed for the runtime to grow a goroutines stack.

While the overhead of the generic and Go specific components of each function call are well optimised by modern processors using speculative execution techniques, those overheads cannot be entirely eliminated, thus each function call carries with it a performance cost over and above the time it takes to perform useful work. As a function call’s overhead is fixed, smaller functions pay a larger cost relative to larger ones because they tend to do less useful work per invocation.

The solution to eliminating these overheads must therefore be to eliminate the function call itself, which the Go compiler does, under certain conditions, by replacing the call to a function with the contents of the function. This is known as inlining because it brings the body of the function in line with its caller.

Improved optimisation opportunities

Dr. Cliff Click describes inlining as the optimisation performed by modern compilers as it forms the basis for optimisations like constant propagation and dead code elimination. In effect, inlining allows the compiler to see further, allowing it to observe, in the context that a particular function is being called, logic that can be further simplified or eliminated entirely. As inlining can be applied recursively optimisation decisions can be made not only in the context of each individual function, but also applied to the chain of functions in a call path.

Inlining in action

The effects of inlining can be demonstrated with this small example

package main

import "testing"

//go:noinline
func max(a, b int) int {
    if a > b {
        return a
    }
    return b
}

var Result int

func BenchmarkMax(b *testing.B) {
    var r int
    for i := 0; i < b.N; i++ {
        r = max(-1, i)
    }
    Result = r
}

Running this benchmark gives the following result:3

% go test -bench=. 
BenchmarkMax-4   530687617         2.24 ns/op

The cost of max(-1, i) is around 2.24 nanoseconds on my 2015 MacBook Air. Now let’s remove the //go:noinline pragma and see the result:

% go test -bench=. 
BenchmarkMax-4   1000000000         0.514 ns/op

From 2.24 ns to 0.51 ns, or according to benchstat, a 78% improvement.

% benchstat {old,new}.txt
name   old time/op  new time/op  delta
Max-4  2.21ns ± 1%  0.49ns ± 6%  -77.96%  (p=0.000 n=18+19)

Where did these improvements come from?

First, the removal of the function call and associated preamble4 was a major contributor. Pulling the contents of max into its caller reduced the number of instructions executed by the processor and eliminated several branches.

Now the contents of max are visible to the compiler as it optimises BenchmarkMax it can make some additional improvements. Consider that once max is inlined, this is what the body of BenchmarkMax looks like to the compiler:

func BenchmarkMax(b *testing.B) {
    var r int
    for i := 0; i < b.N; i++ {
        if -1 > i {
            r = -1
        } else {
            r = i
        }
    }
    Result = r
}

Running the benchmark again we see our manually inlined version performs as well as the version inlined by the compiler

% benchstat {old,new}.txt
name   old time/op  new time/op  delta
Max-4  2.21ns ± 1%  0.48ns ± 3%  -78.14%  (p=0.000 n=18+18)

Now the compiler has access to the result of inlining max into BenchmarkMax it can apply optimisation passes which were not possible before. For example, the compiler has noted that i is initialised to 0 and only incremented so any comparison with i can assume i will never be negative. Thus, the condition -1 > i will never be true.5

Having proved that -1 > i will never be true, the compiler can simplify the code to

func BenchmarkMax(b *testing.B) {
    var r int
    for i := 0; i < b.N; i++ {
        if false {
            r = -1
        } else {
            r = i
        }
    }
    Result = r
}

and because the branch is now a constant, the compiler can eliminate the unreachable path leaving it with

func BenchmarkMax(b *testing.B) {
    var r int
    for i := 0; i < b.N; i++ {
        r = i
    }
    Result = r
}

Thus, through inlining and the optimisations it unlocks, the compiler has reduced the expression r = max(-1, i) to simply r = i.

The limits of inlining

In this article I’ve discussed, so called, leaf inlining; the act of inlining a function at the bottom of a call stack into its direct caller. Inlining is a recursive process, once a function has been inlined into its caller, the compiler may inline the resulting code into its caller, as so on. For example, this code

func BenchmarkMaxMaxMax(b *testing.B) {
    var r int
    for i := 0; i < b.N; i++ {
        r = max(max(-1, i), max(0, i))
    }
    Result = r
}

runs as fast as the previous example as the compiler is able to repeatedly apply the optimisations outlined above to reduce the code to the same r = i expression.

In the next article I’ll discuss an alternative inlining strategy when the Go compiler wishes to inline a function in the middle of a call stack. Finally I’ll discuss the limits that the compiler is prepared to go to to inline code, and which Go constructs are currently beyond its capability.

go test -v streaming output

The testing package is one of my favourite packages in the Go standard library, not just because of its low noise approach to unit testing, but, over the lifetime of Go, it has received a steady stream of quality of life improvements driven by real world usage.

The most recent example of this is, in Go 1.14, go test -v will stream t.Log output as it happens, rather than hoarding it til the end of the test run. Here’s an example;

package main

import (
	"fmt"
	"testing"
	"time"
)

func TestLogStreaming(t *testing.T) {
	for i := 0; i < 5; i++ {
		time.Sleep(300 * time.Millisecond)
		fmt.Println("fmt.Println:", i)
		t.Log("t.Log:", i)
	}
}

Note: Calling fmt.Println inside a test is generally considered a no no as it bypasses the testing package’s output buffering irrespective of the -v flag. However, for this example, it‘s necessary to demonstrate the streaming t.Log change.

% go1.13 test -v tlog_test.go
=== RUN   TestLogStreaming
fmt.Println: 0
fmt.Println: 1
fmt.Println: 2
fmt.Println: 3
fmt.Println: 4
--- PASS: TestLogStreaming (1.52s)
    tlog_test.go:13: t.Log: 0
    tlog_test.go:13: t.Log: 1
    tlog_test.go:13: t.Log: 2
    tlog_test.go:13: t.Log: 3
    tlog_test.go:13: t.Log: 4
PASS
ok      command-line-arguments  1.971s

Under Go 1.13 and earlier the fmt.Println lines output immediately. t.Log lines are buffered and are printed after the test completes.

% go1.14 test -v tlog_test.go
=== RUN   TestLogStreaming
fmt.Println: 0
    TestLogStreaming: tlog_test.go:13: t.Log: 0
fmt.Println: 1
    TestLogStreaming: tlog_test.go:13: t.Log: 1
fmt.Println: 2
    TestLogStreaming: tlog_test.go:13: t.Log: 2
fmt.Println: 3
    TestLogStreaming: tlog_test.go:13: t.Log: 3
fmt.Println: 4
    TestLogStreaming: tlog_test.go:13: t.Log: 4
--- PASS: TestLogStreaming (1.51s)
PASS
ok      command-line-arguments  1.809s

Under Go 1.14 the fmt.Println and t.Log lines are interleaved, rather than waiting for the test to complete, demonstrating that test output is streamed when go test -v is used.

This is a great quality of life improvement for integration style tests that often retry for long periods when the test is failing. Streaming t.Log output will help Gophers debug those test failures without having to wait until the entire test times out to receive their output.