Monthly Archives: May 2020

Diamond interface composition in Go 1.14

Per the overlapping interfaces proposal, Go 1.14 now permits embedding of interfaces with overlapping method sets. This is a brief post explain what this change means:

Let’s start with the definition of the three key interfaces from the io package; io.Reader, io.Writer, and io.Closer:

package io

type Reader interface {
    Read([]byte) (int, error)
}

type Writer interface {
    Write([]byte) (int, error)
}

type Closer interface {
    Close() error
}

Just as embedding a type inside a struct allows the embedded type’s fields and methods to be accessed as if it were declared on the embedding type¹, the process is true for interfaces. Thus there is no difference between explicitly declaring

type ReadCloser interface {
    Read([]byte) (int, error)
    Close() error
}

and using embedding to compose the interface

type ReadCloser interface {
    Reader
    Closer
}

You can even mix and match

type WriteCloser interface {
    Write([]byte) (int, error)
    Closer
}

However, prior to Go 1.14, if you continued to compose interface declarations in this manner you would likely find that something like this,

type ReadWriteCloser interface {
    ReadCloser
    WriterCloser
}

would fail to compile

% go build interfaces.go
command-line-arguments
./interfaces.go:27:2: duplicate method Close

Fortunately, with Go 1.14 this is no longer a limitation, thus solving problems that typically occur with diamond-shaped embedding graphs.

However, there is a catch that I ran into attempting to demonstrate this feature to the local user group–this feature is only enabled when the Go compiler uses the 1.14 (or later) spec.

As near as I can make out the rules for which version of the Go spec is used during compilation appear to be:

If your source code is stored inside GOPATH (or you have disabled modules with GO111MODULE=off) then the version of the Go spec used to compile with matches the version of the compiler you are using. Said another way, if you have Go 1.13 installed, your Go version is 1.13. If you have Go 1.14 installed, your version is 1.14. No surprises here.
If your source code is stored outside GOPATH (or you have forced modules on with GO111MODULE=on) then the go tool will take the Go version from the go.mod file.
If there is no Go version listed in go.mod then the version of the spec will be the version of Go installed. This is identical to point 1.
If you are in module mode, either by being outside GOPATH or with GO111MODULE=on, but there is no go.mod file in the current, or any parent, directory then the version of the Go spec used to compile your code defaults to Go 1.13.

The last point caught me out.

Fatih’s question

A few days ago Fatih posted this question on twitter.

I’m going to attempt to give my answer, however to do that I need to apply some simplifications as my previous attempts to answer it involved a lot of phrases like a pointer to a pointer, and other unhelpful waffling. Hopefully my simplified answer can be useful in building a mental framework to answer Fatih’s original question.

Restating the question

Fatih’s original tweet showed four different variations of json.Unmarshal. I’m going to focus on the last two, which I’ll rewrite a little:

package main

import (
    "encoding/json"
    "fmt"
)

type Result struct {
    Foo string `json:"foo"`
}

func main() {
    content := []byte(`{"foo": "bar"}`)
    var result1, result2 *Result

    err := json.Unmarshal(content, &result1)
    fmt.Println(result1, err) // &{bar} <nil>

    err = json.Unmarshal(content, result2)
    fmt.Println(result2, err) // <nil> json: Unmarshal(nil *main.Result)
}

Restated in words, result1 and result2 are the same type; *Result. Decoding into result1 works as expected, whereas decoding into result2 causes the json package to complain that the value passed to Unmarshal is nil. However, both values were declared without an initialiser so both would have taken on the type’s zero value, nil.

Eagle eyed readers will have spotted that the reason for the difference is the first invocation is passed &result1, while the second is passed result2, but this explanation is unsatisfactory because the documentation for json.Unmarshal states:

Unmarshal parses the JSON-encoded data and stores the result in the value pointed to by v. If v is nil or not a pointer, Unmarshal returns an InvalidUnmarshalError.

Which is confusing because result1 and result2 are pointers. Furthermore, without initialisation, both are nil. Now, the documentation is correct (as you’d expect from a package that has been hammered on for a decade), but explaining why takes a little more investigation.

Functions receive a copy of their arguments

Every assignment in Go is a copy, this includes function arguments and return values.

package main

import (
    "fmt"
)

func increment(v int) {
    v++
}

func main() {
    v := 1
    increment(v)
    fmt.Println(v) // 1
}

In this example, increment is operating on a copy of main‘s v. This is because the v declared in main and increment‘s v parameter have different addresses in memory. Thus changes to increment‘s v cannot affect the contents of main‘s v.

package main

import (
    "fmt"
)

func increment(v *int) {
    *v++
}

func main() {
    v := 1
    increment(&v)
    fmt.Println(v) // 2
}

If we wanted to write increment in a way that it could affect the contents of its caller we would need to pass a reference, a pointer, to main.v.² This example demonstrates why json.Unmarshal needs a pointer to the value to decode JSON into.

Pointers to pointers

Returning to the original question, both result1 and result2 are declared as *Result, that is, pointers to a Result value. We established that you have to pass the address of caller’s value to json.Unmarshal otherwise it won’t be able to alter the contents of the caller’s value. Why then must we pass the address of result1, a **Result, a pointer to a pointer to a Result, for the operation to succeed.

To explain this another detour is required. Consider this code:

package main

import (
    "encoding/json"
    "fmt"
)

type Result struct {
    Foo *string `json:"foo"`
}

func main() {
    content := []byte(`{"foo": "bar"}`)
    var result1 *Result

    err := json.Unmarshal(content, &result1)
    fmt.Printf("%#v %v", result1, err) // &main.Result{Foo:(*string)(0xc0000102f0)} <nil>
}

In this example Result contains a pointer typed field, Foo *string. During JSON decoding Unmarshal allocated a new string value, stored the value bar in it, then placed the address of the string in Result.Foo. This behaviour is quite handy as it frees the caller from having to initialise Result.Foo and makes it easier to detect when a field was not initialised because the JSON did not contain a value. Beyond the convenience this offers for simple examples it would be prohibitively difficult for the caller to properly initialise all the reference type fields in a structure before decoding unknown JSON without first inspecting the incoming JSON which itself may be problematic if the input is coming from an io.Reader without the ability to rewind the input.

To unmarshal JSON into a pointer, Unmarshal first handles the case of the JSON being the JSON literal null. In that case, Unmarshal sets the pointer to nil. Otherwise, Unmarshal unmarshals the JSON into the value pointed at by the pointer. If the pointer is nil, Unmarshal allocates a new value for it to point to.

json.Unmarshal‘s handling of pointer fields is clearly documented, and works as you would expect, allocating a new value whenever there is a need to decode into a pointer shaped field. It is this behaviour that gives us a hint to what is happening in the original example.

We’ve seen that when json.Unmarshal encounters a field which points to nil it will allocate a new value of the correct type and assign its address the field before proceeding. Not only is does behaviour is applied recursively–for example in the case of a complex structure which contains pointers to other structures–but it also applies to the value passed to Unmarshal.

package main

import (
    "encoding/json"
    "fmt"
)

func main() {
    content := []byte(`1`)
    var result *int

    err := json.Unmarshal(content, &result)
    fmt.Println(*result, err) // 1 <nil>
}

In this example result is not a struct, but a simple *int which, lacking an initialiser, defaults to nil. After JSON decoding, result now points to an int with the value 1.

Putting the pieces together

Now I think I’m ready to take a shot at answering Fatih’s question.

json.Unmarshal requires the address of the variable you want to decode into, otherwise it would decode into a temporary copy which would be discard on return. Normally this is done by declaring a value, then passing its address, or explicitly initialising the the value

var result1 Result
err := json.Unmarshal(content, &result1) // this is fine

var result2 = new(Result)
err = json.Unmarshal(content, result2) // and this

var result3 = &Result{}
err = json.Unmarshal(content, result3) // this is also fine

In all three cases the address that the *Result points too is not nil, it points to initialised memory that json.Unmarshal decodes into.

Now consider what happens when json.Unmarshal encounters this

var result4 *Result
err = json.Unmarshal(content, result4) // err json: Unmarshal(nil *main.Result)

result2, result3, and the expression &result1 point to a Result. However result4, even though it has the same type as the previous three, does not point to initialised memory, it points to nil. Thus, according to the examples we saw previously, before json.Unmarshal can decode into it, the memory result4 points too must be initialised.

However, because each function receives a copy of its arguments, the caller’s result4 variable and the copy inside json.Unmarshal are unique. json.Unmarshal can allocate a new Result value and decode into it, but it cannot alter result4 to point to this new value because it was not provided with a reference to result4, only a copy of its contents.

Ensmallening Go binaries by prohibiting comparisons

Conventional wisdom dictates that the larger the number of types declared in a Go program, the larger the resulting binary. Intuitively this makes sense, after all, what’s the point in defining a bunch of types if you’re not going to write code that operates on them. However, part of the job of a linker is to detect functions which are not referenced by a program–say they are part of a library of which only a subset of functionality is used–and remove them from the final output. Yet, the adage mo’ types, mo’ binary holds true for the majority of Go programs.

In this post I’ll dig into what equality, in the context of a Go program, means and why changes like this have a measurable impact on the size of a Go program.

Defining equality between two values

The Go spec defines the concepts of assignability and equality. Assignabiity is the act of assigning a value to an identifier. Not everything which is declared can be assigned, for example constants and functions. Equality is the act of comparing two identifies by asking are their contents the same?

Being a strongly typed language, the notion of sameness is fundamentally rooted in the identifier’s type. Two things can only be the same if they are of the same type. Beyond that, the type of the values defines how they are compared.

For example, integers are compared arithmetically. For pointer types, equality is determining if the addresses they point too are the same. Reference types like maps and channels, like pointers, are considered to be the same if they have the same address.

These are all examples of bitwise equality, that is, if the bit patterns of the memory that value occupies are the same, those values are equal. This is known as memcmp, short for memory comparison, as equality is defined by comparing the contents of two areas of memory.

Hold on to this idea, I’ll come back to in a second.

Struct equality

Beyond scalar types like integers, floats, and pointers is the realm of compound types; structs. All structs are laid out in memory in program order, thus this declaration:

type S struct {
    a, b, c, d int64
}

will consume 32 bytes of memory; 8 bytes for a, then 8 bytes for b, and so on. The spec says that struct values are comparable if all their fields are comparable. Thus two structs are equal iff each of their fields are equal.

a := S{1, 2, 3, 4}
b := S{1, 2, 3, 4}
fmt.Println(a == b) // prints true

Under the hood the compiler uses memcmp to compare the 32 bytes of a and b.

Padding and alignment

However the simplistic bitwise comparison strategy will fail in situations like this:

type S struct {
    a byte
    b uint64
    c int16
    d uint32
}

func main()
    a := S{1, 2, 3, 4}
    b := S{1, 2, 3, 4}
    fmt.Println(a == b) // prints true
}

The code compiles, the comparison is still true, but under the hood the compiler cannot rely on comparing the bit patterns of a and b because the structure contains padding.

Go requires each field in a struct to be naturally aligned. 2 byte values must start on an even address, four byte values on an address divisible by 4, and so on³. The compiler inserts padding to ensure the fields are aligned to according to their type and the underlying platform. In effect, after padding, this is what the compiler sees⁴:

type S struct {
    a byte
    _ [7]byte // padding
    b uint64
    c int16
    _ [2]int16 // padding
    d uint32
}

Padding exists to ensure the correct field alignments, and while it does take up space in memory, the contents of those padding bytes are unknown. You might assume that, being Go, the padding bytes are always zero, but it turns out that’s not the case–the contents of padding bytes are simply not defined. Because they’re not defined to always be a certain value, doing a bitwise comparison may return false because the nine bytes of padding spread throughout the 24 bytes of S are may not be the same.

The Go compiler solves this problem by generating what is known as an equality function. In this case S‘s equality function knows how to compare two values of type S by comparing only the fields in the function while skipping over the padding.

Type algorithms

Phew, that was a lot of setup to illustrate why, for each type defined in a Go program, the compiler may generate several supporting functions, known inside the compiler as the type’s algorithms. In addition to the equality function the compiler will generate a hash function if the type is used as a map key. Like the equality function, the hash function must consider factors like padding when computing its result to ensure it remains stable.

It turns out that it can be hard, and sometimes non obvious, to intuit when the compiler will generate these functions–it’s more than you’d expect–and it can be hard for the linker to eliminate the ones that are not needed as reflection often causes the linker to be more conservative when trimming types.

Reducing binary size by prohibiting comparisons

Now we’re at a point to explain Brad’s change. By adding an incomparable field ⁵ to the type, the resulting struct is by extension incomparable, thus forcing the compiler to elide the generation of eq and hash algorithms, short circuiting the linkers elimination of those types and, in practice, reducing the size of the final binary. As an example of this technique, this program:

package main

import "fmt"

func main() {
    type t struct {
        // _ [0][]byte uncomment to prevent comparison
        a byte
        b uint16
        c int32
        d uint64
    }
    var a t
    fmt.Println(a)
}

when compiled with Go 1.14.2 (darwin/amd64), decreased from 2174088 to 2174056, a saving of 32 bytes. In isolation this 32 byte saving may seem like small beer, but consider that equality and hash functions can be generated for every type in the transitive closure of your program and all its dependencies, and the size of these functions varies depending on the size of the type and its complexity, prohibiting them can have a sizeable impact on the final binary over and above the old saw of -ldflags="-s -w".

The bottom line, if you don’t wish to make your types comparable, a hack like this enforces it at the source level while contributing to a small reduction in the size of your binary.

Addendum: thanks to Brad’s prodding, Go 1.15 already has a bunch of improvements by Cherry Zhang and Keith Randall that fix the most egregious of the failures to eliminate unnecessary equality and hash functions (although I suspect it was also to avoid the proliferation of this class of CLs).

Mid-stack inlining in Go

In the previous post I discussed how leaf inlining allows the Go compiler to reduce the overhead of function calls and extend optimisation opportunities across function boundaries. In this post I’ll discuss the limits of inlining and leaf vs mid-stack inlining.

The limits of inlining

Inlining a function into its caller removes the call’s overhead and increases the opportunity for the compiler to apply additional optimisations so the question should be asked, if some inlining is good, would more be better, why not inline as much as possible?

Inlining trades possibly larger program sizes for potentially faster execution time. The main reason to limit inlining is creating many inlined copies of a function can increase compile time and result in larger binaries for marginal gain. Even taking into account the opportunities for further optimisation, aggressive inlining tends to increase the size of, and the time too compile, the resulting binary.

Inlining works best for small functions that do relatively little work compared to the overhead of calling them. As the size of a function grows, the time saved avoiding the call’s overhead diminishes relative to the work done inside the function. Larger functions tend to be more complex, thus the benefits of optimising their inlined forms vs in situ are reduced.

Inlining budget

During compilation each function’s inlineabilty is calculated using what is known as the inlining budget⁶. The cost calculation can be tricky to internalise but is broadly one unit per node in the AST for simple things like unary and binary operations but can be higher for complex operations like make. Consider this example:

package main

func small() string {
    s := "hello, " + "world!"
    return s
}

func large() string {
    s := "a"
    s += "b"
    s += "c"
    s += "d"
    s += "e"
    s += "f"
    s += "g"
    s += "h"
    s += "i"
    s += "j"
    s += "k"
    s += "l"
    s += "m"
    s += "n"
    s += "o"
    s += "p"
    s += "q"
    s += "r"
    s += "s"
    s += "t"
    s += "u"
    s += "v"
    s += "w"
    s += "x"
    s += "y"
    s += "z"
    return s
}

func main() {
    small()
    large()
}

Compiling this function with -gcflags=-m=2 allows us to see the cost the compiler assigns to each function.

% go build -gcflags=-m=2 inl.go 
# command-line-arguments
./inl.go:3:6: can inline small with cost 7 as: func() string { s := "hello, world!"; return s }
./inl.go:8:6: cannot inline large: function too complex: cost 82 exceeds budget 80
./inl.go:38:6: can inline main with cost 68 as: func() { small(); large() }
./inl.go:39:7: inlining call to small func() string { s := "hello, world!"; return s }

The compiler determined that func small() can be inlined due to its cost of 7. func large() was determined to be too expensive. func main()has been marked as eligible and assigned a cost of 68; 7 from the body of small, 57 from the function call to small and the remainder in its own overhead.

The inlining budget can be controlled to some degree with the -gcflag=-l flag. Currently the values that apply are:

-gcflags=-l=0 is the default level of inlining.
-gcflags=-l (or -gcflags=-l=1) disables inlining.
-gcflags=-l=2 and -gcflags=-l=3 are currently unused and have no effect over -gcflags=-l=0
-gcflags=-l=4 reduces the cost for inlining non-leaf functions and calls through interfaces.⁷

Hairy optimisations

Some functions with a relatively low inlining cost may be ineligible because of their complexity. This is known as the function’s hairiness as the semantics of some operations are hard to reason about once inlined, for example recover, break. Others, like select and go, involve co-ordination with the runtime so the extra effort of inlining doesn’t pay for itself.

The list of hairy statements also includes things like for and range which don’t have an inherently large cost, but simply haven’t been optimised yet.

Mid stack inlining

Historically the Go compiler only performed leaf inlining–only functions which did not call other functions were eligible. In the context of the hairiness discussion previously, a function call would disqualify the function from being inlined.

Enter mid stack inlining which, as its name implies, allows functions in the middle of a call stack to be inlined without requiring everything below them to be eligible. Mid stack inlining was introduced by David Lazar in Go 1.9 and improved in subsequent releases. This presentation goes into some of the difficulties with retaining the behaviour of stack traces and runtime.Callers in code paths that had been heavily inlined.

We see an example of mid-stack inlining in the previous example. After inlining, func main() contains the body of func small() and a call to func large(), thus it is considered a non-leaf function. Historically this would have prevented it from being further inlined even though its combined cost was less than the inlining budget.

The primary use case for mid stack inlining is to reduce the overhead of a path through the call stack. Consider this example:

package main

import (
    "fmt"
    "strconv"
)

type Rectangle struct {}

//go:noinline
func (r *Rectangle) Height() int {
    h, _ := strconv.ParseInt("7", 10, 0)
    return int(h)
}

func (r *Rectangle) Width() int {
    return 6
}

func (r *Rectangle) Area() int { return r.Height() * r.Width() }

func main() {
    var r Rectangle
    fmt.Println(r.Area())
}

In this example r.Area() is a simple function which calls two others. r.Width() can be inlined while r.Height(), simulated here with the //go:noinline annotation, cannot. ⁸

% go build -gcflags='-m=2' square.go                                                                                                          
# command-line-arguments
./square.go:12:6: cannot inline (*Rectangle).Height: marked go:noinline                                                                               
./square.go:17:6: can inline (*Rectangle).Width with cost 2 as: method(*Rectangle) func() int { return 6 }
./square.go:21:6: can inline (*Rectangle).Area with cost 67 as: method(*Rectangle) func() int { return r.Height() * r.Width() }                       ./square.go:21:61: inlining call to (*Rectangle).Width method(*Rectangle) func() int { return 6 }                                                     
./square.go:23:6: cannot inline main: function too complex: cost 150 exceeds budget 80                        
./square.go:25:20: inlining call to (*Rectangle).Area method(*Rectangle) func() int { return r.Height() * r.Width() }
./square.go:25:20: inlining call to (*Rectangle).Width method(*Rectangle) func() int { return 6 }

As the multiplication performed by r.Area() is cheap compared to the overhead of calling it, inlining r.Area()‘s single expression is a net win even if its downstream caller to r.Height() remains ineligible.

Fast path inlining

The most startling example of the power of mid-stack inlining comes from 2019 when Carlo Alberto Ferraris improved the performance of sync.Mutex.Lock() by allowing the fast path of the lock–the uncontended case–to be inlined into its caller. Prior to this change sync.Mutex.Lock() was a large function containing many hairy conditions which made it ineligible to be inlined. Even in the case where the lock was available, the caller had to pay the overhead of calling sync.Mutex.Lock().

Carlo’s change split sync.Mutex.Lock() into two functions (a process he dubbed outlining). The outer sync.Mutex.Lock() method now calls sync/atomic.CompareAndSwapInt32() and returns to the caller immediately if the CAS succeeds. If not, the function falls through to sync.Mutex.lockSlow() which handles the slow path required to register interest on the lock and park the goroutine.⁹

% go build -gcflags='-m=2 -l=0' sync 2>&1 | grep '(*Mutex).Lock'
../go/src/sync/mutex.go:72:6: can inline (*Mutex).Lock with cost 69 as: method(*Mutex) func() { if "sync/atomic".CompareAndSwapInt32(&m.state, 0, mutexLocked) { if race.Enabled {  }; return  }; m.lockSlow() }

By splitting the function into an easily inlineable outer function, falling through to a complex inner function to handle the slow path Carlo’s combined mid stack inlining and the compiler’s support for intrinsic operations to reduce the cost of an uncontended lock by 14%. Then he repeated the trick for an additional 9% saving in sync.RWMutex.Unlock().

Dave Cheney

The acme of foolishness