Taking Go modules for a spin

Update: Since this post was written, Go 1.11beta2 has been released. I’ve updated the setup section to reflect this. Russ Cox kindly wrote to me to explain the reasoning behind storing the Go module cache in $GOPATH. I’ve included his response inline.

This weekend I wanted to play with Ubuntu 18.04 on a spare machine. This gave me a perfect excuse to try out the modules feature recently merged into the Go 1.11 development branch.

TL;DR: When Go 1.11 ships you’ll be able to download the tarball and unpack it anywhere you like. When Go 1.11 ships you’ll be able to write Go modules anywhere you like.

Setup

The recently released Go 1.11beta2 has support for Go modules.

% curl https://dl.google.com/go/go1.11beta2.linux-amd64.tar.gz | \
         tar xz --transform=s/^go/go1.11/g
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                  Dload  Upload   Total   Spent    Left  Speed
100  169M  100  169M    0     0  23.6M      0  0:00:07  0:00:07 --:--:-- 21.2M
% go1.11/bin/go version go version go1.11beta2 linux/amd64

That’s all you need to do to install Go 1.11beta2. Out of shot, I’ve added $HOME/go1.11/bin to my $PATH.

Kicking the tires

Now we have a version of Go with module support installed, I wanted to try to use it to manage the dependencies for httpstat, a clone of the Python tool of the same name that many collaborators swarmed on to build in late 2016.

To show that Go 1.11 won’t need you to declare a $GOPATH or use a specific directly layout for the location of your project, I’m going to use my favourite directory for source code, ~/devel.

% git clone https://github.com/davecheney/httpstat devel/httpstat
Cloning into 'devel/httpstat'...
remote: Counting objects: 2326, done.
remote: Total 2326 (delta 0), reused 0 (delta 0), pack-reused 2326
Receiving objects: 100% (2326/2326), 8.73 MiB | 830.00 KiB/s, done.
Resolving deltas: 100% (673/673), done.
Checking out files: 100% (1361/1361), done.
% cd devel/httpstat % go mod -init -module github.com/davecheney/httpstat
go: creating new go.mod: module github.com/davecheney/httpstat
go: copying requirements from Gopkg.lock

Nice, go mod -init translated my existing Gopkg.lock file into its own go.mod format.

% cat go.mod
module github.com/davecheney/httpstat
require (
        github.com/fatih/color v1.5.0
        github.com/mattn/go-colorable v0.0.9
        github.com/mattn/go-isatty v0.0.3
        golang.org/x/net v0.0.0-20170922011244-0744d001aa84
        golang.org/x/sys v0.0.0-20170922123423-429f518978ab
        golang.org/x/text v0.0.0-20170915090833-1cbadb444a80
)

Let’s give it a try

% go build
go: finding golang.org/x/net v0.0.0-20170922011244-0744d001aa84
go: finding github.com/mattn/go-colorable v0.0.9
go: finding github.com/mattn/go-isatty v0.0.3
go: finding golang.org/x/sys v0.0.0-20170922123423-429f518978ab
go: finding github.com/fatih/color v1.5.0
go: finding golang.org/x/text v0.0.0-20170915090833-1cbadb444a80
go: downloading github.com/fatih/color v1.5.0
go: downloading github.com/mattn/go-isatty v0.0.3
go: downloading golang.org/x/net v0.0.0-20170922011244-0744d001aa84
go: downloading github.com/mattn/go-colorable v0.0.9
go: downloading golang.org/x/text v0.0.0-20170915090833-1cbadb444a80

Very nice, go build ignored the vendor/ folder in this repository (because we’re outside $GOPATH) and fetched the revisions it needed. Let’s try out the binary and make sure it works.

% ./httpstat golang.org
Connected to 216.58.196.145:443
HTTP/2.0 200 OK
Server: Google Frontend
Alt-Svc: quic=":443"; ma=2592000; v="44,43,39,35"
Cache-Control: private
Content-Type: text/html; charset=utf-8
Date: Sat, 14 Jul 2018 08:20:43 GMT Strict-Transport-Security: max-age=31536000; preload
Vary: Accept-Encoding
X-Cloud-Trace-Context: 323cd59570cc084fed506f7e85d79d9f
Body discarded

Move along, nothing to see here.

Go module source cache

In the previous version of this article I included a footnote mentioning that go get in module mode stored its downloaded source in $GOPATH/src/mod not the cache added in Go 1.10. Russ Cox kindly wrote to me to explain the rational behind this choice and also copied this to a recent thread on golang-dev. For completeness, here is his response:

The build cache ($GOCACHE, defaulting to $HOME/.cache/go-build) is for storing recent compilation results, so that if you need to do that exact compilation again, you can just reuse the file. The build cache holds entries that are like “if you run this exact compiler on these exact inputs. this is the output you’d get.” If the answer is not in the cache, your build uses a little more CPU to run the compiler nstead of reusing the output. But you are guaranteed to be able to run the compiler instead, since you have the exact inputs and the compiler binary (or else you couldn’t even look up the answer in the cache).
The module cache ($GOPATH/src/mod, defaulting to $HOME/go/src/mod) is for storing downloaded source code, so that every build does not redownload the same code and does not require the network or the original code to be available. The module cache holds entries that are like “if you need to download mymodule@v1.2.3, here are the files you’d get.” If the answer is not in the cache, you have to go out to the network. Maybe you don’t have a network right now. Maybe the code has been deleted. It’s not anywhere near guaranteed that you can redownload the sources and also get the same result. Hopefully you can, but it’s not an absolute certainty like for the build cache. (The go.sum file will detect if you get a different answer on re-download, but knowing you got the wrong bits doesn’t help you make progress on actually building your code. Also these paths end up in file-line information in binaries, so they show up in stack traces, and the like and feed into tools like text editors or debuggers that don’t necessarily know how to trigger the right cache refresh.)

Wrap up

You can build Go 1.11 from source right now anywhere you like. You don’t need to set an environment variable or follow a predefined location.

With Go 1.11 and modules you can write your Go modules anywhere you like. You’re no longer forced into having one copy of a project checked out in a specific sub directory of your $GOPATH.

Slices from the ground up

This blog post was inspired by a conversation with a co-worker about using a slice as a stack. The conversation turned into a wider discussion on the way slices work in Go, so I thought it would be useful to write it up.

Arrays

Every discussion of Go’s slice type starts by talking about something that isn’t a slice, namely, Go’s array type. Arrays in Go have two relevant properties:

They have a fixed size; [5]int is both an array of 5 ints and is distinct from [3]int.
They are value types. Consider this example:
```
package main

import "fmt"

func main() {
        var a [5]int
        b := a
        b[2] = 7
        fmt.Println(a, b) // prints [0 0 0 0 0] [0 0 7 0 0]
}
```
The statement b := a declares a new variable, b, of type [5]int, and copies the contents of a to b. Updating b has no effect on the contents of a because a and b are independent values.¹

Slices

Go’s slice type differs from its array counterpart in two important ways:

Slices do not have a fixed length. A slice’s length is not declared as part of its type, rather it is held within the slice itself and is recoverable with the built-in function len.²
Assigning one slice variable to another does not make a copy of the slices contents. This is because a slice does not directly hold its contents. Instead a slice holds a pointer to its underlying array³ which holds the contents of the slice.

As a result of the second property, two slices can share the same underlying array. Consider these examples:

Slicing a slice:
```
package main

import "fmt"

func main() {
        var a = []int{1,2,3,4,5}
        b := a[2:]
        b[0] = 0
        fmt.Println(a, b) // prints [1 2 0 4 5] [0 4 5]
}
```
In this example a and b share the same underlying array–even though b starts at a different offset in that array, and has a different length. Changes to the underlying array via b are thus visible to a.
Passing a slice to a function:
```
package main

import "fmt"

func negate(s []int) {
        for i := range s {
                s[i] = -s[i]
        }
}

func main() {
        var a = []int{1, 2, 3, 4, 5}
        negate(a)
        fmt.Println(a) // prints [-1 -2 -3 -4 -5]
}
```
In this example a is passed to negateas the formal parameter s. negate iterates over the elements of s, negating their sign. Even though negate does not return a value, or have any way to access the declaration of a in main, the contents of a are modified when passed to negate.

Most programmers have an intuitive understanding of how a Go slice’s underlying array works because it matches how array-like concepts in other languages tend to work. For example, here’s the first example of this section rewritten in Python:

Python 2.7.10 (default, Feb  7 2017, 00:08:15) 
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> a = [1,2,3,4,5]
>>> b = a
>>> b[2] = 0
>>> a
[1, 2, 0, 4, 5]

And also in Ruby:

irb(main):001:0> a = [1,2,3,4,5]
=> [1, 2, 3, 4, 5]
irb(main):002:0> b = a
=> [1, 2, 3, 4, 5]
irb(main):003:0> b[2] = 0
=> 0
irb(main):004:0> a
=> [1, 2, 0, 4, 5]

The same applies to most languages that treat arrays as objects or reference types.⁴

The slice header value

The magic that makes a slice behave both as a value and a pointer is to understand that a slice is actually a struct type. This is commonly referred to as a slice header after its counterpart in the reflect package. The definition of a slice header looks something like this:

package runtime

type slice struct {
        ptr   unsafe.Pointer
        len   int
        cap   int
}

This is important because unlike map and chan types slices are value types and are copied when assigned or passed as arguments to functions.

To illustrate this, programmers instinctively understand that square‘s formal parameter v is an independent copy of the v declared in main.

package main

import "fmt"

func square(v int) {
        v = v * v
}

func main() {
        v := 3
        square(v)
        fmt.Println(v) // prints 3, not 9
}

So the operation of square on its v has no effect on main‘s v. So too the formal parameter s of double is an independent copy of the slice s declared in main, not a pointer to main‘s s value.

package main

import "fmt"

func double(s []int) {
        s = append(s, s...)
}

func main() {
        s := []int{1, 2, 3}
        double(s)
        fmt.Println(s, len(s)) // prints [1 2 3] 3
}

The slightly unusual nature of a Go slice variable is it’s passed around as a value, not than a pointer. 90% of the time when you declare a struct in Go, you will pass around a pointer to values of that struct.⁵ This is quite uncommon, the only other example of passing a struct around as a value I can think of off hand is time.Time.

It is this exceptional behaviour of slices as values, rather than pointers to values, that can confuses Go programmer’s understanding of how slices work. Just remember that any time you assign, subslice, or pass or return, a slice, you’re making a copy of the three fields in the slice header; the pointer to the underlying array, and the current length and capacity.

Putting it all together

I’m going to conclude this post on the example of a slice as a stack that I opened this post with:

package main

import "fmt"

func f(s []string, level int) {
        if level > 5 {
               return
        }
        s = append(s, fmt.Sprint(level))
        f(s, level+1)
        fmt.Println("level:", level, "slice:", s)
}

func main() {
        f(nil, 0)
}

Starting from main we pass a nil slice into f as level 0. Inside f we append to s the current level before incrementing level and recursing. Once level exceeds 5, the calls to f return, printing their current level and the contents of their copy of s.

level: 5 slice: [0 1 2 3 4 5]
level: 4 slice: [0 1 2 3 4]
level: 3 slice: [0 1 2 3]
level: 2 slice: [0 1 2]
level: 1 slice: [0 1]
level: 0 slice: [0]

You can see that at each level the value of s was unaffected by the operation of other callers of f, and that while four underlying arrays were created ⁶ higher levels of f in the call stack are unaffected by the copy and reallocation of new underlying arrays as a by-product of append.

How the Go runtime implements maps efficiently (without generics)

This post discusses how maps are implemented in Go. It is based on a presentation I gave at the GoCon Spring 2018 conference in Tokyo, Japan.

What is a map function?

To understand how a map works, let’s first talk about the idea of the map function. A map function maps one value to another. Given one value, called a key, it will return a second, the value.

map(key) → value

Now, a map isn’t going to be very useful unless we can put some data in the map. We’ll need a function that adds data to the map

insert(map, key, value)

and a function that removes data from the map

delete(map, key)

There are other interesting properties of map implementations like querying if a key is present in the map, but they’re outside the scope of what we’re going to discuss today. Instead we’re just going to focus on these properties of a map; insertion, deletion and mapping keys to values.

Go’s map is a hashmap

The specific map implementation I’m going to talk about is the hashmap, because this is the implementation that the Go runtime uses. A hashmap is a classic data structure offering O(1) lookups on average and O(n) in the worst case. That is, when things are working well, the time to execute the map function is a near constant.

The size of this constant is part of the hashmap design and the point at which the map moves from O(1) to O(n) access time is determined by its hash function.

The hash function

What is a hash function? A hash function takes a key of an unknown length and returns a value with a fixed length.

hash(key) → integer

this hash value is almost always an integer for reasons that we’ll see in a moment.

Hash and map functions are similar. They both take a key and return a value. However in the case of the former, it returns a value derived from the key, not the value associated with the key.

Important properties of a hash function

It’s important to talk about the properties of a good hash function as the quality of the hash function determines how likely the map function is to run near O(1).

When used with a hashmap, hash functions have two important properties. The first is stability. The hash function must be stable. Given the same key, your hash function must return the same answer. If it doesn’t you will not be able to find things you put into the map.

The second property is good distribution. Given two near identical keys, the result should be wildly different. This is important for two reasons. Firstly, as we’ll see, values in a hashmap should be distributed evenly across buckets, otherwise the access time is not O(1). Secondly as the user can control some of the aspects of the input to the hash function, they may be able to control the output of the hash function, leading to poor distribution which has been a DDoS vector for some languages. This property is also known as collision resistance.

The hashmap data structure

The second part of a hashmap is the way data is stored.
The classical hashmap is an array of buckets each of which contains a pointer to an array of key/value entries. In this case our hashmap has eight buckets (as this is the value that the Go implementation uses) and each bucket can hold up to eight entries each (again drawn from the Go implementation). Using powers of two allows the use of cheap bit masks and shifts rather than expensive division.

As entries are added to a map, assuming a good hash function distribution, then the buckets will fill at roughly the same rate. Once the number of entries across each bucket passes some percentage of their total size, known as the load factor, then the map will grow by doubling the number of buckets and redistributing the entries across them.

With this data structure in mind, if we had a map of project names to GitHub stars, how would we go about inserting a value into the map?

We start with the key, feed it through our hash function, then mask off the bottom few bits to get the correct offset into our bucket array. This is the bucket that will hold all the entries whose hash ends in three (011 in binary). Finally we walk down the list of entries in the bucket until we find a free slot and we insert our key and value there. If the key was already present, we’d just overwrite the value.

Now, lets use the same diagram to look up a value in our map. The process is similar. We hash the key as before, then masking off the lower 3 bits, as our bucket array contains 8 entries, to navigate to the fifth bucket (101 in binary). If our hash function is correct then the string "moby/moby" will always hash to the same value, so we know that the key will not be in any other bucket. Now it’s a case of a linear search through the bucket comparing the key provided with the one stored in the entry.

Four properties of a hash map

That was a very high level explanation of the classical hashmap. We’ve seen there are four properties you need to implement a hashmap;

1. You need a hash function for the key.
2. You need an equality function to compare keys.
3. You need to know the size of the key and,
4. You need to know the size of the value because these affect the size of the bucket structure, which the compiler needs to know, as you walk or insert into that structure, how far to advance in memory.

Hashmaps in other languages

Before we talk about the way Go implements a hashmap, I wanted to give a brief overview of how two popular languages implement hashmaps. I’ve chosen these languages as both offer a single map type that works across a variety of key and values.

C++

The first language we’ll discuss is C++. The C++ Standard Template Library (STL) provides std::unordered_map which is usually implemented as a hashmap.

This is the declaration for std::unordered_map. It’s a template, so the actual values of the parameters depend on how the template is instantiated.

template<
    class Key,                             // the type of the key
    class T,                               // the type of the value
    class Hash = std::hash<Key>,            // the hash function
    class KeyEqual = std::equal_to<Key>,    // the key equality function
    class Allocator = std::allocator< std::pair<const Key, T> > 
> class unordered_map;

There is a lot here, but the important things to take away are;

The template takes the type of the key and value as parameters, so it knows their size.
The template takes a std::hash function specialised on the key type, so it knows how to hash a key passed to it.
And the template takes an std::equal_to function, also specialised on key type, so it knows how to compare two keys.

Now we know how the four properties of a hashmap are communicated to the compiler in C++’s std::unordered_map, let’s look at how they work in practice.

First we take the key, pass it to the std::hash function to obtain the hash value of the key. We mask and index into the bucket array, then walk the entries in that bucket comparing the keys using the std::equal_to function.

Java

The second language we’ll discuss is Java. In java the hashmap type is called, unsurprisingly, java.util.Hashmap.

In java, the java.util.Hashmap type can only operate on objects, which is fine because in Java almost everything is a subclass of java.lang.Object. As every object in Java descends from java.lang.Object they inherit, or override, a hashCode and an equals method.

However, you cannot directly store the eight primitive types; boolean, int, short, long, byte, char, float, and double, because they are not subclasss of java.lang.Object. You cannot use them as a key, you cannot store them as a value. To work around this limitation, those types are silently converted into objects representing their primitive values. This is known as boxing.

Putting this limitation to one side for the moment, let’s look at how a lookup in Java’s hashmap would operate.

First we take the key and call its hashCode method to obtain the hash value of the key. We mask and index into the bucket array, which in Java is a pointer to an Entry, which holds a key and value, and a pointer to the next Entry in the bucket forming a linked list of entries.

Tradeoffs

Now that we’ve seen how C++ and Java implement a Hashmap, let’s compare their relative advantages and disadvantages.

C++ templated `std::unordered_map`

Advantages

Size of the key and value types known at compile time.
Data structure are always exactly the right size, no need for boxing or indiretion.
As code is specialised at compile time, other compile time optimisations like inlining, constant folding, and dead code elimination, can come into play.

In a word, maps in C++ can be as fast as hand writing a custom map for each key/value combination, because that is what is happening.

Disadvantages

Code bloat. Each different map are different types. For N map types in your source, you will have N copies of the map code in your binary.
Compile time bloat. Due to the way header files and template work, each file that mentions a std::unordered_map the source code for that implementation has to be generated, compiled, and optimised.

Java util Hashmap

Advantages

One implementation of a map that works for any subclass of java.util.Object. Only one copy of java.util.HashMap is compiled, and its referenced from every single class.

Disadvantages

Everything must be an object, even things which are not objects, this means maps of primitive values must be converted to objects via boxing. This adds gc pressure for wrapper objects, and cache pressure because of additional pointer indirections (each object is effective another pointer lookup)
Buckets are stored as linked lists, not sequential arrays. This leads to lots of pointer chasing while comparing objects.
Hash and equality functions are left as an exercise to the author of the class. Incorrect hash and equals functions can slow down maps using those types, or worse, fail to implement the map behaviour.

Go’s hashmap implementation

Now, let’s talk about how the hashmap implementation in Go allows us to retain many of the benfits of the best map implementations we’ve seen, without paying for the disadvantages.

Just like C++ and just like Java, Go’s hashmap written in Go. But–Go does not provide generic types, so how can we write a hashmap that works for (almost) any type, in Go?

Does the Go runtime use interface{} ?

No, the Go runtime does not use interface{} to implement its hashmap. While we have the container/{list,heap} packages which do use the empty interface, the runtime’s map implementation does not use interface{}.

Does the compiler use code generation?

No, there is only one copy of the map implementation in a Go binary. There is only one map implementation, and unlike Java, it doesn’t use interface{} boxing. So, how does it work?

There are two parts to the answer, and they both involve co-operation between the compiler and the runtime.

Compile time rewriting

The first part of the answer is to understand that map lookups, insertion, and removal, are implemented in the runtime package. During compilation map operations are rewritten to calls to the runtime. eg.

v := m["key"]     → runtime.mapaccess1(m, ”key", &v)
v, ok := m["key"] → runtime.mapaccess2(m, ”key”, &v, &ok)
m["key"] = 9001   → runtime.mapinsert(m, ”key", 9001)
delete(m, "key")  → runtime.mapdelete(m, “key”)

It’s also useful to note that the same thing happens with channels, but not with slices.

The reason for this is channels are complicated data types. Send, receive, and select have complex interactions with the scheduler so that’s delegated to the runtime. By comparison slices are much simpler data structures, so the compiler natively handles operations like slice access, len and cap while deferring complicated cases in copy and append to the runtime.

Only one copy of the map code

Now we know that the compiler rewrites map operations to calls to the runtime. We also know that inside the runtime, because this is Go, there is only one function called mapaccess, one function called mapaccess2, and so on.

So, how can the compiler can rewrite this

v := m[“key"]

into this

 runtime.mapaccess(m, ”key”, &v)

without using something like interface{}? The easiest way to explain how map types work in Go is to show you the actual signature of runtime.mapaccess1.

func mapaccess1(t *maptype, h *hmap, key unsafe.Pointer) unsafe.Pointer

Let’s walk through the parameters.

key is a pointer to the key, this is the value you provided as the key.
h is a pointer to a runtime.hmap structure. hmap is the runtime’s hashmap structure that holds the buckets and other housekeeping values ⁷.
t is a pointer to a maptype, which is odd.

Why do we need a *maptype if we already have a *hmap? *maptype is the special sauce that makes the generic *hmap work for (almost) any combination of key and value types. There is a maptype value for each unique map declaration in your program. There will be one that describes maps from strings to ints, from strings to http.Headers, and so on.

Rather than having, as C++ has, a complete map implementation for each unique map declaration, the Go compiler creates a maptype during compilation and uses that value when calling into the runtime’s map functions.

type maptype struct {
         typ           _type
         key         *_type
         elem        *_type
         bucket        *_type // internal type representing a hash bucket 
        hmap          *_type // internal type representing a hmap
         keysize       uint8  // size of key slot
         indirectkey   bool   // store ptr to key instead of key itself
         valuesize     uint8  // size of value slot
         indirectvalue bool   // store ptr to value instead of value itself
         bucketsize    uint16 // size of bucket
         reflexivekey  bool   // true if k==k for all keys
         needkeyupdate bool   // true if we need to update key on overwrite 
}

Each maptype contains details about properties of this kind of map from key to elem. It contains infomation about the key, and the elements. maptype.key contains information about the pointer to the key we were passed. We call these type descriptors.

type _type struct {
         size       uintptr
         ptrdata    uintptr // size of memory prefix holding all pointers
         hash       uint32
         tflag      tflag
         align      uint8 
        fieldalign uint8
         kind       uint8
         alg       *typeAlg 
        // gcdata stores the GC type data for the garbage collector.
         // If the KindGCProg bit is set in kind, gcdata is a GC program. 
        // Otherwise it is a ptrmask bitmap. See mbitmap.go for details.
         gcdata    *byte
         str       nameOff 
        ptrToThis typeOff
 }

In the _type type, we have things like it’s size, which is important because we just have a pointer to the key value, but we need to know how large it is, what kind of a type it is; it is an integer, is it a struct, and so on. We also need to know how to compare values of this type and how to hash values of that type, and that is what the _type.alg field is for.

type typeAlg struct {
         // function for hashing objects of this type
         // (ptr to object, seed) -> hash
         hash func(unsafe.Pointer, uintptr) uintptr
         // function for comparing objects of this type
         // (ptr to object A, ptr to object B) -> ==? 
        equal func(unsafe.Pointer, unsafe.Pointer) bool 
}

There is one typeAlg value for each type in your Go program.

Putting it all together, here is the (slightly edited for clarity) runtime.mapaccess1 function.

// mapaccess1 returns a pointer to h[key].  Never returns nil, instead 
// it will return a reference to the zero object for the value type if 
// the key is not in the map.
 func mapaccess1(t *maptype, h *hmap, key unsafe.Pointer) unsafe.Pointer { 
        if h == nil || h.count == 0 {
                 return unsafe.Pointer(&zeroVal[0]) 
        } 
        alg := t.key.alg
         hash := alg.hash(key, uintptr(h.hash0))
         m := bucketMask(h.B)
         b := (*bmap)(add(h.buckets, (hash&m)*uintptr(t.bucketsize)))

One thing to note is the h.hash0 parameter passed into alg.hash. h.hash0 is a random seed generated when the map is created. It is how the Go runtime avoids hash collisions.

Anyone can read the Go source code, so they could come up with a set of values which, using the hash ago that go uses, all hash to the same bucket. The seed value adds an amount of randomness to the hash function, providing some protection against collision attack.

Conclusion

I was inspired to give this presentation at GoCon because Go’s map implementation is a delightful compromise between C++’s and Java’s, taking most of the good without having to accomodate most of the bad.

Unlike Java, you can use scalar values like characters and integers without the overhead of boxing. Unlike C++, instead of N runtime.hashmap implementations in the final binary, there are only N runtime.maptype values, a substantial saving in program space and compile time.

Now I want to be clear that I am not trying to tell you that Go should not have generics. My goal today was to describe the situation we have today in Go 1 and how the map type in Go works under the hood. The Go map implementation we have today is very fast and provides most of the benefits of templated types, without the downsides of code generation and compile time bloat.

I see this as a case study in design that deserves recognition.

Containers versus Operating Systems

What does a distro provide?

The most popular docker base container image is either busybox, or scratch. This is driven by a movement that is equal parts puritanical and pragmatic. The puritan asks “Why do I need to run init(1) just to run my process?” The pragmatist asks “Why do I need a 700 meg base image to deploy my application?” And both, seeking immutable deployment units ask “Is it a good idea that I can ssh into my container?” But let’s step back for a second and look at the history of how we got to the point where questions like this are even a thing.

In the very beginnings, there were no operating systems. Programs ran one at a time with the whole machine at their disposal. While efficient, this created a problem for the keepers of these large and expensive machines. To maximise their investment, the time between one program finishing and another starting must be kept to an absolute minimum; hence monitor programs and batch processing was born.

Monitors started as barely more than watchdog timers. They knew how to load the next program off tape, then set an alarm if the program ran too long. As time went on, monitors became job control–quasi single user operating systems where the operators could schedule batch jobs with slightly more finesse than the previous model of concatenating them in the card reader.⁸

In response to the limitations of batch processing, and with the help of increased computing resources, interactive computing was born. Interactive computing allowing multiple users to interact with the computer directly, time slicing, or time sharing, the resources between users to present the illusion of each program having a whole computer to itself.

“The UNIX kernel is an I/O multiplexer more than a complete operating system. This is as it should be.”
Ken Thompson, BSTJ, 1978

interactive computing in raw terms was less efficient than batch, however it recognised that the potential to deliver programs faster outweighed a less than optimal utilisation of the processor; a fact borne out by the realisation that programming time was not benefiting from the same economies of scale that Moore’s law was delivering for hardware. Job control evolved to became what we know as the kernel, a supervisor program which sits above the raw hardware, portioning it out and mediating access to hardware devices.

With interactive users came the shell, a place to start programs, and return once they completed. The shell presented an environment, a virtual work space to organise your work, communicate with others, and of course customise. Customise with programs you wrote, programs you got from others, and programs that you collaborated with your coworkers on.

Interactive computing, multi user systems and then networking gave birth to the first wave of client/server computing–the server was your world, your terminal was just a pane of glass to interact with it. Thus begat userspace, a crowded bazaar of programs, written in many languages, traded, sold, swapped and sometimes stolen. A great inter breeding between the UNIX vendors produced a whole far larger than the sum of its parts.

Each server was an island, lovingly tended by operators, living for years, slowly patched and upgraded, becoming ever more unique through the tide of software updates and personnel changes.

Skip forward to Linux and the GNU generation, a kernel by itself does not serve the market, it needs a user space of tools to attract and nurture users accustomed to the full interactive environment.

But that software was hard, and messy, and spread across a million ftp, tucows, sourceforge, and cvs servers. Their installation procedures are each unique, their dependencies are unknown or unmanaged–in short, a job for an expert. Thus distributions became experts at packaging open source software to work together as a coherent interactive userspace story.

Container sprawl

We used to just have lots of servers, drawing power, running old software, old operating systems, hidden under people’s desks, and sometimes left running behind dry wall. Along came virtualisation to sweep away all the old, slow, flaky, out of warranty hardware. Yet the software remained, and multiplied.

Vmsprawl, it was called. Now free from a purchase order and a network switch port, virtual machines could spawn faster than rabbits. But their lifespan would be much longer.

Back when physical hardware existed, you could put labels on things, assign them to operators, have someone to blame, or at least ask if the operating system was up to date, but virtual machines became ephemeral, multitudinous, and increasingly, redundant,

Now that a virtual machines’ virtual bulk has given way to containers, what does that mean for the security and patching landscape? Surely it’s as bad, if not worse. Containers can multiply even faster than VMs and at such little cost compared to their bloated cousins that the problem could be magnified many times over. Or will it?

The problem is maintaining the software you didn’t write. Before containers that was everything between you and the hardware; obviously a kernel, that is inescapable, but the much larger surface area (in recent years ballooning to a DVD’s girth) was the userland. The gigabytes of software that existed to haul the machine onto the network, initialise its device drivers, scrub its /tmp partition, and so on.

But what if there was no userland? What if the network was handled for you, truly virtualised at layer 3, not layer 1. Your volumes were always mounted and your local storage was fleeting, so nothing to scrub. What would be the purpose of all those decades of lovingly crafted userland cruft?

If interactive software goes unused, was it ever installed at all?

Immutable images

Netflix tells us that immutable images are the path to enlightenment. Built it once, deploy it often. If there is a problem, an update, a software change, a patch, or a kernel fix, then build another image and roll it out. Never change something in place. This mirrors the trend towards immutability writ large by the functional programming tidal wave.

While Netflix use virtual machines, and so need software to configure their (simulated) hardware and software to plumb their (simulated) network interfaces to get to the point of being able to launch the application, containers leave all these concerns to the host. A container is spawned with any block devices or network interfaces required already mounted or plumbed a priori.

So, if you remove the requirement, and increasingly, the ability, to change the contents of the running image, and you remove the requirement to prepare the environment before starting the application, because the container is created with all its facilities already prepared, why do you need a userland inside a container?

Debugging? Possibly.

Today there are many of my generation who would feel helpless without being about to ssh to a host, run their favourite (and disparate) set of inspection tools. But a container is just a process inside a larger host operating system, so do you diagnosis there instead. Unlike virtual machines, these are not black boxes, the host operating system has far more capability to inspect and diagnose a guest than the guest itself–so leave your diagnosis tools on the host. And your ssh daemon, for good measure.

Updates, updates. Updates!

Why do we have operating system distros? In a word, outsourcing.

Sure, every admin could subscribe to the mailing lists of all the software packages installed on the servers they maintain (you do know all the software installed on the machines you are responsible for, right?) and then download, test, certify, upgrade the software promptly after being notified. Sound’s simple. Any admin worth hiring should be able to do this.

Sure, assuming you can find an admin who wants to do this grunt work, and that they can keep up with the workload, and that they can service more than a few machines before they’re hopelessly chasing their tails.

No, of course not, we outsource this to operating system vendor. In return for using outdated versions of software, distros will centralise the triage, testing and preparation of upgrades and patches.

This is the reason that a distro and its package management tool of choice are synonymous. Without a tool to automate the dissemination, installation and upgrade of packaged software, distro vendors would have no value. And without someone to marshal unique snowflake open source software into a unified form, package management software would have no value.

No wonder that the revenue model for all open source distro vendors centers around tooling that automates the distribution of update packages.

The last laugh

Ironically, the last laugh in this tale may be the closed source operating system vendors. It was Linux and open source that destroyed the proprietary UNIX market after the first dot com crash.

Linux rode Moore’s law to become the server operating system for the internet, and made kings of the operating system distributors. But it’s Linux that is driving containers, at least in their current form, and Linux containers, or more specifically a program that communicates directly with the kernel syscall api inside a specially prepared process namespace, is defining the new normal for applications.

Containers are eating the very Linux distribution market which enabled their creation.

OSX and Windows may be relegated to second class citizens–the clients in the client/server or client/container equation–but at least nobody is asking difficult questions about the role of their userspace.

Whither distros

What is the future of operating system distributions? Their services, while mature, scalable, well integrated, and expertly staffed, will unfortunately be priced out of the market. History tells us this.

In the first dot com bust, companies retreated from expensive proprietary software, not because it wasn’t good, not because it wasn’t extensible or changeable, but because it was too expensive. With no money coming in, thousands of dollars of opex walking out the door in licence fees was unsustainable.

The companies that survived the crash, or were born in its wreckage, chose open source software. Software that was arguably less mature, less refined, at the time, but with a price tag that was much more approachable. They kept their investors money in the bank, rode the wave of hardware improvements, and by pulling together in a million loosely organised software projects created a free (as in free puppy) platform to build their services on top–trading opex for some risk that they may have no-one to blame if their free software balloon sprang a leak.

Now, it is the Linux distributors who are chasing the per seat or per cpu licence fees. Offering scaled out professional services in the form of a stream of software updates and patches, well tested and well integrated.

But, with the exception of the kernel–which is actually provided by the host operating system, not the container–all those patches and updates are for software that is not used by the container, and in the case of our opening examples, busybox and scratch. not present. The temptation to go it alone, cut out the distro vendors, backed by the savings in licence fees is overwhelming.

What can distros do?

What would you do if you woke up one day to find that you owned the best butchers shop in a town that had decided to become vegetarian en mass?

Go’s hidden #pragmas

This is an article about compiler directives; or as they are commonly known, pragmas. It’s derived from a talk of a similar name that I gave last year at GopherChina in Shanghai.

But first, a history lesson

Before we talk about Go, let’s talk a little about pragmas, and their history. Many languages have the notion of an attribute, or directive, that changes the way source code is interpreted during compilation. For example, Perl has the use function:

use strict;
use strict "vars";
use strict "refs";
use strict "subs";

use enable features, or makes the compiler interpret the source of the program differently, by making the compiler more pedantic or enabling a new syntax mode.

Javascript has something similar. ECMAScript 5 extended the language with optional modes, such as:

"use strict";

When the Javascript interpreter comes across the words "use strict"; it enables, so called, Strict Mode when parsing your Javascript source. ⁹

Rust is similar, it uses the attributes syntax to enable unstable features in the compiler or standard library.

#[inline(always)]
 fn super_fast_fn() { ... }

#[cfg(target_os = "macos")]
mod macos_only { ... }

The inline(always) attribute tells the compiler that it must inline super_fast_fn. The target_os attribute tells the compiler to only compile the macos_only module on OS X.

The name pragma comes from ALGOL 68, where they were called pragmats, which was itself shorthand for the word pragmatic. When they were adopted by C in the 1970’s, the name was shortened again to #pragma, and due to the widespread use of C, became fully integrated into the programmer zeitgeist.

#pragma pack(2)
 struct T {
    int i;
    short j;
double k; 
};

This example says to the compiler that the structure should be packed on a two byte boundary; so the double, k, will start at an offset of 6 bytes from the address of T, not the usual 8.

C’s #pragma directive spawned a host of compiler specific extensions, like gcc’s __builtin directive.

Does Go have pragmas?

Now that we know a little bit of the history of pragmas, maybe we can now ask the question, does Go have pragmas?

You saw earlier that #pragma, like #include and #define are implemented in C style languages with a preprocessor, but Go does not have a preprocessor, or macros, so, the question remains, does Go have pragmas?

It turns out that, yes, even though Go does not have macros, or a preprocessor, Go does indeed support pragmas. They are implemented by the compiler as comments.

Just to drive home the point, they’re actually called pragmas in the source of the Go compiler.

So, clearly the name pragma, along with the idea, isn’t going away.

This article focuses on a only a few of the pragmas that the compiler recognises, partly because the list changes frequently, but mostly because not all of them are usable by you as programmers.

Here are some examples to whet your appetite

//go:noescape 
func gettimeofday(tv *Timeval) (err Errno)

This is an example of the noescape directive on the gettimeofday stub from the syscall package.

//go:noinline
 func lshNop1(x uint64) uint64 { 
        // two outer shifts should be removed
         return (((x << 5) >> 2) << 2) 
}

This is an example of the noinline directive from a test fixture in the compiler tests.

//go:nosplit
 func atomicstorep(ptr unsafe.Pointer, new unsafe.Pointer) {
         writebarrierptr_prewrite((*uintptr)(ptr), uintptr(new))
         atomic.StorepNoWB(noescape(ptr), new) 
}

This is an example of the nosplit directive inside the runtime’s atomic support functions.

Don’t worry if this was all a bit quick, we’re going to explore these examples, and more, during the remainder of this article.

A word of caution ?

Before I continue, I want to offer a word of caution.

Pragmas are not part of the language. They might be implemented the gc compiler, but you will not find them in the spec. At a higher level, the idea of adding pragmas to the language caused considerable debate, especially after the first few established a precedent. In a debate about adding the //go:noinline directive Rob Pike opined,

“Useful” is always true for a feature request. The question is, does the usefulness justify the cost? The cost here is continued proliferation of magic comments, which are becoming too numerous already.
–Rob Pike

I’ll leave you to decide if adding pragmas to Go was a good idea or not.

As I mentioned earlier pragma directives are placed in Go comments with a precise syntax. The syntax has the general form:

//go:directive

The go: prefix can be replaced with another, so you can see that the Go team were at least considering future growth, even though they don’t encourage it. It’s also important to note that there is no space between the // and the go keyword. This is partly an accident of history, but it also makes it less likely to conflict with a regular comment.

Lastly, some of these directives require you to do one or more of the following:

import the unsafe package.
compile with the undocumented -+ flag.
be part of the runtime package.

If you get it wrong, your directive might be ignored, and in most cases you code will compile but might be slower or behave incorrectly.

//go:noescape

Enough with the preflight safety checks.

Early in Go’s life, the parts that went into a complete Go program would include Go code (obviously), some C code from the runtime, and some assembly code, again from the runtime or syscall package. The take away is it was expected that inside a package, you’d occasionally find functions which were not implemented in Go.

Now, normally this mixing of languages wouldn’t be a problem, except when it interacts with escape analysis. In Go it’s very common to do something like this,

func NewBook() (*Book) {
        b := Book{ Mice: 12, Men: 9 }
        return &b
}

That is, inside NewBook we declare and initialise a new Book variable b, then return the address of b. We do this so often inside Go it probably doesn’t sink in that if you were to do something like this in C, the result would be pernicious memory corruption as the address returned from NewBook would point to the location on the stack where b was temporarily allocated.

Escape analysis

Escape analysis identifies variables whose lifetimes will live beyond the lifetime of the function in which it is declared, and moves the location where the variable is allocated from the stack to the heap. Technically we say that b escapes to the heap.

Obviously there is a cost; heap allocated variables have to be garbage collected when they are no longer reachable, stack allocated variables are automatically free’d when their function returns. Keep that in mind.

func BuildLibrary() {
        b := Book{Mice: 99: Men: 3} 
        AddToCollection(&b) 
}

Now, lets consider a slightly different version of what we saw above. In this contrived example, BuildLibrary declares a new Book, b, and passes the address of b to AddToCollection. The question is, “does b escape to the heap?”

The answer is, it depends. It depends on what AddToCollection does with the *Book passed to it. If AddToCollection did something like this,

func AddToCollection(b *Book) {
        b.Classification = "fiction" 
}

then that’s fine. AddToCollection can address those fields in Book irrespective of if b points to an address on the stack or on the heap. Escape analysis would conclude that the b declared in BuildLibrary did not escape, because AddToCollection did not retain a copy of the *Book passed to it, and can therefore be allocated cheaply on the stack.

However, if AddToCollection did something like this,

var AvailableForLoan []*Book

func AddToCollection(b *Book) {
        AvailableForLoan = append(AvailableForLoan, b) 
}

that is, keep a copy of b in some long lived slice, then that will have an impact on the b declared in BuildLibrary. b must be allocated on the heap so that it lives beyond the lifetime of AddToCollection and BuildLibrary. Escape analysis has to know what AddToCollection does, what functions it calls, and so on, to know if a value should be heap or stack allocated. This is the essence of escape analysis.

os.File.Read

That was a lot of background, let’s get back to the //go:noescape pragma. Now we know that the call stack of functions affects whether a value escapes or not, consider this very common situation (error handling elided for brevity),

f, _ := os.Open("/tmp/foo")
 buf := make([]byte, 4096)
 n, _ := f.Read(buf)

We open a file, make a buffer, and we read into that buffer. Is buf allocated on the stack, or on the heap?

As we saw above, it depends on what happens inside os.File.Read. os.File.Read calls down through a few layers to syscall.Read, and this is where it gets complicated. syscall.Read calls down into syscall.Syscall to do the operating system call. syscall.Syscall is implemented in assembly. Because syscall.Syscall is implemented in assembly, the compiler, which works on Go code, cannot “see” into that function, so it cannot see if the values passed to syscall.Syscall escape or not. Because the compiler cannot know if the value might escape, it must assume it will escape.

This was the situation in issue 4099. If you wanted to write a small bit of glue code in assembly, like the bytes, md5, or syscall package, anything you passed to it would be forced to allocated on the heap even if you knew that it doesn’t.

package bytes
//go:noescape
 // IndexByte returns the index of the first instance of c in s, 
// or -1 if c is not present in s. 
func IndexByte(s []byte, c byte) int // ../runtime/asm_$GOARCH.s

So this is precisely what the //go:noescape pragma does. It says to the compiler, “the next function declaration you see, assume that none of the arguments escape.” We’ve said to the compiler; trust us, IndexByte and its children do not keep a reference to the byte slice.

In this example from Go 1.5 you can see that bytes.IndexByte is implemented in assembly ¹⁰. By marking this function //go:noescape, it will avoid stack allocated []byte slices escaping to the heap unnecessarily.

Can you use //go:noescape in your code?

Can you use //go:noescape in your own code? Yes, but it can only be used on the forward declarations.

package main

import "fmt"

//go:noescape
func length(s string) int // implemented in an .s file

func main() {
        s := "hello world"
        l := length(s)
        fmt.Println(l)
}

Note, you’re bypassing the checks of the compiler, if you get this wrong you’ll corrupt memory and no tool will be able to spot this.

//go:norace

Forking in a multithreaded program is complicated. The child process gets a complete, independent, copy of the parent’s memory, so things like locks, implemented as values in memory can become corrupt when suddenly two copies of the same program see locks in different state.

Fork/exec in the Go runtime is handled with care by the syscall package which coordinates to make sure that the runtime is in quiescent state during the brief fork period. However, when the race runtime is in effect, this becomes harder.

To spot races, when compiling in race mode, the program is rewritten so every read and write goes via the race detector framework to detect unsafe memory access. I’ll let the commit explain.

// TODO(rsc): Remove. Put //go:norace on forkAndExecInChild instead.
 func isforkfunc(fn *Node) bool {
         // Special case for syscall.forkAndExecInChild.
         // In the child, this function must not acquire any locks, because 
        // they might have been locked at the time of the fork. This means 
        // no rescheduling, no malloc calls, and no new stack segments. 
        // Race instrumentation does all of the above. 
        return myimportpath != "" && myimportpath == "syscall" &&
                fn.Func.Nname.Sym.Name == "forkAndExecInChild" 
}

As Russ’s comment shows above, the special casing in the compiler was removed in favour of a directive on the syscall.forkAndExecInChild functions in the syscall package.

// Fork, dup fd onto 0..len(fd), and exec(argv0, argvv, envv) in child. 
// If a dup or exec fails, write the errno error to pipe. 
// (Pipe is close-on-exec so if exec succeeds, it will be closed.)
 // In the child, this function must not acquire any locks, because 
// they might have been locked at the time of the fork. This means
 // no rescheduling, no malloc calls, and no new stack segments.
 // For the same reason compiler does not race instrument it. 
// The calls to RawSyscall are okay because they are assembly
 // functions that do not grow the stack.
 //go:norace 
func forkAndExecInChild(argv0 *byte, argv, envv []*byte, chroot, dir 
        *byte, attr *ProcAttr, sys *SysProcAttr, pipe int)
        (pid int, err Errno) {

This was replaced by the annotation //go:norace by Ian Lance Taylor in Go 1.6, which removed the special case in the compiler, however //go:norace is still only used in one place in the standard library.

Should you use //go:norace in your own code?

Should you use //go:norace in your own code? Using //go:norace will instruct the compiler to not annotate the function, thus will not detect any data races if they exist. This program contains a data race, which will not be reported by the race detector because of the //go:norace annotation.

package main

var v int

//go:norace
func add() {
        v++
}

func main() {
        for i := 0; i < 5; i++ {
                go add()
        }
}

Given the race detector has no known false positives, there should be very little reason to exclude a function from its scope.

//go:nosplit

Hopefully by now everyone knows that a goroutine’s stack is not a static allocation. Instead each goroutine starts with a few kilobytes of stack and, if necessary, will grow.

The technique that the runtime uses to manage a goroutine’s stack relies on each goroutine keeping track of its current stack usage. During the function preamble, a check is made to ensure there is enough stack space for the function to run. If not, the code traps into the runtime to grow the current stack allocation.

"".fn t=1 size=120 args=0x0 locals=0x80 
        0x0000 00000 (main.go:5)  TEXT    "".fn(SB), $128-0
         0x0000 00000 (main.go:5)  MOVQ    (TLS), CX
         0x0009 00009 (main.go:5)  CMPQ    SP, 16(CX)
         0x000d 00013 (main.go:5)  JLS     113

Now, this preamble is quite small, as we see it’s only a few instructions on x86.

A load from an offset of the current g register, which holds a pointer to the current goroutine.
A compare against the stack usage for this function, which is a constant known at compile time.
And a branch to the slow path, which is rare and easily predictable.

But sometimes even this overhead is unacceptable, and occasionally, unsafe, if you’re the runtime package itself. So a mechanism exists, via an annotation in the compiled form of the function to skip the stack check preamble. It should also be noted that the stack check is inserted by the linker, not the compiler, so it applies to assembly functions and, while they existed, C functions.

Up until Go 1.4, the runtime was implemented in a mix of Go, C and assembly.

// All reads and writes of g's status go through readgstatus, casgstatus 
// castogscanstatus, casfromgscanstatus.
 #pragma textflag NOSPLIT
 uint32 runtime·readgstatus(G *gp) {
         return runtime·atomicload(&gp->atomicstatus); 
}

In this example, runtime.readgstatus, we can see the C style #pragma textflag NOSPLIT. ¹¹

When the runtime was rewritten in Go, a way to say that a particular function should not have the stack split check was still required. This was often needed as taking a stack split inside the runtime was forbidden because a stack split implicitly needs to allocate memory, which would lead to recursive behaviour. Hence #pragma textflag NOSPLIT became //go:nosplit.

// All reads and writes of g's status go through 
// readgstatus, casgstatus, castogscanstatus,
 // casfrom_Gscanstatus.
 //go:nosplit 
funcreadgstatus(gp *g) uint32 {
         return atomic.Load(&gp.atomicstatus)
 }

But this leads to a problem, what happens if you run out of stack with //go:nosplit?

If a function, written in Go or otherwise, uses //go:nosplit to say “I don’t want to grow the stack at this point”, the compiler still has to ensure it’s safe to run the function. Go is a memory safe language, we cannot let functions use more stack than they are allowed just because they want to avoid the overhead of the stack check. They will almost certainly corrupt the heap or another goroutine’s memory.

To do this, the compiler maintains a buffer called the redzone, a 768 byte allocation ¹² at the bottom of each goroutines’ stack frame which is guaranteed to be available.

The compiler keeps track of the stack requirements of each function. When it encounters a nosplit function it accumulates that function’s stack allocation against the redzone. In this way, carefully written nosplit functions can execute safely against the redzone buffer while avoiding stack growth at inconvenient times.

This program uses nosplit to attempt to avoid stack splitting,

package main

type T [256]byte // a large stack allocated type

//go:nosplit
func A(t T) {
        B(t)
}

//go:nosplit
func B(t T) {
        C(t)
}

//go:nosplit
func C(t T) {
        D(t)
}

//go:nosplit
//go:noinline
func D(t T) {}

func main() {
        var t T
        A(t)
}

But will not compile because the compiler detects the redzone would be exhausted.

# command-line-arguments
main.C: nosplit stack overflow
        744     assumed on entry to main.A (nosplit)
        480     after main.A (nosplit) uses 264
        472     on entry to main.B (nosplit)
        208     after main.B (nosplit) uses 264
        200     on entry to main.C (nosplit)
        -64     after main.C (nosplit) uses 264

We occasionally hit this in the -N (no optimisation) build on the dashboard as the redzone is sufficient when optimisations are on, generally inlining small functions, but when inlining is disabled, stack frames are deeper and contain more allocations which are not optimised away.

Can you use //go:nosplit in your own code?

Can you use //go:nosplit in your own functions? Yes, I just showed you that you can. But it’s probably not necessary. Small functions would benefit most from this optimisation are already good candidates for inlining, and inlining is far more effective at eliminating the overhead of function calls than //go:nosplit.

You’ll note in the example above I showed I had to use //go:noinline to disable inlining which otherwise would have detected that D() actually did nothing, so the compiler would optimise away the entire call tree.

Of all the pragmas this one is the safest to use, as it will get spotted at compile time, and should generally not affect the correctness of your program, only the performance.

//go:noinline

This leads us to inlining.

Inlining ameliorates the cost of the stack check preamble, and in fact all the overheads of a function call, by copying the code of the inlined function into its caller. It’s a small trade off of possibly increased program size against reduced runtime by avoiding the function call overhead. Inlining is the key compiler optimisation because it unlocks many other optimisations.

Inlining is most effective with small, simple, functions as they do relatively little work compared to their overhead. For large functions, inlining offers less benefit as the overhead of the function call is small compared to the time spent doing work. However, what if you don’t want a function inlined? It turned out this was the case when developing the new SSA backend, as inlining would cause the nascent compiler to crash. I’ll let Keith Randall explain.

We particularly need this feature on the SSA branch because if a function is inlined, the code contained in that function might switch from being SSA-compiled to old-compiler-compiled. Without some sort of noinline mark the SSA-specific tests might not be testing the SSA backend at all.

The decision to control what can be inlined is made by a function inside the compiler called, ishairy. Hairy statements are things like closures, for loops, range loops, select, switch, and defer. If you wanted to write a small function that you do not want to be inlined, and don’t want the to add any overhead to the function, which of those would you use? It turns out, the answer is switch.

Prior to the SSA compiler, switch {} would prevent a function being inlined, whilst also optimising to nothing, and this was used heavily in compiler test fixtures to isolate individual operations.

func f3a_ssa(x int) *int {
         switch {
         }
         return &x
 }

With the introduction of the SSA compiler, switch was no longer considered hairy as switch is logically the same as a list of if ... else if statements, so switch{} stopped being a placeholder to prevent inlining. The compiler developers debated how to represent the construct “please don’t inline this function, ever”, and settled on a new pragma, //go:noinline.

Can you use //go:noinline in your own code?

Absolutely, although I cannot think of any reason to do so off hand, save silly examples like this article.

But what about …

But wait, there are many more pragmas that Go supports that aren’t part of this set we’re discussing.

+build is implemented by the Go tool, not the compiler, to filter files passed to the compiler for build or test

//go:generate uses the same syntax as a pragma, but is only recognised by the generate tool.

package pdf // import "rsc.io/pdf"

What about the canonical import pragma added in Go 1.4, to force the go tool to refuse to compile packages not imported by their “canonical” name

//line /foo/bar.go:123

What about the //line directive that can renumber the line numbers in stack traces?

Wrapping up

Pragmas in Go have a rich history. I hope the retelling of this history has been interesting to you.

The wider arc of Go’s pragmas is they are used inside the standard library to gain a foothold to implement the runtime, including the garbage collector, in Go itself. Pragmas allowed the runtime developers to extend, the language just enough to meet the requirements of the problem. You’ll find pragmas used, sparingly, inside the standard library, although you’ll never find them listed in godoc.

Should you use these pragmas in your own programs? Possibly //go:noescape is useful when writing assembly glue, which is done quite often in the crypto packages. For the other pragmas, outside demos and presentations like this, I don’t think there is much call for using them.

But please remember, magic comments are not part of the language spec, if you use GopherJS, or llgo, or gccgo, your code will still compile, but may operate differently. So please use this advice sparingly.

Caveat emptor.

If aligned memory writes are atomic, why do we need the sync/atomic package?

This is a post inspired by a question on the Go Forum. The question, paraphrased, was “If properly aligned writes are guaranteed to be atomic by the processor, why does the race detector complain?”

The answer is, there are two uses of the word atomic in play here. The first, the one the OP references, is a property of most microprocessors that, as long as the address of the write is naturally aligned–if it’s a 32-bit value, say, then it is always written to an address which is a multiple of four–then nothing will observe a half written value.

To explain what that means, consider the opposite, an unaligned write where a 32-bit value is written to an address whose bottom two bits are not zero. In this case the processor has to split the write into two, spanning the boundary. This is known as a torn write as an observer on the bus could see this partially updated value.¹³

These words comes from a time before multiple processors were common. At that time the observers of a torn read or write would most likely be other agents on the ISA, VESA, or PCI bus like disk controllers or video cards. However, we now live in the multi-core age so we need to talk about caches and visibility.

Since almost the beginning of computing, the CPU has run faster than main memory. That is to say, the performance of a computer is strongly related to the performance of its memory. This is known as the processor/memory gap. To bridge this gap processors have adopted caches which store recently accessed memory in a small, fast, store, closer to the processor.¹⁴ Because caches also buffer writes back to main memory, while the property that an aligned address will be atomic remains, when that write occurs has become less deterministic.¹⁵ This is the domain of second use of the word atomic, the one implemented by the sync/atomic package.

In a modern multiprocessor system, a write to main memory will be buffered in multiple levels of caches before hitting main memory. This is done to to hide the latency of main memory, but in doing so it means that communicating between processors using main memory is now imprecise; a value read from memory may have already been overwritten by one processor, however the new value has not made its way through the various caches yet.

To solve this ambiguity you need to use a memory fence, also known as a memory barrier. A memory write barrier operation tells the processor that it has to wait until all the outstanding operations in its pipeline, specifically writes, have been flushed to main memory. This operation also invalidates the caches

¹⁶

held by other processors, forcing them to retrieve the new value directly from memory. The same is true for reads, you use a memory read barrier to tell the processor to stop and synchronise with any outstanding writes to memory.

In terms of Go, read and write memory barrier operations are handled by the sync/atomic package, specifically the family of atomic.Load and atomic.Store functions respectively.

¹⁷

In answer to the OP’s question: to safely use a value in memory as a communication channel between two goroutines, the race detector will complain unless the sync/atomic package is used.

What Have We Learned from the PDP-11?

This post is a slightly edited version of my November presentation to the San Francisco chapter of Papers We Love.

C. G. Bell, W. D. Strecker, “Computer What Have We Learned from the PDP-11,” The 3rd Annual Symposium on Computer Architecture Conference Proceedings, pp. l-14, 1976.

The paper I have chosen tonight is a retrospective on a computer design. It is one of a series of papers by Gordon Bell, and various co-authors, spanning the design, growth, and eventual replacement of the companies iconic line of PDP-11 mini computers.

C. G. Bell, R. Cady, H. McFarland, B. Delagi, J. O’Laughlin, R. Noonan and W. Wulf, “A New Architecture for Mini-Computers – The DEC PDP- 11,” Proceedings of the Sprint Joint Computer Conference, pp. 657-675, AFIPS Press, 1970.
C. G. Bell, W. D. Strecker, “Computer What Have We Learned from the PDP-11,” The 3rd Annual Symposium on Computer Architecture Conference Proceedings, pp. l-14, 1976.
W. D. Strecker, “VAX-11/780: A Virtual Address Extension to the DEC PDP-11 Family,” Proceedings of the National Computer Conference, pp. 967-980, AFIPS Press, 1978.
C. G. Bell, W. D. Strecker, “Retrospective: what have we learned from the PDP-11—what we have learned from VAX and Alpha”, Proceedings of the 25th Annual International Symposium on Computer Architecture, pp. 6-10, 1998.

This year represents the 60th anniversary of the founding of the company that produced the PDP-11. It is also 40 years since this paper was written, so I thought it would be entertaining to review Bell’s retrospective through the lens of our own 20/20 hindsight.

Image credit: saccade.com

Who were Digital Equipment Corporation?

To set the scene for this paper, first we should talk a little about the company that produced the PDP-11, the Digital Equipment Corporation of Maynard, Massachusetts. Better known as DEC.

Image Credit: boston.com

DEC was founded in 1957 by Ken Olsen, and Harlan Anderson. Olsen and Anderson had worked together at MIT’s Lincoln Laboratory where they noticed that students would queue for hours to use the TX-0, an experimental interactive computer designed by Wes Clarke¹⁸.

Image credit: computerhistory.org

This is the TX-0. What do you notice about it?

Image credit: computerhistory.org

Let’s compare it to something like the IBM 704, a contemporary mainframe, which the MIT students largely ignored¹⁹. What Olsen and Anderson recognised was the desire for an interactive computer experience was so strong that there was a market for “small” computers dedicated to this role.

Image credit: computer-history.info

DEC’s first offering was the PDP-1, effectively a commercial version of the TX-0.

Correction: Michael Cheponis, the lead on the PDP-1 restoration project, kindly wrote to me with the following information.

Whirlwind and PDP-1 have 5 bit operation codes, TX-0 started with a 2 or 3 bit operation code, but gradually grew more after it lost the large memory to TX-2. A comparison of the PDP-1 and Whirlwind order codes makes it clear that the PDP-1 is a cost reduced and slightly improved version of the Whirlwind architecture.

Improvement: add indirect addressing.

Cost reduction: eliminate the live register by adding the subroutine call instructions, and eliminate the shift counter by replacing integer multiply and divide with multiply step and divide step instructions and making the shift instructions shift once for each bit in the address field.The biggest cost reduction was due to taking advantage of the advancement in logic design and packaging from Whirlwind to the PDP-1. (Using the Barta building for packaging versus 4 standard electronics racks.)

The contrast between the PDP-1 and the IBM 704 was visible at MIT, but the small, slow interactive computers, the Librascope LGP-30 and the Bendix G-15 preceded the PDP-1 by several years and sold in similar quantities.

It’s also worth noting that the name PDP is an acronym for “Programmed Data Processor”, as at the time, computers had a reputation of being large, complicated, and expensive machines, and DEC’s venture capitalists would not support them if they built a “computer”²⁰.

Following the success of the PDP-1, DEC’s offerings blossomed into several families of computers, many of which were designed at least in part by Gordon Bell, the author of tonight’s paper.

Introduction

A computer is not solely determined by its architecture; it reflects the technological, economic, and human aspects of the environment in which it was designed and built. […] The finished computer is a product of the total design environment.

Right from the get go, Bell is letting us know that the success of any computer project is not abstractly building the best computer but building the right computer, and that takes context.

In this chapter, we reflect on the PDP-11: its goals, its architecture, its various implementations, and the people who designed it. We examine the design, beginning with the architectural specifications, and observe how it was affected by technology, by the development organization, the sales, application, and manufacturing organizations, and the nature of the final users.

By this time, 1976, Bell has been the VP of engineering at DEC for nearly four years. It’s clear that he’s considering the success of the PDP-11 in the wider context of the market which it was both developed to serve, and which would later influence the evolution of the PDP-11 line.

In keeping with the spirit of Bell’s words, this presentation focuses on two aspects of the paper; the technology and the people.

Background: Thoughts Behind The Design

Bell opens with this observation

It is the nature of computer engineering to be goal-oriented, with pressure to produce deliverable products. It is therefore difficult to plan for an extensive lifetime.

This is your agile mindset, right here. This is 25 years before Snowbird and the birth of the Agile manifesto. When this was written DEC weren’t a scrappy startup desperate to make their bones, they were an established company with several successful product lines in the market, and Bell is talking about the pressure to ship a minimum viable product trampling all over notions of being able to plan elaborately.

Like the IBM/360, the PDP-11 was designed not just as a single model of computer, but a range of models, for which software written for a small PDP-11 would be compatible with the larger one.

“The term architecture is used here to describe the attributes of a system as seen by the programmer, i.e., the conceptual structure and functional behaviour, as distinct from the organisation of the data flow and controls, the logical design, and the physical implementation.” — G. M. Amdahl G. A. Blaauw and F. P. Brooks Jr. Architecture of the IBM System/360, 1964

Because of the open nature of the PDP-11, anything which interpreted the instructions according to the processor specification, was a PDP-11, so there had been a rush within DEC, once it was clear that the PDP-11 market was heating up, to build implementations; you had different groups building fast, expensive ones and cost reduced slower ones.

Despite its evolutionary planning, the PDP-11 has been quite successful in the marketplace: over 20,000 have been sold in the six years that it has been on the market (197Cb1975). It is not clear how rigorous a test (aside from the marketplace) we have given the design, since a large and aggressive marketing organization, armed with software to correct architectural inconsistencies and omissions, can save almost any design.

Here Bell is introducing his hypothesis for the paper; was the PDP-11 a great design, or was it simply the beneficiary of a hyperactive marketing department? To answer his own question Bell begins to reflect on the product that emerged and evaluate it against the design criteria that he and his fellow authors identified six years earlier.

Address space

The first weakness of minicomputers was their limited addressing capability. The biggest (and most common) mistake that can be made in a computer design is that of not providing enough address bits for memory addressing and management.

Minicomputers of the era often came with a 12 bit address space, offering just 4096 addresses each holding a 12 bit value called a word.

It’s worth taking an aside to note that the word minicomputer, which later came to be understood to be an indication of their physical size, was originally a contraction of the words minimal computer. The canonical example was the PDP-8, DEC’s previous offering, which offered just eight instructions.

Image credit: wikipedia

The reason for this tiny address spaces was cost. Memory was extremely expensive in the 60’s and early 70’s as each bit of storage consisted of a tiny iron doughnut, or core, woven into a mesh of control wires.

Cores were arranged into a plane, in this case representing 4096 bits, then stacked to produce words, so a 4096 word memory contained 16 million cores, all of which were at least partially hand assembled. You can see why memory was expensive.

Image credit: bitsavers.org

Bell and the other PDP-11 designers knew that core memory prices were continuing to fall, and that semiconductor memory, while not cost effective at that time, would continue to drive down the price per bit of storage. Thus the amount of memory that customers could afford to buy would increase over time as “… users tend to buy “constant dollar” systems”. But alas,

The PDP-11 followed this hallowed tradition of skimping on address bits, but it was saved by the principle that a good design can evolve through at least one major change.

Even with the designers foresight, Bell noted that not two years after its introduction, the PDP-11 had to be redesigned to include a memory management unit to allow access to a larger 18 bit address space at the cost of increased programming complexity. A few years later a further 4 bits were added.

In fairness, while Bell chastised himself for not seeing this coming, this pattern of insufficient address bits continues to this day. Who remembers the dos 640k limit? Who remembers fighting with himem.sys? Who has struggled with 32 bit programs that need more than 2gb of heap?

So, none of us are perfect.

Insufficient registers

A second weakness of minicomputers was their tendency not to have enough registers. This was corrected for the PDP-11 by providing eight 16-bit registers. Later, six 32-bit registers were added for floating-point arithmetic. […] More registers would increase the multiprogramming context switch time and confuse the user.

It was not uncommon, even for mainframes of the day, to offer only a single register—the accumulator. If additional registers were provided, they would be specifically for use in indexing operations, they were not general purpose.

It’s also interesting to note Bell’s concern that additional registers would confuse the user. In the early 1970’s the assumption that the machine would be programmed directly in assembly was still the prevailing mindset.

There is a strong interplay between the number of registers in an architecture, the number of address bits, and the size of the instruction. All of these factors are rooted in the scarcity of memory, so it’s worth taking a little detour into instruction set design.

In Von Neumann machines (of which almost all computers of the 60’s were) the program and its data share the same, limited, address space, so inefficient programs didn’t just waste computing time, they wasted memory. A slow program is somewhat tolerable, but a program that is too large to fit in memory is a fatal condition, so you want your instruction encoding to be as efficient as possible.

Let’s consider the very common case of moving an value from one location in memory to another. How many bits would you need to describe that operation? One implementation might be:

MOV <addr> <addr>

Which would take 16 bits for the source, and a further 16 for the destination, and the some bits to encode the MOV instruction itself. Let’s call it 40 bits; this is not a multiple of 16, which would mean a complicated 2.5 word instruction encoding. However, what if we were to load that address into a register?

MOV (R0), (R1)

Then we’d only need to describe the register, the PDP had 8 registers, so only 3 bits per register, plus some for the operator. This could easily fit into a 16 bit word, rather than requiring a complex variable length encoding.

Image credit: bitsavers.org

It actually turns out that the PDP-11 uses 6 bits per register, but that still left 4 bits for the operation when two operands were present.

Hardware stack

A third weakness of minicomputers was their lack of hardware stack capability. In the PDP-11, this was solved with the autoincrement/autodecrement addressing mechanism. This solution is unique to the PDP-11 and has proven to be exceptionally useful. (In fact, it has been copied by other designers.)

Nowadays it’s hard to imagine hardware that doesn’t have a notion of a stack, but consider that a stack isn’t important if you don’t need recursion.

The design for the PDP-11 was laid down in 1969 and if we look at the programming languages of the time, FORTRAN and COBOL, neither supported recursive function calls. The function call sequence would often store the return address at a blank word at the start of the procedure making recursion impossible.

Image credit: bitsavers.org

The PDP-11 defined a stack pointer as we understand today, a register that controlled the operation of PUSH and POP style instructions, but the PDP-11 went one better and permitted any register to operate as a stack pointer by the addition of an auto increment/decrement modifier on all register operands.

For example, this single instruction:

MOV R4, -(R6)

will decrement the value stored in R6 by two, then store the value in R4 into the address stored in R6. This is the how you push a value onto the stack in PDP-11 assembler. If anyone has done any ARM programming, this will be very familiar to you.

This means there is no need for a dedicated PUSH or POP instruction, saving instruction encoding space, and allowing any register to be used as a stack pointer, although traditionally R6 was assumed by the hardware if you used the native subroutine call instruction.

Interrupt latency

A fourth weakness, limited interrupt capability and slow context switching, was essentially solved with the device of UNIBUS interrupt vectors, which direct device interrupts.

At this point in DEC’s lifetime, almost all of its products, save the PDP-10 which was DEC’s mainframe offering, were aimed at interactive, laboratory, or process control applications. Interrupt responsiveness, the delay between an interrupt signal being raised, and the computer being ready to process the interrupt, is key to real time performance.

The PDP-11 addressed this by permitting the device that raised the interrupt to supply the address to service the interrupt. Bell proudly reported that:

The basic mechanism is very fast, requiring only four memory cycles from the time an interrupt request is issued until the first instruction of the interrupt routine begins execution.

Character handling

A fifth weakness of prior minicomputers, inadequate character-handling capability, was met in the PDP-11 by providing direct byte addressing capability.

Strings and character handling were of increasing importance during the 1960’s as scientific and business computing converged. The predominant character encodings at the time were 6 bit character sets which provided just enough space for upper case letters, the digits 0 to 9, space, and a few punctuation characters sufficient for printing financial reports.

DEC Sixbit encoding

Because memory was so expensive, placing one 6 bit character into a 12 or 18 bit word was simply unacceptable so characters would be packed into words.

This proved efficient for storage, but complex for operations like move, compare, and concatenate, which had to account for a character appearing in the top or bottom of the word, expending valuable words of program storage to cope.

The problem was addressed in the PDP-11 by allowing the machine to operate on memory as both a 16-bit word, and the increasingly popular 8-bit byte. The expenditure of 2 additional bits per character was felt to be worth it for simpler string handling, and also eased the adoption of the increasingly popular 7-bit ASCII standard of which DEC were a proponent at the time. Bell concludes this point with the throw away line:

Although string instructions are not yet provided in the hard- ware, the common string operations (move, compare, concatenate) can be programmed with very short loops.

And indeed they can. One can write a string copy routine using two instructions, assuming that the source and destination are already in registers.

loop:   MOVB (src)+, (dst)+
        BNE loop

The routine takes full advantage of the fact that MOV updates the processor flag. The loop will continue until the value at the source address is zero, at which point the branch will fall through to the next instruction. This is why C strings are terminated with zeros.

Read only memory

A sixth weakness, the inability to use read-only memories, was avoided in the PDP-11. Most code written for the PDP-11 tends to be pure and reentrant without special effort by the programmer, allowing a read-only memory (ROM) to be used directly.

In process control applications, where the program is relatively fixed, having to load the program each time from magnetic or punched tape is expensive. You have to buy and maintain infrequently used I/O devices. It would be far more convenient if the program could always be present in the computer at startup. However, because of the extreme memory shortage of early minicomputers, and the lack of notion of a hardware stack, self modifying code was often unavoidable, which seriously limited the use of read only memory. Bell is justifiably proud that the PDP-11 design knocked that one out of the park.

Primitive I/O Capabilities

A seventh weakness, one common to many minicomputers, was primitive I/O capabilities.

During the late 60’s when the PDP-11 was being designed, input/output was very expensive. Mainframes of the time use a model called channel I/O, where the main CPU sent a small program to a channel controller, which would execute the program and report the result. The program would usually instruct a tape drive to load a record, or a punch to punch a card.

Channel I/O was important because it allowed the mainframe to offload the oversight of the I/O operation to another processor, freeing its valuable cycles, and permitting overlapping I/O, which in turn increased processor utilisation. The downside was channel I/O required a smaller CPU inside each channel controller which increased the cost of the installation substantially.

In the minicomputer world, I/O was usually performed directly by the CPU, usually with specialised instructions hard coded for a specific device, eg. the paper tape reader or console printer.

The PDP-11 introduced something unique, memory mapped I/O. This isn’t memory mapped I/O that you might be used to with the mmap(2) system call, but rather the convention that specific addresses in memory weren’t just dumb storage, instead their contents were mapped onto cards plugged into the backplane, which DEC called the UNIBUS.

Image credit: bitsavers.org

For example, a value written to 777566²¹ would be written to device attached to the console, usually a hard copy terminal.

Image credit: bitsavers.org

If you read the value at address 777570, you get the value entered on the front panel switches. This was often used as an early form of bootstrap configuration.

Image credit: bitsavers.org

Similarly talking to the RK05 disk drive was accomplished by writing the sector number you wanted to access to 777412, the address to transfer that sector to address 777410, and the number of words to 777406. Then by setting bit zero at address 777404 to 1 (the GO bit), the drive will transfer the number of words you asked for directly into memory.

Image credit: bitsavers.org

My favorite has got to be the KW11-L line clock. A write to bit 6 of address 777546 starts an interrupt firing every 20ms. What’s special about 20ms? Because that’s the reciprocal of the frequency of the AC waveform here in the US²². Yup, just like a bedside clock, the PDP-11 told time by counting the number of power line cycles.

High programming costs

A ninth weakness of minicomputers was the high cost of programming them. Many users program in assembly language, without the comfortable environment of editors, file systems, and debuggers available on bigger systems. The PDP-11 does not seem to have overcome this weakness, although it appears that more complex systems are being built successfully with the PDP-11 than with its predecessors, the PDP-8 and PDP-15.

Because of their minimal nature, mini computers were not pleasant environments on which to write programs. Often this would involve tedious switch flipping, or perhaps editing and assembling a program on another, larger, computer and then transferring it via paper tape.

This is very much akin to how those of us who work with micro controllers are still programming today; editing on a large workstation, compiling a target hex file, then transferring that hex to the micro controller’s flash storage.

Image credit: Dennis Ritchie

However, it seems Bell was unaware of the work of Thompson and Ritchie, who were busy constructing their own programming environment on their PDP-11 in New Jersey.

People: Builders of the Design

Now we move to the second section; the people.

As an amateur historian this section is the most interesting to me, because the study of the history of computing, or really any historical subject, is fundamentally a study of people, and the context surrounding the decisions they made.

Bell recognises that while computers are built from technology, they are built by people and thus he dedicates this section to describing the group dynamics at DEC during the development of the PDP-11.

The problems faced by computer designers can usually be attributed to one of two causes: inexperience or second-systemitis.

Here Bell recalls the words of Fred Brooks from his recent (at the time) book The Mythical Man-Month.

Brooks, the lead for the OS/360 project struggled for years to build a single general purpose operating system that would run across the entire range of IBM/360 modes, this was after all the goal of the 360 project. Brooks’ words must have been fresh in Bell’s mind when he wrote this paragraph.

Chronology of the design

This section is a perfect window into the operation of DEC in the late 1960’s

The internal organization of DEC design groups has through the years oscillated between market orientation and product orientation. Since the company has been growing at a rate of 30 to 40% a year, there has been a constant need for reorganization. At any given time, one third of the staff has been with the company less than a year.

Raise your hand if this sounds familiar to you.

At the time of the PDP-11 design, the company was structured along product lines. The design talent in the company was organized into tight groups: the PDP-10 group, the PDP-15 (an 18-bit machine) group, the PDP-8 group, an ad hoc PDP-8/S subgroup, and the LINC-8 group. Each group included marketing and engineering people responsible for designing a product, software and hardware. As a result of this organization, architectural experience was diffused among the groups, and there was little understanding of the notion of a range of products.

Here Bell spends some time iterating through each group, listing their strengths and weaknesses as sponsors for the PDP-11. I won’t recount all the choices, with one exception.

The PDP-10 group was the strongest group in the company. They built large powerful time-shared machines. It was essentially a separate division of the company, with little or no interaction with the other groups. Although the PDP-10 group as a whole had the best understanding of system architectural controls, they had no notion of system range, and were only interested in building higher-performance computers.

Having recently worked for a software company where one or two of the oldest products made almost all the company profit I have some sympathy for Bell’s position. The PDP-10 was DEC’s version of a mainframe; enormously powerful, but was only available at one price point.

The first design work for a 16-bit computer was carried out under the eye of the PDP-15 manager, a marketing person with engineering background. This first design was called PDP-X, and included specification for a range of machines. As a range architecture, it was better designed than the later PDP-11, but was not otherwise particularly innovative. Unfortunately, this group managed to convince management that their design was potentially as complex as the PDP-10 (which it was not), and thus ensured its demise, since no one wanted another large computer unrelated to the company’s main large computer.

And here Bell teaches us an important lesson; when your competition is in the same reporting chain, they have effective tools to ensure your project is killed before it reaches the market.

In retrospect, the people involved in designing PDP-X were apparently working simultaneously on the design of Data General.

This shade might go unnoticed by the casual reader, but it’s a reference to a defection to rival that of Shockely’s traitorous eight a decade earlier²³.

Edson de Castro, the product manager of the PDP-8, and lead on the PDP-X project had left DEC, along with several of his team to form Data General. The record is not clear if de Castro left because the PDP-X was canceled or if his departure was the final straw for the faltering project. In either case, the result was obvious, as Bell writes.

As the PDP-X project folded, the DCM (Desk Calculator Machine, a code name chosen for security) was started. Design and planning were in disarray, as Data General had been formed and was competing with the PDP-8, using a very small 16-bit computer.

Data General were now competing with DEC with their 16 bit Nova in the space that the PDP-8 had defined and de Castro knew like the back of his hand; rack mounted laboratory equipment.

Image credit: vintagecomputer.net

The 12 bit PDP-8 versus Data General’s 16 bit Nova

By MBlairMartin (Own work) [CC BY-SA 4.0], via Wikimedia Commons

The PDP-11: An Evaluation

The last section of the paper, having evaluated the PDP-11 against its predecessors, Bell proceeds to evaluate the PDP-11 against itself. The biggest take away was the UNIBUS.

In general, the UNIBUS has behaved beyond all expectations. Several hundred types of memories and peripherals have been interfaced to it; it has become a standard architectural component of systems in the $3K to $100K price range (1975).

What is the UNIBUS?

Image credit: wikipedia

The earliest commercial computers were designed with plugin modules connected by a wired backplane. In the days of vacuum tubes this was a necessity because of the unreliability of tubes and the need to replace modules quickly.

Image credit: wikipedia

Later a desire to build computers out of a standardized modules resulted in the generalised logic blocks interconnected by a sophisticated backplane. This is an example of DEC’s early flip-chip offerings.

Image credit: piercefuller.com

You’d mount them on a complex wire wrapped backplane to build a computer, in this case a PDP-8.

The UNIBUS represented an evolution of the previous DEC designs as an abstraction of an idealised control plane. The availability of medium scale integrated components moved the complexity from the backplane to the modules that populated it. This in turn created a standard way for additional modules to be attached to the computer.

The UNIBUS, as a standard, has provided an architectural component for easily configuring systems. Any company, not just DEC, can easily build components that interface to the bus. Good buses make good engineering neighbours, since people can concentrate on structured design. Indeed, the UNIBUS has created a secondary industry providing alternative sources of supply for memories and peripherals. With the exception of the IBM 360 Multiplexor/Selector bus, the UNIBUS is the most widely used computer interconnection standard.

Prior to the UNIBUS, the I/O devices a minicomputer could support were dictated by the designers. You almost had to encode how to talk to them into the CPU logic. After the UNIBUS, the field for end user customisation and experimentation was cracked wide open.

What have we learned from the PDP-11?

Bell’s retrospective ended when the paper was written in 1976/77, but from our vantage point, forty years later, the impact that the PDP-11 has been huge.

RISC

First of all, while the PDP-11 was not designed, or even understood to be a RISC processor–that term would not be coined util 1976 by John Cocke and the IBM 801. However to anyone with experience with processors like the ARM, a contemporary RISC microprocessor, the similarities between the two instruction sets are striking. Just as programming language design is a process of evolution and cultural poaching, so too it seems is instruction set design.

The PDP-11 also drove a stake firmly through the heart of dedicated I/O instructions, it cemented the model of memory mapped I/O as the predominant control mechanism to this day. The only processors post the PDP that offered seperate input/output instructions that I can think of were the Intel 8080, and its cousin, the Z80.

UNIX

Next is the PDP-11’s impact on software and operating systems. The PDP-11 is the machine on which Ken Thompson and Dennis Ritchie developed UNIX at Bell Labs²⁴.

Before the PDP-11, there was no UNIX. Before the PDP-11, there was no C, this is the computer that C was designed on. If you want to know why the classical C int is 16 bits wide, it’s because of the PDP-11.

UNIX bought us ideas such as pipes, everything is a file, and interactive computing.

VAX-11/780

In interactive computing, memory usage is effectively unbounded, and while the PDP was perfect for process control applications, this demand for interactive computing was the hallmark and driver for the PDP-11’s replacement.

The year this paper was released, 1977, the PDP-11’s successor, the VAX-11, which stood for “virtual address extension”–you can see Bell was not going to address space mistake again–was released.

BSD

UNIX, which had arrived at Berkley in 1974 aboard a tape carried by Ken Thompson, would evolve into the west coast flavoured Berkley Systems Distribution.

Berkeley UNIX had been ported to the VAX by the start of the 1980’s and was thriving as the counter cultural alternative to DEC’s own VMS operating system. Berkeley UNIX spawned a new generation of hackers who would go on to form companies like Sun micro systems, and languages like Self, which lead directly to the development of Java.

UNIX was ported to a bewildering array of computer systems during the 80’s and the fallout from the UNIX wars gave us the various BSD operating systems who continue to this day.

4BSD, a descendant of the original Berkeley distribution became the basis of the operating system for Steve Job’s NeXT line of computers. And when Apple purchased NeXT in 1997, NextSTEP and its BSD derived user space, became the foundations for Darwin, OSX, and iOS.

Windows NT

As we say earlier with Edson de Castro, DEC was no stranger to breakups.

Dave Cutler, the architect of the VAX VMS operating system, after a failed attempt to start a new combined operating system and hardware project, designed to succeed the VAX, decamped to Microsoft in 1988 bringing with him his team and lead the development of Windows NT. Those with a knowledge of Windows’ internals and VMS will perhaps spot the similarities.

Xerox Alto

To close the loop on the de Castro story, the Data General Nova series provided inspiration to Charles Thacker and Butler Lampson, the designers of the Xerox Alto, which itself was the fabled inspiration for the look and feel of the Apple Macintosh.

Data General Nova

The rivalry between Data General and DEC continued into the 32 bit era, the story of which is told in Tracy Kidder’s 1981 Pulitzer winner, The Soul of a New Machine.

What have we learned from the PDP-11?

While its development was sometimes chaotic, and not without its flaws, the PDP-11 is at the intersection of many threads of history.

Hardware, software, programming languages, operating systems, have all been influenced by the PDP-11. I wager there is not a single person in this room who cannot trace the lineage of the language they work with, the computer they use, or the operating system it runs, back to the PDP-11.

And that is worth celebrating.

Never edit a method, always rewrite it

At a recent RubyConf, Chad Fowler presented his ideas for writing software systems that mirror the process of continual replacement observed in biological systems.

The first principal of this approach is, unsurprisingly, to keep the components of the software system small–just as complex organisms like human beings are constituted from billions of tiny cells which are constantly undergoing a process of renewal.

Following from that Fowler proposed this idea:

What would happen if you had a rule on your team that said you never edit a method after it was written, you only rewrote it again from scratch?
Chad Fowler

Fowler quickly walked back this suggestion as possibly not a good idea, nevertheless the idea has stuck in my head all day. What would happen if we developed software this way? What benefits could it bring?

Would it have benefits for software reuse? Opening up a method to add another branch condition or switch clause would become more expensive, and having rewritten the same function over and over again, the author might be tempted to make it more generalisable over a class of problems.
Would it have an impact on function complexity? If you knew that changing a long, complex, function required writing it again from scratch, would it encourage you to make it smaller? Perhaps you would pull non critical setup or checking logic into other functions to limit the amount you had to rewrite.
Would it have an impact on the tests you write? Some functions are truly complex, they contain a core algorithm that can’t be reduced any further. If you had to rewrite them, how would you know you got it right? Are there tests? Do they cover the edge cases? Are there benchmarks so you could ensure your version ran comparably to the previous?
Would it have an impact on the name you chose? Is the name of the current function sufficient to describe how to re-implement it? Would a comment help? Does the current comment give you sufficient guidance?

I agree with Fowler that the idea of immutable source code is likely unworkable. But even if you never actually followed this rule in practice, what would be the impact on the quality, reliability, and usability of your programs if you always wrote your functions with the mindset of it being immutable?

Note: Fowler talks about methods, because in Ruby, everything is a method. I prefer to talk about functions, because in Go, methods are a syntactic sugar over functions. For the purpose of this article, please treat functions and methods as interchangable.

Please, vote Yes for marriage equality in Australia

I wanted to write a few words about the postal survey on marriage law currently underway in Australia.

As an Australian, our country and our government do so many things that make me ashamed; the poverty of our indigenous population, the inhumane treatment of refugees on Manus Island, and the maniacal desire to burn every last ounce of coal in the country, come hell and high water, to name but a few.

It is, quite frankly, overwhelming how institutionally cruel our government, which is after all a representation of the majority of Australians, can be, and nothing has sharpened this meanness to a point than the way the Liberal government have approached this survey.

With everything that is wrong in the world right now; climate change, the threat of nuclear war, and an unqualified narcissist running the White House, voting yes to the survey’s simple question is, quite literally, the smallest thing you could do to bring joy to two people.

So please, when you get your postal survey, vote yes.

Thank you.

Why I joined Heptio

Everyone gets the same set of tools

Something that had long puzzled me was the question “Why do some people [in the organisation] have root, and others do not?” It seemed to me that the reason the sysadmins had the root passwords, and everyone else had to raise tickets, was a tooling problem. Giving everyone root would permit anyone in the organisation to fix their own problems, deploy their own software, or, less charitably, cowboy things or be downright naughty. And while everyone had root, it usually turned out that only the operations team had the on call pager.

After the wholesale failure of organisations to understand Devops, I’m a big fan of the “You build it, you run it” movement. So when George Barnett and I built the Atlassian OnDemand Cloud we made a deliberate decision that everyone would get the same tools, and (modulo permissions and audit logs) be empowered to use the platform to the full extent. There wouldn’t be one set of tools for regular users, and a super set of “power tools” reserved for operators.

To me you build it, you run it, means if you have a problem, we’ll help you learn to use the tools better, not fix your problem for you.

Virtualise the operating system, not the hardware

I remember playing with VMware in 1999 or early 2000. I thought it was an amazing trick, especially as the drivers for my sound card worked way better in virtualized Windows than the real thing.

Fast forward a few years and I was using VMware to maintain a fleet of foreign language Windows installations for testing. Skip forward a few more and the industry had figured out that virtualisation was a solution to the sprawl of single use Windows servers that cluttered up wiring cupboards and data centres.

Virtualisation is a neat trick taken well beyond the point of a joke, but it did shine a light on the dark corners of systems administration. Back when turning up a server involved purchase orders, waiting for hardware to be shipped, contract negotiations, and trips to the data centre, what was a few hours spent installing the operating system? But when virtual hardware could be conjured out of thin air in seconds, it cast a long shadow over the need to automate operating system installation and management.

This was the age of Puppet and Chef, who re-plowed the ground sowed a decade earlier by CFEngine. Now sysadmins could configure and manage servers at the speed they could be virtually provisioned. I remember, thinking back to when I started to use Puppet, and imagining about what it would have been like to have those tools in previous jobs, where automation involved SVN repositories full of perl scripts, and crontab entries lovingly copy pasta’d between machines. And so everything was good for a time in the age of configuration as code.

But, simulating the entirety of an x86 host on another, just so people can share a computer, is a ridiculous waste. This shouldn’t be a surprise, FreeBSD Jails and Solaris Zones (rest in peace) had been coughing loudly about this for decades. Bryan Cantrill said it best when he exclaimed that we should “virtualise the operating system, not the hardware“, or as we’ve come to know them: containers.

The death of the operating system

I remember where I saw Docker for the first time. The product wasn’t even a year old and Docker were carpet bombing any meetup that would have them to promote it. Canonical were sprinting at a hotel near SFO and I convinced several of my teammates to squeeze into a taxi for the first meetup in San Mateo. What I saw that night shook me to my core. It wasn’t just the speed–oh the speed, after spending two years waiting for EC2 and slow apt mirrors–it was the clarity of that Californian mindset: what would happen if I checked my entire application deployment into git?

It was clear to me that night that virtual machines were virtualising stuff that people didn’t care about; virtual video cards, virtual floppy drives, virtual ram that swapped to virtual disks. What people wanted was a virtual kernel–their own pid 1. Orchestration tools like Chef, Puppet, and Juju were trying to orchestrate an entire operating system when what developers really wanted was a way to take a single program, the one that they had written, and deploy it to a server. Filesystems, crontabs, init/upstart/systemd, apt-get and dpkg-reconfigure, weren’t just someone else’s problem, they were irrelevant.

Anyone who’s endured to my rants about product knows my unwavering belief in the Innovator’s Dilemma. Through the window of Christensen’s logic, it was clear that the server orchestration market had been upended in that moment. Squeezed between Docker images at the low end and Netflix’s “everything is an AMI” model at the top end, was a large middle ground filled with orchestration tools that expected to be given a running operating system to configure. The Chefs and Puppets and whatnots would be desperately trying to convince the biggest orchestration users–the Netflixes of the world, with their CI/CD pipelines that pooped AMIs–to adopt agent based tooling, while all the while each developer faced with the question “How should I deploy my application?” would default to docker push.

Orchestration as table stakes

If you’re building your own orchestration layer, then you are betting on the wrong horse–I say this as someone who’s built a bespoke container based PaaS.

Within the next year or two you’ll be able to buy access to a Kubernetes API server at every price point; on your laptop, shared as a VPS, in your own VPC, or even as an appliance. Building on top of the Kubernetes primitives is where the value lies. Building on top of the shared tooling the Kubernetes API provides the level playing field that every development team who is responsible for supporting their own software in production is entitled to.

Why did I join Heptio? Because I believe that the administration of operating systems has reached its endgame. Kubernetes is going to revolutionise the way software is developed, and deployed, and I’m honoured to be given the opportunity to join the company that is going to make that happen.

Setup

Kicking the tires

Go module source cache

Wrap up

Arrays

Slices

The slice header value

Putting it all together

Further reading

Notes

What is a map function?

Go’s map is a hashmap

The hash function

Important properties of a hash function

The hashmap data structure

Four properties of a hash map

Hashmaps in other languages

C++

Java

Tradeoffs

C++ templated std::unordered_map

Advantages

Disadvantages

Java util Hashmap

Advantages

Disadvantages

Go’s hashmap implementation

Does the Go runtime use interface{} ?

Does the compiler use code generation?

Compile time rewriting

Only one copy of the map code

Conclusion

What does a distro provide?

Container sprawl

Immutable images

Updates, updates. Updates!

The last laugh

Whither distros

What can distros do?

But first, a history lesson

Does Go have pragmas?

A word of caution ?

//go:noescape

Escape analysis

os.File.Read

Can you use //go:noescape in your code?

//go:norace

Should you use //go:norace in your own code?

//go:nosplit

Can you use //go:nosplit in your own code?

//go:noinline

Can you use //go:noinline in your own code?

But what about …

Wrapping up

Who were Digital Equipment Corporation?

Introduction

Background: Thoughts Behind The Design

Address space

Insufficient registers

Hardware stack

Interrupt latency

Character handling

Read only memory

Primitive I/O Capabilities

High programming costs

People: Builders of the Design

Chronology of the design

The PDP-11: An Evaluation

What is the UNIBUS?

What have we learned from the PDP-11?

RISC

UNIX

VAX-11/780

BSD

NeXT

Windows NT

Xerox Alto

Data General Nova

What have we learned from the PDP-11?

Everyone gets the same set of tools

C++ templated `std::unordered_map`

Does the Go runtime use interface{} ?