Introduction

This talk is a case study of designing an efficient Go package. I’m going to use the example of building a high performance JSON parser.

The goals of this project were:

Supports streaming operations

It’s unrealistic to expect to have the entire input in memory. Buffering in memory is a availability risk, input sizes are usually unknown and potentially unbounded. Buffering before processing introduces latency. Streaming reads lets you process data as it arrives and overlap that processing with reading.

Reasonably compatible with the encoding/json package

This package offers the same high level json.Decoder API but higher throughput and reduced allocations.

Allocation free, or bounded API

In addition to the encoding/json API, provide an alternative, more efficient API.

Time complexity

Let’s talk about the time complexity of this problem.

JSON doesn’t use length markers; to know how much to read, we have to read it all. To parse the 1,000th element in JSON array, we have to read and process the 999 that come before it. This means the lower bounds on the time to process the input is the size of the input — you can’t skip ahead.

But reading isn’t the full story, we need to process it through the JSON state machine to figure out where the tokens start and end. Thus the performance is at least read(N)+parse(N).

But there are other costs:

  • Ideally if we read N bytes, we want to process each byte only once. If we touch the same byte more than once, that adds overhead, and complicates processing if we have to keep those bytes around to come back and look at them again.

  • Just like we don’t want to process a byte more than once, we want to avoid processing a token more than once.

  • Limit function calls in the hot path inside the Scanner or Decoder. We want to limit the number of function to O(tokens), not O(bytes).

  • Limit copies. If we have a design that limits copying then we reduce the number times we (re)visit a byte.

  • Limit allocations. If you limit the number of times data is copied, then you naturally limit allocations, which reduces runtime in several ways:

    1. Reduce the overhead in taking the allocation. The heap is a shared resource, allocating on the heap requires working with shared data structures. This means locks, cache contention, etc. c.f. Amdahl’s Law [1]

    2. Reduce the overhead of freeing allocations. The less allocations you make, the less heap you consume and the less garbage you produce. Reducing these two factors reduces the overhead of background and foreground garbage collection.

Tokenisation

A JSON decoder has two main components:

  1. A scanner, or tokeniser, that converts a stream a bytes into a stream of JSON tokens.

  2. An unmarshaller that applies a stream of JSON tokens to a Go object.

What then, is a token?

JSON is regular, well defined grammar. json.org has a wonderful set of railroad diagrams that describe the grammar of JSON.

{"a": 1, "b": true, "c": [1, "two", null]}

Is a stream of,

  • {, the opening brace, signifying a collection of name/value pairs.

  • "a", the string a.

  • :, a colon, the delimiter between the key and the value in the key/value pair.

  • 1, the number one.

  • ,, a comma, the delimiter between one key/value pair and the next.

  • "b", the string b.

  • :

  • true, the boolean value for true.

  • ,

  • "c", the string c.

  • :

  • [, the opening square brace, signifying an ordered list of values.

  • 1

  • ,

  • "two"

  • ,

  • null, a null or void value (rare).

  • ], closing square brace, terminates the list of values

  • }, closing curly brace, terminating the key/value collection.

If wanted to parse this with encoding/json, we declare a json.Decoder, then call Token until err is non nil.

package main

import (
	"encoding/json"
	"fmt"
	"strings"
)

func main() {
	input := `{"a": 1, "b": true, "c": [1, "two", null]}`
	dec := json.NewDecoder(strings.NewReader(input))
	for {
		tok, err := dec.Token()
		if err != nil {
			break
		}
		fmt.Printf("%v\t(%T)\n", tok, tok)
	}

}

When we run this we get the following output

{       (json.Delim)
a       (string)
1       (float64)
b       (string)
true    (bool)
c       (string)
[       (json.Delim)
1       (float64)
two     (string)
<nil>   (<nil>)
]       (json.Delim)
}       (json.Delim)

This is rather convenient, the type of the token returned is an interface{} so it can represent both the value of the token, and also its type. Strings are string, numbers are float64, booleans are true and false, even null is represented as a nil.

But there is a cost to this convenience, and a reason why Brad Fitzpatrick called the Token API a garbage factory.

Because of the design of the Decoder.Token API, the concrete value assigned to each token causes that value to escape to the heap. The number of allocations is tied to the number of tokens in the input.

Let’s look at the allocations for the simplest case, a single token.

package main

import (
	"encoding/json"
	"strings"
	"testing"
)

func BenchmarkJSONDecodeHello(b *testing.B) {
	input := `"hello"`
	r := strings.NewReader(input) // reads strings as []byte
	dec := json.NewDecoder(r)
	b.ReportAllocs()
	b.SetBytes(int64(len(input)))
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		r.Seek(0, 0)
		tok, _ := dec.Token()
		if tok != "hello" {
			b.Fatal()
		}
	}
}

Which results in

name               time/op
JSONDecodeHello-4     355ns ± 0%

name               speed
JSONDecodeHello-4  19.7MB/s ± 0%

name               alloc/op
JSONDecodeHello-4     37.0B ± 0%

name               allocs/op
JSONDecodeHello-4      3.00 ± 0%

Take away: API design influences allocation. Allocations can influence performance.

Implicit tokens

Let’s look back at the sequence of tokens; { "a" : 1 , "b" : true , "c" : [ 1 , "two" , null ] and }.

It turns out that the first character in the token tells you the type of the token.

  • {, } - collection start, end

  • [, ] - array start, end

  • t - true

  • f - false

  • n - null

  • " - string

  • -, 0-9 - a number

This is the first improvement we can make in our Decoder.NextToken APIs. Rather than converting the input []byte to a value, we just return the bytes representing the token straight from the input—​a simple subslice. The first character in the []byte will tell the type of the token.

package main

import (
	"fmt"
	"github.com/pkg/json"
	"strings"
)

func main() {
	input := `{"a": 1, "b": true, "c": [1, "two", null]}`
	dec := json.NewDecoder(strings.NewReader(input))
	for {
		tok, err := dec.NextToken()
		if err != nil {
			break
		}
		fmt.Printf("%s\t(%T)\n", tok, tok)
	}
}
{       ([]uint8)
"a"     ([]uint8)
1       ([]uint8)
"b"     ([]uint8)
true    ([]uint8)
"c"     ([]uint8)
[       ([]uint8)
1       ([]uint8)
"two"   ([]uint8)
null    ([]uint8)
]       ([]uint8)
}       ([]uint8)

There are a few subtleties with this API.

  1. Because the output is a subslice of the input, not a copy, there are restrictions on how long the output is valid for. This is similar to the bufio.Scanner API.

  2. Sometimes people want to know type of the token; collection, array, string, number, etc, sometimes they want the token value, the string, the number, in a form they can work with. Decoder.NextToken isn’t convenient for that, but can be used to build higher level abstractions.

Reading

Let’s talk about reading data. This can be tricky to do efficiently because JSON is not length delimited, you have to read to the end of the token to find token.

io.Reader problems

The traditional way to do this is with an io.Reader.

buf := make([]byte, 4 << 10)
read, err := r.Read(buf)

But this comes with a number of problems:

  • io.Reader.Read copies data from the reader into a buffer. That copying takes time.

  • io.Reader.Read makes buffer management the problem of the caller.

    1. You can read one byte at a time, but you need a place to store the thing you’re walking over, also might need to put the byte back.

    2. You can read into a large buffer, then look in buffer for start and end of token. If the end token isn’t in the buffer you need to do a lot of bookkeeping and copying to move the data around the the buffer or grow the buffer to make room for more data.

This is tricky to get right, and inefficient even if you do.

The alternative is an idea inspired by Steven Schveighoffer’s iopipe [2] and Phil Pearl [3] which I adapted into a type called a byteReader. byteReader operates similarly to bufio.Reader but has a more efficient API.

// A byteReader implements a sliding window over an io.Reader.
type byteReader struct {
	data   []byte
	offset int
	r      io.Reader
	err    error
}

// window returns the current window.
// The window is invalidated by calls to release or extend.
func (b *byteReader) window() []byte {
	return b.data[b.offset:]
}

// release discards n bytes from the front of the window.
func (b *byteReader) release(n int) {
	b.offset += n
}

// extend extends the window with data from the underlying reader.
func (b *byteReader) extend() int {

Example: whitespace

JSON contains a mixture of tokens and whitespace. Space, tab, newline and carriage return can occur between tokens and are ignored, thus the search for a token begins with a search for the first non whitespace character.

{"a": 1, "b": true, "c": [1, "two", null]}

This is a good time to talk about the search for whitespace with an example of using a byteReader.

var whitespace = [256]bool{
	' ':  true,
	'\t': true,
	'\n': true,
	'\r': true,
}

func countWhitespace(br *byteReader) int {
	n := 0
	w := br.window()
	for {
		for _, c := range w {
			if whitespace[c] {
				n++
			}
		}
		br.release(len(w))
		if br.extend() == 0 {
			return n
		}
		w = br.window()
	}
}

This does the minimum, visit each character and test if it matches a whitespace character. Any (useful) JSON decoder code cannot go faster that this.

name                             time/op
CountWhitespace/canada-16          1.10ms ± 2%
CountWhitespace/citm_catalog-16     838µs ± 1%
CountWhitespace/twitter-16          306µs ± 1%
CountWhitespace/code-16             937µs ± 1%
CountWhitespace/example-16         6.40µs ± 1%
CountWhitespace/sample-16           333µs ± 1%

name                             speed
CountWhitespace/canada-16        2.04GB/s ± 2%
CountWhitespace/citm_catalog-16  2.06GB/s ± 1%
CountWhitespace/twitter-16       2.06GB/s ± 1%
CountWhitespace/code-16          2.07GB/s ± 1%
CountWhitespace/example-16       2.04GB/s ± 1%
CountWhitespace/sample-16        2.06GB/s ± 1%

So this is our baseline.

Quiz me after the talk on the best way to write a general purpose whitespace counter.

Scanning

Now we can tell which characters are tokens and which are simply whitespace, let’s step up a level and talk about scanning.

// Next returns a []byte referencing the the next lexical token in the stream.
// The []byte is valid until Next is called again.
// If the stream is at its end, or an error has occured, Next returns a zero
// length []byte slice.
//
// A valid token begins with one of the following:
//
//	{ Object start
//	[ Array start
//	} Object end
//	] Array End
//	, Literal comma
//	: Literal colon
//	t JSON true
//	f JSON false
//	n JSON null
//	" A string, possibly containing backslash escaped entites.
//	-, 0-9 A number
func (s *Scanner) Next() []byte {
	s.br.release(s.offset) // release the previous token
	s.offset = 0

	c := s.token() // align to the next token
	length := 0
	switch c {
	case ObjectStart, ObjectEnd, Colon, Comma, ArrayStart, ArrayEnd:
		length = 1
		s.offset = 1
	case True:
		length = validateToken(&s.br, "true")
		s.offset = length
	case False:
		length = validateToken(&s.br, "false")
		s.offset = length
	case Null:
		length = validateToken(&s.br, "null")
		s.offset = length
	case String:
		length = parseString(&s.br)
		if length < 2 {
			return nil
		}
		s.offset = length
	case 0:
		// eof
		return nil
	default:
		length = s.parseNumber()
		if length < 0 {
			return nil
		}
	}
	return s.br.window()[:length]
}

This is the core loop of Scanner.Next. Scanner.Next skips over any intermediate whitespace, determines the token from the first character in the window, then continues to read until the token is read or we hit the end of the input.

Let’s look at how token works, then we’ll talk about some optimisations

// token positions the scanner at the next token in the stream
// and returns the first byte of the token.
func (s *Scanner) token() byte {
	w := s.br.window()
	for {
		for _, c := range w {
			if whitespace[c] {
				s.offset++
				continue
			}

			// release whitespace
			s.br.release(s.offset)
			s.offset = 0
			return c
		}
		if s.br.extend() == 0 {
			// eof
			return 0
		}
		w = s.br.window()[s.offset:]
	}
}
  • We start by getting the current window from the byteReader. This is a []byte slice of all the data that is yet to be read.

  • We’re looking for the first non whitespace character. If the character is a whitespace we increment s.offset to ignore the character and loop around.

  • If we do find a non whitespace character, we release s.offset characters from the front of the window. Now the start of the window is properly aligned with the first character of the token.

  • It turns out that we also get the first character of the token for free, it’s in c, so we can return that as a hint to Scanner.Next.

  • If we run out of characters without hitting a token then we called extend() to grow the window.

  • If we couldn’t grow, then we’ve run out of input and haven’t got a token, so give up.

  • Otherwise update w with a new window.

Some things to note:

  • Note the lack of error handling, it’s not part of the inner loop, it only happens when we have to read more data from the underlying reader via extend. [4]

  • extend hides the process of reading into, growing, refilling the buffer, etc. The makes the caller--Scanner.token--simpler; if there is data in the window, process it, extend if you need too. If you can’t extend, give up.

  • release is similar, it shrinks the start of the window to exclude data that no longer care about.

  • extend is not in the hot path, so there is no need to optimise it, its performance is a function of the buffer it is given. In practice an initial buffer of 8k is sufficient.

Let’s talk about the performance of this code.

name                     time/op
Scanner/canada-16         4.41ms ± 2%
Scanner/citm_catalog-16   2.55ms ± 3%
Scanner/twitter-16        1.03ms ± 1%
Scanner/code-16           4.21ms ± 1%
Scanner/example-16        21.4µs ± 1%
Scanner/sample-16          822µs ± 1%

name                     speed
Scanner/canada-16        510MB/s ± 2%
Scanner/citm_catalog-16  677MB/s ± 3%
Scanner/twitter-16       615MB/s ± 1%
Scanner/code-16          461MB/s ± 1%
Scanner/example-16       608MB/s ± 1%
Scanner/sample-16        837MB/s ± 1%

Comparing the performance of Scanner.Next to our whitespace benchmark we can see that we’re between 1/4 and 2/5ths of our baseline.

Let’s talk about improvements we can make to the code. Note the amount of work being spent to keep s.offset up to date. We know that s.offset is set to 0 just before Scanner.Next calls this function, and we set s.offset to zero on the way out of the function, so the changes we make to s.offset within the function are invisible—​it’s zero on entry, and zero on exit.

We can rewrite the function to keep a local offset value, which has an impressive effect on token.

func (s *Scanner) token() byte {
	w := s.br.window()
	offset := 0
	for {
		for _, c := range w {
			if whitespace[c] {
				offset++
				continue
			}

			// release whitespace
			s.br.release(offset)
			return c
		}
		if s.br.extend() == 0 {
			// eof
			return 0
		}
		w = s.br.window()[offset:]
	}
}
name                     old time/op    new time/op    delta
Scanner/canada-16          4.39ms ± 1%    4.43ms ± 4%     ~     (p=1.000 n=5+5)
Scanner/citm_catalog-16    2.52ms ± 1%    1.80ms ± 4%  -28.46%  (p=0.008 n=5+5)
Scanner/twitter-16         1.03ms ± 2%    0.95ms ± 3%   -7.41%  (p=0.008 n=5+5)
Scanner/code-16            4.24ms ± 2%    4.18ms ± 1%     ~     (p=0.095 n=5+5)
Scanner/example-16         21.4µs ± 1%    18.9µs ± 2%  -11.68%  (p=0.008 n=5+5)
Scanner/sample-16           828µs ± 2%     528µs ± 2%  -36.24%  (p=0.008 n=5+5)

name                     old speed      new speed      delta
Scanner/canada-16         512MB/s ± 1%   509MB/s ± 4%     ~     (p=1.000 n=5+5)
Scanner/citm_catalog-16   685MB/s ± 1%   958MB/s ± 4%  +39.84%  (p=0.008 n=5+5)
Scanner/twitter-16        616MB/s ± 2%   665MB/s ± 3%   +8.01%  (p=0.008 n=5+5)
Scanner/code-16           458MB/s ± 2%   465MB/s ± 1%     ~     (p=0.095 n=5+5)
Scanner/example-16        608MB/s ± 1%   688MB/s ± 2%  +13.23%  (p=0.008 n=5+5)
Scanner/sample-16         831MB/s ± 2%  1303MB/s ± 2%  +56.84%  (p=0.008 n=5+5)

By keeping offset local the compiler avoided those temporary writes back to memory. The question I have for you is why did this improve some inputs and not others?

The answer is different inputs have different amounts of whitespace. For example canada only has 33 whitespace characters whereas citm has 1,227,563.

Quiz me after the talk why the compiler can’t optimise these writes away.

Hint: panic()

Inlining

There is a larger improvement we can make for the runtime of this code, and it relates to inlining. Inlining is the process of automatically (or manually) copying the body of a function into, in line with, its caller. This avoids the overhead of the function call.

The Go compiler has reasonable support for inlining, but has a number of limitations.

 % go build -gcflags=-m=2 2>&1 | grep cannot | grep -v decoder
./reader.go:31:6: cannot inline (*byteReader).extend: function too complex: cost 198 exceeds budget 80
./scanner.go:99:6: cannot inline (*Scanner).token: unhandled op FOR
./scanner.go:130:6: cannot inline validateToken: unhandled op FOR
./scanner.go:153:6: cannot inline parseString: unhandled op FOR
./scanner.go:182:6: cannot inline (*Scanner).parseNumber: unhandled op FOR
./scanner.go:56:6: cannot inline (*Scanner).Next: function too complex: cost 476 exceeds budget 80

The first is the size of the function, byteReader.extend cannot be inlined because it is too complex. [5] The second is statements within the function, Scanner.token cannot be inlined because it contains a for statement.

Let’s go back to the constraints:

  • Scanner.Next is called for each token in the input.

  • This means that Scanner.token is called for each token in the input.

  • Scanner.token cannot be automatically inlined into its caller because it is too complex.

  • Therefore we’re paying two function calls per token.

We can remove one of these by manually inlining Scanner.token into its caller.

func (s *Scanner) Next() []byte {
	// release the previous token
	s.br.release(s.offset)
	w := s.br.window()
	offset := 0
	for {
		for _, c := range w {
			if whitespace[c] {
				offset++
				continue
			}

			// release whitespace
			s.br.release(offset)

			length := 0
			switch c {
			case ObjectStart, ObjectEnd, Colon, Comma, ArrayStart, ArrayEnd:
				length = 1
				s.offset = 1
			case True:
				length = validateToken(&s.br, "true")
				s.offset = length
			case False:
				length = validateToken(&s.br, "false")
				s.offset = length
			case Null:
				length = validateToken(&s.br, "null")
				s.offset = length
			case String:
				// string
				length = parseString(&s.br)
				if length < 2 {
					return nil
				}
				s.offset = length
			default:
				// ensure the number is correct.
				length = s.parseNumber()
				if length < 0 {
					return nil
				}
			}
			return s.br.window()[:length]
		}
		if s.br.extend() == 0 {
			// eof
			return nil
		}
		w = s.br.window()[offset:]
	}
}

The results support our thesis:

name                     old time/op    new time/op    delta
Scanner/canada-16          4.36ms ± 1%    3.50ms ± 0%  -19.68%  (p=0.008 n=5+5)
Scanner/citm_catalog-16    1.80ms ± 1%    1.56ms ± 2%  -13.16%  (p=0.008 n=5+5)
Scanner/twitter-16          965µs ± 2%     833µs ± 2%  -13.75%  (p=0.008 n=5+5)
Scanner/code-16            4.15ms ± 1%    3.61ms ± 1%  -12.82%  (p=0.008 n=5+5)
Scanner/example-16         18.9µs ± 2%    16.6µs ± 1%  -12.42%  (p=0.008 n=5+5)
Scanner/sample-16           515µs ± 1%     472µs ± 2%   -8.34%  (p=0.008 n=5+5)

name                     old speed      new speed      delta
Scanner/canada-16         516MB/s ± 1%   642MB/s ± 0%  +24.50%  (p=0.008 n=5+5)
Scanner/citm_catalog-16   960MB/s ± 1%  1105MB/s ± 2%  +15.16%  (p=0.008 n=5+5)
Scanner/twitter-16        654MB/s ± 2%   759MB/s ± 1%  +15.94%  (p=0.008 n=5+5)
Scanner/code-16           468MB/s ± 1%   537MB/s ± 1%  +14.69%  (p=0.008 n=5+5)
Scanner/example-16        689MB/s ± 2%   787MB/s ± 1%  +14.17%  (p=0.008 n=5+5)
Scanner/sample-16        1.33GB/s ± 1%  1.46GB/s ± 1%   +9.11%  (p=0.008 n=5+5)

By saving the function call we’ve improved throughput by 9-24%. The largest improvement comes from canada, which basically contained no whitespace, so the call to Scanner.token almost always returned immediately having done no work!

To recap the major optimisations were:

  • Avoiding s.offset updates. They cannot be registerised, CPU has to do a write on every iteration. s.offset updates reduced from one per byte to one per token.

  • Scanner.Next and Scanner.token were effectively one function spread over two. Each are too large to be inlined, so we’re paying for an extra function call per token. Manually inlining them increased the indentation depth of the function, but delivered substantial speedups.

  • Most JSON contains some whitespace, it’s moderately optimised for human readability. It turns out, the more whitespace, the faster pkg/json decodes! citm is over 50% of the baseline, sample is nearly 75%.

Take away: Avoiding function calls can improve performance in the hot path.

Decoding

So far we have a Scanner which tokenises input at 50-75% of the whitespace baseline. But there are a few more things we need to make the scanner fully functional.

The first part of that is validation.

Validation

JSON is a state machine. Depending on the current token, only certain subsequent tokens are valid. For example, if you’ve read these tokens {, "username", then the only valid token is :. To track this we need to layer some logic on top of Scanner.Next to assert that a sequence of token is valid. This is the role of Decoder.NextToken:

const (
	stateValue        = 0
	stateObjectString = iota
	stateObjectColon
	stateObjectValue
	stateObjectComma
	stateArrayValue
	stateArrayComma
	stateEnd
)

func (d *Decoder) NextToken() ([]byte, error) {
	tok := d.scanner.Next()
	if len(tok) < 1 {
		return nil, io.EOF
	}
	switch d.state {
	case stateValue:
		return d.stateValue(tok)
	case stateObjectString:
		return d.stateObjectString(tok)
	case stateObjectColon:
		return d.stateObjectColon(tok)
	case stateObjectValue:
		return d.stateObjectValue(tok)
	case stateObjectComma:
		return d.stateObjectComma(tok)
	case stateArrayValue:
		return d.stateArrayValue(tok)
	case stateArrayComma:
		return d.stateArrayComma(tok)
	case stateEnd:
		fallthrough
	default:
		return nil, io.EOF
	}
}

This is pretty straightforward stuff. We take track the current state in d.state and based on its value we dispatch to the various state methods which assert token is valid and update the state.

Click to see the source for the various state methods.
func (d *Decoder) stateObjectString(tok []byte) ([]byte, error) {
	switch tok[0] {
	case '}':
		inObj := d.pop()
		switch {
		case d.len() == 0:
			d.state = stateEnd
		case inObj:
			d.state = stateObjectComma
		case !inObj:
			d.state = stateArrayComma
		}
		return tok, nil
	case '"':
		d.state = stateObjectColon
		return tok, nil
	default:
		return nil, fmt.Errorf("stateObjectString: missing string key")
	}
}

func (d *Decoder) stateObjectColon(tok []byte) ([]byte, error) {
	switch tok[0] {
	case Colon:
		d.state = stateObjectValue
		return d.NextToken()
	default:
		return tok, fmt.Errorf("stateObjectColon: expecting colon")
	}
}

func (d *Decoder) stateObjectValue(tok []byte) ([]byte, error) {
	switch tok[0] {
	case '{':
		d.state = stateObjectString
		d.push(true)
		return tok, nil
	case '[':
		d.state = stateArrayValue
		d.push(false)
		return tok, nil
	default:
		d.state = stateObjectComma
		return tok, nil
	}
}

func (d *Decoder) stateObjectComma(tok []byte) ([]byte, error) {
	switch tok[0] {
	case '}':
		inObj := d.pop()
		switch {
		case d.len() == 0:
			d.state = stateEnd
		case inObj:
			d.state = stateObjectComma
		case !inObj:
			d.state = stateArrayComma
		}
		return tok, nil
	case Comma:
		d.state = stateObjectString
		return d.NextToken()
	default:
		return tok, fmt.Errorf("stateObjectComma: expecting comma")
	}
}

func (d *Decoder) stateArrayValue(tok []byte) ([]byte, error) {
	switch tok[0] {
	case '{':
		d.state = stateObjectString
		d.push(true)
		return tok, nil
	case '[':
		d.state = stateArrayValue
		d.push(false)
		return tok, nil
	case ']':
		inObj := d.pop()
		switch {
		case d.len() == 0:
			d.state = stateEnd
		case inObj:
			d.state = stateObjectComma
		case !inObj:
			d.state = stateArrayComma
		}
		return tok, nil
	case ',':
		return nil, fmt.Errorf("stateArrayValue: unexpected comma")
	default:
		d.state = stateArrayComma
		return tok, nil
	}
}

func (d *Decoder) stateArrayComma(tok []byte) ([]byte, error) {
	switch tok[0] {
	case ']':
		inObj := d.pop()
		switch {
		case d.len() == 0:
			d.state = stateEnd
		case inObj:
			d.state = stateObjectComma
		case !inObj:
			d.state = stateArrayComma
		}
		return tok, nil
	case Comma:
		d.state = stateArrayValue
		return d.NextToken()
	default:
		return nil, fmt.Errorf("stateArrayComma: expected comma, %v", d.stack)
	}
}

func (d *Decoder) stateValue(tok []byte) ([]byte, error) {
	switch tok[0] {
	case '{':
		d.state = stateObjectString
		d.push(true)
		return tok, nil
	case '[':
		d.state = stateArrayValue
		d.push(false)
		return tok, nil
	case ',':
		return nil, fmt.Errorf("stateValue: unexpected comma")
	default:
		d.state = stateEnd
		return tok, nil
	}
}

Let’s look at the results.

name                                           time/op
DecoderNextToken/pkgjson/canada-16               5.64ms ± 1%
DecoderNextToken/encodingjson/canada-16          65.1ms ± 1%
DecoderNextToken/pkgjson/citm_catalog-16         2.42ms ± 2%
DecoderNextToken/encodingjson/citm_catalog-16    19.8ms ± 1%
DecoderNextToken/pkgjson/twitter-16              1.18ms ± 1%
DecoderNextToken/encodingjson/twitter-16         12.1ms ± 0%
DecoderNextToken/pkgjson/code-16                 6.10ms ± 2%
DecoderNextToken/encodingjson/code-16            77.5ms ± 1%
DecoderNextToken/pkgjson/example-16              25.2µs ± 1%
DecoderNextToken/encodingjson/example-16          266µs ± 1%
DecoderNextToken/pkgjson/sample-16                559µs ± 0%
DecoderNextToken/encodingjson/sample-16          3.18ms ± 2%

name                                           speed
DecoderNextToken/pkgjson/canada-16              399MB/s ± 1%
DecoderNextToken/encodingjson/canada-16        34.6MB/s ± 1%
DecoderNextToken/pkgjson/citm_catalog-16        713MB/s ± 2%
DecoderNextToken/encodingjson/citm_catalog-16  87.1MB/s ± 1%
DecoderNextToken/pkgjson/twitter-16             537MB/s ± 1%
DecoderNextToken/encodingjson/twitter-16       52.2MB/s ± 0%
DecoderNextToken/pkgjson/code-16                318MB/s ± 2%
DecoderNextToken/encodingjson/code-16          25.0MB/s ± 1%
DecoderNextToken/pkgjson/example-16             517MB/s ± 1%
DecoderNextToken/encodingjson/example-16       49.0MB/s ± 1%
DecoderNextToken/pkgjson/sample-16             1.23GB/s ± 0%
DecoderNextToken/encodingjson/sample-16         216MB/s ± 2%

Compared to encoding/json we’re 8-10x faster. But there are some things we can do to improve:

Central to the operation of Decoder.NextToken is the switch statement. In the general case, switch is implemented as a sequence of if statements. Effectively what the compiler sees is

func (d *Decoder) NextToken() ([]byte, error) {
	tok := d.scanner.Next()
	if len(tok) < 1 {
		return nil, io.EOF
	}
	if d.state == stateValue {
		return d.stateValue(tok)
	}
	if d.state == stateObjectString {
		return d.stateObjectString(tok)
	}
	if d.state == stateObjectColon {
		return d.stateObjectColon(tok)
	}
	if d.state == stateObjectValue {
		return d.stateObjectValue(tok)
	}
	if d.state == stateObjectComma {
		return d.stateObjectComma(tok)
	}
	if d.state == stateArrayValue {
		return d.stateArrayValue(tok)
	}
	if d.state == stateArrayComma {
		return d.stateArrayComma(tok)
	}
	return nil, io.EOF
}

switch is convenient, but not optimal in the hot path because that long sequence of if statements breaks up the instruction stream, and the puts pressure on the CPU’s branch predictor. Mispredicting a branch is expensive, the CPU has to flush the pipeline, rewind, and take the other branch.

This problem turns up in many places; bytecode interpreters are a classic example. One of the optimisations we could make is is to turn this linear search into a table. This can be space efficient if the state space is small and dense (often it is not), and the result might be something like this

var stateTable = [...]func(*Decoder, []byte) ([]byte, error){
	stateValue:        (*Decoder).stateValue,
	stateObjectString: (*Decoder).stateObjectString,
	stateObjectColon:  (*Decoder).stateObjectColon,
	stateObjectValue:  (*Decoder).stateObjectValue,
	stateObjectComma:  (*Decoder).stateObjectComma,
	stateArrayValue:   (*Decoder).stateArrayValue,
	stateArrayComma:   (*Decoder).stateArrayComma,
	stateEnd:          (*Decoder).stateEnd,
}

func (d *Decoder) NextToken() ([]byte, error) {
	tok := d.scanner.Next()
	if len(tok) < 1 {
		return nil, io.EOF
	}
	return stateTable[d.state](d, tok)
}
Unfortunately this won’t compile because there is an initalisation loop.
./decoder.go:115:5: initialization loop:
        /Users/davecheney/devel/json/decoder.go:115:5: stateTable refers to
        /Users/davecheney/devel/json/decoder.go:155:19: (*Decoder).stateObjectColon refers to
        /Users/davecheney/devel/json/decoder.go:126:19: (*Decoder).NextToken refers to
        /Users/davecheney/devel/json/decoder.go:115:5: stateTable
FAIL    github.com/pkg/json [build failed\]

But there is a better trick that we can use that is more space efficient than this table, and is sometimes called a computed goto.

Computed Goto

If you look at the table above there is a pattern. Each state enumeration is matched with exactly one method. What we store in d.state is a proxy for the method we want to call, but to call the method, we have to switch on d.state to find the appropriate method.

What if we could just store the method directly and call it directly? And infact, we can do just that with some old Go 1.0 magic, method expression.

// A Decoder decodes JSON values from an input stream.
type Decoder struct {
	scanner *Scanner
	state   func(*Decoder, []byte) ([]byte, error)
	stack   // keeps track of whether we're in an object or array
}

func (d *Decoder) NextToken() ([]byte, error) {
	tok := d.scanner.Next()
	if len(tok) < 1 {
		return nil, io.EOF
	}
	return d.state(d, tok)
}

Let’s see how this performs.

name                                           old time/op    new time/op    delta
DecoderNextToken/pkgjson/canada-16               5.60ms ± 0%    5.61ms ± 0%     ~     (p=0.841 n=5+5)
DecoderNextToken/encodingjson/canada-16          64.9ms ± 1%    64.4ms ± 0%     ~     (p=0.222 n=5+5)
DecoderNextToken/pkgjson/citm_catalog-16         2.41ms ± 1%    2.42ms ± 1%     ~     (p=0.310 n=5+5)
DecoderNextToken/encodingjson/citm_catalog-16    19.8ms ± 1%    19.7ms ± 1%     ~     (p=0.548 n=5+5)
DecoderNextToken/pkgjson/twitter-16              1.17ms ± 1%    1.20ms ± 2%   +2.70%  (p=0.008 n=5+5)
DecoderNextToken/encodingjson/twitter-16         12.2ms ± 1%    12.1ms ± 1%     ~     (p=1.000 n=5+5)
DecoderNextToken/pkgjson/code-16                 6.11ms ± 3%    6.37ms ± 4%   +4.26%  (p=0.016 n=5+5)
DecoderNextToken/encodingjson/code-16            77.7ms ± 2%    78.6ms ± 2%     ~     (p=0.310 n=5+5)
DecoderNextToken/pkgjson/example-16              25.0µs ± 0%    25.6µs ± 1%   +2.25%  (p=0.008 n=5+5)
DecoderNextToken/encodingjson/example-16          263µs ± 1%     265µs ± 2%     ~     (p=0.222 n=5+5)
DecoderNextToken/pkgjson/sample-16                560µs ± 1%     553µs ± 0%   -1.15%  (p=0.008 n=5+5)
DecoderNextToken/encodingjson/sample-16          3.16ms ± 1%    3.17ms ± 1%     ~     (p=0.690 n=5+5)

name                                           old speed      new speed      delta
DecoderNextToken/pkgjson/canada-16              402MB/s ± 0%   401MB/s ± 0%     ~     (p=0.841 n=5+5)
DecoderNextToken/encodingjson/canada-16        34.7MB/s ± 1%  35.0MB/s ± 0%     ~     (p=0.222 n=5+5)
DecoderNextToken/pkgjson/citm_catalog-16        716MB/s ± 1%   713MB/s ± 1%     ~     (p=0.310 n=5+5)
DecoderNextToken/encodingjson/citm_catalog-16  87.3MB/s ± 1%  87.6MB/s ± 1%     ~     (p=0.548 n=5+5)
DecoderNextToken/pkgjson/twitter-16             539MB/s ± 1%   524MB/s ± 2%   -2.62%  (p=0.008 n=5+5)
DecoderNextToken/encodingjson/twitter-16       51.9MB/s ± 1%  52.0MB/s ± 1%     ~     (p=1.000 n=5+5)
DecoderNextToken/pkgjson/code-16                318MB/s ± 2%   305MB/s ± 3%   -4.07%  (p=0.016 n=5+5)
DecoderNextToken/encodingjson/code-16          25.0MB/s ± 2%  24.7MB/s ± 2%     ~     (p=0.310 n=5+5)
DecoderNextToken/pkgjson/example-16             521MB/s ± 0%   509MB/s ± 1%   -2.19%  (p=0.008 n=5+5)
DecoderNextToken/encodingjson/example-16       49.5MB/s ± 1%  49.1MB/s ± 2%     ~     (p=0.222 n=5+5)
DecoderNextToken/pkgjson/sample-16             1.23GB/s ± 1%  1.24GB/s ± 0%   +1.16%  (p=0.008 n=5+5)
DecoderNextToken/encodingjson/sample-16         218MB/s ± 1%   217MB/s ± 1%     ~     (p=0.690 n=5+5)

The results aren’t that promising, but now we can unlock our final optimisation.

Outlining

Compare the previous version of Decoder.NextToken

func (d *Decoder) NextToken() ([]byte, error) {
	tok := d.scanner.Next()
	if len(tok) < 1 {
		return nil, io.EOF
	}
	return d.state(d, tok)
}

with the version after outlining.

func (d *Decoder) NextToken() ([]byte, error) {
	return d.state(d)
}

func (d *Decoder) stateObjectColon() ([]byte, error) {
	tok := d.scanner.Next()
	if len(tok) < 1 {
		return nil, io.ErrUnexpectedEOF
	}
	switch tok[0] {
	case Colon:
		d.state = (*Decoder).stateObjectValue
		return d.NextToken()
	default:
		return tok, fmt.Errorf("stateObjectColon: expecting colon")
	}
}

// repeated for each Decoder.state... method

Moving the tok := d.scanner.Next() call into each state method might seem like a step backwards, but it has several positive effects.

  • The first is not passing tok into each state method. This saves 3 words on the call stack.

  • The second is, by moving the if len(tok) < 1 into the same function as the switch, it enables bounds check elimination.

    Previously, when the len(tok) check happened in Decoder.NextToken, Decoder.stateObjectColon doesn’t know anything about the length of tok because it can’t be inlined because we’re calling it via a method expression. When the compiler encounters switch tok[0], it needs to put a bounds check to make sure tok is at least 1 element long.

    When the if check is moved into the same function, the compiler knows that if we get further than the check then tok is at least 1 element long, so the bounds check is not needed.

    We can see this in the debug information.

% go build -gcflags=-d=ssa/prove/debug=2 2>&1 | grep 135\:
./decoder.go:135:13: Proved IsInBounds (v31)
  • The final optimisation occurs because Decoder.NextToken, which was previously too complex to inline

./decoder.go:107:6: cannot inline (*Decoder).NextToken: function too complex: cost 145 exceeds budget 80

is now inlinable

./bench_test.go:217:29: inlining call to (*Decoder).NextToken method(*Decoder) func() ([]byte, error) { return d.state(d) }

Which means, calls to dec.NextToken()

for {
    _, err := dec.NextToken()
    if err == io.EOF {
        break
    }
    check(b, err)
    n++
}

becomes a direct call to the current state method.

for {
    _, err := dec.state(dec) // this is what the compiler sees
    if err == io.EOF {
        break
    }
    check(b, err)
    n++
}

Eliminating the cost of the function call.

Results

At the lowest level pkg/json.Scanner can tokenize streaming JSON without allocation (provided it is supplied a few kilobytes of buffer).

BenchmarkScanner/canada-16                  1561           3524005 ns/op         638.78 MB/s           0 B/op          0 allocs/op
BenchmarkScanner/citm_catalog-16            3556           1555322 ns/op        1110.51 MB/s           0 B/op          0 allocs/op
BenchmarkScanner/twitter-16                 7068            836031 ns/op         755.37 MB/s           0 B/op          0 allocs/op
BenchmarkScanner/code-16                    1543           3640425 ns/op         533.03 MB/s           0 B/op          0 allocs/op
BenchmarkScanner/example-16               341224             16362 ns/op         796.00 MB/s           0 B/op          0 allocs/op
BenchmarkScanner/sample-16                 12771            467360 ns/op        1471.01 MB/s           0 B/op          0 allocs/op

At the next level pkg/json.Decoder.Token is 2-3x faster than encoding/json.Decoder.Token.

BenchmarkDecoderToken/pkgjson/canada-16                      267          22073095 ns/op         101.98 MB/s     4402914 B/op     222279 allocs/op
BenchmarkDecoderToken/encodingjson/canada-16                  86          67830737 ns/op          33.19 MB/s    17740387 B/op     889106 allocs/op
BenchmarkDecoderToken/pkgjson/citm_catalog-16               1114           5183226 ns/op         333.23 MB/s      965992 B/op      81995 allocs/op
BenchmarkDecoderToken/encodingjson/citm_catalog-16           288          20882018 ns/op          82.71 MB/s     5661597 B/op     324692 allocs/op
BenchmarkDecoderToken/pkgjson/twitter-16                    2356           2467042 ns/op         255.98 MB/s      768354 B/op      38992 allocs/op
BenchmarkDecoderToken/encodingjson/twitter-16                471          12606205 ns/op          50.10 MB/s     3584017 B/op     187319 allocs/op
BenchmarkDecoderToken/pkgjson/code-16                        346          16877006 ns/op         114.98 MB/s     4304233 B/op     320235 allocs/op
BenchmarkDecoderToken/encodingjson/code-16                    73          80255994 ns/op          24.18 MB/s    23355962 B/op    1319125 allocs/op
BenchmarkDecoderToken/pkgjson/example-16                  113912             53083 ns/op         245.35 MB/s       16016 B/op        914 allocs/op
BenchmarkDecoderToken/encodingjson/example-16              21734            273991 ns/op          47.53 MB/s       82416 B/op       4325 allocs/op
BenchmarkDecoderToken/pkgjson/sample-16                     6642            871796 ns/op         788.59 MB/s      213761 B/op       5081 allocs/op
BenchmarkDecoderToken/encodingjson/sample-16                1803           3287623 ns/op         209.12 MB/s      723782 B/op      26095 allocs/op

Because allocations make up a large proportion of the Decoder.Token API, pkg/json.Decoder provides an alternative API that produces significantly fewer allocations and is 8-10x faster.

BenchmarkDecoderNextToken/pkgjson/canada-16                 1197           4825232 ns/op         466.52 MB/s         136 B/op          3 allocs/op
BenchmarkDecoderNextToken/encodingjson/canada-16              90          65392440 ns/op          34.42 MB/s    17740399 B/op     889106 allocs/op
BenchmarkDecoderNextToken/pkgjson/citm_catalog-16           2709           2162849 ns/op         798.58 MB/s         136 B/op          3 allocs/op
BenchmarkDecoderNextToken/encodingjson/citm_catalog-16       301          20064314 ns/op          86.08 MB/s     5661597 B/op     324692 allocs/op
BenchmarkDecoderNextToken/pkgjson/twitter-16                5395           1068106 ns/op         591.25 MB/s         152 B/op          4 allocs/op
BenchmarkDecoderNextToken/encodingjson/twitter-16            494          12072956 ns/op          52.31 MB/s     3584013 B/op     187319 allocs/op
BenchmarkDecoderNextToken/pkgjson/code-16                   1135           5124666 ns/op         378.65 MB/s         248 B/op          6 allocs/op
BenchmarkDecoderNextToken/encodingjson/code-16                74          77579973 ns/op          25.01 MB/s    23355955 B/op    1319125 allocs/op
BenchmarkDecoderNextToken/pkgjson/example-16              269010             22323 ns/op         583.43 MB/s         152 B/op          4 allocs/op
BenchmarkDecoderNextToken/encodingjson/example-16          22707            264078 ns/op          49.32 MB/s       82416 B/op       4325 allocs/op
BenchmarkDecoderNextToken/pkgjson/sample-16                10000            510445 ns/op        1346.85 MB/s        1144 B/op          9 allocs/op
BenchmarkDecoderNextToken/encodingjson/sample-16            1836           3161804 ns/op         217.44 MB/s      723781 B/op      26095 allocs/op

At the highest level, pkg/json can unmarshal data into a Go object with the same API as encoding/json.

BenchmarkDecoderDecodeInterfaceAny/pkgjson/canada-16                 217          27425893 ns/op          82.08 MB/s     8747163 B/op     281408 allocs/op
BenchmarkDecoderDecodeInterfaceAny/encodingjson/canada-16            153          38347477 ns/op          58.70 MB/s    20647616 B/op     392553 allocs/op
BenchmarkDecoderDecodeInterfaceAny/pkgjson/citm_catalog-16           747           8008839 ns/op         215.66 MB/s     5197853 B/op      89673 allocs/op
BenchmarkDecoderDecodeInterfaceAny/encodingjson/citm_catalog-16      360          16607501 ns/op         104.00 MB/s     9406809 B/op      95389 allocs/op
BenchmarkDecoderDecodeInterfaceAny/pkgjson/twitter-16               1606           3714515 ns/op         170.01 MB/s     2130731 B/op      30182 allocs/op
BenchmarkDecoderDecodeInterfaceAny/encodingjson/twitter-16           862           6927998 ns/op          91.15 MB/s     4283407 B/op      31278 allocs/op
BenchmarkDecoderDecodeInterfaceAny/pkgjson/code-16                   333          17939351 ns/op         108.17 MB/s     7331643 B/op     232059 allocs/op
BenchmarkDecoderDecodeInterfaceAny/encodingjson/code-16              236          25324951 ns/op          76.62 MB/s    12332753 B/op     271292 allocs/op
BenchmarkDecoderDecodeInterfaceAny/pkgjson/example-16              76874             78079 ns/op         166.81 MB/s       50980 B/op        739 allocs/op
BenchmarkDecoderDecodeInterfaceAny/encodingjson/example-16         40886            146685 ns/op          88.79 MB/s       82855 B/op        782 allocs/op
BenchmarkDecoderDecodeInterfaceAny/pkgjson/sample-16                5240           1116081 ns/op         615.99 MB/s      399970 B/op       5542 allocs/op
BenchmarkDecoderDecodeInterfaceAny/encodingjson/sample-16           1123           5369313 ns/op         128.04 MB/s     2661872 B/op       7527 allocs/op

Thematic ideas

Allocations affect performance

The garbage collector might be fast to allocate and efficient to collect, but not allocating will always be faster.

Eliminating allocations through API design

Most of the speedups of this package come from reducing allocations; specifically the time not spent in the heap allocation path, and the time not spent in GC cycles, is available for scanning.

API design influences performance

The encoding/json.Decoder API requires allocations as returning a primitive value via an interface causes it to escape to the heap — it effectively becomes a pointer to the value. When dealing with data, allocations, can be the biggest performance cost of an algorithm. You’re benchmarking the garbage collector at that point.

Careful attention to the per byte and per token overheads

Optimising the hot path to convert functions per byte into per token functions is the second biggest performance win.

I started this project because I wanted to apply some ideas that I had seen to a problem in Go. I believed that it was possible implement an efficient JSON parser based on my assumption that encoding/json was slower than it could be because of its API.

It turned out there is 2-3x performance in some unmarshalling paths and between 8x and 10x performance in tokenisation, if you’re prepared to accept a different API.

Thank you


1. en.wikipedia.org/wiki/Amdahl%27s_law
2. DConf 2017: IOPipe: a High Performance I/O Library — Steven Schveighoffer www.youtube.com/watch?v=un-bZdyumog
3. philpearl.github.io/post/reader/
4. John Ousterhout, Define errors out of existence. youtu.be/bmSAYlu0NcY?si=WjC1ouEN1ad2OWjp&t=1312
5. Complexity is determined by giving each statement a weighting. The weightings are somewhat arbitrary, but more or less each token costs 1, function calls cost 52. github.com/golang/go/issues/17566 covers improving the inlining cost model.