Introduction
This talk is a case study of designing an efficient Go package. I’m going to use the example of building a high performance JSON parser.
The goals of this project were:
- Supports streaming operations
-
It’s unrealistic to expect to have the entire input in memory. Buffering in memory is a availability risk, input sizes are usually unknown and potentially unbounded. Buffering before processing introduces latency. Streaming reads lets you process data as it arrives and overlap that processing with reading.
- Reasonably compatible with the
encoding/json
package -
This package offers the same high level
json.Decoder
API but higher throughput and reduced allocations. - Allocation free, or bounded API
-
In addition to the
encoding/json
API, provide an alternative, more efficient API.
Time complexity
Let’s talk about the time complexity of this problem.
JSON doesn’t use length markers; to know how much to read, we have to read it all. To parse the 1,000th element in JSON array, we have to read and process the 999 that come before it. This means the lower bounds on the time to process the input is the size of the input — you can’t skip ahead.
But reading isn’t the full story, we need to process it through the JSON state machine to figure out where the tokens start and end.
Thus the performance is at least read(N)+parse(N)
.
But there are other costs:
-
Ideally if we read N bytes, we want to process each byte only once. If we touch the same byte more than once, that adds overhead, and complicates processing if we have to keep those bytes around to come back and look at them again.
-
Just like we don’t want to process a byte more than once, we want to avoid processing a token more than once.
-
Limit function calls in the hot path inside the
Scanner
orDecoder
. We want to limit the number of function toO(tokens)
, notO(bytes)
. -
Limit copies. If we have a design that limits copying then we reduce the number times we (re)visit a byte.
-
Limit allocations. If you limit the number of times data is copied, then you naturally limit allocations, which reduces runtime in several ways:
-
Reduce the overhead in taking the allocation. The heap is a shared resource, allocating on the heap requires working with shared data structures. This means locks, cache contention, etc. c.f. Amdahl’s Law [1]
-
Reduce the overhead of freeing allocations. The less allocations you make, the less heap you consume and the less garbage you produce. Reducing these two factors reduces the overhead of background and foreground garbage collection.
-
Tokenisation
A JSON decoder has two main components:
-
A scanner, or tokeniser, that converts a stream a bytes into a stream of JSON tokens.
-
An unmarshaller that applies a stream of JSON tokens to a Go object.
What then, is a token?
JSON is regular, well defined grammar. json.org has a wonderful set of railroad diagrams that describe the grammar of JSON.
{"a": 1, "b": true, "c": [1, "two", null]}
Is a stream of,
-
{
, the opening brace, signifying a collection of name/value pairs. -
"a"
, the stringa
. -
:
, a colon, the delimiter between the key and the value in the key/value pair. -
1
, the number one. -
,
, a comma, the delimiter between one key/value pair and the next. -
"b"
, the stringb
. -
:
-
true
, the boolean value for true. -
,
-
"c"
, the stringc
. -
:
-
[
, the opening square brace, signifying an ordered list of values. -
1
-
,
-
"two"
-
,
-
null
, a null or void value (rare). -
]
, closing square brace, terminates the list of values -
}
, closing curly brace, terminating the key/value collection.
If wanted to parse this with encoding/json
, we declare a json.Decoder
, then call Token
until err
is non nil
.
package main
import (
"encoding/json"
"fmt"
"strings"
)
func main() {
input := `{"a": 1, "b": true, "c": [1, "two", null]}`
dec := json.NewDecoder(strings.NewReader(input))
for {
tok, err := dec.Token()
if err != nil {
break
}
fmt.Printf("%v\t(%T)\n", tok, tok)
}
}
When we run this we get the following output
{ (json.Delim)
a (string)
1 (float64)
b (string)
true (bool)
c (string)
[ (json.Delim)
1 (float64)
two (string)
<nil> (<nil>)
] (json.Delim)
} (json.Delim)
This is rather convenient, the type of the token returned is an interface{}
so it can represent both the value of the token, and also its type.
Strings are string
, numbers are float64
, booleans are true
and false
, even null
is represented as a nil
.
But there is a cost to this convenience, and a reason why Brad Fitzpatrick called the Token API a garbage factory.
Because of the design of the Decoder.Token
API, the concrete value assigned to each token causes that value to escape to the heap.
The number of allocations is tied to the number of tokens in the input.
Let’s look at the allocations for the simplest case, a single token.
package main
import (
"encoding/json"
"strings"
"testing"
)
func BenchmarkJSONDecodeHello(b *testing.B) {
input := `"hello"`
r := strings.NewReader(input) // reads strings as []byte
dec := json.NewDecoder(r)
b.ReportAllocs()
b.SetBytes(int64(len(input)))
b.ResetTimer()
for i := 0; i < b.N; i++ {
r.Seek(0, 0)
tok, _ := dec.Token()
if tok != "hello" {
b.Fatal()
}
}
}
Which results in
name time/op
JSONDecodeHello-4 355ns ± 0%
name speed
JSONDecodeHello-4 19.7MB/s ± 0%
name alloc/op
JSONDecodeHello-4 37.0B ± 0%
name allocs/op
JSONDecodeHello-4 3.00 ± 0%
Take away: API design influences allocation. Allocations can influence performance.
Implicit tokens
Let’s look back at the sequence of tokens; {
"a"
:
1
,
"b"
:
true
,
"c"
:
[
1
,
"two"
,
null
]
and }
.
It turns out that the first character in the token tells you the type of the token.
-
{
,}
- collection start, end -
[
,]
- array start, end -
t
- true -
f
- false -
n
- null -
"
- string -
-
,0
-9
- a number
This is the first improvement we can make in our Decoder.NextToken
APIs.
Rather than converting the input []byte
to a value, we just return the bytes representing the token straight from the input—a simple subslice.
The first character in the []byte
will tell the type of the token.
package main
import (
"fmt"
"github.com/pkg/json"
"strings"
)
func main() {
input := `{"a": 1, "b": true, "c": [1, "two", null]}`
dec := json.NewDecoder(strings.NewReader(input))
for {
tok, err := dec.NextToken()
if err != nil {
break
}
fmt.Printf("%s\t(%T)\n", tok, tok)
}
}
{ ([]uint8)
"a" ([]uint8)
1 ([]uint8)
"b" ([]uint8)
true ([]uint8)
"c" ([]uint8)
[ ([]uint8)
1 ([]uint8)
"two" ([]uint8)
null ([]uint8)
] ([]uint8)
} ([]uint8)
There are a few subtleties with this API.
-
Because the output is a subslice of the input, not a copy, there are restrictions on how long the output is valid for. This is similar to the
bufio.Scanner
API. -
Sometimes people want to know type of the token; collection, array, string, number, etc, sometimes they want the token value, the string, the number, in a form they can work with.
Decoder.NextToken
isn’t convenient for that, but can be used to build higher level abstractions.
Reading
Let’s talk about reading data. This can be tricky to do efficiently because JSON is not length delimited, you have to read to the end of the token to find token.
io.Reader problems
The traditional way to do this is with an io.Reader
.
buf := make([]byte, 4 << 10)
read, err := r.Read(buf)
But this comes with a number of problems:
-
io.Reader.Read copies data from the reader into a buffer. That copying takes time.
-
io.Reader.Read makes buffer management the problem of the caller.
-
You can read one
byte
at a time, but you need a place to store the thing you’re walking over, also might need to put thebyte
back. -
You can read into a large buffer, then look in buffer for start and end of token. If the end token isn’t in the buffer you need to do a lot of bookkeeping and copying to move the data around the the buffer or grow the buffer to make room for more data.
-
This is tricky to get right, and inefficient even if you do.
The alternative is an idea inspired by Steven Schveighoffer’s iopipe [2] and Phil Pearl [3] which I adapted into a type called a byteReader
.
byteReader
operates similarly to bufio.Reader
but has a more efficient API.
// A byteReader implements a sliding window over an io.Reader.
type byteReader struct {
data []byte
offset int
r io.Reader
err error
}
// window returns the current window.
// The window is invalidated by calls to release or extend.
func (b *byteReader) window() []byte {
return b.data[b.offset:]
}
// release discards n bytes from the front of the window.
func (b *byteReader) release(n int) {
b.offset += n
}
// extend extends the window with data from the underlying reader.
func (b *byteReader) extend() int {
Example: whitespace
JSON contains a mixture of tokens and whitespace. Space, tab, newline and carriage return can occur between tokens and are ignored, thus the search for a token begins with a search for the first non whitespace character.
{"a": 1, "b": true, "c": [1, "two", null]}
This is a good time to talk about the search for whitespace with an example of using a byteReader
.
var whitespace = [256]bool{
' ': true,
'\t': true,
'\n': true,
'\r': true,
}
func countWhitespace(br *byteReader) int {
n := 0
w := br.window()
for {
for _, c := range w {
if whitespace[c] {
n++
}
}
br.release(len(w))
if br.extend() == 0 {
return n
}
w = br.window()
}
}
This does the minimum, visit each character and test if it matches a whitespace character. Any (useful) JSON decoder code cannot go faster that this.
name time/op
CountWhitespace/canada-16 1.10ms ± 2%
CountWhitespace/citm_catalog-16 838µs ± 1%
CountWhitespace/twitter-16 306µs ± 1%
CountWhitespace/code-16 937µs ± 1%
CountWhitespace/example-16 6.40µs ± 1%
CountWhitespace/sample-16 333µs ± 1%
name speed
CountWhitespace/canada-16 2.04GB/s ± 2%
CountWhitespace/citm_catalog-16 2.06GB/s ± 1%
CountWhitespace/twitter-16 2.06GB/s ± 1%
CountWhitespace/code-16 2.07GB/s ± 1%
CountWhitespace/example-16 2.04GB/s ± 1%
CountWhitespace/sample-16 2.06GB/s ± 1%
So this is our baseline.
Quiz me after the talk on the best way to write a general purpose whitespace counter.
Scanning
Now we can tell which characters are tokens and which are simply whitespace, let’s step up a level and talk about scanning.
// Next returns a []byte referencing the the next lexical token in the stream.
// The []byte is valid until Next is called again.
// If the stream is at its end, or an error has occured, Next returns a zero
// length []byte slice.
//
// A valid token begins with one of the following:
//
// { Object start
// [ Array start
// } Object end
// ] Array End
// , Literal comma
// : Literal colon
// t JSON true
// f JSON false
// n JSON null
// " A string, possibly containing backslash escaped entites.
// -, 0-9 A number
func (s *Scanner) Next() []byte {
s.br.release(s.offset) // release the previous token
s.offset = 0
c := s.token() // align to the next token
length := 0
switch c {
case ObjectStart, ObjectEnd, Colon, Comma, ArrayStart, ArrayEnd:
length = 1
s.offset = 1
case True:
length = validateToken(&s.br, "true")
s.offset = length
case False:
length = validateToken(&s.br, "false")
s.offset = length
case Null:
length = validateToken(&s.br, "null")
s.offset = length
case String:
length = parseString(&s.br)
if length < 2 {
return nil
}
s.offset = length
case 0:
// eof
return nil
default:
length = s.parseNumber()
if length < 0 {
return nil
}
}
return s.br.window()[:length]
}
This is the core loop of Scanner.Next
.
Scanner.Next
skips over any intermediate whitespace, determines the token from the first character in the window, then continues to read until the token is read or we hit the end of the input.
Let’s look at how token
works, then we’ll talk about some optimisations
// token positions the scanner at the next token in the stream
// and returns the first byte of the token.
func (s *Scanner) token() byte {
w := s.br.window()
for {
for _, c := range w {
if whitespace[c] {
s.offset++
continue
}
// release whitespace
s.br.release(s.offset)
s.offset = 0
return c
}
if s.br.extend() == 0 {
// eof
return 0
}
w = s.br.window()[s.offset:]
}
}
-
We start by getting the current window from the
byteReader
. This is a[]byte
slice of all the data that is yet to be read. -
We’re looking for the first non whitespace character. If the character is a whitespace we increment
s.offset
to ignore the character and loop around. -
If we do find a non whitespace character, we release
s.offset
characters from the front of the window. Now the start of the window is properly aligned with the first character of the token. -
It turns out that we also get the first character of the token for free, it’s in
c
, so we can return that as a hint toScanner.Next
. -
If we run out of characters without hitting a token then we called
extend()
to grow the window. -
If we couldn’t grow, then we’ve run out of input and haven’t got a token, so give up.
-
Otherwise update
w
with a new window.
Some things to note:
-
Note the lack of error handling, it’s not part of the inner loop, it only happens when we have to read more data from the underlying reader via
extend
. [4] -
extend
hides the process of reading into, growing, refilling the buffer, etc. The makes the caller--Scanner.token
--simpler; if there is data in the window, process it, extend if you need too. If you can’t extend, give up. -
release
is similar, it shrinks the start of the window to exclude data that no longer care about. -
extend
is not in the hot path, so there is no need to optimise it, its performance is a function of the buffer it is given. In practice an initial buffer of 8k is sufficient.
Let’s talk about the performance of this code.
name time/op
Scanner/canada-16 4.41ms ± 2%
Scanner/citm_catalog-16 2.55ms ± 3%
Scanner/twitter-16 1.03ms ± 1%
Scanner/code-16 4.21ms ± 1%
Scanner/example-16 21.4µs ± 1%
Scanner/sample-16 822µs ± 1%
name speed
Scanner/canada-16 510MB/s ± 2%
Scanner/citm_catalog-16 677MB/s ± 3%
Scanner/twitter-16 615MB/s ± 1%
Scanner/code-16 461MB/s ± 1%
Scanner/example-16 608MB/s ± 1%
Scanner/sample-16 837MB/s ± 1%
Comparing the performance of Scanner.Next
to our whitespace benchmark we can see that we’re between 1/4 and 2/5ths of our baseline.
Let’s talk about improvements we can make to the code.
Note the amount of work being spent to keep s.offset
up to date.
We know that s.offset
is set to 0 just before Scanner.Next
calls this function, and we set s.offset
to zero on the way out of the function, so the changes we make to s.offset
within the function are invisible—it’s zero on entry, and zero on exit.
We can rewrite the function to keep a local offset
value, which has an impressive effect on token
.
func (s *Scanner) token() byte {
w := s.br.window()
offset := 0
for {
for _, c := range w {
if whitespace[c] {
offset++
continue
}
// release whitespace
s.br.release(offset)
return c
}
if s.br.extend() == 0 {
// eof
return 0
}
w = s.br.window()[offset:]
}
}
name old time/op new time/op delta
Scanner/canada-16 4.39ms ± 1% 4.43ms ± 4% ~ (p=1.000 n=5+5)
Scanner/citm_catalog-16 2.52ms ± 1% 1.80ms ± 4% -28.46% (p=0.008 n=5+5)
Scanner/twitter-16 1.03ms ± 2% 0.95ms ± 3% -7.41% (p=0.008 n=5+5)
Scanner/code-16 4.24ms ± 2% 4.18ms ± 1% ~ (p=0.095 n=5+5)
Scanner/example-16 21.4µs ± 1% 18.9µs ± 2% -11.68% (p=0.008 n=5+5)
Scanner/sample-16 828µs ± 2% 528µs ± 2% -36.24% (p=0.008 n=5+5)
name old speed new speed delta
Scanner/canada-16 512MB/s ± 1% 509MB/s ± 4% ~ (p=1.000 n=5+5)
Scanner/citm_catalog-16 685MB/s ± 1% 958MB/s ± 4% +39.84% (p=0.008 n=5+5)
Scanner/twitter-16 616MB/s ± 2% 665MB/s ± 3% +8.01% (p=0.008 n=5+5)
Scanner/code-16 458MB/s ± 2% 465MB/s ± 1% ~ (p=0.095 n=5+5)
Scanner/example-16 608MB/s ± 1% 688MB/s ± 2% +13.23% (p=0.008 n=5+5)
Scanner/sample-16 831MB/s ± 2% 1303MB/s ± 2% +56.84% (p=0.008 n=5+5)
By keeping offset
local the compiler avoided those temporary writes back to memory.
The question I have for you is why did this improve some inputs and not others?
The answer is different inputs have different amounts of whitespace.
For example canada
only has 33 whitespace characters whereas citm
has 1,227,563.
Quiz me after the talk why the compiler can’t optimise these writes away.
Hint: panic()
Inlining
There is a larger improvement we can make for the runtime of this code, and it relates to inlining. Inlining is the process of automatically (or manually) copying the body of a function into, in line with, its caller. This avoids the overhead of the function call.
The Go compiler has reasonable support for inlining, but has a number of limitations.
% go build -gcflags=-m=2 2>&1 | grep cannot | grep -v decoder
./reader.go:31:6: cannot inline (*byteReader).extend: function too complex: cost 198 exceeds budget 80
./scanner.go:99:6: cannot inline (*Scanner).token: unhandled op FOR
./scanner.go:130:6: cannot inline validateToken: unhandled op FOR
./scanner.go:153:6: cannot inline parseString: unhandled op FOR
./scanner.go:182:6: cannot inline (*Scanner).parseNumber: unhandled op FOR
./scanner.go:56:6: cannot inline (*Scanner).Next: function too complex: cost 476 exceeds budget 80
The first is the size of the function, byteReader.extend
cannot be inlined because it is too complex.
[5]
The second is statements within the function, Scanner.token
cannot be inlined because it contains a for
statement.
Let’s go back to the constraints:
-
Scanner.Next
is called for each token in the input. -
This means that
Scanner.token
is called for each token in the input. -
Scanner.token
cannot be automatically inlined into its caller because it is too complex. -
Therefore we’re paying two function calls per token.
We can remove one of these by manually inlining Scanner.token
into its caller.
func (s *Scanner) Next() []byte {
// release the previous token
s.br.release(s.offset)
w := s.br.window()
offset := 0
for {
for _, c := range w {
if whitespace[c] {
offset++
continue
}
// release whitespace
s.br.release(offset)
length := 0
switch c {
case ObjectStart, ObjectEnd, Colon, Comma, ArrayStart, ArrayEnd:
length = 1
s.offset = 1
case True:
length = validateToken(&s.br, "true")
s.offset = length
case False:
length = validateToken(&s.br, "false")
s.offset = length
case Null:
length = validateToken(&s.br, "null")
s.offset = length
case String:
// string
length = parseString(&s.br)
if length < 2 {
return nil
}
s.offset = length
default:
// ensure the number is correct.
length = s.parseNumber()
if length < 0 {
return nil
}
}
return s.br.window()[:length]
}
if s.br.extend() == 0 {
// eof
return nil
}
w = s.br.window()[offset:]
}
}
The results support our thesis:
name old time/op new time/op delta
Scanner/canada-16 4.36ms ± 1% 3.50ms ± 0% -19.68% (p=0.008 n=5+5)
Scanner/citm_catalog-16 1.80ms ± 1% 1.56ms ± 2% -13.16% (p=0.008 n=5+5)
Scanner/twitter-16 965µs ± 2% 833µs ± 2% -13.75% (p=0.008 n=5+5)
Scanner/code-16 4.15ms ± 1% 3.61ms ± 1% -12.82% (p=0.008 n=5+5)
Scanner/example-16 18.9µs ± 2% 16.6µs ± 1% -12.42% (p=0.008 n=5+5)
Scanner/sample-16 515µs ± 1% 472µs ± 2% -8.34% (p=0.008 n=5+5)
name old speed new speed delta
Scanner/canada-16 516MB/s ± 1% 642MB/s ± 0% +24.50% (p=0.008 n=5+5)
Scanner/citm_catalog-16 960MB/s ± 1% 1105MB/s ± 2% +15.16% (p=0.008 n=5+5)
Scanner/twitter-16 654MB/s ± 2% 759MB/s ± 1% +15.94% (p=0.008 n=5+5)
Scanner/code-16 468MB/s ± 1% 537MB/s ± 1% +14.69% (p=0.008 n=5+5)
Scanner/example-16 689MB/s ± 2% 787MB/s ± 1% +14.17% (p=0.008 n=5+5)
Scanner/sample-16 1.33GB/s ± 1% 1.46GB/s ± 1% +9.11% (p=0.008 n=5+5)
By saving the function call we’ve improved throughput by 9-24%.
The largest improvement comes from canada
, which basically contained no whitespace, so the call to Scanner.token
almost always returned immediately having done no work!
To recap the major optimisations were:
-
Avoiding
s.offset
updates. They cannot be registerised, CPU has to do a write on every iteration.s.offset
updates reduced from one per byte to one per token. -
Scanner.Next
andScanner.token
were effectively one function spread over two. Each are too large to be inlined, so we’re paying for an extra function call per token. Manually inlining them increased the indentation depth of the function, but delivered substantial speedups. -
Most JSON contains some whitespace, it’s moderately optimised for human readability. It turns out, the more whitespace, the faster
pkg/json
decodes!citm
is over 50% of the baseline,sample
is nearly 75%.
Take away: Avoiding function calls can improve performance in the hot path.
Decoding
So far we have a Scanner
which tokenises input at 50-75% of the whitespace baseline.
But there are a few more things we need to make the scanner fully functional.
The first part of that is validation.
Validation
JSON is a state machine.
Depending on the current token, only certain subsequent tokens are valid.
For example, if you’ve read these tokens {
, "username"
, then the only valid token is :
.
To track this we need to layer some logic on top of Scanner.Next
to assert that a sequence of token is valid.
This is the role of Decoder.NextToken
:
const (
stateValue = 0
stateObjectString = iota
stateObjectColon
stateObjectValue
stateObjectComma
stateArrayValue
stateArrayComma
stateEnd
)
func (d *Decoder) NextToken() ([]byte, error) {
tok := d.scanner.Next()
if len(tok) < 1 {
return nil, io.EOF
}
switch d.state {
case stateValue:
return d.stateValue(tok)
case stateObjectString:
return d.stateObjectString(tok)
case stateObjectColon:
return d.stateObjectColon(tok)
case stateObjectValue:
return d.stateObjectValue(tok)
case stateObjectComma:
return d.stateObjectComma(tok)
case stateArrayValue:
return d.stateArrayValue(tok)
case stateArrayComma:
return d.stateArrayComma(tok)
case stateEnd:
fallthrough
default:
return nil, io.EOF
}
}
This is pretty straightforward stuff.
We take track the current state in d.state
and based on its value we dispatch to the various state methods which assert token is valid and update the state.
Click to see the source for the various state
methods.
func (d *Decoder) stateObjectString(tok []byte) ([]byte, error) {
switch tok[0] {
case '}':
inObj := d.pop()
switch {
case d.len() == 0:
d.state = stateEnd
case inObj:
d.state = stateObjectComma
case !inObj:
d.state = stateArrayComma
}
return tok, nil
case '"':
d.state = stateObjectColon
return tok, nil
default:
return nil, fmt.Errorf("stateObjectString: missing string key")
}
}
func (d *Decoder) stateObjectColon(tok []byte) ([]byte, error) {
switch tok[0] {
case Colon:
d.state = stateObjectValue
return d.NextToken()
default:
return tok, fmt.Errorf("stateObjectColon: expecting colon")
}
}
func (d *Decoder) stateObjectValue(tok []byte) ([]byte, error) {
switch tok[0] {
case '{':
d.state = stateObjectString
d.push(true)
return tok, nil
case '[':
d.state = stateArrayValue
d.push(false)
return tok, nil
default:
d.state = stateObjectComma
return tok, nil
}
}
func (d *Decoder) stateObjectComma(tok []byte) ([]byte, error) {
switch tok[0] {
case '}':
inObj := d.pop()
switch {
case d.len() == 0:
d.state = stateEnd
case inObj:
d.state = stateObjectComma
case !inObj:
d.state = stateArrayComma
}
return tok, nil
case Comma:
d.state = stateObjectString
return d.NextToken()
default:
return tok, fmt.Errorf("stateObjectComma: expecting comma")
}
}
func (d *Decoder) stateArrayValue(tok []byte) ([]byte, error) {
switch tok[0] {
case '{':
d.state = stateObjectString
d.push(true)
return tok, nil
case '[':
d.state = stateArrayValue
d.push(false)
return tok, nil
case ']':
inObj := d.pop()
switch {
case d.len() == 0:
d.state = stateEnd
case inObj:
d.state = stateObjectComma
case !inObj:
d.state = stateArrayComma
}
return tok, nil
case ',':
return nil, fmt.Errorf("stateArrayValue: unexpected comma")
default:
d.state = stateArrayComma
return tok, nil
}
}
func (d *Decoder) stateArrayComma(tok []byte) ([]byte, error) {
switch tok[0] {
case ']':
inObj := d.pop()
switch {
case d.len() == 0:
d.state = stateEnd
case inObj:
d.state = stateObjectComma
case !inObj:
d.state = stateArrayComma
}
return tok, nil
case Comma:
d.state = stateArrayValue
return d.NextToken()
default:
return nil, fmt.Errorf("stateArrayComma: expected comma, %v", d.stack)
}
}
func (d *Decoder) stateValue(tok []byte) ([]byte, error) {
switch tok[0] {
case '{':
d.state = stateObjectString
d.push(true)
return tok, nil
case '[':
d.state = stateArrayValue
d.push(false)
return tok, nil
case ',':
return nil, fmt.Errorf("stateValue: unexpected comma")
default:
d.state = stateEnd
return tok, nil
}
}
Let’s look at the results.
name time/op DecoderNextToken/pkgjson/canada-16 5.64ms ± 1% DecoderNextToken/encodingjson/canada-16 65.1ms ± 1% DecoderNextToken/pkgjson/citm_catalog-16 2.42ms ± 2% DecoderNextToken/encodingjson/citm_catalog-16 19.8ms ± 1% DecoderNextToken/pkgjson/twitter-16 1.18ms ± 1% DecoderNextToken/encodingjson/twitter-16 12.1ms ± 0% DecoderNextToken/pkgjson/code-16 6.10ms ± 2% DecoderNextToken/encodingjson/code-16 77.5ms ± 1% DecoderNextToken/pkgjson/example-16 25.2µs ± 1% DecoderNextToken/encodingjson/example-16 266µs ± 1% DecoderNextToken/pkgjson/sample-16 559µs ± 0% DecoderNextToken/encodingjson/sample-16 3.18ms ± 2% name speed DecoderNextToken/pkgjson/canada-16 399MB/s ± 1% DecoderNextToken/encodingjson/canada-16 34.6MB/s ± 1% DecoderNextToken/pkgjson/citm_catalog-16 713MB/s ± 2% DecoderNextToken/encodingjson/citm_catalog-16 87.1MB/s ± 1% DecoderNextToken/pkgjson/twitter-16 537MB/s ± 1% DecoderNextToken/encodingjson/twitter-16 52.2MB/s ± 0% DecoderNextToken/pkgjson/code-16 318MB/s ± 2% DecoderNextToken/encodingjson/code-16 25.0MB/s ± 1% DecoderNextToken/pkgjson/example-16 517MB/s ± 1% DecoderNextToken/encodingjson/example-16 49.0MB/s ± 1% DecoderNextToken/pkgjson/sample-16 1.23GB/s ± 0% DecoderNextToken/encodingjson/sample-16 216MB/s ± 2%
Compared to encoding/json
we’re 8-10x faster.
But there are some things we can do to improve:
Linear search
Central to the operation of Decoder.NextToken
is the switch
statement.
In the general case, switch
is implemented as a sequence of if
statements.
Effectively what the compiler sees is
func (d *Decoder) NextToken() ([]byte, error) {
tok := d.scanner.Next()
if len(tok) < 1 {
return nil, io.EOF
}
if d.state == stateValue {
return d.stateValue(tok)
}
if d.state == stateObjectString {
return d.stateObjectString(tok)
}
if d.state == stateObjectColon {
return d.stateObjectColon(tok)
}
if d.state == stateObjectValue {
return d.stateObjectValue(tok)
}
if d.state == stateObjectComma {
return d.stateObjectComma(tok)
}
if d.state == stateArrayValue {
return d.stateArrayValue(tok)
}
if d.state == stateArrayComma {
return d.stateArrayComma(tok)
}
return nil, io.EOF
}
switch
is convenient, but not optimal in the hot path because that long sequence of if
statements breaks up the instruction stream, and the puts pressure on the CPU’s branch predictor. Mispredicting a branch is expensive, the CPU has to flush the pipeline, rewind, and take the other branch.
This problem turns up in many places; bytecode interpreters are a classic example. One of the optimisations we could make is is to turn this linear search into a table. This can be space efficient if the state space is small and dense (often it is not), and the result might be something like this
var stateTable = [...]func(*Decoder, []byte) ([]byte, error){
stateValue: (*Decoder).stateValue,
stateObjectString: (*Decoder).stateObjectString,
stateObjectColon: (*Decoder).stateObjectColon,
stateObjectValue: (*Decoder).stateObjectValue,
stateObjectComma: (*Decoder).stateObjectComma,
stateArrayValue: (*Decoder).stateArrayValue,
stateArrayComma: (*Decoder).stateArrayComma,
stateEnd: (*Decoder).stateEnd,
}
func (d *Decoder) NextToken() ([]byte, error) {
tok := d.scanner.Next()
if len(tok) < 1 {
return nil, io.EOF
}
return stateTable[d.state](d, tok)
}
Unfortunately this won’t compile because there is an initalisation loop.
./decoder.go:115:5: initialization loop:
/Users/davecheney/devel/json/decoder.go:115:5: stateTable refers to
/Users/davecheney/devel/json/decoder.go:155:19: (*Decoder).stateObjectColon refers to
/Users/davecheney/devel/json/decoder.go:126:19: (*Decoder).NextToken refers to
/Users/davecheney/devel/json/decoder.go:115:5: stateTable
FAIL github.com/pkg/json [build failed\]
But there is a better trick that we can use that is more space efficient than this table, and is sometimes called a computed goto.
Computed Goto
If you look at the table above there is a pattern.
Each state
enumeration is matched with exactly one method.
What we store in d.state
is a proxy for the method we want to call, but to call the method, we have to switch
on d.state
to find the appropriate method.
What if we could just store the method directly and call it directly? And infact, we can do just that with some old Go 1.0 magic, method expression.
// A Decoder decodes JSON values from an input stream.
type Decoder struct {
scanner *Scanner
state func(*Decoder, []byte) ([]byte, error)
stack // keeps track of whether we're in an object or array
}
func (d *Decoder) NextToken() ([]byte, error) {
tok := d.scanner.Next()
if len(tok) < 1 {
return nil, io.EOF
}
return d.state(d, tok)
}
Let’s see how this performs.
name old time/op new time/op delta
DecoderNextToken/pkgjson/canada-16 5.60ms ± 0% 5.61ms ± 0% ~ (p=0.841 n=5+5)
DecoderNextToken/encodingjson/canada-16 64.9ms ± 1% 64.4ms ± 0% ~ (p=0.222 n=5+5)
DecoderNextToken/pkgjson/citm_catalog-16 2.41ms ± 1% 2.42ms ± 1% ~ (p=0.310 n=5+5)
DecoderNextToken/encodingjson/citm_catalog-16 19.8ms ± 1% 19.7ms ± 1% ~ (p=0.548 n=5+5)
DecoderNextToken/pkgjson/twitter-16 1.17ms ± 1% 1.20ms ± 2% +2.70% (p=0.008 n=5+5)
DecoderNextToken/encodingjson/twitter-16 12.2ms ± 1% 12.1ms ± 1% ~ (p=1.000 n=5+5)
DecoderNextToken/pkgjson/code-16 6.11ms ± 3% 6.37ms ± 4% +4.26% (p=0.016 n=5+5)
DecoderNextToken/encodingjson/code-16 77.7ms ± 2% 78.6ms ± 2% ~ (p=0.310 n=5+5)
DecoderNextToken/pkgjson/example-16 25.0µs ± 0% 25.6µs ± 1% +2.25% (p=0.008 n=5+5)
DecoderNextToken/encodingjson/example-16 263µs ± 1% 265µs ± 2% ~ (p=0.222 n=5+5)
DecoderNextToken/pkgjson/sample-16 560µs ± 1% 553µs ± 0% -1.15% (p=0.008 n=5+5)
DecoderNextToken/encodingjson/sample-16 3.16ms ± 1% 3.17ms ± 1% ~ (p=0.690 n=5+5)
name old speed new speed delta
DecoderNextToken/pkgjson/canada-16 402MB/s ± 0% 401MB/s ± 0% ~ (p=0.841 n=5+5)
DecoderNextToken/encodingjson/canada-16 34.7MB/s ± 1% 35.0MB/s ± 0% ~ (p=0.222 n=5+5)
DecoderNextToken/pkgjson/citm_catalog-16 716MB/s ± 1% 713MB/s ± 1% ~ (p=0.310 n=5+5)
DecoderNextToken/encodingjson/citm_catalog-16 87.3MB/s ± 1% 87.6MB/s ± 1% ~ (p=0.548 n=5+5)
DecoderNextToken/pkgjson/twitter-16 539MB/s ± 1% 524MB/s ± 2% -2.62% (p=0.008 n=5+5)
DecoderNextToken/encodingjson/twitter-16 51.9MB/s ± 1% 52.0MB/s ± 1% ~ (p=1.000 n=5+5)
DecoderNextToken/pkgjson/code-16 318MB/s ± 2% 305MB/s ± 3% -4.07% (p=0.016 n=5+5)
DecoderNextToken/encodingjson/code-16 25.0MB/s ± 2% 24.7MB/s ± 2% ~ (p=0.310 n=5+5)
DecoderNextToken/pkgjson/example-16 521MB/s ± 0% 509MB/s ± 1% -2.19% (p=0.008 n=5+5)
DecoderNextToken/encodingjson/example-16 49.5MB/s ± 1% 49.1MB/s ± 2% ~ (p=0.222 n=5+5)
DecoderNextToken/pkgjson/sample-16 1.23GB/s ± 1% 1.24GB/s ± 0% +1.16% (p=0.008 n=5+5)
DecoderNextToken/encodingjson/sample-16 218MB/s ± 1% 217MB/s ± 1% ~ (p=0.690 n=5+5)
The results aren’t that promising, but now we can unlock our final optimisation.
Outlining
Compare the previous version of Decoder.NextToken
func (d *Decoder) NextToken() ([]byte, error) {
tok := d.scanner.Next()
if len(tok) < 1 {
return nil, io.EOF
}
return d.state(d, tok)
}
with the version after outlining.
func (d *Decoder) NextToken() ([]byte, error) {
return d.state(d)
}
func (d *Decoder) stateObjectColon() ([]byte, error) {
tok := d.scanner.Next()
if len(tok) < 1 {
return nil, io.ErrUnexpectedEOF
}
switch tok[0] {
case Colon:
d.state = (*Decoder).stateObjectValue
return d.NextToken()
default:
return tok, fmt.Errorf("stateObjectColon: expecting colon")
}
}
// repeated for each Decoder.state... method
Moving the tok := d.scanner.Next()
call into each state method might seem like a step backwards, but it has several positive effects.
-
The first is not passing
tok
into each state method. This saves 3 words on the call stack. -
The second is, by moving the
if len(tok) < 1
into the same function as theswitch
, it enables bounds check elimination.Previously, when the
len(tok)
check happened inDecoder.NextToken
,Decoder.stateObjectColon
doesn’t know anything about the length oftok
because it can’t be inlined because we’re calling it via a method expression. When the compiler encountersswitch tok[0]
, it needs to put a bounds check to make suretok
is at least 1 element long.When the
if
check is moved into the same function, the compiler knows that if we get further than the check thentok
is at least 1 element long, so the bounds check is not needed.We can see this in the debug information.
% go build -gcflags=-d=ssa/prove/debug=2 2>&1 | grep 135\: ./decoder.go:135:13: Proved IsInBounds (v31)
-
The final optimisation occurs because
Decoder.NextToken
, which was previously too complex to inline
./decoder.go:107:6: cannot inline (*Decoder).NextToken: function too complex: cost 145 exceeds budget 80
is now inlinable
./bench_test.go:217:29: inlining call to (*Decoder).NextToken method(*Decoder) func() ([]byte, error) { return d.state(d) }
Which means, calls to dec.NextToken()
for {
_, err := dec.NextToken()
if err == io.EOF {
break
}
check(b, err)
n++
}
becomes a direct call to the current state method.
for {
_, err := dec.state(dec) // this is what the compiler sees
if err == io.EOF {
break
}
check(b, err)
n++
}
Eliminating the cost of the function call.
Results
At the lowest level pkg/json.Scanner
can tokenize streaming JSON without allocation (provided it is supplied a few kilobytes of buffer).
BenchmarkScanner/canada-16 1561 3524005 ns/op 638.78 MB/s 0 B/op 0 allocs/op
BenchmarkScanner/citm_catalog-16 3556 1555322 ns/op 1110.51 MB/s 0 B/op 0 allocs/op
BenchmarkScanner/twitter-16 7068 836031 ns/op 755.37 MB/s 0 B/op 0 allocs/op
BenchmarkScanner/code-16 1543 3640425 ns/op 533.03 MB/s 0 B/op 0 allocs/op
BenchmarkScanner/example-16 341224 16362 ns/op 796.00 MB/s 0 B/op 0 allocs/op
BenchmarkScanner/sample-16 12771 467360 ns/op 1471.01 MB/s 0 B/op 0 allocs/op
At the next level pkg/json.Decoder.Token
is 2-3x faster than encoding/json.Decoder.Token
.
BenchmarkDecoderToken/pkgjson/canada-16 267 22073095 ns/op 101.98 MB/s 4402914 B/op 222279 allocs/op
BenchmarkDecoderToken/encodingjson/canada-16 86 67830737 ns/op 33.19 MB/s 17740387 B/op 889106 allocs/op
BenchmarkDecoderToken/pkgjson/citm_catalog-16 1114 5183226 ns/op 333.23 MB/s 965992 B/op 81995 allocs/op
BenchmarkDecoderToken/encodingjson/citm_catalog-16 288 20882018 ns/op 82.71 MB/s 5661597 B/op 324692 allocs/op
BenchmarkDecoderToken/pkgjson/twitter-16 2356 2467042 ns/op 255.98 MB/s 768354 B/op 38992 allocs/op
BenchmarkDecoderToken/encodingjson/twitter-16 471 12606205 ns/op 50.10 MB/s 3584017 B/op 187319 allocs/op
BenchmarkDecoderToken/pkgjson/code-16 346 16877006 ns/op 114.98 MB/s 4304233 B/op 320235 allocs/op
BenchmarkDecoderToken/encodingjson/code-16 73 80255994 ns/op 24.18 MB/s 23355962 B/op 1319125 allocs/op
BenchmarkDecoderToken/pkgjson/example-16 113912 53083 ns/op 245.35 MB/s 16016 B/op 914 allocs/op
BenchmarkDecoderToken/encodingjson/example-16 21734 273991 ns/op 47.53 MB/s 82416 B/op 4325 allocs/op
BenchmarkDecoderToken/pkgjson/sample-16 6642 871796 ns/op 788.59 MB/s 213761 B/op 5081 allocs/op
BenchmarkDecoderToken/encodingjson/sample-16 1803 3287623 ns/op 209.12 MB/s 723782 B/op 26095 allocs/op
Because allocations make up a large proportion of the Decoder.Token
API, pkg/json.Decoder
provides an alternative API that produces significantly fewer allocations and is 8-10x faster.
BenchmarkDecoderNextToken/pkgjson/canada-16 1197 4825232 ns/op 466.52 MB/s 136 B/op 3 allocs/op
BenchmarkDecoderNextToken/encodingjson/canada-16 90 65392440 ns/op 34.42 MB/s 17740399 B/op 889106 allocs/op
BenchmarkDecoderNextToken/pkgjson/citm_catalog-16 2709 2162849 ns/op 798.58 MB/s 136 B/op 3 allocs/op
BenchmarkDecoderNextToken/encodingjson/citm_catalog-16 301 20064314 ns/op 86.08 MB/s 5661597 B/op 324692 allocs/op
BenchmarkDecoderNextToken/pkgjson/twitter-16 5395 1068106 ns/op 591.25 MB/s 152 B/op 4 allocs/op
BenchmarkDecoderNextToken/encodingjson/twitter-16 494 12072956 ns/op 52.31 MB/s 3584013 B/op 187319 allocs/op
BenchmarkDecoderNextToken/pkgjson/code-16 1135 5124666 ns/op 378.65 MB/s 248 B/op 6 allocs/op
BenchmarkDecoderNextToken/encodingjson/code-16 74 77579973 ns/op 25.01 MB/s 23355955 B/op 1319125 allocs/op
BenchmarkDecoderNextToken/pkgjson/example-16 269010 22323 ns/op 583.43 MB/s 152 B/op 4 allocs/op
BenchmarkDecoderNextToken/encodingjson/example-16 22707 264078 ns/op 49.32 MB/s 82416 B/op 4325 allocs/op
BenchmarkDecoderNextToken/pkgjson/sample-16 10000 510445 ns/op 1346.85 MB/s 1144 B/op 9 allocs/op
BenchmarkDecoderNextToken/encodingjson/sample-16 1836 3161804 ns/op 217.44 MB/s 723781 B/op 26095 allocs/op
At the highest level, pkg/json
can unmarshal data into a Go object with the same API as encoding/json
.
BenchmarkDecoderDecodeInterfaceAny/pkgjson/canada-16 217 27425893 ns/op 82.08 MB/s 8747163 B/op 281408 allocs/op
BenchmarkDecoderDecodeInterfaceAny/encodingjson/canada-16 153 38347477 ns/op 58.70 MB/s 20647616 B/op 392553 allocs/op
BenchmarkDecoderDecodeInterfaceAny/pkgjson/citm_catalog-16 747 8008839 ns/op 215.66 MB/s 5197853 B/op 89673 allocs/op
BenchmarkDecoderDecodeInterfaceAny/encodingjson/citm_catalog-16 360 16607501 ns/op 104.00 MB/s 9406809 B/op 95389 allocs/op
BenchmarkDecoderDecodeInterfaceAny/pkgjson/twitter-16 1606 3714515 ns/op 170.01 MB/s 2130731 B/op 30182 allocs/op
BenchmarkDecoderDecodeInterfaceAny/encodingjson/twitter-16 862 6927998 ns/op 91.15 MB/s 4283407 B/op 31278 allocs/op
BenchmarkDecoderDecodeInterfaceAny/pkgjson/code-16 333 17939351 ns/op 108.17 MB/s 7331643 B/op 232059 allocs/op
BenchmarkDecoderDecodeInterfaceAny/encodingjson/code-16 236 25324951 ns/op 76.62 MB/s 12332753 B/op 271292 allocs/op
BenchmarkDecoderDecodeInterfaceAny/pkgjson/example-16 76874 78079 ns/op 166.81 MB/s 50980 B/op 739 allocs/op
BenchmarkDecoderDecodeInterfaceAny/encodingjson/example-16 40886 146685 ns/op 88.79 MB/s 82855 B/op 782 allocs/op
BenchmarkDecoderDecodeInterfaceAny/pkgjson/sample-16 5240 1116081 ns/op 615.99 MB/s 399970 B/op 5542 allocs/op
BenchmarkDecoderDecodeInterfaceAny/encodingjson/sample-16 1123 5369313 ns/op 128.04 MB/s 2661872 B/op 7527 allocs/op
Thematic ideas
- Allocations affect performance
-
The garbage collector might be fast to allocate and efficient to collect, but not allocating will always be faster.
- Eliminating allocations through API design
-
Most of the speedups of this package come from reducing allocations; specifically the time not spent in the heap allocation path, and the time not spent in GC cycles, is available for scanning.
- API design influences performance
-
The
encoding/json.Decoder
API requires allocations as returning a primitive value via an interface causes it to escape to the heap — it effectively becomes a pointer to the value. When dealing with data, allocations, can be the biggest performance cost of an algorithm. You’re benchmarking the garbage collector at that point. - Careful attention to the per byte and per token overheads
-
Optimising the hot path to convert functions per byte into per token functions is the second biggest performance win.
I started this project because I wanted to apply some ideas that I had seen to a problem in Go.
I believed that it was possible implement an efficient JSON parser based on my assumption that encoding/json
was slower than it could be because of its API.
It turned out there is 2-3x performance in some unmarshalling paths and between 8x and 10x performance in tokenisation, if you’re prepared to accept a different API.
Thank you
-
Thank you to Valentine and all the GopherCon Singapore organisers. ❤️
-
Link to this presentation: dave.cheney.net/paste/gophercon-sg-2023.html
-
Link to the code: github.com/pkg/json