JSON is important, damn near everything that we do as programmers or operators involves JSON at some point. JSON decoding is expensive, if your product talks JSON then performance of marshalling data in and out of JSON is important. This is a talk about designing an efficient replacement for
encoding/json.Decoder
.The code for this talk is available https://github.com/pkg/json.
Thanks to Jon Forrest and Dave Russell for copy editing.
Problem statement
JSON is an important data interchange format. Damn near everything we do as programmers involves JSON in some way.
At the same time, JSON decoding is expensive.
Every Go release we see improvements in the efficiency of the encoding/json
package.
Sometimes these are large improvements, like the move away from segmented stacks in Go 1.3, more recently these improvements have been moderate.
In the 1.15 cycle I’ve seen two performance improvements that had to be rolled back because, while they made it faster, they broke a subtle implicit behaviour that people were relying on.
c.f. Hyrum’s Law.[1]
In the Go ecosystem there are a bunch of alternative JSON libraries, usually maintained by a small number of people, so this suggested to me that this is a problem with some meat on its bones, and equally, not impenetrable. I figured I’d give it a try.
Results
At the lowest level pkg/json.Scanner
can tokenize streaming JSON without allocation (provided it is supplied a few kilobytes of buffer).
BenchmarkScanner/canada-16 1561 3524005 ns/op 638.78 MB/s 0 B/op 0 allocs/op
BenchmarkScanner/citm_catalog-16 3556 1555322 ns/op 1110.51 MB/s 0 B/op 0 allocs/op
BenchmarkScanner/twitter-16 7068 836031 ns/op 755.37 MB/s 0 B/op 0 allocs/op
BenchmarkScanner/code-16 1543 3640425 ns/op 533.03 MB/s 0 B/op 0 allocs/op
BenchmarkScanner/example-16 341224 16362 ns/op 796.00 MB/s 0 B/op 0 allocs/op
BenchmarkScanner/sample-16 12771 467360 ns/op 1471.01 MB/s 0 B/op 0 allocs/op
At the next level pkg/json.Decoder.Token
is 2-3x faster than encoding/json.Decoder.Token
.
BenchmarkDecoderToken/pkgjson/canada-16 267 22073095 ns/op 101.98 MB/s 4402914 B/op 222279 allocs/op
BenchmarkDecoderToken/encodingjson/canada-16 86 67830737 ns/op 33.19 MB/s 17740387 B/op 889106 allocs/op
BenchmarkDecoderToken/pkgjson/citm_catalog-16 1114 5183226 ns/op 333.23 MB/s 965992 B/op 81995 allocs/op
BenchmarkDecoderToken/encodingjson/citm_catalog-16 288 20882018 ns/op 82.71 MB/s 5661597 B/op 324692 allocs/op
BenchmarkDecoderToken/pkgjson/twitter-16 2356 2467042 ns/op 255.98 MB/s 768354 B/op 38992 allocs/op
BenchmarkDecoderToken/encodingjson/twitter-16 471 12606205 ns/op 50.10 MB/s 3584017 B/op 187319 allocs/op
BenchmarkDecoderToken/pkgjson/code-16 346 16877006 ns/op 114.98 MB/s 4304233 B/op 320235 allocs/op
BenchmarkDecoderToken/encodingjson/code-16 73 80255994 ns/op 24.18 MB/s 23355962 B/op 1319125 allocs/op
BenchmarkDecoderToken/pkgjson/example-16 113912 53083 ns/op 245.35 MB/s 16016 B/op 914 allocs/op
BenchmarkDecoderToken/encodingjson/example-16 21734 273991 ns/op 47.53 MB/s 82416 B/op 4325 allocs/op
BenchmarkDecoderToken/pkgjson/sample-16 6642 871796 ns/op 788.59 MB/s 213761 B/op 5081 allocs/op
BenchmarkDecoderToken/encodingjson/sample-16 1803 3287623 ns/op 209.12 MB/s 723782 B/op 26095 allocs/op
Because allocations make up a large proportion of the Decoder.Token
API, pkg/json.Decoder
provides an alternative API that produces significantly fewer allocations and is 8-10x faster.
BenchmarkDecoderNextToken/pkgjson/canada-16 1197 4825232 ns/op 466.52 MB/s 136 B/op 3 allocs/op
BenchmarkDecoderNextToken/encodingjson/canada-16 90 65392440 ns/op 34.42 MB/s 17740399 B/op 889106 allocs/op
BenchmarkDecoderNextToken/pkgjson/citm_catalog-16 2709 2162849 ns/op 798.58 MB/s 136 B/op 3 allocs/op
BenchmarkDecoderNextToken/encodingjson/citm_catalog-16 301 20064314 ns/op 86.08 MB/s 5661597 B/op 324692 allocs/op
BenchmarkDecoderNextToken/pkgjson/twitter-16 5395 1068106 ns/op 591.25 MB/s 152 B/op 4 allocs/op
BenchmarkDecoderNextToken/encodingjson/twitter-16 494 12072956 ns/op 52.31 MB/s 3584013 B/op 187319 allocs/op
BenchmarkDecoderNextToken/pkgjson/code-16 1135 5124666 ns/op 378.65 MB/s 248 B/op 6 allocs/op
BenchmarkDecoderNextToken/encodingjson/code-16 74 77579973 ns/op 25.01 MB/s 23355955 B/op 1319125 allocs/op
BenchmarkDecoderNextToken/pkgjson/example-16 269010 22323 ns/op 583.43 MB/s 152 B/op 4 allocs/op
BenchmarkDecoderNextToken/encodingjson/example-16 22707 264078 ns/op 49.32 MB/s 82416 B/op 4325 allocs/op
BenchmarkDecoderNextToken/pkgjson/sample-16 10000 510445 ns/op 1346.85 MB/s 1144 B/op 9 allocs/op
BenchmarkDecoderNextToken/encodingjson/sample-16 1836 3161804 ns/op 217.44 MB/s 723781 B/op 26095 allocs/op
At the highest level, pkg/json
can unmarshal data into a Go object with the same API as encoding/json
.
This is very much a work in progress, but the results are promising for folks who want to use this package as a drop in replacement.
BenchmarkDecoderDecodeInterfaceAny/pkgjson/canada-16 217 27425893 ns/op 82.08 MB/s 8747163 B/op 281408 allocs/op
BenchmarkDecoderDecodeInterfaceAny/encodingjson/canada-16 153 38347477 ns/op 58.70 MB/s 20647616 B/op 392553 allocs/op
BenchmarkDecoderDecodeInterfaceAny/pkgjson/citm_catalog-16 747 8008839 ns/op 215.66 MB/s 5197853 B/op 89673 allocs/op
BenchmarkDecoderDecodeInterfaceAny/encodingjson/citm_catalog-16 360 16607501 ns/op 104.00 MB/s 9406809 B/op 95389 allocs/op
BenchmarkDecoderDecodeInterfaceAny/pkgjson/twitter-16 1606 3714515 ns/op 170.01 MB/s 2130731 B/op 30182 allocs/op
BenchmarkDecoderDecodeInterfaceAny/encodingjson/twitter-16 862 6927998 ns/op 91.15 MB/s 4283407 B/op 31278 allocs/op
BenchmarkDecoderDecodeInterfaceAny/pkgjson/code-16 333 17939351 ns/op 108.17 MB/s 7331643 B/op 232059 allocs/op
BenchmarkDecoderDecodeInterfaceAny/encodingjson/code-16 236 25324951 ns/op 76.62 MB/s 12332753 B/op 271292 allocs/op
BenchmarkDecoderDecodeInterfaceAny/pkgjson/example-16 76874 78079 ns/op 166.81 MB/s 50980 B/op 739 allocs/op
BenchmarkDecoderDecodeInterfaceAny/encodingjson/example-16 40886 146685 ns/op 88.79 MB/s 82855 B/op 782 allocs/op
BenchmarkDecoderDecodeInterfaceAny/pkgjson/sample-16 5240 1116081 ns/op 615.99 MB/s 399970 B/op 5542 allocs/op
BenchmarkDecoderDecodeInterfaceAny/encodingjson/sample-16 1123 5369313 ns/op 128.04 MB/s 2661872 B/op 7527 allocs/op
This is a story of how I went about building this package.
I’m using a pre release version of Go 1.15 built from source. If you’re using an older version your numbers may vary. When Go 1.15 comes out, you should upgrade. |
Design
The design of this package has the following features
- Reasonably compatible with the
encoding/json
package -
This package offers the same high level
json.Decoder
API with higher throughput and/or reduced allocations. A success criteria for this package would be as a drop in replacement forencoding/json
. - Supports streaming operations
-
It’s nice if you can have the entire input in memory but that’s unrealistic. Input sizes are usually unknown and potentially unbounded. Buffering in memory is a availability risk. Buffering before processing introduces latency, streaming reads lets you process data as it arrives and logically overlap with transfer or read. This package supports streaming operation via
io.Reader
input sources (which is also required for compatibility withencoding/json
) - Allocation free, or bounded API
-
In addition to the
encoding/json
API, provide an alternative API that can operate with minimal, ideally no allocations.
Time complexity
Let’s talk about the time complexity of this problem.
JSON doesn’t use length markers; to know how much to read, we have to read it all. This means the lower bounds on the time to process the input is the size of the input.
But reading the input isn’t enough, we have to follow the JSON state machine to figure out where the tokens start and end.
Now, reading N bytes isn’t the full story, we need to process those N bytes, so the performance is at least read(N)+parse(N)
.
But there are other costs:
-
If we have to allocate memory to read or process those bytes, that overhead will grow with the size of the input.
-
We know that the big factor in the performance of a parse is the size of the input. Ideally we want N to be the number of bytes in the input, that is, we want to process each byte only once. If we touch the same byte more than once, that adds overhead, and complicates processing if we have to keep those bytes around to come back and look at them again.
-
Just like we don’t want to process a byte more than once, we want to avoid processing a token more than once.
-
Limit function calls in the hot path inside the
Scanner
orDecoder
.encoding/json
uses one function call per byte,pkg/json
does better at one call per token. We want to limit the number of function calls per token. IdeallyO(tokens)
, notO(bytes)
. If we did nothing else we’d be ahead. -
Limit copies. If we design to limit copying of data then we limit the number times we (re)visit a byte.
-
Limit allocations. If you limit the number of places you can copy from and to, ideally only copies within existing buffers, then you naturally limit allocations. Limiting allocations reduces runtime in two ways:
-
Reduce the overhead in taking the allocation. The heap is a shared resource, allocating on the heap requires working with shared data structures. This means locks, cache contention, etc. c.f. Amdahl’s Law [2]
-
Reduce the overhead of freeing allocations. The less allocations you make, the less heap you consume and the less garbage you produce. Reducing these two factors reduces the overhead of background and foreground garbage collection.
-
Tokenisation
JSON is a stream of tokens. To build higher level components like pretty printers and decoders we need to break the stream into tokens.
Go’s JSON decoder has two main components:
-
A scanner, or tokeniser, that converts a stream a bytes into a stream of JSON tokens.
-
An unmarshaller that applies a stream of JSON tokens to a Go object.
Let’s talk about tokenization first.
What is a token?
JSON is regular, well defined grammar. There is a great set of charts over on json.org.
{"a": 1, "b": true, "c": [1, "two", null]}
Is a stream of,
-
{
, the opening brace, signifying a collection of name/value pairs. -
"a"
, the stringa
. -
:
, a colon, the delimiter between the key and the value in the key/value pair. -
1
, the number one. -
,
, a comma, the delimiter between one key/value pair and the next. -
"b"
, the stringb
. -
:
-
true
, the boolean value for true. -
,
-
"c"
, the stringc
. -
:
-
[
, the opening square brace, signifying and ordered list of values. -
1
-
,
-
"two"
-
,
-
null
, a null or void value (rare). -
]
, closing square brace, terminates the list of values -
}
, closing curly brace, terminating the key/value collection.
encoding/json
does this with the Decoder.Token
API.
You declare a json.Decoder
, then call Token
until err
is non nil
.
package main
import (
"encoding/json"
"fmt"
"strings"
)
func main() {
input := `{"a": 1, "b": true, "c": [1, "two", null]}`
dec := json.NewDecoder(strings.NewReader(input))
for {
tok, err := dec.Token()
if err != nil {
break
}
fmt.Printf("%v\t(%T)\n", tok, tok)
}
}
When we run this we get the following output
{ (json.Delim)
a (string)
1 (float64)
b (string)
true (bool)
c (string)
[ (json.Delim)
1 (float64)
two (string)
<nil> (<nil>)
] (json.Delim)
} (json.Delim)
This is rather convenient, tok
is an interface{}
value so it can represent both the value being returned, and also its type; strings are string
, numbers are float64
, booleans are real true
and false
, even null
is represented as a nil
.
But there is a cost to this convenience.[3] To see why, let’s talk about a string.
var b = make([]byte, 10)
var s = string(b)
When we write the statement s = string(b)
the compiler makes a copy of b
, because the rules of Go mandate that strings are immutable.
If a string
and a byte slice shared the same backing data, then changing b
could change the contents of s
.
This would be bad, thus the expression string(b)
makes a copy of the contents of b
.
Now, lets look at the inputs to the json.Decoder
, it’s an io.Reader
.
% go doc encoding/json NewDecoder
package json // import "encoding/json"
func NewDecoder(r io.Reader) *Decoder
NewDecoder returns a new decoder that reads from r.
The decoder introduces its own buffering and may read data from r beyond the
JSON values requested.
Let’s look at the io.Reader.Read
method
% go doc io Reader.Read
package io // import "io"
func Read(p []byte) (n int, err error)
You give Read
a []byte
buffer, Read
returns to you the number of bytes it read into the buffer, and possibly an error.
Now we know the input is a stream of bytes, and the output is runes, float64s, bools, and strings.
At a minimum the input "hello"
is going to result in a byte slice []byte{'h','e','l','l','o'}
and that byte slice is going to be copied to a string.
package main
import (
"encoding/json"
"fmt"
"io"
"strings"
)
func main() {
input := `"hello"`
var r io.Reader = strings.NewReader(input) // reads strings as []byte
dec := json.NewDecoder(r)
tok, _ := dec.Token()
fmt.Printf("%s\t(%T)\n", tok, tok)
}
The convenience of the Decoder.Token
API means one allocation per token.
But it gets worse.
Values assigned to interfaces escape
A more serious issue, from a performance point of view, is assigning a value to an interface generally causes an allocation.
Because of the design of the Decoder.Token
API, the concrete value assigned to each token causes the value to escape to the heap.
So not only do we have an allocation for every []byte
to string
conversion, but each token escapes to the heap.
The number of allocations is tied to the number of tokens in the file, and the size of those allocations will be in part related to the size of the file.
The reason for this is the garbage collector.
We all know that an interface value is a two word structure. It looks something like this [4]
type interface struct {
type uintptr
data uintptr
}
uintptr above is not to suggest that type and data are pointers (although in most cases they are) just that the size of those fields is large enough to hold the address of a value in memory — the size of a pointer.
It’s unsigned because a signed pointer would halve the address space, and a negative pointer doesn’t make any sense.
|
interface
values are special in that they record both the value and the type of the value.
Another property of interface
values is they can hold any value, regardless of their type.
In early Go versions it was possible for an interface to store a uintptr or smaller value directly in the data field of the interface.
However this changed in Go 1.6 TODO check because it is not possible to change two fields atomically,
which caused a problem for the concurrent collector.
[5].
|
var x interface{} = 1
x = "one"
From the point of view of the compiler, this must change the type stored in x
from int
to string
and the value from 1
to "one"
atomically.
To do this the compiler stores a pointer to the value in the data
slot.
This means, each token escapes to the heap.
But it gets worse, we know that a Go string
is itself a small struct.
type string struct {
ptr *[]byte
len int
}
So to convert a []byte
token to a string
first we copy the []byte
, that goes on the heap, then a string header is created, and that goes on the heap, and finally the pointer to that header goes into the data
slot in the interface.
package main
import (
"encoding/json"
"strings"
"testing"
)
func BenchmarkJSONDecodeHello(b *testing.B) {
input := `"hello"`
r := strings.NewReader(input) // reads strings as []byte
dec := json.NewDecoder(r)
b.ReportAllocs()
b.SetBytes(int64(len(input)))
b.ResetTimer()
for i := 0; i < b.N; i++ {
r.Seek(0, 0)
tok, _ := dec.Token()
if tok != "hello" {
b.Fatal()
}
}
}
% go test -bench=. -memprofile=m.p json/tok_test.go
goos: darwin
goarch: amd64
BenchmarkJSONDecodeHello-4 3523440 355 ns/op 19.73 MB/s 37 B/op 3 allocs/op
PASS
ok command-line-arguments 1.951s
Take away: API design influences allocation.
Eliminating allocations through API design.
Spoiler alert, most of the speedups of this package come from reducing allocations; specifically the time not spent in the heap allocation path, and the time not spent in GC cycles, is available for scanning.
If we want to build an API that has lower allocations than encoding/json
, we have to address each of the problems I’ve discussed.
Implicit tokens
Let’s look back at the sequence of tokens; {
"a"
:
1
,
"b"
:
true
,
"c"
:
[
1
,
"two"
,
null
]
and }
.
It turns out that the first character in the token tells you what the token is
-
{
,}
- collection start, end -
[
,]
- array start, end -
t
- true -
f
- false -
n
- null -
"
- string -
-
,0
-9
- a number
This is the first improvement in the Scanner.Next
, and Decoder.NextToken
APIs.
Rather than converting the []byte
to a value, it just returns the token straight from the input—a simple subslice.
package main
import (
"fmt"
"github.com/pkg/json"
"strings"
)
func main() {
input := `{"a": 1, "b": true, "c": [1, "two", null]}`
dec := json.NewDecoder(strings.NewReader(input))
for {
tok, err := dec.NextToken()
if err != nil {
break
}
fmt.Printf("%s\t(%T)\n", tok, tok)
}
}
{ ([]uint8)
"a" ([]uint8)
1 ([]uint8)
"b" ([]uint8)
true ([]uint8)
"c" ([]uint8)
[ ([]uint8)
1 ([]uint8)
"two" ([]uint8)
null ([]uint8)
] ([]uint8)
} ([]uint8)
There are a few subtleties with this API.
-
Because the output is a subslice of the input, not a copy, there are restrictions on how long the output is valid for. This is similar to the
bufio.Scanner
API. -
Sometimes people want to know type of the token; collection, array, string, number, etc, sometimes they want the token value, the string, the number, in a form they can work with.
Scanner.Next
andDecoder.NextToken
aren’t convenient for that, but they can be used to build higher level abstractions.
Reading
Let’s talk about reading data.
This can be tricky to do efficiently because JSON is not length delimited, you have to read until you find the end of the token.
The traditional way to do this is with an io.Reader
.
-
You can read one
byte
at a time, but you need a place to store the thing you’re walking over, also might need to put thebyte
back. -
You can read into a buffer, then look in buffer for start and end of token. If the end token isn’t in the buffer you need to do a lot of bookkeeping and copying to move the data around the the buffer or grow the buffer to make room for more data.
encoding/json
does a combination of these, often with a smattering of sync.pool
to try to reuse small objects transparently.
// A byteReader implements a sliding window over an io.Reader.
type byteReader struct {
data []byte
offset int
r io.Reader
err error
}
// release discards n bytes from the front of the window.
func (b *byteReader) release(n int) {
b.offset += n
}
// window returns the current window.
// The window is invalidated by calls to release or extend.
func (b *byteReader) window() []byte {
return b.data[b.offset:]
}
// extend extends the window with data from the underlying reader.
func (b *byteReader) extend() int {
Example: whitespace
JSON contains a mixture of tokens and whitespace. Space, tab, newline and carriage return can occur between tokens and are ignored, thus the search for a token begins with a search for the first non whitespace character.
{"a": 1, "b": true, "c": [1, "two", null]}
This is a good time to talk about optimising the search for whitespace with an example of using a byteReader
.
func countWhitespace(br *byteReader) int {
n := 0
w := br.window()
for {
for _, c := range w {
if isWhitespace(c) {
n++
}
}
br.release(len(w))
if br.extend() == 0 {
return n
}
w = br.window()
}
}
This does the minimum, visit each character and makes one function call (which is inlined), and counts the number of whitespace characters. Any (useful) JSON decoder code cannot go faster that this.
What’s the fastest way to implement isWhitespace
?
Here’s the implementation that encoding/json
uses (with a different name)
func isWhitespace(c byte) bool {
return c <= ' ' && (c == ' ' || c == '\t' || c == '\r' || c == '\n')
}
var whitespace = [256]bool{
' ': true,
'\t': true,
'\n': true,
'\r': true,
}
func isWhitespace(c byte) bool {
return whitespace[c]
}
name time/op
CountWhitespace/canada-16 1.10ms ± 2%
CountWhitespace/citm_catalog-16 838µs ± 1%
CountWhitespace/twitter-16 306µs ± 1%
CountWhitespace/code-16 937µs ± 1%
CountWhitespace/example-16 6.40µs ± 1%
CountWhitespace/sample-16 333µs ± 1%
name speed
CountWhitespace/canada-16 2.04GB/s ± 2%
CountWhitespace/citm_catalog-16 2.06GB/s ± 1%
CountWhitespace/twitter-16 2.06GB/s ± 1%
CountWhitespace/code-16 2.07GB/s ± 1%
CountWhitespace/example-16 2.04GB/s ± 1%
CountWhitespace/sample-16 2.06GB/s ± 1%
So this is our baseline.
Scanning
Now we can tell which characters are tokens and which are simply whitespace, let’s step up a level and talk about breaking up those tokens.
// Next returns a []byte referencing the the next lexical token in the stream.
// The []byte is valid until Next is called again.
// If the stream is at its end, or an error has occured, Next returns a zero
// length []byte slice.
//
// A valid token begins with one of the following:
//
// { Object start
// [ Array start
// } Object end
// ] Array End
// , Literal comma
// : Literal colon
// t JSON true
// f JSON false
// n JSON null
// " A string, possibly containing backslash escaped entites.
// -, 0-9 A number
func (s *Scanner) Next() []byte {
// release the previous token
s.br.release(s.pos)
s.pos = 0
c := s.token()
length := 0
switch c {
case ObjectStart, ObjectEnd, Colon, Comma, ArrayStart, ArrayEnd:
length = 1
s.pos = 1
case True:
length = validateToken(&s.br, "true")
s.pos = length
case False:
length = validateToken(&s.br, "false")
s.pos = length
case Null:
length = validateToken(&s.br, "null")
s.pos = length
case String:
// string
length = parseString(&s.br)
if length < 2 {
return nil
}
s.pos = length
case 0:
// eof
return nil
default:
// ensure the number is correct.
length = s.parseNumber()
if length < 0 {
return nil
}
}
return s.br.window()[:length]
}
This is the core loop of Scanner.Next
.
Scanner.Next
skips over any intermediate whitespace, determines the token from the first character in the window, then continues to read until the token is read or we hit the end of the input.
Let’s look at how token
works, then we’ll talk about some optimisations
func (s *Scanner) token() byte {
w := s.br.window()
for {
for _, c := range w {
if whitespace[c] {
s.pos++
continue
}
// release whitespace
s.br.release(s.pos)
s.pos = 0
return c
}
if s.br.extend() == 0 {
// eof
return 0
}
w = s.br.window()[s.pos:]
}
}
var whitespace = [256]bool{
' ': true,
'\r': true,
'\n': true,
'\t': true,
}
-
We start by getting the current window from the
byteReader
. This is a[]byte
slice of all the data that is yet to be read. -
We’re looking for the first non whitespace character. If the character is a whitespace we increment
s.pos
to ignore the character and loop around. -
If we do find a non whitespace character, we release
s.pos
characters from the front of the window. Now the start of the window is properly aligned with the first character of the token. -
It turns out that we also get the first character of the token for free, it’s in
c
, so we can return that as a hint toScanner.Next
. -
If we run out of characters without hitting a token then we called
extend()
to grow the window. -
If we couldn’t grow, then we’ve run out of input and haven’t got a token, so give up.
-
Otherwise update
w
with a new window.
This is the basic operation of byteReader
, we’ll see that pattern repeated across the scanner.
Some things to note:
-
Note the lack of error handling, it’s not part of the inner loop, it only happens when we have to read more data from the underlying reader.
-
extend
hides the process of reading into, growing, refilling the buffer, it makes the loop above it--Scanner.token
simpler; if there is data in the window, process it, extend if you need too, give up if you can’t extend. -
release
is similar, it shrinks the start of the window to exclude data that we don’t care about. -
extend
is not in the hot path, so there is no need to optimise it, its performance is a function of the buffer it is given. In practice a buffer of 8k is sufficient.
Let’s talk about the performance of this code.
name time/op
Scanner/canada-16 4.41ms ± 2%
Scanner/citm_catalog-16 2.55ms ± 3%
Scanner/twitter-16 1.03ms ± 1%
Scanner/code-16 4.21ms ± 1%
Scanner/example-16 21.4µs ± 1%
Scanner/sample-16 822µs ± 1%
name speed
Scanner/canada-16 510MB/s ± 2%
Scanner/citm_catalog-16 677MB/s ± 3%
Scanner/twitter-16 615MB/s ± 1%
Scanner/code-16 461MB/s ± 1%
Scanner/example-16 608MB/s ± 1%
Scanner/sample-16 837MB/s ± 1%
This is a benchmark you saw earlier, minus a few optimisations we’ll talk about next.
Comparing the performance of Scanner.Next
to our whitespace benchmark we can see that we’re between 1/4 and 2/5ths of our baseline.
Let’s talk about the first improvement we can make to the code.
Note the amount of work being spent to keep s.pos
up to date.
We know that s.pos
is set to 0 just before Scanner.Next
calls this function, and we set s.pos
to zero on the way out of the function, so the changes we make to s.pos
within the function are invisible—it’s zero on entry, and zero on exit.
We can rewrite the function to keep a local pos
value, which has an impressive effect on token
.
func (s *Scanner) token() byte {
w := s.br.window()
pos := 0
for {
for _, c := range w {
if whitespace[c] {
pos++
continue
}
// release whitespace
s.br.release(pos)
return c
}
if s.br.extend() == 0 {
// eof
return 0
}
w = s.br.window()[pos:]
}
}
var whitespace = [256]bool{
' ': true,
'\r': true,
'\n': true,
'\t': true,
}
name old time/op new time/op delta
Scanner/canada-16 4.39ms ± 1% 4.43ms ± 4% ~ (p=1.000 n=5+5)
Scanner/citm_catalog-16 2.52ms ± 1% 1.80ms ± 4% -28.46% (p=0.008 n=5+5)
Scanner/twitter-16 1.03ms ± 2% 0.95ms ± 3% -7.41% (p=0.008 n=5+5)
Scanner/code-16 4.24ms ± 2% 4.18ms ± 1% ~ (p=0.095 n=5+5)
Scanner/example-16 21.4µs ± 1% 18.9µs ± 2% -11.68% (p=0.008 n=5+5)
Scanner/sample-16 828µs ± 2% 528µs ± 2% -36.24% (p=0.008 n=5+5)
name old speed new speed delta
Scanner/canada-16 512MB/s ± 1% 509MB/s ± 4% ~ (p=1.000 n=5+5)
Scanner/citm_catalog-16 685MB/s ± 1% 958MB/s ± 4% +39.84% (p=0.008 n=5+5)
Scanner/twitter-16 616MB/s ± 2% 665MB/s ± 3% +8.01% (p=0.008 n=5+5)
Scanner/code-16 458MB/s ± 2% 465MB/s ± 1% ~ (p=0.095 n=5+5)
Scanner/example-16 608MB/s ± 1% 688MB/s ± 2% +13.23% (p=0.008 n=5+5)
Scanner/sample-16 831MB/s ± 2% 1303MB/s ± 2% +56.84% (p=0.008 n=5+5)
By keeping pos
local the compiler avoided those temporary writes back to memory.
The question I have for you is why did this improve some inputs and not others?
The answer, I think, is different inputs have different amounts of whitespace.
For example canada
only has 33 whitespace characters whereas citm
has 1,227,563.
There is a larger improvement we can make for the runtime of this code, and it relates to inlining. Inlining is the process of automatically (or manually) copying the body of a function into, in line with, its caller. This avoids the overhead of the function call.
Usually inlining is performed automatically by the compiler according to a set of rules it controls. The Go compiler has reasonable support for inlining, but has a number of limitations.
% go build -gcflags=-m=2 2>&1 | grep cannot | grep -v decoder
./reader.go:31:6: cannot inline (*byteReader).extend: function too complex: cost 198 exceeds budget 80
./scanner.go:99:6: cannot inline (*Scanner).token: unhandled op FOR
./scanner.go:130:6: cannot inline validateToken: unhandled op FOR
./scanner.go:153:6: cannot inline parseString: unhandled op FOR
./scanner.go:182:6: cannot inline (*Scanner).parseNumber: unhandled op FOR
./scanner.go:56:6: cannot inline (*Scanner).Next: function too complex: cost 476 exceeds budget 80
The first is the size of the function, byteReader.extend
cannot be inlined because it is too complex.
[10]
The second is statements within the function, Scanner.token
cannot be inlined because it contains a for
statement.
Also note that Scanner.Next
, the caller of Scanner.token
cannot be inlined because it is also too complex.
Let’s go back to the constraints.
Scanner.Next
is called for each token in the input.
This means that Scanner.token
is called for each token in the input.
Scanner.token
cannot be automatically inlined into its caller because it is too complex.
Therefore we’re paying for an extra function call for each token.
We can remove this overhead by manually inlining Scanner.token
into its caller.
func (s *Scanner) Next() []byte {
// release the previous token
s.br.release(s.pos)
w := s.br.window()
pos := 0
for {
for _, c := range w {
if whitespace[c] {
pos++
continue
}
// release whitespace
s.br.release(pos)
length := 0
switch c {
case ObjectStart, ObjectEnd, Colon, Comma, ArrayStart, ArrayEnd:
length = 1
s.pos = 1
case True:
length = validateToken(&s.br, "true")
s.pos = length
case False:
length = validateToken(&s.br, "false")
s.pos = length
case Null:
length = validateToken(&s.br, "null")
s.pos = length
case String:
// string
length = parseString(&s.br)
if length < 2 {
return nil
}
s.pos = length
default:
// ensure the number is correct.
length = s.parseNumber()
if length < 0 {
return nil
}
}
return s.br.window()[:length]
}
if s.br.extend() == 0 {
// eof
return nil
}
w = s.br.window()[pos:]
}
}
The results support our thesis:
name old time/op new time/op delta
Scanner/canada-16 4.36ms ± 1% 3.50ms ± 0% -19.68% (p=0.008 n=5+5)
Scanner/citm_catalog-16 1.80ms ± 1% 1.56ms ± 2% -13.16% (p=0.008 n=5+5)
Scanner/twitter-16 965µs ± 2% 833µs ± 2% -13.75% (p=0.008 n=5+5)
Scanner/code-16 4.15ms ± 1% 3.61ms ± 1% -12.82% (p=0.008 n=5+5)
Scanner/example-16 18.9µs ± 2% 16.6µs ± 1% -12.42% (p=0.008 n=5+5)
Scanner/sample-16 515µs ± 1% 472µs ± 2% -8.34% (p=0.008 n=5+5)
name old speed new speed delta
Scanner/canada-16 516MB/s ± 1% 642MB/s ± 0% +24.50% (p=0.008 n=5+5)
Scanner/citm_catalog-16 960MB/s ± 1% 1105MB/s ± 2% +15.16% (p=0.008 n=5+5)
Scanner/twitter-16 654MB/s ± 2% 759MB/s ± 1% +15.94% (p=0.008 n=5+5)
Scanner/code-16 468MB/s ± 1% 537MB/s ± 1% +14.69% (p=0.008 n=5+5)
Scanner/example-16 689MB/s ± 2% 787MB/s ± 1% +14.17% (p=0.008 n=5+5)
Scanner/sample-16 1.33GB/s ± 1% 1.46GB/s ± 1% +9.11% (p=0.008 n=5+5)
By saving the function call we’ve improved throughput by 9-24%.
The largest improvement comes from canada
, which basically contained no whitespace, so the call to Scanner.token
almost always returned immediately having done no work, while also paying for all the s.pos
and release
overhead.
This is where the inner loop of the Scanner
stands today.
Note that citm
is over 50% of the baseline, sample
is nearly 75%.
To recap the major optimisations were:
-
whitespace[c]
-
Avoiding
s.pos
updates. They cannot be registerised, CPU has to do a write on every iteration.s.pos
updates reduced from one per byte to one per token. -
Scanner.Next
andScanner.token
were effectively one function spread over two. Each are too large to be inlined, so we’re paying for an extra function call per token. Manually inlining them increased the indentation depth of the function, but delivered substantial speedups. -
Most JSON contains some whitespace, it’s moderately optimised for human readability. It turns out, the more whitespace, the faster
pkg/json
decodes.
Decoding
So far we have a Scanner
which tokenises input at 50-75% of the baseline speed we established before.
But there are a few more things we need to make the scanner fully functional.
The first part of that is validation.
Validation
JSON is a state machine.
Depending on the current token you’re in, some may be valid some may not be.
For example, if you’ve read these tokens {
, "username"
, then the only valid token is :
.
To track this we need to layer some logic on top of Scanner.Next
to assert that the token sequence is valid.
This is the role of Decoder.NextToken
:
const (
stateValue = 0
stateObjectString = iota
stateObjectColon
stateObjectValue
stateObjectComma
stateArrayValue
stateArrayComma
stateEnd
)
func (d *Decoder) NextToken() ([]byte, error) {
tok := d.scanner.Next()
if len(tok) < 1 {
return nil, io.EOF
}
switch d.state {
case stateValue:
return d.stateValue(tok)
case stateObjectString:
return d.stateObjectString(tok)
case stateObjectColon:
return d.stateObjectColon(tok)
case stateObjectValue:
return d.stateObjectValue(tok)
case stateObjectComma:
return d.stateObjectComma(tok)
case stateArrayValue:
return d.stateArrayValue(tok)
case stateArrayComma:
return d.stateArrayComma(tok)
case stateEnd:
fallthrough
default:
return nil, io.EOF
}
}
This is pretty straightforward stuff.
We take track the current state in d.state
and based on its value we dispatch to the various state methods.
Click to see the source for the various state
methods.
func (d *Decoder) stateObjectString(tok []byte) ([]byte, error) {
switch tok[0] {
case '}':
inObj := d.pop()
switch {
case d.len() == 0:
d.state = stateEnd
case inObj:
d.state = stateObjectComma
case !inObj:
d.state = stateArrayComma
}
return tok, nil
case '"':
d.state = stateObjectColon
return tok, nil
default:
return nil, fmt.Errorf("stateObjectString: missing string key")
}
}
func (d *Decoder) stateObjectColon(tok []byte) ([]byte, error) {
switch tok[0] {
case Colon:
d.state = stateObjectValue
return d.NextToken()
default:
return tok, fmt.Errorf("stateObjectColon: expecting colon")
}
}
func (d *Decoder) stateObjectValue(tok []byte) ([]byte, error) {
switch tok[0] {
case '{':
d.state = stateObjectString
d.push(true)
return tok, nil
case '[':
d.state = stateArrayValue
d.push(false)
return tok, nil
default:
d.state = stateObjectComma
return tok, nil
}
}
func (d *Decoder) stateObjectComma(tok []byte) ([]byte, error) {
switch tok[0] {
case '}':
inObj := d.pop()
switch {
case d.len() == 0:
d.state = stateEnd
case inObj:
d.state = stateObjectComma
case !inObj:
d.state = stateArrayComma
}
return tok, nil
case Comma:
d.state = stateObjectString
return d.NextToken()
default:
return tok, fmt.Errorf("stateObjectComma: expecting comma")
}
}
func (d *Decoder) stateArrayValue(tok []byte) ([]byte, error) {
switch tok[0] {
case '{':
d.state = stateObjectString
d.push(true)
return tok, nil
case '[':
d.state = stateArrayValue
d.push(false)
return tok, nil
case ']':
inObj := d.pop()
switch {
case d.len() == 0:
d.state = stateEnd
case inObj:
d.state = stateObjectComma
case !inObj:
d.state = stateArrayComma
}
return tok, nil
case ',':
return nil, fmt.Errorf("stateArrayValue: unexpected comma")
default:
d.state = stateArrayComma
return tok, nil
}
}
func (d *Decoder) stateArrayComma(tok []byte) ([]byte, error) {
switch tok[0] {
case ']':
inObj := d.pop()
switch {
case d.len() == 0:
d.state = stateEnd
case inObj:
d.state = stateObjectComma
case !inObj:
d.state = stateArrayComma
}
return tok, nil
case Comma:
d.state = stateArrayValue
return d.NextToken()
default:
return nil, fmt.Errorf("stateArrayComma: expected comma, %v", d.stack)
}
}
func (d *Decoder) stateValue(tok []byte) ([]byte, error) {
switch tok[0] {
case '{':
d.state = stateObjectString
d.push(true)
return tok, nil
case '[':
d.state = stateArrayValue
d.push(false)
return tok, nil
case ',':
return nil, fmt.Errorf("stateValue: unexpected comma")
default:
d.state = stateEnd
return tok, nil
}
}
Let’s look at the results.
name time/op
DecoderNextToken/pkgjson/canada-16 5.64ms ± 1%
DecoderNextToken/encodingjson/canada-16 65.1ms ± 1%
DecoderNextToken/pkgjson/citm_catalog-16 2.42ms ± 2%
DecoderNextToken/encodingjson/citm_catalog-16 19.8ms ± 1%
DecoderNextToken/pkgjson/twitter-16 1.18ms ± 1%
DecoderNextToken/encodingjson/twitter-16 12.1ms ± 0%
DecoderNextToken/pkgjson/code-16 6.10ms ± 2%
DecoderNextToken/encodingjson/code-16 77.5ms ± 1%
DecoderNextToken/pkgjson/example-16 25.2µs ± 1%
DecoderNextToken/encodingjson/example-16 266µs ± 1%
DecoderNextToken/pkgjson/sample-16 559µs ± 0%
DecoderNextToken/encodingjson/sample-16 3.18ms ± 2%
name speed
DecoderNextToken/pkgjson/canada-16 399MB/s ± 1%
DecoderNextToken/encodingjson/canada-16 34.6MB/s ± 1%
DecoderNextToken/pkgjson/citm_catalog-16 713MB/s ± 2%
DecoderNextToken/encodingjson/citm_catalog-16 87.1MB/s ± 1%
DecoderNextToken/pkgjson/twitter-16 537MB/s ± 1%
DecoderNextToken/encodingjson/twitter-16 52.2MB/s ± 0%
DecoderNextToken/pkgjson/code-16 318MB/s ± 2%
DecoderNextToken/encodingjson/code-16 25.0MB/s ± 1%
DecoderNextToken/pkgjson/example-16 517MB/s ± 1%
DecoderNextToken/encodingjson/example-16 49.0MB/s ± 1%
DecoderNextToken/pkgjson/sample-16 1.23GB/s ± 0%
DecoderNextToken/encodingjson/sample-16 216MB/s ± 2%
Compared to encoding/json
we’re 8-10x faster.
But there are some things we can do to improve:
Computed goto
Central to the operation of Decoder.NextToken
is the switch
statement.
As we saw earlier in the whitespace benchmark switch
was the second worse performer.
This is because switch
, in the general case at least, is implemented as a set of if {} else if {} ..
operations.
Effectively what the compiler sees is
func (d *Decoder) NextToken() ([]byte, error) {
tok := d.scanner.Next()
if len(tok) < 1 {
return nil, io.EOF
}
if d.state == stateValue {
return d.stateValue(tok)
}
if d.state == stateObjectString {
return d.stateObjectString(tok)
}
if d.state == stateObjectColon {
return d.stateObjectColon(tok)
}
if d.state == stateObjectValue {
return d.stateObjectValue(tok)
}
if d.state == stateObjectComma {
return d.stateObjectComma(tok)
}
if d.state == stateArrayValue {
return d.stateArrayValue(tok)
}
if d.state == stateArrayComma {
return d.stateArrayComma(tok)
}
return nil, io.EOF
}
Benchmarking this explicit if {} else {} …
version confirms this is close to what the compiler is seeing:
DecoderNextToken/pkgjson/canada-16 5.60ms ± 0% 5.65ms ± 0% +0.96% (p=0.008 n=5+5)
DecoderNextToken/encodingjson/canada-16 64.9ms ± 1% 65.6ms ± 1% ~ (p=0.151 n=5+5)
DecoderNextToken/pkgjson/citm_catalog-16 2.41ms ± 1% 2.45ms ± 1% +1.42% (p=0.016 n=5+5)
DecoderNextToken/encodingjson/citm_catalog-16 19.8ms ± 1% 19.7ms ± 0% ~ (p=0.690 n=5+5)
DecoderNextToken/pkgjson/twitter-16 1.17ms ± 1% 1.18ms ± 2% ~ (p=1.000 n=5+5)
DecoderNextToken/encodingjson/twitter-16 12.2ms ± 1% 12.1ms ± 1% ~ (p=0.310 n=5+5)
DecoderNextToken/pkgjson/code-16 6.11ms ± 3% 6.09ms ± 1% ~ (p=1.000 n=5+5)
DecoderNextToken/encodingjson/code-16 77.7ms ± 2% 77.2ms ± 1% ~ (p=0.548 n=5+5)
DecoderNextToken/pkgjson/example-16 25.0µs ± 0% 25.0µs ± 1% ~ (p=0.841 n=5+5)
DecoderNextToken/encodingjson/example-16 263µs ± 1% 264µs ± 1% ~ (p=0.222 n=5+5)
DecoderNextToken/pkgjson/sample-16 560µs ± 1% 558µs ± 1% ~ (p=0.841 n=5+5)
DecoderNextToken/encodingjson/sample-16 3.16ms ± 1% 3.19ms ± 1% ~ (p=0.095 n=5+5)
name old speed new speed delta
DecoderNextToken/pkgjson/canada-16 402MB/s ± 0% 398MB/s ± 0% -0.95% (p=0.008 n=5+5)
DecoderNextToken/encodingjson/canada-16 34.7MB/s ± 1% 34.3MB/s ± 1% ~ (p=0.151 n=5+5)
DecoderNextToken/pkgjson/citm_catalog-16 716MB/s ± 1% 706MB/s ± 1% -1.39% (p=0.016 n=5+5)
DecoderNextToken/encodingjson/citm_catalog-16 87.3MB/s ± 1% 87.6MB/s ± 0% ~ (p=0.690 n=5+5)
DecoderNextToken/pkgjson/twitter-16 539MB/s ± 1% 537MB/s ± 2% ~ (p=1.000 n=5+5)
DecoderNextToken/encodingjson/twitter-16 51.9MB/s ± 1% 52.2MB/s ± 1% ~ (p=0.310 n=5+5)
DecoderNextToken/pkgjson/code-16 318MB/s ± 2% 319MB/s ± 1% ~ (p=1.000 n=5+5)
DecoderNextToken/encodingjson/code-16 25.0MB/s ± 2% 25.1MB/s ± 1% ~ (p=0.548 n=5+5)
DecoderNextToken/pkgjson/example-16 521MB/s ± 0% 520MB/s ± 1% ~ (p=0.841 n=5+5)
DecoderNextToken/encodingjson/example-16 49.5MB/s ± 1% 49.3MB/s ± 1% ~ (p=0.167 n=5+5)
DecoderNextToken/pkgjson/sample-16 1.23GB/s ± 1% 1.23GB/s ± 1% ~ (p=0.841 n=5+5)
DecoderNextToken/encodingjson/sample-16 218MB/s ± 1% 216MB/s ± 1% ~ (p=0.095 n=5+5)
switch
is convenient, but not optimal in the hot path.
This problem turns up in many places; bytecode interpreters are a classic example.
One of the optimisations compilers can make (although the Go compiler does not implement this currently) is to turn this set of if … else
clauses into a table.
This can be space efficient if the state space is small and dense (often it is not), and the result might be something like this
var stateTable = [...]func(*Decoder, []byte) ([]byte, error){
stateValue: (*Decoder).stateValue,
stateObjectString: (*Decoder).stateObjectString,
stateObjectColon: (*Decoder).stateObjectColon,
stateObjectValue: (*Decoder).stateObjectValue,
stateObjectComma: (*Decoder).stateObjectComma,
stateArrayValue: (*Decoder).stateArrayValue,
stateArrayComma: (*Decoder).stateArrayComma,
stateEnd: (*Decoder).stateEnd,
}
func (d *Decoder) NextToken() ([]byte, error) {
tok := d.scanner.Next()
if len(tok) < 1 {
return nil, io.EOF
}
return stateTable[d.state](d, tok)
}
Unfortunately this won’t compile because there is an initalisation loop.
./decoder.go:115:5: initialization loop:
/Users/davecheney/devel/json/decoder.go:115:5: stateTable refers to
/Users/davecheney/devel/json/decoder.go:155:19: (*Decoder).stateObjectColon refers to
/Users/davecheney/devel/json/decoder.go:126:19: (*Decoder).NextToken refers to
/Users/davecheney/devel/json/decoder.go:115:5: stateTable
FAIL github.com/pkg/json [build failed\]
But there is a better trick that we can use that is more space efficient than this table.
It’s one that encoding/json
uses and is sometimes called a computed goto.
If you look at the table above there is a pattern.
Each state
enumeration is matched with exactly one method.
We talk about state
values as a proxy for the method we want to call.
What if we could just store the method directly and call it directly.
And infact, we can do just that.
// A Decoder decodes JSON values from an input stream.
type Decoder struct {
scanner *Scanner
state func(*Decoder, []byte) ([]byte, error)
stack
}
func (d *Decoder) NextToken() ([]byte, error) {
tok := d.scanner.Next()
if len(tok) < 1 {
return nil, io.EOF
}
return d.state(d, tok)
}
The the Method expressions aren’t that common because in Go 1.1 the ability to capture the receiver of a method was added.
|
The results aren’t that promising
name old time/op new time/op delta
DecoderNextToken/pkgjson/canada-16 5.60ms ± 0% 5.61ms ± 0% ~ (p=0.841 n=5+5)
DecoderNextToken/encodingjson/canada-16 64.9ms ± 1% 64.4ms ± 0% ~ (p=0.222 n=5+5)
DecoderNextToken/pkgjson/citm_catalog-16 2.41ms ± 1% 2.42ms ± 1% ~ (p=0.310 n=5+5)
DecoderNextToken/encodingjson/citm_catalog-16 19.8ms ± 1% 19.7ms ± 1% ~ (p=0.548 n=5+5)
DecoderNextToken/pkgjson/twitter-16 1.17ms ± 1% 1.20ms ± 2% +2.70% (p=0.008 n=5+5)
DecoderNextToken/encodingjson/twitter-16 12.2ms ± 1% 12.1ms ± 1% ~ (p=1.000 n=5+5)
DecoderNextToken/pkgjson/code-16 6.11ms ± 3% 6.37ms ± 4% +4.26% (p=0.016 n=5+5)
DecoderNextToken/encodingjson/code-16 77.7ms ± 2% 78.6ms ± 2% ~ (p=0.310 n=5+5)
DecoderNextToken/pkgjson/example-16 25.0µs ± 0% 25.6µs ± 1% +2.25% (p=0.008 n=5+5)
DecoderNextToken/encodingjson/example-16 263µs ± 1% 265µs ± 2% ~ (p=0.222 n=5+5)
DecoderNextToken/pkgjson/sample-16 560µs ± 1% 553µs ± 0% -1.15% (p=0.008 n=5+5)
DecoderNextToken/encodingjson/sample-16 3.16ms ± 1% 3.17ms ± 1% ~ (p=0.690 n=5+5)
name old speed new speed delta
DecoderNextToken/pkgjson/canada-16 402MB/s ± 0% 401MB/s ± 0% ~ (p=0.841 n=5+5)
DecoderNextToken/encodingjson/canada-16 34.7MB/s ± 1% 35.0MB/s ± 0% ~ (p=0.222 n=5+5)
DecoderNextToken/pkgjson/citm_catalog-16 716MB/s ± 1% 713MB/s ± 1% ~ (p=0.310 n=5+5)
DecoderNextToken/encodingjson/citm_catalog-16 87.3MB/s ± 1% 87.6MB/s ± 1% ~ (p=0.548 n=5+5)
DecoderNextToken/pkgjson/twitter-16 539MB/s ± 1% 524MB/s ± 2% -2.62% (p=0.008 n=5+5)
DecoderNextToken/encodingjson/twitter-16 51.9MB/s ± 1% 52.0MB/s ± 1% ~ (p=1.000 n=5+5)
DecoderNextToken/pkgjson/code-16 318MB/s ± 2% 305MB/s ± 3% -4.07% (p=0.016 n=5+5)
DecoderNextToken/encodingjson/code-16 25.0MB/s ± 2% 24.7MB/s ± 2% ~ (p=0.310 n=5+5)
DecoderNextToken/pkgjson/example-16 521MB/s ± 0% 509MB/s ± 1% -2.19% (p=0.008 n=5+5)
DecoderNextToken/encodingjson/example-16 49.5MB/s ± 1% 49.1MB/s ± 2% ~ (p=0.222 n=5+5)
DecoderNextToken/pkgjson/sample-16 1.23GB/s ± 1% 1.24GB/s ± 0% +1.16% (p=0.008 n=5+5)
DecoderNextToken/encodingjson/sample-16 218MB/s ± 1% 217MB/s ± 1% ~ (p=0.690 n=5+5)
But it unlocks several optimisations.
Outlining
func (d *Decoder) NextToken() ([]byte, error) {
return d.state(d)
}
func (d *Decoder) stateObjectColon() ([]byte, error) {
tok := d.scanner.Next()
if len(tok) < 1 {
return nil, io.ErrUnexpectedEOF
}
switch tok[0] {
case Colon:
d.state = (*Decoder).stateObjectValue
return d.NextToken()
default:
return tok, fmt.Errorf("stateObjectColon: expecting colon")
}
}
Moving the tok := d.scanner.Next()
call into each state method might seem like a step backwards, but it has several positive effects.
-
The first is not passing
tok
into each state method. This saves 3 words on the call stack. -
The second is, by moving the
if len(tok) < 1
into the same function as theswitch
, it enables bounds check elimination. Previously, when thelen(tok)
check happened inDecoder.NextToken
,Decoder.stateObjectColon
doesn’t know anything about the length oftok
. When the compiler encountersswitch tok[0]
, it needs to put a bounds check to make suretok
is at least 1 element long.When the
if
check is moved into the same function, the compiler knows that if we get further than the check thentok
is at least 1 element long, so the bounds check is not needed.We can see this in the debug information.
` % go build -gcflags=-d=ssa/prove/debug=2 2>&1 | grep 135\: ./decoder.go:135:13: Proved IsInBounds (v31)
` -
The final optimisation occurs because
Decoder.NextToken
, which was previously too complex to inline./decoder.go:107:6: cannot inline (*Decoder).NextToken: function too complex: cost 145 exceeds budget 80
is now inlinable
./bench_test.go:217:29: inlining call to (*Decoder).NextToken method(*Decoder) func() ([]byte, error) { return d.state(d) }
Which means, calls to dec.NextToken()
in
for {
_, err := dec.NextToken()
if err == io.EOF {
break
}
check(b, err)
n++
}
becomes a direct call to the current state method.
for {
_, err := dec.state(dec) // this is what the compiler sees
if err == io.EOF {
break
}
check(b, err)
n++
}
Let’s look at the results
DecoderNextToken/pkgjson/canada-16 5.60ms ± 0% 4.73ms ± 1% -15.54% (p=0.008 n=5+5)
DecoderNextToken/encodingjson/canada-16 64.9ms ± 1% 64.2ms ± 1% ~ (p=0.056 n=5+5)
DecoderNextToken/pkgjson/citm_catalog-16 2.41ms ± 1% 2.17ms ± 2% -9.94% (p=0.008 n=5+5)
DecoderNextToken/encodingjson/citm_catalog-16 19.8ms ± 1% 19.6ms ± 1% ~ (p=0.095 n=5+5)
DecoderNextToken/pkgjson/twitter-16 1.17ms ± 1% 1.05ms ± 1% -10.84% (p=0.008 n=5+5)
DecoderNextToken/encodingjson/twitter-16 12.2ms ± 1% 11.9ms ± 1% -2.02% (p=0.008 n=5+5)
DecoderNextToken/pkgjson/code-16 6.11ms ± 3% 5.15ms ± 1% -15.77% (p=0.008 n=5+5)
DecoderNextToken/encodingjson/code-16 77.7ms ± 2% 77.0ms ± 2% ~ (p=0.095 n=5+5)
DecoderNextToken/pkgjson/example-16 25.0µs ± 0% 22.0µs ± 0% -12.15% (p=0.008 n=5+5)
DecoderNextToken/encodingjson/example-16 263µs ± 1% 263µs ± 0% ~ (p=0.690 n=5+5)
DecoderNextToken/pkgjson/sample-16 560µs ± 1% 510µs ± 1% -8.92% (p=0.008 n=5+5)
DecoderNextToken/encodingjson/sample-16 3.16ms ± 1% 3.16ms ± 1% ~ (p=0.841 n=5+5)
name old speed new speed delta
DecoderNextToken/pkgjson/canada-16 402MB/s ± 0% 476MB/s ± 1% +18.39% (p=0.008 n=5+5)
DecoderNextToken/encodingjson/canada-16 34.7MB/s ± 1% 35.1MB/s ± 1% ~ (p=0.056 n=5+5)
DecoderNextToken/pkgjson/citm_catalog-16 716MB/s ± 1% 795MB/s ± 2% +11.05% (p=0.008 n=5+5)
DecoderNextToken/encodingjson/citm_catalog-16 87.3MB/s ± 1% 88.3MB/s ± 1% ~ (p=0.095 n=5+5)
DecoderNextToken/pkgjson/twitter-16 539MB/s ± 1% 604MB/s ± 1% +12.16% (p=0.008 n=5+5)
DecoderNextToken/encodingjson/twitter-16 51.9MB/s ± 1% 53.0MB/s ± 1% +2.05% (p=0.008 n=5+5)
DecoderNextToken/pkgjson/code-16 318MB/s ± 2% 377MB/s ± 1% +18.71% (p=0.008 n=5+5)
DecoderNextToken/encodingjson/code-16 25.0MB/s ± 2% 25.2MB/s ± 2% ~ (p=0.095 n=5+5)
DecoderNextToken/pkgjson/example-16 521MB/s ± 0% 593MB/s ± 0% +13.82% (p=0.008 n=5+5)
DecoderNextToken/encodingjson/example-16 49.5MB/s ± 1% 49.6MB/s ± 0% ~ (p=0.651 n=5+5)
DecoderNextToken/pkgjson/sample-16 1.23GB/s ± 1% 1.35GB/s ± 1% +9.80% (p=0.008 n=5+5)
DecoderNextToken/encodingjson/sample-16 218MB/s ± 1% 218MB/s ± 1% ~ (p=0.841 n=5+5)
9-18% improvement over the previous version.
Unmarshalling
The second part of decoding is conversion from JSON tokens to Go object. In Go this is called unmarshalling.
This part of the package is very much a work in progress. Unmarshalling is the most expensive part of the package because it combines reflect, which is expensive, with unavoidable allocations.
BenchmarkDecoderDecodeInterfaceAny/pkgjson/canada-16 217 27425893 ns/op 82.08 MB/s 8747163 B/op 281408 allocs/op
BenchmarkDecoderDecodeInterfaceAny/encodingjson/canada-16 153 38347477 ns/op 58.70 MB/s 20647616 B/op 392553 allocs/op
BenchmarkDecoderDecodeInterfaceAny/pkgjson/citm_catalog-16 747 8008839 ns/op 215.66 MB/s 5197853 B/op 89673 allocs/op
BenchmarkDecoderDecodeInterfaceAny/encodingjson/citm_catalog-16 360 16607501 ns/op 104.00 MB/s 9406809 B/op 95389 allocs/op
BenchmarkDecoderDecodeInterfaceAny/pkgjson/twitter-16 1606 3714515 ns/op 170.01 MB/s 2130731 B/op 30182 allocs/op
BenchmarkDecoderDecodeInterfaceAny/encodingjson/twitter-16 862 6927998 ns/op 91.15 MB/s 4283407 B/op 31278 allocs/op
BenchmarkDecoderDecodeInterfaceAny/pkgjson/code-16 333 17939351 ns/op 108.17 MB/s 7331643 B/op 232059 allocs/op
BenchmarkDecoderDecodeInterfaceAny/encodingjson/code-16 236 25324951 ns/op 76.62 MB/s 12332753 B/op 271292 allocs/op
BenchmarkDecoderDecodeInterfaceAny/pkgjson/example-16 76874 78079 ns/op 166.81 MB/s 50980 B/op 739 allocs/op
BenchmarkDecoderDecodeInterfaceAny/encodingjson/example-16 40886 146685 ns/op 88.79 MB/s 82855 B/op 782 allocs/op
BenchmarkDecoderDecodeInterfaceAny/pkgjson/sample-16 5240 1116081 ns/op 615.99 MB/s 399970 B/op 5542 allocs/op
BenchmarkDecoderDecodeInterfaceAny/encodingjson/sample-16 1123 5369313 ns/op 128.04 MB/s 2661872 B/op 7527 allocs/op
-
Can use unsafe here
Thematic ideas
-
Allocations affect performance. The garbage collector might be fast to allocate and efficient to collect, but not allocating will always be faster.
-
When dealing with data, allocations, can be the biggest performance cost of an algorithm. O(n) allocations—Per token or per byte—can be the limiting constraint on an algo, you can never go below that floor.
-
API design influences performance, the
encoding/json.Decoder
API requires allocations as returning a primitive value via an interface causes it to escape to the heap — it effectively becomes a pointer to the value. This includes strings. -
Careful attention to the per byte and per token overheads delivered a JSON decoder that performs 2-3x faster than
enconding/json.Decoder.Token
and 8-10x faster with the alternativeNextToken
API. -
O(n) functions per bytes of input are the second limit. Optimising the hot path to convert per byte into per token/line/message/etc batches is key to avoiding function call overhead. Note how I predicted the outcome of
encoding/json.Decoder.Token
without looking at the code. -
Beyond this is the domain of micro optimisations. Things like moving statements across function call boundaries to play inlining tricks. Don’t reach for the tricks I showed here without addressing the big O effects in your api. It won’t make any difference.
Performance matters, sometimes
I started this project because I wanted to apply some ideas that I saw to Go.
I believed that I could implement an efficient JSON parser based on my assumption that encoding/json
was slower than it could be because of its API.
It turned out that I was right, it looks like there is 2-3x performance in some unmarshalling paths and between 8x and 10x performance in tokenisation, if you’re prepared to accept a different API.
In this presentation I made some statements about performance, but these are because I specifically constructed the question such that the run time of the algorithm was the most important feature. Don’t extrapolate my words to mean that all code should optimise for runtime and allocations above all else.
Sometimes performance matters, most times it doesn’t. It probably matters more when writing library code. It probably matters when dealing with input data of an unknown size and quality. It certainly doesn’t matter unless you have metrics that show a code path is too slow, and profiling to identify the cause.
Further work
The basics are done. Now what?
-
Better compatibility with the
encoding/json
package. -
Reduce allocations in
Decoder.NextToken
, probably by replacing the []bool state stack with a bit vector. -
Decode encoded strings in
Scanner.parseString
, I think this can be done without allocation because the unencoded form is always smaller. -
Scanner.parseNumber
is slow because it visits its input twice; once at the scanner and a second time when it is converted to a float. I did an experiment and the first parse can be faster if we just look to find the termination of the number without validation,canada.json
went from 650mb/s to 820mb/sec. -
The
readBuffer
model are generally applicable to other kinds of parsing, xml decoding, http requests, source code …
Bonus
Benchmarking is hard, especially on laptops.
-
OSX seems to be very poor at process isolation.
-
Downloading a file while running a benchmark cost 10%.
-
Storing my source on iCloud cost 10% (randomly).
-
OSX seems to scan the binary on first run which can slow the first benchmark in a run. Workaround: use
go test -c
and pause before running the test binary. -
Workaround: shut down everything on the machine, don’t touch it, don’t let it go to sleep.