Matthew Jaffee for the GopherCon Liveblog on August 29, 2018
Presenter: George Tankersley
Liveblogger: Matt Jaffee
Everyone knows: if you need to go fast, bust out the assembly. But what if there's better performance to be had just by writing Go more carefully?
In this tutorial session, George will optimize a hash function in pure Go, starting from a direct implementation of the RFC and going until he beats the standard library. This will cover the whole process - starting with the naive code and a set of tests, then progressing through some algorithmic improvements and down the rabbit hole of program tuning. Along the way you'll see the performance impact of API design, some microbenchmarking pitfalls, an introduction to profiling, and a lot of practical tips for doing this kind of optimization.
When the hash function runs out, George will draw further examples from the standard libraries (mostly crypto!) and other high performance Go projects.
Code at: https://github.com/gtank/blake2s
George gets carried away with optimization - wanted to make Go version of a hash function as fast as Rust/C++.
George is employed by Zcash - think Bitcoin with privacy.
A Hash function is variable length input -> fixed length output
Blake2 is an awesome new hash function with some great features:
but...
It's under-specified, and no one implements all of it.
"under-specified" is not a phrase you want to hear with regards to things like crypto and hash functions.
Zcash wanted to implement Blake2 independently in several languages to validate it.
Blake2 Algorithm:
A fairly complex interface.
Doesn't work for Blake2
pprof and this...?
DATE=`date -u +'%s' | tr -d '\n'`; BRANCH=`git
rev-parse --abbrev-ref HEAD`; for i in {1..8}; do
go test -bench . >> benchmark-$BRANCH-$DATE; done
testing.B
is how you write benchmarks in Go - you can put them right in your test files.
Each benchmark needs to loop from 0 to b.N - go test -bench
determines N, you just have to make sure that whatever you're benchmarking runs N times.
func benchmarkHashSize(b *testing.B, size int) {
b.SetBytes(int64(size))
sum := make([]byte, 32)
b.ResetTimer()
for i := 0; i < b.N; i++ {
digest, _ := NewDigest(nil, nil, nil, 32)
digest.Write(emptyBuf[:size])
digest.Sum(sum[:0])
}
}
Quite slow initially - no where near 1GB/s
We found a function that is pretty small and doesn't have side effects - this is a good candidate for inlining.
Inlining is when the compiler copies the body of a function into the caller. It avoids function call overhead, but increases binary size.
Is it worth it? In this case making the function inline will give a 70% speed boost.
How do you know if a function will inline?
go test -gcflags="-m=2" 2>&1 | grep "too complex"
Manually inline these math/bits function calls!
Now our cost is 81 - one over the limit!
Had to change the API to remove an argument to get it under budget - pretty ugly hack!
Be careful - under inliner budget doesn't always mean faster - always benchmark.
What's next?
Use "weblist" this time. Gives a nice source view in the browser with line-by-line profiling data.
Search for a hot function from our "top5" pprof command. We see 2 slice lookups that are taking over 1.5s of the total 6s. Diving into that in the weblist view gives us an assembly breakdown. Two JAE (jump) instructions to runtime.panicindex - this is bound checking!
"Safety is great... but we're trying to go fast."
Finding bounds checks:
go test -gcflags="-d=ssa/check_bce/debug=1"
You can do an early bounds check deep into the slice to avoid multiple bounds checks further on.
If that won't work:
This can make the code super ugly! a dozen lines becomes a few hundred. but... 80% throughput boost!
Still some bounds checks - one more technique: Replace slices with arrays.
Going back to pprof:
We're seeing mallocgc
, makeslice
, and memmove
. Allocations are expensive - initially, and because of garbage collection. Let's grep for make
and see what we can get rid of.
The Go compiler contains "pattern matching" which can catch things like looping
over a slice and zeroing it out. It will replace the loop with a special low
level call like memset
.
You can check the capacity of a slice with cap()
which is often longer than its
length - you may be able to use existing allocated memory rather than making a
new slice.
We've eeked out another few percent!
There's more we can do:
But, probably not worth it to go much further.
Things start to depend heavily on the compiler version - you have to watch carefully that new Go versions don't break your "optimizations".
You won't be competitive in assemply in pure Go - maybe use Rust?