A naive implementation of the XGBoost schema is quite quick. Obviously it scales by number of trees / depth of trees, but it should be fast because the operations are quite simple.
BenchmarkXGBoost-10 196039 5223 ns/op 1829 B/op 4 allocs/op
BenchmarkXGBoostConcurrent-10 793015 1496 ns/op 969 B/op 4 allocs/op
BenchmarkLoadXGBoost-10 200 5957921 ns/op 1062448 B/op 17759 allocs/op
It's a bit annoying to compare directly against XGBoost and other libraries. These results are quite good though, and are on the same order of magnitude as FPGA algorithms. In practice, fetching and preprocessing the input data will dominate inference time, as will network latency.
BenchmarkXGBoost-10 246240 4803 ns/op 844 B/op 4 allocs/op
BenchmarkXGBoostOptimized-10 252561 4250 ns/op 12 B/op 2 allocs/op
BenchmarkXGBoostConcurrent-10 2146634 591.4 ns/op 469 B/op 4 allocs/op
Here, we've applied a few optimizations:
float32
instead offloat32
: XGBoost also uses single-precision floats, so this actually aligns us more closely with XGBoost's defaults. Both benchmarks above use this, it was just way easier to find/replace everywhere. This does make me a bit uncomfortable, as we're probably doing a bunch of back and forth translations, even with math32.- Inlined tree predict functions: we used 'outlining' (splitting functions into smaller ones) to increase performance, at the cost of increasing the binary size. This is reasonable, given that the tree predict functions are absolutely in the hot path and called often.
- Cache locality: XGBoost uses lists to represent its info, so you can represent a node as an integer index into each of its lists. The alternative is keeping a single list of structs which hold all of the info for a node. The latter is more efficient on modern architectures, as the data for a given node is stored close to each other, the CPU can also access it more quickly.
- Move transformation code into the predict loops. This is actually quite a large improvement, since we're no longer allocating a slice of floats into the heap to pass to transformations. This is probably worth it, as it's a 10-15% speedup and something like a 99% reduction in memory overhead, at the cost of making the predict code uglier - need to handle all the cases instead of delegating.
- The XGBoost model format may change. NB 2024 update: it did change a little bit but also seems like it runs fine in production.
- XGBoost internally uses single-precision float32's. Go's standard library uses float64's internally and it's a bit of a pain to use float32's (e.g. we're just casting back and forth). This may lead to slightly different results (and slightly worse performance than if we can use float32's directly).
Some observations
- I haven't applied many "clever" optimizations at all to this code. I aimed to write clean code that derived from the XGBoost export format and was easy to understand. This was pretty straightforward, and I was surprised at how easy it was to get good performance and to unmarshal the file format quickly (less than an hour, taking advantage of autogenerated structs). I'm sure there are some optimizations that could be made, but I'm not sure they would be worth it.
- Python's interop with C++ and Rust is great, but mainly that's a function of C++ and Rust offering a simple FFI. Because Go does not offer a similar setup, it's really difficult to use Go to optimize code that's later used in Python. While this library is designed to be plugged into a web server in Go, if I was going to write something more focused on the data science / ML workflow, I would do it in Rust for better portability.
- Likewise, the
cgo
API is a nightmare to work with. This is a shame, since sometimes you really do want to hyperoptimize a hot path or you want to use a library written in C++. Before writing this library, I tried to have a 'simpler' implementation that usedcgo
to call into XGBoost. But you're just trading off complexity in one arena for complexity elsewhere, and I found it to be a net negative. - Go's philosophy of offering more batteries included than C++/Rust while still offering more control of the underlying system than Python is great for lots of things, but seems best if you limit your service's scope to problems that can be solved with pure Go. That seems relevant given how Go was developed for Google engineers to build services quickly, with reasonably good performance, and without having to worry about too many low-level things.
- The prediction code for XGBoost is actually quite simple, I'm a bit surprised that other people haven't written many similar implementations. Other efforts exist but I found them quite unintuitive to read and understand, especially since they rely on different file formats to parse the underlying XGBoost model. I made a more pronounced effort in this library to keep the code reasonably simple and readable to understand logically.
- On that point, I am a bit displeased with the XGBoost team's decisions around file formats. They have quite a few different implementations and even though the JSON format is considered to be their 'format of the future,' have even added another UBJSON format. The format is also not documented clearly; while there is a JSON schema, it is not easy to understand how it is actually meant to be used.