If there is any one most salient technological aspect of the Burst world it's
the single-pass-scan
and all the various attendant
challenges associated with that getting that right.
The single-pass-scan
is the critical inner-loop
of Burst
Behavioral Analysis and
the way we approach that game is where many of its performance wins lie.
Burst analytics are significant calculations across
high-cardinality sets of behavioral entities where each
of those entities is an object-tree
with generally high-cardinality
collections of behavioral 'events'. All of that data needs to be
filtered/measured/categorized as fast and as efficiently as
technologically practical.
The bad news is that Burst needs to provide high transaction-rate, low-latency calculations day in and day out, on very large entities sets where each entity can be quite large and the basic algorithms so efficient as to be limited by the simply reading and writing of memory. Very simple changes to how memory is read or how an instruction is turned into byte code can make dramatic differences.
This means:
- We need to strenuously limit the number of VM objects created and carefully manage non VM memory as well.
- We need to carefully optimize how multiple CPUs and cores and their cache lines interact with the various cache levels of the memory architecture.
- We need to be sure we are using best practices with our multicore thread usage especially as regards synchronization.
The good news come in two forms:
- There is no need to calculate direct inter-entity relationships i.e. most of the calculus ends up with a high degree of locality within the entity object-tree. This allows us to divide up our processing across multiple cores and multiple nodes.
- processing is inherently ordered by causality/time i.e. there is a high degree of directionality in our algorithm. This allows us to take advantage of modern hardware's innate forward moving path optimization.
All this translates to a finite number of design practices:
- rigorously (no exceptions) translate all analytic processing of the entity object-model into a single pass depth first traversal (easier said than done)
- don't create any VM objects during the scan. Evens small amounts of GC are death at Burst operation rates.
- on a given worker node, batch entities into contiguous memory 'regions' and bind all operations to a single thread/core.
- place all significant data structures into off-heap memory 'parts'
- manage parts using lock-free, off-heap queues (thanks JCL)
- always move forward in a byte order sense when accessing large chunks of off heap memory (such as the Brio Blob)
- carefully divide threads into finite sized 'cpu bound' and cached 'async request' pools.
- be mindful of concurrency levels and transaction rates on queues
- have the OS do what it is best at e.g.
mmap
files - generate the final analysis algorithm into reusable maximally 'efficient' bytecode and allow that byte code to JIT optimize.
- highly specialized data structures such as Felt Cubes and Routes
The single-pass-scan
was an enormous architectural bet made in the very
early stages of the Burst architecture. It was not even always clear that it could
be done i.e. that all the questions we would want to ask could be answered
that way. Fortunately, it did in fact turn out to be a successful bet.
This decision permeates throughout the architecture. However, take
an especally close look at these relevant modules for a deeper dive:
- Brio -- single pass scan encoded binary data format
- Tesla -- thread and memory management
- EQL -- declarative language with single pass scan semantic output
- Felt -- an execution semantic object model for single pass scans
- Fabric -- multi-node / multi-core distributed processing
- Zap -- high performance off heap data structures for single pass calculus
------ HOME --------------------------------------------