DuckDB-based query sidecar for MatrixOne, powered by the
Sirius GPU execution engine.
MatrixOne rewrites and forwards
queries annotated with /*+ SIDECAR */ (DuckDB on CPU) or
/*+ SIDECAR GPU */ (SiriusDB on GPU) to this sidecar to take advantage of GPU
for analytic query processing.
MatrixOne distinguishes itself from other SiriusDB integrations through its fundamental architecture as a Hybrid Transactional/Analytical Processing (HTAP) system. This dual-capability framework allows MatrixOne to consistently maintain high-throughput transactional performance, executing tens of thousands of transactions per second on modest CPU configurations. By integrating with SiriusDB, the system facilitates the offloading of complex analytical workloads to high-performance GPUs. This synergy enables the processing of near-instantaneous, transaction-consistent data, effectively bridging the gap between real-time operational state and deep computational analysis without the latency typically associated with traditional data movement.
| Path | Hint | Engine | Scan Pipeline |
|---|---|---|---|
| CPU | /*+ SIDECAR */ |
DuckDB vectorized | tae_scan() → pread → LZ4 (CPU) → DuckDB vectors |
| GPU | /*+ SIDECAR GPU */ |
Sirius + cuDF | tae_scan_task → coalesced pread → pinned host → cudaMemcpy → nvCOMP LZ4 (GPU) → CUDA decode → cudf tables |
Both paths apply object-level and block-level zone-map pruning to skip data that
cannot match filter predicates. The GPU path uses coalesced I/O to merge adjacent
reads into single pread() calls (e.g., 360 reads → 12 I/O calls for a 5-column
scan), and CRC stripping is performed in memory when reading local MO files.
The GPU path bypasses the DuckDB execution engine entirely — compressed TAE data
goes directly from disk to GPU memory, with decompression and column decoding
performed by CUDA kernels. Filter predicates are pushed down and evaluated on GPU
via cudf::compute_column().
See DESIGN.md §13 for full architecture.
| Extension | Source | Description |
|---|---|---|
| tae-scanner | duckdb-tae-scanner | Reads MatrixOne TAE storage objects as DuckDB table functions |
| httpserver | duckdb-httpserver | DuckDB HTTP server for accepting SQL queries |
| sirius | sirius | GPU-accelerated SQL execution via cuCascade/cuDF |
Extensions are statically linked into the DuckDB binary — no manual LOAD needed.
The GPU build adds Sirius on top of the base extensions.
CMake ≥ 3.15, Ninja, Clang (recommended) or GCC ≥ 11, plus lz4 and OpenSSL dev libraries.
Debian / Ubuntu:
sudo apt install clang cmake ninja-build liblz4-dev libssl-dev git libcurl4-openssl-devFedora / RHEL / Rocky:
sudo dnf install clang cmake ninja-build lz4-devel openssl-devel gitArch Linux:
sudo pacman -S clang cmake ninja lz4 openssl gitgit clone --recurse-submodules https://github.com/matrixorigin/mo-sirius-sidecar.git
cd mo-sirius-sidecar
# Configure (first time only)
cmake -S duckdb -B build/release -G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DDUCKDB_EXTENSION_CONFIGS="$(pwd)/extension_config.cmake"
# Build
ninja -C build/releaseArtifacts:
build/release/duckdb— DuckDB shell with all extensions linkedbuild/release/extension/tae_scanner/tae_scanner.duckdb_extension— loadablebuild/release/extension/httpserver/httpserver.duckdb_extension— loadablebuild/release/extension/sirius/sirius.duckdb_extension— loadable (GPU build only)
The Sirius extension uses pixi to manage CUDA toolkit, cuDF, and other RAPIDS dependencies. Install pixi first, then initialize the Sirius conda environment:
# Install pixi (one-time)
curl -fsSL https://pixi.sh/install.sh | bash
# Initialize Sirius submodule deps (cucascade is required at build time)
git -C sirius submodule update --init cucascade
# Install CUDA + RAPIDS toolchain into sirius/.pixi/
cd sirius && pixi install && cd ..Build from within the pixi environment so the compiler can find CUDA, cuDF, lz4, and OpenSSL:
SIDECAR_DIR=$(pwd)
# Configure (first time only)
cd sirius && pixi run -- bash -c "
cmake -S $SIDECAR_DIR/duckdb -B $SIDECAR_DIR/build/release-gpu -G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DDUCKDB_EXTENSION_CONFIGS=$SIDECAR_DIR/extension_config_gpu.cmake
" && cd ..
# Build
cd sirius && pixi run -- bash -c "
ninja -C $SIDECAR_DIR/build/release-gpu
" && cd ..Note: The pixi compiler wrapper (conda-forge GCC) does not see system
/usr/include, so system libraries like lz4 and OpenSSL are declared as pixi dependencies insirius/pixi.toml.
Note: On machines without an NVIDIA GPU the build succeeds but the binary will print "NVML not available" and refuse GPU queries at runtime.
This adds the Sirius GPU execution engine on top of tae_scanner + httpserver.
Run the sidecar binary directly. The HTTP server auto-starts on the
specified port; DUCKDB_HTTPSERVER_FOREGROUND=1 blocks on an atexit
handler so the process stays up after the DuckDB REPL ends. Send
SIGINT to shut down gracefully.
# CPU sidecar
DUCKDB_HTTPSERVER_FOREGROUND=1 DUCKDB_HTTPSERVER_PORT=9876 \
./build/release/duckdb
# GPU sidecar (RPATH already points at sirius/.pixi/envs/default/lib —
# no `pixi run` wrap needed on the host where it was built)
DUCKDB_HTTPSERVER_FOREGROUND=1 DUCKDB_HTTPSERVER_PORT=9876 \
SIRIUS_LOG_LEVEL=info \
./build/release-gpu/duckdbWhen backgrounding without a TTY, redirect stdin to avoid SIGTTIN:
DUCKDB_HTTPSERVER_FOREGROUND=1 DUCKDB_HTTPSERVER_PORT=9876 \
./build/release-gpu/duckdb < /dev/null > sidecar.log 2>&1 &Set SIRIUS_LOG_LEVEL=debug for verbose GPU execution logs (very noisy).
A combined MO + GPU sidecar image is defined in docker/Dockerfile. The
canonical build entrypoint is docker/build.sh, which defaults to
podman (override with BUILD_ENGINE=docker):
./docker/build.sh # uses ../mo-tpch by default
MO_TPCH_DIR=/path/to/mo-tpch ./docker/build.sh
IMAGE_TAG=mo-sirius:dev ./docker/build.sh
BUILD_ENGINE=docker ./docker/build.shA typical run with all bind-mounts (data, TPC-H scratch, logs, sirius
config) — daemonized so we can drive it later via podman exec:
mkdir -p $(pwd)/{mo-data,tpch-data,log}
podman run -d --name mo-sirius --device nvidia.com/gpu=all \
-p 6001:6001 -p 8888:8888 -p 9999:9999 \
-v $(pwd)/mo-data:/mo-data \
-v $(pwd)/tpch-data:/opt/mo-tpch/data \
-v $(pwd)/log:/log \
-v $(pwd)/sirius.yaml:/etc/sidecar/sirius.yaml:ro \
mo-sirius:latestGPU access: podman uses CDI (
--device nvidia.com/gpu=allor=<index>/=<UUID>to pin one GPU). Docker users substitute--gpus all.
What each mount is for:
/mo-data— MO catalog, logs, and TAE objects. The bundled MO configs usedata-dir = "./mo-data"and the entrypoint runs from/. Without this mount the data lives in the container's writable layer and is lost when the container is removed. Same convention as upstream'setc/docker-multi-cn-local-disk/docker-compose.yml./opt/mo-tpch/data— TPC-Hdbgenoutput. Required for SF ≥ 10 to keep multi-GB.tblfiles out of the writable layer. SetDATA_DIRenv to override./log— see "Container logs" below.
Running TPC-H benchmarks. The image bundles
mo-tpch at /opt/mo-tpch
with a pre-built dbgen, the schema (mo.ddl), all 22 queries, and
golden answers. A convenience wrapper tpch-bench runs the full
generate → create-tables → load → query pipeline, with an ENGINE
switch to route queries through MO native, the CPU sidecar, or the
GPU sidecar:
# inside the running container (or via podman exec):
tpch-bench 1 # SF=1, all phases, ENGINE=native (default)
SF=10 tpch-bench # SF=10
GEN=0 LOAD=0 tpch-bench 10 # SF=10, queries only
ENGINE=cpu GEN=0 LOAD=0 tpch-bench 10 # route via CPU sidecar (/*+ SIDECAR */)
ENGINE=gpu GEN=0 LOAD=0 tpch-bench 10 # route via GPU sidecar (/*+ SIDECAR GPU */)
# override MO connection or data location:
MO_HOST=mo MO_PORT=6001 tpch-bench 1
DATA_DIR=/data/sf10 tpch-bench 10 # bind-mount /data for large SFsENGINE=cpu|gpu injects the corresponding sidecar hint as the first
line of every query before piping to mariadb --comments, so MO
forwards the rewritten SQL to the in-container sidecar at
http://127.0.0.1:9999.
To drive the bench from the host against the daemonized container above,
use podman exec:
podman exec mo-sirius bash -lc 'ENGINE=gpu tpch-bench 10'
# Reuse already-loaded data — queries only:
podman exec mo-sirius bash -lc 'ENGINE=gpu GEN=0 CTAB=0 LOAD=0 tpch-bench 10'
# Run in the background (detached); follow output via the /log bind-mount:
podman exec -d mo-sirius bash -lc 'ENGINE=gpu tpch-bench 10'
tail -f log/tpch/*/run.logContainer logs. MO and the sidecar run at debug level by default
and would otherwise flood the host's syslog through the journald log
driver. The entrypoint redirects them — and mo-tpch's
/opt/mo-tpch/report and run.log — into files under /log inside
the container:
/log/
├── mo-YYYYMMDD-HHMMSS.log # mo-service stdout/stderr (per container start)
├── sidecar-YYYYMMDD-HHMMSS.log # DuckDB sidecar stdout/stderr (per container start)
└── tpch/
└── YYYYMMDD-HHMMSS/ # one subdir per container start (symlinked
├── run.log # from /opt/mo-tpch/{report,run.log})
└── TPCH_<SF>/q*.txt
Container stdout only carries [entrypoint] lifecycle messages.
Bind-mount /log (see the run example above) to harvest everything
on the host, or set LOG_DIR to a different in-container path.
Runtime configuration overrides. The image ships a default
sirius.yaml at /etc/sidecar/sirius.yaml and MO configs at
/etc/launch/*.toml. The typical run above already shows the
sirius.yaml bind-mount; you can do the same for the MO configs, or
bypass them entirely:
-
Point
SIRIUS_CONFIG_FILEat a custom path:podman run --device nvidia.com/gpu=all ... \ -v /host/configs:/custom:ro \ -e SIRIUS_CONFIG_FILE=/custom/my-sirius.yaml \ mo-sirius:latest
-
Tune knobs via environment variables (see table below) — all
SIRIUS_*,DUCKDB_HTTPSERVER_*,MO_DEBUG_HTTP, andMO_LAUNCH_CONFare passed through:podman run --device nvidia.com/gpu=all ... \ -e SIRIUS_TAE_BASELINE_COLS=6 \ -e SIRIUS_LOG_LEVEL=info \ -e DUCKDB_HTTPSERVER_AUTH=my-secret-token \ mo-sirius:latest
| Variable | Default | Description |
|---|---|---|
DUCKDB_HTTPSERVER_PORT |
(none) | Set to auto-start HTTP server on this port |
DUCKDB_HTTPSERVER_HOST |
0.0.0.0 |
Listen address |
DUCKDB_HTTPSERVER_AUTH |
(empty) | Auth token (X-API-Key or Basic auth) |
DUCKDB_HTTPSERVER_FOREGROUND |
0 |
Set to 1 to block after startup (daemon mode) |
SIRIUS_LOG_LEVEL |
warn |
Sirius GPU engine log level (info, debug, trace) |
SIRIUS_TAE_BASELINE_COLS |
4 |
GPU TAE scan: projected-col count at which scan_task_batch_size is used as-is. Effective cap scales as scan_task_batch_size × baseline / proj_cols (floored at 32MB); wider projections get smaller per-task batches to reduce GPU tail latency. 0 disables scaling. |
./build/release/duckdb \
-cmd "SELECT httpserve_start('0.0.0.0', 9876, '')"curl 'http://localhost:9876/?default_format=JSONCompact&query=SELECT+42'
# Expected: {"meta":[{"name":"42","type":"Int32"}],"data":[[42]],"rows":1}Start the CPU or GPU sidecar on port 9876 (see Deploy above).
MatrixOne supports SiriusDB sidecar offloading. Currently MO must be started with
the -debug-http flag — this enables the internal
/debug/tae/manifest endpoint that the sidecar uses to discover TAE objects:
cd /path/to/matrixone
./mo-service -debug-http :8888 -launch etc/launch/launch.tomlAdd to cn.toml:
[cn.frontend]
sidecarUrl = "http://localhost:9876"Or set it per-session (useful for testing):
SET sidecar_url = 'http://localhost:9876';TPC-H dataset can be loaded from mo-tpch.
-- CPU sidecar (DuckDB vectorized engine):
/*+ SIDECAR */ SELECT count(*) FROM tpch.lineitem WHERE l_shipdate < '1998-09-01';
-- GPU sidecar (Sirius + cuDF, wraps query in gpu_execution()):
/*+ SIDECAR GPU */ SELECT count(*) FROM tpch.lineitem WHERE l_shipdate < '1998-09-01';If the sidecar is not configured or not reachable, MO silently falls back to native execution (the hint is stripped).
NOTE: MatrixOne uses a MySQL-compatible client protocol. Any MySQL client can connect
to MatrixOne and run queries. If using the mariadb client, add --comments so that
SQL hints are preserved:
mariadb --skip-ssl -h 127.0.0.1 -P 6001 -u dump -p111 --comments- MO HTTP timeout: MO's
fileservicepackage overrideshttp.DefaultTransportwith a 20-secondResponseHeaderTimeout. The sidecar HTTP client in MO must use a dedicatedhttp.Transportto avoid this — seepkg/frontend/sidecar_offload.go. - GPU VRAM limits: Multi-table joins at SF100+ may hang if the GPU has insufficient VRAM (tested: RTX 3070 8GB handles SF10 fully, SF100 Q1-Q2 only).
Client MatrixOne Sidecar (DuckDB + Sirius)
│ │ │
│ /*+ SIDECAR [GPU] */ │ │
│────────────────────────>│ │
│ │ GET /debug/tae/manifest│
│ │ (internal, for schema) │
│ │ │
│ │ Rewrite: table refs → │
│ │ tae_scan(manifest_url) │
│ │ GPU: wrap in │
│ │ gpu_execution() │
│ │ │
│ │ POST rewritten SQL │
│ │────────────────────────>│
│ │ │
│ │ │ CPU path:
│ │ │ tae_scan → pread →
│ │ │ LZ4 decompress (CPU) →
│ │ │ DuckDB vectors →
│ │ │ DuckDB engine
│ │ │
│ │ │ GPU path:
│ │ │ tae_scan_task → pread →
│ │ │ pinned host memory →
│ │ │ cudaMemcpy to GPU →
│ │ │ nvCOMP LZ4 decompress →
│ │ │ CUDA decode kernels →
│ │ │ cudf filter pushdown →
│ │ │ Sirius GPU engine
│ │ │
│ │ JSONCompact response │
│ │<────────────────────────│
│ MySQL result set │ │
│<────────────────────────│ │
mo-sirius-sidecar/
├── duckdb/ ← DuckDB v1.5.2 (submodule)
├── extension-ci-tools/ ← DuckDB build helpers (submodule)
├── tae-scanner/ ← TAE storage reader (submodule)
│ ├── src/ ← Scanner, filter, object reader
│ └── include/ ← Headers
├── httpserver/ ← HTTP query server (submodule)
│ └── src/ ← Server, serializers
├── sirius/ ← GPU SQL engine (submodule)
│ └── src/
│ ├── op/scan/ ← tae_scan_task (GPU native TAE reader)
│ ├── data/ ← host_tae→gpu_table converter (nvCOMP + CUDA)
│ ├── cuda/tae/ ← CUDA kernels (fixed decode, varchar, null mask)
│ ├── tae/ ← TAE metadata parser
│ └── ... ← GPU operators, cuCascade, planner
├── extension_config.cmake ← CPU extensions config
├── extension_config_gpu.cmake ← GPU extensions config
├── Makefile ← Build wrapper
├── DESIGN.md ← Architecture document
└── README.md