Skip to content
Open
Show file tree
Hide file tree
Changes from 250 commits
Commits
Show all changes
433 commits
Select commit Hold shift + click to select a range
ef6d255
flattend
cpegeric Mar 13, 2026
08fc2e4
revert to main
cpegeric Mar 16, 2026
b5060c3
merge fix
cpegeric Mar 16, 2026
4bb6fc3
quantizer
cpegeric Mar 16, 2026
15047f1
bug fix misalign memory
cpegeric Mar 16, 2026
ae15dc3
get/set quantizer
cpegeric Mar 16, 2026
38ea4a9
pairwise
cpegeric Mar 16, 2026
9be0c39
pairwise
cpegeric Mar 16, 2026
4ede16e
hybrid
cpegeric Mar 16, 2026
a816435
pairwise distance in blockio/read.go
cpegeric Mar 16, 2026
622dc5e
bvt fix
cpegeric Mar 16, 2026
dde7275
cagra merged index need explicit call Start() before search
cpegeric Mar 17, 2026
588f756
Merge branch 'gpu_ivfpq' of github.com:cpegeric/matrixone into gpu_ivfpq
cpegeric Mar 17, 2026
d4cd5b5
remove compiler warning
cpegeric Mar 17, 2026
94ab8b4
benchmark
cpegeric Mar 17, 2026
922472c
optimize for replicated mode
cpegeric Mar 17, 2026
d45a0b5
dynamic batching
cpegeric Mar 17, 2026
2849c87
run false and then true
cpegeric Mar 17, 2026
d2a1833
run old path when useBatching = false
cpegeric Mar 17, 2026
bcc938e
go fmt
cpegeric Mar 17, 2026
4b428cc
set_use_batch and set_per_thread_device
cpegeric Mar 17, 2026
1347342
index_base class
cpegeric Mar 17, 2026
016702a
introduce main thread queue to make sure build in main thread
cpegeric Mar 17, 2026
8db3f72
bug fix thread safe queue with capacity limit
cpegeric Mar 18, 2026
db4b2e9
info
cpegeric Mar 18, 2026
c892823
bug fix thread safe queue stopped
cpegeric Mar 18, 2026
b28d746
build_internal refactor
cpegeric Mar 18, 2026
04b0bab
add chunk benchmark
cpegeric Mar 18, 2026
d306e77
tests for cuvs_worker
cpegeric Mar 18, 2026
01b9093
thread safe queue stress test
cpegeric Mar 18, 2026
4158505
search_batch_internal
cpegeric Mar 18, 2026
5f21f0e
clean up include headers
cpegeric Mar 18, 2026
559399b
get centroids return []T
cpegeric Mar 18, 2026
b587cc6
recall rate
cpegeric Mar 18, 2026
87fe7ea
recall rate shown
cpegeric Mar 18, 2026
1fc6dbc
info in JSON
cpegeric Mar 18, 2026
8b26d2b
info in JSON
cpegeric Mar 18, 2026
e01bc66
info test
cpegeric Mar 18, 2026
0533f7f
comment out Info
cpegeric Mar 18, 2026
2305f42
go fmt
cpegeric Mar 18, 2026
2fe384a
readme
cpegeric Mar 18, 2026
64c9acb
auto quantization
cpegeric Mar 18, 2026
172de9b
add blog.md
cpegeric Mar 19, 2026
1b63918
Merge branch 'main' into gpu_ivfpq
cpegeric Mar 19, 2026
bc40d61
more log
cpegeric Mar 19, 2026
44bf29a
more log
cpegeric Mar 19, 2026
daa10b8
bug fix assign wrong device id in single gpu mode
cpegeric Mar 19, 2026
8cd64c5
bug fix device id
cpegeric Mar 19, 2026
2bcfc08
sharded mode use int64 id in cagra
cpegeric Mar 19, 2026
9b2922e
cagra id use int64
cpegeric Mar 19, 2026
9324e56
bug fix deallocate
cpegeric Mar 19, 2026
914352d
inner scope to free temp memory
cpegeric Mar 19, 2026
73718dd
bug
Mar 19, 2026
679084f
uint32 as id cagra
Mar 19, 2026
45d9ee3
bug fix
Mar 19, 2026
1ef798e
repro case
Mar 19, 2026
af0a7c8
Revert "repro case"
Mar 19, 2026
862bdba
repro
Mar 19, 2026
f18a6d1
fixme
Mar 20, 2026
3800bb9
working
Mar 20, 2026
a7930cc
compilable
Mar 20, 2026
93b0bc3
snmg
Mar 20, 2026
73d98d8
test program
Mar 20, 2026
56ab723
all test passed
Mar 20, 2026
c53e9a8
snmg test
Mar 20, 2026
9b069f1
helper
Mar 20, 2026
d7d0519
add nccl library
Mar 20, 2026
1aa69a4
benchmark
Mar 20, 2026
f31fbc8
bug fix single gpu
Mar 21, 2026
ef7e7d0
train quantizer in load() and build()
Mar 21, 2026
e65496d
better sync
Mar 21, 2026
88308b5
rmm_pool
Mar 21, 2026
664afee
per device resource for replicated mode
Mar 21, 2026
5c9f908
remove rmm pool always crash
Mar 21, 2026
b066e7d
fix recall=1.0 when int8
Mar 21, 2026
6be19fe
get local index faster
Mar 21, 2026
df7e9cd
working replicated
Mar 22, 2026
5beb025
optimize with multiple queue, sharded result store
Mar 22, 2026
cf2cf3a
self replication mode working
Mar 22, 2026
57b9fff
bug fix any_cast error in single gpu
Mar 22, 2026
c760a7a
load with replicated mode
Mar 23, 2026
2ed731a
free replicated indexes memory
Mar 23, 2026
9cf56e5
explicate destroy
Mar 23, 2026
51b4399
disable sharded test
Mar 23, 2026
b4f4bbc
disable sharded test
Mar 23, 2026
a7419bc
disable sharded test
Mar 23, 2026
e838956
bug fix go tests
Mar 23, 2026
90ddf20
submit_all_devices
Mar 24, 2026
bec22d0
id mapping
Mar 24, 2026
14f80d3
worker threads
Mar 24, 2026
40aaa3d
ivf_flat sharded
Mar 24, 2026
9e3c6b4
ivfpq and cagra sharded mode
Mar 24, 2026
7c23bfb
benchmark tests
Mar 24, 2026
5f04081
add brute-force index benchmark
Mar 24, 2026
8c5d8d9
go tests
Mar 24, 2026
c486c5c
info
Mar 24, 2026
9a8bcd6
device threads pool and remove worker threads
Mar 24, 2026
69203a3
pinned memory
Mar 24, 2026
26bd7d2
fix f16 failure because of gencode
cpegeric Mar 25, 2026
67cc8b9
add test
cpegeric Mar 25, 2026
60e3e80
cleanup
cpegeric Mar 25, 2026
52a8aae
pinned memory pool
cpegeric Mar 25, 2026
13a13ae
async pairwise
cpegeric Mar 26, 2026
cf81cb7
async pairwise in vectorindex cpu and gpu
cpegeric Mar 26, 2026
632f925
async reader not tested
cpegeric Mar 26, 2026
7d3ad4a
Merge branch 'main' into gpu_async_reader
cpegeric Mar 26, 2026
5dc9885
async pairwise working
cpegeric Mar 26, 2026
8535a30
bug fix empty batch crash
cpegeric Mar 26, 2026
ebf4c57
bug fix prefix function didn't check constant
cpegeric Mar 26, 2026
a6d4338
code review and fix
cpegeric Mar 27, 2026
0483833
remove compiler warning
cpegeric Mar 27, 2026
89ec9c7
expose fromCache flag to reader
cpegeric Mar 27, 2026
14e61ab
brute force search with gpu - distance functions
cpegeric Mar 27, 2026
f095f5d
delete_id
cpegeric Mar 27, 2026
2d59334
save_dir, load_dir, pack and unpack
cpegeric Mar 27, 2026
1d2b2d9
bug fix
cpegeric Mar 27, 2026
27299e8
json refactor
cpegeric Mar 27, 2026
7a05541
bug fix for code review
cpegeric Mar 27, 2026
f63ee3d
remove device id from pairwise distance. automatic assign device id
cpegeric Mar 28, 2026
85ff0ef
extend() with ivf_flat and ivf_pq
cpegeric Mar 28, 2026
64d8c38
bug fix cagra extend and fix go failed tests
cpegeric Mar 28, 2026
0402280
fix New with ids and fix merge with ids
cpegeric Mar 28, 2026
44aaade
add delete_id go interface
cpegeric Mar 28, 2026
a01918c
fix shard bitset
cpegeric Mar 28, 2026
bfa0566
developer guide
cpegeric Mar 28, 2026
9aa9ca2
developer guide
cpegeric Mar 28, 2026
9a68215
overlap IO
cpegeric Mar 28, 2026
5df142b
sharded mode support extend to last shard
cpegeric Mar 29, 2026
9f30704
blog 8192 block size challenge
cpegeric Mar 29, 2026
68b7b9f
bug fix race condition with cusv_worker_t
cpegeric Mar 29, 2026
b9fb792
fix sharded extend() and negative inner product distance
Mar 30, 2026
c9a1036
validate build param with dataset
Mar 30, 2026
ac065fa
store all shard size
Mar 30, 2026
eaf09b8
race condition fix for cagra
Mar 30, 2026
e52758b
fix cagra validation
Mar 30, 2026
0f89af7
race condition fix on ivf_flat and ivf_pq
Mar 30, 2026
f420425
kmeans
cpegeric Mar 30, 2026
7fbab93
kmeans
cpegeric Mar 30, 2026
9199773
balanced kmeans
cpegeric Mar 30, 2026
3ff4c74
merge fix read.go
cpegeric Mar 30, 2026
c3ce145
no overlay
Mar 30, 2026
6d1f454
search async
Mar 31, 2026
b3ed7bc
validate params
Mar 31, 2026
831944c
multi index
Mar 31, 2026
bbbb9cb
revert to main
Mar 31, 2026
d4eda69
add cuvs library
Mar 31, 2026
cb1cb30
get next gpu device id as round robin fashion
cpegeric Mar 31, 2026
1fdd314
Merge branch 'main' into gpu_async_search
mergify[bot] Mar 31, 2026
70e6163
sca
cpegeric Mar 31, 2026
59c3c76
change between pairwise distance
cpegeric Mar 31, 2026
91f2dea
bvt test
cpegeric Mar 31, 2026
b4ab340
update comment
cpegeric Mar 31, 2026
48f961a
remove overlay
cpegeric Mar 31, 2026
385967b
submit_to_rank
cpegeric Mar 31, 2026
44e8049
merge fix
Apr 1, 2026
977a49d
python apis
cpegeric Apr 1, 2026
5e7a0ff
update
cpegeric Apr 1, 2026
314c9ee
python api update
cpegeric Apr 1, 2026
8023e70
remove snmg
Apr 1, 2026
a982bf4
bug fix multiple call of stop_fn and cagra invalid parameter value
cpegeric Apr 2, 2026
df68bd7
SearchFloat32 to avoid escape to heap
cpegeric Apr 2, 2026
086b8e4
update tests and fix sca
cpegeric Apr 2, 2026
43de38c
revert to main
cpegeric Apr 2, 2026
a3d1cdc
update cuda
cpegeric Apr 3, 2026
135c733
cagra
cpegeric Apr 13, 2026
c4f06c7
bug fix
cpegeric Apr 13, 2026
8631451
Pack and Unpack
cpegeric Apr 13, 2026
d6cfdbd
add chunk with ids
cpegeric Apr 13, 2026
0bdfc7e
preallocate memory for host_ids
cpegeric Apr 13, 2026
86ef619
preallocate memory for host_ids
cpegeric Apr 13, 2026
ce61030
sql support cagra and ivfpq
cpegeric Apr 14, 2026
9d17477
merge fix
cpegeric Apr 14, 2026
1623748
cagra cpu create index and search index
cpegeric Apr 14, 2026
bccc784
cagra table function
cpegeric Apr 14, 2026
324e558
fix setting
cpegeric Apr 14, 2026
9430c27
bug fix query type mismatch
cpegeric Apr 14, 2026
a794cae
set batch window
cpegeric Apr 15, 2026
61834ed
buffer data for quantizer
cpegeric Apr 15, 2026
146f208
configurable batch_window
cpegeric Apr 15, 2026
3e856e2
bug fix show create table
cpegeric Apr 15, 2026
0d283a0
merge fix
Apr 15, 2026
812fa9a
bug fix distribution mode
Apr 15, 2026
1acf4f1
cagra int64 ids
cpegeric Apr 15, 2026
0653213
remove debug log
cpegeric Apr 15, 2026
545dcd6
bug fix sharded mode low recall
Apr 16, 2026
0e6e458
fix race condition and set shard size in sharded mode
Apr 16, 2026
c3d49ee
remove start() and use f32 to train cagra
Apr 16, 2026
e586c43
cagra max iteration to 30
Apr 16, 2026
247d6ca
fix topk bigger than itopk_size
Apr 17, 2026
d970dae
itopk_size
Apr 17, 2026
99deade
cleanup index_base.hpp
Apr 18, 2026
0e96bc0
add ivfpq
Apr 20, 2026
ebcbc14
cudf
cpegeric Apr 20, 2026
f1d763d
fine tuning ivfpq with float16
Apr 20, 2026
0f44bbc
kmeans_train_percent for ivfpq
Apr 20, 2026
7d9fc38
nprobe
Apr 20, 2026
8b5648d
fix ivf_pq internal distance type to float32 with float16 quantization
Apr 20, 2026
e64c8c7
pushdown filter with cuvs index
Apr 21, 2026
2bdc06b
revert
Apr 21, 2026
02c8bb6
omp
Apr 21, 2026
99d2df5
fix race condition with shared_ptr for del_bs
Apr 22, 2026
dd3de23
optimize bitset auto-unrolling
Apr 22, 2026
c6ca2c7
bug fix sync_stream
Apr 22, 2026
a91c875
add nullmap for each filter column
Apr 22, 2026
3802043
add nullmap for each filter column
Apr 22, 2026
5365917
integration with MO and include columns
Apr 23, 2026
3da9ccf
fix special character escaped like < >
Apr 23, 2026
67b7ae0
merge fix
Apr 23, 2026
7338956
remove log
Apr 23, 2026
2ebea8f
fix ivfpq with post filtering
Apr 23, 2026
76f4596
validate include columns
Apr 23, 2026
9e712a0
better error handling
Apr 24, 2026
dd13e73
fix l2_distance() replacement with tablefunction in projection and fi…
Apr 24, 2026
4964594
support pkid as filter column
Apr 24, 2026
c182cb0
better error handling with submit_all_devices
Apr 24, 2026
3bc6af8
fix error handling
Apr 24, 2026
45e5855
blog
cpegeric Apr 27, 2026
0b62d58
focus pre-filter
cpegeric Apr 27, 2026
bea153e
lists
cpegeric Apr 27, 2026
0798e2c
update
cpegeric Apr 27, 2026
3a2734a
GPU 48G
cpegeric Apr 27, 2026
48b7317
why L40S but not A10
cpegeric Apr 27, 2026
cc35850
update build time
Apr 28, 2026
ab34f50
update build time
Apr 28, 2026
906f025
use moerr to replace fmt.Errorf
cpegeric Apr 28, 2026
b59f660
merge fix
cpegeric Apr 28, 2026
099bb60
cleanup cpu build
cpegeric Apr 28, 2026
875519f
go fmt
cpegeric Apr 28, 2026
2005414
add UT tests for coverage
cpegeric Apr 28, 2026
031b59a
add test
cpegeric Apr 28, 2026
ac66f01
add UT test
cpegeric Apr 28, 2026
b0b5992
UT Tests
cpegeric Apr 28, 2026
ccc3cff
UT Tests
cpegeric Apr 28, 2026
a50d44c
fix probe_limit propagate to ivfpq
cpegeric Apr 29, 2026
82ea663
hardware configuration
cpegeric Apr 30, 2026
9e0d061
update
cpegeric Apr 30, 2026
f555069
update
cpegeric Apr 30, 2026
348e1ea
update
cpegeric Apr 30, 2026
0875bc5
update stats
cpegeric Apr 30, 2026
b393974
remove auto batching
cpegeric Apr 30, 2026
4d7fdf2
update ivfflat recall
cpegeric Apr 30, 2026
5b9422c
update
cpegeric Apr 30, 2026
ae12105
auto-detect data size when index build
May 1, 2026
f9195ab
merge fix
May 1, 2026
d48dca3
update python library
May 1, 2026
a669306
go fmt
May 1, 2026
3490408
update blog
cpegeric May 6, 2026
7af60cb
update
cpegeric May 6, 2026
7cf1d07
remove usable DB cache
cpegeric May 6, 2026
5d1c194
pre-filter lock optimization
May 7, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,7 @@ pb: vendor-build generate-pb fmt

VERSION_INFO :=-X '$(GO_MODULE)/pkg/version.GoVersion=$(GO_VERSION)' -X '$(GO_MODULE)/pkg/version.BranchName=$(BRANCH_NAME)' -X '$(GO_MODULE)/pkg/version.CommitID=$(LAST_COMMIT_ID)' -X '$(GO_MODULE)/pkg/version.BuildTime=$(BUILD_TIME)' -X '$(GO_MODULE)/pkg/version.Version=$(MO_VERSION)'
THIRDPARTIES_INSTALL_DIR=$(ROOT_DIR)/thirdparties/install
CGO_DIR=$(ROOT_DIR)/cgo
RACE_OPT :=
DEBUG_OPT :=
CGO_DEBUG_OPT :=
Expand All @@ -188,7 +189,7 @@ ifeq ($(MO_CL_CUDA),1)
$(error CONDA_PREFIX env variable not found.)
endif
CUVS_CFLAGS := -I$(CONDA_PREFIX)/include
CUVS_LDFLAGS := -L$(CONDA_PREFIX)/envs/go/lib -lcuvs -lcuvs_c
CUVS_LDFLAGS := -L$(CONDA_PREFIX)/lib -lcuvs -lcuvs_c -lnccl -lucxx -lucp -luct -lucs -lucm
CUDA_CFLAGS := -I/usr/local/cuda/include $(CUVS_CFLAGS)
CUDA_LDFLAGS := -L/usr/local/cuda/lib64/stubs -lcuda -L/usr/local/cuda/lib64 -lcudart $(CUVS_LDFLAGS) -lstdc++
TAGS += -tags "gpu"
Expand All @@ -198,11 +199,11 @@ ifeq ($(TYPECHECK),1)
TAGS += -tags "typecheck"
endif

CGO_OPTS :=CGO_CFLAGS="-I$(THIRDPARTIES_INSTALL_DIR)/include $(CUDA_CFLAGS)"
GOLDFLAGS=-ldflags="-extldflags '$(CUDA_LDFLAGS) -L$(THIRDPARTIES_INSTALL_DIR)/lib -Wl,-rpath,\$${ORIGIN}/lib -fopenmp' $(VERSION_INFO)"
CGO_OPTS :=CGO_CFLAGS="-I$(CGO_DIR) -I$(THIRDPARTIES_INSTALL_DIR)/include $(CUDA_CFLAGS)"
GOLDFLAGS=-ldflags="-extldflags '$(CUDA_LDFLAGS) -L$(CGO_DIR) -lmo -L$(THIRDPARTIES_INSTALL_DIR)/lib -Wl,-rpath,\$${ORIGIN}/lib -fopenmp' $(VERSION_INFO)"

ifeq ("$(UNAME_S)","darwin")
GOLDFLAGS:=-ldflags="-extldflags '-L$(THIRDPARTIES_INSTALL_DIR)/lib -Wl,-rpath,@executable_path/lib' $(VERSION_INFO)"
GOLDFLAGS:=-ldflags="-extldflags '-L$(CGO_DIR) -lmo -L$(THIRDPARTIES_INSTALL_DIR)/lib -Wl,-rpath,@executable_path/lib' $(VERSION_INFO)"
endif

ifeq ($(GOBUILD_OPT),)
Expand Down
65 changes: 47 additions & 18 deletions cgo/Makefile
Original file line number Diff line number Diff line change
@@ -1,48 +1,77 @@
DEBUG_OPT :=
UNAME_M := $(shell uname -m)
UNAME_S := $(shell uname -s)
CC ?= gcc

# Yeah, fast math. We want it to be fast, for all xcall,
# IEEE compliance should not be an issue.
OPT_LV := -O3 -ffast-math -ftree-vectorize -funroll-loops
CFLAGS=-std=c99 -g ${OPT_LV} -Wall -Werror -I../thirdparties/install/include
OBJS=mo.o arith.o compare.o logic.o xcall.o usearchex.o bloom.o
CUDA_OBJS=
COMMON_CFLAGS := -g $(OPT_LV) -Wall -Werror -fPIC -I../thirdparties/install/include
CFLAGS := -std=c99 $(COMMON_CFLAGS)
OBJS := mo.o arith.o compare.o logic.o xcall.o usearchex.o bloom.o
CUDA_OBJS :=
LDFLAGS := -L../thirdparties/install/lib -lusearch_c
TARGET_LIB := libmo.so

ifeq ($(UNAME_S),Darwin)
TARGET_LIB := libmo.dylib
LDFLAGS += -dynamiclib -undefined dynamic_lookup -install_name @rpath/$(TARGET_LIB)
else
LDFLAGS += -shared
endif

ifeq ($(UNAME_M), x86_64)
CFLAGS+= -march=haswell
CFLAGS += -march=haswell
endif

ifeq ($(MO_CL_CUDA),1)
ifeq ($(CONDA_PREFIX),)
$(error CONDA_PREFIX env variable not found. Please activate your conda environment.)
endif
CC = /usr/local/cuda/bin/nvcc
CFLAGS = -ccbin g++ -m64 --shared -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_89,code=sm_89 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90
CFLAGS = -ccbin g++ -m64 -Xcompiler -fPIC -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_89,code=sm_89 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90
CFLAGS += -I../thirdparties/install/include -DMO_CL_CUDA
CUDA_OBJS += cuda/cuda.o
CUDA_LDFLAGS := -L/usr/local/cuda/lib64/stubs -lcuda -L/usr/local/cuda/lib64 -lcudart -lstdc++
# Explicitly include all needed libraries for shared library linking
CUDA_LDFLAGS := -L/usr/local/cuda/lib64/stubs -lcuda -L/usr/local/cuda/lib64 -lcudart -L$(CONDA_PREFIX)/lib -lcuvs -lcuvs_c -ldl -lrmm -lstdc++
LDFLAGS += $(CUDA_LDFLAGS)
endif

all: libmo.a
.PHONY: all clean test debug

all: $(TARGET_LIB) libmo.a

libmo.a: $(OBJS)
$(TARGET_LIB): $(OBJS)
ifeq ($(MO_CL_CUDA),1)
make -C cuda
$(MAKE) -C cuda
$(MAKE) -C cuvs
$(CC) $(LDFLAGS) -o $@ $(OBJS) $(CUDA_OBJS) cuvs/*.o
else
$(CC) $(LDFLAGS) -o $@ $(OBJS)
endif
ar -rcs libmo.a $(OBJS) $(CUDA_OBJS)

#
# $(CC) -o libmo.a $(OBJS) $(CUDA_OBJS) $(CUDA_LDFLAGS)
libmo.a: $(OBJS)
ifeq ($(MO_CL_CUDA),1)
$(MAKE) -C cuda
$(MAKE) -C cuvs
ar -rcs $@ $(OBJS) $(CUDA_OBJS) cuvs/*.o
else
ar -rcs $@ $(OBJS)
endif

%.o: %.c
$(CC) $(CFLAGS) -c $< -o $@

test: libmo.a
make -C test
test: $(TARGET_LIB)
$(MAKE) -C test

.PHONY: debug
debug: override OPT_LV := -O0
debug: override DEBUG_OPT := debug
debug: all

.PHONY: clean
clean:
rm -f *.o *.a *.so
rm -f *.o *.a *.so *.dylib
ifeq ($(MO_CL_CUDA),1)
make -C cuda clean
$(MAKE) -C cuda clean
$(MAKE) -C cuvs clean
endif
33 changes: 18 additions & 15 deletions cgo/README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,28 @@
MatrixOne CGO Kernel
===============================

This directory contains cgo source code for MO. Running
make should produce two files to be used by go code.
On go side, go will `include "mo.h"` and `-lmo`.
This directory contains CGO source code for MatrixOne. Running `make` produces the core library files used by Go code.

On the Go side, the integration typically uses `mo.h` and links against the generated libraries:
```
mo.h
libmo.a
libmo.a / libmo.so
```

`mo.h` should be pristine, meaning it only contains C function
prototype used by go. The only datatypes that can be passed
between go and c code are int and float/double and pointer.
Always explicitly specify int size such as `int32_t`, `uint64_t`.
Do not use `int`, `long`, etc.
`mo.h` should remain pristine, containing only C function prototypes for Go to consume. Data passed between Go and C should be limited to standard types (int, float, double, pointers). Always specify explicit integer sizes (e.g., `int32_t`, `uint64_t`) and avoid platform-dependent types like `int` or `long`.

GPU Support (CUDA & cuVS)
-------------------------
The kernel supports GPU acceleration for certain operations (e.g., vector search) via NVIDIA CUDA and the cuVS library.

- **Build Flag:** GPU support is enabled by setting `MO_CL_CUDA=1` during the build.
- **Environment:** Requires a working CUDA installation and a Conda environment with `cuvs` and `rmm` installed.
- **Source Code:** GPU-specific code resides in the `cuda/` and `cuvs/` subdirectories.

Implementation Notes
--------------------------------
--------------------

1. Pure C.
2. Use memory passed from go. Try not allocate memory in C code.
3. Only depends on libc and libm.
4. If 3rd party lib is absolutely necessary, import source code
and build from source. If 3rd party lib is C++, wrap it completely in C.
1. **Language:** Core kernel is Pure C. GPU extensions use C++ and CUDA, wrapped in a C-compatible interface.
2. **Memory Management:** Prefer using memory allocated and passed from Go. Minimize internal allocations in C/C++ code.
3. **Dependencies:** The base kernel depends only on `libc`, `libm`, and `libusearch`. GPU builds introduce dependencies on CUDA, `cuvs`, and `rmm`.
4. **Third-party Libraries:** If a third-party library is necessary, it should be built from source (see `thirdparties/` directory). C++ libraries must be fully wrapped in C before being exposed to Go.
2 changes: 1 addition & 1 deletion cgo/cuda/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -395,7 +395,7 @@ $(FATBIN_FILE): mocl.cu
$(EXEC) $(NVCC) $(INCLUDES) $(ALL_CCFLAGS) $(GENCODE_FLAGS) -o $@ -fatbin $<

cuda.o: cuda.cpp
$(EXEC) $(NVCC) $(INCLUDES) -O3 --shared $(ALL_CCFLAGS) $(GENCODE_FLAGS) -o $@ -c $<
$(EXEC) $(NVCC) $(INCLUDES) -O3 --shared -Xcompiler -fPIC $(ALL_CCFLAGS) $(GENCODE_FLAGS) -o $@ -c $<

mytest.o: cuda.cpp $(FATBIN_FILE)
$(EXEC) $(NVCC) $(INCLUDES) -DTEST_RUN -g -O0 $(ALL_CCFLAGS) $(GENCODE_FLAGS) -o $@ -c $<
Expand Down
84 changes: 84 additions & 0 deletions cgo/cuvs/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Copyright 2021 Matrix Origin
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

NVCC := /usr/local/cuda/bin/nvcc
CC := gcc
CXX := g++

# Libraries
LIBS := -L/usr/local/cuda/lib64/stubs -lcuda -L/usr/local/cuda/lib64/cudart -L$(CONDA_PREFIX)/lib -lcuvs -lcuvs_c -lnccl -lucxx -lucp -luct -lucs -lucm -ldl -lrmm -lrapids_logger -Xlinker -lpthread -Xlinker -lm

INCLUDES := -I. -I/usr/local/cuda/include -I$(CONDA_PREFIX)/include -I$(CONDA_PREFIX)/include/rapids -I$(CONDA_PREFIX)/include/raft -I$(CONDA_PREFIX)/include/cuvs

# NVCC_FLAGS are for compilation only. -x cu tells nvcc to treat .cpp as .cu
NVCC_FLAGS := -O3 -std=c++17 -x cu -Xcompiler "-Wall -Wextra -fPIC" --extended-lambda --expt-relaxed-constexpr $(INCLUDES) -DLIBCUDACXX_ENABLE_EXPERIMENTAL_MEMORY_RESOURCE -DRAFT_SYSTEM_LITTLE_ENDIAN=1 \
-gencode arch=compute_75,code=sm_75 \
-gencode arch=compute_80,code=sm_80 \
-gencode arch=compute_86,code=sm_86 \
-gencode arch=compute_89,code=sm_89 \
-gencode arch=compute_90,code=sm_90 \
-gencode arch=compute_90,code=compute_90

# LDFLAGS for linking only. DO NOT include -x cu here.
LDFLAGS := -O3 -std=c++17 -Xcompiler "-Wall -Wextra -fPIC"

# Source files
C_SRCS := brute_force_c.cpp ivf_flat_c.cpp ivf_pq_c.cpp cagra_c.cpp kmeans_c.cpp adhoc_c.cpp distance_c.cpp
CPP_SRCS := helper.cpp
TEST_SRCS := test/main_test.cu test/brute_force_test.cu test/ivf_flat_test.cu test/ivf_pq_test.cu test/cagra_test.cu test/kmeans_test.cu test/quantize_test.cu test/distance_test.cu test/batching_test.cu test/snmg_test.cu test/verify_half_conversion.cu

# Object files
OBJS := $(C_SRCS:.cpp=.o) $(CPP_SRCS:.cpp=.o)
TEST_OBJS := $(patsubst test/%.cu,obj/test/%.o,$(TEST_SRCS))

.PHONY: all clean debug release test

all: libmocuvs.so

test: test_cuvs_worker benchmark_cuvs test_kmeans

release: all

debug: NVCC_FLAGS := $(filter-out -O3,$(NVCC_FLAGS)) -O0 -g -lineinfo
debug: LDFLAGS := $(filter-out -O3,$(LDFLAGS)) -g
debug: all

libmocuvs.so: $(OBJS)
$(NVCC) $(LDFLAGS) -shared -o $@ $^ $(LIBS)

%.o: %.cpp
@echo "Compiling $< with NVCC"
$(NVCC) $(NVCC_FLAGS) -c $< -o $@

obj/test/%.o: test/%.cu
@mkdir -p $(@D)
@echo "NVCC $<"
$(NVCC) $(NVCC_FLAGS) -c $< -o $@

test_cuvs_worker: $(TEST_OBJS) $(OBJS)
@echo "Linking $@"
$(NVCC) $(LDFLAGS) $^ $(LIBS) -o $@

benchmark_cuvs: obj/test/benchmark_cuvs.o $(OBJS)
@echo "Linking $@"
$(NVCC) $(LDFLAGS) $^ $(LIBS) -o $@

test_kmeans: obj/test/test_kmeans.o $(OBJS)
@echo "Linking $@"
$(NVCC) $(LDFLAGS) $^ $(LIBS) -o $@

clean:
@echo "Cleaning up..."
rm -f libmocuvs.so *.o test_cuvs_worker benchmark_cuvs test_kmeans
rm -rf obj
119 changes: 119 additions & 0 deletions cgo/cuvs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
✦ Architecture Design: cuVS-Accelerated Vector Indexing

1. Overview
The MatrixOne cuvs package provides a high-performance, GPU-accelerated vector search and clustering infrastructure. It acts as
a bridge between the Go-based database kernel and NVIDIA's cuVS and RAFT libraries. The architecture is designed to solve three
primary challenges:
1. Impedance Mismatch: Reconciling Go’s concurrent goroutine scheduler with CUDA’s thread-specific resource requirements.
2. Scalability: Supporting datasets that exceed single-GPU memory (Sharding) or high-concurrency search requirements
(Replicated).
3. Efficiency: Minimizing CUDA kernel launch overhead via dynamic query batching.

---

2. Core Component: cuvs_worker_t
The cuvs_worker_t is the foundational engine of the architecture.

Implementation Details:
* Persistent C++ Thread Pool: Instead of executing CUDA calls directly from CGO (which could be scheduled on any OS thread),
the worker maintains a dedicated pool of long-lived C++ threads. Each thread is pinned to a specific GPU device.
* Job Queuing: Requests from the Go layer are submitted as "Jobs" to an internal thread-safe queue. The worker returns a
std::future, allowing the Go layer to perform other tasks while the GPU processes the request.
* Context Stability: By using dedicated threads, we ensure that CUDA context and RAFT resource handles remain stable and
cached, avoiding the expensive overhead of context creation or handle re-initialization.

---

3. Distribution Modes
The system supports three distinct modes to leverage multi-GPU hardware:

A. Single GPU Mode
* Design: The index resides entirely on one device.
* Use Case: Small to medium datasets where latency is the priority.

B. Replicated Mode (Scaling Throughput)
* Design: The full index is loaded onto multiple GPUs simultaneously.
* Mechanism: The cuvs_worker implements a load-balancing strategy (typically round-robin). Incoming queries are dispatched to
the next available GPU.
* Benefit: Linearly scales the Queries Per Second (QPS) by utilizing the compute power of all available GPUs.

C. Sharded Mode (Scaling Capacity)
* Design: The dataset is partitioned into $N$ shards across $N$ GPUs.
* Mechanism:
1. Broadcast: A search request is sent to all GPUs.
2. Local Search: Each GPU searches its local shard independently using RAFT resources.
3. Top-K Merge: The worker aggregates the results ($N \times K$ candidates) and performs a final merge-sort (often on the
CPU or via a fast GPU kernel) to return the global top-K.
* Benefit: Enables indexing of massive datasets (e.g., 100M+ vectors) that would not fit in the memory of a single GPU.

---

4. RAFT Resource Management
The package relies on RAFT (raft::resources) for all CUDA-accelerated operations.

* Resource Caching: raft::resources objects (containing CUDA streams, cuBLAS handles, and workspace memory) are held within the
cuvs_worker threads. They are created once at Start() and reused for the lifetime of the index.
* Stream-Based Parallelism: Every index operation is executed asynchronously on a RAFT-managed CUDA stream. This allows the
system to overlap data transfers (Host-to-Device) with kernel execution, maximizing hardware utilization.
* Memory Layout: Leveraging raft::mdspan and raft::mdarray ensures that memory is handled in a layout-aware manner
(C-contiguous or Fortran-contiguous), matching the requirements of optimized BLAS and LAPACK kernels.

---

5. Dynamic Batching: The Throughput Key
In a database environment, queries often arrive one by one from different users. Processing these as individual CUDA kernels is
inefficient due to launch overhead and under-utilization of GPU warps.

The Dynamic Batching Mechanism:
* Aggregation Window: When multiple search requests arrive at the worker within a small time window (microseconds), the worker
stalls briefly to aggregate them.
* Matrix Consolidation: Individual query vectors are packed into a single large query matrix.
* Consolidated Search: A single cuvs::neighbors::search call is made. GPUs are significantly more efficient at processing one
$64 \times D$ matrix than 64 individual $1 \times D$ vectors.
* Automatic Fulfilling: Once the batch search completes, the worker de-multiplexes the results and fulfills the specific
std::future for each individual Go request.

---

6. Automatic Type Quantization
To optimize memory footprint and search speed, the architecture features an automated quantization pipeline that converts
high-precision float32 vectors into compressed formats.

* Transparent Conversion: The Go layer can consistently provide float32 data. The system automatically handles the conversion
to the index's internal type (half, int8, or uint8) directly on the GPU.
* FP16 (Half Precision):
* Mechanism: Uses raft::copy to perform bit-level conversion from 32-bit to 16-bit floating point.
* Benefit: 2x memory reduction with negligible impact on search recall.
* 8-Bit Integer (int8/uint8):
* Mechanism: Implements a learned Scalar Quantizer. The system samples the dataset to determine optimal min and max
clipping bounds.
* Training: Before building, the quantizer is "trained" on a subset of the data to ensure the 256 available integer levels
are mapped to the most significant range of the distribution.
* Benefit: 4x memory reduction, enabling massive datasets to reside in VRAM.
* GPU-Accelerated: All quantization kernels are executed on the device. This minimizes CPU usage and avoids the latency of
converting data before sending it over the PCIe bus.

7. Supported Index Types
The following indexes are fully integrated into the MatrixOne GPU architecture:


┌──────────┬──────────────────────┬───────────────────────────────────────────────────────────────────────────────┐
│ Index │ Algorithm │ Strengths │
├──────────┼──────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ CAGRA │ Hardware-accelerated │ Best-in-class search speed and high recall. Optimized for hardware graph │
│ │ Graph │ traversal. │
│ IVF-Flat │ Inverted File Index │ High accuracy and fast search. Excellent for general-purpose use. │
│ IVF-PQ │ Product Quantization │ Extreme compression. Supports billions of vectors via lossy code compression. │
│ Brute │ Exact Flat Search │ 100% recall. Ideal for small datasets or generating ground-truth for │
│ Force │ │ benchmarks. │
│ K-Means │ Clustering │ High-performance centroid calculation for data partitioning and unsupervised │
│ │ │ learning. │
└──────────┴──────────────────────┴───────────────────────────────────────────────────────────────────────────────┘


8. Operational Telemetry
All indexes implement a unified Info() method that returns a JSON-formatted string. This allows the database to programmatically
verify:
* Hardware Mapping: Which GPU devices are holding which shards.
* Data Layout: Element sizes, dimensions, and current vector counts.
* Hyper-parameters: Internal tuning values like NLists, GraphDegree, or PQBits.
Loading
Loading