DRAFT: Add `embedfile` for all-in-one embeddings CLI tool #644

asg017 · 2024-11-27T20:57:41Z

embedfile is a CLI tool that bundles llama.cpp / llamafile, the SQLite CLI, sqlite-vec, sqlite-lembed, and a few other SQLite extensions into a comprehensive and performant tool for generating text embeddings from CSV, JSON, NDJSON, txt, or SQLite database files.

Just like llamafile and whisperfile, you can embed a .gguf embeddings model file into a embedfile, removing the need for managing weights yourself.

Model	embedfile	Size (f16 quant)
sentence-transformers/all-MiniLM-L6-v2	`all-MiniLM-L6-v2.f16.embedfile`	`56MB`
mixedbread-ai/mxbai-embed-xsmall-v1	`mxbai-embed-xsmall-v1-f16.embedfile`	`61MB`
nomic-ai/nomic-embed-text-v1.5	`nomic-embed-text-v1.5.f16.embedfile`	`273MB`
snowflake-arctic-embed-m-v1.5	`snowflake-arctic-embed-m-v1.5-f16.embedfile`	`221MB`
-	`embedfile` (no embedded model)	`12MB`

Here's an example, using MixedBread's xsmall model:

$ wget https://huggingface.co/asg017/embedfile/resolve/main/mxbai-embed-xsmall-v1-f16.embedfile
$ chmod u+x mxbai-embed-xsmall-v1-f16.embedfile 
$ ./mxbai-embed-xsmall-v1-f16.embedfile --version
embedfile 0.0.1-alpha.1, llamafile 0.8.16, SQLite 3.47.0, sqlite-vec=v0.1.6, sqlite-lembed=v0.0.1-alpha.8

This executable file already has sqlite-vec, sqlite-lembed, and the embeddings model pre-configured. Test that embeddings work with:

./mxbai-embed-xsmall-v1-f16.embedfile embed 'hello!'
[-0.058174,0.043776,0.030660,...]

You can embed data from CSV, JSON, NDJSON, and .txt files and save the results to a SQLite database. Here we are embedding the text column in the dbpedia.min.csv file, outputting to a dbpedia.db database.

$ ./mxbai-embed-xsmall-v1-f16.embedfile import --embed text dbpedia.min.csv dbpedia.db
INSERT INTO vec_items SELECT rowid, lembed("text") FROM temp.source;
100%|████████████████████| 10000/10000 [02:00<00:00, 83/s]
✔ dbpedia.min.csv imported into dbpedia.db, 10000 items

That was 10,000 rows with 820,604 tokens. I got 83 embeddings per second on my older 2019 Intel Macbook. On my M1 Mac Mini I get 173 embbedings/second, and I'm sure it's faster on newer macs.

Once indexed, you can search with the search command:

$ ./mxbai-embed-xsmall-v1-f16.embedfile search dbpedia.db 'global warming'
3240 0.852299 Attribution of recent climate change is the effort to scientifically ascertain mechanisms ...
6697 0.904844 The global warming controversy concerns the public debate over whether global warming is occurring, how ...
...

At any point, if you want to "eject" and run SQL scripts yourself, the sh command will fire up the sqlite3 CLI with all extensions and embeddings models pre-configured.

$ ./mxbai-embed-xsmall-v1-f16.embedfile sh
SQLite version 3.47.0 2024-10-21 16:30:22
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .mode qbox
sqlite> select sqlite_version(), vec_version(), lembed_version();
┌──────────────────┬───────────────┬──────────────────┐
│ sqlite_version() │ vec_version() │ lembed_version() │
├──────────────────┼───────────────┼──────────────────┤
│ '3.47.0'         │ 'v0.1.6'      │ 'v0.0.1-alpha.8' │
└──────────────────┴───────────────┴──────────────────┘
sqlite> select vec_to_json(vec_slice(lembed('hello!'), 0, 8)) as sample;
┌──────────────────────────────────────────────────────────────┐
│                            sample                            │
├──────────────────────────────────────────────────────────────┤
│ '[-0.058174,0.043776,0.030660,0.047412,-0.059377,-0.036267,0 │
│ .038117,0.005184]'                                           │
└──────────────────────────────────────────────────────────────┘

Status

This was really fun to put together, and I'd love to see this (or something like this) as part of the llamafile project. I totally get it if it's out-of-scope or not a priority, I'd be happy to maintain an experimental fork if needed.

Though as-is this branch isn't quite ready yet, there's a few things I want to fix:

Code is under llama.cpp/embedfile directory, but maybe could be a top-level /embedfile?
llama.cpp/embedfile/BUILD.mk is a bit messy, I had trouble compiling .c files in the subdirectory so I manually added those builds. Would love some help cleaning that up!
I made manual changes to the vendored in sqlite-vec.c, sqlite-lembed.c ,sqlite3.c, and shell.c files in order to fix a few cosmopolitan/integration issues. I want to clean those up before merging.
Include licenses/notices
A ton of assert()'s that fail on any error

TODO

Metdata + auxiliary column options in import
Better TUI for search results. Maybe REPL?
--k and other search options
--prefix option for nomic-like embeddings, ex --prefix 'search_document:'
Better perf
More embeddings model uploaded to HF

Build yourself

./make o//llama.cpp/embedfile/embedfile
make -f embedfile.mk all

See #644

jart · 2024-11-29T04:10:51Z

Hi Alex. Thanks for sending this. This would make an awesome addition to the project.

Code is under llama.cpp/embedfile directory, but maybe could be a top-level /embedfile?

I recommend putting it in the root of the repo, for better visibility.

llama.cpp/embedfile/BUILD.mk is a bit messy, I had trouble compiling .c files in the subdirectory so I manually added those builds. Would love some help cleaning that up!

I've checked-in SQLite to third party. Your build rule can now simply depend on o/$(MODE)/third_party/sqlite/sqlite.a. The build is configured to have all the compile-time options you specified in this change, e.g. FTS5, FTS3, etc.

I made manual changes to the vendored in sqlite-vec.c, sqlite-lembed.c ,sqlite3.c, and shell.c files in order to fix a few cosmopolitan/integration issues. I want to clean those up before merging.

I only needed to change the zlib include in sqlite3.c. If you need any other local changes, please feel free to make them to the new third_party location.

Include licenses/notices

It's recommended that you declare them like this:

__notice(mbedtls_notice, "\                                                                                                                                                                             
Mbed TLS (Apache 2.0)\n\                                                                                                                                                                                
Copyright ARM Limited\n\                                                                                                                                                                                
Copyright The Mbed TLS Contributors");

In any one of your .c or .cpp files. This will ensure your copyright notice is distributed inside any binaries that are built with it.

A ton of assert()'s that fail on any error

Tell me more? Maybe I can help.

make -f embedfile.mk all

Could you incorporate this into the monolithic Makefile? While the default make rule needs to be hermetic, you can do whatever you want in manually-run rules. For example, under llamafile, there's a lot of manually-run CUDA stuff. You could have manual rules that package your standard embedfiles.

Here's some feedback:

Thank you for taking the time to write a man page.
Thank you for using the new cosmo_args() API.

Here's some suggestions / action items:

Could you update make install so it installs embedfile and its man page?
Please use the new third_party/sqlite/ package. Be sure to update #include lines to say #include "third_party/sqlite/sqlite3.h" etc.
Consider adding a .clang-format file to your package directory, with your preferred style (use Mozilla style in llamafile/highlight/.clang-format if you don't have a preference) and then run clang-format -i on your sources.

github-actions bot added the llama.cpp label Nov 27, 2024

jart added a commit that referenced this pull request Nov 29, 2024

Introduce sqlite

d8123c7

See #644

asg017 added 16 commits November 30, 2024 10:29

initial pass

0d08588

sqlite-lembed

4b4665f

add sqlite.org csv

ae178b8

progress

9eeceec

sqlite-lines

3c3c103

"index" cmd, fixup a few things

92065a5

rename to embedfile

581f64d

include embedfile in dist

2c45550

bestlineover readline in shell

d1aed34

comso_dlopen for loadable SQLite extensions

ddf73f6

snapshot tests

c54c950

more sqlite compile time options

80b3693

import and search commands

7ed9101

0.0.1-alpha.1

d9a2d7f

depend on third_party/sqlite instead

542cd2e

llama.cpp/embedfile -> embedfile

6947bfa

asg017 force-pushed the embedfile-init branch from 1cc8e3e to 6947bfa Compare November 30, 2024 18:49

asg017 added 4 commits November 30, 2024 17:12

fix include

30d4e69

clang-format embedfile

481f3ef

small build fixes, error handling

61b718d

small man changes

81845d5

asg017 mentioned this pull request Dec 20, 2024

DRAFT: Add jamfile, a JavaScript runtime for creating scripts/CLIs on top of llamafile #661

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRAFT: Add `embedfile` for all-in-one embeddings CLI tool #644

DRAFT: Add `embedfile` for all-in-one embeddings CLI tool #644

asg017 commented Nov 27, 2024

jart commented Nov 29, 2024

DRAFT: Add embedfile for all-in-one embeddings CLI tool #644

Are you sure you want to change the base?

DRAFT: Add embedfile for all-in-one embeddings CLI tool #644

Conversation

asg017 commented Nov 27, 2024

Status

TODO

Build yourself

jart commented Nov 29, 2024

DRAFT: Add `embedfile` for all-in-one embeddings CLI tool #644

DRAFT: Add `embedfile` for all-in-one embeddings CLI tool #644