Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAFT: Add embedfile for all-in-one embeddings CLI tool #644

Draft
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

asg017
Copy link

@asg017 asg017 commented Nov 27, 2024

embedfile is a CLI tool that bundles llama.cpp / llamafile, the SQLite CLI, sqlite-vec, sqlite-lembed, and a few other SQLite extensions into a comprehensive and performant tool for generating text embeddings from CSV, JSON, NDJSON, txt, or SQLite database files.

Just like llamafile and whisperfile, you can embed a .gguf embeddings model file into a embedfile, removing the need for managing weights yourself.

Model embedfile Size (f16 quant)
sentence-transformers/all-MiniLM-L6-v2 all-MiniLM-L6-v2.f16.embedfile 56MB
mixedbread-ai/mxbai-embed-xsmall-v1 mxbai-embed-xsmall-v1-f16.embedfile 61MB
nomic-ai/nomic-embed-text-v1.5 nomic-embed-text-v1.5.f16.embedfile 273MB
snowflake-arctic-embed-m-v1.5 snowflake-arctic-embed-m-v1.5-f16.embedfile 221MB
- embedfile (no embedded model) 12MB

Here's an example, using MixedBread's xsmall model:

$ wget https://huggingface.co/asg017/embedfile/resolve/main/mxbai-embed-xsmall-v1-f16.embedfile
$ chmod u+x mxbai-embed-xsmall-v1-f16.embedfile 
$ ./mxbai-embed-xsmall-v1-f16.embedfile --version
embedfile 0.0.1-alpha.1, llamafile 0.8.16, SQLite 3.47.0, sqlite-vec=v0.1.6, sqlite-lembed=v0.0.1-alpha.8

This executable file already has sqlite-vec, sqlite-lembed, and the embeddings model pre-configured. Test that embeddings work with:

./mxbai-embed-xsmall-v1-f16.embedfile embed 'hello!'
[-0.058174,0.043776,0.030660,...]

You can embed data from CSV, JSON, NDJSON, and .txt files and save the results to a SQLite database. Here we are embedding the text column in the dbpedia.min.csv file, outputting to a dbpedia.db database.

$ ./mxbai-embed-xsmall-v1-f16.embedfile import --embed text dbpedia.min.csv dbpedia.db
INSERT INTO vec_items SELECT rowid, lembed("text") FROM temp.source;
100%|████████████████████| 10000/10000 [02:00<00:00, 83/s]
✔ dbpedia.min.csv imported into dbpedia.db, 10000 items

That was 10,000 rows with 820,604 tokens. I got 83 embeddings per second on my older 2019 Intel Macbook. On my M1 Mac Mini I get 173 embbedings/second, and I'm sure it's faster on newer macs.

Once indexed, you can search with the search command:

$ ./mxbai-embed-xsmall-v1-f16.embedfile search dbpedia.db 'global warming'
3240 0.852299 Attribution of recent climate change is the effort to scientifically ascertain mechanisms ...
6697 0.904844 The global warming controversy concerns the public debate over whether global warming is occurring, how ...
...

At any point, if you want to "eject" and run SQL scripts yourself, the sh command will fire up the sqlite3 CLI with all extensions and embeddings models pre-configured.

$ ./mxbai-embed-xsmall-v1-f16.embedfile sh
SQLite version 3.47.0 2024-10-21 16:30:22
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .mode qbox
sqlite> select sqlite_version(), vec_version(), lembed_version();
┌──────────────────┬───────────────┬──────────────────┐
│ sqlite_version() │ vec_version() │ lembed_version() │
├──────────────────┼───────────────┼──────────────────┤
│ '3.47.0'         │ 'v0.1.6'      │ 'v0.0.1-alpha.8' │
└──────────────────┴───────────────┴──────────────────┘
sqlite> select vec_to_json(vec_slice(lembed('hello!'), 0, 8)) as sample;
┌──────────────────────────────────────────────────────────────┐
│                            sample                            │
├──────────────────────────────────────────────────────────────┤
│ '[-0.058174,0.043776,0.030660,0.047412,-0.059377,-0.036267,0 │
│ .038117,0.005184]'                                           │
└──────────────────────────────────────────────────────────────┘

Status

This was really fun to put together, and I'd love to see this (or something like this) as part of the llamafile project. I totally get it if it's out-of-scope or not a priority, I'd be happy to maintain an experimental fork if needed.

Though as-is this branch isn't quite ready yet, there's a few things I want to fix:

  • Code is under llama.cpp/embedfile directory, but maybe could be a top-level /embedfile?
  • llama.cpp/embedfile/BUILD.mk is a bit messy, I had trouble compiling .c files in the subdirectory so I manually added those builds. Would love some help cleaning that up!
  • I made manual changes to the vendored in sqlite-vec.c, sqlite-lembed.c ,sqlite3.c, and shell.c files in order to fix a few cosmopolitan/integration issues. I want to clean those up before merging.
  • Include licenses/notices
  • A ton of assert()'s that fail on any error

TODO

  • Metdata + auxiliary column options in import
  • Better TUI for search results. Maybe REPL?
  • --k and other search options
  • --prefix option for nomic-like embeddings, ex --prefix 'search_document:'
  • Better perf
  • More embeddings model uploaded to HF

Build yourself

./make o//llama.cpp/embedfile/embedfile
make -f embedfile.mk all

jart added a commit that referenced this pull request Nov 29, 2024
@jart
Copy link
Collaborator

jart commented Nov 29, 2024

Hi Alex. Thanks for sending this. This would make an awesome addition to the project.

Code is under llama.cpp/embedfile directory, but maybe could be a top-level /embedfile?

I recommend putting it in the root of the repo, for better visibility.

llama.cpp/embedfile/BUILD.mk is a bit messy, I had trouble compiling .c files in the subdirectory so I manually added those builds. Would love some help cleaning that up!

I've checked-in SQLite to third party. Your build rule can now simply depend on o/$(MODE)/third_party/sqlite/sqlite.a. The build is configured to have all the compile-time options you specified in this change, e.g. FTS5, FTS3, etc.

I made manual changes to the vendored in sqlite-vec.c, sqlite-lembed.c ,sqlite3.c, and shell.c files in order to fix a few cosmopolitan/integration issues. I want to clean those up before merging.

I only needed to change the zlib include in sqlite3.c. If you need any other local changes, please feel free to make them to the new third_party location.

Include licenses/notices

It's recommended that you declare them like this:

__notice(mbedtls_notice, "\                                                                                                                                                                             
Mbed TLS (Apache 2.0)\n\                                                                                                                                                                                
Copyright ARM Limited\n\                                                                                                                                                                                
Copyright The Mbed TLS Contributors");

In any one of your .c or .cpp files. This will ensure your copyright notice is distributed inside any binaries that are built with it.

A ton of assert()'s that fail on any error

Tell me more? Maybe I can help.

make -f embedfile.mk all

Could you incorporate this into the monolithic Makefile? While the default make rule needs to be hermetic, you can do whatever you want in manually-run rules. For example, under llamafile, there's a lot of manually-run CUDA stuff. You could have manual rules that package your standard embedfiles.


Here's some feedback:

  1. Thank you for taking the time to write a man page.
  2. Thank you for using the new cosmo_args() API.

Here's some suggestions / action items:

  1. Could you update make install so it installs embedfile and its man page?
  2. Please use the new third_party/sqlite/ package. Be sure to update #include lines to say #include "third_party/sqlite/sqlite3.h" etc.
  3. Consider adding a .clang-format file to your package directory, with your preferred style (use Mozilla style in llamafile/highlight/.clang-format if you don't have a preference) and then run clang-format -i on your sources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants