-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRAFT: Add embedfile
for all-in-one embeddings CLI tool
#644
base: main
Are you sure you want to change the base?
Conversation
Hi Alex. Thanks for sending this. This would make an awesome addition to the project.
I recommend putting it in the root of the repo, for better visibility.
I've checked-in SQLite to third party. Your build rule can now simply depend on
I only needed to change the zlib include in sqlite3.c. If you need any other local changes, please feel free to make them to the new third_party location.
It's recommended that you declare them like this: __notice(mbedtls_notice, "\
Mbed TLS (Apache 2.0)\n\
Copyright ARM Limited\n\
Copyright The Mbed TLS Contributors"); In any one of your .c or .cpp files. This will ensure your copyright notice is distributed inside any binaries that are built with it.
Tell me more? Maybe I can help.
Could you incorporate this into the monolithic Makefile? While the default Here's some feedback:
Here's some suggestions / action items:
|
1cc8e3e
to
6947bfa
Compare
embedfile
is a CLI tool that bundles llama.cpp / llamafile, the SQLite CLI,sqlite-vec
,sqlite-lembed
, and a few other SQLite extensions into a comprehensive and performant tool for generating text embeddings from CSV, JSON, NDJSON, txt, or SQLite database files.Just like
llamafile
andwhisperfile
, you can embed a.gguf
embeddings model file into aembedfile
, removing the need for managing weights yourself.all-MiniLM-L6-v2.f16.embedfile
56MB
mxbai-embed-xsmall-v1-f16.embedfile
61MB
nomic-embed-text-v1.5.f16.embedfile
273MB
snowflake-arctic-embed-m-v1.5-f16.embedfile
221MB
embedfile
(no embedded model)12MB
Here's an example, using MixedBread's xsmall model:
This executable file already has
sqlite-vec
,sqlite-lembed
, and the embeddings model pre-configured. Test that embeddings work with:You can embed data from CSV, JSON, NDJSON, and .txt files and save the results to a SQLite database. Here we are embedding the
text
column in thedbpedia.min.csv
file, outputting to adbpedia.db
database.That was 10,000 rows with 820,604 tokens. I got 83 embeddings per second on my older 2019 Intel Macbook. On my M1 Mac Mini I get 173 embbedings/second, and I'm sure it's faster on newer macs.
Once indexed, you can search with the
search
command:At any point, if you want to "eject" and run SQL scripts yourself, the
sh
command will fire up thesqlite3
CLI with all extensions and embeddings models pre-configured.Status
This was really fun to put together, and I'd love to see this (or something like this) as part of the
llamafile
project. I totally get it if it's out-of-scope or not a priority, I'd be happy to maintain an experimental fork if needed.Though as-is this branch isn't quite ready yet, there's a few things I want to fix:
llama.cpp/embedfile
directory, but maybe could be a top-level/embedfile
?llama.cpp/embedfile/BUILD.mk
is a bit messy, I had trouble compiling.c
files in the subdirectory so I manually added those builds. Would love some help cleaning that up!sqlite-vec.c
,sqlite-lembed.c
,sqlite3.c
, andshell.c
files in order to fix a few cosmopolitan/integration issues. I want to clean those up before merging.assert()
's that fail on any errorTODO
--k
and other search options--prefix
option for nomic-like embeddings, ex--prefix 'search_document:'
Build yourself