Skip to content

llamafile v0.8.10

Compare
Choose a tag to compare
@jart jart released this 23 Jul 17:53
· 237 commits to main since this release
f7c6ef4

[line drawing of llama animal head in front of slightly open manilla folder filled with files]

llamafile lets you distribute and run LLMs with a single file

llamafile is a local LLM inference tool introduced by Mozilla Ocho in Nov 2023, which offers superior performance and binary portability to the stock installs of six OSes without needing to be installed. It features the best of llama.cpp and cosmopolitan libc while aiming to stay ahead of the curve by including the most cutting-edge performance and accuracy enhancements. What llamafile gives you is a fun web GUI chatbot, a turnkey OpenAI API compatible server, and a shell-scriptable CLI interface which together put you in control of artificial intelligence.

This release includes a build of the new llamafile server rewrite we've
been promising, which we're calling llamafiler. It's matured enough to
recommend for embedding serving. This is the fastest way to serve
embeddings. If you use it with all-MiniLM-L6-v2.Q6_K.gguf then on
Threadripper it can serve JSON /embedding at 800 req/sec whereas the old
llama.cpp server could only do 100 req/sec. So you can fill up your RAG
databases very quickly if you productionize this.

The old llama.cpp server came from a folder named "examples" and was
never intended to be production worthy. This server is designed to be
sturdy and uncrashable. It has /completion and /tokenize endpoints too,
which serves 3.7 million requests per second on Threadripper, thanks to
Cosmo Libc improvements.

See the LLaMAfiler Documentation for further details.

  • 73b1836 Write documentation for new server
  • b3930aa Make GGML asynchronously cancelable
  • 8604e9a Fix POSIX undefined cancelation behavior
  • 323f50a Let SIGQUIT produce per-thread backtraces
  • 15d7fba Use semaphore to limit GGML worker threads
  • d7c8e33 Add support for JSON parameters to new server
  • 7f099cd Make stack overflows recoverable in new server
  • fb3421c Add barebones /completion endpoint to new server

This release restores support for non-AVX x86 microprocessors. We had to
drop support at the beginning of the year. However our CPUid dispatching
has advanced considerably since then. We're now able to offer top speeds
on modern hardware, without leaving old hardware behind.

  • a674cfb Restore support for non-AVX microprocessors
  • 555fb80 Improve build configuration

Here's the remaining improvements included in this release:

  • cc30400 Supports SmolLM (#495)
  • 4a4c065 Fix CUDA compile warnings and errors
  • 82f845c Avoid crashing with BF16 on Apple Metal