Skip to content

llamafile v0.8.13

Compare
Choose a tag to compare
@jart jart released this 18 Aug 17:22
· 184 commits to main since this release
b17ccd1

[line drawing of llama animal head in front of slightly open manilla folder filled with files]

llamafile lets you distribute and run LLMs with a single file

llamafile is a local LLM inference tool introduced by Mozilla Ocho in Nov 2023, which offers superior performance and binary portability to the stock installs of six OSes without needing to be installed. It features the best of llama.cpp and cosmopolitan libc while aiming to stay ahead of the curve by including the most cutting-edge performance and accuracy enhancements. What llamafile gives you is a fun web GUI chatbot, a turnkey OpenAI API compatible server, and a shell-scriptable CLI interface which together put you in control of artificial intelligence.

v0.8.13 changes

This release synchronizes with upstream projects, bringing with it
support for the newest models (e.g. Gemma 2B). Support for LLaMA v3 has
been significantly improved.

  • e9ee3f9 Synchronize with llama.cpp upstream
  • d0b5e8f Upgrade to Cosmopolitan v3.7.1

The new llamafiler server is now able to serve 2400 embeddings per
second on CPU. That's 3x faster than the llama.cpp server upstream. It's
now hardened for security. You should be able to safely use it a public
facing web server. There's a man page for llamafiler. You can also read
the docs online: /llamafile/server/doc/index.md.

  • 070aa13 Bring new server up to 2421 embedding/sec
  • 584a327 Increase tokens per second on tiny models
  • 99dd1c0 Add seccomp, tokenbucket, and batch prioritization
  • cda83f8 Make GGML threads spawn 10x faster
  • d451e0e Add chrome://tracing/ feature

The new llamafiler server now fully supports all the old embedding
endpoints that were provided by llamafile --server. Support for
serving embeddings has been removed from the old server.

  • be94c1f Add OpenAI /v1/embeddings to new llamafiler server

image

This release introduces whisperfile which is a single-file
implementation of OpenAI's Whisper model. It lets you transcribe speech
to text and even translate it too. Our implementation is based off
Georgi Gerganov's whisper.cpp project.
The project to turn it into a whisperfile was
founded by CJ Pais who's handed over maintenance of his awesome work.
There's a man page for whisperfile (which also can be viewed by running
./whisperfile --help) and we have online documentation with markdown
tutorials at /whisper.cpp/doc/index.md.

We developed a faster, more accurate implementation of GeLU. This helps
improve the performance of tiny models. It leads to measurable quality
improvements in whisper model output.

We've been improving floating point numerical stability for very large
models, e.g. Mixtral 8x22b and Command-R-Plus. tinyBLAS on CPU for F32,
F16, and BF16 weights now uses a new zero-overhead divide-and-conquer
approach to computing dot products, which we call ruler reduction, that
can result in a 10x reduction in worst case roundoff error accumulation.

  • cb817f5 Reduce rounding errors for very large models
  • 5b06924 Use ruler reduction for GGML dot products

This release introduces sdfile, which is our implementation of stable
diffusion. No documentation is yet provided for this command, other than
the docs provided by the upstream stable-diffusion.cpp
project on which it's based.

The list of new architectures and tokenizers introduced by this version are:
Open ELM, GPT NEOX, Arctic, DeepSeek2, ChatGLM, BitNet, T5, JAIS, Poro,
Viking, Tekken, and CodeShell.

Known Issues

The llamafile executable size is increased from 30mb to 200mb by this release.
This is caused by ggerganov/llama.cpp#7156. We're already employing some
workarounds to minimize the impact of upstream development contributions
on binary size, and we're aiming to find more in the near future.