Skip to content

Releases: Mozilla-Ocho/llamafile

llamafile v0.7.1

13 Apr 04:21
e5d53ac
Compare
Choose a tag to compare

This release fixes bugs in the 0.7.0 release.

  • Fix 2 embeddings-related issues in server.cpp (#324)
  • Detect search query to start webchat (#333)
  • Use LLAMAFILE_GPU_ERROR value -2 instead of -1 (#291)
  • Fix --silent-prompt flag regression #328
  • Clamp out of range values in K quantizer ef0307e
  • Update to latest q5_k quantization code a8b0b15
  • Change file format magic number for recently bf16 file format introduced in 0.7.0. This is a breaking change. It's due to a numbering conflict with the upstream project. We're still waiting on a permanent assignment for bfloat16 so this could potentially change again. Follow ggerganov/llama.cpp#6412 for updates.

Mixtral 8x22b and Grok support are not available in this release, but they are available if you build llamafile from source on the main branch at HEAD. We're currently dealing with an AMD Windows GPU support regression there. Once it's resolved, a 0.8 release will ship.

llamafile v0.7

31 Mar 04:21
c7780c4
Compare
Choose a tag to compare

llamafile lets you distribute and run LLMs with a single file

[line drawing of llama animal head in front of slightly open manilla folder filled with files]

This release improves the performance and accuracy of both CPU and GPU computations in addition to security.

  • tinyBLAS now gives outputs consistent with the cuBLAS thanks to Kahan summation on matvec ops. This is good news for Windows users, because llamafile releases bundle tinyBLAS DLLs for driver-only GPU support. That support will now be faster, and more accurate than before, thereby reducing the need to install the CUDA / ROCm SDKs yourself.
  • Prompt evaluation now goes much faster on CPU. For example, f16 weights on Raspberry Pi 5 are now 8x faster. These new optimizations mostly apply to F16, BF16, Q8_0, Q4_0, Q4_0, and F32 weights. Depending on the hardware and weights being used, we've observed llamafile-0.7 going anywhere between 30% to 500% faster than llama.cpp upstream.
  • Support for the bf16 data type has been introduced for CPU only, which is the Google Brain floating point format.
  • Support for AVX512 has been introduced. Owners of CPUs like Zen4 can expect to see 10x faster prompt eval times.
  • If you want to run llamafile-0.7 [...] --recompile --gpu amd support on Windows, this release requires that you use version 5.7+ of the ROCm HIP SDK, which may be downloaded here.
  • This release includes a security fix for CVE-2024-23496 (see #294).
  • This release is synced with llama.cpp 2024-03-22 upstream.

llamafile v0.6.2

27 Jan 21:35
d4c602d
Compare
Choose a tag to compare

llamafile lets you distribute and run LLMs with a single file

[line drawing of llama animal head in front of slightly open manilla folder filled with files]

This release synchronizes with llama.cpp upstream and polishes GPU
auto-configuration. Support for splitting a model onto multiple NVIDIA
GPUs has been restored.

  • dfd3335 Synchronize with llama.cpp 2024-01-27
  • c008e43 Synchronize with llama.cpp 2024-01-26
  • e34b35c Make GPU auto configuration more resilient
  • 79b88f8 Sanitize -ngl flag on Apple Metal

There's a known issue with support for splitting onto multiple AMD GPUs,
which currently doesn't work. This is an upstream issue we're working to
solve. The workaround is to set export HIP_VISIBLE_DEVICES=0 in your
environment when running llamafile, so it'll only see the first GPU.

Example llamafiles

Our llamafiles on Hugging Face are updated shortly after a release goes live.

Flagship models

Supreme models (highest-end consumer hardware)

Tiny models (small enough to use on raspberry pi)

Other models:

If you have a slow Internet connection and want to update your llamafiles without needing to redownload, then see the instructions here: #24 (comment) You can also download llamafile-0.6.2 and simply say ./llamafile-0.6.2 -m old.llamafile to run your old weights.

llamafile v0.6.1

20 Jan 08:09
389c389
Compare
Choose a tag to compare

llamafile lets you distribute and run LLMs with a single file

[line drawing of llama animal head in front of slightly open manilla folder filled with files]

This release fixes a crash that can happen on Apple Metal GPUs.

  • 9c85d9c Fix free() related crash in ggml-metal.m

Windows users will see better performance with tinyBLAS. Please note we
still recommend installing the CUDA SDK (NVIDIA), or HIP/ROCm SDK (AMD)
for maximum performance and accuracy if you're in their support vector.

  • df0b3ff Use thread-local register file for matmul speedups (#205)
  • 4892494 Change BM/BN/BK to template parameters (#203)
  • ed05ba9 Reduce server memory use on Windows

This release also synchronizes with llama.cpp upstream (as of Jan 9th)
along with other improvements.

  • 133b05e Sync with llama.cpp upstream
  • 67d97b5 Use hipcc on $PATH if it exists
  • 15e2339 Do better job reporting AMD hipBLAS errors
  • c617679 Don't crash when --image argument is invalid
  • 3e8aa78 Clarify install/gpu docs/behavior per feedback
  • eb4989a Fix typo in OpenAI API

Example llamafiles

Our llamafiles on Hugging Face are updated shortly after a release goes live.

Flagship models

Supreme models (highest-end consumer hardware)

Tiny models (small enough to use on raspberry pi)

Other models:

If you have a slow Internet connection and want to update your llamafiles without needing to redownload, then see the instructions here: #24 (comment) You can also download llamafile-0.6.1 and simply say ./llamafile-0.6.1 -m old.llamafile to run your old weights.

llamafile v0.6

09 Jan 11:48
64d1e65
Compare
Choose a tag to compare

llamafile lets you distribute and run LLMs with a single file

[line drawing of llama animal head in front of slightly open manilla folder filled with files]

This release features significant improvements to GPU support.

  • 4616816 Introduce support for multiple GPUs
  • 6559da6 Introduce AMD GPU support for Linux
  • 20d5f46 Make CLIP GPU acceleration work on UNIX / Windows

The llamafile server is now more reliable. Invalid JSON won't crash the
server. Opening a browser tab won't prevent the server from starting.

  • 3384234 Upgrade to cosmocc 3.2.4
  • 585c2d8 Make browser tab launching more reliable
  • 7a5ec37 Show IP addresses when binding to 0.0.0.0
  • d39ec38 Enable setting thread affinity on NUMA systems

You can now say llamafile -m foo.llamafile to load a model from a
llamafile without having to execute it, or extract the gguf file.

  • bb136e1 Support opening weights from llamafiles

The documentation has been improved (but still a work in progress).

  • 7ad00db Add more content to manual

Example llamafiles

Our llamafiles on Hugging Face are updated shortly after a release goes live.

Flagship models

Supreme models (highest-end consumer hardware)

Tiny models (small enough to use on raspberry pi)

Other models:

If you have a slow Internet connection and want to update your llamafiles without needing to redownload, then see the instructions here: #24 (comment) You can also download llamafile-0.6 and simply say ./llamafile-0.6 -m old.llamafile to run your old weights.

llamafile v0.5

05 Jan 18:12
ef83e2b
Compare
Choose a tag to compare

llamafile lets you distribute and run LLMs with a single file

[line drawing of llama animal head in front of slightly open manilla folder filled with files]

The llamafile-server command is now unified into llamafile. This way
you won't need to upload your llamafiles to Hugging Face twice. We also
have rich man page documentation for this command, which can be viewed
with pagination on all platforms via the llamafile --help flag.

  • b86dcb7 Unify llamafile-server command into llamafile
  • 156f0a6 Embed man page into --help flag of each program

This release introduces support for AMD graphics cards on Windows. Our
release binaries include a prebuilt tinyBLAS DLL. Like our Nvidia DLL,
it works on stock installs and only depends on the graphics driver. GPU
on Windows is also much faster out of the box, thanks to improvements
we've made to our tinyBLAS kernels.

  • 1f1c53f Get AMD GPU support working on Windows
  • 1d9fa85 Add 2D blocking to tinyBLAS GemmEx (#153)
  • c0589f0 Apply 2D blocking to all kernels (#156)
  • c2bc6e6 Separate kernel for GemmStridedBatchedEx (#163)
  • f6ee33c Read and write column-major matrices better (#164)
  • d7cbaf7 Reduce BM/BN/BK to 64/32/64 to 48/12/48
  • 04d6e93 Introduce --gpu flag

Apple Metal users should expect to see LLaVA image summarization go
roughly 33% faster. Complete support for Microsoft's new Phi-2 model is
now available, which works great on Raspberry Pi. FreeBSD ARM64 users
can now also enjoy this project. Shell scriptability is improved. We've
also introduced a llamafile-convert command that makes it easier for
you to create your own llamafiles.

  • 922c4f1 Add GPU acceleration of LLaVA image processing on MacOS
  • 6423228 Add Phi-2 architecture support
  • ce4aac6 Support FreeBSD ARM64
  • 1dcf274 Add llamafile-convert command (#112)
  • 50bdf69 7d23bc9 Make --log-disable work better
  • 7843183 Make default thread count capped at 12 maximum
  • 2e276a1 Sync with llama.cpp upstream
  • dd4c9d7 Make JSON server crashes more informative
  • 8762f13 474b44f Introduce --nocompile flag
  • 5cf6e76 Introduce --cli flag
  • f0e86e1 Don't schlep weights into CPU when using GPU
  • f1410a1 Fix repeat_last_n in OpenAI server
  • 3119f09 Increase server max payload size

Known Issues

  • Multiple GPUs isn't supported yet.
  • CLIP only supports GPU acceleration on Apple Silicon.

Example llamafiles

Our llamafiles on Hugging Face are updated shortly after a release goes live.

Flagship models

Supreme models (highest-end consumer hardware)

Tiny models (small enough to use on raspberry pi)

Other models:

If you have a slow Internet connection and want to update your llamafiles
without needing to redownload, then see the instructions here: #24 (comment)

llamafile v0.4.1

28 Dec 10:41
f6ea6bf
Compare
Choose a tag to compare

llamafile lets you distribute and run LLMs with a single file

[line drawing of llama animal head in front of slightly open manilla folder filled with files]

If you had trouble generating filenames following the "bash one-liners"
blog post using the latest release, then please try again.

  • 0984ed8 Fix regression with --grammar flag

Crashes on older Intel / AMD systems should be fixed:

  • 3490afa Fix SIGILL on older Intel/AMD CPUs w/o F16C

The OpenAI API compatible endpoint has been improved.

  • 9e4bf29 Fix OpenAI server sampling w.r.t. temp and seed

This release improves the documentation.

  • 5c7ff6e Improve llamafile manual
  • 658b18a Add WSL CUDA to GPU section (#105)
  • 586b408 Update README.md so links and curl commands work (#136)
  • a56ffd4 Update README to clarify Darwin kernel versioning
  • 47d8a8f Fix README changing SSE3 to SSSE3
  • 4da8e2e Fix README examples for certain UNIX shells
  • faa7430 Change README to list Mixtral Q5 (instead of Q3)
  • 6b0b64f Fix CLI README examples

We're making strides to automating our testing process.

Some other improvements:

  • 9e972b2 Improve README examples
  • 9de5686 Support bos token in llava-cli
  • 3d81e22 Set logger callback for Apple Metal
  • 9579b73 Make it easier to override CPPFLAGS

Our .llamafiles on Hugging Face have been updated to incorporate these
new release binaries. You can redownload here:

Known Issues

LLaVA image processing using the builtin tinyBLAS library may go slow on Windows.
Here's the workaround for using the faster NVIDIA cuBLAS library instead.

  1. Delete the .llamafile directory in your home directory.
  2. Install CUDA
  3. Install MSVC
  4. Open the "x64 MSVC command prompt" from Start
  5. Run llamafile there for the first invocation.

There's a YouTube video tutorial on doing this here: https://youtu.be/d1Fnfvat6nM?si=W6Y0miZ9zVBHySFj

llamafile v0.4

14 Dec 09:23
188f7fc
Compare
Choose a tag to compare

llamafile lets you distribute and run LLMs with a single file

[line drawing of llama animal head in front of slightly open manilla folder filled with files]

This release features Mixtral support. Support has been added for Qwen
models too. The --chatml, --samplers, and other flags are added.

  • 820d42d Synchronize with llama.cpp upstream

GPU now works out of the box on Windows. You still need to pass the
-ngl 35 flag, but you're no longer required to install CUDA/MSVC.

  • a7de00b Make tinyBLAS go 95% as fast as cuBLAS for token generation (#97)
  • 9d85a72 Improve GEMM performance by nearly 2x (#93)
  • 72e1c72 Support CUDA without cuBLAS (#82)
  • 2849b08 Make it possible for CUDA to extract prebuilt DSOs

Additional fixes and improvements:

  • c236a71 Improve markdown and syntax highlighting in server (#88)
  • 69ec1e4 Update the llamafile manual
  • 782c81c Add SD ops, kernels
  • 93178c9 Polyfill $HOME on some Windows systems
  • fcc727a Write log to /dev/null when main.log fails to open
  • 77cecbe Fix handling of characters that span multiple tokens when streaming

Our .llamafiles on Hugging Face have been updated to incorporate these
new release binaries. You can redownload here:

llamafile v0.3

11 Dec 20:18
1f17930
Compare
Choose a tag to compare

llamafile lets you distribute and run LLMs with a single file

[line drawing of llama animal head in front of slightly open manilla folder filled with files]

The llamafile-main and llamafile-llava-cli programs have been
unified into a single command named llamafile. Man pages now exist in
pdf, troff, and postscript format. There's much better support for shell
scripting, thanks to a new --silent-prompt flag. It's now possible to
shell script vision models like LLaVA using grammar constraints.

  • d4e2388 Add --version flag
  • baf216a Make ctrl-c work better
  • 762ad79 Add make install build rule
  • 7a3e557 Write man pages for all commands
  • c895a44 Remove stdout logging in llava-cli
  • 6cb036c Make LLaVA more shell script friendly
  • 28d3160 Introduce --silent-prompt flag to main
  • 1cd334f Allow --grammar to be used on --image prompts

The OpenAI API in llamafile-server has been improved.

  • e8c92bc Make OpenAI API stop field optional (#36)
  • c1c8683 Avoid bind() conflicts on port 8080 w/ server
  • 8cb9fd8 Recognize cache_prompt parameter in OpenAI API

Performance regressions have been fixed for Intel and AMD users.

  • 73ee0b1 Add runtime dispatching for Q5 weights
  • 36b103e Make Q2/Q3 weights go 2x faster on AMD64 AVX2 CPUs
  • b4dea04 Slightly speed up LLaVA runtime dispatch on Intel

The zipalign command is now feature complete.

  • 76d47c0 Put finishing touches on zipalign tool
  • 7b2fbcb Add support for replacing zip files to zipalign

Some additional improvements:

  • 5f69bb9 Add SVG logo
  • cd0fae0 Make memory map loader go much faster on MacOS
  • c8cd8e1 Fix output path in llamafile-quantize
  • dd1e0cd Support attention_bias on LLaMA architecture
  • 55467d9 Fix integer overflow during quantization
  • ff1b437 Have makefile download cosmocc automatically
  • a7cc180 Update grammar-parser.cpp (#48)
  • 61944b5 Disable pledge on systems with GPUs
  • ccc377e Log cuda build command to stderr

Our .llamafiles on Hugging Face have been updated to incorporate these new release binaries. You can redownload here:

If you have a slower Internet connection and don't want to re-download, then you don't have to! Instructions are here:

llamafile v0.2.1

01 Dec 18:51
57cc1f4
Compare
Choose a tag to compare

llamafile lets you distribute and run LLMs with a single file. See our README file for documentation and to learn more.

Changes

  • 95703b6 Fix support for old Intel CPUs
  • 401dd08 Add OpenAI API compatibility to server
  • e5c2315 Make server open tab in browser on startup
  • 865462f Cherry pick StableLM support from llama.cpp
  • 8f21460 Introduce pledge() / seccomp security to llama.cpp
  • 711344b Fix server so it doesn't consume 100% cpu when idle
  • 12f4319 Add single-client multi-prompt support to server
  • c64989a Add --log-disable flag to server
  • 90fa20f Fix typical sampling (#4261)
  • e574488 reserve space in decode_utf8
  • 481b6a5 Look for GGML DSO before looking for NVCC
  • 41f243e Check for i/o errors in httplib read_file()
  • ed87fdb Fix uninitialized variables in server
  • c5d35b0 Avoid CUDA assertion error with some models
  • c373b5d Fix LLaVA regression for square images
  • 176e54f Fix server crash when prompt exceeds context size

Example Llamafiles

Our .llamafiles on Hugging Face have been updated to incorporate these new release binaries. You can redownload here:

If you have a slower Internet connection and don't want to re-download, then you don't have to! Instructions are here: