Add support for chunked request bodies (llama-swap compatibility) #154
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
I use mlx-lm in combination with Open WebUI. One issue I've encountered is that once a model is loaded into memory, it stays resident indefinitely—even after it's no longer in use. Since I also rely on this machine for other productive tasks, I wanted to enable automatic memory release after inactivity, similar to the keep_alive behavior in Ollama.
During my research, I came across Llama-Swap, a lightweight proxy to manage different models. One of its features is a
ttl
config parameter, which unloads the model after a specified idle timeout—freeing up system resources automatically.Problem
However, while integrating llama-swap into my setup, I discovered that mlx-lm (as of v0.24.0) is not compatible with transfer-Encoding: chunked, which seemingly llama-swap uses for proxying HTTP requests. This causes inference requests to fail with a 502 due to missing content-Length.
Solution
This PR introduces a minimal addition to the
do_POST
method inserver.py
, enabling mlx-lm to decode chunked HTTP request bodies. It preserves existing behavior for requests with a content-length header. Therefore, this change should have no effect on current usage patterns but enables broader compatibility with toolchains that rely on chunked streaming.