Skip to content

Conversation

reznej
Copy link

@reznej reznej commented May 5, 2025

Motivation

I use mlx-lm in combination with Open WebUI. One issue I've encountered is that once a model is loaded into memory, it stays resident indefinitely—even after it's no longer in use. Since I also rely on this machine for other productive tasks, I wanted to enable automatic memory release after inactivity, similar to the keep_alive behavior in Ollama.

During my research, I came across Llama-Swap, a lightweight proxy to manage different models. One of its features is a ttl config parameter, which unloads the model after a specified idle timeout—freeing up system resources automatically.

Problem

However, while integrating llama-swap into my setup, I discovered that mlx-lm (as of v0.24.0) is not compatible with transfer-Encoding: chunked, which seemingly llama-swap uses for proxying HTTP requests. This causes inference requests to fail with a 502 due to missing content-Length.

Solution

This PR introduces a minimal addition to the do_POST method in server.py, enabling mlx-lm to decode chunked HTTP request bodies. It preserves existing behavior for requests with a content-length header. Therefore, this change should have no effect on current usage patterns but enables broader compatibility with toolchains that rely on chunked streaming.

@reznej reznej changed the title handle chunked requests (llama-swap compatibility) Add support for chunked request bodies (llama-swap compatibility) May 5, 2025
@yihongang
Copy link
Contributor

I think llama-swap wants to use content-length, but it's just broken at the moment. There's an open PR to fix it, but we may take some time to agree on approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants