dllama: An LLM Inference Server/Module for the Cloud #207

samyfodil · 2024-07-10T23:04:36Z

samyfodil
Jul 10, 2024
Maintainer

Extend Tau compute to support AI inference capabilities more efficiently. Previously, the Ollama plugin was integrated but it introduced significant overhead and had limited concurrency capabilities. Check out the existing implementation at ollama-cloud.

Objective: Develop a plugin that exports a model management and inference interface capable of handling concurrent requests efficiently.

Proposed Approach:

Start with llama.cpp. Consider using go-llama.cpp or create a fork and update it to the latest version of llama.cpp.
Reuse code from taubyte-llama-satellite initially.
Simplify the compilation process similar to Ollama. Utilize builder to build shared objects from llama.cpp, and embed them into the plugin using the embed package.

Next Steps for Enhancement:

In the second iteration, integrate support for TensorRT-LLM, which offers superior performance and is more suited for cloud environments.

Inspiration and Resources:

Draw inspiration from Ollama for embedding techniques.
Refer to LocalAI for effective llama.cpp integration.
Consider using cortex.llamacpp over llama.cpp directly. Building configurations can be found here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taubyte

dllama: An LLM Inference Server/Module for the Cloud #207

{{title}}

Replies: 0 comments

Select a reply

Taubyte

dllama: An LLM Inference Server/Module for the Cloud #207

samyfodil Jul 10, 2024 Maintainer

Replies: 0 comments

samyfodil
Jul 10, 2024
Maintainer