Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide an offline engine API #1567

Merged

Conversation

ByronHsu
Copy link
Collaborator

@ByronHsu ByronHsu commented Oct 4, 2024

Motivation

This PR is to support "Add APIs for using the inference engine in a single script without launching a separate server" in #1487. This is a simplified version of #1127 @JianyuZhan where I reuse most of the existing code.

Modifications

Context

The current SRT server consists of an HTTP server and the SRT engine.

  1. HTTP server: A FastAPI server that routes requests to the engine.
  2. SRT engine:
    1. Tokenizer Manager: Tokenizes the requests and sends them to the controller.
    2. Controller (subprocess): Receives requests from the Tokenizer Manager, schedules batches, forwards them, and sends the output tokens to the Detokenizer Manager.
    3. Detokenizer Manager (subprocess): Detokenizes the output tokens and sends the result back to the Tokenizer.

HTTP server and Tokenizer Manager are both running in the main process, but there is no way to decouple them and only instantiate Tokenizer Manager.

Decouple SRT engine and HTTP server

This PR introduces SRT engine by decoupling launch_server to launch_server and launch_engine.

launch_server: launch_engine + HTTP server creation, used by SRT Runtime and standalone server.
launch_engine: SRT Engine creation, used by SRT engine.

New public API: Engine

Uplift Engine to the top level, so users can easily call with sgl.Engine

Engine Usage Example

Same settings as vllm but use SRT Engine.

import sglang as sgl

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = {"temperature": 0.8, "top_p": 0.95}

# Create an LLM.
llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for prompt, output in zip(prompts, outputs):
    print("===============================")
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")
===============================
Prompt: Hello, my name is
Generated text:  Alistair. I am an independent game developer.

I am currently working on a virtual reality game that focuses on exploring the world and discovering new things. The game has a unique art style that combines elements of both photorealism and cartoonish art. The game will have no story, but rather focus on the exploration of the world and the various things that can be found within.

I am looking for a 3D artist who can help me bring my vision to life. The ideal candidate will have experience in creating environments, characters, and objects in Unity. They should also have a strong
===============================
Prompt: The president of the United States is
Generated text:  Donald Trump. The president of India is Narendra Modi. The president of China is Xi Jinping. The president of Russia is Vladimir Putin. The president of France is Emmanuel Macron. The president of the United Kingdom is Boris Johnson. The president of Canada is Justin Trudeau. The president of Australia is Scott Morrison. The president of Japan is Shinzo Abe. The president of South Korea is Moon Jae-in. The president of Brazil is Jair Bolsonaro. The president of Colombia is Ivan Duque. The president of Argentina is Mauricio Macri
===============================
Prompt: The capital of France is
Generated text:  Paris. Paris is one of the world’s leading cities and is known for its history, art, culture, architecture, and gastronomy. Paris is located in the north-central region of France and is surrounded by the Seine River, which flows through the city. The city is divided into twenty districts or arrondissements, each with its own unique character and charm. Some of the most famous landmarks in Paris include the Eiffel Tower, the Louvre Museum, Notre-Dame Cathedral, the Arc de Triomphe, and the Champs-Élysées. Paris is also known
===============================
Prompt: The future of AI is
Generated text:  bright, and its potential to revolutionize various industries is immense. Here are some ways in which AI can impact the future of healthcare:

1. Personalized Medicine: AI can help in developing personalized treatment plans for patients based on their genetic makeup, medical history, and lifestyle factors. This can lead to more effective treatments and better outcomes for patients.
2. Diagnosis and Treatment: AI-powered tools can help healthcare providers diagnose diseases more accurately and quickly. They can also help in selecting the most appropriate treatment options based on the patient's condition.
3. Drug Development: AI can help in acceler

Discussion

One caveat is that we construct ServerArgs, but the HTTP server related args will not be used. I think this is ok because ServerArgs is the superset of Engine Args, so it can cover everything.

Testing

Add test_srt_engine.py, which runs batch inference and assert the answer.

TODO

  1. Add async generate
  2. Add encode

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@ByronHsu ByronHsu force-pushed the byhsu/decouple-engine-with-server branch from 57f9cd8 to 94346ea Compare October 4, 2024 08:10
@ByronHsu ByronHsu marked this pull request as ready for review October 4, 2024 08:13
@zhyncs
Copy link
Member

zhyncs commented Oct 5, 2024

I think async generate is needed. QQ What does add decode in todo 2 refer to?

@ByronHsu
Copy link
Collaborator Author

ByronHsu commented Oct 5, 2024

@zhyncs oh i mean encode like the one in Runtime. It was a typo

@ByronHsu
Copy link
Collaborator Author

ByronHsu commented Oct 5, 2024

@zhyncs i can do the async gen in the next PR

@ByronHsu ByronHsu force-pushed the byhsu/decouple-engine-with-server branch 2 times, most recently from 1bfd171 to c080de1 Compare October 5, 2024 23:14
@merrymercy merrymercy mentioned this pull request Oct 6, 2024
4 tasks
python/sglang/api.py Outdated Show resolved Hide resolved
python/sglang/srt/server.py Outdated Show resolved Hide resolved
examples/frontend_language/usage/srt_engine.py Outdated Show resolved Hide resolved
@merrymercy merrymercy mentioned this pull request Oct 6, 2024
33 tasks
@ByronHsu ByronHsu force-pushed the byhsu/decouple-engine-with-server branch 2 times, most recently from 3d92605 to 34a6c2e Compare October 6, 2024 06:28
python/sglang/srt/server.py Show resolved Hide resolved
python/sglang/srt/server.py Outdated Show resolved Hide resolved
test/lang/run_suite.py Outdated Show resolved Hide resolved
@ByronHsu ByronHsu force-pushed the byhsu/decouple-engine-with-server branch from 005e5e0 to b18b447 Compare October 6, 2024 07:19
@ByronHsu ByronHsu force-pushed the byhsu/decouple-engine-with-server branch from 66d5402 to 22c7e3e Compare October 6, 2024 17:06
@ByronHsu
Copy link
Collaborator Author

ByronHsu commented Oct 6, 2024

Please don't merge now. Consistency test is failing on H100 in CI, but passing on my A100

@ByronHsu ByronHsu force-pushed the byhsu/decouple-engine-with-server branch from 1a49352 to 61f17a2 Compare October 6, 2024 22:03
test/srt/test_srt_engine.py Outdated Show resolved Hide resolved
test/srt/test_srt_engine.py Outdated Show resolved Hide resolved
@merrymercy merrymercy enabled auto-merge (squash) October 7, 2024 03:02
@merrymercy merrymercy enabled auto-merge (squash) October 7, 2024 03:03
@merrymercy merrymercy changed the title Decouple engine with server and provide an engine API Provide an offline engine API Oct 7, 2024
@merrymercy merrymercy merged commit 551a3a9 into sgl-project:main Oct 7, 2024
11 checks passed
@imadoualid
Copy link

hey guys i'm getting the AttributeError: module 'sglang' has no attribute 'Engine' on sgl '0.3.2' still not in prod ?

@ByronHsu
Copy link
Collaborator Author

@imadoualid the changes should be in the main HEAD if you can build from source.

@ByronHsu ByronHsu deleted the byhsu/decouple-engine-with-server branch October 13, 2024 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants