Skip to content

Feature/function calling update#2700

Merged
zhaochenyang20 merged 37 commits intosgl-project:mainfrom
YAMY1234:feature/function-calling-update
Jan 26, 2025
Merged

Feature/function calling update#2700
zhaochenyang20 merged 37 commits intosgl-project:mainfrom
YAMY1234:feature/function-calling-update

Conversation

@YAMY1234
Copy link
Copy Markdown
Contributor

@YAMY1234 YAMY1234 commented Jan 2, 2025

Pull Request Description

Summary

This pull request introduces streaming modes for function calling within the OpenAI API integration, and updated the non-streaming framework for better extensibility. The changes include:

  1. New Features:

    • Implementation of a FunctionCallParser for robust and efficient parsing of function calls in both streaming and non-streaming contexts.
    • Added support for incremental streaming responses using the parse_streaming_increment method.
    • Enhanced tooling support with structured parsing for tool calls, enabling seamless function integration with improved parameter handling.
  2. Refactoring:

    • Refactored openai_api/adapter.py to integrate streaming tool call parsing logic.
    • Updated openai_api/protocol.py with additional models (ToolCallItem, DeltaMessage) to support streaming functionalities.
  3. Documentation:

    • Added detailed comments and docstrings for new classes and methods to enhance readability and maintainability.

Detailed Changes

  • docs/backend/function_calling_streaming.py:

    • Added functionality to demonstrate streaming and non-streaming API calls with mock tool integrations.
    • Included an example for handling tool calls and parsing streamed arguments incrementally.
  • python/sglang/srt/function_call_parser.py:

    • Introduced FunctionCallParser, StreamingJSONParser, and related utility functions to handle function calls during streaming responses.
    • Implemented logic for detecting and parsing incremental JSON inputs with robust error handling.
  • python/sglang/srt/openai_api/adapter.py:

    • Integrated FunctionCallParser to enable real-time function call parsing during streaming response generation.
    • Adjusted tool-related logic to align with the new structured tool parsing approach.
  • python/sglang/srt/openai_api/protocol.py:

    • Modified FunctionResponse and ToolCall models to use Optional fields for compatibility with the new parser.
    • Added ToolCallItem and DeltaMessage models to streamline the representation of parsed tool calls and response deltas.

Testing

  • Verified the functionality of streaming and non-streaming API calls using mock scenarios.
  • Validated the correctness of tool call parsing through tests and real-time simulations.

@YAMY1234 YAMY1234 changed the title Feature/function calling update WIP: Feature/function calling update Jan 2, 2025
@merrymercy
Copy link
Copy Markdown
Contributor

@HaoyuWang4188 @Tushar-ml @YAMY1234 Can you review each other's code? #2576

@YAMY1234 YAMY1234 requested a review from HaiShaw as a code owner January 2, 2025 19:59
@YAMY1234 YAMY1234 force-pushed the feature/function-calling-update branch 3 times, most recently from 25a03b0 to 63c3d4e Compare January 2, 2025 20:10
@YAMY1234
Copy link
Copy Markdown
Contributor Author

YAMY1234 commented Jan 3, 2025

@HaoyuWang4188 @Tushar-ml @YAMY1234 Can you review each other's code? #2576

Sure! We’ll review each other’s code and collaborate to work out a great solution. 🚀

@HaoyuWang4188
Copy link
Copy Markdown
Contributor

HaoyuWang4188 commented Jan 3, 2025

Hi! After a general review, I would like to initiate some discussions to help we determine the best solution:

1. Support for parallel_tool_calls

OpenAI API supports parrallel_tool_calls option (default true) to determine whether LLM should output multiple tool calls at once.
In vLLM, this option is added but skipped like always treating parrallel_tool_calls=true (details).
In our current implementation, we have not considered this option in both #2544 and #2700. And the actual behaviour is summarized as follow :

Static API in #2544 (details)

  • parrallel_tool_calls=true for qwen2.5 (can output multiple tool calls at once)
  • parrallel_tool_calls=false for internlm2, llama3.1, llama3.2 (only output the first parsed tool call)

I tried to align these behaviours by set parrallel_tool_calls=false by default and force qwen2.5 to only output the first tool call in API level in link.

IMO, parallel_tool_calls should be supported in our function calling API (both static and stream).
And I suggest to support it in two steps:

2. Aligning Terms of model names
In #2544, we uses these names in link

Name Special Token (i.e. bot_token)
Llama 3.2 <|python_tag|>
Llama 3.1 <function=
Qwen 2.5 <tool_call>
InternLM <|plugin|>

I prefer to change Llama 3.1/3.2 into Llama 3.1+ (since Llama3.3 also shares the same pattern) and use terms JSON-based and User-defined from Meta's doc for clarification, because Llama 3.2 adds no new function calling support from training phase and both <|python_tag|> and <function= is supported from 3.1.

@HaoyuWang4188
Copy link
Copy Markdown
Contributor

Please let me know your thoughts on the above discussion. @YAMY1234 @Thunderbeee @Tushar-ml @merrymercy
If you guys agree with my suggestions, I will proceed with the proposed plan. Thank you! 🚀

@YAMY1234 YAMY1234 force-pushed the feature/function-calling-update branch from 8ad4c1f to 882b77a Compare January 8, 2025 06:00
@YAMY1234
Copy link
Copy Markdown
Contributor Author

YAMY1234 commented Jan 8, 2025

Please let me know your thoughts on the above discussion. @YAMY1234 @Thunderbeee @Tushar-ml @merrymercy If you guys agree with my suggestions, I will proceed with the proposed plan. Thank you! 🚀

Hi @HaoyuWang4188 Thank you for the detailed review and suggestions!
1. We fully support the idea of creating a new PR to add Support for parallel_tool_calls as described.
2. We also agree with the plan to change [Feature] Add partial support for parallel_tool_calls in Function Calling API #2576 to a support PR that always sets parallel_tool_calls to false.
3. For changing Llama 3.1/3.2 into Llama 3.1+, we totally agree and have changed this in our recent commits.

Currently, our implementation is nearly complete. Could you consider adding this new parameter based on our existing changes (perhaps cherry-picking them)? Alternatively, I have another suggestion: we could combine the addition of the parallel_tool_calls parameter and its full implementation into the next PR to streamline the process.

Let me know your thoughts!

Copy link
Copy Markdown
Contributor

@HaoyuWang4188 HaoyuWang4188 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some suggested changes for parrallel_too_call

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
calls.append(tool_call_item)
calls.append(tool_call_item)
break

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_, action = text.split("<|python_tag|>")
_, action = text.split("<|python_tag|>")
# split multiple actions and only select the first one
# e.g. {"name": "A", "parameters": {"arg": "x"}}; {"name": "B", "parameters": {"arg": "y"}}
if "}};" in action:
action = action.split("}};")[0] + "}}"

@HaoyuWang4188
Copy link
Copy Markdown
Contributor

Please let me know your thoughts on the above discussion. @YAMY1234 @Thunderbeee @Tushar-ml @merrymercy If you guys agree with my suggestions, I will proceed with the proposed plan. Thank you! 🚀

Hi @HaoyuWang4188 Thank you for the detailed review and suggestions! 1. We fully support the idea of creating a new PR to add Support for parallel_tool_calls as described. 2. We also agree with the plan to change [Feature] Add partial support for parallel_tool_calls in Function Calling API #2576 to a support PR that always sets parallel_tool_calls to false. 3. For changing Llama 3.1/3.2 into Llama 3.1+, we totally agree and have changed this in our recent commits.

Currently, our implementation is nearly complete. Could you consider adding this new parameter based on our existing changes (perhaps cherry-picking them)? Alternatively, I have another suggestion: we could combine the addition of the parallel_tool_calls parameter and its full implementation into the next PR to streamline the process.

Let me know your thoughts!

I'm glad we agreed 😄. I commented with the related suggestion (no cherry-pick needed now). Next, we just need to merge #2700 and #2576 to finish the main support for the function calling API. Can't wait to use it in Sglang 🚀 .

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to the docs readme:

# 4) Clean notebook outputs
# nbstripout removes notebook outputs so your PR stays clean
pip install nbstripout
find . -name '*.ipynb' -exec nbstripout {} \;

Comment on lines 9 to 19
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this concise.

This guide demonstrates how to use SGLang’s ToolCalling functionality with a get_current_weather function. You can replace or add any tool function depending on your use case.

Comment on lines 26 to 56
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concise it and do not use os.environ["CUDA_VISIBLE_DEVICES"] = "7". We only have 1 GPU for docs CI.


Launch the Sever

import os
from openai import OpenAI
import json

from sglang.utils import execute_shell_command, wait_for_server, terminate_process

server_process = execute_shell_command(
    "python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --tool-call-parser llama3 --port 30222 --host 0.0.0.0" 
)
wait_for_server("http://localhost:30222")

Note that --tool-call-parser defines the parser used to interpret responses. Currently supported parsers include:

  • llama3: Llama 3.1 / 3.2 (e.g. meta-llama/Llama-3.1-8B-Instruct, meta-llama/Llama-3.2-1B-Instruct).
  • mistral: Mistral (e.g. mistralai/Mistral-7B-Instruct-v0.3, mistralai/Mistral-Nemo-Instruct-2407, mistralai/Mistral-Nemo-Instruct-2407, mistralai/Mistral-7B-v0.3).
  • qwen25: Qwen 2.5 (e.g. Qwen/Qwen2.5-1.5B-Instruct, Qwen/Qwen2.5-7B-Instruct).

Comment on lines 137 to 58
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below is a Python snippet that shows how to define a tool as a dictionary. The dictionary includes a tool name, a description, and property defined Parameters.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete this line.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should alway be true and even assert to be true. Do not add this if-else.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also assert to be true.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print_highlight

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print_highlight.

Comment on lines 456 to 338
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just adding one block at the end and:

terminate_process(server_process)

No need to explain and make a title.

Comment on lines 16 to 58
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is verbose. Use this plz:

Launch the Sever

import os
from openai import OpenAI
import json

from sglang.utils import execute_shell_command, wait_for_server, terminate_process

server_process = execute_shell_command(
    "python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --tool-call-parser llama3 --port 30222 --host 0.0.0.0" 
)
wait_for_server("http://localhost:30222")

Note that --tool-call-parser defines the parser used to interpret responses. Currently supported parsers include:

  • llama3: Llama 3.1 / 3.2 (e.g. meta-llama/Llama-3.1-8B-Instruct, meta-llama/Llama-3.2-1B-Instruct).
  • mistral: Mistral (e.g. mistralai/Mistral-7B-Instruct-v0.3, mistralai/Mistral-Nemo-Instruct-2407, mistralai/Mistral-Nemo-Instruct-2407, mistralai/Mistral-7B-v0.3).
  • qwen25: Qwen 2.5 (e.g. Qwen/Qwen2.5-1.5B-Instruct, Qwen/Qwen2.5-7B-Instruct).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete this line. Do not need any explanation since the code is clear.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the server -> When the engine

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, non-streaming mode also supports function calling. ->

Give example to it like streaming ones.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete this explanation. Since in the codes, you have this line:

# This is a demonstration, define real function according to your usage.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete this line, it's redundant to # This is a demonstration, define real function according to your usage..

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete this line, I don't see the meaning leaving it here.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete this.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete this.

Comment on lines 306 to 327
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one blank block. Delete it:

image

@Thunderbeee Thunderbeee force-pushed the feature/function-calling-update branch from 9f9ecec to cce2859 Compare January 11, 2025 08:42
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concise, name it: Tool and Function Calling

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concise:

This guide demonstrates how to use SGLang’s Tool Calling functionality.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use more than ### titles. Use **Non-Streaming Request** instead.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use more than ### titles. Use **Streaming Request** instead.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use ( and ) here.

Just:

arguments_non_stream = response_non_stream.choices[0].message.tool_calls[0].function.arguments

@YAMY1234 YAMY1234 force-pushed the feature/function-calling-update branch from 6d71133 to 94a1338 Compare January 14, 2025 17:09
Copy link
Copy Markdown
Collaborator

@zhaochenyang20 zhaochenyang20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs looks good to me. @shuaills will review other parts.

@YAMY1234 YAMY1234 force-pushed the feature/function-calling-update branch from 8f4ac89 to 9cf3086 Compare January 20, 2025 19:23
@zhaochenyang20
Copy link
Copy Markdown
Collaborator

@YAMY1234 @shuaills @Thunderbeee Tom is making rebase these days, so it's urgent to merge and review 😂

@Thunderbeee Thunderbeee force-pushed the feature/function-calling-update branch from 5442889 to cc147ba Compare January 22, 2025 00:08
@zhaochenyang20
Copy link
Copy Markdown
Collaborator

Nice work!

@YAMY1234 YAMY1234 force-pushed the feature/function-calling-update branch from 522ab47 to 5717965 Compare January 25, 2025 17:41
@YAMY1234 YAMY1234 changed the title WIP: Feature/function calling update Feature/function calling update Jan 25, 2025
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a matter of fact, if these contents are redundant in function_Calling.ipynb, please just use a link to redirect contents in function_calling.ipynb. Use URL is okay, like https://docs.sglang.ai/backend/openai_api_completions.html#Launch-A-Server

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a matter of fact, if these contents are redundant in function_Calling.ipynb, please just use a link to redirect contents in function_calling.ipynb. Use URL is okay, like https://docs.sglang.ai/backend/openai_api_completions.html#Launch-A-Server

Copy link
Copy Markdown
Collaborator

@zhaochenyang20 zhaochenyang20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhaochenyang20 zhaochenyang20 merged commit b045841 into sgl-project:main Jan 26, 2025
timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025
Co-authored-by: Mingyuan Ma <mamingyuan2001@berkeley.edu>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: shuaills <shishuaiuoe@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants