[Feature] support pooling model runner #4590

lizexu123 · 2025-10-24T13:54:13Z

This PR supports actual inference for pooling models

Usage

online serving

launch the serving
Embeddings can be obtained via the EmbeddingCompletionRequest API or the EmbeddingChatRequest API

# Start the service demo
model_path=Qwen3-Embedding-0.6B
# 以下参数是必须的 
export FD_DISABLE_CHUNKED_PREFILL=1 #强制禁用默认的分块预填充（Chunked Prefill）功能

python -m fastdeploy.entrypoints.openai.api_server --model ${model_path} \
    --max-num-seqs 256 --max-model-len 32768 \
    --port 13331 --engine-worker-queue-port 7132 \
    --metrics-port 7431 --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.9 \
    --load-choices "default_v1" \
    --runner pooling \
    --graph-optimization-config '{"use_cudagraph":false}'

Request Method (curl example)

A. EmbeddingCompletionRequest 示例（标准文本输入）

curl -X POST 'YOUR_SERVICE_URL/v1/embeddings' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "text-embedding-chat-model",
    "input": [
      "This is a sentence for pooling embedding.",
      "Another input text."
    ],
    "user": "test_client"
  }'

B. EmbeddingChatRequest 示例（消息序列输入）

curl -X POST 'YOUR_SERVICE_URL/v1/embeddings' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "text-embedding-chat-model",
    "messages": [
      {"role": "user", "content": "Generate embedding for user query."}
    ]
  }'

Currently, there are bugs when enabling CUDA Graph and custom all-reduce, so they are temporarily disabled.
TODO:
Offline interface support for pooling is temporarily unavailable and will be supported in the future.

…into develop

paddle-bot · 2025-10-24T13:54:21Z

Thanks for your contribution!

…into suport_spoling

zoooo0820 · 2025-10-28T08:39:34Z

custom_ops/gpu_ops/update_inputs_v1.cu

      const_cast<bool*>(is_block_step.data<bool>()),
      next_tokens.data<int64_t>(),
-      now_bsz,
+      bsz_to_process,


这里为什么需要改成max_bsz啊

这里需要max-num-seqs个seq_lens_encoder都变成0

zoooo0820 · 2025-10-28T08:49:55Z

fastdeploy/model_executor/pre_and_post_process.py

+
+            pooler_output = pooler_output.numpy()
+            if pooler_output.dtype != np.float32:
+                pooler_output = pooler_output.astype(np.float32)


这里输出是都固定返回fp32类型吗

已修改，只有bfloat16这么修改，因为numpy无此类型

zoooo0820 · 2025-10-28T08:56:00Z

tests/pooling/test_Qwen3-Embedding_serving.py

+        save_embedding_baseline(embedding, baseline_file)
+    else:
+        print(f"Comparing with baseline: {baseline_file}")
+        check_embedding_against_baseline(embedding, baseline_file, threshold=0.01)


这里是不是每次CI环境只会执行savedump不会执行精度比较？

这里先让宝库在ci环境上执行一下，再合入

zoooo0820 · 2025-10-28T09:27:00Z

fastdeploy/model_executor/pre_and_post_process.py

+            if save_each_rank or model_output.mp_rank == 0:
+                output = _build_stream_transfer_data(output_tokens=None, pooler_outputs=pooler_output.outputs)
+
+                async_output_queue.put(output)


这里如果没有开启FD_USE_GET_SAVE_OUTPUT_V1，行为是未定义的？这个环境变量默认值是0，感觉用户很难注意到特意去打开，是否可以改成Pooling模型自动打开，如果有问题最好也给个醒目的报错和修改提示

这里直接把这个环境变量删除了，pooling就走这个，没有第二种选择

yuanlehome · 2025-10-28T11:34:21Z

不能跑cudagraph的原因是什么？

lizexu123 · 2025-10-28T11:35:51Z

不能跑cudagraph的原因是什么？

yuanlehome · 2025-10-28T11:38:06Z

fastdeploy/engine/common_engine.py

-                            delta_text, token_ids = self._decode_token(
-                                token_ids=content.outputs.token_ids, req_id=request_id, is_end=content.finished
-                            )
+                        if isinstance(content, RequestOutput):


else分支的类型是什么明确给出，然后再来个else报错

else的现在只要不是生成式都走下面，目前我们只有这两个，后续还会有reward等，都是走else，这里孙磊参考改的

yuanlehome · 2025-10-28T11:38:55Z

custom_ops/gpu_ops/update_inputs_v1.cu

-          seq_lens_encoder[thread_idx] = 0;
-          int64_t* input_ids_now = input_ids + thread_idx * input_ids_stride;
-          input_ids_now[0] = next_tokens[thread_idx];
+      if (is_pooling_task) {


对这个算子做了什么逻辑的改动？

pooling时，seq_lens_encode的全部shape的值都改成0，确保exist_prefill为0，解决hung的问题

yuanlehome · 2025-10-28T11:40:12Z

fastdeploy/model_executor/pre_and_post_process.py



-def _build_stream_transfer_data(output_tokens: np.ndarray):
+def _build_stream_transfer_data(output_tokens: np.ndarray, pooler_outputs: None):


pooler_outputs默认是None的话，是这样写的吗？应该是pooler_outputs: type = None ?

yuanlehome · 2025-10-28T11:43:29Z

fastdeploy/worker/gpu_model_runner.py

+        print("num_tokens", num_tokens)
+        print("max_num_seqs", max_num_seqs)
+        print("num_reqs", num_reqs)
+        print("min_tokens_per_req", min_tokens_per_req)
+        print("num_scheduled_token_list", num_scheduled_tokens_list)


yuanlehome · 2025-10-28T11:47:33Z

fastdeploy/worker/gpu_model_runner.py

            )
+            return None
+
+    def _pool(self, hidden_states: paddle.Tensor, num_running_requests: int) -> Optional[ModelRunnerOutput]:


这个函数命名语义再明确点儿

这个也是参考vllm的命名

CLAassistant · 2025-10-28T12:06:00Z

All committers have signed the CLA.

…into suport_spoling

gongshaotian

LGTM

lizexu123 added 6 commits September 22, 2025 20:33

support qwen3-embedding

7716866

merge develop

5fde033

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

001f23d

…into develop

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

73141d4

…into develop

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

6a2ddaf

…into develop

support qwen3-embedding-0.6b

85d14ba

lizexu123 added 2 commits October 24, 2025 21:56

fix

8200040

update

5832cc4

lizexu123 force-pushed the suport_spoling branch from cfb1754 to 5832cc4 Compare October 27, 2025 05:43

lizexu123 added 4 commits October 27, 2025 14:57

fix bug

58616e4

fix test_return_token_ids.py and update enable_thinking

ad2f7b6

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

aeddcac

…into suport_spoling

fix mtp dummy_run

955fac1

lizexu123 force-pushed the suport_spoling branch from 26b8569 to 955fac1 Compare October 27, 2025 10:01

lizexu123 added 2 commits October 28, 2025 13:52

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

21c20a7

…into suport_spoling

merge develop

0206d42

zoooo0820 reviewed Oct 28, 2025

View reviewed changes

fix np.float32

30795d2

zoooo0820 reviewed Oct 28, 2025

View reviewed changes

delete FD_DISABLE_CHUNKED_PREFILL and FD_USE_GET_SAVE_OUTPUT_V1

6bc1ed2

lizexu123 closed this Oct 28, 2025

lizexu123 reopened this Oct 28, 2025

yuanlehome reviewed Oct 28, 2025

View reviewed changes

delete and build_stream_transfer_data

f439ca2

lizexu123 force-pushed the suport_spoling branch from b4a8a1a to f439ca2 Compare October 28, 2025 12:10

yuanlehome previously approved these changes Oct 28, 2025

View reviewed changes

fix test_update_v1:

27d686b

lizexu123 dismissed yuanlehome’s stale review via 27d686b October 28, 2025 16:12

lizexu123 added 2 commits October 28, 2025 16:26

update develop

a6a9483

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

15a0df8

…into suport_spoling

lizexu123 force-pushed the suport_spoling branch from a0e0452 to 15a0df8 Compare October 29, 2025 06:18

lizexu123 added 6 commits October 29, 2025 09:15

fix

57e76be

fix

eae6db6

update dummy_run post_process

90d5ee1

delete test_update_v1

1a35691

fix

2fa8733

fix dummy_run

5b12f6f

yuanlehome previously approved these changes Oct 30, 2025

View reviewed changes

lizexu123 dismissed yuanlehome’s stale review via 1cfbcb7 October 30, 2025 07:52

fix model_path

7ca73ba

lizexu123 force-pushed the suport_spoling branch from 1169a1b to 7ca73ba Compare October 30, 2025 08:26

lizexu123 added 4 commits October 30, 2025 08:27

fix model_path

1e3cae5

fix dummy_run

90ef114

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

6c20954

…into suport_spoling

merge develop

c8c3664

EmmonsCurse approved these changes Oct 31, 2025

View reviewed changes

gongshaotian approved these changes Oct 31, 2025

View reviewed changes

gongshaotian merged commit 4ac6de9 into PaddlePaddle:develop Oct 31, 2025
35 of 39 checks passed



		def _build_stream_transfer_data(output_tokens: np.ndarray):
		def _build_stream_transfer_data(output_tokens: np.ndarray, pooler_outputs: None):

[Feature] support pooling model runner #4590

[Feature] support pooling model runner #4590

Uh oh!

Conversation

lizexu123 commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Usage

online serving

Request Method (curl example)

A. EmbeddingCompletionRequest 示例（标准文本输入）

B. EmbeddingChatRequest 示例（消息序列输入）

Uh oh!

paddle-bot bot commented Oct 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuanlehome commented Oct 28, 2025

Uh oh!

lizexu123 commented Oct 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lizexu123 Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CLAassistant commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gongshaotian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

lizexu123 commented Oct 24, 2025 •

edited

Loading

lizexu123 Oct 28, 2025 •

edited

Loading

CLAassistant commented Oct 28, 2025 •

edited

Loading