Skip to content

Conversation

@lizexu123
Copy link
Collaborator

@lizexu123 lizexu123 commented Oct 24, 2025

This PR supports actual inference for pooling models

Usage

online serving

  • launch the serving
  • Embeddings can be obtained via the EmbeddingCompletionRequest API or the EmbeddingChatRequest API
# Start the service demo
model_path=Qwen3-Embedding-0.6B
# 以下参数是必须的 
export FD_DISABLE_CHUNKED_PREFILL=1 #强制禁用默认的分块预填充(Chunked Prefill)功能

python -m fastdeploy.entrypoints.openai.api_server --model ${model_path} \
    --max-num-seqs 256 --max-model-len 32768 \
    --port 13331 --engine-worker-queue-port 7132 \
    --metrics-port 7431 --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.9 \
    --load-choices "default_v1" \
    --runner pooling \
    --graph-optimization-config '{"use_cudagraph":false}'

Request Method (curl example)

A. EmbeddingCompletionRequest 示例(标准文本输入)

curl -X POST 'YOUR_SERVICE_URL/v1/embeddings' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "text-embedding-chat-model",
    "input": [
      "This is a sentence for pooling embedding.",
      "Another input text."
    ],
    "user": "test_client"
  }'

B. EmbeddingChatRequest 示例(消息序列输入)

curl -X POST 'YOUR_SERVICE_URL/v1/embeddings' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "text-embedding-chat-model",
    "messages": [
      {"role": "user", "content": "Generate embedding for user query."}
    ]
  }'

Currently, there are bugs when enabling CUDA Graph and custom all-reduce, so they are temporarily disabled.
TODO:
Offline interface support for pooling is temporarily unavailable and will be supported in the future.

@paddle-bot
Copy link

paddle-bot bot commented Oct 24, 2025

Thanks for your contribution!

const_cast<bool*>(is_block_step.data<bool>()),
next_tokens.data<int64_t>(),
now_bsz,
bsz_to_process,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为什么需要改成max_bsz

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里需要max-num-seqs个seq_lens_encoder都变成0


pooler_output = pooler_output.numpy()
if pooler_output.dtype != np.float32:
pooler_output = pooler_output.astype(np.float32)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里输出是都固定返回fp32类型吗

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改,只有bfloat16这么修改,因为numpy无此类型

save_embedding_baseline(embedding, baseline_file)
else:
print(f"Comparing with baseline: {baseline_file}")
check_embedding_against_baseline(embedding, baseline_file, threshold=0.01)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是不是每次CI环境只会执行savedump不会执行精度比较?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里先让宝库在ci环境上执行一下,再合入

if save_each_rank or model_output.mp_rank == 0:
output = _build_stream_transfer_data(output_tokens=None, pooler_outputs=pooler_output.outputs)

async_output_queue.put(output)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里如果没有开启FD_USE_GET_SAVE_OUTPUT_V1,行为是未定义的?这个环境变量默认值是0,感觉用户很难注意到特意去打开,是否可以改成Pooling模型自动打开,如果有问题最好也给个醒目的报错和修改提示

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里直接把这个环境变量删除了,pooling就走这个,没有第二种选择

@yuanlehome
Copy link
Collaborator

不能跑cudagraph的原因是什么?

@lizexu123
Copy link
Collaborator Author

不能跑cudagraph的原因是什么?

@lizexu123 lizexu123 closed this Oct 28, 2025
@lizexu123 lizexu123 reopened this Oct 28, 2025
delta_text, token_ids = self._decode_token(
token_ids=content.outputs.token_ids, req_id=request_id, is_end=content.finished
)
if isinstance(content, RequestOutput):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else分支的类型是什么明确给出,然后再来个else报错

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else的现在只要不是生成式都走下面,目前我们只有这两个,后续还会有reward等,都是走else,这里孙磊参考改的

seq_lens_encoder[thread_idx] = 0;
int64_t* input_ids_now = input_ids + thread_idx * input_ids_stride;
input_ids_now[0] = next_tokens[thread_idx];
if (is_pooling_task) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对这个算子做了什么逻辑的改动?

Copy link
Collaborator Author

@lizexu123 lizexu123 Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pooling时,seq_lens_encode的全部shape的值都改成0,确保exist_prefill为0,解决hung的问题



def _build_stream_transfer_data(output_tokens: np.ndarray):
def _build_stream_transfer_data(output_tokens: np.ndarray, pooler_outputs: None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pooler_outputs默认是None的话,是这样写的吗?应该是pooler_outputs: type = None ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

Comment on lines 1540 to 1544
print("num_tokens", num_tokens)
print("max_num_seqs", max_num_seqs)
print("num_reqs", num_reqs)
print("min_tokens_per_req", min_tokens_per_req)
print("num_scheduled_token_list", num_scheduled_tokens_list)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

)
return None

def _pool(self, hidden_states: paddle.Tensor, num_running_requests: int) -> Optional[ModelRunnerOutput]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个函数命名语义再明确点儿

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个也是参考vllm的命名

@CLAassistant
Copy link

CLAassistant commented Oct 28, 2025

CLA assistant check
All committers have signed the CLA.

yuanlehome
yuanlehome previously approved these changes Oct 28, 2025
yuanlehome
yuanlehome previously approved these changes Oct 30, 2025
Copy link
Collaborator

@gongshaotian gongshaotian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gongshaotian gongshaotian merged commit 4ac6de9 into PaddlePaddle:develop Oct 31, 2025
35 of 39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants