[Feature] support stop_token_ids #5399

lizexu123 · 2025-12-05T09:37:21Z

Motivation

支持stop_token_ids

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

online serving

launch the serving
request with stop_token_ids parameter， it can be '[List[int]]'

# create a chat request with "stop_token_ids" parameter
curl -X POST "http://0.0.0.0:13312/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
    "model": "default",
    "messages": [
        {
            "role": "user",
            "content": "北京天安门在哪里?"
        }
    ],
    "temperature": 0.7,
    "stream": false,
    "seed": 1,
    "stop_token_ids":[104208]
}'

# the original output without `stop_token_ids` is: 
# {"id":"chatcmpl-33610f95-7d01-47a6-b040-39b18316f727","object":"chat.completion","created":1760692757,"model":"/root/paddlejob/workspace/env_run/output/models/paddle/Qwen/Qwen3-0.6B","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\n好的，用户问的是“北京天安门在哪里？”。首先，我需要确认用户的需求是什么。可能他们想知道天安门的具体位置，或者想了解它的重要性。接下来，我得回忆一下北京天安门广场的地理位置。天安门广场位于北京市中心，周围环绕着著名的胡同，比如大栅栏、小街等。用户可能对城市规划和地标建筑感兴趣，也可能是想了解天安门的历史和功能。\n\n用户可能没有明确说明他们的需求，但作为回答者，我需要确保信息准确且易于理解。天安门广场是北京的标志性建筑之一，周围有丰富的历史和文化元素。此外，用户可能还想知道天安门与周围其他景点的关系，比如人民广场、故宫等，这有助于提供更全面的回答。\n\n需要注意的是，用户可能对“哪里”这个词语有歧义，可能需要进一步澄清。但根据问题本身，直接回答地理位置是合适的。同时，保持回答简洁明了，避免使用过于专业的术语，让用户容易理解。\n</think>\n\n北京天安门广场位于中国北京市中心，是中华人民共和国的象征性建筑之一。广场周围环绕着著名的胡同，如大栅栏、小街等，是北京的城市地标和历史文化中心。","multimodal_content":null,"reasoning_content":null,"tool_calls":null,"prompt_token_ids":null,"completion_token_ids":null,"prompt_tokens":null,"completion_tokens":null},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":14,"total_tokens":276,"completion_tokens":262,"prompt_tokens_details":{"cached_tokens":0}}}

# the output with `stop_token_ids` is:
# {"id":"chatcmpl-51f772d0-e0d8-48da-9b5f-a3690849ffca","object":"chat.completion","created":1760692873,"model":"/root/paddlejob/workspace/env_run/output/models/paddle/Qwen/Qwen3-0.6B","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\n好的，用户问的是“北京天安门在哪里？”。首先，我需要确认用户的需求是什么。可能他们想知道天安门的具体位置，或者想了解它的重要性。接下来，我得回忆一下北京天安门","multimodal_content":null,"reasoning_content":null,"tool_calls":null,"prompt_token_ids":null,"completion_token_ids":null,"prompt_tokens":null,"completion_tokens":null},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":14,"total_tokens":64,"completion_tokens":50,"prompt_tokens_details":{"cached_tokens":0}}}```

offline demo

from fastdeploy.engine.sampling_params import SamplingParams
from fastdeploy.entrypoints.llm import LLM

model_name_or_path = "/root/paddlejob/workspace/env_run/output/models/paddle/Qwen/Qwen3-0.6B"

# 超参设置
sampling_params = SamplingParams(temperature=1, seed=1,stop_token_ids=[104208])
llm = LLM(model=model_name_or_path, tensor_parallel_size=1)
output = llm.chat(messages=[{"role": "user", "content": "北京天安门在哪里?"}], use_tqdm=True, sampling_params=sampling_params)

print(output)```

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2025-12-05T09:37:27Z

Thanks for your contribution!

codecov-commenter · 2025-12-05T11:20:27Z

Codecov Report

❌ Patch coverage is 84.37500% with 5 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@d1bd40d). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/input/utils.py	72.22%	4 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #5399   +/-   ##
==========================================
  Coverage           ?   59.52%           
==========================================
  Files              ?      327           
  Lines              ?    40641           
  Branches           ?     6169           
==========================================
  Hits               ?    24193           
  Misses             ?    14584           
  Partials           ?     1864

Flag	Coverage Δ
GPU	`59.52% <84.37%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

freeliuzc · 2025-12-08T07:22:24Z

custom_ops/gpu_ops/speculate_decoding/speculate_set_stop_value_multi_seqs.cu

                                            const int accept_tokens_len,
                                            const int stop_seqs_bs,
                                            const int stop_seqs_max_len,
+                                            const int64_t *min_tokens,


两个算子改参数的话，记得把 ernie5_serving 同步改了

…into support_token_ids_3

yuanlehome · 2025-12-08T09:57:46Z

fastdeploy/input/paddleocr_vl_processor/paddleocr_vl_processor.py

+        stop_token_ids_final = []
+        if request.get("stop_token_ids") is not None:
+            stop_token_ids = request.get("stop_token_ids")
+            if isinstance(stop_token_ids, list) and len(stop_token_ids) > 0:
+                if isinstance(stop_token_ids[0], int):
+                    stop_token_ids_final.extend([[t] for t in stop_token_ids])
+                elif isinstance(stop_token_ids[0], list):
+                    stop_token_ids_final.extend(stop_token_ids)
+
        stop_sequences = request.get("stop", [])
        if stop_sequences:
            stop_seqs, stop_seqs_len = self.update_stop_seq(stop_sequences)
-            request["stop_token_ids"] = stop_seqs
+            stop_token_ids_final.extend(stop_seqs)
+
+        if stop_token_ids_final:
+            stop_seqs_len = [len(seq) for seq in stop_token_ids_final]
+            request["stop_token_ids"] = stop_token_ids_final


很多个文件都有这坨，看看能不能抽象封装一下

yuanlehome · 2025-12-08T09:58:19Z

custom_ops/gpu_ops/stop_generation_multi_ends.cu

                       const paddle::Tensor &step_idx,
                       const paddle::Tensor &stop_seqs,
                       const paddle::Tensor &stop_seqs_len,
+                       const paddle::Tensor &min_tokens,


引入min_tokens的作用是什么？

设置最小生成的token数量，如果当前生成的token数量小于min_tokens，即使设置了stop_token_ids，也不会停止

yuanlehome · 2025-12-08T12:52:34Z

fastdeploy/model_executor/pre_and_post_process.py

+        print("model_output.min_tokens", model_output.min_tokens)
+        print("stop_token_ids", model_output.stop_token_ids)


support stop_token_ids

4d054ce

lizexu123 added 2 commits December 5, 2025 17:48

fix

a66ecf0

delete chinese

5810bb6

freeliuzc reviewed Dec 8, 2025

View reviewed changes

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

48e74d6

…into support_token_ids_3

yuanlehome reviewed Dec 8, 2025

View reviewed changes

support both

5a213e1

yuanlehome reviewed Dec 8, 2025

View reviewed changes

delete print

93e3eaf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] support stop_token_ids #5399

[Feature] support stop_token_ids #5399

Uh oh!

lizexu123 commented Dec 5, 2025

Uh oh!

paddle-bot bot commented Dec 5, 2025

Uh oh!

codecov-commenter commented Dec 5, 2025 •

edited

Loading

Uh oh!

freeliuzc Dec 8, 2025

Uh oh!

yuanlehome Dec 8, 2025

Uh oh!

yuanlehome Dec 8, 2025

Uh oh!

lizexu123 Dec 8, 2025

Uh oh!

yuanlehome Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		print("model_output.min_tokens", model_output.min_tokens)
		print("stop_token_ids", model_output.stop_token_ids)

[Feature] support stop_token_ids #5399

Are you sure you want to change the base?

[Feature] support stop_token_ids #5399

Uh oh!

Conversation

lizexu123 commented Dec 5, 2025

Motivation

Modifications

Usage or Command

online serving

offline demo

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Dec 5, 2025

Uh oh!

codecov-commenter commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

freeliuzc Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

yuanlehome Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

yuanlehome Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

lizexu123 Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

yuanlehome Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Dec 5, 2025 •

edited

Loading