WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

Conversation

@lizexu123
Copy link
Collaborator

Motivation

支持stop_token_ids

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

online serving

  • launch the serving
  • request with stop_token_ids parameter, it can be '[List[int]]'
# create a chat request with "stop_token_ids" parameter
curl -X POST "http://0.0.0.0:13312/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
    "model": "default",
    "messages": [
        {
            "role": "user",
            "content": "北京天安门在哪里?"
        }
    ],
    "temperature": 0.7,
    "stream": false,
    "seed": 1,
    "stop_token_ids":[104208]
}'

# the original output without `stop_token_ids` is: 
# {"id":"chatcmpl-33610f95-7d01-47a6-b040-39b18316f727","object":"chat.completion","created":1760692757,"model":"/root/paddlejob/workspace/env_run/output/models/paddle/Qwen/Qwen3-0.6B","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\n好的,用户问的是“北京天安门在哪里?”。首先,我需要确认用户的需求是什么。可能他们想知道天安门的具体位置,或者想了解它的重要性。接下来,我得回忆一下北京天安门广场的地理位置。天安门广场位于北京市中心,周围环绕着著名的胡同,比如大栅栏、小街等。用户可能对城市规划和地标建筑感兴趣,也可能是想了解天安门的历史和功能。\n\n用户可能没有明确说明他们的需求,但作为回答者,我需要确保信息准确且易于理解。天安门广场是北京的标志性建筑之一,周围有丰富的历史和文化元素。此外,用户可能还想知道天安门与周围其他景点的关系,比如人民广场、故宫等,这有助于提供更全面的回答。\n\n需要注意的是,用户可能对“哪里”这个词语有歧义,可能需要进一步澄清。但根据问题本身,直接回答地理位置是合适的。同时,保持回答简洁明了,避免使用过于专业的术语,让用户容易理解。\n</think>\n\n北京天安门广场位于中国北京市中心,是中华人民共和国的象征性建筑之一。广场周围环绕着著名的胡同,如大栅栏、小街等,是北京的城市地标和历史文化中心。","multimodal_content":null,"reasoning_content":null,"tool_calls":null,"prompt_token_ids":null,"completion_token_ids":null,"prompt_tokens":null,"completion_tokens":null},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":14,"total_tokens":276,"completion_tokens":262,"prompt_tokens_details":{"cached_tokens":0}}}

# the output with `stop_token_ids` is:
# {"id":"chatcmpl-51f772d0-e0d8-48da-9b5f-a3690849ffca","object":"chat.completion","created":1760692873,"model":"/root/paddlejob/workspace/env_run/output/models/paddle/Qwen/Qwen3-0.6B","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\n好的,用户问的是“北京天安门在哪里?”。首先,我需要确认用户的需求是什么。可能他们想知道天安门的具体位置,或者想了解它的重要性。接下来,我得回忆一下北京天安门","multimodal_content":null,"reasoning_content":null,"tool_calls":null,"prompt_token_ids":null,"completion_token_ids":null,"prompt_tokens":null,"completion_tokens":null},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":14,"total_tokens":64,"completion_tokens":50,"prompt_tokens_details":{"cached_tokens":0}}}```

offline demo

from fastdeploy.engine.sampling_params import SamplingParams
from fastdeploy.entrypoints.llm import LLM

model_name_or_path = "/root/paddlejob/workspace/env_run/output/models/paddle/Qwen/Qwen3-0.6B"

# 超参设置
sampling_params = SamplingParams(temperature=1, seed=1,stop_token_ids=[104208])
llm = LLM(model=model_name_or_path, tensor_parallel_size=1)
output = llm.chat(messages=[{"role": "user", "content": "北京天安门在哪里?"}], use_tqdm=True, sampling_params=sampling_params)

print(output)```

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link

paddle-bot bot commented Dec 5, 2025

Thanks for your contribution!

@codecov-commenter
Copy link

codecov-commenter commented Dec 5, 2025

Codecov Report

❌ Patch coverage is 84.37500% with 5 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@d1bd40d). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/input/utils.py 72.22% 4 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #5399   +/-   ##
==========================================
  Coverage           ?   59.52%           
==========================================
  Files              ?      327           
  Lines              ?    40641           
  Branches           ?     6169           
==========================================
  Hits               ?    24193           
  Misses             ?    14584           
  Partials           ?     1864           
Flag Coverage Δ
GPU 59.52% <84.37%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

const int accept_tokens_len,
const int stop_seqs_bs,
const int stop_seqs_max_len,
const int64_t *min_tokens,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

两个算子改参数的话,记得把 ernie5_serving 同步改了

Comment on lines 213 to 229
stop_token_ids_final = []
if request.get("stop_token_ids") is not None:
stop_token_ids = request.get("stop_token_ids")
if isinstance(stop_token_ids, list) and len(stop_token_ids) > 0:
if isinstance(stop_token_ids[0], int):
stop_token_ids_final.extend([[t] for t in stop_token_ids])
elif isinstance(stop_token_ids[0], list):
stop_token_ids_final.extend(stop_token_ids)

stop_sequences = request.get("stop", [])
if stop_sequences:
stop_seqs, stop_seqs_len = self.update_stop_seq(stop_sequences)
request["stop_token_ids"] = stop_seqs
stop_token_ids_final.extend(stop_seqs)

if stop_token_ids_final:
stop_seqs_len = [len(seq) for seq in stop_token_ids_final]
request["stop_token_ids"] = stop_token_ids_final
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

很多个文件都有这坨,看看能不能抽象封装一下

const paddle::Tensor &step_idx,
const paddle::Tensor &stop_seqs,
const paddle::Tensor &stop_seqs_len,
const paddle::Tensor &min_tokens,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

引入min_tokens的作用是什么?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

设置最小生成的token数量,如果当前生成的token数量小于min_tokens,即使设置了stop_token_ids,也不会停止

Comment on lines 355 to 356
print("model_output.min_tokens", model_output.min_tokens)
print("stop_token_ids", model_output.stop_token_ids)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants