Add gsm8k accuracy test for multi-note Qwen3-235B-A22B #4802

leo-pony · 2025-12-08T13:38:13Z

What this PR does / why we need it?

As there is not accuracy test for qwen3-235B-A22B model

Test result:
dataset version metric mode vllm-api-general-chat

gsm8k 7cd45e accuracy gen 96.29

Times long for test case running: 30mintues

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.12.0
vLLM main: vllm-project/vllm@ad32e3e

Signed-off-by: leo-pony <[email protected]>

gemini-code-assist

Code Review

This pull request adds a gsm8k accuracy test for the Qwen3-235B-A22B model in a multi-node setup. It also increases max-num-seqs from 16 to 32 for the server deployments. My review identifies a potential performance issue in the new benchmark configuration. The batch_size for the accuracy test is set to 512, while the vLLM server is configured to handle a maximum of 32 sequences. This mismatch could lead to inefficient test execution or timeouts. I've suggested aligning these values. On a side note, there seems to be a discrepancy between the configuration filename (Qwen3-235B-A3B.yaml) and the model being tested (Qwen3-235B-A22B). While not part of this PR's changes, it would be good to correct this in the future for clarity.

gemini-code-assist · 2025-12-08T13:39:11Z

tests/e2e/nightly/multi_node/config/models/Qwen3-235B-A3B.yaml

+    request_conf: vllm_api_general_chat
+    dataset_conf: gsm8k/gsm8k_gen_0_shot_cot_chat_prompt
+    max_out_len: 7680
+    batch_size: 512


The benchmark batch_size is set to 512, which is significantly larger than the server's configured max-num-seqs of 32 (as seen on lines 26 and 43).

This large discrepancy can lead to:

Request queuing: The server will queue the excess requests, as it can only process 32 at a time.

Increased test time: The PR notes a 30-minute test time, which might be partly due to this queuing.

Potential timeouts: The client might time out waiting for responses.

To ensure efficient testing and avoid potential stability issues, it's recommended to align the batch_size with the server's capacity.

batch_size: 32

github-actions · 2025-12-08T14:23:16Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

wangxiyuan · 2025-12-09T08:20:14Z

@Potabk @zhangxinyuehfad

Potabk · 2025-12-09T08:24:25Z

tests/e2e/nightly/multi_node/config/models/Qwen3-235B-A3B.yaml

-
+  acc:
+    case_type: accuracy
+    dataset_path: vllm-ascend/gsm8k


Suggested change

dataset_path: vllm-ascend/gsm8k

dataset_path: vllm-ascend/gsm8k-lite

gsm8k-lite is enough

Add gsm8k accuracy test for multi-note Qwen3-235B-A3B

2da1732

Signed-off-by: leo-pony <[email protected]>

gemini-code-assist bot reviewed Dec 8, 2025

View reviewed changes

github-actions bot added the module:tests label Dec 8, 2025

wangxiyuan approved these changes Dec 9, 2025

View reviewed changes

Potabk reviewed Dec 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add gsm8k accuracy test for multi-note Qwen3-235B-A22B #4802

Add gsm8k accuracy test for multi-note Qwen3-235B-A22B #4802

leo-pony commented Dec 8, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 8, 2025

Uh oh!

github-actions bot commented Dec 8, 2025

Uh oh!

wangxiyuan commented Dec 9, 2025

Uh oh!

Potabk Dec 9, 2025

Uh oh!

Potabk Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	dataset_path: vllm-ascend/gsm8k
	dataset_path: vllm-ascend/gsm8k-lite

Add gsm8k accuracy test for multi-note Qwen3-235B-A22B #4802

Are you sure you want to change the base?

Add gsm8k accuracy test for multi-note Qwen3-235B-A22B #4802

Conversation

leo-pony commented Dec 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 8, 2025

Uh oh!

wangxiyuan commented Dec 9, 2025

Uh oh!

Potabk Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Potabk Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

leo-pony commented Dec 8, 2025 •

edited by github-actions bot

Loading