-
Notifications
You must be signed in to change notification settings - Fork 645
Add gsm8k accuracy test for multi-note Qwen3-235B-A22B #4802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: leo-pony <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a gsm8k accuracy test for the Qwen3-235B-A22B model in a multi-node setup. It also increases max-num-seqs from 16 to 32 for the server deployments. My review identifies a potential performance issue in the new benchmark configuration. The batch_size for the accuracy test is set to 512, while the vLLM server is configured to handle a maximum of 32 sequences. This mismatch could lead to inefficient test execution or timeouts. I've suggested aligning these values. On a side note, there seems to be a discrepancy between the configuration filename (Qwen3-235B-A3B.yaml) and the model being tested (Qwen3-235B-A22B). While not part of this PR's changes, it would be good to correct this in the future for clarity.
| request_conf: vllm_api_general_chat | ||
| dataset_conf: gsm8k/gsm8k_gen_0_shot_cot_chat_prompt | ||
| max_out_len: 7680 | ||
| batch_size: 512 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benchmark batch_size is set to 512, which is significantly larger than the server's configured max-num-seqs of 32 (as seen on lines 26 and 43).
This large discrepancy can lead to:
- Request queuing: The server will queue the excess requests, as it can only process 32 at a time.
- Increased test time: The PR notes a 30-minute test time, which might be partly due to this queuing.
- Potential timeouts: The client might time out waiting for responses.
To ensure efficient testing and avoid potential stability issues, it's recommended to align the batch_size with the server's capacity.
batch_size: 32|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
|
|
||
| acc: | ||
| case_type: accuracy | ||
| dataset_path: vllm-ascend/gsm8k |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| dataset_path: vllm-ascend/gsm8k | |
| dataset_path: vllm-ascend/gsm8k-lite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gsm8k-lite is enough
What this PR does / why we need it?
As there is not accuracy test for qwen3-235B-A22B model
Test result:
dataset version metric mode vllm-api-general-chat
gsm8k 7cd45e accuracy gen 96.29
Times long for test case running: 30mintues
Does this PR introduce any user-facing change?
How was this patch tested?