-
Notifications
You must be signed in to change notification settings - Fork 644
[Refactor] 2/N Unify all mask generation methods and cache mask #4779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Refactor] 2/N Unify all mask generation methods and cache mask #4779
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the attention mask generation by unifying all methods into the AttentionMaskBuilder class and implementing lazy initialization to improve memory efficiency. The changes are well-aligned with the stated goals. However, I've identified a few critical issues, including a broken unit test and a syntax error that would prevent the code from running. Additionally, there are a couple of high-severity bugs in the new mask caching logic where data type changes are not handled correctly. I've provided suggestions to address these issues.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
patach_config is useless now. Let's remove it - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: wangxiyuan <[email protected]> Co-authored-by: Mengqing Cao <[email protected]> Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
…ntion ops by size of BS. Signed-off-by: weijinqian_v1 <[email protected]>
…ntion ops by size of BS. Signed-off-by: weijinqian_v1 <[email protected]>
…ntion ops by size of BS. Signed-off-by: weijinqian_v1 <[email protected]>
…ntion ops by size of BS. Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
### What this PR does / why we need it? DeepSeekV3.2 support bmm_transpose operator. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: ZYang6263 <[email protected]> Signed-off-by: ZYang6263 <[email protected]> Co-authored-by: wangxiyuan <[email protected]> Signed-off-by: weijinqian_v1 <[email protected]>
### What this PR does / why we need it? Add log Info for MOE_load Imbalance Ratio ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0 --------- Signed-off-by: daishixun <[email protected]> Co-authored-by: weijinqian0 <[email protected]> Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
### What this PR does / why we need it? Due to the differences in operators used and execution order between xlite and eager modes, there will be slight precision discrepancies. This patch skip the xlite e2e tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? vLLM version: v0.12.0 vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: lulina <[email protected]> Co-authored-by: wangxiyuan <[email protected]> Signed-off-by: weijinqian_v1 <[email protected]>
) ### What this PR does / why we need it? After enabling Mlapo and DCP, since Mlapo has its own mla_preprocess logic and does not perform additional all_gather operations on the DCP group, this will lead to dimension mismatch during the subsequent forward proces ### Does this PR introduce _any_ user-facing change? N/A - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: zengran <[email protected]> Co-authored-by: wangxiyuan <[email protected]> Signed-off-by: weijinqian_v1 <[email protected]>
### What this PR does / why we need it? 1.Add the implementation of normal Aclnn operators: MoeCombineNormal, MoeDispatchNormal, NotifyDispatch,and DispatchLayout. - MoeCombineNormal: Implements the combine logic within MoE operations. - MoeDispatchNormal: Implements the dispatch logic within MoE operations. - NotifyDispatch: Exchanges topk_idx information among different ranks to calculate the device memory required for the dispatch stage. - DispatchLayout: Used to calculate information related to the device memory layout for the dispatch stage. 2.Provide PyTorch interfaces for normal operators—get_dispatch_layout, dispatch_prefill, and combine_prefill—to be used for MoE communication during the prefill stage in vLLM. - get_dispatch_layout: Calculates information related to the device memory layout for the dispatch operator, and is called before dispatch_prefill. - dispatch_prefill: Initiates the dispatch operation. - combine_prefill: Initiates the combine operation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The functionality has already been validated using the local Qwen model. Test cases will be added after support for multi-NPU use cases in the CI pipeline is finalized. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: shiro-zzzz <[email protected]> Signed-off-by: weijinqian_v1 <[email protected]>
Bumps [actions/checkout](https://github.com/actions/checkout) from 6.0.0 to 6.0.1. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: wangxiyuan <[email protected]> Signed-off-by: weijinqian_v1 <[email protected]>
### What this PR does / why we need it? Qwen2.5-VL mrope precision problem would been solved once this pr is merged ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test on G8600 with textVQA dataset - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: 李少鹏 <[email protected]> Co-authored-by: wangxiyuan <[email protected]> Signed-off-by: weijinqian_v1 <[email protected]>
### What this PR does / why we need it? Add Qwen3-235B tutorial including the following examples - Single-node Online Deployment for 128k context inference - Multi-node Deployment with MP - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: xuyexiong <[email protected]> Co-authored-by: wangxiyuan <[email protected]> Signed-off-by: weijinqian_v1 <[email protected]>
### What this PR does / why we need it? Fix dp padding logic in dummyrun. After vllm-project/vllm#28579, `num_tokens` will be padded in `CudagraphDispatcher`, thus we also need to do the pad in the dummy_run. ### How was this patch tested? Test locally with the following scripts ```bash VLLM_USE_MODELSCOPE=true python3 -m vllm.entrypoints.openai.api_server \ --model wemaster/deepseek_mtp_main_random_bf16 \ --trust-remote-code \ --data-parallel-size 4 \ --tensor-parallel-size 1 \ --compilation-config '{"cudagraph_capture_sizes":[96],"cudagraph_mode":"FULL_DECODE_ONLY"}' \ --enable-expert-parallel ``` ```bash vllm bench serve --model wemaster/deepseek_mtp_main_random_bf16 --endpoint /v1/completions --dataset-name random --random-input 512 --random-output 100 --num-prompts 48 --request-rate 1 --ready-check-timeout-sec 0 ``` - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: MengqingCao <[email protected]> Signed-off-by: weijinqian_v1 <[email protected]>
) ### What this PR does / why we need it? In reinforcement learning scenarios, the current inference applies a transpose operation to the weights. For a cleaner architecture, the weight transpose module was moved to wakeup. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: lhp-deep <[email protected]> Co-authored-by: weijinqian0 <[email protected]> Signed-off-by: weijinqian_v1 <[email protected]>
…ct#4774) ### What this PR does / why we need it? Fix incorrect MLAPO weight release in PD mixex scenarios. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: ZYang6263 <[email protected]> Co-authored-by: wangxiyuan <[email protected]> Signed-off-by: weijinqian_v1 <[email protected]>
Reverts vllm-project#4194 as it broke CI in https://github.com/vllm-project/vllm-ascend/actions/runs/20030369087/job/57437687382?pr=4791 Co-authored-by: wangxiyuan <[email protected]> Signed-off-by: weijinqian_v1 <[email protected]>
### What this PR does / why we need it? In vllm-omni, we create the empty `VllmConfig`, which raised the null error in [`vllm-ascend/vllm_ascend/utils.py`](https://github.com/vllm-project/vllm-ascend/blob/a7f91079b8576a846f671c9e6923805e74e35c87/vllm_ascend/utils.py#L833). More details are [here](vllm-project/vllm-omni#208). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: gcanlin <[email protected]> Signed-off-by: weijinqian_v1 <[email protected]>
### What this PR does / why we need it? Currently, suffix decoding has known correctness issue see https://github.com/vllm-project/vllm-ascend/actions/runs/20033509824/job/57457565620?pr=4781" Signed-off-by: wangli <[email protected]> Signed-off-by: weijinqian_v1 <[email protected]>
0737845 to
d1fce55
Compare
RFC: #4629
Reason:
There are various types of masks here, and some of them do not have a caching mechanism. As a result, the masks need to be initialized for each layer, leading to waste of video memory.
At the same time, we hope to standardize the management and usage of masks.
So we have gathered all the masks into the AttentionMaskBuilder class.
Todo: