[Refactor] 2/N Unify all mask generation methods and cache mask #4779

weijinqian0 · 2025-12-08T07:01:19Z

Reason:

There are various types of masks here, and some of them do not have a caching mechanism. As a result, the masks need to be initialized for each layer, leading to waste of video memory.

At the same time, we hope to standardize the management and usage of masks.

So we have gathered all the masks into the AttentionMaskBuilder class.

Todo:

remove spec_attn_mask; @LICO1314
remove pcp_prefill_mask; @LICO1314

vLLM version: v0.12.0
vLLM main: vllm-project/vllm@ad32e3e

gemini-code-assist

Code Review

This pull request refactors the attention mask generation by unifying all methods into the AttentionMaskBuilder class and implementing lazy initialization to improve memory efficiency. The changes are well-aligned with the stated goals. However, I've identified a few critical issues, including a broken unit test and a syntax error that would prevent the code from running. Additionally, there are a couple of high-severity bugs in the new mask caching logic where data type changes are not handled correctly. I've provided suggestions to address these issues.

tests/ut/attention/test_attention_mask.py

vllm_ascend/worker/model_runner_v1.py

vllm_ascend/attention/attention_mask.py

github-actions · 2025-12-08T07:20:52Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

vllm_ascend/spec_decode/eagle_proposer.py

patach_config is useless now. Let's remove it - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: wangxiyuan <[email protected]> Co-authored-by: Mengqing Cao <[email protected]> Signed-off-by: weijinqian_v1 <[email protected]>

Signed-off-by: weijinqian_v1 <[email protected]>

…ntion ops by size of BS. Signed-off-by: weijinqian_v1 <[email protected]>

Signed-off-by: weijinqian_v1 <[email protected]>

### What this PR does / why we need it? DeepSeekV3.2 support bmm_transpose operator. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: ZYang6263 <[email protected]> Signed-off-by: ZYang6263 <[email protected]> Co-authored-by: wangxiyuan <[email protected]> Signed-off-by: weijinqian_v1 <[email protected]>

### What this PR does / why we need it? Add log Info for MOE_load Imbalance Ratio ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0 --------- Signed-off-by: daishixun <[email protected]> Co-authored-by: weijinqian0 <[email protected]> Signed-off-by: weijinqian_v1 <[email protected]>

Signed-off-by: weijinqian_v1 <[email protected]>

### What this PR does / why we need it? Due to the differences in operators used and execution order between xlite and eager modes, there will be slight precision discrepancies. This patch skip the xlite e2e tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? vLLM version: v0.12.0 vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: lulina <[email protected]> Co-authored-by: wangxiyuan <[email protected]> Signed-off-by: weijinqian_v1 <[email protected]>

) ### What this PR does / why we need it? After enabling Mlapo and DCP, since Mlapo has its own mla_preprocess logic and does not perform additional all_gather operations on the DCP group, this will lead to dimension mismatch during the subsequent forward proces ### Does this PR introduce _any_ user-facing change? N/A - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: zengran <[email protected]> Co-authored-by: wangxiyuan <[email protected]> Signed-off-by: weijinqian_v1 <[email protected]>

### What this PR does / why we need it? 1.Add the implementation of normal Aclnn operators: MoeCombineNormal, MoeDispatchNormal, NotifyDispatch，and DispatchLayout. - MoeCombineNormal: Implements the combine logic within MoE operations. - MoeDispatchNormal: Implements the dispatch logic within MoE operations. - NotifyDispatch: Exchanges topk_idx information among different ranks to calculate the device memory required for the dispatch stage. - DispatchLayout: Used to calculate information related to the device memory layout for the dispatch stage. 2.Provide PyTorch interfaces for normal operators—get_dispatch_layout, dispatch_prefill, and combine_prefill—to be used for MoE communication during the prefill stage in vLLM. - get_dispatch_layout: Calculates information related to the device memory layout for the dispatch operator, and is called before dispatch_prefill. - dispatch_prefill: Initiates the dispatch operation. - combine_prefill: Initiates the combine operation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The functionality has already been validated using the local Qwen model. Test cases will be added after support for multi-NPU use cases in the CI pipeline is finalized. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: shiro-zzzz <[email protected]> Signed-off-by: weijinqian_v1 <[email protected]>

Bumps [actions/checkout](https://github.com/actions/checkout) from 6.0.0 to 6.0.1. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: wangxiyuan <[email protected]> Signed-off-by: weijinqian_v1 <[email protected]>

### What this PR does / why we need it? Qwen2.5-VL mrope precision problem would been solved once this pr is merged ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test on G8600 with textVQA dataset - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: 李少鹏 <[email protected]> Co-authored-by: wangxiyuan <[email protected]> Signed-off-by: weijinqian_v1 <[email protected]>

### What this PR does / why we need it? Add Qwen3-235B tutorial including the following examples - Single-node Online Deployment for 128k context inference - Multi-node Deployment with MP - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: xuyexiong <[email protected]> Co-authored-by: wangxiyuan <[email protected]> Signed-off-by: weijinqian_v1 <[email protected]>

### What this PR does / why we need it? Fix dp padding logic in dummyrun. After vllm-project/vllm#28579, `num_tokens` will be padded in `CudagraphDispatcher`, thus we also need to do the pad in the dummy_run. ### How was this patch tested? Test locally with the following scripts ```bash VLLM_USE_MODELSCOPE=true python3 -m vllm.entrypoints.openai.api_server \ --model wemaster/deepseek_mtp_main_random_bf16 \ --trust-remote-code \ --data-parallel-size 4 \ --tensor-parallel-size 1 \ --compilation-config '{"cudagraph_capture_sizes":[96],"cudagraph_mode":"FULL_DECODE_ONLY"}' \ --enable-expert-parallel ``` ```bash vllm bench serve --model wemaster/deepseek_mtp_main_random_bf16 --endpoint /v1/completions --dataset-name random --random-input 512 --random-output 100 --num-prompts 48 --request-rate 1 --ready-check-timeout-sec 0 ``` - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: MengqingCao <[email protected]> Signed-off-by: weijinqian_v1 <[email protected]>

) ### What this PR does / why we need it? In reinforcement learning scenarios, the current inference applies a transpose operation to the weights. For a cleaner architecture, the weight transpose module was moved to wakeup. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: lhp-deep <[email protected]> Co-authored-by: weijinqian0 <[email protected]> Signed-off-by: weijinqian_v1 <[email protected]>

…ct#4774) ### What this PR does / why we need it? Fix incorrect MLAPO weight release in PD mixex scenarios. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: ZYang6263 <[email protected]> Co-authored-by: wangxiyuan <[email protected]> Signed-off-by: weijinqian_v1 <[email protected]>

Reverts vllm-project#4194 as it broke CI in https://github.com/vllm-project/vllm-ascend/actions/runs/20030369087/job/57437687382?pr=4791 Co-authored-by: wangxiyuan <[email protected]> Signed-off-by: weijinqian_v1 <[email protected]>

### What this PR does / why we need it? In vllm-omni, we create the empty `VllmConfig`, which raised the null error in [`vllm-ascend/vllm_ascend/utils.py`](https://github.com/vllm-project/vllm-ascend/blob/a7f91079b8576a846f671c9e6923805e74e35c87/vllm_ascend/utils.py#L833). More details are [here](vllm-project/vllm-omni#208). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: gcanlin <[email protected]> Signed-off-by: weijinqian_v1 <[email protected]>

### What this PR does / why we need it? Currently, suffix decoding has known correctness issue see https://github.com/vllm-project/vllm-ascend/actions/runs/20033509824/job/57457565620?pr=4781" Signed-off-by: wangli <[email protected]> Signed-off-by: weijinqian_v1 <[email protected]>

weijinqian0 mentioned this pull request Dec 8, 2025

[RFC]: Refactor Attention module #4629

Open

gemini-code-assist bot reviewed Dec 8, 2025

View reviewed changes

github-actions bot added the module:tests label Dec 8, 2025

wangxiyuan approved these changes Dec 8, 2025

View reviewed changes

weijinqian0 commented Dec 8, 2025

View reviewed changes

vllm_ascend/spec_decode/eagle_proposer.py Outdated Show resolved Hide resolved

weijinqian0 added ready read for review ready-for-test start test by label for PR labels Dec 9, 2025

wangxiyuan and others added 22 commits December 9, 2025 12:11

[Refactor] extract attention cp.

2bc2007

Signed-off-by: weijinqian_v1 <[email protected]>

[Refactor] enable aclgraph for pagedAttention & enable different atte…

5c93ee1

…ntion ops by size of BS. Signed-off-by: weijinqian_v1 <[email protected]>

[Refactor] enable aclgraph for pagedAttention & enable different atte…

a170cad

…ntion ops by size of BS. Signed-off-by: weijinqian_v1 <[email protected]>

[Refactor] enable aclgraph for pagedAttention & enable different atte…

b8f7012

…ntion ops by size of BS. Signed-off-by: weijinqian_v1 <[email protected]>

[Refactor] enable aclgraph for pagedAttention & enable different atte…

37afceb

…ntion ops by size of BS. Signed-off-by: weijinqian_v1 <[email protected]>

[Refactor] 2/N Unify all mask generation methods and cache mask.

3e84d6e

Signed-off-by: weijinqian_v1 <[email protected]>

[Refactor] 2/N Unify all mask generation methods and cache mask.

5bcc473

Signed-off-by: weijinqian_v1 <[email protected]>

weijinqian0 force-pushed the refactor_attention_mask branch from 0737845 to d1fce55 Compare December 9, 2025 04:11

weijinqian0 added 2 commits December 9, 2025 12:12

Merge branch 'main' into refactor_attention_mask

d9f760c

Merge branch 'main' into refactor_attention_mask

d8c616d

weijinqian0 merged commit c331503 into vllm-project:main Dec 9, 2025
13 of 18 checks passed

weijinqian0 deleted the refactor_attention_mask branch December 9, 2025 11:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Refactor] 2/N Unify all mask generation methods and cache mask #4779

[Refactor] 2/N Unify all mask generation methods and cache mask #4779

weijinqian0 commented Dec 8, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Dec 8, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

[Refactor] 2/N Unify all mask generation methods and cache mask #4779

[Refactor] 2/N Unify all mask generation methods and cache mask #4779

Conversation

weijinqian0 commented Dec 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Dec 8, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

weijinqian0 commented Dec 8, 2025 •

edited by github-actions bot

Loading