WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

Conversation

@weijinqian0
Copy link
Collaborator

@weijinqian0 weijinqian0 commented Dec 8, 2025

RFC: #4629

Reason:

There are various types of masks here, and some of them do not have a caching mechanism. As a result, the masks need to be initialized for each layer, leading to waste of video memory.

At the same time, we hope to standardize the management and usage of masks.

So we have gathered all the masks into the AttentionMaskBuilder class.

Todo:

  1. remove spec_attn_mask; @LICO1314
  2. remove pcp_prefill_mask; @LICO1314

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the attention mask generation by unifying all methods into the AttentionMaskBuilder class and implementing lazy initialization to improve memory efficiency. The changes are well-aligned with the stated goals. However, I've identified a few critical issues, including a broken unit test and a syntax error that would prevent the code from running. Additionally, there are a couple of high-severity bugs in the new mask caching logic where data type changes are not handled correctly. I've provided suggestions to address these issues.

@github-actions
Copy link

github-actions bot commented Dec 8, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@weijinqian0 weijinqian0 added ready read for review ready-for-test start test by label for PR labels Dec 9, 2025
wangxiyuan and others added 22 commits December 9, 2025 12:11
patach_config is useless now. Let's remove it

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

Signed-off-by: wangxiyuan <[email protected]>
Co-authored-by: Mengqing Cao <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
…ntion ops by size of BS.

Signed-off-by: weijinqian_v1 <[email protected]>
…ntion ops by size of BS.

Signed-off-by: weijinqian_v1 <[email protected]>
…ntion ops by size of BS.

Signed-off-by: weijinqian_v1 <[email protected]>
…ntion ops by size of BS.

Signed-off-by: weijinqian_v1 <[email protected]>
### What this PR does / why we need it?
DeepSeekV3.2 support bmm_transpose operator.

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

Signed-off-by: ZYang6263 <[email protected]>
Signed-off-by: ZYang6263 <[email protected]>
Co-authored-by: wangxiyuan <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
### What this PR does / why we need it?
Add log Info for MOE_load Imbalance Ratio

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0

---------

Signed-off-by: daishixun <[email protected]>
Co-authored-by: weijinqian0 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
### What this PR does / why we need it?
Due to the differences in operators used and execution order between
xlite and eager modes, there will be slight precision discrepancies.
This patch skip the xlite e2e tests.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
vLLM version: v0.12.0
vLLM main:
vllm-project/vllm@ad32e3e

Signed-off-by: lulina <[email protected]>
Co-authored-by: wangxiyuan <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
)

### What this PR does / why we need it?
After enabling Mlapo and DCP, since Mlapo has its own mla_preprocess
logic and does not perform additional all_gather operations on the DCP
group, this will lead to dimension mismatch during the subsequent
forward proces

### Does this PR introduce _any_ user-facing change?

N/A

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

Signed-off-by: zengran <[email protected]>
Co-authored-by: wangxiyuan <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
### What this PR does / why we need it?
1.Add the implementation of normal Aclnn operators: MoeCombineNormal,
MoeDispatchNormal, NotifyDispatch,and DispatchLayout.

- MoeCombineNormal: Implements the combine logic within MoE operations.
- MoeDispatchNormal: Implements the dispatch logic within MoE
operations.
- NotifyDispatch: Exchanges topk_idx information among different ranks
to calculate the device memory required for the dispatch stage.
- DispatchLayout: Used to calculate information related to the device
memory layout for the dispatch stage.

2.Provide PyTorch interfaces for normal operators—get_dispatch_layout,
dispatch_prefill, and combine_prefill—to be used for MoE communication
during the prefill stage in vLLM.

- get_dispatch_layout: Calculates information related to the device
memory layout for the dispatch operator, and is called before
dispatch_prefill.
- dispatch_prefill: Initiates the dispatch operation.
- combine_prefill: Initiates the combine operation.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
The functionality has already been validated using the local Qwen model.
Test cases will be added after support for multi-NPU use cases in the CI
pipeline is finalized.

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

Signed-off-by: shiro-zzzz <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
Bumps [actions/checkout](https://github.com/actions/checkout) from 6.0.0 to 6.0.1.

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: wangxiyuan <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
### What this PR does / why we need it?
Qwen2.5-VL mrope precision problem would been solved once this pr is
merged
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Test on G8600 with textVQA dataset

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: 李少鹏 <[email protected]>
Co-authored-by: wangxiyuan <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
### What this PR does / why we need it?
Add Qwen3-235B tutorial including the following examples
- Single-node Online Deployment for 128k context inference
- Multi-node Deployment with MP

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

---------

Signed-off-by: xuyexiong <[email protected]>
Co-authored-by: wangxiyuan <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
### What this PR does / why we need it?
Fix dp padding logic in dummyrun. After
vllm-project/vllm#28579, `num_tokens` will be
padded in `CudagraphDispatcher`, thus we also need to do the pad in the
dummy_run.

### How was this patch tested?
Test locally with the following scripts
```bash
VLLM_USE_MODELSCOPE=true python3 -m vllm.entrypoints.openai.api_server \
         --model wemaster/deepseek_mtp_main_random_bf16 \
         --trust-remote-code \
         --data-parallel-size 4 \
         --tensor-parallel-size 1 \
         --compilation-config '{"cudagraph_capture_sizes":[96],"cudagraph_mode":"FULL_DECODE_ONLY"}' \
         --enable-expert-parallel
```
```bash
vllm bench serve --model wemaster/deepseek_mtp_main_random_bf16 --endpoint /v1/completions --dataset-name random --random-input 512 --random-output 100 --num-prompts 48 --request-rate 1 --ready-check-timeout-sec 0
```

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

Signed-off-by: MengqingCao <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
)

### What this PR does / why we need it?
In reinforcement learning scenarios, the current inference applies a
transpose operation to the weights. For a cleaner architecture, the
weight transpose module was moved to wakeup.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

Signed-off-by: lhp-deep <[email protected]>
Co-authored-by: weijinqian0 <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
…ct#4774)

### What this PR does / why we need it?
Fix incorrect MLAPO weight release in PD mixex scenarios.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

Signed-off-by: ZYang6263 <[email protected]>
Co-authored-by: wangxiyuan <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
### What this PR does / why we need it?

In vllm-omni, we create the empty `VllmConfig`, which raised the null
error in
[`vllm-ascend/vllm_ascend/utils.py`](https://github.com/vllm-project/vllm-ascend/blob/a7f91079b8576a846f671c9e6923805e74e35c87/vllm_ascend/utils.py#L833).
More details are
[here](vllm-project/vllm-omni#208).

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

Signed-off-by: gcanlin <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
### What this PR does / why we need it?
Currently, suffix decoding has known correctness issue see
https://github.com/vllm-project/vllm-ascend/actions/runs/20033509824/job/57457565620?pr=4781"

Signed-off-by: wangli <[email protected]>
Signed-off-by: weijinqian_v1 <[email protected]>
@weijinqian0 weijinqian0 force-pushed the refactor_attention_mask branch from 0737845 to d1fce55 Compare December 9, 2025 04:11
@weijinqian0 weijinqian0 merged commit c331503 into vllm-project:main Dec 9, 2025
13 of 18 checks passed
@weijinqian0 weijinqian0 deleted the refactor_attention_mask branch December 9, 2025 11:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.