-
Notifications
You must be signed in to change notification settings - Fork 663
Support MXFP4 for GPT-OSS #5435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Support MXFP4 for GPT-OSS #5435
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
这个 PR 添加了对 GPT-OSS-120B 和 GPT-OSS-20B 模型的 MXFP4 离线量化支持,使用 FlashInfer 作为 MoE 组件的后端。
主要变更包括:
- 新增 MXFP4 量化方法的完整实现
- 添加模块转换控制逻辑以支持选择性量化
- 修复 safetensors 文件数量计算的潜在问题
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| fastdeploy/utils.py | 修正 safetensors 文件数量的计算逻辑 |
| fastdeploy/model_executor/models/gpt_oss.py | 添加 MXFP4 量化配置支持和权重加载逻辑优化 |
| fastdeploy/model_executor/layers/utils.py | 新增 modules_to_convert 工具函数用于判断模块是否需要量化转换 |
| fastdeploy/model_executor/layers/quantization/mxfp4.py | 实现 MXFP4 量化的核心逻辑,包括权重创建、加载和前向计算 |
| fastdeploy/model_executor/layers/quantization/init.py | 注册 MXFP4 量化方法到量化系统 |
| fastdeploy/model_executor/layers/normalization.py | 集成 modules_to_convert 以支持选择性量化 |
| fastdeploy/model_executor/layers/moe/moe.py | 添加 MXFP4 专用的 MoE 权重加载逻辑 |
| fastdeploy/model_executor/layers/linear.py | 集成 modules_to_convert 检查以控制线性层量化 |
| fastdeploy/envs.py | 添加 FlashInfer MXFP4 后端的环境变量配置 |
| fastdeploy/config.py | 添加 MXFP4 模型格式检测逻辑 |
Comments suppressed due to low confidence (1)
fastdeploy/model_executor/layers/quantization/mxfp4.py:250
- This statement is unreachable.
block_size = 32
|
|
||
|
|
||
| class MXFP4Config(QuantConfigBase): | ||
| """Base class for quantization configs.""" |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring "Base class for quantization configs." is misleading since this is actually a concrete implementation for MXFP4 quantization, not a base class. Consider updating it to: "Configuration for MXFP4 quantization."
| """Base class for quantization configs.""" | |
| """Configuration for MXFP4 quantization.""" |
| router_logits (torch.Tensor): Router logits of shape [batch_size, num_experts] | ||
| top_k (int): Number of experts to route to per token | ||
| Returns: | ||
| tuple[torch.Tensor, torch.Tensor]: A tuple containing: |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring references torch.Tensor types, but this is a Paddle-based codebase and the function signature uses paddle.Tensor. Update the documentation to use paddle.Tensor for consistency.
| router_logits (torch.Tensor): Router logits of shape [batch_size, num_experts] | |
| top_k (int): Number of experts to route to per token | |
| Returns: | |
| tuple[torch.Tensor, torch.Tensor]: A tuple containing: | |
| router_logits (paddle.Tensor): Router logits of shape [batch_size, num_experts] | |
| top_k (int): Number of experts to route to per token | |
| Returns: | |
| tuple[paddle.Tensor, paddle.Tensor]: A tuple containing: |
| elif quant_method == "mxfp4": | ||
| quant_config_name = "mxfp4" | ||
| else: | ||
| raise ValueError("Torch weight offline quantization only supports block-wise FP8.") |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message is now incorrect. Since MXFP4 is supported, the error should be updated to reflect this. Consider: raise ValueError("Torch weight offline quantization only supports block-wise FP8 and MXFP4.")
| raise ValueError("Torch weight offline quantization only supports block-wise FP8.") | |
| raise ValueError("Torch weight offline quantization only supports block-wise FP8 and MXFP4.") |
| try: | ||
| # check all the file exists | ||
| safetensors_num = int(model_files[0].strip(".safetensors").split("-")[-1]) | ||
| safetensors_num = int(model_files[0].strip(".safetensors").split("-")[-1]) + 1 |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The + 1 addition may be incorrect. The filename format appears to be model-XXXXX-of-YYYYY.safetensors where the last number (YYYYY) represents the total count, not a zero-indexed value. Adding 1 to this value would result in an incorrect safetensors_num. For example, if the file is model-00001-of-00005.safetensors, the code would extract 5 and add 1 to get 6, but there should only be 5 files total. Please verify the filename format and remove the + 1 if YYYYY represents the total count.
| safetensors_num = int(model_files[0].strip(".safetensors").split("-")[-1]) + 1 | |
| safetensors_num = int(model_files[0].strip(".safetensors").split("-")[-1]) |
| hasattr(fd_config, "quant_config") | ||
| and fd_config.model_config.quantization_config is not None |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition checks hasattr(fd_config, "quant_config") but then accesses fd_config.model_config.quantization_config. This is inconsistent. If you're checking for the existence of quant_config attribute, you should also validate fd_config.model_config.quantization_config is not None before checking if it contains "modules_to_not_convert". The second condition on line 231 already checks if quantization_config is not None, but the first check should verify the correct attribute path. Consider: hasattr(fd_config.model_config, "quantization_config") or simply remove the first check since line 231 already handles the None case.
| hasattr(fd_config, "quant_config") | |
| and fd_config.model_config.quantization_config is not None | |
| fd_config.model_config.quantization_config is not None |
| return vocab_range_from_per_partition_vocab_size(per_partition_vocab_size, rank, offset=offset) | ||
|
|
||
|
|
||
| def modules_to_convert(prefix: str, fd_config: FDConfig): |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function modules_to_convert lacks a docstring. Add documentation explaining the purpose, parameters, and return value. For example: "Check if a module should be converted for quantization based on its prefix and quantization config patterns."
| def modules_to_convert(prefix: str, fd_config: FDConfig): | |
| def modules_to_convert(prefix: str, fd_config: FDConfig): | |
| """ | |
| Determines whether a module should be converted for quantization based on its prefix and quantization configuration patterns. | |
| Args: | |
| prefix (str): The prefix (name) of the module to check. | |
| fd_config (FDConfig): The FastDeploy configuration object containing model and quantization settings. | |
| Returns: | |
| bool: True if the module should be converted for quantization, False otherwise. | |
| """ |
| def _interleave_mxfp4_cutlass_sm90(w): | ||
| w_shape = w.shape | ||
| w_interleaved = w.reshape([w_shape[0], w_shape[1], (w_shape[2] // 4), 4]) | ||
| w_interleaved = w_interleaved.permute([0, 2, 1, 3]) | ||
| w_interleaved = w_interleaved.reshape([w_shape[0], w_shape[2] // 4, w_shape[1] * 4]) | ||
| return w_interleaved |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The helper function _interleave_mxfp4_cutlass_sm90 is defined inside _load_fused_experts_weight and is duplicated in fastdeploy/model_executor/layers/quantization/mxfp4.py (line 297, though in unreachable code). Consider extracting this function to a shared utility module to avoid duplication and improve maintainability.
|
|
||
| def get_mxfp4_backend(): | ||
| if current_platform.is_cuda(): | ||
| if check_device_capability(90) and has_flashinfer() and envs.FD_USE_FLASHINFER_MOE_MXFP4_BF16: |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The environment variable envs.FD_USE_FLASHINFER_MOE_MXFP4_BF16 is being evaluated as a string. According to fastdeploy/envs.py, it defaults to "0" (a string). This condition will always be truthy because non-empty strings evaluate to True in Python. You need to convert it to a boolean or integer, e.g., envs.FD_USE_FLASHINFER_MOE_MXFP4_BF16 == "1" or use int(envs.FD_USE_FLASHINFER_MOE_MXFP4_BF16) or similar conversion.
| if check_device_capability(90) and has_flashinfer() and envs.FD_USE_FLASHINFER_MOE_MXFP4_BF16: | |
| if check_device_capability(90) and has_flashinfer() and envs.FD_USE_FLASHINFER_MOE_MXFP4_BF16 == "1": |
| logger.info("FastDeploy Using Triton backend in MoE") | ||
| return Mxfp4Backend.TRITON | ||
| else: | ||
| raise NotImplementedError |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message "NotImplementedError" is not helpful. Consider adding a descriptive error message to explain which platforms are unsupported and why. For example: raise NotImplementedError("MXFP4 quantization is only supported on CUDA platforms").
| raise NotImplementedError | |
| raise NotImplementedError("MXFP4 quantization is only supported on CUDA platforms due to hardware requirements.") |
| is_checkpoint_bf16 = not config.get("is_quantized", False) | ||
| return cls(is_checkpoint_bf16) | ||
|
|
||
| def get_quant_method(self, layer) -> Optional[QuantMethodBase]: |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method requires 2 positional arguments, whereas overridden QuantConfigBase.get_quant_method requires 3.
| def get_quant_method(self, layer) -> Optional[QuantMethodBase]: | |
| def get_quant_method(self, layer, weight=None) -> Optional[QuantMethodBase]: |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #5435 +/- ##
==========================================
Coverage ? 59.39%
==========================================
Files ? 328
Lines ? 40818
Branches ? 6197
==========================================
Hits ? 24242
Misses ? 14714
Partials ? 1862
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Motivation
This PR adds support for native GPT-OSS-120B and GPT-OSS-20B models using MXFP4 offline quantization, leveraging FlashInfer as the backend for MoE components.
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.