Support MXFP4 for GPT-OSS #5435

Limerances · 2025-12-08T09:08:07Z

Motivation

This PR adds support for native GPT-OSS-120B and GPT-OSS-20B models using MXFP4 offline quantization, leveraging FlashInfer as the backend for MoE components.

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2025-12-08T09:08:17Z

Thanks for your contribution!

Copilot

Pull request overview

这个 PR 添加了对 GPT-OSS-120B 和 GPT-OSS-20B 模型的 MXFP4 离线量化支持，使用 FlashInfer 作为 MoE 组件的后端。

主要变更包括：

新增 MXFP4 量化方法的完整实现
添加模块转换控制逻辑以支持选择性量化
修复 safetensors 文件数量计算的潜在问题

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
fastdeploy/utils.py	修正 safetensors 文件数量的计算逻辑
fastdeploy/model_executor/models/gpt_oss.py	添加 MXFP4 量化配置支持和权重加载逻辑优化
fastdeploy/model_executor/layers/utils.py	新增 `modules_to_convert` 工具函数用于判断模块是否需要量化转换
fastdeploy/model_executor/layers/quantization/mxfp4.py	实现 MXFP4 量化的核心逻辑，包括权重创建、加载和前向计算
fastdeploy/model_executor/layers/quantization/init.py	注册 MXFP4 量化方法到量化系统
fastdeploy/model_executor/layers/normalization.py	集成 `modules_to_convert` 以支持选择性量化
fastdeploy/model_executor/layers/moe/moe.py	添加 MXFP4 专用的 MoE 权重加载逻辑
fastdeploy/model_executor/layers/linear.py	集成 `modules_to_convert` 检查以控制线性层量化
fastdeploy/envs.py	添加 FlashInfer MXFP4 后端的环境变量配置
fastdeploy/config.py	添加 MXFP4 模型格式检测逻辑

Comments suppressed due to low confidence (1)

fastdeploy/model_executor/layers/quantization/mxfp4.py:250

This statement is unreachable.

        block_size = 32

Copilot · 2025-12-08T09:14:46Z

fastdeploy/model_executor/layers/quantization/mxfp4.py

+
+
+class MXFP4Config(QuantConfigBase):
+    """Base class for quantization configs."""


The docstring "Base class for quantization configs." is misleading since this is actually a concrete implementation for MXFP4 quantization, not a base class. Consider updating it to: "Configuration for MXFP4 quantization."

Suggested change

"""Base class for quantization configs."""

"""Configuration for MXFP4 quantization."""

Copilot · 2025-12-08T09:14:46Z

fastdeploy/model_executor/layers/quantization/mxfp4.py

+            router_logits (torch.Tensor): Router logits of shape [batch_size, num_experts]
+            top_k (int): Number of experts to route to per token
+
+        Returns:
+            tuple[torch.Tensor, torch.Tensor]: A tuple containing:


The docstring references torch.Tensor types, but this is a Paddle-based codebase and the function signature uses paddle.Tensor. Update the documentation to use paddle.Tensor for consistency.

Suggested change

router_logits (torch.Tensor): Router logits of shape [batch_size, num_experts]

top_k (int): Number of experts to route to per token

Returns:

tuple[torch.Tensor, torch.Tensor]: A tuple containing:

router_logits (paddle.Tensor): Router logits of shape [batch_size, num_experts]

top_k (int): Number of experts to route to per token

Returns:

tuple[paddle.Tensor, paddle.Tensor]: A tuple containing:

Copilot · 2025-12-08T09:14:47Z

fastdeploy/model_executor/layers/quantization/__init__.py

+        elif quant_method == "mxfp4":
+            quant_config_name = "mxfp4"
        else:
            raise ValueError("Torch weight offline quantization only supports block-wise FP8.")


The error message is now incorrect. Since MXFP4 is supported, the error should be updated to reflect this. Consider: raise ValueError("Torch weight offline quantization only supports block-wise FP8 and MXFP4.")

Suggested change

raise ValueError("Torch weight offline quantization only supports block-wise FP8.")

raise ValueError("Torch weight offline quantization only supports block-wise FP8 and MXFP4.")

Copilot · 2025-12-08T09:14:47Z

fastdeploy/utils.py

    try:
        # check all the file exists
-        safetensors_num = int(model_files[0].strip(".safetensors").split("-")[-1])
+        safetensors_num = int(model_files[0].strip(".safetensors").split("-")[-1]) + 1


The + 1 addition may be incorrect. The filename format appears to be model-XXXXX-of-YYYYY.safetensors where the last number (YYYYY) represents the total count, not a zero-indexed value. Adding 1 to this value would result in an incorrect safetensors_num. For example, if the file is model-00001-of-00005.safetensors, the code would extract 5 and add 1 to get 6, but there should only be 5 files total. Please verify the filename format and remove the + 1 if YYYYY represents the total count.

Suggested change

safetensors_num = int(model_files[0].strip(".safetensors").split("-")[-1]) + 1

safetensors_num = int(model_files[0].strip(".safetensors").split("-")[-1])

Copilot · 2025-12-08T09:14:48Z

fastdeploy/model_executor/models/gpt_oss.py

+            hasattr(fd_config, "quant_config")
+            and fd_config.model_config.quantization_config is not None


The condition checks hasattr(fd_config, "quant_config") but then accesses fd_config.model_config.quantization_config. This is inconsistent. If you're checking for the existence of quant_config attribute, you should also validate fd_config.model_config.quantization_config is not None before checking if it contains "modules_to_not_convert". The second condition on line 231 already checks if quantization_config is not None, but the first check should verify the correct attribute path. Consider: hasattr(fd_config.model_config, "quantization_config") or simply remove the first check since line 231 already handles the None case.

Suggested change

hasattr(fd_config, "quant_config")

and fd_config.model_config.quantization_config is not None

fd_config.model_config.quantization_config is not None

Copilot · 2025-12-08T09:14:49Z

fastdeploy/model_executor/layers/utils.py

    return vocab_range_from_per_partition_vocab_size(per_partition_vocab_size, rank, offset=offset)
+
+
+def modules_to_convert(prefix: str, fd_config: FDConfig):


The function modules_to_convert lacks a docstring. Add documentation explaining the purpose, parameters, and return value. For example: "Check if a module should be converted for quantization based on its prefix and quantization config patterns."

Suggested change

def modules_to_convert(prefix: str, fd_config: FDConfig):

def modules_to_convert(prefix: str, fd_config: FDConfig):

"""

Determines whether a module should be converted for quantization based on its prefix and quantization configuration patterns.

Args:

prefix (str): The prefix (name) of the module to check.

fd_config (FDConfig): The FastDeploy configuration object containing model and quantization settings.

Returns:

bool: True if the module should be converted for quantization, False otherwise.

"""

Copilot · 2025-12-08T09:14:49Z

fastdeploy/model_executor/layers/moe/moe.py

+                def _interleave_mxfp4_cutlass_sm90(w):
+                    w_shape = w.shape
+                    w_interleaved = w.reshape([w_shape[0], w_shape[1], (w_shape[2] // 4), 4])
+                    w_interleaved = w_interleaved.permute([0, 2, 1, 3])
+                    w_interleaved = w_interleaved.reshape([w_shape[0], w_shape[2] // 4, w_shape[1] * 4])
+                    return w_interleaved


The helper function _interleave_mxfp4_cutlass_sm90 is defined inside _load_fused_experts_weight and is duplicated in fastdeploy/model_executor/layers/quantization/mxfp4.py (line 297, though in unreachable code). Consider extracting this function to a shared utility module to avoid duplication and improve maintainability.

Copilot · 2025-12-08T09:14:50Z

fastdeploy/model_executor/layers/quantization/mxfp4.py

+
+def get_mxfp4_backend():
+    if current_platform.is_cuda():
+        if check_device_capability(90) and has_flashinfer() and envs.FD_USE_FLASHINFER_MOE_MXFP4_BF16:


The environment variable envs.FD_USE_FLASHINFER_MOE_MXFP4_BF16 is being evaluated as a string. According to fastdeploy/envs.py, it defaults to "0" (a string). This condition will always be truthy because non-empty strings evaluate to True in Python. You need to convert it to a boolean or integer, e.g., envs.FD_USE_FLASHINFER_MOE_MXFP4_BF16 == "1" or use int(envs.FD_USE_FLASHINFER_MOE_MXFP4_BF16) or similar conversion.

Suggested change

if check_device_capability(90) and has_flashinfer() and envs.FD_USE_FLASHINFER_MOE_MXFP4_BF16:

if check_device_capability(90) and has_flashinfer() and envs.FD_USE_FLASHINFER_MOE_MXFP4_BF16 == "1":

Copilot · 2025-12-08T09:14:50Z

fastdeploy/model_executor/layers/quantization/mxfp4.py

+            logger.info("FastDeploy Using Triton backend in MoE")
+            return Mxfp4Backend.TRITON
+    else:
+        raise NotImplementedError


The error message "NotImplementedError" is not helpful. Consider adding a descriptive error message to explain which platforms are unsupported and why. For example: raise NotImplementedError("MXFP4 quantization is only supported on CUDA platforms").

Suggested change

raise NotImplementedError

raise NotImplementedError("MXFP4 quantization is only supported on CUDA platforms due to hardware requirements.")

Copilot · 2025-12-08T09:14:50Z

fastdeploy/model_executor/layers/quantization/mxfp4.py

+        is_checkpoint_bf16 = not config.get("is_quantized", False)
+        return cls(is_checkpoint_bf16)
+
+    def get_quant_method(self, layer) -> Optional[QuantMethodBase]:


This method requires 2 positional arguments, whereas overridden QuantConfigBase.get_quant_method requires 3.

Suggested change

def get_quant_method(self, layer) -> Optional[QuantMethodBase]:

def get_quant_method(self, layer, weight=None) -> Optional[QuantMethodBase]:

codecov-commenter · 2025-12-08T12:27:37Z

Codecov Report

❌ Patch coverage is 27.83505% with 140 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@3066a0c). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...deploy/model_executor/layers/quantization/mxfp4.py	31.25%	77 Missing ⚠️
fastdeploy/model_executor/layers/moe/moe.py	2.32%	42 Missing ⚠️
fastdeploy/model_executor/models/gpt_oss.py	0.00%	15 Missing ⚠️
fastdeploy/config.py	0.00%	3 Missing ⚠️
...loy/model_executor/layers/quantization/__init__.py	33.33%	2 Missing ⚠️
fastdeploy/model_executor/layers/utils.py	90.90%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #5435   +/-   ##
==========================================
  Coverage           ?   59.39%           
==========================================
  Files              ?      328           
  Lines              ?    40818           
  Branches           ?     6197           
==========================================
  Hits               ?    24242           
  Misses             ?    14714           
  Partials           ?     1862

Flag	Coverage Δ
GPU	`59.39% <27.83%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Limerances added 3 commits December 8, 2025 14:27

support mxfp4 in gpt-oss

d02395d

support mxfp4 in gpt-oss

8861fe2

add scope for flashinfer

24c5437

Copilot AI review requested due to automatic review settings December 8, 2025 09:08

paddle-bot bot added the contributor External developers label Dec 8, 2025

Copilot started reviewing on behalf of Limerances December 8, 2025 09:08 View session

Copilot AI reviewed Dec 8, 2025

View reviewed changes

remove torch code

747b261



		class MXFP4Config(QuantConfigBase):
		"""Base class for quantization configs."""

	"""Base class for quantization configs."""
	"""Configuration for MXFP4 quantization."""

	raise ValueError("Torch weight offline quantization only supports block-wise FP8.")
	raise ValueError("Torch weight offline quantization only supports block-wise FP8 and MXFP4.")

	safetensors_num = int(model_files[0].strip(".safetensors").split("-")[-1]) + 1
	safetensors_num = int(model_files[0].strip(".safetensors").split("-")[-1])

		hasattr(fd_config, "quant_config")
		and fd_config.model_config.quantization_config is not None

	hasattr(fd_config, "quant_config")
	and fd_config.model_config.quantization_config is not None
	fd_config.model_config.quantization_config is not None

		return vocab_range_from_per_partition_vocab_size(per_partition_vocab_size, rank, offset=offset)


		def modules_to_convert(prefix: str, fd_config: FDConfig):

-def modules_to_convert(prefix: str, fd_config: FDConfig):
+def modules_to_convert(prefix: str, fd_config: FDConfig):
+    """
+    Determines whether a module should be converted for quantization based on its prefix and quantization configuration patterns.
+    Args:
+        prefix (str): The prefix (name) of the module to check.
+        fd_config (FDConfig): The FastDeploy configuration object containing model and quantization settings.
+    Returns:
+        bool: True if the module should be converted for quantization, False otherwise.
+    """

	if check_device_capability(90) and has_flashinfer() and envs.FD_USE_FLASHINFER_MOE_MXFP4_BF16:
	if check_device_capability(90) and has_flashinfer() and envs.FD_USE_FLASHINFER_MOE_MXFP4_BF16 == "1":

	raise NotImplementedError
	raise NotImplementedError("MXFP4 quantization is only supported on CUDA platforms due to hardware requirements.")

	def get_quant_method(self, layer) -> Optional[QuantMethodBase]:
	def get_quant_method(self, layer, weight=None) -> Optional[QuantMethodBase]:

Support MXFP4 for GPT-OSS #5435

Are you sure you want to change the base?

Support MXFP4 for GPT-OSS #5435

Conversation

Limerances commented Dec 8, 2025

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Dec 8, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Dec 8, 2025

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants