WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

Conversation

@Limerances
Copy link
Contributor

Motivation

This PR adds support for native GPT-OSS-120B and GPT-OSS-20B models using MXFP4 offline quantization, leveraging FlashInfer as the backend for MoE components.

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings December 8, 2025 09:08
@paddle-bot
Copy link

paddle-bot bot commented Dec 8, 2025

Thanks for your contribution!

@paddle-bot paddle-bot bot added the contributor External developers label Dec 8, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

这个 PR 添加了对 GPT-OSS-120B 和 GPT-OSS-20B 模型的 MXFP4 离线量化支持,使用 FlashInfer 作为 MoE 组件的后端。

主要变更包括:

  • 新增 MXFP4 量化方法的完整实现
  • 添加模块转换控制逻辑以支持选择性量化
  • 修复 safetensors 文件数量计算的潜在问题

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
fastdeploy/utils.py 修正 safetensors 文件数量的计算逻辑
fastdeploy/model_executor/models/gpt_oss.py 添加 MXFP4 量化配置支持和权重加载逻辑优化
fastdeploy/model_executor/layers/utils.py 新增 modules_to_convert 工具函数用于判断模块是否需要量化转换
fastdeploy/model_executor/layers/quantization/mxfp4.py 实现 MXFP4 量化的核心逻辑,包括权重创建、加载和前向计算
fastdeploy/model_executor/layers/quantization/init.py 注册 MXFP4 量化方法到量化系统
fastdeploy/model_executor/layers/normalization.py 集成 modules_to_convert 以支持选择性量化
fastdeploy/model_executor/layers/moe/moe.py 添加 MXFP4 专用的 MoE 权重加载逻辑
fastdeploy/model_executor/layers/linear.py 集成 modules_to_convert 检查以控制线性层量化
fastdeploy/envs.py 添加 FlashInfer MXFP4 后端的环境变量配置
fastdeploy/config.py 添加 MXFP4 模型格式检测逻辑
Comments suppressed due to low confidence (1)

fastdeploy/model_executor/layers/quantization/mxfp4.py:250

  • This statement is unreachable.
        block_size = 32



class MXFP4Config(QuantConfigBase):
"""Base class for quantization configs."""
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring "Base class for quantization configs." is misleading since this is actually a concrete implementation for MXFP4 quantization, not a base class. Consider updating it to: "Configuration for MXFP4 quantization."

Suggested change
"""Base class for quantization configs."""
"""Configuration for MXFP4 quantization."""

Copilot uses AI. Check for mistakes.
Comment on lines 320 to 324
router_logits (torch.Tensor): Router logits of shape [batch_size, num_experts]
top_k (int): Number of experts to route to per token
Returns:
tuple[torch.Tensor, torch.Tensor]: A tuple containing:
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring references torch.Tensor types, but this is a Paddle-based codebase and the function signature uses paddle.Tensor. Update the documentation to use paddle.Tensor for consistency.

Suggested change
router_logits (torch.Tensor): Router logits of shape [batch_size, num_experts]
top_k (int): Number of experts to route to per token
Returns:
tuple[torch.Tensor, torch.Tensor]: A tuple containing:
router_logits (paddle.Tensor): Router logits of shape [batch_size, num_experts]
top_k (int): Number of experts to route to per token
Returns:
tuple[paddle.Tensor, paddle.Tensor]: A tuple containing:

Copilot uses AI. Check for mistakes.
elif quant_method == "mxfp4":
quant_config_name = "mxfp4"
else:
raise ValueError("Torch weight offline quantization only supports block-wise FP8.")
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message is now incorrect. Since MXFP4 is supported, the error should be updated to reflect this. Consider: raise ValueError("Torch weight offline quantization only supports block-wise FP8 and MXFP4.")

Suggested change
raise ValueError("Torch weight offline quantization only supports block-wise FP8.")
raise ValueError("Torch weight offline quantization only supports block-wise FP8 and MXFP4.")

Copilot uses AI. Check for mistakes.
try:
# check all the file exists
safetensors_num = int(model_files[0].strip(".safetensors").split("-")[-1])
safetensors_num = int(model_files[0].strip(".safetensors").split("-")[-1]) + 1
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The + 1 addition may be incorrect. The filename format appears to be model-XXXXX-of-YYYYY.safetensors where the last number (YYYYY) represents the total count, not a zero-indexed value. Adding 1 to this value would result in an incorrect safetensors_num. For example, if the file is model-00001-of-00005.safetensors, the code would extract 5 and add 1 to get 6, but there should only be 5 files total. Please verify the filename format and remove the + 1 if YYYYY represents the total count.

Suggested change
safetensors_num = int(model_files[0].strip(".safetensors").split("-")[-1]) + 1
safetensors_num = int(model_files[0].strip(".safetensors").split("-")[-1])

Copilot uses AI. Check for mistakes.
Comment on lines +230 to +231
hasattr(fd_config, "quant_config")
and fd_config.model_config.quantization_config is not None
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition checks hasattr(fd_config, "quant_config") but then accesses fd_config.model_config.quantization_config. This is inconsistent. If you're checking for the existence of quant_config attribute, you should also validate fd_config.model_config.quantization_config is not None before checking if it contains "modules_to_not_convert". The second condition on line 231 already checks if quantization_config is not None, but the first check should verify the correct attribute path. Consider: hasattr(fd_config.model_config, "quantization_config") or simply remove the first check since line 231 already handles the None case.

Suggested change
hasattr(fd_config, "quant_config")
and fd_config.model_config.quantization_config is not None
fd_config.model_config.quantization_config is not None

Copilot uses AI. Check for mistakes.
return vocab_range_from_per_partition_vocab_size(per_partition_vocab_size, rank, offset=offset)


def modules_to_convert(prefix: str, fd_config: FDConfig):
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function modules_to_convert lacks a docstring. Add documentation explaining the purpose, parameters, and return value. For example: "Check if a module should be converted for quantization based on its prefix and quantization config patterns."

Suggested change
def modules_to_convert(prefix: str, fd_config: FDConfig):
def modules_to_convert(prefix: str, fd_config: FDConfig):
"""
Determines whether a module should be converted for quantization based on its prefix and quantization configuration patterns.
Args:
prefix (str): The prefix (name) of the module to check.
fd_config (FDConfig): The FastDeploy configuration object containing model and quantization settings.
Returns:
bool: True if the module should be converted for quantization, False otherwise.
"""

Copilot uses AI. Check for mistakes.
Comment on lines +426 to +431
def _interleave_mxfp4_cutlass_sm90(w):
w_shape = w.shape
w_interleaved = w.reshape([w_shape[0], w_shape[1], (w_shape[2] // 4), 4])
w_interleaved = w_interleaved.permute([0, 2, 1, 3])
w_interleaved = w_interleaved.reshape([w_shape[0], w_shape[2] // 4, w_shape[1] * 4])
return w_interleaved
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The helper function _interleave_mxfp4_cutlass_sm90 is defined inside _load_fused_experts_weight and is duplicated in fastdeploy/model_executor/layers/quantization/mxfp4.py (line 297, though in unreachable code). Consider extracting this function to a shared utility module to avoid duplication and improve maintainability.

Copilot uses AI. Check for mistakes.

def get_mxfp4_backend():
if current_platform.is_cuda():
if check_device_capability(90) and has_flashinfer() and envs.FD_USE_FLASHINFER_MOE_MXFP4_BF16:
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The environment variable envs.FD_USE_FLASHINFER_MOE_MXFP4_BF16 is being evaluated as a string. According to fastdeploy/envs.py, it defaults to "0" (a string). This condition will always be truthy because non-empty strings evaluate to True in Python. You need to convert it to a boolean or integer, e.g., envs.FD_USE_FLASHINFER_MOE_MXFP4_BF16 == "1" or use int(envs.FD_USE_FLASHINFER_MOE_MXFP4_BF16) or similar conversion.

Suggested change
if check_device_capability(90) and has_flashinfer() and envs.FD_USE_FLASHINFER_MOE_MXFP4_BF16:
if check_device_capability(90) and has_flashinfer() and envs.FD_USE_FLASHINFER_MOE_MXFP4_BF16 == "1":

Copilot uses AI. Check for mistakes.
logger.info("FastDeploy Using Triton backend in MoE")
return Mxfp4Backend.TRITON
else:
raise NotImplementedError
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message "NotImplementedError" is not helpful. Consider adding a descriptive error message to explain which platforms are unsupported and why. For example: raise NotImplementedError("MXFP4 quantization is only supported on CUDA platforms").

Suggested change
raise NotImplementedError
raise NotImplementedError("MXFP4 quantization is only supported on CUDA platforms due to hardware requirements.")

Copilot uses AI. Check for mistakes.
is_checkpoint_bf16 = not config.get("is_quantized", False)
return cls(is_checkpoint_bf16)

def get_quant_method(self, layer) -> Optional[QuantMethodBase]:
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method requires 2 positional arguments, whereas overridden QuantConfigBase.get_quant_method requires 3.

Suggested change
def get_quant_method(self, layer) -> Optional[QuantMethodBase]:
def get_quant_method(self, layer, weight=None) -> Optional[QuantMethodBase]:

Copilot uses AI. Check for mistakes.
@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 27.83505% with 140 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@3066a0c). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...deploy/model_executor/layers/quantization/mxfp4.py 31.25% 77 Missing ⚠️
fastdeploy/model_executor/layers/moe/moe.py 2.32% 42 Missing ⚠️
fastdeploy/model_executor/models/gpt_oss.py 0.00% 15 Missing ⚠️
fastdeploy/config.py 0.00% 3 Missing ⚠️
...loy/model_executor/layers/quantization/__init__.py 33.33% 2 Missing ⚠️
fastdeploy/model_executor/layers/utils.py 90.90% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #5435   +/-   ##
==========================================
  Coverage           ?   59.39%           
==========================================
  Files              ?      328           
  Lines              ?    40818           
  Branches           ?     6197           
==========================================
  Hits               ?    24242           
  Misses             ?    14714           
  Partials           ?     1862           
Flag Coverage Δ
GPU 59.39% <27.83%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants