WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

Conversation

@lyogavin
Copy link

@lyogavin lyogavin commented Nov 24, 2025

Add Block-wise INT8 Quantization

This PR adds deepseek-style block-wise INT8 quantization support to ComfyUI, enabling ~50% memory reduction with limited accuracy loss and improved performance on large layers.

How it works:

Similar to our current scaled fp8, but change the scale to block based, split the given tensor into blocks and save the scale value for each block.

Note that the implementation is based on this, which is asymmetric for activation and weights. For activate only split into blocks on the last dimension, while for weights it's across last 2 dimensions.

more papers/code for reference:
Jetfire
Jetfire Repo
Deepseek v3 paper
Deepseek block-wise scaled fp8 implementation
Deepseek Deepgemm

Changes:

Core Implementation

  • BlockWiseINT8Layout: New quantization format with per-block scaling based on QuantizedLayout mechanism (here)
  • Triton-optimized CUDA kernels with PyTorch fallbacks (referencing this implementation)
  • Configurable block size (default: 128)
  • Make some necessary changes in weights adapter codes that referencing the internal data dtype
  • Changed MixedPrecisionOps implementation to make it load the new fields(is_weight) for the new layout

Tests

  • added unit tests intests-unit/comfy_quant/test_quant_registry.py: verify errors for quantization/dequantization, verify errors for all ops like gemm, gelu, etc. and also added run time benchmarking (disabled by default)

Performance Benchmarks (RTX 4090)

  • Memory: ~50% reduction vs FP16
  • Speed:
    • On RTX 4090 48GB vram Wan t2v model (in sec):
      • 5b
        • Bwscaled Int8 with linear+gelu+transpose: 168.05
        • Bwscaled Int8 With linear+gelu: 167.12
        • Bwscaled Int8 With linear: 171.33
        • fb16: 178.77
      • 14b ( 20 steps)
        • Bwscaled Int8 with linear+gelu+transpose: 233.23
        • Bwscaled Int8 With linear+gelu: 233.86
        • Bwscaled Int8 With linear: 236.90
        • fb8 scaled: 209.39
        • Fb16: 253.19
      • 14b (on 4090 24GB VRam, 5 steps with step distillation):
        • Int8: 85.72 sec (no offload)
        • Fp16: 106.9 sec (10242MB offload)
        • FP8: 80 sec (no offload)


More detailed perf benchmark, precison comparison and memory consumption comparison on Wan video model sizes:

================================================================================
Summary: FP16 vs INT8 vs FP8 Performance
================================================================================

WAN2.2-5B:
Layer                     FP16       INT8       FP8        Speedup              Mem     
--------------------------------------------------------------------------------
First layer (small batch)      0.146ms    0.268ms    0.117ms INT8: 0.54x FP8: 1.25x   2.00x
Attention layer (long seq)      6.331ms    6.549ms    5.519ms INT8: 0.97x FP8: 1.15x   1.94x
MLP down projection (long seq)     30.536ms   23.795ms   18.422ms INT8: 1.28x FP8: 1.66x   1.94x
Attention layer (medium seq)     0.149ms    0.246ms    0.160ms INT8: 0.61x FP8: 0.93x   1.98x
--------------------------------------------------------------------------------
SUBTOTAL                    37.162ms   30.857ms   24.218ms INT8: 1.20x FP8: 1.53x
  WAN2.2-5B avg memory reduction: 1.97x
  WAN2.2-5B avg INT8 precision error: 0.179672
  WAN2.2-5B avg FP8 precision error: 0.389072
  WAN2.2-5B VRAM usage: FP16 6138.19MB, INT8 7189.63MB (during inference with both)

WAN2.2-14B:
Layer                     FP16       INT8       FP8        Speedup              Mem     
--------------------------------------------------------------------------------
First layer (small batch)      0.360ms    0.395ms    0.268ms INT8: 0.91x FP8: 1.34x   2.00x
Attention layer (long seq)     17.401ms   15.633ms   12.488ms INT8: 1.11x FP8: 1.39x   1.94x
Attention layer (medium seq)     0.366ms    0.357ms    0.262ms INT8: 1.02x FP8: 1.40x   1.99x
--------------------------------------------------------------------------------
SUBTOTAL                    18.127ms   16.385ms   13.018ms INT8: 1.11x FP8: 1.39x
  WAN2.2-14B avg memory reduction: 1.98x
  WAN2.2-14B avg INT8 precision error: 0.190389
  WAN2.2-14B avg FP8 precision error: 0.365195
  WAN2.2-14B VRAM usage: FP16 2829.11MB, INT8 3310.05MB (during inference with both)

Conclusion: INT8 is slower than FP8 but faster than FP16, precision is better than FP8, similar memory consumption as FP8

Usage

from comfy.quant_ops import QuantizedTensor

weight_int8 = QuantizedTensor.from_float(
    weight, 
    "BlockWiseINT8Layout",
    block_size=128,
    is_weight=True
)

# to dequantize:
weight_float = weight_int8.dequantize()

# below will internally trigger the IN8 based linear operation:
output = torch.nn.functional.linear(input, weight_int8)

actual ComfyUI workflow test:

I've uploaded some quantized Wan2.2 models in here, and create this sample workflow

Generation result:
https://github.com/user-attachments/assets/35227283-f8b6-4b7c-af18-6d86e6ed18f6

LORA loading precision tests

I did some more tests on the quantization/dequantization errors on model+lora in here:compare_lora_error.ipynb

From the tests, it seems like for original Wan Model, the new int8 quantization's error is smaller, but if we load lora and then quantize(the actual form it'll be when running), the error becomes bigger than scaled FP8.

I guess this is because in some cases the LORA might change the weights' distribution to cause some scales not accurate any more. So in the model converter tool I've added an arg to support providing a LORA for the converter to consider both original base model and the LORA applied version when quantizing. (see below for more details)

Tool to convert Model to Block-wise scaled INT8 format:

convert_to_int8_blockwise.py

Usage:

# convert model to quantized version:
python convert_to_int8_blockwise.py input_original_model.safetensors output_quantized_model.safetensors

# convert considering specified loras
python convert_to_int8_blockwise.py input_original_model.safetensors output_quantized_model.safetensors \
    --lora lora1.safetensors --lora lora2.safetensors --lora-strength 1.0

@Kosinkadink
Copy link
Collaborator

Kosinkadink commented Nov 25, 2025

The changes for flux2 caused a slight conflict, could you take a look? Thanks!

@lyogavin lyogavin force-pushed the support_int8_quantization branch 2 times, most recently from 597ab49 to 16c2dfa Compare November 25, 2025 21:06
@lyogavin
Copy link
Author

The changes for flux2 caused a slight conflict, could you take a look? Thanks!

Sure. I've resolved the conflicts. Thanks.

@lyogavin lyogavin force-pushed the support_int8_quantization branch 6 times, most recently from 7f9a65c to 86c0361 Compare November 26, 2025 18:41
weight = weight_decompose(dora_scale, weight, lora_diff, alpha, strength, intermediate_dtype, function)
else:
weight += function((strength * lora_diff).type(weight.dtype))
weight += function((strength * lora_diff).type(weight.dtype if not isinstance(weight, QuantizedTensor) else torch.float32))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these are needed in the latest commit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little tricky, the weight.dtype returns torch.int, so if we don't add this, it'll convert the LORA diff to int.

Here's an example:

M, N = 256, 512
weight = torch.randn(M, N, dtype=torch.float32, device='cuda')
        
int8v = QuantizedTensor.from_float(weight, layout_type='BlockWiseINT8Layout', is_weight=True)
fp8v = QuantizedTensor.from_float(weight, layout_type='TensorCoreFP8Layout')

print(int8v.dtype, fp8v.dtype)
# output:
# torch.int8 torch.float8_e4m3fn

so the codes that directly access dtype might cause some potential issues.

Maybe we should try to find a way to override Tensor.dtype?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the most recent code the lora stuff is called after doing QuantizedTensor.dequantize

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically it does: convert_weight() -> apply the lora -> set_weight()

The convert and set weight functions are: https://github.com/comfyanonymous/ComfyUI/blob/master/comfy/ops.py#L502

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check. Thanks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@comfyanonymous you are right. Seems like with this change #10899, we don't need to modify the types in Lora any more. It's awesome. It also fixed the LORA issue I mentioned in the description. I removed all the related unnecessary changes.

@lyogavin lyogavin force-pushed the support_int8_quantization branch from 86c0361 to 3322d21 Compare November 26, 2025 20:57
@Kosinkadink Kosinkadink added the Core Core team dependency label Dec 3, 2025
@lyogavin lyogavin force-pushed the support_int8_quantization branch from 3322d21 to dfaa2c1 Compare December 10, 2025 16:05
@lyogavin lyogavin requested a review from guill as a code owner December 10, 2025 16:05
@lyogavin lyogavin force-pushed the support_int8_quantization branch 2 times, most recently from ae0b939 to bbb4626 Compare December 10, 2025 18:22
…anism

add more tests by comparing with manual torch implementation

add perf benchmarks

fix errors caused by merging

default no output quant

fix unittest
@lyogavin lyogavin force-pushed the support_int8_quantization branch from 46679ef to 5ba2d28 Compare December 10, 2025 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Core Core team dependency

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants