add block-wise scaled int8 quantization based on QuantizedLayout mechanism #10864

lyogavin · 2025-11-24T18:14:50Z

Add Block-wise INT8 Quantization

This PR adds deepseek-style block-wise INT8 quantization support to ComfyUI, enabling ~50% memory reduction with limited accuracy loss and improved performance on large layers.

How it works:

Similar to our current scaled fp8, but change the scale to block based, split the given tensor into blocks and save the scale value for each block.

Note that the implementation is based on this, which is asymmetric for activation and weights. For activate only split into blocks on the last dimension, while for weights it's across last 2 dimensions.

more papers/code for reference:
Jetfire
Jetfire Repo
Deepseek v3 paper
Deepseek block-wise scaled fp8 implementation
Deepseek Deepgemm

Changes:

Core Implementation

BlockWiseINT8Layout: New quantization format with per-block scaling based on QuantizedLayout mechanism (here)
Triton-optimized CUDA kernels with PyTorch fallbacks (referencing this implementation)
Configurable block size (default: 128)
Make some necessary changes in weights adapter codes that referencing the internal data dtype
Changed MixedPrecisionOps implementation to make it load the new fields(is_weight) for the new layout

Tests

added unit tests intests-unit/comfy_quant/test_quant_registry.py: verify errors for quantization/dequantization, verify errors for all ops like gemm, gelu, etc. and also added run time benchmarking (disabled by default)

Performance Benchmarks (RTX 4090)

Memory: ~50% reduction vs FP16
Speed:
- On RTX 4090 48GB vram Wan t2v model (in sec):
  - 5b
    - Bwscaled Int8 with linear+gelu+transpose: 168.05
    - Bwscaled Int8 With linear+gelu: 167.12
    - Bwscaled Int8 With linear: 171.33
    - fb16: 178.77
  - 14b ( 20 steps)
    - Bwscaled Int8 with linear+gelu+transpose: 233.23
    - Bwscaled Int8 With linear+gelu: 233.86
    - Bwscaled Int8 With linear: 236.90
    - fb8 scaled: 209.39
    - Fb16: 253.19
  - 14b (on 4090 24GB VRam, 5 steps with step distillation):
    - Int8: 85.72 sec (no offload)
    - Fp16: 106.9 sec (10242MB offload)
    - FP8: 80 sec (no offload)

More detailed perf benchmark, precison comparison and memory consumption comparison on Wan video model sizes:

================================================================================
Summary: FP16 vs INT8 vs FP8 Performance
================================================================================

WAN2.2-5B:
Layer                     FP16       INT8       FP8        Speedup              Mem     
--------------------------------------------------------------------------------
First layer (small batch)      0.146ms    0.268ms    0.117ms INT8: 0.54x FP8: 1.25x   2.00x
Attention layer (long seq)      6.331ms    6.549ms    5.519ms INT8: 0.97x FP8: 1.15x   1.94x
MLP down projection (long seq)     30.536ms   23.795ms   18.422ms INT8: 1.28x FP8: 1.66x   1.94x
Attention layer (medium seq)     0.149ms    0.246ms    0.160ms INT8: 0.61x FP8: 0.93x   1.98x
--------------------------------------------------------------------------------
SUBTOTAL                    37.162ms   30.857ms   24.218ms INT8: 1.20x FP8: 1.53x
  WAN2.2-5B avg memory reduction: 1.97x
  WAN2.2-5B avg INT8 precision error: 0.179672
  WAN2.2-5B avg FP8 precision error: 0.389072
  WAN2.2-5B VRAM usage: FP16 6138.19MB, INT8 7189.63MB (during inference with both)

WAN2.2-14B:
Layer                     FP16       INT8       FP8        Speedup              Mem     
--------------------------------------------------------------------------------
First layer (small batch)      0.360ms    0.395ms    0.268ms INT8: 0.91x FP8: 1.34x   2.00x
Attention layer (long seq)     17.401ms   15.633ms   12.488ms INT8: 1.11x FP8: 1.39x   1.94x
Attention layer (medium seq)     0.366ms    0.357ms    0.262ms INT8: 1.02x FP8: 1.40x   1.99x
--------------------------------------------------------------------------------
SUBTOTAL                    18.127ms   16.385ms   13.018ms INT8: 1.11x FP8: 1.39x
  WAN2.2-14B avg memory reduction: 1.98x
  WAN2.2-14B avg INT8 precision error: 0.190389
  WAN2.2-14B avg FP8 precision error: 0.365195
  WAN2.2-14B VRAM usage: FP16 2829.11MB, INT8 3310.05MB (during inference with both)

Conclusion: INT8 is slower than FP8 but faster than FP16, precision is better than FP8, similar memory consumption as FP8

Usage

from comfy.quant_ops import QuantizedTensor

weight_int8 = QuantizedTensor.from_float(
    weight, 
    "BlockWiseINT8Layout",
    block_size=128,
    is_weight=True
)

# to dequantize:
weight_float = weight_int8.dequantize()

# below will internally trigger the IN8 based linear operation:
output = torch.nn.functional.linear(input, weight_int8)

actual ComfyUI workflow test:

I've uploaded some quantized Wan2.2 models in here, and create this sample workflow

Generation result:
https://github.com/user-attachments/assets/35227283-f8b6-4b7c-af18-6d86e6ed18f6

LORA loading precision tests

I did some more tests on the quantization/dequantization errors on model+lora in here:compare_lora_error.ipynb

From the tests, it seems like for original Wan Model, the new int8 quantization's error is smaller, but if we load lora and then quantize(the actual form it'll be when running), the error becomes bigger than scaled FP8.

I guess this is because in some cases the LORA might change the weights' distribution to cause some scales not accurate any more. So in the model converter tool I've added an arg to support providing a LORA for the converter to consider both original base model and the LORA applied version when quantizing. (see below for more details)

Tool to convert Model to Block-wise scaled INT8 format:

convert_to_int8_blockwise.py

Usage:

# convert model to quantized version:
python convert_to_int8_blockwise.py input_original_model.safetensors output_quantized_model.safetensors

# convert considering specified loras
python convert_to_int8_blockwise.py input_original_model.safetensors output_quantized_model.safetensors \
    --lora lora1.safetensors --lora lora2.safetensors --lora-strength 1.0

Kosinkadink · 2025-11-25T20:31:47Z

The changes for flux2 caused a slight conflict, could you take a look? Thanks!

lyogavin · 2025-11-25T21:06:43Z

The changes for flux2 caused a slight conflict, could you take a look? Thanks!

Sure. I've resolved the conflicts. Thanks.

comfyanonymous · 2025-11-26T19:57:11Z

comfy/weight_adapter/boft.py

                weight = weight_decompose(dora_scale, weight, lora_diff, alpha, strength, intermediate_dtype, function)
            else:
-                weight += function((strength * lora_diff).type(weight.dtype))
+                weight += function((strength * lora_diff).type(weight.dtype if not isinstance(weight, QuantizedTensor) else torch.float32))


I don't think these are needed in the latest commit.

It's a little tricky, the weight.dtype returns torch.int, so if we don't add this, it'll convert the LORA diff to int.

Here's an example:

M, N = 256, 512 weight = torch.randn(M, N, dtype=torch.float32, device='cuda') int8v = QuantizedTensor.from_float(weight, layout_type='BlockWiseINT8Layout', is_weight=True) fp8v = QuantizedTensor.from_float(weight, layout_type='TensorCoreFP8Layout') print(int8v.dtype, fp8v.dtype) # output: # torch.int8 torch.float8_e4m3fn

so the codes that directly access dtype might cause some potential issues.

Maybe we should try to find a way to override Tensor.dtype?

In the most recent code the lora stuff is called after doing QuantizedTensor.dequantize

Basically it does: convert_weight() -> apply the lora -> set_weight()

The convert and set weight functions are: https://github.com/comfyanonymous/ComfyUI/blob/master/comfy/ops.py#L502

I'll check. Thanks.

@comfyanonymous you are right. Seems like with this change #10899, we don't need to modify the types in Lora any more. It's awesome. It also fixed the LORA issue I mentioned in the description. I removed all the related unnecessary changes.

…anism add more tests by comparing with manual torch implementation add perf benchmarks fix errors caused by merging default no output quant fix unittest

lyogavin requested a review from Kosinkadink as a code owner November 24, 2025 18:14

lyogavin force-pushed the support_int8_quantization branch 2 times, most recently from 597ab49 to 16c2dfa Compare November 25, 2025 21:06

lyogavin force-pushed the support_int8_quantization branch 6 times, most recently from 7f9a65c to 86c0361 Compare November 26, 2025 18:41

comfyanonymous reviewed Nov 26, 2025

View reviewed changes

lyogavin force-pushed the support_int8_quantization branch from 86c0361 to 3322d21 Compare November 26, 2025 20:57

yoinked-h mentioned this pull request Nov 27, 2025

add weight_scale_inv implementation #10930

Closed

Kosinkadink added the Core Core team dependency label Dec 3, 2025

lyogavin force-pushed the support_int8_quantization branch from 3322d21 to dfaa2c1 Compare December 10, 2025 16:05

lyogavin requested a review from guill as a code owner December 10, 2025 16:05

lyogavin force-pushed the support_int8_quantization branch 2 times, most recently from ae0b939 to bbb4626 Compare December 10, 2025 18:22

add block-wise scaled int8 quantization based on QuantizedLayout mech…

5ba2d28

…anism add more tests by comparing with manual torch implementation add perf benchmarks fix errors caused by merging default no output quant fix unittest

lyogavin force-pushed the support_int8_quantization branch from 46679ef to 5ba2d28 Compare December 10, 2025 18:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add block-wise scaled int8 quantization based on QuantizedLayout mechanism #10864

add block-wise scaled int8 quantization based on QuantizedLayout mechanism #10864

lyogavin commented Nov 24, 2025 •

edited

Loading

Uh oh!

Kosinkadink commented Nov 25, 2025 •

edited

Loading

Uh oh!

lyogavin commented Nov 25, 2025

Uh oh!

comfyanonymous Nov 26, 2025

Uh oh!

lyogavin Nov 26, 2025

Uh oh!

comfyanonymous Nov 27, 2025

Uh oh!

comfyanonymous Nov 27, 2025

Uh oh!

lyogavin Nov 27, 2025

Uh oh!

lyogavin Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

add block-wise scaled int8 quantization based on QuantizedLayout mechanism #10864

Are you sure you want to change the base?

add block-wise scaled int8 quantization based on QuantizedLayout mechanism #10864

Conversation

lyogavin commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add Block-wise INT8 Quantization

How it works:

Changes:

Performance Benchmarks (RTX 4090)

Usage

actual ComfyUI workflow test:

LORA loading precision tests

Tool to convert Model to Block-wise scaled INT8 format:

Uh oh!

Kosinkadink commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lyogavin commented Nov 25, 2025

Uh oh!

comfyanonymous Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

lyogavin Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

comfyanonymous Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

comfyanonymous Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

lyogavin Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

lyogavin Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lyogavin commented Nov 24, 2025 •

edited

Loading

Kosinkadink commented Nov 25, 2025 •

edited

Loading