WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

[Bug]: Bad output of GPT-OSS in AutoDeploy #9810

@Fridah-nv

Description

@Fridah-nv

System Info

Tested on H100

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

There's accuracy issue for unsloth/gpt-oss-20b-BF16 and openai/gpt-oss-20b across different configurations.
The examples configs and outputs ranking from best output to worst output:

  1. using torch attention backend and torch-simple compile backend + greedy decoding: Repetitive, not too bad
args:
  mode: graph
  world_size: 1
  runtime: demollm
  compile_backend: torch-simple
  attn_page_size: 64
  attn_backend: torch
  model_factory: AutoModelForCausalLM
  skip_loading_weights: false
  disable_overlap_scheduler: true
  kv_cache_config:
    enable_block_reuse: false
  model_kwargs:
    torch_dtype: bfloat16
benchmark:
  enabled: false
prompt:
  sp_kwargs:
    top_k: 0
    temperature: 0
dry_run: false

Output is:

[12/02/2025-13:05:16] [TRT-LLM AUTO-DEPLOY] [I] [PROMPT 0] How big is the universe? : 1.5 trillion light years? 1.5 trillion? Wait: The observable universe radius is about 46.5 billion light years. But the entire universe might be infinite. But the question: "How big is the universe? 1.5 trillion light years? 1.5 trillion? 1.5 trillion? 1.5 trillion? 1.5 trillion? 1.5 trillion? 1.5 trillion? 1.5 trillion? 1
[12/02/2025-13:05:16] [TRT-LLM AUTO-DEPLOY] [I] [PROMPT 1] In simple words and a single sentence, explain the concept of gravity: : 1) The 2nd law of Newton's law?

The second law of Newton's law states that the force acting on an object is equal to the mass of the object multiplied by its acceleration.

The second law of Newton's

The second law of Newton states that the force acting on an object is equal to the mass of the object multiplied by its acceleration.

The second law of Newton states that the force acting on an object is equal to the mass of the object multiplied by its
  1. using torch attention backend and torch-simple compile backend + default sampling kwarg: Much worse, not totally random
[12/02/2025-13:06:43] [TRT-LLM AUTO-DEPLOY] [I] [PROMPT 0] How big is the universe? : 20km means? 0.99%'

They might want to know about the mass, but likely ask for visual diameter measured in kilometers.

Hence final answer: It's impossible to measure.

But we can compute size if mass constant.

Let's deliver finalcomend.

Alternatively, we can compute using baryon number vs mass.

Also compute scale as 'R ~ 13Mpc/h'.

Ok.

Let's craft answer accordingly.

Focus on measurement and mention that the radius is ~ $10^{23} m
[12/02/2025-13:06:43] [TRT-LLM AUTO-DEPLOY] [I] [PROMPT 1] In simple words and a single sentence, explain the concept of gravity: :  It's the natural force that draws efficientlyDesign and produce; let's say planning in **OKM hotcountdown portion & answer plugin extent thou?**

It appears there might have been some typos or errors in your query. Let'sossi clarify or reinterpret what you're asking about.

It looks like your message might be a bit mixed up. Could you clarify what you need? Are you talking about a countdown, OKM (which could mean "Other Closed Markets" or something else?), planning,
  1. Output for attention backend and compile backend other than torch is totally random.

The results above are tested with unsloth/gpt-oss-20b-BF16, openai/gpt-oss-20b (using triton_mxfp4_moe has similar behaviour). I suggest
we look into the accuracy issue for the BF16 model first.

The error maybe from different sources. Culprits could be the harmony chat template, attention sink, etc.

Expected behavior

Coherent output. We can also run this model with benchmarks like MMLU.

actual behavior

Outputs are weird across different configs.

additional notes

Likely a accuracy regression that happens 2 months ago.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    AutoDeploy<NV> AutoDeploy BackendCustomized kernels<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.Decoding<NV>Token sampling algorithms in TRTLLM for text gen (top-k, top-p, beam).Inference runtime<NV>General operational aspects of TRTLLM execution not in other categories.bugSomething isn't working

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions