-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Open
Labels
AutoDeploy<NV> AutoDeploy Backend<NV> AutoDeploy BackendCustomized kernels<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.Decoding<NV>Token sampling algorithms in TRTLLM for text gen (top-k, top-p, beam).<NV>Token sampling algorithms in TRTLLM for text gen (top-k, top-p, beam).Inference runtime<NV>General operational aspects of TRTLLM execution not in other categories.<NV>General operational aspects of TRTLLM execution not in other categories.bugSomething isn't workingSomething isn't working
Description
System Info
Tested on H100
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
There's accuracy issue for unsloth/gpt-oss-20b-BF16 and openai/gpt-oss-20b across different configurations.
The examples configs and outputs ranking from best output to worst output:
- using torch attention backend and torch-simple compile backend + greedy decoding: Repetitive, not too bad
args:
mode: graph
world_size: 1
runtime: demollm
compile_backend: torch-simple
attn_page_size: 64
attn_backend: torch
model_factory: AutoModelForCausalLM
skip_loading_weights: false
disable_overlap_scheduler: true
kv_cache_config:
enable_block_reuse: false
model_kwargs:
torch_dtype: bfloat16
benchmark:
enabled: false
prompt:
sp_kwargs:
top_k: 0
temperature: 0
dry_run: false
Output is:
[12/02/2025-13:05:16] [TRT-LLM AUTO-DEPLOY] [I] [PROMPT 0] How big is the universe? : 1.5 trillion light years? 1.5 trillion? Wait: The observable universe radius is about 46.5 billion light years. But the entire universe might be infinite. But the question: "How big is the universe? 1.5 trillion light years? 1.5 trillion? 1.5 trillion? 1.5 trillion? 1.5 trillion? 1.5 trillion? 1.5 trillion? 1.5 trillion? 1
[12/02/2025-13:05:16] [TRT-LLM AUTO-DEPLOY] [I] [PROMPT 1] In simple words and a single sentence, explain the concept of gravity: : 1) The 2nd law of Newton's law?
The second law of Newton's law states that the force acting on an object is equal to the mass of the object multiplied by its acceleration.
The second law of Newton's
The second law of Newton states that the force acting on an object is equal to the mass of the object multiplied by its acceleration.
The second law of Newton states that the force acting on an object is equal to the mass of the object multiplied by its
- using torch attention backend and torch-simple compile backend + default sampling kwarg: Much worse, not totally random
[12/02/2025-13:06:43] [TRT-LLM AUTO-DEPLOY] [I] [PROMPT 0] How big is the universe? : 20km means? 0.99%'
They might want to know about the mass, but likely ask for visual diameter measured in kilometers.
Hence final answer: It's impossible to measure.
But we can compute size if mass constant.
Let's deliver finalcomend.
Alternatively, we can compute using baryon number vs mass.
Also compute scale as 'R ~ 13Mpc/h'.
Ok.
Let's craft answer accordingly.
Focus on measurement and mention that the radius is ~ $10^{23} m
[12/02/2025-13:06:43] [TRT-LLM AUTO-DEPLOY] [I] [PROMPT 1] In simple words and a single sentence, explain the concept of gravity: : It's the natural force that draws efficientlyDesign and produce; let's say planning in **OKM hotcountdown portion & answer plugin extent thou?**
It appears there might have been some typos or errors in your query. Let'sossi clarify or reinterpret what you're asking about.
It looks like your message might be a bit mixed up. Could you clarify what you need? Are you talking about a countdown, OKM (which could mean "Other Closed Markets" or something else?), planning,
- Output for attention backend and compile backend other than torch is totally random.
The results above are tested with unsloth/gpt-oss-20b-BF16, openai/gpt-oss-20b (using triton_mxfp4_moe has similar behaviour). I suggest
we look into the accuracy issue for the BF16 model first.
The error maybe from different sources. Culprits could be the harmony chat template, attention sink, etc.
Expected behavior
Coherent output. We can also run this model with benchmarks like MMLU.
actual behavior
Outputs are weird across different configs.
additional notes
Likely a accuracy regression that happens 2 months ago.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.
Metadata
Metadata
Assignees
Labels
AutoDeploy<NV> AutoDeploy Backend<NV> AutoDeploy BackendCustomized kernels<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.Decoding<NV>Token sampling algorithms in TRTLLM for text gen (top-k, top-p, beam).<NV>Token sampling algorithms in TRTLLM for text gen (top-k, top-p, beam).Inference runtime<NV>General operational aspects of TRTLLM execution not in other categories.<NV>General operational aspects of TRTLLM execution not in other categories.bugSomething isn't workingSomething isn't working
Type
Projects
Status
Backlog