WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

Training hangs at 0/12 with Qwen2.5-14B + Unsloth/vLLM (KV cache reported as 0.0 GB despite free VRAM) #472

@ansh-info

Description

@ansh-info

Description

I’m using the ART agent training pipeline with LangGraph and Qwen2.5-14B, and my training run consistently pauses mid-training with no explicit error. The script does not crash, but the training progress bar never moves past 0/12 steps.

From the logs, it looks like Unsloth/vLLM is incorrectly concluding that there is no free GPU memory available for the KV cache, even though the model uses ~26 GB VRAM and the GPU has plenty of free memory.

Environment

  • Model: Qwen2.5-14B-Instruct (4-bit / 8-bit via Unsloth, as configured by ART)
  • Framework: ART agent training for LangGraph
  • GPU: NVIDIA H100 80GB (2x GPUs shown in logs)
  • CUDA: 12.6
  • Torch: 2.7.1+cu126
  • vLLM: 0.10.0
  • Transformers: 4.53.2
  • Unsloth: 2025.10.3 (per log banner)
  • OS: Linux

What happens

The run starts normally:

  • Trajectories are collected.
  • RULER reward stats are logged.
  • Trajectories are packed into sequences for training.

Then, when training with vLLM starts, I see:

2025-12-10 20:52:00,769 | INFO | RULER reward stats — step=59 | trajectories=44 | mean=0.812 | stdev=0.189
Packed 44 trajectories into 12 sequences of length 4096
2025-12-10 20:52:02,337 | INFO | Unsloth: Patching vLLM
train:   0%|                                                          | 0/12 [00:00<?, ?it/s]
==((====))==  Unsloth 2025.10.3: Fast Qwen2 patching. Transformers: 4.53.2. vLLM: 0.10.0.
   \\   /|    NVIDIA H100 80GB HBM3. Num GPUs = 2. Max memory: 79.189 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.1+cu126. CUDA: 9.0. CUDA Toolkit: 12.6. Triton: 3.3.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.31. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Your GPU cannot handle sequence lengths of 256 due to limited GPU memory.
Unsloth: Your GPU can only handle approximately the maximum sequence length of 256.
Unsloth: vLLM loading unsloth/qwen2.5-14b-instruct-unsloth-bnb-4bit with actual GPU utilization = 12.79%
Unsloth: Your GPU has CUDA compute capability 9.0 with VRAM = 79.19 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 256. Num Sequences = 128.
Unsloth: vLLM's KV Cache can use up to 0.0 GB. Also swap space = 6 GB.
Unsloth: Not an error, but `device` is not supported in vLLM. Skipping.

After this, nothing else happens:

  • The progress bar stays at 0/12.
  • No Python traceback / exception is printed.
  • The process keeps running but never finishes the training step.

Why this seems wrong

  • Qwen2.5-14B with 4-bit quantization uses about 26 GB VRAM on this H100 80GB.

  • nvidia-smi shows that there is still plenty of free VRAM on the GPU during the run (well over the few GB that a KV cache would need for sequence lengths of 256–4096).

  • Despite that, Unsloth/vLLM logs:

    • Your GPU cannot handle sequence lengths of 256 due to limited GPU memory.
    • vLLM's KV Cache can use up to 0.0 GB.
  • This suggests either:

    • ART’s integration with Unsloth/vLLM is mis-configuring something like gpu_memory_utilization / max model length, or
    • There’s a bug in how available VRAM is being computed before allocating KV cache in this setup.

Because the KV cache size is computed as 0.0 GB, I suspect vLLM cannot actually schedule any tokens, and the training loop ends up waiting forever for outputs.

What I’ve tried

  • Verified there is no other heavy process on the same GPU (no big competing jobs).
  • Confirmed that Qwen2.5-14B fits fine on this GPU in isolation.
  • Reproduced multiple times; behavior is consistent.
  • I didn’t explicitly set gpu_memory_utilization or vLLM engine arguments in my code — I’m using the defaults provided by ART’s training script.

Questions

  1. Is ART setting any vLLM/Unsloth parameters like gpu_memory_utilization, max_model_len, or max_num_seqs that could cause the KV cache to be computed as 0.0 GB in this configuration?
  2. Is this a known issue when using Qwen2.5-14B with ART’s agent training + vLLM backend?
  3. Do you have recommended overrides for the vLLM engine args in this setup (e.g. context length, sequences per batch) to avoid the KV Cache can use up to 0.0 GB situation?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions