Training hangs at 0/12 with Qwen2.5-14B + Unsloth/vLLM (KV cache reported as 0.0 GB despite free VRAM)

**Description**

I’m using the ART agent training pipeline with LangGraph and Qwen2.5-14B, and my training run consistently **pauses mid-training** with no explicit error. The script does not crash, but the training progress bar never moves past `0/12` steps.

From the logs, it looks like Unsloth/vLLM is incorrectly concluding that there is no free GPU memory available for the KV cache, even though the model uses ~26 GB VRAM and the GPU has plenty of free memory.

**Environment**

* Model: `Qwen2.5-14B-Instruct` (4-bit / 8-bit via Unsloth, as configured by ART)
* Framework: ART agent training for LangGraph
* GPU: NVIDIA H100 80GB (2x GPUs shown in logs)
* CUDA: 12.6
* Torch: 2.7.1+cu126
* vLLM: 0.10.0
* Transformers: 4.53.2
* Unsloth: 2025.10.3 (per log banner)
* OS: Linux

**What happens**

The run starts normally:

* Trajectories are collected.
* RULER reward stats are logged.
* Trajectories are packed into sequences for training.

Then, when training with vLLM starts, I see:

```text
2025-12-10 20:52:00,769 | INFO | RULER reward stats — step=59 | trajectories=44 | mean=0.812 | stdev=0.189
Packed 44 trajectories into 12 sequences of length 4096
2025-12-10 20:52:02,337 | INFO | Unsloth: Patching vLLM
train:   0%|                                                          | 0/12 [00:00<?, ?it/s]
==((====))==  Unsloth 2025.10.3: Fast Qwen2 patching. Transformers: 4.53.2. vLLM: 0.10.0.
   \\   /|    NVIDIA H100 80GB HBM3. Num GPUs = 2. Max memory: 79.189 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.1+cu126. CUDA: 9.0. CUDA Toolkit: 12.6. Triton: 3.3.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.31. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Your GPU cannot handle sequence lengths of 256 due to limited GPU memory.
Unsloth: Your GPU can only handle approximately the maximum sequence length of 256.
Unsloth: vLLM loading unsloth/qwen2.5-14b-instruct-unsloth-bnb-4bit with actual GPU utilization = 12.79%
Unsloth: Your GPU has CUDA compute capability 9.0 with VRAM = 79.19 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 256. Num Sequences = 128.
Unsloth: vLLM's KV Cache can use up to 0.0 GB. Also swap space = 6 GB.
Unsloth: Not an error, but `device` is not supported in vLLM. Skipping.
```

After this, **nothing else happens**:

* The progress bar stays at `0/12`.
* No Python traceback / exception is printed.
* The process keeps running but never finishes the training step.

**Why this seems wrong**

* Qwen2.5-14B with 4-bit quantization uses about **26 GB VRAM** on this H100 80GB.
* `nvidia-smi` shows that there is still plenty of free VRAM on the GPU during the run (well over the few GB that a KV cache would need for sequence lengths of 256–4096).
* Despite that, Unsloth/vLLM logs:

  * `Your GPU cannot handle sequence lengths of 256 due to limited GPU memory.`
  * `vLLM's KV Cache can use up to 0.0 GB.`
* This suggests either:

  * ART’s integration with Unsloth/vLLM is mis-configuring something like `gpu_memory_utilization` / max model length, or
  * There’s a bug in how available VRAM is being computed before allocating KV cache in this setup.

Because the KV cache size is computed as 0.0 GB, I suspect vLLM cannot actually schedule any tokens, and the training loop ends up waiting forever for outputs.

**What I’ve tried**

* Verified there is no other heavy process on the same GPU (no big competing jobs).
* Confirmed that Qwen2.5-14B fits fine on this GPU in isolation.
* Reproduced multiple times; behavior is consistent.
* I didn’t explicitly set `gpu_memory_utilization` or vLLM engine arguments in my code — I’m using the defaults provided by ART’s training script.

**Questions**

1. Is ART setting any vLLM/Unsloth parameters like `gpu_memory_utilization`, `max_model_len`, or `max_num_seqs` that could cause the KV cache to be computed as 0.0 GB in this configuration?
2. Is this a known issue when using Qwen2.5-14B with ART’s agent training + vLLM backend?
3. Do you have recommended overrides for the vLLM engine args in this setup (e.g. context length, sequences per batch) to avoid the `KV Cache can use up to 0.0 GB` situation?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training hangs at 0/12 with Qwen2.5-14B + Unsloth/vLLM (KV cache reported as 0.0 GB despite free VRAM) #472

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Training hangs at 0/12 with Qwen2.5-14B + Unsloth/vLLM (KV cache reported as 0.0 GB despite free VRAM) #472

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions