-
Notifications
You must be signed in to change notification settings - Fork 644
Description
Description
I’m using the ART agent training pipeline with LangGraph and Qwen2.5-14B, and my training run consistently pauses mid-training with no explicit error. The script does not crash, but the training progress bar never moves past 0/12 steps.
From the logs, it looks like Unsloth/vLLM is incorrectly concluding that there is no free GPU memory available for the KV cache, even though the model uses ~26 GB VRAM and the GPU has plenty of free memory.
Environment
- Model:
Qwen2.5-14B-Instruct(4-bit / 8-bit via Unsloth, as configured by ART) - Framework: ART agent training for LangGraph
- GPU: NVIDIA H100 80GB (2x GPUs shown in logs)
- CUDA: 12.6
- Torch: 2.7.1+cu126
- vLLM: 0.10.0
- Transformers: 4.53.2
- Unsloth: 2025.10.3 (per log banner)
- OS: Linux
What happens
The run starts normally:
- Trajectories are collected.
- RULER reward stats are logged.
- Trajectories are packed into sequences for training.
Then, when training with vLLM starts, I see:
2025-12-10 20:52:00,769 | INFO | RULER reward stats — step=59 | trajectories=44 | mean=0.812 | stdev=0.189
Packed 44 trajectories into 12 sequences of length 4096
2025-12-10 20:52:02,337 | INFO | Unsloth: Patching vLLM
train: 0%| | 0/12 [00:00<?, ?it/s]
==((====))== Unsloth 2025.10.3: Fast Qwen2 patching. Transformers: 4.53.2. vLLM: 0.10.0.
\\ /| NVIDIA H100 80GB HBM3. Num GPUs = 2. Max memory: 79.189 GB. Platform: Linux.
O^O/ \_/ \ Torch: 2.7.1+cu126. CUDA: 9.0. CUDA Toolkit: 12.6. Triton: 3.3.1
\ / Bfloat16 = TRUE. FA [Xformers = 0.0.31. FA2 = False]
"-____-" Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Your GPU cannot handle sequence lengths of 256 due to limited GPU memory.
Unsloth: Your GPU can only handle approximately the maximum sequence length of 256.
Unsloth: vLLM loading unsloth/qwen2.5-14b-instruct-unsloth-bnb-4bit with actual GPU utilization = 12.79%
Unsloth: Your GPU has CUDA compute capability 9.0 with VRAM = 79.19 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 256. Num Sequences = 128.
Unsloth: vLLM's KV Cache can use up to 0.0 GB. Also swap space = 6 GB.
Unsloth: Not an error, but `device` is not supported in vLLM. Skipping.
After this, nothing else happens:
- The progress bar stays at
0/12. - No Python traceback / exception is printed.
- The process keeps running but never finishes the training step.
Why this seems wrong
-
Qwen2.5-14B with 4-bit quantization uses about 26 GB VRAM on this H100 80GB.
-
nvidia-smishows that there is still plenty of free VRAM on the GPU during the run (well over the few GB that a KV cache would need for sequence lengths of 256–4096). -
Despite that, Unsloth/vLLM logs:
Your GPU cannot handle sequence lengths of 256 due to limited GPU memory.vLLM's KV Cache can use up to 0.0 GB.
-
This suggests either:
- ART’s integration with Unsloth/vLLM is mis-configuring something like
gpu_memory_utilization/ max model length, or - There’s a bug in how available VRAM is being computed before allocating KV cache in this setup.
- ART’s integration with Unsloth/vLLM is mis-configuring something like
Because the KV cache size is computed as 0.0 GB, I suspect vLLM cannot actually schedule any tokens, and the training loop ends up waiting forever for outputs.
What I’ve tried
- Verified there is no other heavy process on the same GPU (no big competing jobs).
- Confirmed that Qwen2.5-14B fits fine on this GPU in isolation.
- Reproduced multiple times; behavior is consistent.
- I didn’t explicitly set
gpu_memory_utilizationor vLLM engine arguments in my code — I’m using the defaults provided by ART’s training script.
Questions
- Is ART setting any vLLM/Unsloth parameters like
gpu_memory_utilization,max_model_len, ormax_num_seqsthat could cause the KV cache to be computed as 0.0 GB in this configuration? - Is this a known issue when using Qwen2.5-14B with ART’s agent training + vLLM backend?
- Do you have recommended overrides for the vLLM engine args in this setup (e.g. context length, sequences per batch) to avoid the
KV Cache can use up to 0.0 GBsituation?