-
Notifications
You must be signed in to change notification settings - Fork 644
Description
What happened?
I’m running ART on a machine with multiple NVIDIA H100 80GB GPUs.
ART automatically detects multiple GPUs and logs messages like:
NVIDIA H100 80GB HBM3. Num GPUs = 2 (or 4)
However, as soon as I try to use multi-GPU inference or training through ART’s internal Unsloth + vLLM engine, the system becomes unstable:
1. When I set CUDA_VISIBLE_DEVICES to limit GPUs (e.g., export CUDA_VISIBLE_DEVICES=0,1), the engine crashes
Even though:
- Those GPUs are valid
- vLLM is supposed to support multi-GPU tensor parallelism
- Unsloth detects the GPUs correctly
The moment ART initializes the model, I get:
- vLLM engine crashes
- Runtime exceptions (varies)
- Training or inference does not start
2. Setting tensor parallelism to 1 or more also breaks execution
Even with:
tensor_parallel=1or passing parameters through the backend config, the engine fails to initialize.
Instead of loading the model across the selected GPUs, ART/Unsloth/vLLM simply crashes without completing startup.
3. I cannot run any parallel execution even though hardware supports it
Even basic inference through the ART model wrapper fails to run if:
- More than one GPU is present or
- The model attempts to distribute across GPUs
4. This prevents using larger models (e.g., Qwen2.5-14B, Qwen3-32B) effectively
Since models must fit on a single GPU, this limits model size and sequence length, and makes RL training very slow.
My setup
-
4 × NVIDIA H100 80GB (PCIe)
-
Running Ubuntu Linux
-
CUDA 12.6
-
PyTorch 2.7.1
-
vLLM 0.10.x
-
Unsloth 2025.10.3
-
ART latest version (Dec 2025)
-
Models tested:
- Qwen2.5-14B-Instruct
- Qwen3-32B (FP16 and quantized)
- Smaller models also experience the GPU crash on parallel init
I export GPU visibility like:
export CUDA_VISIBLE_DEVICES=0,1ART detects this correctly, logs the right GPU count, but fails during Unsloth/vLLM initialization.
What I expected
- ART should be able to run Unsloth + vLLM across multiple GPUs using tensor parallelism.
- If
CUDA_VISIBLE_DEVICES="0,1"is set, ART should cleanly use those two GPUs. - No engine crash during initialization.
- Ability to train/infer on larger models that require parallelism.
Actual behavior
- vLLM / Unsloth engine crashes immediately
- No clear error message (varies by run)
- Parallel execution never enters the training loop
- Only single-GPU mode works, limiting model size and speed
Why this is a problem
-
ART advertises support for large-scale RL training
-
But without working multi-GPU support, users cannot:
- Run 14B / 32B models at full context length
- Scale training
- Perform faster rollouts
- Avoid VRAM bottlenecks
-
This makes multi-GPU deployments unusable in practice
Potential cause (hypotheses)
After checking logs, it seems:
1. Unsloth’s vLLM patching may incorrectly disable TP or enforce single-GPU mode
The log even includes:
Unsloth: vLLM's KV Cache can use up to 0.0 GB
when multiple GPUs are visible.
2. vLLM engine may not be receiving correct TP configuration from ART
No TP config seems to propagate, or the engine rejects it and crashes.
3. CUDA_VISIBLE_DEVICES filtering + Unsloth’s GPU detection produces inconsistent internal state
ART logs the correct number of GPUs, but vLLM later fails.
Requested fix / enhancements
- Proper multi-GPU vLLM + Unsloth support inside ART
- Explicit configuration options for TP, PP, and GPU selection
- Graceful failure with meaningful error messages
- Documentation for how ART expects multi-GPU environments to be configured
- Validation that
CUDA_VISIBLE_DEVICESis respected by all internal components
I can provide full logs, environment files, or run debugging commands if needed.
Let me know what additional details would help.