Unsloth + vLLM multi-GPU setup in ART crashes when restricting CUDA_VISIBLE_DEVICES or using tensor parallelism >1 on H100 GPUs

## **What happened?**

I’m running ART on a machine with **multiple NVIDIA H100 80GB GPUs**.
ART automatically detects multiple GPUs and logs messages like:

```
NVIDIA H100 80GB HBM3. Num GPUs = 2 (or 4)
```

However, as soon as I try to use **multi-GPU inference or training** through ART’s internal Unsloth + vLLM engine, the system becomes unstable:

###  **1. When I set `CUDA_VISIBLE_DEVICES` to limit GPUs (e.g., `export CUDA_VISIBLE_DEVICES=0,1`), the engine crashes**

Even though:

* Those GPUs are valid
* vLLM is supposed to support multi-GPU tensor parallelism
* Unsloth detects the GPUs correctly

The moment ART initializes the model, I get:

* vLLM engine crashes
* Runtime exceptions (varies)
* Training or inference does not start

###  **2. Setting tensor parallelism to 1 or more also breaks execution**

Even with:

```python
tensor_parallel=1
```

or passing parameters through the backend config, the engine **fails to initialize**.

Instead of loading the model across the selected GPUs, ART/Unsloth/vLLM simply crashes without completing startup.

###  **3. I cannot run *any* parallel execution even though hardware supports it**

Even basic inference through the ART model wrapper fails to run if:

* More than one GPU is present **or**
* The model attempts to distribute across GPUs

###  **4. This prevents using larger models (e.g., Qwen2.5-14B, Qwen3-32B) effectively**

Since models must fit on a single GPU, this limits model size and sequence length, and makes RL training very slow.

---

## **My setup**

* **4 × NVIDIA H100 80GB (PCIe)**
* Running Ubuntu Linux
* CUDA 12.6
* PyTorch 2.7.1
* vLLM 0.10.x
* Unsloth 2025.10.3
* ART latest version (Dec 2025)
* Models tested:

  * Qwen2.5-14B-Instruct
  * Qwen3-32B (FP16 and quantized)
  * Smaller models also experience the GPU crash on parallel init

I export GPU visibility like:

```bash
export CUDA_VISIBLE_DEVICES=0,1
```

ART detects this correctly, logs the right GPU count, but **fails during Unsloth/vLLM initialization.**

---

## **What I expected**

* ART should be able to run Unsloth + vLLM across multiple GPUs using tensor parallelism.
* If `CUDA_VISIBLE_DEVICES="0,1"` is set, ART should cleanly use those two GPUs.
* No engine crash during initialization.
* Ability to train/infer on larger models that require parallelism.

---

## **Actual behavior**

* vLLM / Unsloth engine crashes immediately
* No clear error message (varies by run)
* Parallel execution never enters the training loop
* Only single-GPU mode works, limiting model size and speed

---

## **Why this is a problem**

* ART advertises support for large-scale RL training
* But without working multi-GPU support, users cannot:

  * Run 14B / 32B models at full context length
  * Scale training
  * Perform faster rollouts
  * Avoid VRAM bottlenecks
* This makes multi-GPU deployments unusable in practice

---

## **Potential cause (hypotheses)**

After checking logs, it seems:

### 1. **Unsloth’s vLLM patching may incorrectly disable TP or enforce single-GPU mode**

The log even includes:

```
Unsloth: vLLM's KV Cache can use up to 0.0 GB
```

when multiple GPUs are visible.

### 2. **vLLM engine may not be receiving correct TP configuration from ART**

No TP config seems to propagate, or the engine rejects it and crashes.

### 3. **CUDA_VISIBLE_DEVICES filtering + Unsloth’s GPU detection produces inconsistent internal state**

ART logs the correct number of GPUs, but vLLM later fails.

---

## **Requested fix / enhancements**

1. **Proper multi-GPU vLLM + Unsloth support inside ART**
2. **Explicit configuration options for TP, PP, and GPU selection**
3. **Graceful failure with meaningful error messages**
4. **Documentation for how ART expects multi-GPU environments to be configured**
5. **Validation that `CUDA_VISIBLE_DEVICES` is respected by all internal components**

---

## **I can provide full logs, environment files, or run debugging commands if needed.**

Let me know what additional details would help.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unsloth + vLLM multi-GPU setup in ART crashes when restricting CUDA_VISIBLE_DEVICES or using tensor parallelism >1 on H100 GPUs #478

What happened?

1. When I set `CUDA_VISIBLE_DEVICES` to limit GPUs (e.g., `export CUDA_VISIBLE_DEVICES=0,1`), the engine crashes

2. Setting tensor parallelism to 1 or more also breaks execution

**3. I cannot run any parallel execution even though hardware supports it**

4. This prevents using larger models (e.g., Qwen2.5-14B, Qwen3-32B) effectively

My setup

What I expected

Actual behavior

Why this is a problem

Potential cause (hypotheses)

1. Unsloth’s vLLM patching may incorrectly disable TP or enforce single-GPU mode

2. vLLM engine may not be receiving correct TP configuration from ART

3. CUDA_VISIBLE_DEVICES filtering + Unsloth’s GPU detection produces inconsistent internal state

Requested fix / enhancements

I can provide full logs, environment files, or run debugging commands if needed.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unsloth + vLLM multi-GPU setup in ART crashes when restricting CUDA_VISIBLE_DEVICES or using tensor parallelism >1 on H100 GPUs #478

Description

What happened?

1. When I set CUDA_VISIBLE_DEVICES to limit GPUs (e.g., export CUDA_VISIBLE_DEVICES=0,1), the engine crashes

2. Setting tensor parallelism to 1 or more also breaks execution

3. I cannot run any parallel execution even though hardware supports it

4. This prevents using larger models (e.g., Qwen2.5-14B, Qwen3-32B) effectively

My setup

What I expected

Actual behavior

Why this is a problem

Potential cause (hypotheses)

1. Unsloth’s vLLM patching may incorrectly disable TP or enforce single-GPU mode

2. vLLM engine may not be receiving correct TP configuration from ART

3. CUDA_VISIBLE_DEVICES filtering + Unsloth’s GPU detection produces inconsistent internal state

Requested fix / enhancements

I can provide full logs, environment files, or run debugging commands if needed.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. When I set `CUDA_VISIBLE_DEVICES` to limit GPUs (e.g., `export CUDA_VISIBLE_DEVICES=0,1`), the engine crashes

**3. I cannot run any parallel execution even though hardware supports it**