WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

Unsloth + vLLM multi-GPU setup in ART crashes when restricting CUDA_VISIBLE_DEVICES or using tensor parallelism >1 on H100 GPUs #478

@ansh-info

Description

@ansh-info

What happened?

I’m running ART on a machine with multiple NVIDIA H100 80GB GPUs.
ART automatically detects multiple GPUs and logs messages like:

NVIDIA H100 80GB HBM3. Num GPUs = 2 (or 4)

However, as soon as I try to use multi-GPU inference or training through ART’s internal Unsloth + vLLM engine, the system becomes unstable:

1. When I set CUDA_VISIBLE_DEVICES to limit GPUs (e.g., export CUDA_VISIBLE_DEVICES=0,1), the engine crashes

Even though:

  • Those GPUs are valid
  • vLLM is supposed to support multi-GPU tensor parallelism
  • Unsloth detects the GPUs correctly

The moment ART initializes the model, I get:

  • vLLM engine crashes
  • Runtime exceptions (varies)
  • Training or inference does not start

2. Setting tensor parallelism to 1 or more also breaks execution

Even with:

tensor_parallel=1

or passing parameters through the backend config, the engine fails to initialize.

Instead of loading the model across the selected GPUs, ART/Unsloth/vLLM simply crashes without completing startup.

3. I cannot run any parallel execution even though hardware supports it

Even basic inference through the ART model wrapper fails to run if:

  • More than one GPU is present or
  • The model attempts to distribute across GPUs

4. This prevents using larger models (e.g., Qwen2.5-14B, Qwen3-32B) effectively

Since models must fit on a single GPU, this limits model size and sequence length, and makes RL training very slow.


My setup

  • 4 × NVIDIA H100 80GB (PCIe)

  • Running Ubuntu Linux

  • CUDA 12.6

  • PyTorch 2.7.1

  • vLLM 0.10.x

  • Unsloth 2025.10.3

  • ART latest version (Dec 2025)

  • Models tested:

    • Qwen2.5-14B-Instruct
    • Qwen3-32B (FP16 and quantized)
    • Smaller models also experience the GPU crash on parallel init

I export GPU visibility like:

export CUDA_VISIBLE_DEVICES=0,1

ART detects this correctly, logs the right GPU count, but fails during Unsloth/vLLM initialization.


What I expected

  • ART should be able to run Unsloth + vLLM across multiple GPUs using tensor parallelism.
  • If CUDA_VISIBLE_DEVICES="0,1" is set, ART should cleanly use those two GPUs.
  • No engine crash during initialization.
  • Ability to train/infer on larger models that require parallelism.

Actual behavior

  • vLLM / Unsloth engine crashes immediately
  • No clear error message (varies by run)
  • Parallel execution never enters the training loop
  • Only single-GPU mode works, limiting model size and speed

Why this is a problem

  • ART advertises support for large-scale RL training

  • But without working multi-GPU support, users cannot:

    • Run 14B / 32B models at full context length
    • Scale training
    • Perform faster rollouts
    • Avoid VRAM bottlenecks
  • This makes multi-GPU deployments unusable in practice


Potential cause (hypotheses)

After checking logs, it seems:

1. Unsloth’s vLLM patching may incorrectly disable TP or enforce single-GPU mode

The log even includes:

Unsloth: vLLM's KV Cache can use up to 0.0 GB

when multiple GPUs are visible.

2. vLLM engine may not be receiving correct TP configuration from ART

No TP config seems to propagate, or the engine rejects it and crashes.

3. CUDA_VISIBLE_DEVICES filtering + Unsloth’s GPU detection produces inconsistent internal state

ART logs the correct number of GPUs, but vLLM later fails.


Requested fix / enhancements

  1. Proper multi-GPU vLLM + Unsloth support inside ART
  2. Explicit configuration options for TP, PP, and GPU selection
  3. Graceful failure with meaningful error messages
  4. Documentation for how ART expects multi-GPU environments to be configured
  5. Validation that CUDA_VISIBLE_DEVICES is respected by all internal components

I can provide full logs, environment files, or run debugging commands if needed.

Let me know what additional details would help.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions