Encountering similar issues to #395 with the latest version of ART (0.5.3), vllm 0.10.0. Running within local script (outside of collab / notebook).
Training runs fine for 2 batches and then suddenly hangs in the gather step of the next batch
Gather step 2: 83%|████████▎ | 5/6
Without any error message nor anything, just hangs.