-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Open
Description
the error:
[2025-12-16 23:27:41,996] [WARNING] [runner.py:232:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2025-12-16 23:27:41,997] [INFO] [runner.py:630:main] cmd = /root/anaconda3/envs/deepspeed/bin/python3.12 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None --log_level=info main.py --model_name_or_path /home/nbw/model/opt-1.3b/ --gradient_accumulation_steps 8 --lora_dim 128 --zero_stage 0 --enable_tensorboard --tensorboard_path ./output --deepspeed --output_dir ./output
[2025-12-16 23:27:48,547] [INFO] [launch.py:162:main] WORLD INFO DICT: {'localhost': [0]}
[2025-12-16 23:27:48,547] [INFO] [launch.py:168:main] nnodes=1, num_local_procs=1, node_rank=0
[2025-12-16 23:27:48,547] [INFO] [launch.py:179:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2025-12-16 23:27:48,548] [INFO] [launch.py:180:main] dist_world_size=1
[2025-12-16 23:27:48,548] [INFO] [launch.py:184:main] Setting CUDA_VISIBLE_DEVICES=0
[2025-12-16 23:27:48,549] [INFO] [launch.py:272:main] process 8596 spawned with command: ['/root/anaconda3/envs/deepspeed/bin/python3.12', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', '/home/nbw/model/opt-1.3b/', '--gradient_accumulation_steps', '8', '--lora_dim', '128', '--zero_stage', '0', '--enable_tensorboard', '--tensorboard_path', './output', '--deepspeed', '--output_dir', './output']
[rank0]:[W1216 23:27:56.691535111 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
tokenizer initialize start
load_hf_tokenizer model_name_or_path /home/nbw/model/opt-1.3b/
exist
tokenizer initialize end
model initialize start
The module name (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues.
model initialize end
dataset initialize start
Creating prompt dataset ['/home/nbw/datasets/Dahoas/rm-static/'], reload=False
Creating dataset Dahoas_rm_static for train_phase=1 size=15252
Creating dataset Dahoas_rm_static for train_phase=1 size=1021
dataset initialize end
/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
***** Running training *****
***** Evaluating perplexity, Epoch 0/1 *****
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/nbw/train/main.py", line 447, in <module>
[rank0]: main()
[rank0]: File "/home/nbw/train/main.py", line 397, in main
[rank0]: perplexity, eval_loss = evaluation(model, eval_dataloader)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/nbw/train/main.py", line 348, in evaluation
[rank0]: outputs = model(**batch)
[rank0]: ^^^^^^^^^^^^^^
[rank0]: File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 2179, in forward
[rank0]: loss = self.module(*inputs, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
[rank0]: return inner()
[rank0]: ^^^^^^^
[rank0]: File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1805, in inner
[rank0]: result = forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/transformers/utils/generic.py", line 918, in wrapper
[rank0]: output = func(self, *args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/transformers/models/opt/modeling_opt.py", line 818, in forward
[rank0]: outputs = self.model.decoder(
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/transformers/utils/generic.py", line 918, in wrapper
[rank0]: output = func(self, *args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/transformers/models/opt/modeling_opt.py", line 648, in forward
[rank0]: layer_outputs = decoder_layer(
[rank0]: ^^^^^^^^^^^^^^
[rank0]: File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/transformers/modeling_layers.py", line 94, in __call__
[rank0]: return super().__call__(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/transformers/models/opt/modeling_opt.py", line 255, in forward
[rank0]: hidden_states, self_attn_weights = self.self_attn(
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/transformers/models/opt/modeling_opt.py", line 160, in forward
[rank0]: query_states = self.q_proj(hidden_states) * self.scaling
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/nbw/train/lora.py", line 82, in forward
[rank0]: return F.linear(
[rank0]: ^^^^^^^^^
[rank0]: RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 2048 n 8192 k 2048 mat1_ld 2048 mat2_ld 2048 result_ld 2048 abcType 2 computeType 68 scaleType 0
[rank0]:[W1216 23:29:04.283389756 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[2025-12-16 23:29:07,562] [INFO] [launch.py:335:sigkill_handler] Killing subprocess 8596
[2025-12-16 23:29:07,563] [ERROR] [launch.py:341:sigkill_handler] ['/root/anaconda3/envs/deepspeed/bin/python3.12', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', '/home/nbw/model/opt-1.3b/', '--gradient_accumulation_steps', '8', '--lora_dim', '128', '--zero_stage', '0', '--enable_tensorboard', '--tensorboard_path', './output', '--deepspeed', '--output_dir', './output'] exits with return code = 1
Metadata
Metadata
Assignees
Labels
No labels