WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

Step1 failed when I run "bash training_scripts/opt/single_gpu/run_1.3b.sh" #995

@niebowen666

Description

@niebowen666

the error:

[2025-12-16 23:27:41,996] [WARNING] [runner.py:232:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2025-12-16 23:27:41,997] [INFO] [runner.py:630:main] cmd = /root/anaconda3/envs/deepspeed/bin/python3.12 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None --log_level=info main.py --model_name_or_path /home/nbw/model/opt-1.3b/ --gradient_accumulation_steps 8 --lora_dim 128 --zero_stage 0 --enable_tensorboard --tensorboard_path ./output --deepspeed --output_dir ./output
[2025-12-16 23:27:48,547] [INFO] [launch.py:162:main] WORLD INFO DICT: {'localhost': [0]}
[2025-12-16 23:27:48,547] [INFO] [launch.py:168:main] nnodes=1, num_local_procs=1, node_rank=0
[2025-12-16 23:27:48,547] [INFO] [launch.py:179:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2025-12-16 23:27:48,548] [INFO] [launch.py:180:main] dist_world_size=1
[2025-12-16 23:27:48,548] [INFO] [launch.py:184:main] Setting CUDA_VISIBLE_DEVICES=0
[2025-12-16 23:27:48,549] [INFO] [launch.py:272:main] process 8596 spawned with command: ['/root/anaconda3/envs/deepspeed/bin/python3.12', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', '/home/nbw/model/opt-1.3b/', '--gradient_accumulation_steps', '8', '--lora_dim', '128', '--zero_stage', '0', '--enable_tensorboard', '--tensorboard_path', './output', '--deepspeed', '--output_dir', './output']
[rank0]:[W1216 23:27:56.691535111 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
tokenizer initialize start
load_hf_tokenizer model_name_or_path /home/nbw/model/opt-1.3b/
exist
tokenizer initialize end
model initialize start
The module name  (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues.
model initialize end
dataset initialize start
Creating prompt dataset ['/home/nbw/datasets/Dahoas/rm-static/'], reload=False
Creating dataset Dahoas_rm_static for train_phase=1 size=15252
Creating dataset Dahoas_rm_static for train_phase=1 size=1021
dataset initialize end
/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
***** Running training *****
***** Evaluating perplexity, Epoch 0/1 *****
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/nbw/train/main.py", line 447, in <module>
[rank0]:     main()
[rank0]:   File "/home/nbw/train/main.py", line 397, in main
[rank0]:     perplexity, eval_loss = evaluation(model, eval_dataloader)
[rank0]:                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/nbw/train/main.py", line 348, in evaluation
[rank0]:     outputs = model(**batch)
[rank0]:               ^^^^^^^^^^^^^^
[rank0]:   File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 2179, in forward
[rank0]:     loss = self.module(*inputs, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
[rank0]:     return inner()
[rank0]:            ^^^^^^^
[rank0]:   File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1805, in inner
[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/transformers/utils/generic.py", line 918, in wrapper
[rank0]:     output = func(self, *args, **kwargs)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/transformers/models/opt/modeling_opt.py", line 818, in forward
[rank0]:     outputs = self.model.decoder(
[rank0]:               ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/transformers/utils/generic.py", line 918, in wrapper
[rank0]:     output = func(self, *args, **kwargs)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/transformers/models/opt/modeling_opt.py", line 648, in forward
[rank0]:     layer_outputs = decoder_layer(
[rank0]:                     ^^^^^^^^^^^^^^
[rank0]:   File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/transformers/modeling_layers.py", line 94, in __call__
[rank0]:     return super().__call__(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/transformers/models/opt/modeling_opt.py", line 255, in forward
[rank0]:     hidden_states, self_attn_weights = self.self_attn(
[rank0]:                                        ^^^^^^^^^^^^^^^
[rank0]:   File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/transformers/models/opt/modeling_opt.py", line 160, in forward
[rank0]:     query_states = self.q_proj(hidden_states) * self.scaling
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/anaconda3/envs/deepspeed/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/nbw/train/lora.py", line 82, in forward
[rank0]:     return F.linear(
[rank0]:            ^^^^^^^^^
[rank0]: RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 2048 n 8192 k 2048 mat1_ld 2048 mat2_ld 2048 result_ld 2048 abcType 2 computeType 68 scaleType 0
[rank0]:[W1216 23:29:04.283389756 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[2025-12-16 23:29:07,562] [INFO] [launch.py:335:sigkill_handler] Killing subprocess 8596
[2025-12-16 23:29:07,563] [ERROR] [launch.py:341:sigkill_handler] ['/root/anaconda3/envs/deepspeed/bin/python3.12', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', '/home/nbw/model/opt-1.3b/', '--gradient_accumulation_steps', '8', '--lora_dim', '128', '--zero_stage', '0', '--enable_tensorboard', '--tensorboard_path', './output', '--deepspeed', '--output_dir', './output'] exits with return code = 1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions