WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
but it always turn out with error :
[2025-09-09 09:54:42 TP0] sglang is using nccl==2.27.3
[p-8f2c329b559b-ackcs-00gjf2kv:55682:0:55682] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid: 55682) ====
0 /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7fd3c0253654]
1 /opt/hpcx/ucx/lib/libucs.so.0(+0x3684c) [0x7fd3c025384c]
2 /opt/hpcx/ucx/lib/libucs.so.0(+0x36a88) [0x7fd3c0253a88]
Fatal Python error: Segmentation fault
Thread 0x00007fd3d824f6c0 (most recent call first):
File "/root/.local/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 61 in _recv_msg
File "/root/.local/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 195 in _read_thread
File "/base/mambaforge/lib/python3.10/threading.py", line 953 in run
File "/base/mambaforge/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/base/mambaforge/lib/python3.10/threading.py", line 973 in _bootstrap
Current thread 0x00007fda0995ed80 (most recent call first):
File "/root/.local/lib/python3.10/site-packages/sglang/srt/distributed/device_communicators/pynccl_wrapper.py", line 400 in ncclCommInitRank
File "/root/.local/lib/python3.10/site-packages/sglang/srt/distributed/device_communicators/pynccl.py", line 110 in init
File "/root/.local/lib/python3.10/site-packages/sglang/srt/distributed/parallel_state.py", line 296 in init
File "/root/.local/lib/python3.10/site-packages/sglang/srt/distributed/parallel_state.py", line 1221 in init_model_parallel_group
File "/root/.local/lib/python3.10/site-packages/sglang/srt/distributed/parallel_state.py", line 1433 in initialize_model_parallel
File "/root/.local/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 610 in init_torch_distributed
File "/root/.local/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 235 in init
File "/root/.local/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 84 in init
File "/root/.local/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 67 in init
File "/root/.local/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 328 in init
File "/root/.local/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 2553 in run_scheduler_process
File "/base/mambaforge/lib/python3.10/multiprocessing/process.py", line 108 in run
File "/base/mambaforge/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
File "/base/mambaforge/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
File "/base/mambaforge/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
File "", line 1 in
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
I was trying to run tensor parallelism of deepseek r1 7B on 2 RTX5090 card.
I simply follow the instruction on https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3
using
pip install "sglang[all]>=0.5.2rc2"
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 2
but it always turn out with error :
[2025-09-09 09:54:42 TP0] sglang is using nccl==2.27.3
[p-8f2c329b559b-ackcs-00gjf2kv:55682:0:55682] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid: 55682) ====
0 /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7fd3c0253654]
1 /opt/hpcx/ucx/lib/libucs.so.0(+0x3684c) [0x7fd3c025384c]
2 /opt/hpcx/ucx/lib/libucs.so.0(+0x36a88) [0x7fd3c0253a88]
Fatal Python error: Segmentation fault
Thread 0x00007fd3d824f6c0 (most recent call first):
File "/root/.local/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 61 in _recv_msg
File "/root/.local/lib/python3.10/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 195 in _read_thread
File "/base/mambaforge/lib/python3.10/threading.py", line 953 in run
File "/base/mambaforge/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/base/mambaforge/lib/python3.10/threading.py", line 973 in _bootstrap
Current thread 0x00007fda0995ed80 (most recent call first):
File "/root/.local/lib/python3.10/site-packages/sglang/srt/distributed/device_communicators/pynccl_wrapper.py", line 400 in ncclCommInitRank
File "/root/.local/lib/python3.10/site-packages/sglang/srt/distributed/device_communicators/pynccl.py", line 110 in init
File "/root/.local/lib/python3.10/site-packages/sglang/srt/distributed/parallel_state.py", line 296 in init
File "/root/.local/lib/python3.10/site-packages/sglang/srt/distributed/parallel_state.py", line 1221 in init_model_parallel_group
File "/root/.local/lib/python3.10/site-packages/sglang/srt/distributed/parallel_state.py", line 1433 in initialize_model_parallel
File "/root/.local/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 610 in init_torch_distributed
File "/root/.local/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 235 in init
File "/root/.local/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 84 in init
File "/root/.local/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 67 in init
File "/root/.local/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 328 in init
File "/root/.local/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 2553 in run_scheduler_process
File "/base/mambaforge/lib/python3.10/multiprocessing/process.py", line 108 in run
File "/base/mambaforge/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
File "/base/mambaforge/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
File "/base/mambaforge/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
File "", line 1 in
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pybase64._pybase64, _brotli, zstandard.backend_c, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, setproctitle._setproctitle, uvloop.loop, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, psutil._psutil_linux, psutil._psutil_posix, zmq.backend.cython._zmq, PIL._imaging, yaml._yaml, regex._regex, markupsafe._speedups, PIL._imagingft, _cffi_backend, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._isolve._iterative, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.linalg._flinalg, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize.__nnls, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, scipy.optimize._direct, sentencepiece._sentencepiece, msgspec._core (total: 105)
[p-8f2c329b559b-ackcs-00gjf2kv:55683:0:55683] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
I guess it may be something wrong with the version of NCCL or something else as I can run the model on a single RTX5090
Beta Was this translation helpful? Give feedback.
All reactions