-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
System Info
- CPU: i5-14000kf
- GPU: 4070super
- python3.10
- nvidia-smi: NVIDIA-SMI 580.105.08 Driver Version: 572.16 CUDA Version: 12.8
- nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Jan_15_19:20:09_PST_2025
Cuda compilation tools, release 12.8, V12.8.61
Build cuda_12.8.r12.8/compiler.35404655_0 - tensorrt-llm 1.0.0
- torch 2.7.1
torchprofile 0.0.4
torchvision 0.22.1 - tensorrt 10.11.0.33
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
import os
os.environ["TLLM_LLM_ENABLE_DEBUG"] = "1"
from tensorrt_llm import LLM, SamplingParams
def main():
# Model could accept HF model name, a path to local HF model,
# or TensorRT Model Optimizer's quantized checkpoints like nvidia/Llama-3.1-8B-Instruct-FP8 on HF.
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", backend="pytorch")
# Sample prompts.
prompts = [
"Hello, my name is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
for output in llm.generate(prompts, sampling_params):
print(
f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}"
)
# Got output like
# Prompt: 'Hello, my name is', Generated text: '\n\nJane Smith. I am a student pursuing my degree in Computer Science at [university]. I enjoy learning new things, especially technology and programming'
# Prompt: 'The president of the United States is', Generated text: 'likely to nominate a new Supreme Court justice to fill the seat vacated by the death of Antonin Scalia. The Senate should vote to confirm the'
# Prompt: 'The capital of France is', Generated text: 'Paris.'
# Prompt: 'The future of AI is', Generated text: 'an exciting time for us. We are constantly researching, developing, and improving our platform to create the most advanced and efficient model available. We are'
if __name__ == '__main__':
main()Expected behavior
successfully get model output
actual behavior
/home/rookie/Qwen/.venv/bin/python3.10 /home/rookie/Qwen/trt_llm_demo.py
:1184: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
:1184: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
[2025-12-07 13:29:44] INFO config.py:54: PyTorch version 2.7.1 available.
LLM debug mode enabled.
[12/07/2025-13:29:46] [TRT-LLM] [I] Starting TensorRT LLM init.
2025-12-07 13:29:46,319 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[12/07/2025-13:29:46] [TRT-LLM] [I] TensorRT LLM inited.
[TensorRT-LLM] TensorRT LLM version: 1.0.0
[12/07/2025-13:29:46] [TRT-LLM] [I] Using LLM with PyTorch backend
[12/07/2025-13:29:46] [TRT-LLM] [W] Using default gpus_per_node: 1
[12/07/2025-13:29:46] [TRT-LLM] [I] Set nccl_plugin to None.
[12/07/2025-13:29:46] [TRT-LLM] [I] neither checkpoint_format nor checkpoint_loader were provided, checkpoint_format will be set to HF.
LLM.args.mpi_session: None
/home/rookie/Qwen/.venv/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
[12/07/2025-13:29:47] [TRT-LLM] [I] PyTorchConfig(extra_resource_managers={}, use_cuda_graph=True, cuda_graph_batch_sizes=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 64, 128], cuda_graph_max_batch_size=128, cuda_graph_padding_enabled=False, disable_overlap_scheduler=False, moe_max_num_tokens=None, moe_load_balancer=None, attention_dp_enable_balance=False, attention_dp_time_out_iters=50, attention_dp_batching_wait_iters=10, attn_backend='TRTLLM', moe_backend='CUTLASS', enable_mixed_sampler=False, enable_trtllm_sampler=False, kv_cache_dtype='auto', enable_iter_perf_stats=False, enable_iter_req_stats=False, print_iter_log=False, torch_compile_enabled=False, torch_compile_fullgraph=True, torch_compile_inductor_enabled=False, torch_compile_piecewise_cuda_graph=False, torch_compile_enable_userbuffers=True, torch_compile_max_num_streams=1, enable_autotuner=True, enable_layerwise_nvtx_marker=False, load_format=<LoadFormat.AUTO: 0>, enable_min_latency=False, allreduce_strategy='AUTO', stream_interval=1, force_dynamic_quantization=False, _limit_torch_cuda_mem_fraction=True)
create pool session ...
rank 0 using MpiPoolSession to spawn MPI processes
Server [proxy_request_queue] bound to tcp://127.0.0.1:45799 in PAIR
[12/07/2025-13:29:47] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_queue
Server [worker_init_status_queue] bound to tcp://127.0.0.1:45971 in PAIR
[12/07/2025-13:29:47] [TRT-LLM] [I] Generating a new HMAC key for server worker_init_status_queue
Server [proxy_result_queue] bound to tcp://127.0.0.1:46275 in PAIR
[12/07/2025-13:29:47] [TRT-LLM] [I] Generating a new HMAC key for server proxy_result_queue
Server [proxy_stats_queue] bound to tcp://127.0.0.1:45893 in PAIR
[12/07/2025-13:29:47] [TRT-LLM] [I] Generating a new HMAC key for server proxy_stats_queue
Server [proxy_kv_cache_events_queue] bound to tcp://127.0.0.1:45993 in PAIR
[12/07/2025-13:29:47] [TRT-LLM] [I] Generating a new HMAC key for server proxy_kv_cache_events_queue
additional notes
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.