[Feature]: Add comprehensive Prometheus metrics similar to SGLang

### 🚀 The feature, motivation and pitch

## Description
Currently, our TensorRT-LLM inference service exposes a limited set of metrics (e.g., Time-To-First-Token, Time-Per-Output-Token, and end-to-end latency). To improve observability, debugging, and production monitoring, we propose implementing a complete set of Prometheus metrics similar to those exposed by SGLang and vLLM.

This will provide deep insights into system performance, queue states, cache efficiency, and token-level statistics.

## Current Limitations
* Only basic latency metrics are available (ttft, tpot, e2e_latency).
* Missing critical gauges for real-time system state (queue sizes, cache hit rate, throughput).
* Missing histograms for detailed latency analysis (inter-token latency, request latency distributions).
* Missing counters for aggregated token and request statistics.

## Proposed Metrics to Implement
The following metrics should be added, following the naming and structure used by SGLang (example output provided in the references).
A. Latency Histograms
1、time_to_first_token_seconds (histogram) – Distribution of time to generate the first token.
2、e2e_request_latency_seconds (histogram) – Distribution of total request latency.
3、inter_token_latency_seconds (histogram) – Distribution of latency between consecutive generated tokens.

B. Real-time Gauges (System State)
4、num_running_reqs – Number of requests currently being processed.
5、num_used_tokens – Number of active tokens in GPU memory.
6、token_usage – Token usage rate/cost metric.
7、gen_throughput – Current generation throughput (tokens/second).
8、num_queue_reqs – Number of requests waiting in the general queue.
9、num_grammar_queue_reqs – Number of requests in the grammar-specific queue (if supported).
10、cache_hit_rate – Prefix cache hit rate.
11、spec_accept_length – Average acceptance length for speculative decoding (if supported).
12、num_prefill_prealloc_queue_reqs – Requests in the prefill pre-allocation queue.
13、num_prefill_inflight_queue_reqs – Requests in the prefill execution queue.
14、num_decode_prealloc_queue_reqs – Requests in the decode pre-allocation queue.
15、num_decode_transfer_queue_reqs – Requests in the decode transfer queue.

C. Cumulative Counters
16、prompt_tokens_total – Total number of prompt tokens processed.
17、generation_tokens_total – Total number of generated tokens.
18、num_requests_total – Total number of requests processed.
19、num_so_requests_total – Total number of structured output requests processed.
20、cached_tokens_total – Total number of prompt tokens served from cache.
21、num_aborted_requests_total – Total number of aborted requests.

## Benefits
* Unified Observability: Aligns with industry-standard inference servers.
* Enhanced Debugging: Provides visibility into queues, cache performance, and token generation dynamics.
* Production Readiness: Enables robust alerting, capacity planning, and performance trend analysis.
* Better Integration: Facilitates integration with existing Prometheus/Grafana monitoring stacks.

### Alternatives

none

### Additional context

none

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: Add comprehensive Prometheus metrics similar to SGLang #9779

🚀 The feature, motivation and pitch

Description

Current Limitations

Proposed Metrics to Implement

Benefits

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Add comprehensive Prometheus metrics similar to SGLang #9779

Description

🚀 The feature, motivation and pitch

Description

Current Limitations

Proposed Metrics to Implement

Benefits

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions