-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
🚀 The feature, motivation and pitch
Description
Currently, our TensorRT-LLM inference service exposes a limited set of metrics (e.g., Time-To-First-Token, Time-Per-Output-Token, and end-to-end latency). To improve observability, debugging, and production monitoring, we propose implementing a complete set of Prometheus metrics similar to those exposed by SGLang and vLLM.
This will provide deep insights into system performance, queue states, cache efficiency, and token-level statistics.
Current Limitations
- Only basic latency metrics are available (ttft, tpot, e2e_latency).
- Missing critical gauges for real-time system state (queue sizes, cache hit rate, throughput).
- Missing histograms for detailed latency analysis (inter-token latency, request latency distributions).
- Missing counters for aggregated token and request statistics.
Proposed Metrics to Implement
The following metrics should be added, following the naming and structure used by SGLang (example output provided in the references).
A. Latency Histograms
1、time_to_first_token_seconds (histogram) – Distribution of time to generate the first token.
2、e2e_request_latency_seconds (histogram) – Distribution of total request latency.
3、inter_token_latency_seconds (histogram) – Distribution of latency between consecutive generated tokens.
B. Real-time Gauges (System State)
4、num_running_reqs – Number of requests currently being processed.
5、num_used_tokens – Number of active tokens in GPU memory.
6、token_usage – Token usage rate/cost metric.
7、gen_throughput – Current generation throughput (tokens/second).
8、num_queue_reqs – Number of requests waiting in the general queue.
9、num_grammar_queue_reqs – Number of requests in the grammar-specific queue (if supported).
10、cache_hit_rate – Prefix cache hit rate.
11、spec_accept_length – Average acceptance length for speculative decoding (if supported).
12、num_prefill_prealloc_queue_reqs – Requests in the prefill pre-allocation queue.
13、num_prefill_inflight_queue_reqs – Requests in the prefill execution queue.
14、num_decode_prealloc_queue_reqs – Requests in the decode pre-allocation queue.
15、num_decode_transfer_queue_reqs – Requests in the decode transfer queue.
C. Cumulative Counters
16、prompt_tokens_total – Total number of prompt tokens processed.
17、generation_tokens_total – Total number of generated tokens.
18、num_requests_total – Total number of requests processed.
19、num_so_requests_total – Total number of structured output requests processed.
20、cached_tokens_total – Total number of prompt tokens served from cache.
21、num_aborted_requests_total – Total number of aborted requests.
Benefits
- Unified Observability: Aligns with industry-standard inference servers.
- Enhanced Debugging: Provides visibility into queues, cache performance, and token generation dynamics.
- Production Readiness: Enables robust alerting, capacity planning, and performance trend analysis.
- Better Integration: Facilitates integration with existing Prometheus/Grafana monitoring stacks.
Alternatives
none
Additional context
none
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.