WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

[Feature]: Add comprehensive Prometheus metrics similar to SGLang #9779

@0xd8b

Description

@0xd8b

🚀 The feature, motivation and pitch

Description

Currently, our TensorRT-LLM inference service exposes a limited set of metrics (e.g., Time-To-First-Token, Time-Per-Output-Token, and end-to-end latency). To improve observability, debugging, and production monitoring, we propose implementing a complete set of Prometheus metrics similar to those exposed by SGLang and vLLM.

This will provide deep insights into system performance, queue states, cache efficiency, and token-level statistics.

Current Limitations

  • Only basic latency metrics are available (ttft, tpot, e2e_latency).
  • Missing critical gauges for real-time system state (queue sizes, cache hit rate, throughput).
  • Missing histograms for detailed latency analysis (inter-token latency, request latency distributions).
  • Missing counters for aggregated token and request statistics.

Proposed Metrics to Implement

The following metrics should be added, following the naming and structure used by SGLang (example output provided in the references).
A. Latency Histograms
1、time_to_first_token_seconds (histogram) – Distribution of time to generate the first token.
2、e2e_request_latency_seconds (histogram) – Distribution of total request latency.
3、inter_token_latency_seconds (histogram) – Distribution of latency between consecutive generated tokens.

B. Real-time Gauges (System State)
4、num_running_reqs – Number of requests currently being processed.
5、num_used_tokens – Number of active tokens in GPU memory.
6、token_usage – Token usage rate/cost metric.
7、gen_throughput – Current generation throughput (tokens/second).
8、num_queue_reqs – Number of requests waiting in the general queue.
9、num_grammar_queue_reqs – Number of requests in the grammar-specific queue (if supported).
10、cache_hit_rate – Prefix cache hit rate.
11、spec_accept_length – Average acceptance length for speculative decoding (if supported).
12、num_prefill_prealloc_queue_reqs – Requests in the prefill pre-allocation queue.
13、num_prefill_inflight_queue_reqs – Requests in the prefill execution queue.
14、num_decode_prealloc_queue_reqs – Requests in the decode pre-allocation queue.
15、num_decode_transfer_queue_reqs – Requests in the decode transfer queue.

C. Cumulative Counters
16、prompt_tokens_total – Total number of prompt tokens processed.
17、generation_tokens_total – Total number of generated tokens.
18、num_requests_total – Total number of requests processed.
19、num_so_requests_total – Total number of structured output requests processed.
20、cached_tokens_total – Total number of prompt tokens served from cache.
21、num_aborted_requests_total – Total number of aborted requests.

Benefits

  • Unified Observability: Aligns with industry-standard inference servers.
  • Enhanced Debugging: Provides visibility into queues, cache performance, and token generation dynamics.
  • Production Readiness: Enables robust alerting, capacity planning, and performance trend analysis.
  • Better Integration: Facilitates integration with existing Prometheus/Grafana monitoring stacks.

Alternatives

none

Additional context

none

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

Labels

feature requestNew feature or request. This includes new model, dtype, functionality support

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions