[misc,trainer,rollout] feat: add Prometheus metrics logging to experiment tracking by guillemgt · Pull Request #5291 · verl-project/verl

guillemgt · 2026-02-11T15:41:32Z

What does this PR do?

Adds the ability to query Prometheus metrics and log them to experiment tracking backends (WandB, TensorBoard, MLflow, etc.) during training. This allows users to correlate infrastructure metrics (GPU cache usage, throughput) with training metrics in a unified view.

Checklist Before Starting

Search for similar PRs: prometheus metrics
Format the PR title as [{modules}] {type}: {description}

Test

Testing was performed locally/internally during development. The feature has been validated to work correctly with:

Prometheus HTTP API queries
Ray head node auto-discovery
Cache behavior
Error handling (connection errors, timeouts, malformed responses)
Graceful failure modes

Due to the complexity of mocking Ray and Prometheus infrastructure, comprehensive unit tests are not included in this PR. The feature can be validated end-to-end by configuring metrics_to_log and verifying metrics appear in experiment tracking backends.

Test

Tested with the example config below. Metric was successfully logged to rollout/vllm_generation_tokens_total in tensorboard over a 20 iteration run on 2 nodes.

API and Usage Example

actor_rollout_ref:
  rollout:
    prometheus:
      enable: True
      port: 9090
      metrics_to_log:
        - "vllm:generation_tokens_total"

Design & Code Changes

verl/workers/config/rollout.py: Add metrics_to_log field to PrometheusConfig
verl/experimental/agent_loop/prometheus_utils.py: Add PrometheusClient class (~180 lines)
verl/trainer/ppo/ray_trainer.py: Initialize client and query metrics before logging

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide
Apply pre-commit checks
Add documentation
Add unit tests - Explained in Test section why tests are not included
Notify ci-request Slack channel (will do after PR creation)
Recipe submodule not affected

gemini-code-assist

Code Review

This pull request introduces a PrometheusClient to query and log metrics from Prometheus to experiment tracking backends. The implementation is well-structured, including features like Ray head node discovery, caching, and retry logic. However, I've identified two critical issues in the error handling within prometheus_utils.py. One issue in the Ray head node discovery defaults to localhost on any failure, which is problematic in a distributed setting. The other issue is in the metric querying loop, which silently swallows all exceptions. Addressing these will significantly improve the robustness and debuggability of this new feature.

verl/experimental/agent_loop/prometheus_utils.py

guillemgt added 5 commits February 11, 2026 16:02

feat(misc): add option to log prometheus metrics in tracking

41f29ef

chore(misc): renamed some things

9f35127

chore(misc): updated prometheus documentation

19e01e6

chore(misc): regenerate ppo_veomni config with metrics_to_log field

2374692

chore(misc): updated prometheus documentation

94d4920

guillemgt requested review from PeterSH6, eric-haibin-lin, tongyx361 and vermouth1992 as code owners February 11, 2026 15:41

gemini-code-assist bot reviewed Feb 11, 2026

View reviewed changes

verl/experimental/agent_loop/prometheus_utils.py Outdated Show resolved Hide resolved

verl/experimental/agent_loop/prometheus_utils.py Outdated Show resolved Hide resolved

chore(misc): better error handling in prometheus client

3a60c25

guillemgt force-pushed the guillem.tarrach/upstream-prometheus-metrics branch from b1e6d83 to 3a60c25 Compare February 11, 2026 16:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[misc,trainer,rollout] feat: add Prometheus metrics logging to experiment tracking#5291

[misc,trainer,rollout] feat: add Prometheus metrics logging to experiment tracking#5291
guillemgt wants to merge 6 commits intoverl-project:mainfrom
guillemgt:guillem.tarrach/upstream-prometheus-metrics

guillemgt commented Feb 11, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

guillemgt commented Feb 11, 2026

What does this PR do?

Checklist Before Starting

Test

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant