feat(reexecute/c): decouple metrics server and collector #4415

RodrigoVillar · 2025-10-12T21:04:49Z

Why this should be merged

As mentioned in #4362, splitting up the Prometheus server and collector is ideal for clients of Firewood who want access to VM metrics but would prefer to use their own monitoring stack.

Although #4362 also discusses the option of hardcoding the metrics server port, I've opted to split this out into a separate PR.

How this works

Changes the METRICS_ENABLED parameter to METRICS_MODE, which has the following three options:

disabled: does not start both the Prometheus server and collector
server-only: starts only the Prometheus server
full: starts both the Prometheus server and collector

I've opted for keeping a single environment variable rather than splitting METRICS_ENABLED since there should never be a case for starting the Prometheus collector but not the Prometheus server. In either the server-only or full cases, a metrics endpoint or Grafana URL is printed out, respectively.

How this was tested

CI

Need to be documented in RELEASES.md?

No

RodrigoVillar · 2025-10-13T12:46:48Z

The full option is enabled in CI as seen here: https://github.com/ava-labs/avalanchego/actions/runs/18465753103/job/52607232591?pr=4415#step:4:968

When running locally with METRICS_MODE=server-only, you get the following:

goos: darwin
goarch: arm64
pkg: github.com/ava-labs/avalanchego/tests/reexecute/c
cpu: Apple M1 Max
BenchmarkReexecuteRange
BenchmarkReexecuteRange/[1,1000]-Config-default-Runner-dev
[10-13|08:45:03.588] INFO c-chain-reexecution c/vm_reexecute_test.go:616 metrics endpoint available {"url": "http://127.0.0.1:62042/ext/metrics"}

When running locally with METRICS_MODE=full, you get the following:

goos: darwin
goarch: arm64
pkg: github.com/ava-labs/avalanchego/tests/reexecute/c
cpu: Apple M1 Max
BenchmarkReexecuteRange
BenchmarkReexecuteRange/[1,1000]-Config-default-Runner-dev
[10-13|08:46:22.299] INFO prometheus tmpnet/monitor_processes.go:370 collector already running {"cmd": "prometheus"}
[10-13|08:46:22.299] INFO prometheus tmpnet/monitor_processes.go:610 waiting for collector readiness {"cmd": "prometheus", "url": "http://127.0.0.1:9090/-/ready", "logPath": "/Users/rodrigo.villar/.tmpnet/prometheus/prometheus.log"}
[10-13|08:46:22.300] INFO prometheus tmpnet/monitor_processes.go:634 collector ready {"cmd": "prometheus"}
[10-13|08:46:22.300] INFO prometheus tmpnet/monitor_processes.go:60 To stop: tmpnetctl stop-metrics-collector
[10-13|08:46:22.300] INFO c-chain-reexecution c/vm_reexecute_test.go:665 metrics available via grafana {"url": "https://grafana-poc.avax-dev.network/d/Gl1I20mnk/c-chain?&var-filter=network_uuid%7C%3D%7C8ce59182-3fcf-4a8d-8272-eeadbcea1537&var-filter=is_ephemeral_node%7C%3D%7Cfalse&from=1760359582300&to=now"}

Copilot

Pull Request Overview

This PR decouples the metrics server and collector functionality by replacing the boolean METRICS_ENABLED parameter with a more granular METRICS_MODE parameter. This allows users to run only the Prometheus server without the collector, enabling access to VM metrics while using their own monitoring stack.

Introduces a new metricsMode type with three values: disabled, server-only, and full
Replaces METRICS_ENABLED with METRICS_MODE across all configuration files and scripts
Updates the collectRegistry function to conditionally start the collector based on the metrics mode

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
tests/reexecute/c/vm_reexecute_test.go	Implements the new metricsMode type with validation and updates function signatures
tests/reexecute/c/README.md	Updates documentation to explain the three metrics mode options
scripts/benchmark_cchain_range.sh	Changes flag from metrics-enabled to metrics-mode
Taskfile.yml	Updates default value from "false" to "disabled" for the new metrics mode
.github/actions/c-chain-reexecution-benchmark/action.yml	Sets CI to use "full" metrics mode instead of boolean true

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Taskfile.yml

aaronbuchwald · 2025-10-13T16:17:57Z

tests/reexecute/c/vm_reexecute_test.go

+func (m *metricsMode) Set(s string) error {
+	s = strings.ToLower(strings.TrimSpace(s))
+
+	switch s {
+	case "disabled":
+		*m = MetricsDisabled
+	case "server-only":
+		*m = MetricsServerOnly
+	case "full":
+		*m = MetricsFull
+	default:
+		return fmt.Errorf("invalid metrics mode: %s (valid options: disabled, server-only, full)", s)
+	}
+	return nil
+}


Could we just use a string here and perhaps an alias type rather than implementing Set?

Simplified the metricsMode type here: a7cb056

aaronbuchwald · 2025-10-13T16:29:30Z

tests/reexecute/c/vm_reexecute_test.go

+	if metricsMode.shouldStartServer() {
+		collectRegistry(b, log, "c-chain-reexecution", prefixGatherer, labels, metricsMode.shouldStartCollector())


Can we clean this up a little bit? It seems odd that we decompose metricsMode into two separate booleans and use one to handle the if condition here and the other half of it as an argument that gets passed in.

Cleaned up the logic for starting the metrics server vs starting the metrics server and collector here: 9e319d0

tests/reexecute/c/vm_reexecute_test.go

aaronbuchwald · 2025-10-14T13:36:10Z

tests/reexecute/c/README.md

+- `METRICS_MODE=disabled`: no metrics are available.
+- `METRICS_MODE=server-only`: starts a Prometheus server exporting VM metrics. A
+  link to the metrics endpoint is logged during execution.
+- `METRICS_MODE=full`: starts both a Prometheus server exporting VM metrics and


Given these names are not very self-explanatory (it's not clear what server-only and full refer to without the descriptions), I think it would be better to simply configure the two separately and require that if grafana is enabled, then the prometheus server must be enabled as well.

It's fine imo for the metrics server to be enabled and grafana disabled by default since that won't require setting any extra credentials.

Done: 5160738

maru-ava · 2025-10-15T15:37:24Z

tests/reexecute/c/README.md

+If running locally, metrics collection can be customized via the following parameters:
+
+- `METRICS_SERVER_ENABLED`: starts a Prometheus server exporting VM metrics.
+- `METRICS_COLLECTOR_ENABLED`: starts a Prometheus collector (if enabled, then `METRICS_SERVER_ENABLED` must be enabled as well).


Why isn't this just implicit?

On it's own, I'm not opposed to METRICS_COLLECTOR_ENABLED=true implicitly setting METRICS_SERVER_ENABLED=true as well. However, considering that this PR will be followed up by #4418 (which adds the ability to configure a port for the metrics server), I think this becomes confusing (i.e. it isn't clear what happens if METRICS_COLLECTOR_ENABLED=true and METRICS_PORT=X without reading the description of METRICS_COLLECTOR_ENABLED ).

This could be fixed by renaming METRICS_COLLECTOR_ENABLED to METRICS_SERVER_AND_COLLECTOR_ENABLED, but this looks similar to a previous iteration of this PR which received this review comment: #4415 (comment)

I'm not going to block on this, but a rethink is definitely suggested. I don't think the comment you linked to suggesting seperate flags precludes improving what appears in this PR.

Can you elaborate? The only options regarding the flag design are as follows:

Enabling the collector implicitly enables the server.

Enabling the collector does not implicitly enable the server.

With this PR choosing the latter, I'm not sure what else needs to be rethought (if you think enabling the collector needs to implicitly enable the server, I'm happy to follow-up on that).

I think the issue is the chosen terminology. ENABLE_METRICS_SERVER starts a PrometheusServer instance, and that name could be confused with an actual Prometheus server (one that collects and aggregate metrics). I recommend changing ENABLE_METRICS_SERVER to ENABLE_METRICS_EXPORT and PrometheusServer to MetricsExporter. That way there is a unambiguous relationship between exporting metrics and collecting those exported metrics. Once that change is made, I think it would be a no-brainer to implicitly enable metrics export when collection is enabled rather than requiring that the user explicitly configure that.

Thank you for the feedback. Repeating what I said in #4418, enabling the metrics collector in #4418 will implicitly enable the metrics server with a port of 0 (as of commit 830e8c1)

feat(reexecution/c): decouple metrics server and collector

0178776

RodrigoVillar self-assigned this Oct 12, 2025

github-project-automation bot added this to avalanchego Oct 12, 2025

Merge branch 'master' into rodrigo/decouple-reexecution-metrics

4ad3f2b

RodrigoVillar added 2 commits October 13, 2025 08:52

Merge branch 'master' into rodrigo/decouple-reexecution-metrics

fc83c89

docs: improve collectRegistry

848c6ad

RodrigoVillar requested a review from AminR443 October 13, 2025 12:59

RodrigoVillar marked this pull request as ready for review October 13, 2025 13:00

RodrigoVillar requested review from aaronbuchwald, joshua-kim and maru-ava as code owners October 13, 2025 13:00

Copilot AI review requested due to automatic review settings October 13, 2025 13:00

Copilot AI reviewed Oct 13, 2025

View reviewed changes

aaronbuchwald reviewed Oct 13, 2025

View reviewed changes

Taskfile.yml Outdated Show resolved Hide resolved

aaronbuchwald reviewed Oct 13, 2025

View reviewed changes

Taskfile.yml Outdated Show resolved Hide resolved

aaronbuchwald reviewed Oct 13, 2025

View reviewed changes

tests/reexecute/c/vm_reexecute_test.go Show resolved Hide resolved

RodrigoVillar added 7 commits October 13, 2025 12:41

chore: set default to empty string

be761ac

docs: benchmark script

ca0b993

chore: simplify metricsMode

a7cb056

Merge branch 'master' into rodrigo/decouple-reexecution-metrics

d36d4ca

chore: unexport metricsMode

c7f3185

chore: clean up

9e319d0

chore: self-review nits

2cc0a79

AminR443 approved these changes Oct 14, 2025

View reviewed changes

RodrigoVillar requested a review from aaronbuchwald October 14, 2025 13:09

RodrigoVillar changed the title ~~feat(reexecution/c): decouple metrics server and collector~~ feat(reexecute/c): decouple metrics server and collector Oct 14, 2025

aaronbuchwald reviewed Oct 14, 2025

View reviewed changes

tests/reexecute/c/vm_reexecute_test.go Show resolved Hide resolved

aaronbuchwald reviewed Oct 14, 2025

View reviewed changes

RodrigoVillar added 3 commits October 14, 2025 10:27

chore: split metrics flag

5160738

chore: switch

1832b58

chore: reduce duplicate code

915ae5d

RodrigoVillar requested a review from aaronbuchwald October 14, 2025 15:25

aaronbuchwald approved these changes Oct 15, 2025

View reviewed changes

aaronbuchwald enabled auto-merge October 15, 2025 15:12

maru-ava reviewed Oct 15, 2025

View reviewed changes

Merge branch 'master' into rodrigo/decouple-reexecution-metrics

d846616

maru-ava approved these changes Oct 20, 2025

View reviewed changes

aaronbuchwald added this pull request to the merge queue Oct 20, 2025

Merged via the queue into master with commit 36fa7f3 Oct 20, 2025
35 checks passed

aaronbuchwald deleted the rodrigo/decouple-reexecution-metrics branch October 20, 2025 13:46

github-project-automation bot moved this to Done 🎉 in avalanchego Oct 20, 2025

maru-ava mentioned this pull request Oct 28, 2025

feat(reexecute/c): explicit metrics port #4418

Merged

		if metricsMode.shouldStartServer() {
		collectRegistry(b, log, "c-chain-reexecution", prefixGatherer, labels, metricsMode.shouldStartCollector())

feat(reexecute/c): decouple metrics server and collector #4415

feat(reexecute/c): decouple metrics server and collector #4415

Conversation

RodrigoVillar commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why this should be merged

How this works

How this was tested

Need to be documented in RELEASES.md?

Uh oh!

RodrigoVillar commented Oct 13, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

aaronbuchwald Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

RodrigoVillar commented Oct 12, 2025 •

edited

Loading

aaronbuchwald Oct 13, 2025 •

edited

Loading