Skip to content

Conversation

@Elvis339
Copy link
Contributor

@Elvis339 Elvis339 commented Dec 22, 2025

Why this should be merged

Patch is related to #1578

The aws-launch.sh script is currently not collecting metrics because it uses the old avalanchego task interface. The new version of avalanchego's Taskfile.yml changed how metrics configuration is passed:

  • Old: METRICS_ENABLED passed as a task variable
  • New: METRICS_SERVER_ENABLED and / or METRICS_COLLECTOR_ENABLED passed as environment variables

This mismatch causes metrics to silently not be configured when running benchmarks on EC2 instances.

Related PRs:

How this works

Changed the bootstrap command from:

task reexecute-cchain-range ... METRICS_ENABLED=false

To:

METRICS_SERVER_ENABLED=true task reexecute-cchain-range ...

This aligns with how avalanchego's benchmark_cchain_range.sh now reads these settings from the environment rather than from task variables.

How this was tested

1. AvalancheGo environment variable behavior

Navigate to or clone AvalancheGo:

git clone [email protected]:ava-labs/avalanchego.git
cd avalanchego

Default (metrics disabled):

./scripts/run_task.sh reexecute-cchain-range \
  BLOCK_DIR=/tmp/test \
  CURRENT_STATE_DIR=/tmp/test \
  START_BLOCK=101 \
  END_BLOCK=102 2>&1 | grep -o '"metrics-server-enabled": "[^"]*"'

Expected: "metrics-server-enabled": "false"

With metrics enabled:

METRICS_SERVER_ENABLED=true ./scripts/run_task.sh reexecute-cchain-range \
  BLOCK_DIR=/tmp/test \
  CURRENT_STATE_DIR=/tmp/test \
  START_BLOCK=101 \
  END_BLOCK=102 2>&1 | grep -o '"metrics-server-enabled": "[^"]*"'

Expected: "metrics-server-enabled": "true"

2. Firewood aws-launch.sh flag behavior

Default (metrics enabled):

./benchmark/bootstrap/aws-launch.sh --dry-run 2>&1 | grep "Metrics Server:"

Expected: Metrics Server: true

Disable metrics:

./benchmark/bootstrap/aws-launch.sh --dry-run --metrics-server false 2>&1 | grep "Metrics Server:"

Expected: Metrics Server: false

Case insensitivity:

./benchmark/bootstrap/aws-launch.sh --dry-run --metrics-server TRUE 2>&1 | grep "Metrics Server:"

Expected: Metrics Server: true (normalized to lowercase)

Invalid value (fail fast):

./benchmark/bootstrap/aws-launch.sh --metrics-server tru 2>&1

Expected:

Error: Invalid --metrics-server value 'tru'
Valid values: true, false

- >
sudo -u ubuntu -D /mnt/nvme/ubuntu/avalanchego --login
time task reexecute-cchain-range CURRENT_STATE_DIR=/mnt/nvme/ubuntu/exec-data/current-state BLOCK_DIR=/mnt/nvme/ubuntu/exec-data/blocks START_BLOCK=1 END_BLOCK=__END_BLOCK__ CONFIG=__CONFIG__ METRICS_ENABLED=false
METRICS_SERVER_ENABLED=false METRICS_COLLECTOR_ENABLED=false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for reviewers: Should metrics be configurable via a CLI flag (e.g., --enable-metrics)?

Currently I kept metrics disabled (METRICS_SERVER_ENABLED=false) to match the previous behavior. To test with metrics enabled, you would need to:

  1. Manually edit the script to change false to true:

    METRICS_SERVER_ENABLED=true METRICS_COLLECTOR_ENABLED=true
  2. Or we could add a --enable-metrics flag to the script that sets these environment variables accordingly.

Let me know if you'd like me to add this as a configurable option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need two different ways to pass parameters to the task. This is why I was saying the parameter needed to be added to the taskfile.

Copy link
Contributor Author

@Elvis339 Elvis339 Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right we don't need both. I initially included METRICS_COLLECTOR_ENABLED to show it's possible to pass multiple env vars.

The two options serve different purposes:

  • METRICS_SERVER_ENABLED - Starts a metrics HTTP server that exposes a /metrics endpoint
  • METRICS_COLLECTOR_ENABLED - Controls whether a Prometheus agent is started to collect and forward metrics to a remote Prometheus instance (used by tmpnet and CI workflows)

Since this scripts runs Grafana locally and scrape the metrics endpoint, you only need METRICS_SERVER_ENABLED.

As for why these are environment variables instead of task variables, that's just how we designed it ava-labs/avalanchego#4443. Task variables are for the benchmark itself (which blocks, which config), environment variables are for the runtime context (metrics, Prometheus creds, runner type). Either approach would work, this is just convention.

Copy link
Member

@rkuris rkuris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer the default on, not off. Why would someone not want metrics?

@github-actions
Copy link

Metrics Change Detection ⚠️

This PR contains changes related to metrics:

-            firewood_counter!(
+            firewood_counter!("ffi.commit_ms", "FFI commit timing in milliseconds")
-        firewood_counter!("firewood.ffi.batch_ms", "FFI batch timing in milliseconds")
+        firewood_counter!("ffi.batch_ms", "FFI batch timing in milliseconds")
-        firewood_counter!("firewood.ffi.batch", "Number of FFI batch operations").increment(1);
+        firewood_counter!("ffi.batch", "Number of FFI batch operations").increment(1);
-            firewood_counter!(
+            firewood_counter!("ffi.cached_view.miss", "Number of FFI cached view misses")
-            firewood_counter!(
+            firewood_counter!("ffi.cached_view.hit", "Number of FFI cached view hits").increment(1);
-            firewood_counter!(
-            firewood_counter!("firewood.ffi.commit", "Number of FFI commit operations")
+            firewood_counter!("ffi.commit_ms", "FFI commit timing in milliseconds")
+            firewood_counter!("ffi.commit", "Number of FFI commit operations").increment(1);
-            firewood_counter!(
-            firewood_counter!("firewood.ffi.merge", "Number of FFI merge operations").increment(1);
+            firewood_counter!("ffi.commit_ms", "FFI commit timing in milliseconds")
+            firewood_counter!("ffi.merge", "Number of FFI merge operations").increment(1);
-        firewood_counter!(
-        firewood_counter!("firewood.ffi.propose", "Number of FFI propose operations").increment(1);
+        firewood_counter!("ffi.propose_ms", "FFI propose timing in milliseconds")
+        firewood_counter!("ffi.propose", "Number of FFI propose operations").increment(1);
-            proposals: firewood_counter!("firewood.proposals", "Number of proposals created"),
+            proposals: firewood_counter!("proposals", "Number of proposals created"),
-            firewood_gauge!(
+            firewood_gauge!("max_revisions", "Maximum number of revisions configured")
-                firewood_counter!("firewood.insert", "Number of merkle insert operations", "merkle" => "update").increment(1);
+                firewood_counter!("insert", "Number of merkle insert operations", "merkle" => "update").increment(1);
-                firewood_counter!("firewood.insert", "Number of merkle insert operations", "merkle"=>"above").increment(1);
+                firewood_counter!("insert", "Number of merkle insert operations", "merkle"=>"above").increment(1);
-                            firewood_counter!("firewood.insert", "Number of merkle insert operations", "merkle"=>"below").increment(1);
+                            firewood_counter!("insert", "Number of merkle insert operations", "merkle"=>"below").increment(1);
-                        firewood_counter!("firewood.insert", "Number of merkle insert operations", "merkle"=>"split").increment(1);
+                        firewood_counter!("insert", "Number of merkle insert operations", "merkle"=>"split").increment(1);
-                firewood_counter!("firewood.insert", "Number of merkle insert operations", "merkle" => "split").increment(1);
+                firewood_counter!("insert", "Number of merkle insert operations", "merkle" => "split").increment(1);
-            firewood_counter!("firewood.remove", "Number of merkle remove operations", "prefix" => "false", "result" => "nonexistent")
+            firewood_counter!("remove", "Number of merkle remove operations", "prefix" => "false", "result" => "nonexistent")
-            firewood_counter!("firewood.remove", "Number of merkle remove operations", "prefix" => "false", "result" => "success").increment(1);
+            firewood_counter!("remove", "Number of merkle remove operations", "prefix" => "false", "result" => "success").increment(1);
-            firewood_counter!("firewood.remove", "Number of merkle remove operations", "prefix" => "false", "result" => "nonexistent")
+            firewood_counter!("remove", "Number of merkle remove operations", "prefix" => "false", "result" => "nonexistent")
-            firewood_counter!("firewood.remove", "Number of merkle remove operations", "prefix" => "true", "result" => "nonexistent").increment(1);
+            firewood_counter!("remove", "Number of merkle remove operations", "prefix" => "true", "result" => "nonexistent").increment(1);
-        firewood_counter!("firewood.remove", "Number of merkle remove operations", "prefix" => "true", "result" => "success")
+        firewood_counter!("remove", "Number of merkle remove operations", "prefix" => "true", "result" => "success")
-        firewood_counter!("firewood.read_node", "Number of node reads", "from" => "file")
+        firewood_counter!("read_node", "Number of node reads", "from" => "file").increment(1);
-        firewood_counter!("firewood.cache.node", "Number of node cache operations", "mode" => mode, "type" => if cached.is_some() { "hit" } else { "miss" })
+        firewood_counter!("cache.node", "Number of node cache operations", "mode" => mode, "type" => if cached.is_some() { "hit" } else { "miss" })
-        firewood_counter!("firewood.cache.freelist", "Number of freelist cache operations", "type" => if cached.is_some() { "hit" } else { "miss" }).increment(1);
+        firewood_counter!("cache.freelist", "Number of freelist cache operations", "type" => if cached.is_some() { "hit" } else { "miss" }).increment(1);
-        firewood_counter!("firewood.io.read_ms", "IO read timing in milliseconds")
+        firewood_counter!("io.read_ms", "IO read timing in milliseconds")
-        firewood_counter!("firewood.io.read", "Number of IO read operations").increment(1);
+        firewood_counter!("io.read", "Number of IO read operations").increment(1);
-        firewood_counter!("firewood.read_node", "Number of node reads", "from" => "memory")
+        firewood_counter!("read_node", "Number of node reads", "from" => "memory").increment(1);
-        firewood_counter!("firewood.flush_nodes", "amount flushed nodes").increment(flush_time);
+        firewood_counter!("flush_nodes", "amount flushed nodes").increment(flush_time);

However, the dashboard was not modified.

You may need to update benchmark/Grafana-dashboard.json accordingly.


This check is automated to help maintain the dashboard.

@Elvis339
Copy link
Contributor Author

I would prefer the default on, not off. Why would someone not want metrics?

Default was false, updated it to be true.

@Elvis339 Elvis339 requested a review from rkuris December 23, 2025 06:14
@Elvis339 Elvis339 force-pushed the es/enable-metrics-in-launch-script branch from 93f01e1 to 3cdd82f Compare December 23, 2025 06:19
@Elvis339 Elvis339 force-pushed the es/enable-metrics-in-launch-script branch from 3cdd82f to fa8676d Compare December 23, 2025 06:20
@rkuris
Copy link
Member

rkuris commented Dec 23, 2025

This just needs a little more TLC.

As Brandon mentioned, configuration variables for the script should be handled the same way. This means setting METRICS_SERVER_ENABLED should be just another parameter to the task file.

We could use some way to shut it off via an option when the script is run.

How are invalid values handled? Is the error sane? If I set it to "tru" will it give me a good error message?

This must be tested. Did it work? Please update the PR to show how you tested it. Ideally you tested it with firewood bootstrap.

@rkuris
Copy link
Member

rkuris commented Dec 23, 2025

This just needs a little more TLC.

As Brandon mentioned, configuration variables for the script should be handled the same way. This means setting METRICS_SERVER_ENABLED should be just another parameter to the task file.

We could use some way to shut it off via an option when the script is run.

How are invalid values handled? Is the error sane? If I set it to "tru" will it give me a good error message?

This must be tested. Did it work? Please update the PR to show how you tested it. Ideally you tested it with firewood bootstrap.

Summary of offline discussion:

  • There is a move toward environment variables rather than add additional task variables. Code is being refactored anyway in https://github.com/ava-labs/avalanchego/blob/0eebefb80b34849c44c7657a431b57b50e194300/scripts/benchmark_cchain_range.sh#L15
  • Still could use an option to turn it off.
  • Invalid options are ignored, this should be fixed. It's a line or two to validate either true|TRUE|false|FALSE and provide a sane error if none of these are specified, potentially saving some debugging down the road. Fail fast is always better, especially for a tool like this that is run manually during debugging.
  • Some testing was done, I'll do some final acceptance testing once the PR is fully ready.

Add configurable METRICS_SERVER_ENABLED environment variable to align
with AvalancheGo's updated Taskfile which reads metrics configuration
from environment variables instead of task variables.

- Add --metrics-server flag (true/false, default: true)
- Validate input with clear error message for invalid values
- Normalize input to lowercase for case-insensitive matching
- Display metrics server setting in configuration summary
@Elvis339
Copy link
Contributor Author

Elvis339 commented Dec 23, 2025

@rkuris this commit 189d359

  • Added --metrics-server flag to control whether the metrics server is enabled during benchmark runs.
    Default is true (metrics enabled)
  • Use --metrics-server false to disable
  • Validates input and normalizes case (e.g., TRUE → true)
  • Rejects invalid values like tru with a clear error

Validation is done here because the underlying AvalancheGo task only recognizes exact true/false - anything else silently disables metrics. This makes the behavior explicit on the consumer (Firewood) rather than letting typos fail silently.

# Normalize to lowercase
METRICS_SERVER="${2,,}"
# Validate boolean value
if [[ "$METRICS_SERVER" != "true" && "$METRICS_SERVER" != "false" ]]; then
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the team agrees we can also map with common boolean aliases: true|t|1|yes|y|on and false|f|0|no|n|off

Copy link
Contributor

@demosdemon demosdemon Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine. Fancier handling can be implemented in fwdctl (#1318).

# Normalize to lowercase
METRICS_SERVER="${2,,}"
# Validate boolean value
if [[ "$METRICS_SERVER" != "true" && "$METRICS_SERVER" != "false" ]]; then
Copy link
Contributor

@demosdemon demosdemon Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine. Fancier handling can be implemented in fwdctl (#1318).

@Elvis339 Elvis339 enabled auto-merge (squash) December 30, 2025 11:41
@Elvis339 Elvis339 merged commit 1a1b988 into main Dec 30, 2025
45 checks passed
@Elvis339 Elvis339 deleted the es/enable-metrics-in-launch-script branch December 30, 2025 11:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants