Skip to content

Conversation

@gupadhyaya
Copy link
Contributor

@gupadhyaya gupadhyaya commented Sep 12, 2025

Add Comprehensive Metrics for Blob Submission Path

Overview

This PR adds comprehensive OpenTelemetry metrics to track performance and behavior of the blob submission path in both the blob and state modules. The metrics provide detailed insights into submission timing, error rates, gas estimation performance, and resource usage.

What's Added

Blob Module Metrics (blob/metrics.go)

  • Retrieval tracking: Get operations, timing, and error rates
  • Proof tracking: Proof generation timing and success rates
  • Error categorization: Detailed error tracking with context

Note: Submission metrics are handled by the state module to avoid duplication.

State Module Metrics (state/metrics.go)

  • PFB submission tracking: PayForBlob transaction metrics with gas details
  • Gas estimation metrics: Timing and success rates for gas calculations
  • Gas price estimation: Performance tracking for price estimation
  • Account query metrics: Account lookup timing and error rates

Key Metrics

Blob Metrics

blob_retrieval_total
blob_retrieval_duration_seconds
blob_proof_total
blob_proof_duration_seconds

Note: Blob submission metrics are tracked in the state module to avoid duplication.

State Metrics

state_pfb_submission_total{blob_count, total_size_bytes, gas_price_utia}
state_pfb_submission_duration_seconds
state_pfb_gas_estimation_duration_seconds
state_pfb_gas_price_estimation
state_gas_estimation_total
state_account_query_total

How to Use

1. Enable Metrics in Node Configuration

# Start node with OTLP metrics enabled
./celestia bridge start --metrics --metrics.endpoint localhost:4318 --metrics.tls=false
./celestia light start --metrics --metrics.endpoint localhost:4318 --metrics.tls=false

2. Set Up OTLP Collector (Required for Grafana)

Create otel-collector-config.yaml:

receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: "celestia"

service:
  pipelines:
    metrics:
      receivers: [otlp]
      exporters: [prometheus]

Run OTLP collector:

docker run -p 4318:4318 -p 8889:8889 -v $(pwd)/otel-collector-config.yaml:/etc/otel-collector-config.yaml otel/opentelemetry-collector-contrib:latest --config=/etc/otel-collector-config.yaml

3. Query Metrics Endpoint

# Get all metrics from OTLP collector
curl http://localhost:8889/metrics

# Filter blob metrics
curl http://localhost:8889/metrics | grep "blob_"

# Filter state metrics  
curl http://localhost:8889/metrics | grep "state_"

4. Monitor Specific Operations

# Monitor blob retrieval operations
curl -s http://localhost:8889/metrics | grep "blob_retrieval_total"

# Monitor blob proof generation
curl -s http://localhost:8889/metrics | grep "blob_proof_total"

# Monitor gas estimation performance
curl -s http://localhost:8889/metrics | grep "state_pfb_gas_estimation"

5. Prometheus Integration

Add to your prometheus.yml:

scrape_configs:
  - job_name: 'celestia-node-otlp'
    static_configs:
      - targets: ['localhost:8889']  # OTLP collector's Prometheus endpoint
    metrics_path: '/metrics'
    scrape_interval: 15s

6. Grafana Dashboard

Configure Grafana to use Prometheus datasource pointing to http://localhost:8889

Create dashboards to visualize:

  • Blob retrieval and proof generation performance
  • State PFB submission performance
  • Error rates and types
  • Gas estimation metrics
  • Account query performance

7. Architecture Overview

Celestia Node (OpenTelemetry) → OTLP Collector → Prometheus Format → Grafana
  • Celestia Node: Exports OpenTelemetry metrics to OTLP collector
  • OTLP Collector: Converts OpenTelemetry metrics to Prometheus format
  • Grafana: Visualizes metrics from Prometheus endpoint

Example Metrics Output

# State PFB metrics (from OTLP collector)
celestia_pfb_count_total{instance="12D3KooW...",job="test/Bridge",label1="value1",otel_scope_name="state"} 5
celestia_last_pfb_timestamp_total{instance="12D3KooW...",job="test/Bridge",label1="value1",otel_scope_name="state"} 1.758100573267e+12

# Blob metrics
celestia_blob_retrieval_observable_total{instance="12D3KooW...",job="test/Bridge",label1="value1",otel_scope_name="blob"} 0
celestia_blob_proof_observable_total{instance="12D3KooW...",job="test/Bridge",label1="value1",otel_scope_name="blob"} 0

# Discovery metrics
celestia_archival_discovery_amount_of_peers{instance="12D3KooW...",job="test/Bridge",label1="value1",otel_scope_name="share_discovery"} 0
celestia_archival_discovery_find_peers_result_total{enough_peers="false",instance="12D3KooW...",job="test/Bridge",label1="value1",otel_scope_name="share_discovery"} 65

Note: Metrics are prefixed with celestia_ and include OpenTelemetry scope information when exported through the OTLP collector.

Fixes #4538

@gupadhyaya gupadhyaya added the kind:feat Attached to feature PRs label Sep 12, 2025
@codecov-commenter
Copy link

codecov-commenter commented Sep 12, 2025

Codecov Report

❌ Patch coverage is 16.42651% with 290 lines in your changes missing coverage. Please review.
✅ Project coverage is 35.55%. Comparing base (2469e7a) to head (24a6adc).
⚠️ Report is 600 commits behind head on main.

Files with missing lines Patch % Lines
state/metrics.go 7.10% 183 Missing ⚠️
blob/metrics.go 6.38% 88 Missing ⚠️
state/core_access.go 70.96% 5 Missing and 4 partials ⚠️
nodebuilder/settings.go 54.54% 3 Missing and 2 partials ⚠️
blob/service.go 62.50% 2 Missing and 1 partial ⚠️
nodebuilder/state/core.go 0.00% 1 Missing ⚠️
share/root.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4549      +/-   ##
==========================================
- Coverage   44.83%   35.55%   -9.29%     
==========================================
  Files         265      305      +40     
  Lines       14620    20502    +5882     
==========================================
+ Hits         6555     7289     +734     
- Misses       7313    12253    +4940     
- Partials      752      960     +208     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@Wondertan Wondertan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid. Mainly one comment about metrics duplication. Also wonder if you tested this on Graphana, so we don't accidentally get nil ptr or a metrics that doesn work

@gupadhyaya
Copy link
Contributor Author

wonder if you tested this on Graphana, so we don't accidentally get nil ptr or a metrics that doesn work

yes tested locally on grafana.

- Remove unsupported broadcast control options from bitswap
- Add missing HashOnRead method to BlockstoreWithMetrics
- Enable metrics flags in tastora framework for bridge and light nodes
The HashOnRead method is not part of the current boxo Blockstore interface
- Change nodeImage to use local celestia-node-local image
- Update defaultNodeTag to use our commit 6bc2316 with metrics fixes
The interface check was failing because HashOnRead method was removed from the Blockstore interface
Wondertan
Wondertan previously approved these changes Sep 23, 2025
Copy link
Member

@renaynay renaynay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • pls make sure go mods for our internal deps (celestiaorg) match what's on main branch (like don't update go-square)
  • I'd prefer to do metric observations (time it took) in the helper methods themselves instead of inline in the submission code path as the submission code path is already extremely long and convoluted

Resolved conflicts in:
- nodebuilder/header/config.go: Fixed return value in trustedPeers function
- share/root.go: Fixed return value in RowsWithNamespace function
- state/core_access.go: Kept metrics-related code from HEAD
- state/core_access_test.go: Kept metrics parameter in NewCoreAccessor
- nodebuilder/tests/tastora/go.mod: Used origin/main versions
- nodebuilder/tests/tastora/go.sum: Used origin/main versions
- Fixed estimateGasForBlobs function call to match new signature
- Properly declared author variable to avoid redeclaration
- Maintained metrics functionality while fixing function signature changes
- Applied goimports-reviser formatting to all Go files
- Fixed import grouping and ordering according to project standards
- This addresses the lint-imports failures after the merge
- Reordered imports to match goimports-reviser expectations
- Grouped imports: standard library, third-party, external project, local project
- This resolves the lint-imports failure
- Fix duplicate // indirect comment in go.mod
- Fix compilation error in state/core_access_test.go where accounts[0].Name was called on a string
Copy link
Member

@renaynay renaynay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Lint errors that need resolving
  • thank you for adding grafana dashboard, I UTACK that
  • might need to be rebased
  • bunch of root level files that do not belong at root
  • some of my comments re necessity of some of the metrics not addressed - pls let's not overbloat w unnecessary metrics.

@gupadhyaya gupadhyaya force-pushed the blob_submission_metrics branch from cb476b3 to c373bd9 Compare October 21, 2025 07:32
renaynay
renaynay previously approved these changes Oct 21, 2025
@gupadhyaya
Copy link
Contributor Author

closing in favor of #4664

@gupadhyaya gupadhyaya closed this Oct 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind:feat Attached to feature PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(blob/state): add metrics to submission

4 participants