Skip to content

Conversation

@johnsonw
Copy link
Contributor

EHT-1025 - Replace otel deps with prometheus client

@johnsonw johnsonw self-assigned this Aug 27, 2025
@johnsonw johnsonw added the enhancement New feature or request label Aug 27, 2025
@notion-workspace
Copy link

@codecov
Copy link

codecov bot commented Aug 27, 2025

Codecov Report

❌ Patch coverage is 95.02392% with 52 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.29%. Comparing base (a678244) to head (5c35210).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
lustrefs-exporter/src/lib.rs 85.25% 14 Missing and 9 partials ⚠️
lustrefs-exporter/src/jobstats.rs 92.18% 9 Missing and 11 partials ⚠️
lustrefs-exporter/src/routes.rs 82.35% 0 Missing and 3 partials ⚠️
lustrefs-exporter/src/main.rs 0.00% 2 Missing ⚠️
lustrefs-exporter/src/brw_stats.rs 99.75% 1 Missing ⚠️
lustrefs-exporter/src/host.rs 96.77% 0 Missing and 1 partial ⚠️
lustrefs-exporter/src/metrics.rs 96.66% 1 Missing ⚠️
lustrefs-exporter/src/quota.rs 98.38% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #108      +/-   ##
==========================================
+ Coverage   92.76%   93.29%   +0.52%     
==========================================
  Files          43       43              
  Lines        5391     5220     -171     
  Branches     5391     5220     -171     
==========================================
- Hits         5001     4870     -131     
+ Misses        326      283      -43     
- Partials       64       67       +3     
Flag Coverage Δ
2_14_0_ddn133 36.05% <0.00%> (-0.06%) ⬇️
2_14_0_ddn145 37.13% <0.00%> (-0.06%) ⬇️
all-tests 93.29% <95.02%> (+0.52%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link

github-actions bot commented Aug 27, 2025

🐰 Bencher Report

Branchjohnsonw/prometheus-client-switch
Testbedci-runner
Click to view all benchmark results
BenchmarkLatencyBenchmark Result
milliseconds (ms)
(Result Δ%)
Lower Boundary
milliseconds (ms)
(Limit %)
Upper Boundary
milliseconds (ms)
(Limit %)
jobstats 100📈 view plot
🚷 view threshold
1.50 ms
(-0.09%)Baseline: 1.50 ms
1.43 ms
(95.63%)
1.57 ms
(95.65%)
jobstats 1000📈 view plot
🚷 view threshold
14.34 ms
(-1.12%)Baseline: 14.50 ms
13.94 ms
(97.23%)
15.06 ms
(95.21%)
🐰 View full continuous benchmarking report in Bencher

@github-actions
Copy link

github-actions bot commented Aug 27, 2025

🐰 Bencher Report

Branchjohnsonw/prometheus-client-switch
Testbedci-runner

⚠️ WARNING: No Threshold found!

Without a Threshold, no Alerts will ever be generated.

Click here to create a new Threshold
For more information, see the Threshold documentation.
To only post results if a Threshold exists, set the --ci-only-thresholds flag.

Click to view all benchmark results
BenchmarkD1 Miss Ratemisses (%)D1mrmisses (reads) x 1e3D1mwmisses (writes) x 1e3DLmrmisses (reads)DLmwmisses (writes) x 1e3Drreads x 1e6Dwwrites x 1e6Estimated Cyclescycles x 1e6I1 Miss Ratemisses (%)I1mrmisses (reads)ILmrmisses (reads)InstructionsBenchmark Result
instructions x 1e6
(Result Δ%)
Lower Boundary
instructions x 1e6
(Limit %)
Upper Boundary
instructions x 1e6
(Limit %)
L1 Hit Ratehits (%)L1 Hitshits x 1e6LL Hit Ratehits (%)LL Hitshits x 1e3LL Miss Ratemisses (%)LLd Miss Ratemisses (%)LLi Miss Ratemisses (%)RAM Hit Ratehits (%)RAM Hitshits x 1e3Total read+writereads/writes x 1e6
lustre_metrics::memory_benches::bench_encode_lustre_metrics with_setup:generate_records()📈 view plot
⚠️ NO THRESHOLD
0.91 %📈 view plot
⚠️ NO THRESHOLD
24.45 reads x 1e3📈 view plot
⚠️ NO THRESHOLD
8.66 writes x 1e3📈 view plot
⚠️ NO THRESHOLD
107.00 reads📈 view plot
⚠️ NO THRESHOLD
6.37 writes x 1e3📈 view plot
⚠️ NO THRESHOLD
2.44 x 1e6📈 view plot
⚠️ NO THRESHOLD
1.22 x 1e6📈 view plot
⚠️ NO THRESHOLD
14.61 x 1e6📈 view plot
⚠️ NO THRESHOLD
0.01 %📈 view plot
⚠️ NO THRESHOLD
996.00 reads📈 view plot
⚠️ NO THRESHOLD
843.00 reads📈 view plot
🚷 view threshold
10.60 x 1e6
(-22.99%)Baseline: 13.77 x 1e6
3.66 x 1e6
(34.49%)
23.87 x 1e6
(44.40%)
📈 view plot
⚠️ NO THRESHOLD
99.76 %📈 view plot
⚠️ NO THRESHOLD
14.22 x 1e6📈 view plot
⚠️ NO THRESHOLD
0.19 %📈 view plot
⚠️ NO THRESHOLD
26.79 x 1e3📈 view plot
⚠️ NO THRESHOLD
0.05 %📈 view plot
⚠️ NO THRESHOLD
0.18 %📈 view plot
⚠️ NO THRESHOLD
0.01 %📈 view plot
⚠️ NO THRESHOLD
0.05 %📈 view plot
⚠️ NO THRESHOLD
7.32 x 1e3📈 view plot
⚠️ NO THRESHOLD
14.26 x 1e6
🐰 View full continuous benchmarking report in Bencher

@github-actions
Copy link

github-actions bot commented Aug 27, 2025

🐰 Bencher Report

Branchjohnsonw/prometheus-client-switch
Testbedci-runner

⚠️ WARNING: No Threshold found!

Without a Threshold, no Alerts will ever be generated.

Click here to create a new Threshold
For more information, see the Threshold documentation.
To only post results if a Threshold exists, set the --ci-only-thresholds flag.

Click to view all benchmark results
Benchmarkavg_runtime_rss_mibMeasure (MiB)avg_runtime_virtual_mibMeasure (MiB) x 1e3end_rss_mibMeasure (MiB)end_virtual_mibMeasure (MiB) x 1e3memory_growth_mibMeasure (MiB)peak_over_start_rss_ratioMeasure (units)peak_over_start_virtual_ratioMeasure (units)peak_rss_mibBenchmark Result
Measure (MiB)
(Result Δ%)
Lower Boundary
Measure (MiB)
(Limit %)
Upper Boundary
Measure (MiB)
(Limit %)
peak_virtual_mibMeasure (MiB) x 1e3start_rss_mibMeasure (MiB)start_virtual_mibMeasure (MiB) x 1e3virtual_growth_mibMeasure (MiB)
scrape_allocations📈 view plot
⚠️ NO THRESHOLD
491.83 MiB📈 view plot
⚠️ NO THRESHOLD
1.17 MiB x 1e3📈 view plot
⚠️ NO THRESHOLD
450.21 MiB📈 view plot
⚠️ NO THRESHOLD
1.12 MiB x 1e3📈 view plot
⚠️ NO THRESHOLD
7.32 MiB📈 view plot
⚠️ NO THRESHOLD
2.44 units📈 view plot
⚠️ NO THRESHOLD
1.21 units📈 view plot
🚷 view threshold
649.02 MiB
(+148.54%)Baseline: 261.13 MiB
-215.80 MiB
(-33.25%)
738.07 MiB
(87.93%)
📈 view plot
⚠️ NO THRESHOLD
1.44 MiB x 1e3📈 view plot
⚠️ NO THRESHOLD
442.89 MiB📈 view plot
⚠️ NO THRESHOLD
1.11 MiB x 1e3📈 view plot
⚠️ NO THRESHOLD
15.47 MiB
🐰 View full continuous benchmarking report in Bencher

] }
pretty_assertions = "1.4.1"
prometheus = "0.14"
prometheus-client = { git = "https://github.com/whamcloud/client_rust", branch = "whamcloud-08-12-2025" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an upstream PR for these changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple patches that are in review:

prometheus/client_rust#278
prometheus/client_rust#279

This patch does not have a PR upstream: whamcloud/client_rust@d81f9a1

Copy link
Member

@jgrund jgrund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Some snapshot outputs have different number of lines (and output appears to have changed order). Please pass through all the snapshots and ensure that the content has not notably changed (we cannot break backwards compatibility).
  • Output coming from server should be Vec<u8>. I noticed that this patch changes some output back to String.

@jgrund jgrund added this to the 2025082700 milestone Aug 27, 2025
  changes
- update `normalize_docs` to strip off the `.` at the end of help text
  so that it doesn't use the period at the end during comparison.
- Update a few other properties to match existing help text

Signed-off-by: William Johnson <[email protected]>
Signed-off-by: William Johnson <[email protected]>
Signed-off-by: William Johnson <[email protected]>
@johnsonw
Copy link
Contributor Author

johnsonw commented Aug 27, 2025

I've updated https://github.com/whamcloud/lustrefs-exporter/blob/main/lustrefs-exporter/benches/lustre_metrics.rs#L38 to use Vec<u8> again. Just want to make sure there isn't a different spot you are thinking about.

@johnsonw
Copy link
Contributor Author

Comparing snapshots is a bit difficult due to the following:

  1. New snapshots have a different number of lines than the historical snapshots. This caused by a couple of things:
  • The historical snapshots include a newline between sections. When the prometheus-client encoder generates the output it does not include a newline between each section.
  • Many historical snapshots have duplicate lines. For example: lustrefs_exporter__tests__stats.histsnap
lustre_connected_clients{component="mdt",target="ai400x2-MDT0000"} 1

The above line appears twice.

lustrefs_exporter__tests__valid_fixture_2.14.0_ddn133_quota.txt.histsnap is also a great example of this as it has around 650 duplicate entries (mostly starting with lustre_block_maps_milliseconds_total).

  • Prometheus-client adds an #EOF to the end of its output. This does not exist in the historical snapshots.
  • We add target_info{service_name="lustrefs-exporter",telemetry_sdk_language="rust",telemetry_sdk_name="opentelemetry",telemetry_sdk_version="0.29.0"} 1 to the new snapshots.
  1. New snapshots contain the otel_scope_name="lustre" label, which does not exist in the historical snapshots.

  2. Attribute order changed in just about every snapshot. Even though the line is essentially the same (with the exception of the otel_scope_name label being added) the attribute order changed, making it difficult to view the diff.

The approach I have taken to compare snapshots is the following:

  1. Create a temporary file and copy the historical snapshot into it.
  2. Remove all duplicate lines
  3. Sort the attributes to match the order in the new snapshot
  4. Remove the otel_scope_name label
  5. I then run sort <filepath> to sort the file
  6. Open a second temporary file
  7. Run sort <new snapshot file> | pbcopy and paste the results into the second temporary file
  8. Copy the contents of the first temporary file and do a diff against the second

This procedure looks at it from a text point of view to see what is different between the two files. With that said, I think we also have a test that compares the historical vs new snapshots: https://github.com/whamcloud/lustrefs-exporter/blob/johnsonw/prometheus-client-switch/lustrefs-exporter/src/lib.rs#L330

Would it be worth writing a test from a text standpoint to perform the above procedure but inside of a unit test?

- Update some comments

Signed-off-by: William Johnson <[email protected]>
@johnsonw johnsonw marked this pull request as ready for review August 27, 2025 18:43
@mdiep25 mdiep25 modified the milestones: 2025082700, 2025091000 Aug 27, 2025
Signed-off-by: William Johnson <[email protected]>
  `otelsnap` extension. Otherwise, we will receive an error indicating
  that there are stale snapshots.

Signed-off-by: William Johnson <[email protected]>
@johnsonw
Copy link
Contributor Author

johnsonw commented Aug 28, 2025

Benchmark on server running original binary:

original-lustrefs-exporter2.mp4

RES 1.6G
Average time / request: 23.4s

@johnsonw

This comment was marked as resolved.

@johnsonw
Copy link
Contributor Author

johnsonw commented Aug 28, 2025

Benchmark on server running el8 binary:

el8-lustrefs-exporter2.mp4

RES: 1.4G
Average time / request: 1.34s

@johnsonw johnsonw force-pushed the johnsonw/prometheus-client-switch branch from f72e507 to 667cff4 Compare August 29, 2025 15:59
@jgrund
Copy link
Member

jgrund commented Aug 29, 2025

Benchmark on server running el8 binary:

el8-lustrefs-exporter2.mp4
RES: 1.4G Average time / request: 1.34s

Does your benchmark take compression into account? I wonder if that is / is not making a difference here.

@johnsonw
Copy link
Contributor Author

johnsonw commented Aug 29, 2025

Benchmark on server running el8 binary compiled with mimalloc global allocatore:

el8-mimalloc-lustrefs-exporter2.mp4

RES: 992MB
Average time / request: 1.29s

@johnsonw
Copy link
Contributor Author

Lustre alerts are working:
image

@johnsonw johnsonw force-pushed the johnsonw/prometheus-client-switch branch from 4d9717b to 40ddc41 Compare August 30, 2025 03:49
Signed-off-by: William Johnson <[email protected]>
- remove stale snapshot

Signed-off-by: William Johnson <[email protected]>
Signed-off-by: William Johnson <[email protected]>
@johnsonw johnsonw force-pushed the johnsonw/prometheus-client-switch branch from ce555f3 to ddd771f Compare September 2, 2025 16:35
  https requests to hit the /metrics endpoint. This will make it closer
  to what we are doing in real life.

Signed-off-by: William Johnson <[email protected]>
@johnsonw
Copy link
Contributor Author

johnsonw commented Sep 3, 2025

Benchmark on server running el8 binary:
el8-lustrefs-exporter2.mp4
RES: 1.4G Average time / request: 1.34s

Does your benchmark take compression into account? I wonder if that is / is not making a difference here.

I've updated the benchmark to run an http server and make http requests. In fact, it uses the same app() call to construct the service, setup gzip, etc. This will make the benchmark much closer to what we see on a live system. The benchmarks have been updated:

Chart: Scrape Memory Metrics
Library: Otel (Based on main branch)
Notable: RSS almost reached 1.6Gib. This is very close to what I saw on a live system (~1.7Gib)
Link: https://bencher.dev/console/projects/lustrefs-exporter/perf?report=70781f3b-b789-4393-89e9-1816bcf1a249&branches=49aa4653-c3c0-4fde-9e8c-9d8ff08e5abb&heads=9adb14cd-d3b5-4132-94b3-b7956b517306&testbeds=84e6157f-c316-46cc-b713-79e028002265&benchmarks=85fb7422-d312-4f4b-b783-81d51494565f&measures=79483139-f37c-4438-abe3-b55761a739db&lower_boundary=true&upper_boundary=true&start_time=1754256573000&end_time=1756848573000&key=true&reports_per_page=4&branches_per_page=8&testbeds_per_page=8&benchmarks_per_page=8&plots_per_page=8&reports_page=1&branches_page=1&testbeds_page=1&benchmarks_page=1&plots_page=1

image

Chart: Scrape Memory Metrics
Library: Prometheus-Client
Notable: RSS around 700MB. I've seen it hit this much on a live system, it just depends on how the global allocator decides to release memory back to the OSS.
Link: https://bencher.dev/console/projects/lustrefs-exporter/perf?report=42a8f22a-c3c4-4ff8-a3e2-5a0606956227&branches=7b0b6c87-a9e5-4677-bb2d-3afa586d912b&heads=b4ba7fcd-b2f6-4b66-b550-b007ed6df833&testbeds=84e6157f-c316-46cc-b713-79e028002265&benchmarks=85fb7422-d312-4f4b-b783-81d51494565f&measures=79483139-f37c-4438-abe3-b55761a739db&lower_boundary=true&upper_boundary=true&start_time=1754248906000&end_time=1756840906000&key=true&reports_per_page=4&branches_per_page=8&testbeds_per_page=8&benchmarks_per_page=8&plots_per_page=8&reports_page=1&branches_page=1&testbeds_page=1&benchmarks_page=1&plots_page=1

image

@jgrund jgrund merged commit 0066053 into main Sep 9, 2025
19 checks passed
@jgrund jgrund deleted the johnsonw/prometheus-client-switch branch September 9, 2025 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants