Add Fleet Detection Plugin #6151

jonathanrainer · 2024-10-15T13:11:08Z

Adds an initial plugin, that loads at startup and emits metrics for three simple cases: cpus, cpu_freq and total_memory. Not sure this is the correct approach for this, especially as this plugin will expand over time so willing to take any pointers in that regard.

Tested via using OpenTelemetry Collector with the following router config and OTEL config

telemetry:
  apollo:
    experimental_otlp_endpoint: "http://0.0.0.0:4317"
  instrumentation:
    spans:
      mode: "spec_compliant"

receivers:
  otlp:
    protocols:
      grpc:
      http:

exporters:
  debug:
    verbosity: detailed

processors:
  batch:
  filter:
    metrics:
      include:
        match_type: regexp
        metric_names:
          - apollo.router.instance.*

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [filter]
      exporters: [debug]

And produced the following

Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

Notes

It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. ↩
Configuration is an important part of many changes. Where applicable please try to document configuration examples. ↩
Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. ↩

svc-apollo-docs · 2024-10-15T13:11:11Z

✅ Docs Preview Ready

No new or changed pages found.

github-actions · 2024-10-15T13:11:22Z

@jonathanrainer, please consider creating a changeset entry in /.changesets/. These instructions describe the process and tooling.

router-perf · 2024-10-15T13:11:40Z

CI performance tests

apollo-router/src/plugins/fleet_detector.rs

bnjjj

Last but not least, could you make sure to make a perf test with this plugin enabled. We have a benchmark system in place called router-scale (check docs in our own confluence space), it would be great to create a flamegraph with these benchmarks to make sure it's not something that takes a lot of resources. Feel free to ask if you need help

bnjjj · 2024-10-17T15:16:35Z

...uter/src/configuration/snapshots/apollo_router__configuration__tests__schema_generation.snap

+    "IntrospectionMode": {
+      "description": "Which implementation of GraphQL schema introspection to use, if enabled",
+      "oneOf": [
+        {
+          "description": "Use the new Rust-based implementation.",
+          "enum": [
+            "new"
+          ],
+          "type": "string"
+        },
+        {
+          "description": "Use the old JavaScript-based implementation.",
+          "enum": [
+            "legacy"
+          ],
+          "type": "string"
+        },
+        {
+          "description": "Use Rust-based and Javascript-based implementations side by side, logging warnings if the implementations disagree.",
+          "enum": [
+            "both"
+          ],
+          "type": "string"
+        }
+      ]
+    },


Why is it part of this PR ?

So we kept having CI failures because of changes to the JSON schema for the config I think? And I thought this would fix it but it doesn't! Any ideas much appreciated!

...uter/src/configuration/snapshots/apollo_router__configuration__tests__schema_generation.snap

apollo-router/src/plugins/fleet_detector.rs

jonathanrainer · 2024-10-29T10:27:56Z

Ok @bnjjj I've run the perf tests, building the router from this branch and further enabling metrics so that we can see these now being emitted. I'll upload the top file for the router because that's probably the most instructive, I see the memory usage goes up by about 200Mi over the entire course of the test but I don't have any baselines so its hard to compare. I'll also attach the router logs. That doesn't seem like a terrible thing to me but it's hard to compare without a baseline.

I have a few further questions as well to push us forward on this:

We want to ensure that customers can turn this off, and we were thinking of re-using the APOLLO_TELEMETRY_DISABLED env var, are you folks happy with that?
We'll need to add documentation around what's being collected and how to turn it off, should we do that on this branch too for documentation or is it better to do that separately?
The CI on this branch seems to be failing because of generating the config schema, and try as I might I can't see how that might be fixed, any pointers?

otel.log
router.log
top.router.txt

apollo-router/src/plugins/fleet_detector.rs

bnjjj

@jonathanrainer to make a comparison could you just run the same benchmark on dev and you'll get something to compare with this branch

bnjjj · 2024-11-06T08:56:06Z

apollo-router/src/plugins/fleet_detector.rs

+    // We have to store a reference to the gauge otherwise it will be dropped once the plugin is
+    // initialised, even though it still has data to emit
+    freq_gauge: ObservableGauge<u64>,


@BrynCooke Could you confirm it will handle the hot reload properly ?

Yes I believe it will, but do test

Adds an initial plugin, that loads at startup and emits metrics for three simple cases: cpus, cpu_freq and total_memory.

Also improve how we handle refreshing the System object, rather than doing it in either the callback or the async task, contain that within a struct and do it there.

jonathanrainer · 2024-11-06T10:34:29Z

@bnjjj Ah yes, apologies should have thought of that, have done that below and redid the tests I posted above just so the comparison is easier. From looking at the memory figures it looks like the plugin does increase memory but it's not the constant increase we were seeing before, also the baseline it starts from in the test appears higher in the branched case which presumably isn't plugin related because it won't be running in the early part of the test (I imagine). Have attached the logs again, let me know if you think there's anything we need to worry about

Dev Files
otel_dev.log
router_dev.log
top.router_dev.txt

Branch Files
otel_branch.log
router_branch.log
top.router_branch.txt

bnjjj

Thanks @jonathanrainer LGTM

BrynCooke

CPU and memory should be gauges. We also talked about activate.

jonathanrainer requested review from a team as code owners October 15, 2024 13:11

nmoutschen reviewed Oct 15, 2024

View reviewed changes

apollo-router/src/plugins/fleet_detector.rs Outdated Show resolved Hide resolved

apollo-router/src/plugins/fleet_detector.rs Outdated Show resolved Hide resolved

apollo-router/src/plugins/fleet_detector.rs Outdated Show resolved Hide resolved

jonathanrainer force-pushed the jr/task/FLEET-19 branch 2 times, most recently from 7e971cf to cad95b7 Compare October 16, 2024 06:23

bnjjj reviewed Oct 17, 2024

View reviewed changes

jonathanrainer force-pushed the jr/task/FLEET-19 branch from 87dfa7c to 5e47ad8 Compare October 28, 2024 10:47

jonathanrainer requested a review from bnjjj October 29, 2024 10:28

jonathanrainer force-pushed the jr/task/FLEET-19 branch 2 times, most recently from 25ce977 to 492ab28 Compare November 1, 2024 10:50

garypen reviewed Nov 6, 2024

View reviewed changes

apollo-router/src/plugins/fleet_detector.rs Outdated Show resolved Hide resolved

bnjjj reviewed Nov 6, 2024

View reviewed changes

jonathanrainer added 9 commits November 6, 2024 10:05

FLEET-19 Add initial take on Fleet Detection Plugin

3e2c16e

Adds an initial plugin, that loads at startup and emits metrics for three simple cases: cpus, cpu_freq and total_memory.

FLEET-19 Fix linting issues

179d01a

FLEET-19 Fix incorrect Linux name

205a428

FLEET-19 Fix clippies

d8de417

FLEET-19 Respond to PR comments

86ecaa1

FLEET-19 Fix /proc/filesystems query to be more resilient

dbc64b0

FLEET-19 Change gears to make metrics less spammy

0d866c2

FLEET-19 Use correct metrics form

d5f2321

Also improve how we handle refreshing the System object, rather than doing it in either the callback or the async task, contain that within a struct and do it there.

FLEET-19 Responding to PR comments

c652b17

jonathanrainer force-pushed the jr/task/FLEET-19 branch from 492ab28 to c652b17 Compare November 6, 2024 10:05

jonathanrainer requested a review from garypen November 6, 2024 10:06

jonathanrainer requested a review from bnjjj November 6, 2024 10:34

bnjjj approved these changes Nov 6, 2024

View reviewed changes

FLEET-19 Fixing CI with new schema generation

f3ac1fc

BrynCooke requested changes Nov 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Fleet Detection Plugin #6151

Add Fleet Detection Plugin #6151

jonathanrainer commented Oct 15, 2024

svc-apollo-docs commented Oct 15, 2024 •

edited

Loading

github-actions bot commented Oct 15, 2024

router-perf bot commented Oct 15, 2024

bnjjj left a comment

bnjjj Oct 17, 2024

jonathanrainer Oct 18, 2024

jonathanrainer commented Oct 29, 2024 •

edited

Loading

bnjjj left a comment

bnjjj Nov 6, 2024

BrynCooke Nov 6, 2024

jonathanrainer commented Nov 6, 2024 •

edited

Loading

bnjjj left a comment

BrynCooke left a comment

Add Fleet Detection Plugin #6151

Are you sure you want to change the base?

Add Fleet Detection Plugin #6151

Conversation

jonathanrainer commented Oct 15, 2024

Footnotes

svc-apollo-docs commented Oct 15, 2024 • edited Loading

✅ Docs Preview Ready

github-actions bot commented Oct 15, 2024

router-perf bot commented Oct 15, 2024

bnjjj left a comment

Choose a reason for hiding this comment

bnjjj Oct 17, 2024

Choose a reason for hiding this comment

jonathanrainer Oct 18, 2024

Choose a reason for hiding this comment

jonathanrainer commented Oct 29, 2024 • edited Loading

bnjjj left a comment

Choose a reason for hiding this comment

bnjjj Nov 6, 2024

Choose a reason for hiding this comment

BrynCooke Nov 6, 2024

Choose a reason for hiding this comment

jonathanrainer commented Nov 6, 2024 • edited Loading

bnjjj left a comment

Choose a reason for hiding this comment

BrynCooke left a comment

Choose a reason for hiding this comment

svc-apollo-docs commented Oct 15, 2024 •

edited

Loading

jonathanrainer commented Oct 29, 2024 •

edited

Loading

jonathanrainer commented Nov 6, 2024 •

edited

Loading