Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metricdog: add boot time related metrics to send-boot-success #3204

Merged
merged 1 commit into from
Jul 18, 2023

Conversation

etungsten
Copy link
Contributor

@etungsten etungsten commented Jun 15, 2023

Issue number:

N/A

Description of changes:

    metricdog: add boot-time-related metrics to send-boot-success
    
    Refactors a bit to accommodate a new trait for checking host status
    using systemd utilities.

Testing done:
Checked emitted event from metricdog upon first boot and was able to serialize out the following:

{
...
  "sender": "metricdog",
  "version": "1.15.0",
  "variant": "aws-k8s-1.27",
  "arch": "x86_64",
  "region": "us-west-2",
  "seed": 2039,
  "version_lock": "latest",
  "ignore_waves": true,
  "event": "boot_success",
  "is_first_boot": true,
  "preconfigured_time_ms": 12841,
  "configured_time_ms": 16096,
  "network_ready_time_ms": 7966,
  "filesystem_ready_time_ms": 4624
}

After a reboot, metricdog sent another send-boot-success event with the following parameters:

{
..
  "sender": "metricdog",
  "version": "1.15.0",
  "variant": "aws-k8s-1.27",
  "arch": "x86_64",
  "region": "us-west-2",
  "seed": 2039,
  "version_lock": "latest",
  "ignore_waves": true,
  "event": "boot_success",
  "is_first_boot": false,
  "preconfigured_time_ms": 5090,
  "configured_time_ms": 5780,
  "network_ready_time_ms": 4399,
  "filesystem_ready_time_ms": 2072
}

Note how is_first_boot is no longer true and it took less time for the host to become ready as expected.

Checked a health-ping event to make sure there's no regression:

{
..
  "sender": "metricdog",
  "version": "1.15.0",
  "variant": "aws-k8s-1.27",
  "arch": "x86_64",
  "region": "us-west-2",
  "seed": 2039,
  "version_lock": "latest",
  "ignore_waves": true,
  "event": "health_ping",
  "is_healthy": true,
  "failed_services": {}
}

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@etungsten etungsten requested a review from webern June 15, 2023 20:43
sources/metricdog/src/systemd.rs Outdated Show resolved Hide resolved
sources/metricdog/src/host_check.rs Outdated Show resolved Hide resolved
sources/metricdog/src/main.rs Show resolved Hide resolved
sources/metricdog/src/metricdog.rs Outdated Show resolved Hide resolved
sources/metricdog/src/systemd.rs Show resolved Hide resolved
sources/metricdog/src/systemd.rs Outdated Show resolved Hide resolved
@etungsten
Copy link
Contributor Author

etungsten commented Jul 10, 2023

Push above addresses @arnaldo2792's comments.

Push below rebases onto develop

sources/metricdog/src/systemd.rs Outdated Show resolved Hide resolved
sources/metricdog/src/host_check.rs Outdated Show resolved Hide resolved
sources/metricdog/src/systemd.rs Outdated Show resolved Hide resolved
Comment on lines 130 to 133
values.insert(
"network_ready_time".to_string(),
self.host_check.network_ready_time()?,
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want these values to be stable, to simplify querying the results - would it make sense to have a real struct with known, optional fields? Or an enum with two variants - BootStatus and HealthStatus?

sources/metricdog/src/host_check.rs Outdated Show resolved Hide resolved
@etungsten
Copy link
Contributor Author

etungsten commented Jul 13, 2023

Push above addresses @bcressey's comments.

Adds filesystem_ready_time_ms as another dimension that tracks when local-fs.target becomes active.
Refactors how the query parameters are constructed to use structs and enums to denote fields more stably.

Unit tests pass.

Generated metrics event:

{
...
  "sender": "metricdog",
  "version": "1.15.0",
  "variant": "aws-k8s-1.27",
  "arch": "x86_64",
  "region": "us-west-2",
  "seed": 2039,
  "version_lock": "latest",
  "ignore_waves": true,
  "event": "boot_success",
  "is_first_boot": true,
  "preconfigured_time_ms": 12841,
  "configured_time_ms": 16096,
  "network_ready_time_ms": 7966,
  "filesystem_ready_time_ms": 4624
}

Updated PR description with new testing.

Copy link
Contributor

@stmcginnis stmcginnis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Refactors to accommodate a new trait for checking host status using
systemd utilities.

Refactors how query parameters are constructed with structs to denote
fields more stably.
Copy link
Contributor

@stmcginnis stmcginnis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better error handling!

})
}

fn is_active(service: &str) -> Result<bool> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: My knee-jerk reaction anytime I see an &str as a funtion argument is to suggest to use AsRef<str> to make it more friendly to use and robust. Given these functions are only used in this module... I can see why we wouldn't do that. It just looked odd to my eyes given most of our other code does so.

@etungsten etungsten merged commit 4e7145a into bottlerocket-os:develop Jul 18, 2023
@etungsten etungsten deleted the timing-metrics branch July 18, 2023 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants