Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add opt-in metrics to CloudWatch agent #988

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ouranos
Copy link
Contributor

@ouranos ouranos commented Feb 25, 2022

Following #811, we've removed our custom patch to install the CloudWatch agent.

The default config (downloaded with amazon-cloudwatch-agent-ctl -m ec2 -a fetch-config -c default -s) had an extra config block to reports on memory and disk usage.

This is a nice feature to monitor buildkite instances utilisation (and assist with dimensioning).

For example, see the bottom right graph:
image

We could potentially add other dimensions to make it easier to differentiate by queue name. I'm currently using the ASG name which might change when updating the stack.

Comment on lines 22 to 25
"ImageId": "${aws:ImageId}",
"InstanceId": "${aws:InstanceId}",
"InstanceType": "${aws:InstanceType}",
"AutoScalingGroupName": "${aws:AutoScalingGroupName}"
Copy link

@freewil freewil Feb 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for purposes of cloudwatch metric billing, each unique dimension combination is considered 1 metric, so including granular dimensions like InstanceId or even InstanceType could really blow up costs for those of us with a large number of ec2 instances.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. How about having just the AutoScalingGroupName dimension then?
Or would there be a way to have the queue as a dimension?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i think that makes sense. not sure if it's possible to add the queue, but i think that'd be nice as well. It should be available as a tag on the instance, BuildkiteQueue.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this a bit, and one way to add BuildkiteQueue as a dimension would be to modify the cloudwatch config file on boot using jq (similar to docker), as part of the bk-install-elastic-stack.sh script. You should have access to BUILDKITE_QUEUE from there and then just enable and start the service on boot instead of at AMI creation time.

@ouranos ouranos force-pushed the add-default-metrics-to-cloudwatch branch from 19278b8 to 5756f14 Compare March 4, 2022 01:11
@ouranos ouranos changed the title Add default metrics to CloudWatch agent Add opt-in metrics to CloudWatch agent Mar 4, 2022
@ouranos
Copy link
Contributor Author

ouranos commented Mar 4, 2022

@freewil Ok updated the PR after much testing (lost a bit of time with the issue mentioned in #995)

I've done the following changes:

  • Added a new stack parameter (EnableCloudWatchMetrics) to opt-in - by default the behavior stays the same (ie, no metrics)
  • Deferred the start of the CloudWatch agent to boot hook bk-configure-cloudwatch-agent.sh (so we can generate the configuration at run time)
  • Tweaked the append_dimensions

As per the documentation:

The only supported key-value pairs for append_dimensions are shown in the following list. Any other key-value pairs are ignored. ["ImageID", "InstanceId", "InstanceType", "AutoScalingGroupName"]
If you want to append dimensions to metrics with arbitrary key-value pairs, use the append_dimensions parameter in the field for that particular type of metric.

After this change I get the following metrics in CloudWatch (ignore the pollution due to various tests):
image

image
image

I reckon we don't need both the ASG name and the Queue, so we could drop one of them.

Let me know what you think. Maybe we can change the namespace too? Or make it more customisable?
I don't really care about the disk metrics (the auto clean up works fine for us) so we could drop it too.

Comment on lines +6 to +19
cw_config="/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json"
cat <<<"$(jq \
--arg queue "$BUILDKITE_QUEUE" \
'. + {
metrics: {
metrics_collected: {
mem: {measurement: ["mem_used_percent"], append_dimensions: {BuildkiteQueue: $queue}},
disk: {measurement: ["used_percent"], resources: ["*"], append_dimensions: {BuildkiteQueue: $queue}}
},
append_dimensions: {
AutoScalingGroupName: "${aws:AutoScalingGroupName}"
}
}
}' $cw_config)" >$cw_config
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it actually rewrites it. It's the same pattern used to configure docker, albeit more complicated.

  • The jq command will read the $cw_config file, append the metrics configuration block and output it (jq [options] <jq filter> [file...]). Where the jq filter is:
    • . identity filter
    • {}: object construction
    • +: addition operator between two objects. So takes the original configuration and merge the new object created:

      Objects are added by merging, that is, inserting all the key-value pairs from both objects into a single combined object.

  • This output is passed to the standard input cat with a here-string (<<<) using command substitution ($(..)). So the jq command runs in a subshell.
  • cat is then overwriting the original config file

It's equivalent to:

new_config=$(jq <filter> $cw_config)
echo $new_config | cat > $cw_config

From what I understand, the combination of here-string and command substitution will use a temp file or memory to avoid any race condition while modifying the file in place.

Copy link

@freewil freewil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool - LGTM. I don't work for buildkite btw, but would love to see this get merged.

@eleanorakh eleanorakh self-assigned this Mar 17, 2022
Copy link
Contributor

@moskyb moskyb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is awesome @ouranos! thanks for submitting.

Maybe we can change the namespace too? Or make it more customisable?

i think adding a namespace is probably a good idea, it'll make filtering in the console significantly less of a pain. maybe something like BuildkiteElasticStack/Agent or something similar? very open to other ideas though, and agree that making it customizable (with a sensible default) is a good idea.

Comment on lines +12 to +13
mem: {measurement: ["mem_used_percent"], append_dimensions: {BuildkiteQueue: $queue}},
disk: {measurement: ["used_percent"], resources: ["*"], append_dimensions: {BuildkiteQueue: $queue}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're adding the queue as dimension for memory and disk, should we do it for CPU too?

Report on memory and disk usage
@ouranos ouranos force-pushed the add-default-metrics-to-cloudwatch branch from 5756f14 to a1e83f6 Compare September 29, 2022 03:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants