-
Notifications
You must be signed in to change notification settings - Fork 526
[AWS Content Packs] [OOTB Alerts] Add alerting templates #16750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Template name:Can we update the template names as below,
High-risk actions succeeded
High resource deletion
High error rate
Multiple failed login attempts
Application errors
Backend errors
High data transfer rate
High reject actions. |
💚 Build Succeeded
History
cc @Linu-Elias |
tommyers-elastic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i sort of reviewed this back to front, so the more general comments are on the later rules.
i noticed that all these rules run every 5m over the last 10/15m of data. did we consider each rule independently and decide that this is the best schedule in every case?
| "id": "aws-vpcflow-otel-massive-data-transfer", | ||
| "type": "alerting_rule_template", | ||
| "attributes": { | ||
| "name": "[AWS VPC OTEL] Excessive data transfer from a single source", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[AWS VPC OTEL] doesn't seem very use friendly
can we remove 'OTEL'?
| "searchType": "esqlQuery", | ||
| "timeWindowSize": 10, | ||
| "timeWindowUnit": "m", | ||
| "esqlQuery": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think we need to include WHERE @timestamp > NOW()- 10m - it's handled by the timeWindowSize param in the rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(same for all other rules in this PR)
| "id": "aws-vpcflow-otel-reject-ip", | ||
| "type": "alerting_rule_template", | ||
| "attributes": { | ||
| "name": "[AWS VPC OTEL] Excessive REJECT actions with single source IP", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's keep the naming consistent. above we have "from a single source", here we have "with single source IP"
| "type": "alerting_rule_template", | ||
| "attributes": { | ||
| "name": "[AWS VPC OTEL] Excessive data transfer from a single source", | ||
| "tags": ["AWS VPC Logs OpenTelemetry Assets"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't a good tag name
we should have tags for 'aws', 'vpc' (and possibly 'otel'?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(same for all other rules in this PR)
| "timeWindowSize": 10, | ||
| "timeWindowUnit": "m", | ||
| "esqlQuery": { | ||
| "esql": "// Alert triggers when any source IP address whose bytes exceed a threshold (e.g. > 50GB in 10 minutes)\n// You can adjust the threshold value in WHERE clause as needed.\nFROM logs-aws.vpcflow.otel-default | WHERE @timestamp > NOW()- 10m | STATS total_bytes = SUM(aws.vpc.flow.bytes) BY source.address | WHERE total_bytes > 53687091200 | SORT total_bytes DESC" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need the SORT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(same for all other rules)
| }, | ||
| "params": { | ||
| "searchType": "esqlQuery", | ||
| "timeWindowSize": 10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be 15m to match description
| }, | ||
| "params": { | ||
| "searchType": "esqlQuery", | ||
| "timeWindowSize": 10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does 10m seem like a long time period for detecting failed login attempts?
| "timeWindowSize": 10, | ||
| "timeWindowUnit": "m", | ||
| "esqlQuery": { | ||
| "esql": "// Alert triggers when any source IP address whose reject requests exceed a threshold (e.g. > 100 in 10 minutes)\n// You can adjust the threshold value in WHERE clause as needed.\nFROM logs-aws.cloudtrail.otel-default | WHERE @timestamp > NOW()- 10m | WHERE rpc.method == \"ConsoleLogin\" | WHERE aws.error.code IS NOT NULL | STATS failed_count = COUNT(*), users_tried = VALUES(user.name) BY source.address | WHERE failed_count >= 100 | SORT failed_count DESC" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need the VALUES agg here?
| "timeWindowSize": 10, | ||
| "timeWindowUnit": "m", | ||
| "esqlQuery": { | ||
| "esql": "// Alert triggers when any client IP address whose error count exceed a threshold (e.g. > 50 in 10 minutes)\n// You can adjust the threshold value in WHERE clause as needed.\nFROM logs-aws.elbaccess.otel-default | WHERE @timestamp > NOW()- 10m | WHERE aws.elb.status.code != 200| STATS error_count = COUNT(*) BY client.address | WHERE error_count >= 50 | SORT error_count DESC" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should client errors, e.g. 404, trigger this alert?
| "timeWindowSize": 10, | ||
| "timeWindowUnit": "m", | ||
| "esqlQuery": { | ||
| "esql": "// Alert triggers when any source IP address whose reject requests exceed a threshold (e.g. > 100 in 10 minutes)\n// You can adjust the threshold value in WHERE clause as needed.\nFROM logs-aws.cloudtrail.otel-default | WHERE @timestamp > NOW()- 10m | WHERE rpc.method == \"ConsoleLogin\" | WHERE aws.error.code IS NOT NULL | STATS failed_count = COUNT(*), users_tried = VALUES(user.name) BY source.address | WHERE failed_count >= 100 | SORT failed_count DESC" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i had a few concerns about this rule and did a sanity check by asking chatgpt for some feedback. it has a lot of concerns about this rule.
did we get an LLM to thouroughly review all the queries here?
i don't know if the concerns are valid, but i just want to check we have considered feedback like this.
please DM me for the detail i got from GPT, but the summary was:
Primary concerns:
Threshold is orders of magnitude too high
Failure signal is weak
Missing service scoping
Detection intent is unclear
As written, this alert will almost certainly never fire for real attacks, while giving a false sense of coverage.
Proposed commit message
Adding alerting rule templates to AWS Content Packs:
Checklist
changelog.ymlfile.Author's Checklist
How to test this PR locally
Related issues
Screenshots