A production-ready alert normalization pipeline that processes CloudWatch and Grafana alerts, normalizes them into a canonical format, and automatically manages GitHub issues for incident response.
Key Features:
- π Alert Normalization: Converts CloudWatch and Grafana alerts to canonical schema
- π― Intelligent Routing: Team-based alert assignment with priority handling
- π Alert Grouping: Groups recurring alerts by fingerprint with fresh GitHub issues per occurrence
- π Issue Lifecycle: Automated GitHub issue creation, updates, and closure
- π‘οΈ Resilience: Circuit breakers, rate limiting, and graceful degradation
- β‘ Serverless: Fully serverless AWS architecture with auto-scaling
- ποΈ Architecture Overview
- π Quick Start
- π Project Structure
- π οΈ Development Commands
- βοΈ Configuration
- π Operations Guide
- ποΈ Infrastructure Details
- π Monitoring & Observability
- π§ Troubleshooting
- π Security Features
- π§ͺ Testing
- π€ Contributing
- π License
graph TD
%% Alert Sources
GF[π₯ Grafana Alerts]
CW[βοΈ CloudWatch Alarms]
CUSTOM[π§ Custom Sources<br/>Normalized Format]
%% Entry Points
WEBHOOK[π Webhook Lambda<br/>external-alerts-webhook]
SNS[π’ SNS Topic<br/>alerts]
%% Processing Pipeline
SQS[π¬ SQS Queue<br/>alerts]
DLQ[π Dead Letter Queue<br/>dlq]
COLLECTOR[βοΈ Collector Lambda<br/>Main Processing Engine]
%% Transformation Layer
DETECT{π Source Detection}
GRAFANA_T[π Grafana Transformer]
CW_T[π CloudWatch Transformer]
NORM_T[β‘ Normalized Transformer<br/>Skip Transform]
%% Core Processing
FINGERPRINT[π Generate Fingerprint<br/>SHA-256 of stable fields]
STATE_CHECK{π Check Alert State<br/>DynamoDB}
ACTION{π― Determine Action}
%% Actions
CREATE[π CREATE<br/>New GitHub Issue]
COMMENT[π¬ COMMENT<br/>Add to existing issue]
CLOSE[β
CLOSE<br/>Close GitHub issue]
SKIP[βοΈ SKIP<br/>Stale/Manual close]
%% Storage & External
DYNAMO[(ποΈ DynamoDB<br/>Alert State Tracking)]
GITHUB[π GitHub Issues<br/>Incident Management]
%% Flow connections
GF -->|POST /webhook<br/>X-Grafana-Token| WEBHOOK
CUSTOM -->|POST /webhook<br/>Custom Headers| WEBHOOK
CW -->|CloudWatch Action| SNS
WEBHOOK -->|Authenticated<br/>Requests| SNS
SNS -->|Forward to persistent store| SQS
SQS -->|Batch Processing<br/>Partial Failure Support| COLLECTOR
SQS -.->|Failed Messages| DLQ
COLLECTOR --> DETECT
DETECT -->|grafana| GRAFANA_T
DETECT -->|cloudwatch| CW_T
DETECT -->|normalized| NORM_T
GRAFANA_T --> FINGERPRINT
CW_T --> FINGERPRINT
NORM_T --> FINGERPRINT
FINGERPRINT --> STATE_CHECK
STATE_CHECK <--> DYNAMO
STATE_CHECK --> ACTION
ACTION --> CREATE
ACTION --> COMMENT
ACTION --> CLOSE
ACTION --> SKIP
CREATE --> GITHUB
COMMENT --> GITHUB
CLOSE --> GITHUB
CREATE --> DYNAMO
COMMENT --> DYNAMO
CLOSE --> DYNAMO
%% Styling
classDef alertSource fill:#ff9999
classDef processing fill:#99ccff
classDef storage fill:#99ff99
classDef action fill:#ffcc99
classDef transformer fill:#cc99ff
class GF,CW,CUSTOM alertSource
class WEBHOOK,SNS,SQS,COLLECTOR processing
class DYNAMO,GITHUB storage
class CREATE,COMMENT,CLOSE,SKIP action
class GRAFANA_T,CW_T,NORM_T transformer
-
Alert Ingestion
- Grafana: Sends webhooks β Webhook Lambda β SNS
- CloudWatch: Sends directly β SNS
- Custom Sources: Can use webhook (any format) or send pre-normalized alerts
-
Message Processing
- SNS fans out to SQS queue with dead letter queue for failures
- Collector Lambda processes messages in batches with partial failure support
- Source detection automatically routes to appropriate transformer
-
Alert Transformation
- Grafana/CloudWatch: Full transformation to canonical AlertEvent schema
- Normalized: Skip transformation, direct validation for optimal performance
- Generate SHA-256 fingerprint from stable alert identifiers
-
State Management & Actions
- Check DynamoDB for existing alert state by fingerprint
- Determine action: CREATE (new), COMMENT (recurring), CLOSE (resolved), or SKIP
- Update both GitHub issues and DynamoDB state atomically
-
Resilience Features
- Circuit breakers prevent GitHub API cascading failures
- Rate limiting respects GitHub API limits with exponential backoff
- Dead letter queue captures poison messages for manual review
- Terraform >= 1.6
- AWS CLI configured (SSO or profile)
- Node.js 18+ and Yarn
- GitHub App with issues permissions (see setup below)
# Will build all lambada functions. You can also go to their individual folders and run `yarn build` from there
make build
Prerequisites: AWS CLI configured with appropriate credentials for development environment
make aws-apply-dev
Prerequisites: AWS CLI configured with appropriate credentials for production environment
make aws-apply-prod
# Tail logs in one terminal
make logs-dev
# Send test alert in another terminal
make aws-publish-dev
Grafana Webhook:
# Get webhook URL
cd infra && terraform output -raw external_alerts_webhook_url
# Configure in Grafana with header:
# X-Grafana-Token: <your-webhook-token>
CloudWatch Alarms:
# Get SNS topic ARN for CloudWatch alarm actions
cd infra && terraform output -raw sns_topic_arn
βββ infra/ # Terraform infrastructure
β βββ *.tf # AWS resource definitions
β βββ dev.tfvars # Development environment config
β βββ prod.tfvars # Production environment config
β βββ backend-*.hcl # Remote state configuration
βββ lambdas/ # TypeScript Lambda functions
β βββ collector/ # Main alert processing engine
β β βββ src/ # TypeScript source code
β β βββ schemas/ # JSON Schema definitions for validation
β β βββ __tests__/ # Unit tests with Vitest
β β βββ dist/ # Compiled JavaScript (build output)
β βββ external-alerts-webhook/ # Grafana webhook endpoint
βββ ReferenceData/ # Documentation and schemas
βββ bootstrap/ # Infrastructure setup utilities
βββ scratch/ # Development workspace
# Build all Lambda functions
make build
# Clean build artifacts
make clean
# Run tests for collector Lambda
cd lambdas/collector
yarn test # Run unit tests
yarn test:watch # Watch mode
yarn test:coverage # With coverage report
yarn lint # TypeScript checking
# Development Environment
make aws-init-dev # Initialize Terraform backend
make aws-apply-dev # Deploy to dev
make aws-destroy-dev # Destroy dev resources
make logs-dev # Tail dev Lambda logs
make aws-publish-dev # Send test message
# Production Environment
make aws-init-prod # Initialize Terraform backend
make aws-apply-prod # Deploy to prod
make aws-destroy-prod # Destroy prod resources
make logs-prod # Tail prod Lambda logs
make aws-publish-prod # Send test message
# Local Development (LocalStack)
make ls-apply # Deploy to LocalStack
make ls-logs # Tail LocalStack logs
make ls-publish # Send test message locally
make ls-destroy # Clean up LocalStack
Set variables in your tfvars files or via command line:
# In dev.tfvars or prod.tfvars
github_repo = "your-org/your-repo"
-
Create GitHub App in your organization:
- Permissions: Issues (Read/Write), Metadata (Read)
- Note the App ID and generate a private key
-
Install App on your target repository
-
Store Credentials in AWS Secrets Manager:
aws secretsmanager create-secret \
--name "alerting-dev-alerting-app-secrets" \
--secret-string '{
"github_app_id": "123456",
"github_app_key_base64": "<base64-encoded-private-key>"
}'
Webhook Token Setup
Important: The webhook secret for each environment must be created before deploying the infrastructure. Terraform references it but doesn't manage it.
- Generate a secure token:
# Generate a cryptographically secure token
TOKEN=$(openssl rand -base64 64)
echo "Generated token: $TOKEN"
- Create the secret (before running terraform):
# Create the secret (adjust name for your environment)
aws secretsmanager create-secret \
--name "alerting-$ENV-webhook-secrets" \
--description "Authentication tokens for external webhook notifications" \
--secret-string "{\"x-grafana-token\": \"$TOKEN\"}"
- Deploy infrastructure:
# Now Terraform can reference the existing secret
make aws-apply-dev
- Configure Grafana notification policy with:
- URL: Get with
terraform output -raw external_alerts_webhook_url
(aftermake aws-init-dev
ormake aws-init-prod
) - Method: POST
- Header:
X-Grafana-Token: <your-generated-token>
- URL: Get with
Note: The secret supports multiple webhook tokens. Future alert sources can be added like:
{
"x-grafana-token": "token-for-grafana",
"x-pagerduty-signature": "token-for-pagerduty",
}
To onboard a new webhook emitter (e.g., PagerDuty, Datadog, custom services) to the alerting system:
Add the new service's authentication header and token to the webhook secret:
# Get current secret value
CURRENT_SECRET=$(aws secretsmanager get-secret-value \
--secret-id "alerting-$ENV-webhook-secrets" \
--query SecretString --output text)
# Add new header/token pair (example for PagerDuty)
# Generated the token via: `openssl rand -base64 64`
UPDATED_SECRET=$(echo "$CURRENT_SECRET" | jq '. + {"x-pagerduty-signature": "your-pagerduty-webhook-secret"}')
# Update the secret
aws secretsmanager update-secret \
--secret-id "alerting-$ENV-webhook-secrets" \
--secret-string "$UPDATED_SECRET"
Point your new service webhook to the alerting system endpoint:
# Get webhook URL (make sure you've initialized the correct environment first)
# For dev environment:
make aws-init-dev
cd infra && terraform output -raw external_alerts_webhook_url
# For prod environment:
make aws-init-prod
cd infra && terraform output -raw external_alerts_webhook_url
Configure your service to send POST requests to this URL with the appropriate authentication header.
For optimal performance and reliability, custom webhook emitters should send alerts in the normalized format. When alerts are pre-normalized, the collector can skip transformation and directly process them.
Normalized AlertEvent Schema:
interface AlertEvent {
schema_version: number; // Version for schema evolution (currently 1)
source: "grafana" | "cloudwatch" | string; // Alert source identifier
state: "FIRING" | "RESOLVED"; // Alert state
title: string; // Alert title/name
description?: string; // Optional alert description
summary?: string; // High-level summary for display
reason?: string; // Provider-specific reason/message
priority: "P0" | "P1" | "P2" | "P3"; // Canonical priority
occurred_at: string; // ISO8601 timestamp of state change
teams: string[]; // Owning team identifiers (supports multiple teams)
resource: { // Resource information
type: "runner" | "instance" | "job" | "service" | "generic";
id?: string; // Resource identifier
region?: string; // AWS region (if applicable)
extra?: Record<string, any>; // Additional context
};
identity: { // Identity for fingerprinting
aws_account?: string; // AWS account ID
region?: string; // Region
alarm_arn?: string; // CloudWatch alarm ARN
org_id?: string; // Organization ID
rule_id?: string; // Rule/alert ID
};
links: { // Navigation links
runbook_url?: string; // Runbook/playbook URL
dashboard_url?: string; // Dashboard URL
source_url?: string; // Source console/panel URL
silence_url?: string; // Silence/mute URL
};
raw_provider?: any; // Original payload for debugging
}
Example Pre-Normalized Alert:
{
"schema_version": 1,
"source": "datadog",
"state": "FIRING",
"title": "High CPU Usage on prod-web-01",
"description": "CPU utilization has exceeded 90% for 5 minutes",
"summary": "Critical CPU alert on production web server",
"priority": "P1",
"occurred_at": "2024-01-15T10:30:00Z",
"teams": ["platform-team"],
"resource": {
"type": "instance",
"id": "i-1234567890abcdef0",
"region": "us-west-2",
"extra": {
"instance_type": "m5.large",
"availability_zone": "us-west-2a"
}
},
"identity": {
"aws_account": "123456789012",
"region": "us-west-2",
"rule_id": "cpu-high-prod-web"
},
"links": {
"runbook_url": "https://wiki.company.com/runbooks/high-cpu",
"dashboard_url": "https://datadog.com/dashboard/cpu-monitoring",
"source_url": "https://datadog.com/monitors/12345"
},
"raw_provider": {
"monitor_id": 12345,
"original_payload": "..."
}
}
To use pre-normalized format:
- Set SQS message attribute:
source = "normalized"
- Send the AlertEvent JSON directly as the message body
- The collector will validate against the JSON Schema and process directly
Schema Location & Validation:
- JSON Schema:
/lambdas/collector/schemas/alert-event.schema.json
- Schema ID:
https://schemas.pytorch.org/alerting/alert-event.schema.json
- Current Version: 1.0 (schema_version: 1)
- Validation: Messages are validated using AJV with comprehensive error reporting
External Integration:
# Validate your alerts against the schema
curl -O https://raw.githubusercontent.com/pytorch/test-infra-alerting/main/lambdas/collector/schemas/alert-event.schema.json
# Use with any JSON Schema validator (Python example)
pip install jsonschema
python -c "
import json, jsonschema
schema = json.load(open('alert-event.schema.json'))
alert = {'schema_version': 1, 'source': 'myapp', ...}
jsonschema.validate(alert, schema)
"
If your service cannot send pre-normalized alerts and uses a different payload format, you may need to:
- Add a new transformer in
lambdas/collector/src/transformers/
- Update source detection in the collector Lambda to recognize the new format
- Test the transformation with sample payloads
# Monitor logs
make logs-dev
# Send a test webhook from your new service
# Check that alerts are processed and GitHub issues are created correctly
# 1. Add PagerDuty webhook secret
aws secretsmanager update-secret \
--secret-id "alerting-dev-webhook-secrets" \
--secret-string '{
"x-grafana-token": "existing-grafana-token",
"x-pagerduty-signature": "your-pagerduty-secret"
}'
# 2. Configure PagerDuty webhook
# URL: https://your-webhook-url/webhook
# Headers: X-PagerDuty-Signature: your-pagerduty-secret
# Method: POST
# 3. Test with a PagerDuty incident to verify processing
CloudWatch Alarms - Add to AlarmDescription:
High CPU usage detected on production instances.
TEAMS=pytorch-dev-infra, pytorch-platform
PRIORITY=P1
RUNBOOK=https://runbook.example.com
Grafana Alerts - Use annotations:
annotations:
Teams: pytorch-dev-infra, pytorch-platform
Priority: P2
runbook_url: https://runbook.example.com
description: Database connection pool exhausted
Multi-Team Support: Use comma-separated teams:
- Single team:
TEAMS=dev-infra
orTeams: dev-infra
- Multiple teams:
TEAMS=dev-infra, platform, security
orTeams: dev-infra, platform, security
For detailed instructions on configuring new alerts in Grafana and CloudWatch, see OPERATIONS.md.
This guide covers how to add an new alert, with examples
- SNS Topic:
{prefix}-alerts
- Multi-source alert ingestion - SQS Queue:
{prefix}-alerts
- Alert buffering with DLQ - Lambda Functions: Collector (processing) + Webhook (Grafana)
- DynamoDB Table:
{prefix}-alerts-state
- Alert state tracking - IAM Roles: Least-privilege access for Lambda execution
- CloudWatch: Logs, metrics, and monitoring alarms
Important: This system creates fresh GitHub issues for each alert occurrence, even for recurring alerts.
How it works:
- Alert Grouping: Alerts with the same fingerprint (alert rule + resource) are grouped logically
- DynamoDB State: One record per unique alert fingerprint tracks the current state and most recent GitHub issue
- GitHub Issues: Each alert firing creates a new GitHub issue for clean discussion context
- Fresh Context: When an alert recurs after being resolved, it gets a new issue number (not reopened)
Example behavior:
- Alert "CPU High" fires β Create GitHub issue #101 β DynamoDB tracks fingerprint with issue #101
- Alert resolves β Close GitHub issue #101 β DynamoDB status = CLOSED
- Same alert fires again β Create NEW GitHub issue #102 β DynamoDB tracks same fingerprint with issue #102
- Alert resolves β Close GitHub issue #102
Result: Multiple GitHub issues may exist for the same logical alert, but only one DynamoDB state record per unique alert fingerprint.
- Development:
us-west-2
region,alerting-dev
prefix - Production:
us-east-1
region,alerting-prod
prefix - State Management: Separate S3 backends with DynamoDB locking
Create backend configuration files:
infra/backend-dev.hcl
:
bucket = "your-terraform-state-dev"
key = "alerting/dev/terraform.tfstate"
region = "us-west-2"
dynamodb_table = "terraform-locks-dev"
encrypt = true
infra/backend-prod.hcl
:
bucket = "your-terraform-state-prod"
key = "alerting/prod/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks-prod"
encrypt = true
(Aspirational: Not yet implmented)
- Alert Processing: Success/failure rates by source and team
- GitHub Integration: API success rates and rate limiting
- Queue Depth: SQS and DLQ message counts
- Processing Latency: P50/P95/P99 response times
- DLQ High Depth: Failed message accumulation
- High Error Rate: Processing failures above threshold
- Lambda Duration: Function timeout approaching
All logs use structured JSON with correlation IDs:
{
"timestamp": "2024-01-15T10:30:00Z",
"level": "INFO",
"messageId": "12345-abcde",
"fingerprint": "abc123...",
"action": "CREATE",
"teams": ["pytorch-dev-infra", "pytorch-platform"],
"priority": "P1",
"source": "grafana"
}
Alert not creating GitHub issue:
- Check CloudWatch logs for
NORMALIZED_ALERT
entries - Verify GitHub App installation and permissions
- Check DynamoDB
alerts_state
table for alert state - Look for circuit breaker or rate limiting logs
Missing required fields error:
CloudWatch alerts need TEAMS and PRIORITY in AlarmDescription
TEAMS=dev-infra, platform
PRIORITY=P1
RUNBOOK=https://...
Grafana alerts need Teams and Priority annotations
annotations:
Teams: dev-infra, platform
Priority: P2
High DLQ depth:
- Check DLQ messages for common error patterns
- Review CloudWatch error logs for processing failures
- Verify alert payload format matches expected schema
# View recent Lambda logs
aws logs tail /aws/lambda/alerting-dev-collector --follow
# Check DynamoDB alert state
aws dynamodb scan --table-name alerting-dev-alerts-state --limit 10
# View DLQ messages
aws sqs receive-message --queue-url $(terraform output -raw dlq_url)
# Test alert processing locally
cd lambdas/collector && yarn test --verbose
# Validate Terraform configuration
cd infra && terraform validate && terraform plan
- Input Validation: Comprehensive sanitization and size limits
- Authentication: Timing-safe webhook token comparison
- GitHub Integration: App-based authentication with scoped permissions
- Secret Management: AWS Secrets Manager
- IAM: Least-privilege roles with resource-specific permissions
cd lambdas/collector
yarn test # Run all tests
yarn test fingerprint # Run specific test file
yarn test --coverage # Generate coverage report
yarn test --ui # Interactive test UI
# LocalStack full pipeline test
make ls-apply
make ls-publish
make ls-logs
# Cleanup
make ls-destroy
Realistic test payloads available in lambdas/collector/test-data/
:
grafana-firing.json
- Grafana alert in firing statecloudwatch-alarm.json
- CloudWatch alarm notificationgrafana-resolved.json
- Grafana alert resolution
- Development Setup: Follow quick start guide
- Testing: Ensure tests pass (
make build && cd lambdas/collector && yarn test
) - Code Style: Use Prettier formatting (
yarn format
) - Commits: Use conventional commit format with scope prefixes
- Pull Requests: Include test results and infrastructure changes
feat(collector): add circuit breaker for GitHub API resilience
fix(webhook): resolve timing attack vulnerability in auth
docs: update architecture overview with new components
test: add fingerprint edge cases for CloudWatch alarms
This repo is BSD 3-Clause licensed, as found in the LICENSE file.
Need Help? Check the troubleshooting section above or review CloudWatch logs for detailed error information.