| Issue | Command | Time |
|---|---|---|
| View failed run | gh run view <run-id> --log |
1 min |
| Cancel stuck run | gh run cancel <run-id> |
30 sec |
| Re-run workflow | gh run rerun <run-id> |
30 sec |
| Rollback deployment | git revert <commit> && git push |
5 min |
| Check status page | https://www.githubstatus.com | 1 min |
Symptoms: ❌ Red X on CI/CD
Steps:
-
Check failure type
gh run view <run-id> --log | grep ERROR
-
Categorize:
- Compilation error: Fix code, push, re-run
- Test failure: Investigate test, fix or update
- Lint error: Run locally, fix with
npm run lint:fix - Build timeout: Check logs for hang, optimize
-
Fix based on category
# For lint errors cd frontend && npm run lint:fix && npm run format # Commit and push git add . git commit -m "fix: resolve lint errors" git push
-
Monitor re-run
- Check GitHub Actions tab
- Wait for completion
- Verify all checks pass
Symptoms: ❌ Tests failed in CI
Steps:
-
Identify failing test
gh run view <run-id> --log | grep "FAIL"
-
Run locally to reproduce
npm test -- failing-test.spec.ts -
Fix based on root cause:
- Logic error: Fix code
- Environment issue: Check env variables
- Flaky test: Add retry logic
- Mock issue: Update mock data
-
Re-run workflow
gh run rerun <run-id>
Symptoms: ❌ Deployment step failed
Steps:
-
Check deployment logs
gh run view <run-id> --log | tail -50
-
Identify issue:
- Build failed: Fix and rebuild
- Secrets missing: Check GitHub Secrets
- Permission denied: Check branch protection rules
- Vercel error: Check Vercel project settings
-
For Vercel issues:
- Visit Vercel dashboard
- Check project settings
- Verify environment variables
- Check build logs
-
Fix and re-deploy:
- Merge fix to main (if code issue)
- Workflow triggers automatically
- Monitor new deployment
Symptoms:
Steps:
-
Identify slow job
gh run view <run-id> --json jobs --jq '.jobs[] | {name, durationMinutes}'
-
Investigate:
- Check GitHub Actions status
- Review job logs
- Check resource constraints
-
Optimize:
- Add caching
- Parallelize jobs
- Reduce test scope
- Optimize dependencies
Symptoms: 🔴 Secret detected, vulnerability found
Steps:
-
If secret exposed:
# Immediately in order: # 1. Rotate the secret # 2. Remove from repository # 3. Check git history git log --all --pretty=format: --name-only | sort -u | grep secret # 4. Force push if needed
-
If vulnerability detected:
npm audit npm audit fix npm install git add package*.json git commit -m "chore: fix security vulnerabilities" git push
-
Follow up:
- Document incident
- Review security practices
- Update secrets rotation policy
- Re-run failed workflow
- Fix obvious code errors
- Update failing tests
- Restart stuck jobs
- Deep dive into logs
- Check GitHub status page
- Investigate dependencies
- Review recent commits
- Notify team in Slack
- Share error logs
- Discuss solution
- Implement fix together
- Contact GitHub Support (Business plan)
- Contact Vercel Support
- Contact Stacks Foundation
- Report to open source community
# Check if GitHub is down
curl -s https://www.githubstatus.com | grep status
# If GitHub is up, restart workflows
gh run list --repo owner/repo --status failure --limit 5 \
| awk '{print $1}' | xargs -I {} gh run rerun {}# If deployment failed and main is broken:
# 1. Identify last good commit
git log --oneline | grep "✓"
# 2. Revert to last known good
git revert <good-commit-hash>
git push origin main
# 3. Verify new deployment
gh run list --repo owner/repo --branch main --limit 1# If accidentally deleted important file:
# 1. Check git history
git log --follow -- path/to/file
# 2. Recover file
git show <commit-hash>:path/to/file > path/to/file
# 3. Commit recovery
git add path/to/file
git commit -m "fix: restore accidentally deleted file"
git push# Full log output
gh run view <run-id> --log
# Just errors
gh run view <run-id> --log | grep -i error
# Specific job logs
gh run view <run-id> --log | grep -A 20 "test-frontend"# List failed runs
gh run list --repo owner/repo --status failure --limit 10
# Filter by workflow
gh run list --repo owner/repo --workflow ci.yml --status failure
# Get failure summary
gh run list --repo owner/repo --limit 20 --json conclusion \
--jq '.[] | select(.conclusion=="failure")'# View job status
gh run view <run-id> --json jobs
# View with timing
gh run view <run-id> --json jobs \
--jq '.[] | {name, status, startedAt, completedAt}'To team (Slack):
⚠️ CI/CD Alert
Workflow: [name]
Status: Failed
Branch: [branch]
Commit: [hash]
Logs: [link to run]
Investigating...
After resolution (Slack):
✅ Incident Resolved
Issue: [brief description]
Root cause: [what happened]
Fix: [what was done]
Actions: [prevention steps]
After resolving any incident >15 minutes:
-
Document:
- What went wrong
- Root cause
- How it was fixed
- Prevention measures
-
Share:
- Post summary in team channel
- Link to relevant changes
- Discuss lessons learned
-
Prevent:
- Add checks to catch issue early
- Update documentation
- Improve monitoring
- Train team if needed
- Tests pass locally before pushing
- Branch is up to date with main
- Lint/format run locally
- No console errors/warnings
- Environment variables documented
- Security practices followed
- Changes tested in staging if available
| Week | Lead | Backup |
|---|---|---|
| Rotate weekly or as needed |
- Team Lead: Mosas2000
- Security: [contact]
- Infrastructure: [contact]
- Escalation: [contact]