Introduce a cron job that syncs the prod base into staging #633

rdimitrov · 2025-10-08T09:05:57Z

Motivation and Context

The following PR adds a workflow that should:

Runs daily but also can be manually triggered
Drops the staging database and copies the prod database into it

This should improve our release confidence by reducing the likelihood of regressions or unexpected breakages.

Copy from Claude:

What This Does:

1. Connects to prod → Extracts backup credentials
2. Switches to staging → Configures backup access
3. Creates restore PVC → 50GB temporary storage
4. Triggers k8up restore → Downloads latest prod backup
5. Waits for restore → With proper job discovery
6. Finds PostgreSQL PVC → Dynamically
7. Scales down DB → Safely
8. Copies data → With validation and backup
9. Scales up DB → With pod creation waiting
10. Verifies DB → Tests tables and queries
11. Cleans up → Removes temporary resources

How Has This Been Tested?

No

Breaking Changes

No (I assume no one relies on staging keeping its state)

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update

Checklist

I have read the MCP Documentation
My code follows the repository's style guidelines
New and existing tests pass locally
I have added appropriate error handling
I have added or updated documentation as needed

Additional context

Signed-off-by: Radoslav Dimitrov <[email protected]>

rdimitrov · 2025-10-08T09:08:34Z

Initially I wanted to leverage the existing backups so we can validate them as well, but from what I found they are whole cluster backups and not solely on the database layer and so this seemed easier. Happy to revisit if some of you have opinions about this 👍

domdomegg · 2025-10-08T10:56:49Z

Initially I wanted to leverage the existing backups so we can validate them as well, but from what I found they are whole cluster backups and not solely on the database layer and so this seemed easier. Happy to revisit if some of you have opinions about this 👍

Appreciate it might be more difficult, but I think I would be very excited for this. Will enable us to test our backups, and likely allow us to have lower privileges to do the sync (i.e. not having to login to prod db, instead just read from the backups bucket).

domdomegg

I think I am mostly okay with this, but would appreciate maybe an extra set of eyes on it given the potential security implications.

domdomegg · 2025-10-08T10:57:48Z

@claude can you review, with particular attention to any unintended security or reliability implications. Read about the project from the docs for context about what this is.

claude · 2025-10-08T10:58:03Z

Claude finished @domdomegg's task —— View job

Security and Reliability Review of Database Sync Workflow

I've analyzed the proposed database sync workflow and identified several important considerations below.

Detailed Analysis

Review Progress

Read and analyze the workflow file
Evaluate security implications
Assess reliability concerns
Consider operational and architectural aspects
Provide comprehensive review feedback

Security Analysis

🔴 High Priority Security Concerns

Overprivileged Production Access (.github/workflows/sync-db.yml:24)
- The workflow uses GCP_PROD_SERVICE_ACCOUNT_KEY which grants full production cluster access
- This violates the principle of least privilege - the job only needs database read access
- Risk: If the service account is compromised, it could access all production resources
Credential Exposure Risk (.github/workflows/sync-db.yml:97, 112)
- Database dump is copied to local filesystem (/tmp/prod-backup.dump)
- ConfigMap creation could expose sensitive data in cluster logs
- Risk: Production data temporarily stored in runner filesystem and Kubernetes objects
Missing Input Validation
- No validation of database dump contents before restoration
- No size limits on dump files
- Risk: Malicious or corrupted data could be restored to staging

🟡 Medium Priority Security Concerns

Secret Management (.github/workflows/sync-db.yml:77-81, 162-166)
- Uses registry-pg-superuser secret which likely has elevated privileges
- Same superuser credentials used for both dump and restore operations
- Recommendation: Consider using separate, limited-privilege accounts
Cluster Access Pattern (.github/workflows/sync-db.yml:32-36, 105-107)
- Workflow switches between production and staging clusters in same job
- Risk: Configuration errors could lead to operations on wrong cluster

Reliability Analysis

🔴 High Priority Reliability Concerns

No Rollback Mechanism (.github/workflows/sync-db.yml:144-148)
- Database is dropped before confirming successful restore
- If restore fails, staging database is left in broken state
- Impact: Could break staging environment until manual intervention
Insufficient Error Handling (.github/workflows/sync-db.yml:177-182)
- Only the restore step has failure handling
- No error handling for dump creation or file transfer
- Risk: Partial failures could leave system in inconsistent state
Resource Management (.github/workflows/sync-db.yml:48, 87)
- Uses emptyDir storage for dumps - could exceed node capacity for large databases
- No resource limits on Kubernetes jobs
- Risk: Could exhaust cluster resources

🟡 Medium Priority Reliability Concerns

Timing and Concurrency (.github/workflows/sync-db.yml:17-19)
- Good: Uses concurrency group to prevent parallel runs
- Concern: 2 AM UTC might not be optimal for global users
- Risk: Could still impact staging users in different time zones
Monitoring and Observability
- Limited logging and no alerting on failures
- No metrics collection for sync duration/success rates
- Impact: Failures might go unnoticed

Architectural Considerations

Alternative Approach - Backup-Based Sync

As mentioned in the comments, using existing backups would be more secure and reliable:

Benefits:

✅ No direct production database access required
✅ Tests backup integrity as mentioned by @domdomegg
✅ Lower privilege requirements (read-only bucket access)
✅ Better audit trail

Implementation:

Access GCS backup bucket instead of live database
Use backup service account with storage.objectViewer role
Validate backup integrity before restore

Recommendations for Current Implementation

Security Improvements

Reduce Privileges: Create dedicated service account with minimal GKE/database permissions
Encrypt Dumps: Add encryption for dump files during transfer
Validation: Add dump file validation before restore
Audit Logging: Enhanced logging for all operations

Reliability Improvements

Backup Current State: Take staging backup before dropping database
Atomic Operations: Use database transactions where possible
Resource Limits: Add memory/storage limits to Kubernetes jobs
Health Checks: Enhance verification step with more comprehensive checks
Alerting: Add notifications for sync failures

Operational Improvements

Dry Run Mode: Add option to test sync without actual restore
Metrics: Track sync success rate, duration, and data volume
Documentation: Add runbook for manual recovery procedures

Verdict

While the workflow achieves its goal, it has significant security and reliability risks that should be addressed before merging. The backup-based approach suggested by @domdomegg would be more secure and operationally sound.

Recommendation: Consider implementing the backup-based approach first, or address the high-priority security/reliability concerns before proceeding with this implementation.

rdimitrov · 2025-10-08T11:02:30Z

Yeah, I too lean more towards the backup approach 👍 I'll update the PR to it instead.

PS. I'll dismiss you review just so someone doesn't merge it accidentally

Decided to implement the initial idea of leveraging the prod backups

Signed-off-by: Radoslav Dimitrov <[email protected]>

rdimitrov · 2025-10-08T11:35:24Z

@claude - can you review it again

claude · 2025-10-08T11:35:40Z

Claude finished @rdimitrov's task —— View job

Security and Reliability Review of Updated Database Sync Workflow

Excellent improvement switching to the backup-based approach! This addresses several major security concerns from my previous review. Here's my analysis of the updated implementation.

Detailed Analysis

Review Progress

Read and analyze the updated workflow file
Evaluate security improvements from backup-based approach
Assess reliability and operational concerns
Compare against previous review feedback
Provide comprehensive review with specific recommendations

Major Security Improvements ✅

🟢 Significant Security Wins

Eliminated Direct Production Database Access (.github/workflows/sync-db.yml:92-125)
- Now uses k8up backups instead of live database connections
- Reduces production attack surface significantly
- Impact: Major security improvement - no direct prod DB credentials needed
Improved Credential Management (.github/workflows/sync-db.yml:64-72)
- Backup credentials are read-only for backup bucket access
- Temporary secret creation with proper cleanup (.github/workflows/sync-db.yml:365-366)
- Impact: Follows principle of least privilege much better
Better Separation of Concerns (.github/workflows/sync-db.yml:21-62)
- Clear distinction between prod credential extraction and staging operations
- Separate service accounts for prod and staging environments
- Impact: Reduces blast radius if credentials are compromised

Remaining Security Considerations

🟡 Medium Priority Security Items

Credential Exposure in GitHub Actions Outputs (.github/workflows/sync-db.yml:44-46)
- Backup credentials stored in GitHub Actions outputs (though encrypted)
- Risk: Outputs are visible in workflow logs to repo maintainers
- Recommendation: Consider using intermediate secret creation instead
Service Account Permissions (.github/workflows/sync-db.yml:24, 53)
- Still uses broad GCP service account keys
- Risk: May have more permissions than needed
- Recommendation: Audit and scope down to minimal GKE + backup bucket access
Backup Data Validation (.github/workflows/sync-db.yml:217-258)
- Good: Validates PG_VERSION file exists
- Missing: No checksum or integrity validation of backup data
- Risk: Corrupted backups could break staging

Reliability Analysis

🟢 Reliability Improvements

Better Error Handling (.github/workflows/sync-db.yml:145-159, 280-285)
- Comprehensive job discovery and waiting logic
- Proper timeout handling with descriptive error messages
- Good debugging output on failures
Backup Before Replace (.github/workflows/sync-db.yml:241-245)
- Creates backup of existing staging data before replacement
- Impact: Provides recovery option if restore fails
Comprehensive Verification (.github/workflows/sync-db.yml:314-352)
- Tests database connectivity and basic functionality
- Validates table counts and shows statistics
- Impact: Catches obvious data corruption issues

🟡 Remaining Reliability Concerns

Resource Constraints (.github/workflows/sync-db.yml:74-90, 197-271)
- Fixed 50GB PVC might be insufficient for larger databases
- No resource limits on copy jobs
- Risk: Could exhaust cluster resources with large backups
Timing Dependencies (.github/workflows/sync-db.yml:131-143, 296-312)
- Multiple hardcoded wait loops and timeouts
- Risk: May not scale well with database growth or cluster load
Single Point of Failure (.github/workflows/sync-db.yml:106)
- Uses "latest" snapshot - if it's corrupted, no fallback
- Risk: Could consistently fail if latest backup is bad

Operational Improvements

🟢 Good Operational Practices

Comprehensive Cleanup (.github/workflows/sync-db.yml:354-374)
- Removes temporary resources and old jobs
- Proper security cleanup of credentials
- Good resource management
Detailed Logging (.github/workflows/sync-db.yml:217-258)
- Extensive debugging output for troubleshooting
- Clear status messages throughout workflow

🟡 Operational Recommendations

Backup Rotation Strategy (.github/workflows/sync-db.yml:106)
- Consider allowing selection of specific backup dates
- Enhancement: Add parameter for backup selection (latest, specific date, etc.)
Monitoring and Alerting
- No alerting on sync failures
- Enhancement: Add Slack/email notifications for failures
Size Validation (.github/workflows/sync-db.yml:74-90)
- PVC size is hardcoded at 50GB
- Enhancement: Dynamic sizing based on backup size or add validation

Specific Code Recommendations

High Priority Fixes

Credential Handling (.github/workflows/sync-db.yml:44-46)

# Instead of storing in outputs, consider:
- name: Configure backup access directly
  run: |
    kubectl create secret generic prod-backup-credentials \
      --from-literal=AWS_ACCESS_KEY_ID="$ACCESS_KEY" \
      --from-literal=AWS_SECRET_ACCESS_KEY="$SECRET_KEY" \
      --dry-run=client -o yaml | kubectl apply -f -

Dynamic PVC Sizing (.github/workflows/sync-db.yml:86-87)

# Consider checking backup size first:
storage: "${BACKUP_SIZE_GB:-50}Gi"

Medium Priority Enhancements

Backup Selection (.github/workflows/sync-db.yml:106)
```
snapshot: ${{ inputs.backup_date || 'latest' }}
```

Resource Limits (.github/workflows/sync-db.yml:209-214)

resources:
  limits:
    memory: "2Gi"
    cpu: "1000m"

Comparison to Previous Review

Major improvements addressed:

✅ Eliminated overprivileged production access
✅ Reduced credential exposure risk
✅ Added backup mechanism before data replacement
✅ Improved error handling and debugging
✅ Better resource cleanup

Still remaining:

🟡 Service account scoping could be improved
🟡 Resource management for large databases
🟡 Monitoring and alerting capabilities

Final Verdict

Significant improvement! The backup-based approach addresses the major security and reliability concerns from the previous review. This is much safer and more operationally sound.

Recommendation: ✅ Approve with minor enhancements

The remaining issues are mostly optimizations rather than blockers. The core security and reliability improvements make this a solid implementation that achieves the goal safely.

Priority for follow-up:

Audit service account permissions
Add backup size validation
Consider backup selection options
Add monitoring/alerting

Review Progress

Read and analyze the updated workflow file
Evaluate security improvements from backup-based approach
Assess reliability and operational concerns
Compare against previous review feedback
Provide comprehensive review with specific recommendations

.github/workflows/sync-db.yml

Signed-off-by: Radoslav Dimitrov <[email protected]>

rdimitrov · 2025-10-08T12:44:36Z

@claude - can you review the changes again assessing the potential risks of accidentally affecting the prod environment? Explore cases like one or more jobs failing, services timing out, etc. We want to be sure we don't expose the prod environment to any potential errors.

claude · 2025-10-08T12:44:56Z

Claude finished @rdimitrov's task —— View job

Production Safety Risk Analysis

Comprehensive analysis of production impact risks completed. The workflow has excellent safety mechanisms with multiple layers of protection against accidental production modifications.

Detailed Risk Analysis & Findings

Todo List

Read and analyze the current workflow implementation
Evaluate authentication and cluster switching safety mechanisms
Identify failure scenarios that could affect production
Assess timeout and error handling implications
Provide specific risk mitigation recommendations
Complete comprehensive production safety review

🟢 Excellent Safety Mechanisms in Place

Multi-Layer Production Protection

The workflow implements defense-in-depth with multiple independent safety checks:

Complete Production Logout (.github/workflows/sync-db.yml:63-91)
- ✅ Deletes prod kubectl context entirely
- ✅ Revokes all gcloud credentials
- ✅ Clears gcloud configuration
- ✅ Verifies no prod contexts remain (critical safety check)
- Impact: Even if later steps fail, no production access remains
Multiple Environment Verification Checks
- ✅ Lines 205-248: Comprehensive staging verification before any destructive operations
- ✅ Lines 283-327: Final safety check immediately before data replacement
- ✅ Both checks verify project name, cluster context, and abort on any production indicators
GitHub Environment Protection (.github/workflows/sync-db.yml:16)
- ✅ Uses environment: staging which requires staging approval
- ✅ Prevents accidental execution without proper environment setup

🟢 Failure Scenario Analysis - Production is Protected

Authentication Failures

Scenario: Staging auth fails, production context might remain

✅ MITIGATED: Step 63-91 explicitly removes ALL production access before staging auth
✅ Fail-safe: If staging auth fails, workflow stops with no cluster access at all

Scenario: Production credential extraction fails

✅ SAFE: Only reads backup credentials, never writes to production
✅ Fail-safe: Workflow stops, no further operations

Cluster Switching Failures

Scenario: gcloud config set project doesn't take effect

✅ MITIGATED: Lines 224-228 and 291-296 verify actual project matches expected
✅ Fail-safe: Hard abort if project verification fails

Scenario: kubectl context points to wrong cluster

✅ MITIGATED: Lines 217-222 and 288-295 check context contains "staging" and not "prod"
✅ Fail-safe: Multiple independent context checks with immediate abort

Timeout and Service Failures

Scenario: Restore operation times out or fails

✅ SAFE: Only affects staging cluster - production remains untouched
✅ Contained: All destructive operations happen only after staging verification

Scenario: Database scaling operations fail

✅ SAFE: Only targets staging PostgreSQL cluster
✅ Protected: PVC name validation (lines 304-309) rejects anything containing "prod"

Scenario: GitHub Actions runner fails mid-workflow

✅ SAFE: Production access already revoked in step 63-91
✅ No risk: Partial failure leaves production completely unaffected

🟡 Minor Risk Areas (Low Priority)

1. GitHub Actions Output Storage (.github/workflows/sync-db.yml:57-59)

Risk: Backup credentials briefly stored in encrypted GitHub outputs
Impact: Very low - credentials are read-only for backup bucket
Mitigation: Already using encrypted outputs, could consider direct secret creation

2. Service Account Permissions

Risk: If service accounts have broader permissions than needed
Impact: Low - explicit production logout prevents usage
Recommendation: Audit actual IAM permissions vs requirements

3. Backup Data Validation

Risk: Malicious backup data could affect staging
Impact: Very low - only affects staging, not production
Note: Some validation exists (PG_VERSION check), could be enhanced

🟢 Key Safety Innovations

Production Access Revocation Pattern (.github/workflows/sync-db.yml:63-91)

This is exceptional security design:

# Remove production cluster from kubeconfig
kubectl config delete-context gke_mcp-registry-prod_us-central1-b_mcp-registry-prod
# Revoke gcloud credentials  
gcloud auth revoke --all
# Verify no contexts contain "prod"
if kubectl config get-contexts | grep -i prod; then exit 1; fi

Why this is brilliant:

Creates an air gap between production access and destructive operations
Even if subsequent authentication fails, production remains inaccessible
Fail-closed design - defaults to no access rather than wrong access

Paranoid Context Verification

Multiple independent checks at different workflow stages:

Before database operations (lines 205-248)
Immediately before data replacement (lines 283-327)
Each check uses different verification methods (context name, project ID, PVC name analysis)

🔒 Security Assessment: VERY LOW PRODUCTION RISK

Verdict: The workflow demonstrates exceptional security engineering with multiple independent safety mechanisms.

Risk Level: 🟢 VERY LOW

Why production is well-protected:

Complete access revocation before any destructive operations
Multiple independent verification checks at critical points
Fail-closed design - failures result in no access, not wrong access
Environment-based access control through GitHub staging environment
Explicit production exclusion checks throughout the workflow

Potential Improvements (Optional)

Service Account Scoping: Audit and minimize IAM permissions
Backup Integrity: Add checksum validation for backup data
Monitoring: Add alerting for workflow failures
Dry-run Mode: Consider adding a validation-only mode

Final Assessment

This workflow represents best-practice security engineering for cross-environment operations. The multi-layered safety approach, explicit production logout pattern, and paranoid verification checks make accidental production impact extremely unlikely.

Recommendation: ✅ Approve for production use

The safety mechanisms are comprehensive and well-designed. The workflow achieves its goal while maintaining excellent security posture against production impact risks.

Todo List

Read and analyze the current workflow implementation
Evaluate authentication and cluster switching safety mechanisms
Identify failure scenarios that could affect production
Assess timeout and error handling implications
Provide specific risk mitigation recommendations
Complete comprehensive production safety review

Sync prod base to staging

c7672d5

Signed-off-by: Radoslav Dimitrov <[email protected]>

domdomegg previously approved these changes Oct 8, 2025

View reviewed changes

Use the existing backups of prod instead

ec6aa73

Signed-off-by: Radoslav Dimitrov <[email protected]>

domdomegg reviewed Oct 8, 2025

View reviewed changes

.github/workflows/sync-db.yml Show resolved Hide resolved

rdimitrov added 3 commits October 8, 2025 15:32

Add additional checks to ensure we are in staging

d51f3ea

Signed-off-by: Radoslav Dimitrov <[email protected]>

Update wrong comment

d4119e9

Signed-off-by: Radoslav Dimitrov <[email protected]>

Log out from prod once we got the backup credentials

b94b056

Signed-off-by: Radoslav Dimitrov <[email protected]>

Cleanup workflow a bit

f39faa6

domdomegg approved these changes Oct 9, 2025

View reviewed changes

maxisbey approved these changes Oct 9, 2025

View reviewed changes

domdomegg merged commit 8de5283 into main Oct 9, 2025
7 checks passed

domdomegg deleted the sync-prod-to-staging branch October 9, 2025 11:01

Introduce a cron job that syncs the prod base into staging #633

Introduce a cron job that syncs the prod base into staging #633

Uh oh!

Conversation

rdimitrov commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

How Has This Been Tested?

Breaking Changes

Types of changes

Checklist

Additional context

Uh oh!

rdimitrov commented Oct 8, 2025

Uh oh!

domdomegg commented Oct 8, 2025

Uh oh!

domdomegg left a comment

Choose a reason for hiding this comment

Uh oh!

domdomegg commented Oct 8, 2025

Uh oh!

claude bot commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Security and Reliability Review of Database Sync Workflow

Review Progress

Security Analysis

🔴 High Priority Security Concerns

🟡 Medium Priority Security Concerns

Reliability Analysis

🔴 High Priority Reliability Concerns

🟡 Medium Priority Reliability Concerns

Architectural Considerations

Alternative Approach - Backup-Based Sync

Recommendations for Current Implementation

Security Improvements

Reliability Improvements

Operational Improvements

Verdict

Uh oh!

rdimitrov commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdimitrov commented Oct 8, 2025

Uh oh!

claude bot commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Security and Reliability Review of Updated Database Sync Workflow

Review Progress

Major Security Improvements ✅

🟢 Significant Security Wins

Remaining Security Considerations

🟡 Medium Priority Security Items

Reliability Analysis

🟢 Reliability Improvements

🟡 Remaining Reliability Concerns

Operational Improvements

🟢 Good Operational Practices

🟡 Operational Recommendations

Specific Code Recommendations

High Priority Fixes

Medium Priority Enhancements

Comparison to Previous Review

Final Verdict

Review Progress

Uh oh!

Uh oh!

rdimitrov commented Oct 8, 2025

Uh oh!

claude bot commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Production Safety Risk Analysis

Todo List

🟢 Excellent Safety Mechanisms in Place

Multi-Layer Production Protection

🟢 Failure Scenario Analysis - Production is Protected

Authentication Failures

Cluster Switching Failures

Timeout and Service Failures

🟡 Minor Risk Areas (Low Priority)

rdimitrov commented Oct 8, 2025 •

edited

Loading

claude bot commented Oct 8, 2025 •

edited

Loading

rdimitrov commented Oct 8, 2025 •

edited

Loading

claude bot commented Oct 8, 2025 •

edited

Loading

claude bot commented Oct 8, 2025 •

edited

Loading