-
Notifications
You must be signed in to change notification settings - Fork 425
feat: batch region migration for failover #7245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: WenyXu <[email protected]>
7b931d5 to
b3f2d4e
Compare
Signed-off-by: WenyXu <[email protected]>
b3f2d4e to
b6cb913
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the region migration failover logic to support batch processing of regions. Previously, each region migration was handled individually; now, regions are grouped by their source and destination peer pairs and migrated together in a single procedure.
Key Changes:
- Enhanced event recording infrastructure to support multiple rows per event
- Implemented batch region migration with comprehensive result tracking
- Refactored
PersistentContextto support multiple catalog/schema pairs - Added utility functions for analyzing region migration tasks
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| supervisor.rs | Groups failover tasks by peer pairs and submits them in batches |
| utils.rs | New module containing batch migration task structure and analysis functions |
| manager.rs | Adds batch submission API with detailed result tracking |
| region_migration.rs | Updates PersistentContext to support multiple catalog/schema pairs |
| region_migration_event.rs | Refactored to generate multiple event rows for batch operations |
| event.rs | Updated Event trait to return multiple rows instead of single row |
| recorder.rs | Updated to handle multiple rows per event |
| slow_query_event.rs | Adapted to new extra_rows API returning Vec |
| test files | Updated test fixtures to match new PersistentContext structure |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: WenyXu <[email protected]>
Signed-off-by: WenyXu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let from_peer_id = from_peer.id; | ||
| let to_peer = tasks[0].0.to_peer.clone(); | ||
| let to_peer_id = to_peer.id; | ||
| let timeout = Duration::from_secs(120) * max_count; |
Copilot
AI
Nov 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout calculation uses a hardcoded value Duration::from_secs(120) instead of the DEFAULT_REGION_MIGRATION_TIMEOUT constant. This should be DEFAULT_REGION_MIGRATION_TIMEOUT * max_count for consistency with the timeout calculation in generate_failover_tasks (line 692) and to maintain a single source of truth for the timeout value.
| let timeout = Duration::from_secs(120) * max_count; | |
| let timeout = DEFAULT_REGION_MIGRATION_TIMEOUT * max_count; |
| use common_meta::key::TableMetadataManagerRef; | ||
| use common_meta::peer::Peer; | ||
| use common_meta::rpc::router::RegionRoute; | ||
| use itertools::Itertools; |
Copilot
AI
Nov 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The itertools::Itertools import is unused. Only standard iterator methods (iter(), into_iter(), zip()) are used in this file. Consider removing this import to keep dependencies minimal.
| use itertools::Itertools; |
I hereby agree to the terms of the GreptimeDB CLA.
Refer to a related PR or issue link (optional)
#7021
What's changed and what's your intention?
Previously, when multiple regions from the same datanode needed to be migrated to another datanode during failover, each region was processed individually in separate migration procedures. This PR refactors the failover logic to batch regions by their source and destination peers, allowing multiple regions to be migrated together in a single procedure.
Changes
1. Event Recorder Refactoring
EventRecorderinterface and implementation2. Batch Region Migration Support
RegionSupervisor::do_failover()to group migration tasks by(from_peer_id, to_peer_id)pairs before submissionRegionMigrationManager::submit_region_migration_task()to handle batch region migration requestsSubmitRegionMigrationTaskResultto track different migration outcomes:migrated: Regions already at the target peermigrating: Regions with ongoing migrationstable_not_found: Regions whose tables have been droppedleader_changed: Regions where leadership has changedpeer_conflict: Regions with peer conflictssubmitted: Regions successfully submitted for migrationutils.rswithRegionMigrationTaskand analysis functions to support batch operationsPR Checklist
Please convert it to a draft if some of the following conditions are not met.