Skip to content

Conversation

@WenyXu
Copy link
Member

@WenyXu WenyXu commented Nov 17, 2025

I hereby agree to the terms of the GreptimeDB CLA.

Refer to a related PR or issue link (optional)

#7021

What's changed and what's your intention?

Previously, when multiple regions from the same datanode needed to be migrated to another datanode during failover, each region was processed individually in separate migration procedures. This PR refactors the failover logic to batch regions by their source and destination peers, allowing multiple regions to be migrated together in a single procedure.

Changes

1. Event Recorder Refactoring

  • Refactor event recorder to support multiple rows per event: Modified the event recording infrastructure to support recording multiple rows for a single event, which enables better tracking of batch operations.
  • Updated EventRecorder interface and implementation
  • Modified region migration event recording to leverage the new batch capability

2. Batch Region Migration Support

  • Group failover tasks by peer pairs: Refactored RegionSupervisor::do_failover() to group migration tasks by (from_peer_id, to_peer_id) pairs before submission
  • New batch submission API: Added RegionMigrationManager::submit_region_migration_task() to handle batch region migration requests
  • Comprehensive result handling: Introduced SubmitRegionMigrationTaskResult to track different migration outcomes:
    • migrated: Regions already at the target peer
    • migrating: Regions with ongoing migrations
    • table_not_found: Regions whose tables have been dropped
    • leader_changed: Regions where leadership has changed
    • peer_conflict: Regions with peer conflicts
    • submitted: Regions successfully submitted for migration
  • New utility module: Added utils.rs with RegionMigrationTask and analysis functions to support batch operations
  • Improved logging: Enhanced logging to show batch operation details with region lists

PR Checklist

Please convert it to a draft if some of the following conditions are not met.

  • I have written the necessary rustdoc comments.
  • I have added the necessary unit tests and integration tests.
  • This PR requires documentation updates.
  • API changes are backward compatible.
  • Schema or data changes are backward compatible.

@github-actions github-actions bot added size/M docs-not-required This change does not impact docs. labels Nov 17, 2025
@WenyXu WenyXu force-pushed the batch-migration-part2 branch 3 times, most recently from 7b931d5 to b3f2d4e Compare November 18, 2025 06:56
@WenyXu WenyXu force-pushed the batch-migration-part2 branch from b3f2d4e to b6cb913 Compare November 24, 2025 07:02
@WenyXu WenyXu requested a review from Copilot November 24, 2025 07:02
Copilot finished reviewing on behalf of WenyXu November 24, 2025 07:04
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the region migration failover logic to support batch processing of regions. Previously, each region migration was handled individually; now, regions are grouped by their source and destination peer pairs and migrated together in a single procedure.

Key Changes:

  • Enhanced event recording infrastructure to support multiple rows per event
  • Implemented batch region migration with comprehensive result tracking
  • Refactored PersistentContext to support multiple catalog/schema pairs
  • Added utility functions for analyzing region migration tasks

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
supervisor.rs Groups failover tasks by peer pairs and submits them in batches
utils.rs New module containing batch migration task structure and analysis functions
manager.rs Adds batch submission API with detailed result tracking
region_migration.rs Updates PersistentContext to support multiple catalog/schema pairs
region_migration_event.rs Refactored to generate multiple event rows for batch operations
event.rs Updated Event trait to return multiple rows instead of single row
recorder.rs Updated to handle multiple rows per event
slow_query_event.rs Adapted to new extra_rows API returning Vec
test files Updated test fixtures to match new PersistentContext structure

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions github-actions bot added size/L and removed size/M labels Nov 24, 2025
@WenyXu WenyXu requested a review from Copilot November 24, 2025 09:03
Copilot finished reviewing on behalf of WenyXu November 24, 2025 09:07
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

let from_peer_id = from_peer.id;
let to_peer = tasks[0].0.to_peer.clone();
let to_peer_id = to_peer.id;
let timeout = Duration::from_secs(120) * max_count;
Copy link

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout calculation uses a hardcoded value Duration::from_secs(120) instead of the DEFAULT_REGION_MIGRATION_TIMEOUT constant. This should be DEFAULT_REGION_MIGRATION_TIMEOUT * max_count for consistency with the timeout calculation in generate_failover_tasks (line 692) and to maintain a single source of truth for the timeout value.

Suggested change
let timeout = Duration::from_secs(120) * max_count;
let timeout = DEFAULT_REGION_MIGRATION_TIMEOUT * max_count;

Copilot uses AI. Check for mistakes.
use common_meta::key::TableMetadataManagerRef;
use common_meta::peer::Peer;
use common_meta::rpc::router::RegionRoute;
use itertools::Itertools;
Copy link

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The itertools::Itertools import is unused. Only standard iterator methods (iter(), into_iter(), zip()) are used in this file. Consider removing this import to keep dependencies minimal.

Suggested change
use itertools::Itertools;

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-not-required This change does not impact docs. size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant