Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate PPAF with other cross region availability functionality. #44099

Open
wants to merge 129 commits into
base: main
Choose a base branch
from

Conversation

jeet1995
Copy link
Member

@jeet1995 jeet1995 commented Feb 8, 2025

Description

This PR introduces region routing decision making when several high availability settings are enabled together namely - Per-Partition Circuit Breaker (PPCB), Per-Partition Automatic Failover (PPAF), Exclude Regions and Cross-Region Availability Strategy.

It also enables PPCB for document reads (basically point reads, queries, read all, read many) against a single-write multi-region CosmosDB account.

Thought Process

Consider an applicable region list which is an SDK-internal list to determine regions a request could go to. This is computed by first evaluating what are the intersecting regions between the user-provided preferred regions and the regions for the account. Then regions are excluded based on user-provided exclude regions.

  • Any customer expressed regions be it exclude regions, and preferred regions are honored always for reads and writes to multi-write multi-region accounts. The caveats here are if all regions are excluded [or] empty preferred regions are used with global endpoint. In this case, a fallback is used which can either be the account-level primary region (write region in a single-write multi-region account) [or] partition-set level primary region (the write region for a given partition-set as determined by PPAF).
  • When PPCB deems a certain region as unavailable for a partition and such a region is not already excluded by exclude regions, the idea is to increase the applicable region list size should the list size be 1. This is done by appending first the fallback followed by the oldest unavailable region for the partition from PPCB's perspective. This ensures availability is not scoped to a single region when faced with transient failures (could be a network blip [or] client-side compute issues).
  • With Cross-Region Availability Strategy, the SDK issues a "main" request which can also be retried across regions within the applicable region list and a "hedged" request which is pinned to a single region from the second preferred region onwards. The above rules apply to only the "main" request.

Region Augmentation

flowchart TD
    A[LocationCache#getApplicableRegions] --> B[Exclude user-configured exclude regions.]
    B --> C[Exclude PPCB-provided exclude regions - store in a helper list  the URIs and regions excluded by PPCB.]
    C --> D{Is preferred endpoints list size <= 1?}
    C --> E{Is applicable regions list size >= 2?}
    D --> |Yes|F[Return]
    E --> |Yes|F[Return]
    D --> |No|G{Is hedged request?}
    E --> |No|G
    G --> |Yes|F
    G --> |No|H{Is fallback endpoint being used?}
    H --> |Yes|I{Is user-configured exclude regions non-empty && PPCB-configured exclude regions empty?}
    I --> |Yes|K[Apply PPAF override for reads too]
    I --> |No|L{Is user-provided preferred regions empty?}
    L --> |Yes|O[Apply PPAF overide as first applicable region]
    L --> |Yes|M
    K --> F
    H --> |No|M[Loop through PPCB excluded regions and add to applicable region set if different from fallback region and until applicable region list contains at least 1 circuit broken region]
   
Loading

Some examples

Let's take a Multi-Write 3-Region Account and Single-Write 3-Region Account with regions NCUS, WUS and CUS. Let NCUS be the write region in both account types.

Legend

Notation Description
Rd Denotes a read operation
Wt Denotes a write operation
PPCB Per-Partition Circuit Breaker
PPAF Per-Partition Automatic Failover
UR Unavailable Regions
O Override Regions

Multi-Write 3-Region Account

Operation Preferred Regions Exclude Regions PPCB Unavailable Regions Is Hedged? PPAF Overrides Applicable Regions Override Regions
Rd [NCUS, WUS, CUS] [] [] No N/A [NCUS, WUS, CUS] N/A
Rd [NCUS, WUS, CUS] [NCUS] [] No N/A [WUS, CUS] N/A
Rd [NCUS, WUS, CUS] [NCUS, WUS, CUS] [] No N/A [NCUS] N/A
Rd [NCUS, WUS, CUS] [NCUS] [WUS] No N/A [CUS,WUS] N/A
Rd [] [NCUS] [WUS] No N/A [CUS,WUS] N/A
Wt [NCUS, WUS, CUS] [] [] No N/A [NCUS, WUS, CUS] N/A
Wt [NCUS, WUS, CUS] [NCUS] [] No N/A [WUS, CUS] N/A
Wt [NCUS, WUS, CUS] [NCUS, WUS, CUS] [] No N/A [NCUS] N/A
Wt [NCUS, WUS, CUS] [NCUS] [WUS] No N/A [CUS,WUS] N/A
Wt [] [NCUS] [WUS] No N/A [CUS,WUS] N/A

Single-Write 3-Region Account

Operation Preferred Regions Exclude Regions PPCB Unavailable Regions Is Hedged? PPAF Overrides Applicable Regions Override Regions
Rd [NCUS, WUS, CUS] [] [] No {UR:[], O: <>} [NCUS, WUS, CUS] N/A
Rd [NCUS, WUS, CUS] [NCUS, WUS, CUS] [] No {UR:[], O: <>} [NCUS] N/A
Rd [NCUS, WUS, CUS] [NCUS] [WUS] No {UR:[], O: <>} [CUS, WUS] N/A
Rd [NCUS, WUS, CUS] [NCUS, WUS, CUS] [] No {UR:[NCUS], O: <WUS>} [WUS] WUS
Rd [NCUS, WUS, CUS] [] [] No {UR:[NCUS], O: <WUS>} [NCUS, WUS, CUS] WUS
Rd [] [] [] No {UR:[NCUS], O: <WUS>} [WUS, NCUS] WUS
Rd [] [NCUS, WUS, CUS] [] No {UR:[NCUS], O: <WUS>} [WUS] WUS
Rd [] [NCUS] [WUS, CUS] No {UR:[NCUS], O: <WUS>} [WUS, NCUS, CUS] WUS
Wt [NCUS, WUS, CUS] [] N/A No {UR:[], O: <>} [NCUS] N/A
Wt [NCUS, WUS, CUS] [NCUS, WUS, CUS] N/A No {UR:[], O: <>} [NCUS] N/A
Wt [NCUS, WUS, CUS] [NCUS, WUS, CUS] N/A No {UR:[NCUS], O: <WUS>} [WUS] WUS
Wt [NCUS, WUS, CUS] [] N/A No {UR:[NCUS], O: <WUS>} [WUS] WUS
Wt [] [] N/A No {UR:[NCUS], O: <WUS>} [WUS] WUS
Wt [] [] N/A No {UR:[NCUS, WUS], O: <CUS>} [CUS] CUS

Issues closes

closes #43896
closes #43897
closes #43898

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • [] Pull request includes test coverage for the included changes.

jeet1995 added 30 commits July 6, 2024 11:17
…zure-sdk-for-java into PerPartitionAutomaticFailover
…rPartitionAutomaticFailover

# Conflicts:
#	sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/GlobalPartitionEndpointManagerForPerPartitionCircuitBreakerTests.java
#	sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/faultinjection/SessionRetryOptionsTests.java
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Configs.java
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/perPartitionCircuitBreaker/GlobalPartitionEndpointManagerForPerPartitionCircuitBreaker.java
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/perPartitionCircuitBreaker/LocationSpecificHealthContextTransitionHandler.java
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/query/ChangeFeedFetcher.java
…rPartitionAutomaticFailover

# Conflicts:
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java
…rPartitionAutomaticFailover

# Conflicts:
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/Configs.java
jeet1995 added 16 commits March 2, 2025 21:42
…sdk-for-java into PerPartitionAutomaticFailover

# Conflicts:
#	sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/ClientRetryPolicyTest.java
#	sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/RenameCollectionAwareClientRetryPolicyTest.java
#	sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/RxGatewayStoreModelTest.java
#	sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/ThinClientStoreModelTest.java
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ClientRetryPolicy.java
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ClientSideRequestStatistics.java
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/routing/LocationCache.java
…rPartitionAutomaticFailover

# Conflicts:
#	sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/ClientRetryPolicyTest.java
#	sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/RenameCollectionAwareClientRetryPolicyTest.java
#	sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/RxGatewayStoreModelTest.java
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ClientRetryPolicy.java
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/ClientSideRequestStatistics.java
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/routing/LocationCache.java
…zure-sdk-for-java into IntegratePPAFWithCrossRegionAvailabilityFunctionality

# Conflicts:
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/perPartitionCircuitBreaker/GlobalPartitionEndpointManagerForPerPartitionCircuitBreaker.java
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/routing/LocationCache.java
…zure-sdk-for-java into IntegratePPAFWithCrossRegionAvailabilityFunctionality
…zure-sdk-for-java into IntegratePPAFWithCrossRegionAvailabilityFunctionality
@jeet1995
Copy link
Member Author

jeet1995 commented Mar 5, 2025

/azp run java - cosmos - tests

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

jeet1995 added 3 commits March 6, 2025 21:27
…zure-sdk-for-java into IntegratePPAFWithCrossRegionAvailabilityFunctionality

# Conflicts:
#	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/routing/LocationCache.java
@jeet1995
Copy link
Member Author

jeet1995 commented Mar 7, 2025

/azp run java - cosmos - tests

@jeet1995
Copy link
Member Author

jeet1995 commented Mar 7, 2025

/azp run java - cosmos - spark

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

1 similar comment
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

jeet1995 added 5 commits March 7, 2025 15:57
…zure-sdk-for-java into IntegratePPAFWithCrossRegionAvailabilityFunctionality
…zure-sdk-for-java into IntegratePPAFWithCrossRegionAvailabilityFunctionality
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants