chore: AzureSQL - Introduce the estimation of number_of_batches in the unconditional masking to optimize number of masking API calls #91

purusottamm · 2025-10-09T05:04:33Z

Background

For Unconditional masking we were calculating the number of batches in lookup task by the stored procedures but we were not using it anywhere. In this we are using the number of batches calculated from the lookup task in unconditional masking.

The NumberOfBatches parameter is evaluated in the generate_masking_parameters stored procedure dynamically based on the column_width_estimate using the below expression

ceil(((max(row_count) * (sum(column_width_estimate) + log10(max(row_count)) + 1)) / (2000000 * .9)))

Problem

We need to re-introduce the use of DF_NUMBER_OF_BATCHES variable in the unfiltered masking dataflow instead of using the customer estimated value of DF_NUMBER_OF_ROWS_PER_BATCH variable. Asking customer to find out the value for TARGET_BATCH_SIZE for optimally fitting the API request to 2MB, needs a lot of hit & trial, whereas the above formula automatically find the optimal number of batches the table will be divided into.

Solution

Updated the unfiltered masking dataflow to use the DF_NUMBER_OF_BATCHES variable instead of DF_NUMBER_OF_ROWS_PER_BATCH for row aggregation for sending batches to the masking API.

Testing Done

I have created a products table containing 1.2 million records. I also verified that the table is being processed in 250 batches, as expected.

Pipeline link : https://adf.azure.com/en/monitoring/pipelineruns/d402a268-dc30-4286-8f1e-27e71041e003?factory=%2Fsubscriptions%2F247fb129-0717-412e-b3ce-28407e52e28b%2FresourceGroups%2Fpurusottam_rcg%2Fproviders%2FMicrosoft.DataFactory%2Ffactories%2FAzureSQL-PM-ADF

…e unconditional masking to optimize number of masking API calls PR URL: https://www.github.com/delphix/dcs-for-azure-templates/pull/91

ankurs-delphix

Fix it, ship it!

dcsazure_AzureSQL_to_AzureSQL_mask_pl/dcsazure_AzureSQL_to_AzureSQL_mask_pl.json

JessicaLHartog · 2025-10-09T14:28:08Z

dcsazure_AzureSQL_to_AzureSQL_mask_pl/dcsazure_AzureSQL_to_AzureSQL_mask_pl.json

            "     DF_COLUMNS_TO_CAST_BACK_TO_LONG as string[] ([\"\"]),",
-            "     DF_COLUMNS_TO_CAST_BACK_TO_TIMESTAMP as string[] ([\"\"])",
+            "     DF_COLUMNS_TO_CAST_BACK_TO_TIMESTAMP as string[] ([\"\"]),",
+            "     DF_NUMBER_OF_BATCHES as integer (100)",


issue (blocking): this is now a dataflow parameter, it needs to be added to the input_parameters part of "Update Masked State No Filter*" stored procedure calls in the pipeline so that they're persisted to the event store.

This parameter is already coming from the previous activity named ‘Lookup Masking Parameters’, with the value @activity('Lookup Masking Parameters').output.firstRow.NumberOfBatches.
That’s why I added it as a Data Flow parameter instead of including it in the input_parameters section.

Do you mean it should be added as an input parameter and then overridden within the Data Flow?

sumeetdas-dlpx

whereas the above formula automatically find the perfect number of batches the table will be divided into

Is it really perfect? Unless you can prove it without any reasonable doubt that this formula is perfect, I would avoid using that word. Perhaps you can use the word 'optimal'?
I don't see a screenshot in the Testing Done section which shows the number of batches is 250.
So now the unfiltered masking dataflow does not depend on TARGET_BATCH_SIZE variable, correct? If so, we should mention this in README.md of AzureSQL masking pipeline.

purusottamm · 2025-10-10T05:44:09Z

whereas the above formula automatically find the perfect number of batches the table will be divided into

Is it really perfect? Unless you can prove it without any reasonable doubt that this formula is perfect, I would avoid using that word. Perhaps you can use the word 'optimal'?

I don't see a screenshot in the Testing Done section which shows the number of batches is 250.

So now the unfiltered masking dataflow does not depend on TARGET_BATCH_SIZE variable, correct? If so, we should mention this in README.md of AzureSQL masking pipeline.

I’ve modified the solution statement in GitHub and added the correct screenshot. I’ll also update the README shortly.

chore: AzureSQL - Introduce the estimation of number_of_batches in th…

51dc497

…e unconditional masking to optimize number of masking API calls PR URL: https://www.github.com/delphix/dcs-for-azure-templates/pull/91

purusottamm force-pushed the dlpx/pr/purusottamm/ea1e10c5-6c46-4a13-9365-3b6f54a82f10 branch from e77f411 to 51dc497 Compare October 9, 2025 05:04

purusottamm marked this pull request as ready for review October 9, 2025 08:20

purusottamm requested a review from a team as a code owner October 9, 2025 08:20

purusottamm requested review from JessicaLHartog, ankurs-delphix, srusia-delphix and sumeetdas-dlpx October 9, 2025 08:29

ankurs-delphix approved these changes Oct 9, 2025

View reviewed changes

dcsazure_AzureSQL_to_AzureSQL_mask_pl/dcsazure_AzureSQL_to_AzureSQL_mask_pl.json Outdated Show resolved Hide resolved

Reverted the toInteger code changes from the calculation.

e973335

JessicaLHartog reviewed Oct 9, 2025

View reviewed changes

sumeetdas-dlpx reviewed Oct 9, 2025

View reviewed changes

README updated for conditional and unconditional masking

ecf4938

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: AzureSQL - Introduce the estimation of number_of_batches in the unconditional masking to optimize number of masking API calls #91

chore: AzureSQL - Introduce the estimation of number_of_batches in the unconditional masking to optimize number of masking API calls #91

Uh oh!

purusottamm commented Oct 9, 2025 •

edited

Loading

Uh oh!

ankurs-delphix left a comment

Uh oh!

Uh oh!

JessicaLHartog Oct 9, 2025

Uh oh!

purusottamm Oct 10, 2025

Uh oh!

sumeetdas-dlpx left a comment

Uh oh!

purusottamm commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

chore: AzureSQL - Introduce the estimation of number_of_batches in the unconditional masking to optimize number of masking API calls #91

Are you sure you want to change the base?

chore: AzureSQL - Introduce the estimation of number_of_batches in the unconditional masking to optimize number of masking API calls #91

Uh oh!

Conversation

purusottamm commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Problem

Solution

Testing Done

Uh oh!

ankurs-delphix left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JessicaLHartog Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

purusottamm Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

sumeetdas-dlpx left a comment

Choose a reason for hiding this comment

Uh oh!

purusottamm commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

purusottamm commented Oct 9, 2025 •

edited

Loading