Skip to content

Conversation

purusottamm
Copy link
Contributor

@purusottamm purusottamm commented Oct 9, 2025

Background

For Unconditional masking we were calculating the number of batches in lookup task by the stored procedures but we were not using it anywhere. In this we are using the number of batches calculated from the lookup task in unconditional masking.

The NumberOfBatches parameter is evaluated in the generate_masking_parameters stored procedure dynamically based on the column_width_estimate using the below expression

ceil(((max(row_count) * (sum(column_width_estimate) + log10(max(row_count)) + 1)) / (2000000 * .9)))

Problem

We need to re-introduce the use of DF_NUMBER_OF_BATCHES variable in the unfiltered masking dataflow instead of using the customer estimated value of DF_NUMBER_OF_ROWS_PER_BATCH variable. Asking customer to find out the value for TARGET_BATCH_SIZE for optimally fitting the API request to 2MB, needs a lot of hit & trial, whereas the above formula automatically find the optimal number of batches the table will be divided into.

Solution

Updated the unfiltered masking dataflow to use the DF_NUMBER_OF_BATCHES variable instead of DF_NUMBER_OF_ROWS_PER_BATCH for row aggregation for sending batches to the masking API.

Screenshot 2025-10-09 at 10 50 39 AM Screenshot 2025-10-09 at 10 55 31 AM Screenshot 2025-10-09 at 10 51 29 AM Screenshot 2025-10-09 at 10 55 17 AM

Testing Done

I have created a products table containing 1.2 million records. I also verified that the table is being processed in 250 batches, as expected.

Pipeline link : https://adf.azure.com/en/monitoring/pipelineruns/d402a268-dc30-4286-8f1e-27e71041e003?factory=%2Fsubscriptions%2F247fb129-0717-412e-b3ce-28407e52e28b%2FresourceGroups%2Fpurusottam_rcg%2Fproviders%2FMicrosoft.DataFactory%2Ffactories%2FAzureSQL-PM-ADF

Screenshot 2025-10-09 at 1 42 19 PM Screenshot 2025-10-09 at 1 41 52 PM Screenshot 2025-10-09 at 8 35 57 PM

@purusottamm purusottamm force-pushed the dlpx/pr/purusottamm/ea1e10c5-6c46-4a13-9365-3b6f54a82f10 branch from e77f411 to 51dc497 Compare October 9, 2025 05:04
@purusottamm purusottamm marked this pull request as ready for review October 9, 2025 08:20
@purusottamm purusottamm requested a review from a team as a code owner October 9, 2025 08:20
Copy link
Contributor

@ankurs-delphix ankurs-delphix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix it, ship it!

" DF_COLUMNS_TO_CAST_BACK_TO_LONG as string[] ([\"\"]),",
" DF_COLUMNS_TO_CAST_BACK_TO_TIMESTAMP as string[] ([\"\"])",
" DF_COLUMNS_TO_CAST_BACK_TO_TIMESTAMP as string[] ([\"\"]),",
" DF_NUMBER_OF_BATCHES as integer (100)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): this is now a dataflow parameter, it needs to be added to the input_parameters part of "Update Masked State No Filter*" stored procedure calls in the pipeline so that they're persisted to the event store.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter is already coming from the previous activity named ‘Lookup Masking Parameters’, with the value @activity('Lookup Masking Parameters').output.firstRow.NumberOfBatches.
That’s why I added it as a Data Flow parameter instead of including it in the input_parameters section.

Do you mean it should be added as an input parameter and then overridden within the Data Flow?

Copy link
Contributor

@sumeetdas-dlpx sumeetdas-dlpx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • whereas the above formula automatically find the perfect number of batches the table will be divided into

    Is it really perfect? Unless you can prove it without any reasonable doubt that this formula is perfect, I would avoid using that word. Perhaps you can use the word 'optimal'?

  • I don't see a screenshot in the Testing Done section which shows the number of batches is 250.

  • So now the unfiltered masking dataflow does not depend on TARGET_BATCH_SIZE variable, correct? If so, we should mention this in README.md of AzureSQL masking pipeline.

@purusottamm
Copy link
Contributor Author

  • whereas the above formula automatically find the perfect number of batches the table will be divided into

    Is it really perfect? Unless you can prove it without any reasonable doubt that this formula is perfect, I would avoid using that word. Perhaps you can use the word 'optimal'?

  • I don't see a screenshot in the Testing Done section which shows the number of batches is 250.

  • So now the unfiltered masking dataflow does not depend on TARGET_BATCH_SIZE variable, correct? If so, we should mention this in README.md of AzureSQL masking pipeline.

I’ve modified the solution statement in GitHub and added the correct screenshot. I’ll also update the README shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants