-
Notifications
You must be signed in to change notification settings - Fork 9
chore: AzureSQL - Introduce the estimation of number_of_batches in the unconditional masking to optimize number of masking API calls #91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…e unconditional masking to optimize number of masking API calls PR URL: https://www.github.com/delphix/dcs-for-azure-templates/pull/91
e77f411
to
51dc497
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix it, ship it!
dcsazure_AzureSQL_to_AzureSQL_mask_pl/dcsazure_AzureSQL_to_AzureSQL_mask_pl.json
Outdated
Show resolved
Hide resolved
" DF_COLUMNS_TO_CAST_BACK_TO_LONG as string[] ([\"\"]),", | ||
" DF_COLUMNS_TO_CAST_BACK_TO_TIMESTAMP as string[] ([\"\"])", | ||
" DF_COLUMNS_TO_CAST_BACK_TO_TIMESTAMP as string[] ([\"\"]),", | ||
" DF_NUMBER_OF_BATCHES as integer (100)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (blocking): this is now a dataflow parameter, it needs to be added to the input_parameters
part of "Update Masked State No Filter*" stored procedure calls in the pipeline so that they're persisted to the event store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parameter is already coming from the previous activity named ‘Lookup Masking Parameters’, with the value @activity('Lookup Masking Parameters').output.firstRow.NumberOfBatches.
That’s why I added it as a Data Flow parameter instead of including it in the input_parameters section.
Do you mean it should be added as an input parameter and then overridden within the Data Flow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
whereas the above formula automatically find the perfect number of batches the table will be divided into
Is it really perfect? Unless you can prove it without any reasonable doubt that this formula is perfect, I would avoid using that word. Perhaps you can use the word 'optimal'?
-
I don't see a screenshot in the
Testing Done
section which shows the number of batches is 250. -
So now the unfiltered masking dataflow does not depend on
TARGET_BATCH_SIZE
variable, correct? If so, we should mention this inREADME.md
of AzureSQL masking pipeline.
I’ve modified the solution statement in GitHub and added the correct screenshot. I’ll also update the README shortly. |
Background
For Unconditional masking we were calculating the number of batches in lookup task by the stored procedures but we were not using it anywhere. In this we are using the number of batches calculated from the lookup task in unconditional masking.
The
NumberOfBatches
parameter is evaluated in thegenerate_masking_parameters
stored procedure dynamically based on thecolumn_width_estimate
using the below expressionProblem
We need to re-introduce the use of
DF_NUMBER_OF_BATCHES
variable in theunfiltered
masking dataflow instead of using the customer estimated value ofDF_NUMBER_OF_ROWS_PER_BATCH
variable. Asking customer to find out the value forTARGET_BATCH_SIZE
for optimally fitting the API request to2MB
, needs a lot of hit & trial, whereas the above formula automatically find the optimal number of batches the table will be divided into.Solution
Updated the unfiltered masking dataflow to use the DF_NUMBER_OF_BATCHES variable instead of DF_NUMBER_OF_ROWS_PER_BATCH for row aggregation for sending batches to the masking API.
Testing Done
I have created a products table containing 1.2 million records. I also verified that the table is being processed in 250 batches, as expected.
Pipeline link : https://adf.azure.com/en/monitoring/pipelineruns/d402a268-dc30-4286-8f1e-27e71041e003?factory=%2Fsubscriptions%2F247fb129-0717-412e-b3ce-28407e52e28b%2FresourceGroups%2Fpurusottam_rcg%2Fproviders%2FMicrosoft.DataFactory%2Ffactories%2FAzureSQL-PM-ADF