Linked Services, Datasets and Integration Runtimes
Linked Services, Datasets and Integration Runtimes: 1
Task 1: Create Azure Data Factory Integration Runtime 1
Task 2: Create a new ADF Key Vault Linked Service 5
Task 3: Add blob Storage credentials to AKV 11
Task 4: Add other credentials to Azure Key Vault 13
Task 5: Create SFTP Linked Service 13
Task 6: Create datasets for WWI SFTP data 14
Task 7: Create Azure Blob Storage Linked Service 19
Task 8: Create Blob Storage Datasets for WWI Input data 22
Task 9: Create Blob Storage Datasets for SmartFoods Input data 22
Task 10: Create Blob Storage Datasets for WWI Data Warehouse output 23
Task 11: Create a SQL Database Linked Service and Dataset 24
Task 12: Create an HTTP Linked Service: 28
Task 13: Create a CSV Dataset on the HTTP Linked Service 29
Task 14: Create a JSON Dataset on the HTTP Linked Service 31
Duration: 90 minutes
In this exercise, attendees will create multiple Azure data factory pipelines and related component including ADF IR, LinkedServices and Datasets.
Azure Integration Runtime
The Integration Runtime (IR) is the compute infrastructure used by Azure Data Factory. Data Factory offers three types of Integration Runtime, and you should choose the type that best serve the data integration capabilities and network environment needs you are looking for. These three types are:
Azure: In a serverless fashion runs in the cloud.
-
Auto Resolve: Azure finds the closest region depending on Source/Target Dataset to run this activity
-
Fixed Region: Developer setup the runtime in one of Azure regions and the IR always gets deployed in that region.
-
Self-hosted: Developer installs on On-premises infrastructure (MS Windows only) or an Azure VM. SHIR is most useful for accessing resources within private networks (either in the Cloud Vnet or behind on-prem firewall) or for accessing “File” on file system.
-
Azure-SSIS: Special IR used for running SSIS packages in the cloud within ADF.
In this task we create a “Fixed Region Azure IR”.
- Open Azure Data Factory resource in Azure Portal
Click on Author and Monitor and it should be taking you to ADF UI like below. This is the main ADF UI. On the left-hand side there is Author and Monitor tabs. click on author tab to start building the first pipeline.
- Create a new Integration run time in a specific Azure Region: go to Author tab Connections (at the bottom of the screen) click New
Select “Perform data movement and dispatch activities to external computes”
Choose “Azure Public, Self-Hosted”
Give your new IR a name and Choose the preferred region
This lab uses various services (Azure & non-Azure) referenced via Linked Services. In order to use those services, access needs to be granted to them.
The approach demonstrated in further steps is as follows:
- All credentials are stored in Azure Key Vault (AKV)
- Azure Data Factory accesses AKV via the data factory's Managed Identity, to reference the above credentials
As ADF is a trusted Azure service (link), a data factory's system-assigned Managed Identity can be used to provision access to other Azure services such as Azure Storage, Azure SQL, and Azure Key Vault. Below are some links for various services depicting usage of Managed Identities to access other Azure services:
Process summary: For ADF to be able to access different services it will require credentials and the recommended approach is to store the secrets in Azure Key Vault and give ADF permissions to retrieve these credentials at runtime.
To add an Azure Key Vault Linked Service to ADF apart from adding it in ADF you also need to authorize ADF to access the secrets in KV as well. So, after creating the linked service there is some further steps(Step E and further) to be completed on Azure portal.
-
Create a new Linked Service: go to Author tab Connections (at the bottom of the screen) click New
-
Search for Key Vault and select Azure Key Vault
-
Give you linked service a name, select subscription and the Key Vault created in previous steps
-
Copy the managed identity object id and click on “Edit Key Vault” link
- On the key vault page (In Azure Portal) select access policies
- from secrets permissions drop down select “Get” and “List”
- Click on Select principle
-
paste the “managed identity object id” which will then show your ADF and select it
-
Finally click Add and you should see something like below
-
If all looks good click “Save”
Note: Leave the browser tab of Azure Key Vault open as we will need it soon.
- Go back to ADF where we left of and click on “Test Connection” if connection was successful click “Create” to create the linked Service.
As mentioned before Azure Key Vault (AKV) is used to store all credentials for services that ADF will connect to. This has multiple advantages
-
Security of storing sensitive information in credentials store which only the ADF service or Administrators can read from
-
If Credentials need to be rotated ADF Linked Service will not need to be modified
-
When we migrate the ADF pipeline from Dev to Test to production no change is necessary
- From Azure portal go to your Storage account
-
Copy the Connection string
-
Go back to the browser tab you had AKV open
-
Click on Secrets
-
Generate/Import
-
For name provide <storage account name>connectionstring
-
For value paste the Storage account connection string
-
Click “Create”
Follow the same steps in previous task and add the following credentials to your Azure Key Vault:
Name | Value |
---|---|
SmartFoodsRestAPIUsername | adflab |
SmartFoodsRestAPIPassword | Password.1! |
WWISftpPassword | adflab |
As mentioned in the solution architecture section the WWI input data is extracted in Parquet format from the OLTP RDBMS and stored in an SFTP server. So, in this step we create a LS to the SFTP server.
-
Click new linked services
-
Select SFTP
-
Connect Via: <NAME OF YOUR IR>
-
For name enter WWISftp
-
Host: adflabsftp.westus2.cloudapp.azure.com
-
Port: 22
-
Disable SSH host key Validation
-
Authentication type: Basic
-
User name: sftpuser
-
Azure Key Vault
- Secret name: WWISftpPassword
In this step we create a parametrized dataset on SFTP Linked Service to access the Parquet files on SFTP server.
- Click the plus sign on the left top hand of ADF and select Dataset.
-
Select SFTP
-
Select Parquet for format
-
From Drop down select the SFTP linked service created in previous task
-
For name provide “WWISftpParquet”
-
Leave “Directory” and “File” Blank (We are going to parametrize these)
-
Change “Import Schema to None”
-
Click OK
-
Go to parameters tab in your dataset and create three parameters as in the screenshot above
-
Go to connection tab and select the directory box once selected click on “Add dynamic Contents” or hit Alt+P
-
In Expression editor your list of parameters is shown at the bottom select folder and inspect the contents in the expression editor box and then click “Finish”
- Repeat the same steps for file name but instead in the Expression Editor enter.
@{dataset().filename}.@{dataset().filetype}
Hint: The above expression concatenates the two parameters with a ‘.’ in between to make a full file name.
Hint2: Also, we could write the same expression as: @concat(dataset().filename,’.’,dataset().filetype)
-
Click on preview data and fill in the parameters as:
-
Folder: WorldWideImporters/orderlines
-
Filename: orderlines_2019-09-02
-
Filetype: parquet
-
If the data set and parameters are created correctly you should see something like below:
This way our dataset can be re-used in different pipelines or the same pipeline to access different files.
A data factory can have one or more pipelines. A pipeline is a logical grouping of activities that together perform a task. a dataset is a named view of data that simply points or references the data you want to use in your activities as inputs and outputs.
Before you create a dataset, you must create a linked service to link your data store to the data factory. Linked services are much like connection strings, which define the connection information needed for Data Factory to connect to external resources.
Now in this task you will create a Linked Service to Azure Blob Storage.
- Click new in Linked Service window again
-
Select Azure Blob Storage
-
For name call it <your storage account name>john
-
Select your IR
-
Change Authentication Mechanism to Key Vault
-
Select the AKV linked Service you created in the previous task
-
For Secret name provide <storage account name>connectionstring
-
Test connection and click “create”.
All the input data for WWI is sent in Parquet format from the source, so this makes our life easier as we only need a single parametrized parquet dataset for all different data domains. The data set will be created on the Azure Blob linked service created in previous task.
The datasets are parametrized so the same Dataset can be used for writing and reading different files and data domains (i.e. customers or orders on any date). The only part of the dataset that gets hardcoded is the container; the “folder path” and “file name” remains blank to be parametrized.
- Click the plus sign on the left top hand of ADF and select Dataset.
-
Select Blob
-
Select Parquet for format
-
From Drop down select the Azure Blob Storage linked service created in previous task
-
For name provide “WWIStagingBlobParquet”
-
Click “Browse” and select wwistaging container (created in previous steps)
-
Leave “Directory” and “File” Blank (We are going to parametrize these)
-
Change “Import Schema to None”
-
Click OK
-
Parametrize the Dataset like the SFTP data set with following parameters:
-
folder
-
filename
-
filetype
-
SmartFoods data is available to an API. Customer data is in JSON format and Transactions are in CSV format. But we are using Azure Data Factory’s Copy activity to convert customer data to CSV so we have the same file format for both feeds from SmartFoods API source.
- Click the plus sign on the left top hand of ADF and select Dataset.
-
Select Blob
-
Select “Delimited Text” for format
-
From Drop down select the Azure Blob Storage linked service created in previous task
-
For name provide “SmartFoodsDelimitedTextBlob”
-
Click “Browse” and select smartfoodsstaging container (created in previous steps)
-
Leave “Directory” and “File” Blank (We are going to parametrize these)
-
Change “Import Schema to None”
-
Click OK
-
Parametrize the Dataset like the SFTP data set with following parameters:
-
folder
-
filename
-
filetype
-
We are planning to use Parquet file type for storing all output DW files and as such for output only a single Parquet file format is enough.
Note: A question here would be, why can’t we use the Parquet dataset created in previous step for WWI input data?! The answer is: Technically we can but reusing the same objects in a solution comes with a comes with the extra complexity cost.
Better Practices Note: How dynamic should the solution be1?
It can be oh-so-tempting to want to build one solution to rule them all. (Especially if you love tech and problem-solving, like me. It’s fun figuring things out!) But be mindful of how much time you spend on the solution itself. If you start spending more time figuring out how to make your solution work for all sources and all edge-cases, or if you start getting lost in your own framework… stop.Your solution should be dynamic enough that you save time on development and maintenance, but not so dynamic that it becomes difficult to understand.
…don’t try to make a solution that is generic enough to solve everything :)
Your goal is to deliver business value. If you end up looking like this cat, spinning your wheels and working hard (and maybe having lots of fun) but without getting anywhere, you are probably over-engineering your solution.
Alright, now that we’ve got the warnings out the way… Let’s start by looking at parameters :)
1: Reference: https://www.cathrinewilhelmsen.net/2019/12/20/parameters-azure-data-factory/
- Click the plus sign on the left top hand of ADF and select Dataset.
-
Select Blob
-
Select Parquet for format
-
From Drop down select the Azure Blob Storage linked service created in previous task
-
For name provide “WWIDataWarehouseBlobParquet”
-
Click “Browse” and select wwidatawarehouse container (created in previous steps)
-
Leave “Directory” and “File” Blank (We are going to parametrize these)
-
Change “Import Schema to None”
-
Click OK
-
Parametrize the Dataset like the SFTP data set with following parameters:
-
folder
-
filename
-
filetype
-
- Create a Linked service for SQL Database by following the similar procedure as the one you did in task 5 except for Linked Service Type select “Azure SQL Database”
Note1: Like Blob Storage, use Azure Key Vault
Note2: From Azure portal/SQL Database Get the connection string and store in AKV (Screenshots below)
Note3: In the connection string make sure you replace the “{your_password”} with your SQL Database password you chosen in Pre-lab setup.
Create a Dataset similar to Blob storage – User Screenshots as a guide
Name: SmartFoodsApiLinkedService
Base URL: https://smartfoods.azurewebsites.net/api/
Note: Use Screenshot as a guide for other options.
ADF allows us to create datasets of various format on top of an HTTP Linked service to receive data over an HTTP connection from another source in various formats such as delimited text.
-
Start creating a dataset
-
Select HTTP as type
-
Select Delimited Text as format
-
Name: SmartFoodsTransactionApiCsv
-
Leave other options BLANK as it is and click “Ok”
- Create a parameter as “authCode” under dataset
“Note: We are going to use this parameter to pass the Authentication token to the API Service on calling the service.”
- Under Connection tab and for “Relative URL” click go to “dynamic content editor and provide:
Here we are creating an HTTP dataset which the query parameters for it is parametrized so we can change them at runtime and retrieve different data portions from the API
smartfoods?code=@{dataset().authCode}
SmartFoods API provides the customer data in JSON format as such we need to create a second dataset on top of the same HTTP Linked service but in JSON Format
Follow the same steps as in previous task, except for data format select JSON. For name provide “SmartFoodsCustomerApiJson”.
Screenshots below for guidance: