Skip to content

Latest commit

 

History

History
46 lines (26 loc) · 3.61 KB

File metadata and controls

46 lines (26 loc) · 3.61 KB

Objective

  • This pipeline takes the JSON data that is in FHIR standard format from our "raw" ADLS container and converts it to parquet. Since Parquet is a columnar compressed file format this makes it much faster to import and work with the data. We store the parquet output in our "processed" container in ADLS under a folder called "Patient". image

  • We plan to eventually load this data into Dedicated SQL Pool across 2 tables representing Patient Addresses and Patient Indentifiers. We need to extract the data needed for each table, clean it, and write it back to ADLS. The second activity in our pipeline handles all of this inside a Data Flow Activity. This could have been done in a Spark notebook like the previous 2 activities, but this will let you compare the two methods. image

  • Now that the data is prepared and cleaned we are ready to load it into our Dedicated Pool, but we need to create the tables first. We have a script activity that will run against our Dedicated Pool to create these artifacts for us.

    Note: Make sure your Dedicated Pool is running prior to executing this pipeline. You can see this in the SQL Pools tab under the Manage Hub.

image

  • We are now all setup with data ready to go and a table to load it in and we'll use a Copy Activity to perform the load. image

STEP 1: Parameter Setup

Prior to running the Patient pipeline (FHIR_Pipeline4Patient_DataFlow_OC) you will need to set the pipeline parameters to use the artifact names you chose during deployment. Go to the integrate hub, expand the patient folder, and select the pipeline to open it.

image

Once the pipeline opens you will need to click somewhere on the canvas (open space or background) to see the pipeline level parameters. This means that NONE of the activities should be highlighted or selected. Now select the Parameters tab in the bottom pane to view the pipeline level parameters.

image

Change the default value for each of the following five parameters to what you chose during deployment:

  • StorageName - This is the name of your Synapse workspace ADLS account
  • DatabaseName - This is the name of your database in Synapse Dedicated SQL Pool
  • ServerName - This is the name of your Synapse Dedicated SQL Pool
  • SparkPoolName - This is the name of your Synapse Spark Pool
  • DatasetSize - This is either "1tb" or "30tb" depending on which size dataset you want to use

STEP 2: Execute Pipeline

  • Since this pipeline has a data flow we'll kick it off a bit differently than the previous exercises. You will want to flip the radio button for "Data Flow Debug", hit the drop down arrow next to debug, and select the last option "Use Activity Runtime".

    Note: Make sure your Dedicated Pool is running prior to executing this pipeline. You can see this in the SQL Pools tab under the Manage Hub.

image

Congratulations on completing Exercise 03.