This repository contains examples and helper functions to help you develop your custom technical lineage files.
The dependencies are collected in the requirements.txt
file and can be installed using pip install -r requirements.txt
.
Usage:
python3 -m tools.translate_to_batch_format <source_directory> <target_directory> [--migrate_source_code]
Where:
<source_directory>
is the existing directory with the single-file definition files that you want to convert.<target_directory>
is the target directory for the resulting batch definition artifacts. If the target directory doesn't exist, it will be created.--migrate_source_code
is an optional element that extracts the source code.
Usage:
python3 -m tools.ingest_csv <source_directory> <target_directory> [--collibraInstance] [--username] [--password]
Where:
<source_directory>
is the existing directory with the CSV files that you want to convert.<target_directory>
is the target directory for the resulting batch definition artifacts. If the target directory doesn't exist, it will be created.--collibraInstance
is the Collibra instance name. If instance's URL is https://myinstance.collibra.com the instance name is myinstance--username
is the Collibra username used to make API calls--password
is the Collibra's account password
When collibraInstance
, username
and password
are provided, the asset type uuids provided in the CSV files will be automatically fetched from your catalog instance. When not provided you need to update the function _get_default_asset_types
in tools.ingest_csv.py
so they return all the assets used.
The first row in every CSV file is the header. The following rows define the lineage relationships. If we take a look at an example header row, we find in order:
System,Database,Schema,Table,Column,fullname,domain_id,System,Database,Schema,Table,Column,fullname,domain_id,source_code,highlights,transformation_display_name
System
,Database
, andSchema
define the asset types for the nodes of the source. You can have as many nodes as you want, but you must have a least one.Table
is the asset type of the parent asset of the source. This header is mandatory.Column
is the asset type of the leaf asset of the source. This header is mandatory.fullname
anddomain_id
are mandatory headers in cases where a customfullname
anddomain_id
must be provided for the source.System
,Database
, andSchema
define the asset types for the nodes of the target. You can have as many nodes as you want, but you must have a least one.Table
is the asset type of the parent asset of the target. This header is mandatory.Column
is the asset type of the leaf asset of the target. This header is mandatory.fullname
anddomain_id
are mandatory headers in cases where a customfullname
anddomain_id
must be provided for the target.source_code
,highlights
, andtransformation_display_name
are mandatory headers.
When providing the lineage relationships, keep the following in mind:
- Values for the node and parent asset types are mandatory for both the source and the target.
- A value for the leaf asset type is optional.
- Values for
fullname
anddomain_id
are optional. - Values for
source_code
,highlights
, andtransformation_display_name
are optional. source_code
can either be a string or the full path to a file.
Headers define the asset types for which you define the lineage relationships; therefore, one file can only contain lineage relationships for the same type of assets in the source/target. You can, however, create as many CSV files as you want in the directory. The generated metadata.json
file will contain the definition for System
, Database
, Schema
, Table
, and Column
. If you are using any other asset types, you need to add these in the metadata.json
file, with their respective uuid
.
Let's have a look at two examples:
System,Database,Schema,Table,Column,fullname,domain_id,System,Database,Schema,Table,Column,fullname,domain_id,source_code,highlights,transformation_display_name
snowflake,KRISTOF,PUBLIC,T1,USERID,,,snowflake,KRISTOF,PUBLIC,V2,UI_2L,,,CREATE VIEW KRISTOF.PUBLIC.V2 AS select USERID from KRISTOF.PUBLIC.T1,"[0:70]",transformation
snowflake,KRISTOF,PUBLIC,T2,,domain1,,,KRISTOF,PUBLIC,V2,,,,,,
The first row defines the column-level lineage and provides details about the source_code
, highlights
and transformation_display_name
. The second row defines the table-level lineage.
GCS File System,GCS Bucket,Directory,Directory,File,fullname,domain_id,System,Database,Schema,Table,Column,fullname,domain_id,source_code,highlights,transformation_display_name
gcs,catingestiontest,/,ingestion-test,mytest.csv,f13bf705-13a4-44c9-843e-f341feccfb6e > catingestiontest/ingestion-test/ingestion copy/mytest.csv/1611609340099809,fea1b0b0-705f-4e0d-b5eb-1f21132cc718,snowflake,KRISTOF,PUBLIC,V2,UI_2L,,,snowflake data pipline abc,,transformation
This example creates a lineage relationship between a file and a column. The custom fullname
and domain_id
are provided for the file because they are needed to obtain stitching.
tools.example.py
and tools.example_with_props.py
contain examples of how you can use the models and helper functions defined in src.models.py
and src.helper.py
to generate the required files for custom technical lineage. It also shows how the functions can be used to upload the files to edge, trigger edgecli
command and synchronize the capability.
Usage:
python3 -m tools.collect_assets_fullname [--collibraInstance] [--username] [--password] [--domainId] [--typeId] [--name]
Where:
--collibraInstance
is the name of the Collibra environment. If, for example, the URL of the environment ishttps://myinstance.collibra.com
, the environment name ismyinstance
.--username
is the Collibra username used to make API calls.--password
is the Collibra account password.--domainId
is the domain ID of the relevant asset. This is optional.--typeId
is the asset type ID of the relevant asset. This is optional.--name
is the display name of the relevant asset. This is optional.
Usage:
python3 -m tools.collect_assets_fullname [--collibraInstance] [--username] [--password] [--domainId] [--typeId] [--name]
Where:
--collibraInstance
is the Collibra instance name. If instance's URL is https://myinstance.collibra.com the instance name is myinstance--username
is the Collibra username used to make API calls--password
is the Collibra's account password--applicationName
is the type of data source for which you are creating a technical lineage--typeId
optional: is the type ID of the assets details to be retrieved
Custom technical lineage examples are available under the Collibra Marketplace License agreement.