- environments This directory holds configurations for different environments like development, staging, and production. Each environment has its own set of Terraform configuration files (main.tf, variables.tf, outputs.tf, terraform.tfvars).
- modules This directory contains reusable modules for different components of infrastructure. Each module has its own directory with main.tf, variables.tf, and outputs.tf defining the module's functionality.
- images This directory contains the docker images details for the infra, this is to be used along with CI CD, and any artifactory such as jfrog or gcr etc.. currently this is a placeholder
- For creating kubernetes cluster below components are needed
- Node pools
- Networks (VPC, NAT, SUBNET)
- Enabling of GCP services, Service account creation
- For this assessment scope, went ahead using default network.
- Workload identity federation has been enabled for seamless authentication,
- Service account creation and binding to Workload federation has been done.
- Config map with pii fields info has been created in airflow-ns namespace.
- Git repo is used for dags sync.
- Port forwarding is used, instead of load balancer.
- Default user admin is used, no other users are created.
-
- basic_dag is a dag which is used for testing..
- create_dataproc_single_master is the dag which
- creates a dataproc cluster
- Inserts sample data into the gcs location (spark job)
- Reads the pii mapping and generates the masked data (spark job)
- Generates the delta format for both masked data and normal data (spark job)
- The code for jobs is kept in this repo
- As a best practice A CI CD pipeline will zip the code and put it in gcs repo for pyspark to pickup.
- Dataproc serverless metastore is considered as hive metastore
- Port forwarding is used, instead of load balancer.
- An attempt has been made to read the config map pii fields in airflow, which has been successful.
- To ensure the dynamic nature PII fields mount path has been added as a variable to airflow and passed to the job.
- Structure of pii map goes below
- field represents column,
- table represents on which dataset the pii mapper should be applied
- type represents the data type, of column
- If it is a string, it is straight forward
- if it is json, we have to give a parsable path, works only with json without array paths
- A generic function has been implemented to understand this mapper, present here
- We can implement many custom types and move forward for nested / much complex data fields with this approach.
- Structure of pii map goes below