nv-interview-chaitanya-terraform

Flow

Terraform folders

environments This directory holds configurations for different environments like development, staging, and production. Each environment has its own set of Terraform configuration files (main.tf, variables.tf, outputs.tf, terraform.tfvars).
modules This directory contains reusable modules for different components of infrastructure. Each module has its own directory with main.tf, variables.tf, and outputs.tf defining the module's functionality.
images This directory contains the docker images details for the infra, this is to be used along with CI CD, and any artifactory such as jfrog or gcr etc.. currently this is a placeholder

Kubernetes deployment

For creating kubernetes cluster below components are needed
1. Node pools
2. Networks (VPC, NAT, SUBNET)
3. Enabling of GCP services, Service account creation
4. For this assessment scope, went ahead using default network.
Workload identity federation has been enabled for seamless authentication,
Service account creation and binding to Workload federation has been done.
Config map with pii fields info has been created in airflow-ns namespace.

Airflow deployment

Git repo is used for dags sync.
Port forwarding is used, instead of load balancer.
Default user admin is used, no other users are created.
Images
- Dags shown in below screenshot, code an be found in the repo link above.
- basic_dag is a dag which is used for testing..
- create_dataproc_single_master is the dag which
  - creates a dataproc cluster
  - Inserts sample data into the gcs location (spark job)
  - Reads the pii mapping and generates the masked data (spark job)
  - Generates the delta format for both masked data and normal data (spark job)
  - The code for jobs is kept in this repo
  - As a best practice A CI CD pipeline will zip the code and put it in gcs repo for pyspark to pickup.

Trino deployment

Dataproc serverless metastore is considered as hive metastore
Port forwarding is used, instead of load balancer.

created the data and read from trino both delta and parquet file formats

Task details

An attempt has been made to read the config map pii fields in airflow, which has been successful.
To ensure the dynamic nature PII fields mount path has been added as a variable to airflow and passed to the job.
1. Structure of pii map goes below
  1. field represents column,
  2. table represents on which dataset the pii mapper should be applied
  3. type represents the data type, of column
    1. If it is a string, it is straight forward
    2. if it is json, we have to give a parsable path, works only with json without array paths
    3. A generic function has been implemented to understand this mapper, present here
    4. We can implement many custom types and move forward for nested / much complex data fields with this approach.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
environments		environments
images/airflow		images/airflow
modules		modules
readme_stuff		readme_stuff
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nv-interview-chaitanya-terraform

Flow

Terraform folders

Kubernetes deployment

Airflow deployment

Images

Dags shown in below screenshot, code an be found in the repo link above.

Trino deployment

created the data and read from trino both delta and parquet file formats

Task details

About

Releases

Packages

Languages

c-14795/nv-interview-chaitanya-terraform

Folders and files

Latest commit

History

Repository files navigation

nv-interview-chaitanya-terraform

Flow

Terraform folders

Kubernetes deployment

Airflow deployment

Images

Dags shown in below screenshot, code an be found in the repo link above.

Trino deployment

created the data and read from trino both delta and parquet file formats

Task details

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages