Rabobank Assignment

In this project, you will find my initial setup for a Rabobank assignment. The assignment can be found in assignment/Assignment.pdf.

Dependencies

Python 3.12
Java

Notes

Rabobank assignment received through email. In assignment, the assignment pdf can be found with the data. The records CSV file is a simplified version in MT940 format.

Project Structure

.
├── api/                                # a simple FastAPI solution
├── assignment/                         # the original assignment
├── azurite/                            # initialisation of Azurite blobs
├── notebooks/                          # Spark notebook for running validation checks
├── postgresql/                         # initialisation of PostgreSQL tables
├── terraform/                          # Terraform code for deploying Azure resources
├── docker-compose.yml
├── Makefile
├── README.md
└── requirements.txt

Project Setup

Due to Azure Sandbox limitations this project contains local deployments for storage solutions, such as Azurite and Postgresql. Furthermore, it contains a notebook to run the PySpark commands and an API for retrieving the failed records.

To setup the initial project run:

make setup
make validate

This will install the requirements.txt file and run Azurite, PostgreSQL and the FastAPI found in the Docker Compose file. Afterwards, the Jupyter Notebook can be executed to insert the records 1.csv into the Postgres tables. Finally, the invalid records can be retrieved using the API.

Go to http://127.0.0.1:8000/docs to see the API spec. Or retrieve the invalid records from http://127.0.0.1:8000/records/invalid.

When you're finished reviewing the records make sure to exit all processes, using make teardown, this will stop all containers and remove the related images.

Initial Plan

Create infrastructure using Terraform
- Deploy Storage Account, Azure SQL Server and Azure Synapse (with Spark Pool)
- Deploy extra/optional Azure Key Vault for storing admin passwords
Upload records 1.csv to Storage Account
Create table records in database with following schema:
- transaction_reference as integer primary key
- account_number as varchar
- description as varchar
- start_balance as float
- mutation as float
- end_balance as float
Create Test PySpark pipeline for testing validations on fake table?
- Insert one succesful record
- Insert one failed record (non-unique key)
- Insert one failed record (incorrect end balance)
Create PySpark pipeline in Azure Synapse
- Read CSV from Storage Account
- Do validation on end balance
- Insert into table
- Record failed inserts due to non-unique transaction reference
Create simple API to retrieve failed records from database

Limitations and Considerations

In the end I was not able to follow through on the initial plan due to some limitations of the Azure Sandbox that I have used. In the following section I will go through each implementation I have done for some explanation and further development.

Api

Super simple API created with FastAPI with one endpoint. Could add parameters for limited number of invalid records received. Or query parameters based on reference number.
To run the FastAPI or any kind of API to retrieve records from a database, I would compare any of the following solutions: https://learn.microsoft.com/en-us/azure/container-apps/compare-options. My preference would go to Kubernetes, seeing as it is a cloud agnostic framework for orchestrating containerized workloads, but if the team wants to have less maintenance, I would consider a PoC with either Azure Functions or Azure Container Apps.
It was my first experience using FastAPI, it was very easy to setup an initial API, however I am unsure whether it is suitable for production usecases.
Could have done multi-stage building for FastAPI to create slimmer image.

Notebook

In the end, I did not use an Azure Synapse Pipeline to do the validations, therefore I used Jupyter Notebooks (with Spark configured) and DuckDB to perform the necessary actions to simulate database upserts.
I would consider adding a Storage Event trigger to the pipeline. Furthermore, regarding pipeline definitions and notebooks I would store these in a Git repository, giving the possibility for CI/CD.
Connection to Azurite from local Spark configuration seems more complex than initially thought, therefore I opted for using the csv file as is. Although this will be easier when using a Spark notebook in Azure Synapse as the Spark configuration has been set properly.
Spark writes with JDBC are limited, have to manually write up inserts with constraints, such as primary keys.

Terraform

Due to limited permissions on A Cloud Guru Azure Sandbox environment I was unable to perform the following actions:
- Deploy infrastructure via Terraform
- Create a Spark Pool in Azure Synapse
- Unable to connect to Azure SQL Database
I am not certain the Terraform code will deploy, because I was not able to test it.
I did not implement any networking on the Azure Resources, although I do expect some networking to be involved when deploying the actual infrastructure.
I did not setup a remote state storage account to store the Terraform state file in Azure as well, should be done too, because we do not want local statefiles.
An alternative platform would be using Azure Storage Account and Azure Databricks for Engineer workflows, DBT can be run from Databricks clusters using Databricks workflows.

CI/CD

Regarding testing/developing/maintaining all the code, I would have created separate environments. Using Azure Pipelines we can easily integrate some linting, testing and security scanning on all repositories. I would have separated the API, Synapse Pipelines and Terraform into different repositories.
For Terraform, I would consider tflint and tfsec.
For Azure Synapse Pipelines, I am not quite sure about CI/CD, but would probably have separate workspaces for Development and Production.
For FastAPI, I would consider creating Azure Pipelines that performs linting, tests and security scanning as well, such as Ruff for linting and Snyk for security scanning. Furthermore, after performing the previous steps building and pushing to Container Registry will be done.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rabobank Assignment

Dependencies

Notes

Project Structure

Project Setup

Initial Plan

Limitations and Considerations

Api

Notebook

Terraform

CI/CD

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
api		api
assignment		assignment
azurite		azurite
notebooks		notebooks
postgresql		postgresql
terraform		terraform
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

lyauyeung/rabobank-assignment

Folders and files

Latest commit

History

Repository files navigation

Rabobank Assignment

Dependencies

Notes

Project Structure

Project Setup

Initial Plan

Limitations and Considerations

Api

Notebook

Terraform

CI/CD

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages