🚀 Cloud Data Engineer Challenge

Welcome to the Cloud Data Engineer Challenge! 🎉 This challenge is designed to evaluate your ability to work with Infrastructure as Code (IaC), AWS data services, and data engineering workflows, ensuring efficient data ingestion, storage, and querying.

Note

You can use any IaC tool of your choice (Terraform preferred, but alternatives are allowed). If you choose a different tool or a combination of tools, justify your decision!

⚡ Challenge Overview

Your task is to deploy the following infrastructure on AWS:

🎯 Key Objectives:

An S3 bucket that will receive data files as new objects.
A Lambda function that is triggered by a PUT event in the S3 bucket.
The Lambda function must:
- Process the ingested data and perform a minimal aggregation.
- Store the processed data in a PostgreSQL database with PostGIS enabled.
- Expose an API Gateway endpoint (GET /aggregated-data) to query and retrieve the aggregated data.
A PostgreSQL database running in a private subnet with PostGIS enabled.
Networking must include: VPC, public/private subnets, and security groups.
The Lambda must be in a private subnet and use a NAT Gateway in a public subnet for internet access 🌍
CloudWatch logs should capture Lambda execution details and possible errors.

Important

Ensure that your solution is modular, well-documented, and follows best practices for security and maintainability.

📌 Requirements

🛠 Tech Stack

⚡ Must Include:

IaC: Any tool of your choice (Terraform preferred, but others are allowed if justified).
AWS Services: S3, Lambda, API Gateway, CloudWatch, PostgreSQL with PostGIS (RDS or self-hosted on EC2).

📄 Expected Deliverables

📥 Your submission must be a Pull Request that includes:

An IaC module that deploys the entire architecture.
A README.md with deployment instructions and tool selection justification.
A working API Gateway endpoint that returns the aggregated data stored in PostgreSQL.
CloudWatch logs capturing Lambda execution details.
Example input files to trigger the data pipeline (placed in an examples/ directory).
A sample event payload (JSON format) to simulate the S3 PUT event.

Tip

Use the docs folder to store any additional documentation or diagrams that help explain your solution. Mention any assumptions or constraints in your README.md.

🌟 Nice to Have

💡 Bonus Points For:

Data Quality & Validation: Implementing schema validation before storing data in PostgreSQL.
Indexing & Query Optimization: Using PostGIS spatial indexing for efficient geospatial queries.
Monitoring & Alerts: Setting up AWS CloudWatch Alarms for S3 event failures or Lambda errors.
Automated Data Backups: Creating periodic database backups to S3 using AWS Lambda or AWS Backup.
GitHub Actions for validation: Running terraform fmt, terraform validate, or equivalent for the chosen IaC tool.
Pre-commit hooks: Ensuring linting and security checks before committing.
Docker for local testing: Using Docker Compose to spin up:
- Running a local PostgreSQL database with PostGIS to simulate the cloud environment 🛠
- Providing a local S3-compatible service (e.g., MinIO) to test file ingestion before deployment 🖥

Tip

Looking for inspiration or additional ideas to earn extra points? Check out our Awesome NaNLABS repository for reference projects and best practices! 🚀

📥 Submission Guidelines

📌 Follow these steps to submit your solution:

Fork this repository.
Create a feature branch for your implementation.
Commit your changes with meaningful commit messages.
Open a Pull Request following the provided template.
Our team will review and provide feedback.

✅ Evaluation Criteria

🔍 What we'll be looking at:

Correctness and completeness of the data pipeline.
Use of best practices for event-driven processing (S3 triggers, Lambda execution).
Data transformation & aggregation logic implemented in Lambda.
Optimization for geospatial queries using PostGIS.
Data backup & integrity strategies (optional, e.g., automated S3 backups).
CI/CD automation using GitHub Actions and pre-commit hooks (optional).
Documentation clarity: Clear explanation of data flow, transformation logic, and infrastructure choices.

Name	Name	Last commit message	Last commit date
Latest commit ulises-jeremias Refine README.md by removing CI/CD and code quality sections; add exa… Mar 18, 2025 586152c · Mar 18, 2025 History 5 Commits
.github	.github	Enhance README.md with detailed challenge overview and requirements; …	Mar 18, 2025
LICENSE	LICENSE	Initial commit	Mar 18, 2025
README.md	README.md	Refine README.md by removing CI/CD and code quality sections; add exa…	Mar 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Cloud Data Engineer Challenge

⚡ Challenge Overview

📌 Requirements

🛠 Tech Stack

📄 Expected Deliverables

🌟 Nice to Have

📥 Submission Guidelines

✅ Evaluation Criteria

🎯 Good luck and happy coding! 🚀

About

Releases

Sponsor this project

Packages

License

nanlabs/cloud-data-engineer-challenge

Folders and files

Latest commit

History

Repository files navigation

🚀 Cloud Data Engineer Challenge

⚡ Challenge Overview

📌 Requirements

🛠 Tech Stack

📄 Expected Deliverables

🌟 Nice to Have

📥 Submission Guidelines

✅ Evaluation Criteria

🎯 Good luck and happy coding! 🚀

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Packages