-
Introduction to Data Engineering
- What is data engineering
- Data engineering roadmap
- Software engineering
- Analytics
- Domain knowledge
-
Beginner path
- Databases
- Types of databases
- Data warehouses, MPP, NoSQL
- Learning SQL
- Basics
- Window functions
- Joins
- Optimization
- Business Intelligence tools
- Purpose and tasks
- Redash, Metabase, Google Data Studio
- Tableau, PowerBI
- Version control
- Git
- GitHub, GitLab
- Python for Data Engineering
- Basics
- Virtual environments
- Packages
- Scientific packages (NumPy, Pandas)
- Data visualization
- Web
- Network
- REST API
- HTTP
- HTTP methods (get, post)
- Web-servers
- Domain knowledge
- Data modeling
- Kimball
- Snowflake vs Star
- Data Vault
- Data pipelines
- Data modeling
- Databases
-
Big Data path
- Virtualization and containerization
- Docker
- Linux
- Basics
- Terminal
- Big Data
- Distributed processing
- MapReduce
- Rise of Hadoop
- HDFS
- Hive
- Spark
- Hadoop distributions (Cloudera, Hortonworks)
- Integrations between systems (Sqoop, API)
- Distributed systems
- CAP theorem
- Analytics
- Statistics
- Metrics and KPIs
- A/B testing
- Programming
- Data structures and algorithms
- SOLID principles
- Advanced Python
- Classes, contexts
- Collections module
- GIL and asynchronous python
- Advanced domain knowledge
- Data quality
- Data monitoring
- Security
- Privacy (GDPR)
- Fault tolerance
- Virtualization and containerization
-
Data architect path
- Streaming
- Kafka
- Spark Streaming
- Flink / Beam
- Types of architectures
- monolith, micro-services, data lakes
- batch vs stream
- lambda, kappa, zetta
- Advanced Analytics
- Machine learning
- Time series analysis
- High performance programming languages
- Java and Scala
- Example of Scala + SBT + Spark
- DevOps
- Clouds
- CI/CD
- Orchestration
- DE landscape examples
- Cassandra
- Dask
- Prefect
- ELK stack
- Dremio
- Streaming
Useful sources: