Skip to content

Latest commit

 

History

History
110 lines (104 loc) · 2.65 KB

File metadata and controls

110 lines (104 loc) · 2.65 KB

Topics to cover

  1. Introduction to Data Engineering

    1. What is data engineering
    2. Data engineering roadmap
      • Software engineering
      • Analytics
      • Domain knowledge
  2. Beginner path

    1. Databases
      • Types of databases
      • Data warehouses, MPP, NoSQL
    2. Learning SQL
      • Basics
      • Window functions
      • Joins
      • Optimization
    3. Business Intelligence tools
      • Purpose and tasks
      • Redash, Metabase, Google Data Studio
      • Tableau, PowerBI
    4. Version control
      • Git
      • GitHub, GitLab
    5. Python for Data Engineering
      • Basics
      • Virtual environments
      • Packages
      • Scientific packages (NumPy, Pandas)
      • Data visualization
    6. Web
      • Network
      • REST API
      • HTTP
      • HTTP methods (get, post)
      • Web-servers
    7. Domain knowledge
      • Data modeling
        • Kimball
        • Snowflake vs Star
        • Data Vault
      • Data pipelines
  3. Big Data path

    1. Virtualization and containerization
      • Docker
    2. Linux
      • Basics
      • Terminal
    3. Big Data
      • Distributed processing
      • MapReduce
      • Rise of Hadoop
      • HDFS
      • Hive
      • Spark
      • Hadoop distributions (Cloudera, Hortonworks)
      • Integrations between systems (Sqoop, API)
    4. Distributed systems
      • CAP theorem
    5. Analytics
      • Statistics
      • Metrics and KPIs
      • A/B testing
    6. Programming
      • Data structures and algorithms
      • SOLID principles
    7. Advanced Python
      • Classes, contexts
      • Collections module
      • GIL and asynchronous python
    8. Advanced domain knowledge
      • Data quality
      • Data monitoring
      • Security
      • Privacy (GDPR)
      • Fault tolerance
  4. Data architect path

    1. Streaming
      • Kafka
      • Spark Streaming
      • Flink / Beam
    2. Types of architectures
      • monolith, micro-services, data lakes
      • batch vs stream
      • lambda, kappa, zetta
    3. Advanced Analytics
      • Machine learning
      • Time series analysis
    4. High performance programming languages
      • Java and Scala
      • Example of Scala + SBT + Spark
    5. DevOps
      • Clouds
      • CI/CD
      • Orchestration
    6. DE landscape examples
      • Cassandra
      • Dask
      • Prefect
      • ELK stack
      • Dremio

Useful sources: