Skip to content
David Liu edited this page Jan 21, 2025 · 9 revisions

Welcome to the data-integration wiki!

Concepts

Stream vs Batch

  • Streaming format: for sending an arbitrary length sequence. like iterator with lazy execution in yield
    • Pros
      • The most real-time.
    • Cons
      • The format must be processed from start to end, and does not support random access

      • For statistics, it needs windowing, cursor in stream analytics
  • Batch format: for serializing a fixed number of record batches. Supports random access
    • aka. File format
    • It is very useful when used with memory maps

Data integration vs ETL

Data wrangling

aka. Data munging, Data

  • It means data transform
  • It closely aligned with ETL
Clone this wiki locally