Skip to content

Latest commit

 

History

History
61 lines (43 loc) · 3.67 KB

README.md

File metadata and controls

61 lines (43 loc) · 3.67 KB

Gitter docs on_gitbook

OS CI testing on master
Linux
CircleCI branch
Windows

Docs

Visit docs.quiltdata.com. Or browse the docs on GitHub.

Manage data like code

Quilt provides versioned, reusable building blocks for analysis in the form of data packages. A data package may contain data of any type or size. In spirit, Quilt does for data what package managers and Docker registries do for code: provide a centralized, collaborative store of record.

Getting started tutorial

Benefits

  • Reproducibility - Imagine source code without versions. Ouch. Why live with un-versioned data? Versioned data makes analysis reproducible by creating unambiguous references to potentially complex data dependencies.
  • Collaboration and transparency - Data likes to be shared. Quilt offers a centralized data warehouse for finding and sharing data.
  • Auditing - the registry tracks all reads and writes so that admins know when data are accessed or changed
  • Less data prep - the registry abstracts away network, storage, and file format so that users can focus on what they wish to do with the data.
  • Deduplication - Data fragments are hashed with SHA256. Duplicate data fragments are written to disk once globally per user. As a result, large, repeated data fragments consume less disk and network bandwidth.
  • Faster analysis - Serialized data loads 5 to 20 times faster than files. Moreover, specialized storage formats like Apache Parquet minimize I/O bottlenecks so that tools like Presto DB and Hive run faster.

Commands

Here are the basic Quilt commands:

Service

Quilt is offered as a managed service at quiltdata.com.

Architecture

Quilt consists of three source-level components:

  1. A data catalog

    • Displays package meta-data in HTML
    • Implemented with JavaScript with redux, sagas
  2. A data registry

    • Controls permissions
    • Stores package fragments in blob storage
    • Stores package meta-data
    • De-duplicates repeated data fragments
    • Implemented in Python with Flask and PostgreSQL
  3. A data compiler

    • Serializes tabular data to Apache Parquet
    • Transforms and parses files
    • builds packages locally
    • pushes packages to the registry
    • pulls packages from the registry
    • Implemented in Python with pandas and PyArrow