Skip to content

paradigmxyz/absorb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

image

absorb 🧽🫧🫧

absorb makes it easy to 1) collect, 2) manage, 3) query, and 4) customize datasets from nearly any data source

🚧 this is a preview release of beta software, and it is still under active development 🚧

Features

  • limitless dataset library: access to millions of datasets across 20+ diverse data sources
  • intuitive cli+python interfaces: collect or query any dataset in a single line of code
  • maximal modularity: built on open standards for frictionless integration with other tools
  • easy extensibility: add new datasets or data sources with just a few lines of code

Contents

  1. Installation
  2. Example Usage
    1. Command Line
    2. Python
  3. Supported Data Sources
  4. Output Format
  5. Configuration

Installation

basic installation

uv tool install paradigm_absorb

install with all extras

uv tool install paradigm_absorb[test,datasources,interactive]

install from source

git clone [email protected]:paradigmxyz/absorb.git
uv tool install --editable .[test,datasources,interactive]

Example Usage

Example Command Line Usage

# collect dataset and save as local files
absorb collect kalshi

# list datasets that are collected or available
absorb ls

# show schemas of dataset
absorb schema kalshi

# create new custom dataset
absorb new custom_dataset

# upload custom dataset
absorb upload custom_dataset

Example Python Usage

import absorb

# collect dataset and save as local files
absorb.collect('kalshi.metrics')

# get schemas of dataset
schema = absorb.get_schema('kalshi.metrics')

# query dataset eagerly, as polars DataFrame
df = absorb.query('kalshi.metrics')

# query dataset lazily, as polars LazyFrame
lf = absorb.query('kalshi.metrics', lazy=True)

# upload custom dataset
absorb.upload('source.table')

Supported Data Sources

🚧 under construction 🚧

absorb collects data from each of these sources:

To list all available datasets and data sources, type absorb ls on the command line.

To display information about the schema and other metadata of a dataset, type absorb help <DATASET> on the command line.

Output Format

absorb uses the filesystem as its database. Each dataset is stored as a collection of parquet files, either on local disk or in the cloud.

Datasets can be stored in any location on your disks, and absorb will use symlinks to organize those files in the ABSORB_ROOT tree.

the ABSORB_ROOT filesystem directory is organized as:

{ABSORB_ROOT}/
    datasets/
        <source>/
            tables/
                <datatype>/
                    {filename}.parquet
                table_metadata.json
            repos/
                {repo_name}/
    absorb_config.json

Configuration

absorb uses a config file to specify which datasets to track.

Schema of absorb_config.json:

{
    'version': str,
    'tracked_tables': list[TableDict],
    'use_git': bool,
    'default_bucket': {
        'rclone_remote': str | None,
        'bucket_name': str | None,
        'path_prefix': str | None,
        'provider': str | None,
    },
}

schema of dataset_config.json:

{
    'source_name': str,
    'table_name': str,
    'table_class': str,
    'parameters': dict[str, JSONValue],
    'table_version': str,
}

About

No description, website, or topics provided.

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Packages

No packages published

Languages