absorb
makes it easy to 1) collect, 2) manage, 3) query, and 4) customize datasets from nearly any data source
🚧 this is a preview release of beta software, and it is still under active development 🚧
- limitless dataset library: access to millions of datasets across 20+ diverse data sources
- intuitive cli+python interfaces: collect or query any dataset in a single line of code
- maximal modularity: built on open standards for frictionless integration with other tools
- easy extensibility: add new datasets or data sources with just a few lines of code
basic installation
uv tool install paradigm_absorb
install with all extras
uv tool install paradigm_absorb[test,datasources,interactive]
install from source
git clone [email protected]:paradigmxyz/absorb.git
uv tool install --editable .[test,datasources,interactive]
# collect dataset and save as local files
absorb collect kalshi
# list datasets that are collected or available
absorb ls
# show schemas of dataset
absorb schema kalshi
# create new custom dataset
absorb new custom_dataset
# upload custom dataset
absorb upload custom_dataset
import absorb
# collect dataset and save as local files
absorb.collect('kalshi.metrics')
# get schemas of dataset
schema = absorb.get_schema('kalshi.metrics')
# query dataset eagerly, as polars DataFrame
df = absorb.query('kalshi.metrics')
# query dataset lazily, as polars LazyFrame
lf = absorb.query('kalshi.metrics', lazy=True)
# upload custom dataset
absorb.upload('source.table')
🚧 under construction 🚧
absorb
collects data from each of these sources:
- 4byte function and event signatures
- allium crypto data platform
- bigquery crypto ETL datasets
- binance trades and OHLC candles on the Binance CEX
- blocknative Ethereum mempool archive
- chain_ids chain id's
- coingecko token prices
- cryo EVM datasets
- defillama DeFi data
- dune tables and queries
- fred federal macroeonomic data
- git commits, authors, and file diffs of a repo
- growthepie L2 metrics
- kalshi prediction market metrics
- l2beat L2 metrics
- mempool dumpster Ethereum mempool archive
- snowflake generalized data platform
- sourcify verified contracts
- tic usa treasury department data
- tix price feeds
- vera verified contract archives
- xatu many Ethereum datasets
To list all available datasets and data sources, type absorb ls
on the command line.
To display information about the schema and other metadata of a dataset, type absorb help <DATASET>
on the command line.
absorb
uses the filesystem as its database. Each dataset is stored as a collection of parquet files, either on local disk or in the cloud.
Datasets can be stored in any location on your disks, and absorb will use symlinks to organize those files in the ABSORB_ROOT
tree.
the ABSORB_ROOT
filesystem directory is organized as:
{ABSORB_ROOT}/
datasets/
<source>/
tables/
<datatype>/
{filename}.parquet
table_metadata.json
repos/
{repo_name}/
absorb_config.json
absorb
uses a config file to specify which datasets to track.
Schema of absorb_config.json
:
{
'version': str,
'tracked_tables': list[TableDict],
'use_git': bool,
'default_bucket': {
'rclone_remote': str | None,
'bucket_name': str | None,
'path_prefix': str | None,
'provider': str | None,
},
}
schema of dataset_config.json
:
{
'source_name': str,
'table_name': str,
'table_class': str,
'parameters': dict[str, JSONValue],
'table_version': str,
}