Open an issue if you have questions!
Frictionless standards provide a lightweight and minimal abstraction layer (data packages are JSON/YAML files) on top of data files to make them easier to use. Adhering to the Frictionless specs makes it easier to integrate into the existing community and interoperate with all the datasets and tools already built.
Another interesting side effect of the Frictionless design fisolophy is that it allows everyone to package datasets in a permissionless way. You don't need to move the data, just wrap it around with a simple metadata file.
It's already being used by organizations like Our World in Data, cooperatives like Catalyst, and many other places.
We need to solve the problem of "packaging data" as a community. Frictionless is a great starting point as it only takes someone to write a plugin/extension to integrate a new platform/format/scheme/portal into the ecosystem.
I've tried quite a bunch of Data Package Managers. Frictionless is the simplest and most flexible one. It also has a reasonable adoption and active community.
That said, I'm open to other options. If you have a better idea, let's chat!
Datasets could be IPFS native. Clients could either fetch the data via IPFS or use a public Gateway.
name: my-dataset
resources:
- name: my-data
type: table
path: bafkreidgvpkjawlxz6sffxzwgooowe5yt7i6wsyg236mfoks77nywkptdq
scheme: ipfs
In the end, the Frictionless abstraction is just an URL. We can use anything we want in the backend as long as we provide a way to read the data. In this case:
ipfs_package = Package("my-dataset-datapackage.yaml") # Could even be Package("bafyreca4sf...")
ipfs_resource = ipfs_package.get_resource("my-data")
# Enjoy!
ipfs_resource.to_pandas()
ipfs_resource.sql("SELECT * FROM my-data")
Depending on the dataset, this feature could be pushed to the hosting later. If you publish in HuggingFace, you get versioning and backup for free! Once the data in there, we can rely on the _cache
property of the Frictionless Specs (or a _backup
one) to point to the previous backup.
This is something we have to do as a community. A great start is to create Catalogs. Storing the Catalog definitions in places like GitHub will make it easy to discover them and surface the best ones. At the end, a data package is only an URL!
datasets:
- name: airport-codes
package: https://raw.githubusercontent.com/datasets/airport-codes/master/datapackage.json
- name: country-codes
package: https://raw.githubusercontent.com/datasets/country-codes/master/datapackage.json
The tabular resource representation can be an Arrow tabular dataset. With that, we get access to the Apache Arrow ecosystem. Data should be just a resources.to_arrow()
command away!
Additionally, using a file system abstraction like fsspec
makes it easy to interact with different "remotes" like S3, GCS, HDFS, etc.
The Frictionless ecosystem is extensible via plugins/extensions. You can create a plugin to integrate any platform with the Frictionless ecosystem. For example, you can create a plugin to integrate HuggingFace datasets so your package looks something like this:
name: hf-dataset
title: Hugging Face Dataset
resources:
- name: rotten_tomatoes
type: table
path: rotten_tomatoes
format: huggingface
schema:
fields:
- name: text
type: string
- name: label
type: integer
Some interesting plugins ideas might be to integrate with Socrata (Simon Wilson did something similar), with Kaggle Datasets, or with Datalad.
By versioning the code and the data together, it should be possible to reproduce the dataset. The easiest way to do this is by publishing datasets via GitHub Actions, this way the code and the data are always in sync. Furthermore, attaching a Docker image and Dev Container environment makes it easy to reproduce the dataset in any environment.
Mainly using Git but this is another problem ("Git for data") that opens a deep rabbit hole and can be solved by another project. The ideal would be to make everything append only, rely on inmutable and content adressed storage, and use Git as a way to version the metadata.
If only using git, git-history
provides a simple starting point to explore the commited files.
Grab the datapackage.json
or datapackage.yaml
URI and paste it into a data explorer.
Yes, the new LLM models could help with this vision. A few things that could be automated:
- Extract data and generate resources from anything. Define the schema and let GPT-N do the rest. Some projects are already working on this.
- Can datapackages be written in natural language? Can we use GPT-N to generate them? The same way plugins are starting to be written for ChatGPT that only requires a
description_for_model
text. Could something like this work on data packages. Embeddings become the flexible metadata we all want.
Yes! It is probably not the best fit but the basic idea would be to have a table pointing to the URI of the non-tabular data. For example, you could have a datasets of sounds, images, or videos by having a column with the URI of the file.
TODO: Explain why people should use the project. Coldstart problem, incentives, etc.
TODO: Explain how the project will be funded and how it will generate revenue. Grants, consulting, selling curated data, bounties, etc.