Restructure Parquet dataset download and processing #46

daverigby · 2024-05-10T08:05:29Z

Dataset from datasets.py handles multiple aspects of Parquet datasets:

Downloading Parquet Datasets (consisting of multiple files in passages and queries set)
Maintaining a local cache of previously downloaded files
Reading records from Parquet files
- For passages, this is done by iterating over batches
- For queries, this is done by reading the entire query set into memory.

Additionally there's functionality which is currently unused:

Generating queries by sampling passages (where a dataset doesn't have a queries set).

All of this leads to a class which is more complex than we need it to be, and can be harder to test.

Look at how we can restructure to simplify - potentially splitting the class into multiple independent parts (Downloading, caching, reading), and removing functionality we don't need.

The text was updated successfully, but these errors were encountered:

daverigby added the enhancement New feature or request label May 10, 2024

daverigby added this to the Phase 2: More workloads, more databases milestone May 10, 2024

This was referenced Jun 11, 2024

Distribute parquet population over all users even if num_files < num_users #101

Closed

Remove dead code from dataset.py #114

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restructure Parquet dataset download and processing #46

Restructure Parquet dataset download and processing #46

daverigby commented May 10, 2024 •

edited

Loading

Restructure Parquet dataset download and processing #46

Restructure Parquet dataset download and processing #46

Comments

daverigby commented May 10, 2024 • edited Loading

daverigby commented May 10, 2024 •

edited

Loading