Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure Parquet dataset download and processing #46

Open
daverigby opened this issue May 10, 2024 · 0 comments
Open

Restructure Parquet dataset download and processing #46

daverigby opened this issue May 10, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@daverigby
Copy link
Collaborator

daverigby commented May 10, 2024

Dataset from datasets.py handles multiple aspects of Parquet datasets:

  • Downloading Parquet Datasets (consisting of multiple files in passages and queries set)
  • Maintaining a local cache of previously downloaded files
  • Reading records from Parquet files
    • For passages, this is done by iterating over batches
    • For queries, this is done by reading the entire query set into memory.

Additionally there's functionality which is currently unused:

  • Generating queries by sampling passages (where a dataset doesn't have a queries set).

All of this leads to a class which is more complex than we need it to be, and can be harder to test.

Look at how we can restructure to simplify - potentially splitting the class into multiple independent parts (Downloading, caching, reading), and removing functionality we don't need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant