You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Dataset from datasets.py handles multiple aspects of Parquet datasets:
Downloading Parquet Datasets (consisting of multiple files in passages and queries set)
Maintaining a local cache of previously downloaded files
Reading records from Parquet files
For passages, this is done by iterating over batches
For queries, this is done by reading the entire query set into memory.
Additionally there's functionality which is currently unused:
Generating queries by sampling passages (where a dataset doesn't have a queries set).
All of this leads to a class which is more complex than we need it to be, and can be harder to test.
Look at how we can restructure to simplify - potentially splitting the class into multiple independent parts (Downloading, caching, reading), and removing functionality we don't need.
The text was updated successfully, but these errors were encountered:
Dataset
fromdatasets.py
handles multiple aspects of Parquet datasets:passages
andqueries
set)passages
, this is done by iterating over batchesqueries
, this is done by reading the entire query set into memory.Additionally there's functionality which is currently unused:
queries
by samplingpassages
(where a dataset doesn't have a queries set).All of this leads to a class which is more complex than we need it to be, and can be harder to test.
Look at how we can restructure to simplify - potentially splitting the class into multiple independent parts (Downloading, caching, reading), and removing functionality we don't need.
The text was updated successfully, but these errors were encountered: