Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We currently have 3 medical datasets that aren't working:
The fixes in this PR introduce the concept of
target feature
when handling HF datasets, to explicitly mention the column we would like to use as thelabel
column (that may not exist in the dataset). Also, I added some handlers and conversions for out-of-the-standard content in such column.Datasets (2) and (3) are downloaded successfully and are compatible with the existing models, but dataset (1) still has an incompatibility issue when building the embeddings. We should assert if we want dataset (1) at all in this phase as its data alone occupies 85 Gb.
As part of the testing process, I found a blocker where the disk space ran out because the cache for datasets wasn't set in a sole place for all users and runs which ended up filling a drive with small capacity in the VM. Now, by default all dataset related content is stored in the
.cache
dir in the project's root.