Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: fix broken medical datasets #39

Merged
merged 12 commits into from
Mar 8, 2024
Merged

Conversation

eloy-encord
Copy link
Contributor

We currently have 3 medical datasets that aren't working:

  1. NIH-Chest-X-ray,
  2. skin-cancer,
  3. chest-xray-classification.

The fixes in this PR introduce the concept of target feature when handling HF datasets, to explicitly mention the column we would like to use as the label column (that may not exist in the dataset). Also, I added some handlers and conversions for out-of-the-standard content in such column.

Datasets (2) and (3) are downloaded successfully and are compatible with the existing models, but dataset (1) still has an incompatibility issue when building the embeddings. We should assert if we want dataset (1) at all in this phase as its data alone occupies 85 Gb.

As part of the testing process, I found a blocker where the disk space ran out because the cache for datasets wasn't set in a sole place for all users and runs which ended up filling a drive with small capacity in the VM. Now, by default all dataset related content is stored in the .cache dir in the project's root.

The handler was missing the name of the dataset configuration and also had a feature column 'labels' instead of expected 'label'.
The handler was missing the name of the dataset configuration and also had a feature column 'labels' instead of expected 'label'.
Added the required target feature and the encoding from string to integer values.
@eloy-encord eloy-encord marked this pull request as ready for review March 4, 2024 15:35
Remove the source folder, as we want to enable fetch, copy and deletion of individual datasets no matter its source. Now, it's as easy as perform the action over the folder with the dataset title (all information related to the dataset is available in that folder).
@eloy-encord eloy-encord merged commit e6698b3 into main Mar 8, 2024
1 check passed
@eloy-encord eloy-encord deleted the eloy/fix-broken-datasets branch March 8, 2024 11:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants