feat: fix broken medical datasets #39

eloy-encord · 2024-03-04T15:32:10Z

We currently have 3 medical datasets that aren't working:

NIH-Chest-X-ray,
skin-cancer,
chest-xray-classification.

The fixes in this PR introduce the concept of target feature when handling HF datasets, to explicitly mention the column we would like to use as the label column (that may not exist in the dataset). Also, I added some handlers and conversions for out-of-the-standard content in such column.

Datasets (2) and (3) are downloaded successfully and are compatible with the existing models, but dataset (1) still has an incompatibility issue when building the embeddings. We should assert if we want dataset (1) at all in this phase as its data alone occupies 85 Gb.

As part of the testing process, I found a blocker where the disk space ran out because the cache for datasets wasn't set in a sole place for all users and runs which ended up filling a drive with small capacity in the VM. Now, by default all dataset related content is stored in the .cache dir in the project's root.

The handler was missing the name of the dataset configuration and also had a feature column 'labels' instead of expected 'label'.

Added the required target feature and the encoding from string to integer values.

Remove the source folder, as we want to enable fetch, copy and deletion of individual datasets no matter its source. Now, it's as easy as perform the action over the folder with the dataset title (all information related to the dataset is available in that folder).

eloy-encord added 8 commits March 1, 2024 14:42

fix: chest-xray-classification dataset

26bffc5

The handler was missing the name of the dataset configuration and also had a feature column 'labels' instead of expected 'label'.

fix: nih-chest-x-ray dataset

0129549

The handler was missing the name of the dataset configuration and also had a feature column 'labels' instead of expected 'label'.

feat: add global cache configuration for datasets

a2d7315

fix: redirect global cache path to an absolute path

ccc1d0e

chore: update cache dir management in datasets

97b339c

feat: enable target column selection in HF datasets

93ffb20

fix: drop potential wrapper around the class names

d90a5b1

fix: skin-cancer dataset

26e6756

Added the required target feature and the encoding from string to integer values.

eloy-encord requested a review from frederik-encord March 4, 2024 15:32

eloy-encord marked this pull request as ready for review March 4, 2024 15:35

eloy-encord added 4 commits March 5, 2024 11:14

misc: make cache_dir optional, default to global cache

ea3f91e

Merge branch 'main' into eloy/fix-broken-datasets

7e69097

Merge branch 'main' into eloy/fix-broken-datasets

c9a121b

frederik-encord approved these changes Mar 8, 2024

View reviewed changes

eloy-encord merged commit e6698b3 into main Mar 8, 2024
1 check passed

eloy-encord deleted the eloy/fix-broken-datasets branch March 8, 2024 11:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: fix broken medical datasets #39

feat: fix broken medical datasets #39

eloy-encord commented Mar 4, 2024

feat: fix broken medical datasets #39

feat: fix broken medical datasets #39

Conversation

eloy-encord commented Mar 4, 2024