feat: Custom labels #27

Jim-Encord · 2024-02-22T17:02:41Z

Issue occurred around once we load the "dx" column, it was stored as a string representing each of the categories. Looked for a bit for a method that used either a nicer pytorch feature or instead some HF column transformer but couldn't find one.

This seems okay. A possible alternative approach is:

dataset.set_transform(lambda x: feature_map[model.get_transform(x)])

which may work better due to being it latter in the process. Would appreciate some eyes and discussion on best approach. Also very weird as to how to programmatically select out of 'NIH-Chest-X-ray-dataset'

eloy-encord · 2024-02-23T11:35:30Z

We shouldn't be adding dataset specific variables to all the HF datasets, can you check if the label can be used as param of load_dataset()? Otherwise would be wise to just add a line after adding the dataset that performs the subsetting if required.

Jim-Encord · 2024-02-23T11:54:19Z

I don't think that label can be used as an argument to load_dataset (I've looked into the source and checked whether their 'features' param can do it and believe not). The purpose of the label argument would be for datasets like the Skin_cancer dataset where they haven't explicitly declared a label column and encoded it. That we can select the column from inspecting the dataset. I can't really think of a workaround around this complexity. We could have in the HF loader method instead, that we apply a transformation that takes the "label" column, encodes it and makes that the "label" column. But I wanted to do it so it was applied 'lazily'

eloy-encord · 2024-02-23T12:03:12Z

We need to rethink this as the point of using **kwargs in the build method is to feed custom parameters to custom datasets, not a great plan to add the custom params to all the datasets. We are also thinking in those datasets that will require some data manipulation in the download/extraction phase, so probably an input manipulation function could be good.

Nevertheless we should avoid putting dataset logic in the embeddings when feasible and in this case we won't use the object detection layer.

Jim-Encord · 2024-02-23T12:10:20Z

Yeah fair point. I'll try to adapt this later, inspecting your encord PR as is

eloy-encord · 2024-04-15T09:05:02Z

Closing this one as we updated the code on other PRs and currently allow custom target columns and configurations for HF datasets and that NIH-Chest-X-ray-dataset requires much more customisation compared to the rest of the datasets.

Jim-Encord added 3 commits February 23, 2024 16:24

Introduced label field

b163484

Can now load in the skin cancer dataset. Potentially an ugly solution

a218a27

Moved towards selecting labels in the __getitem__ method

c050d7e

Jim-Encord force-pushed the jb/custom_labels branch from b4eb7de to c050d7e Compare February 23, 2024 16:25

eloy-encord closed this Apr 15, 2024

frederik-encord deleted the jb/custom_labels branch April 22, 2024 14:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Custom labels #27

feat: Custom labels #27

Jim-Encord commented Feb 22, 2024

eloy-encord commented Feb 23, 2024

Jim-Encord commented Feb 23, 2024

eloy-encord commented Feb 23, 2024

Jim-Encord commented Feb 23, 2024

eloy-encord commented Apr 15, 2024

feat: Custom labels #27

feat: Custom labels #27

Conversation

Jim-Encord commented Feb 22, 2024

eloy-encord commented Feb 23, 2024

Jim-Encord commented Feb 23, 2024

eloy-encord commented Feb 23, 2024

Jim-Encord commented Feb 23, 2024

eloy-encord commented Apr 15, 2024