Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Custom labels #27

Closed
wants to merge 3 commits into from
Closed

feat: Custom labels #27

wants to merge 3 commits into from

Conversation

Jim-Encord
Copy link
Contributor

Issue occurred around once we load the "dx" column, it was stored as a string representing each of the categories. Looked for a bit for a method that used either a nicer pytorch feature or instead some HF column transformer but couldn't find one.

This seems okay. A possible alternative approach is:

dataset.set_transform(lambda x: feature_map[model.get_transform(x)])

which may work better due to being it latter in the process. Would appreciate some eyes and discussion on best approach. Also very weird as to how to programmatically select out of 'NIH-Chest-X-ray-dataset'

@eloy-encord
Copy link
Contributor

We shouldn't be adding dataset specific variables to all the HF datasets, can you check if the label can be used as param of load_dataset()? Otherwise would be wise to just add a line after adding the dataset that performs the subsetting if required.

@Jim-Encord
Copy link
Contributor Author

I don't think that label can be used as an argument to load_dataset (I've looked into the source and checked whether their 'features' param can do it and believe not). The purpose of the label argument would be for datasets like the Skin_cancer dataset where they haven't explicitly declared a label column and encoded it. That we can select the column from inspecting the dataset. I can't really think of a workaround around this complexity. We could have in the HF loader method instead, that we apply a transformation that takes the "label" column, encodes it and makes that the "label" column. But I wanted to do it so it was applied 'lazily'

@eloy-encord
Copy link
Contributor

We need to rethink this as the point of using **kwargs in the build method is to feed custom parameters to custom datasets, not a great plan to add the custom params to all the datasets. We are also thinking in those datasets that will require some data manipulation in the download/extraction phase, so probably an input manipulation function could be good.

Nevertheless we should avoid putting dataset logic in the embeddings when feasible and in this case we won't use the object detection layer.

@Jim-Encord
Copy link
Contributor Author

Yeah fair point. I'll try to adapt this later, inspecting your encord PR as is

@eloy-encord
Copy link
Contributor

Closing this one as we updated the code on other PRs and currently allow custom target columns and configurations for HF datasets and that NIH-Chest-X-ray-dataset requires much more customisation compared to the rest of the datasets.

@frederik-encord frederik-encord deleted the jb/custom_labels branch April 22, 2024 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants