Skip to content

Moves to mapper interface #266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Dec 2, 2024
Merged

Moves to mapper interface #266

merged 20 commits into from
Dec 2, 2024

Conversation

kylebgorman
Copy link
Contributor

@kylebgorman kylebgorman commented Nov 25, 2024

Borrowing a design element I used in UDTube, I decompose the dataset object into two pieces:

  • a Mapper interface which knows how to map between lists of strings and tensors (to decode and encode)
  • DataSet, as before.

There was no particular reason for the mapper functions to live inside the dataset. This separates the two pieces and uses the mapper object for prediction.

Closes #137. That issue says that the encoding/decoding should be moved to the index, but this actually makes those two even more modular.

(I also imported fix #272 and resolved some merge stuff from #268 etc.)

Borrowing a design element I used in UDTube, I decompose the dataset
object into two pieces:

* a `Mapper` interface which knows how to map between lists of strings
  and tensors (to decode and encode)
* `DataSet`, as before

There was no particular reason for the mapper functions to live inside
the dataset, and this commit simply makes this separation.

A subsequent commit will use this mapper object during prediction.
You can just simulate this by appending an additional string onto the
name of the model_dir if needed.
@kylebgorman kylebgorman marked this pull request as ready for review November 30, 2024 18:23
@kylebgorman kylebgorman requested a review from Adamits November 30, 2024 18:23
@kylebgorman kylebgorman added the enhancement New feature or request label Dec 1, 2024
Copy link
Collaborator

@Adamits Adamits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really have a problem with it, but I don't actually follow the reasoning for needing another class. The "mapping" operations, to me, could intuitively live in the index, the dataset, or the datamodule. Is the ambiguity there around where it should live actually the problem? Or is this just an OOP principle that someone established?

@kylebgorman
Copy link
Contributor Author

kylebgorman commented Dec 2, 2024

I don't really have a problem with it, but I don't actually follow the reasoning for needing another class. The "mapping" operations, to me, could intuitively live in the index, the dataset, or the datamodule. Is the ambiguity there around where it should live actually the problem? Or is this just an OOP principle that someone established?

The way to think of the design is this:

  • The datamodule creates an index.
  • The datamodule creates datasets.
  • At the creation of each dataset, a mapper is made from the index and passed to the dataset. (This is necessary because the dataset generates tensors on the fly.)

I am moving it out of the dataset in this PR, because there are a number of places where you want to do what the mapping does but you don't need a reference to the full dataset. (For instance, you don't need the huge list that contains the actual data.) One example of this is in prediction: you need to map but there's no direct reference to the dataset. Putting it in a separate class helps with potential circularity issues.

The other obvious option is to make it part of the index but there are places where we need an index but not a mapper or vice versa. For instance the expert class doesn't need tensors or any of the padding, so it uses the index but not the mapper. I despaired of a way to separate out "string to integer" vs. "integers to tensors" except putting them in separate classes.

The general OOP principle at play here is the Law of Demeter. In the old code (this was part of predict.py) had loader.dataset.decode_target(...); this now reads mapper.decode_target(...).

@Adamits
Copy link
Collaborator

Adamits commented Dec 2, 2024

Thanks for describing your thinking.

there are a number of places where you want to do what the mapping does but you don't need a reference to the full dataset

Yes good point. I think another place this is helpful is for runtime debugging. previously if I wanted to log encoded inputs/predicted outputs in the forward function at runtime, I needed the dataset.

Copy link
Collaborator

@Adamits Adamits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kylebgorman kylebgorman merged commit f3550a1 into master Dec 2, 2024
8 checks passed
@kylebgorman kylebgorman deleted the mapper branch December 9, 2024 22:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move tensor encoding/decoding into the Index class
2 participants