Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concatenated features break symbol decoding #149

Open
Adamits opened this issue Nov 2, 2023 · 6 comments
Open

Concatenated features break symbol decoding #149

Adamits opened this issue Nov 2, 2023 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@Adamits
Copy link
Collaborator

Adamits commented Nov 2, 2023

For models where the features are concatenated to the source string, we now handle this in the collator. We simply add the source_token vocabulary length to each feature index in order to avoid clashes: https://github.com/CUNY-CL/yoyodyne/blob/master/yoyodyne/data/collators.py#L71

However, the symbol maps do not track this. Thus, if we want to decode a source (e.g. as a sanity check in the logs), this will not work since the feature indices are out of raneg---they no longer can be meaningfully mapped back to their surface form. Not a critical bug, but definitely an odd behavior.

@kylebgorman
Copy link
Contributor

What would you do about this? My thought was that it's hard to imagine (except for debugging I guess) why you'd want to decode anything but the target.

@kylebgorman kylebgorman added the enhancement New feature or request label Nov 2, 2023
@Adamits
Copy link
Collaborator Author

Adamits commented Nov 7, 2023

Yes I suppose debugging is my only motivation. It is a bit awkward that we have totally untracked values, though.

What would you do about this?

I guess it would essentially require reverting back to the old way of doing this, where we add the features into the source vocab. But iirc that might also cause weird friction.

Otherwise, maybe I could do something tricky in the decode function where I check the source vocab size, and assume any integer > that size should be decoded with the features_vocab at index (idx - source_vocab_size).

I will think through if theres any issues with that.

@kylebgorman
Copy link
Contributor

It occurs to me that clashes can't happen anyways because of the special treatment we give to features. Does that help at all?

@Adamits
Copy link
Collaborator Author

Adamits commented Nov 9, 2023

That is a good point. Are you suggesting that we just merge the feature vocab to the source vocab in the setting that they will be concatenated to the source? Or do you mean does it help to justify not making an update?

@kylebgorman
Copy link
Contributor

Yeah, you could just add features to the source vocabulary. (Actually I'd do it regardless of whether concatenation is happening since it's harmless and it's annoying to pass information about whether concatenation is happening---which depends both on whether features are specified and the architecture---into the index and such.)

@Adamits
Copy link
Collaborator Author

Adamits commented Oct 16, 2024

Commenting here to say that I am setting a reminder to dig into this. I recently was trying to debug some projects and its quite unpleasant without being able to decode features at runtime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants