-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training UNK tokens #161
Comments
Yeah see here under |
Rephrasing for my comprehnsion, lemme know if this is correct: Two proposals:
Correct? |
(1) doesn't seem coherent to me without (2). How would the model be exposed to in the vocabulary unless you replace some training data symbols with it? (2) is the only way you'd get exposure to it during training. |
You're right, just doing (1) is only relevant for dev training. For (2), I'm wondering if we could achieve the same affect doing a masking scheme. Just randomly replace the x% of symbols with UNK during training. This would give the embedding exposure to more context. Also would make the threshold less of an issue to figure out. Else we'd be needing to really fine tune that to avoid masking out the majority of the vocabulary. Or maybe a best of both worlds. Set threshold for characters that SHOULDN'T be masked, and then allow masking for all other characters. This would avoid zipfian distribution stuff. |
Sure, do both eventually. That said, I think (2) is low priority and your stocasthci replacement proposal is lower yet, given how much else we have to do and given that it's not not something we know to make a positive difference in our domain. |
@Adamits if it's not something that needs to be done in the next week or so, you can assign to me, I think I know where to do some implementation. |
Note: we're going to fork this to experiment on various sampling paradigms. When done, we'll write a paper, merge to main, and have a pint at Charlene's/(SF/Colorado equivalent). For record keeping, here's the masking approaches to try:
Since these will be quite a few experiments, we'd likely want to stay as model agnostic as possible. I'm thinking just do runs with a basic transformer and basic lstm? @kylebgorman @Adamits Any add-ons for the pile? |
I find percentages unintuitive in this domain, so I would recommend that whatever percentages you target, at least one set of experiments involves just masking all hapax legomena, and another one just masking all hapax legomena and dis legomena. I think this would be equivalent across (1-3).
You may have prior art for this but a simpler solution is just to use a randomly initialized embedding (so just create in the embedding matrix but don't train it.)
I'd recommend that, yeah. You could even pick just one. Or, you could put in pointer-generators if you're doing g2p. |
I find percentages unintuitive in this domain, so I would recommend that whatever percentages you target, at least one set of experiments involves just masking all hapax legomena, and another one just masking all hapax legomena and dis legomena. I think this would be equivalent across (1-3). For single counts (I don't know Latin so assuming that's what you mean), sure that makes sense. But beyond that I personally find percentages more intuitive for scaling.
Well that's what we already do. But I've toyed around before, and the average embedding has some noticeable annecdotal improvements.
Isn't our current problem ptr-gens don't expand well for disjoint source/target vocabs? (I know there's paper out for this thing but thought that was why we had #156) |
Hapax legomenon is a Greek expression for a word/term/symbol occurring only once, and dis legomenon for one occurring twice. They are involved in a lot of theorizing about how to handle rare words. Anyways, I'm just saying you should make sure your experimental grid includes ones where you UNK symbols that occur once and another where you UNK symbols that occur once or twice (and probably draw reader's attention to which percentage of corpus and/or vocabulary gives you those two effects).
That's fine, I just saying that for comparison you should try both. When you show averaging is better than random initialization, then you won't need random initialization as an option anymore.
Well, the current impl is undefined in the case where they have zero overlap, but they work great (as far as I can tell) when there's any overlap at all, and there is a decent amount in g2p (with languages written in Latin scripts) and usually perfect overlap in inflection tasks. But absolutely not essential. |
Sorry I have been a little MIA. This sounds good. Need me to check what people have typically done in morphology/phonology tasks? I recall in a shared task submission replacing all UNKs by copying, which we could also compare to just for fun. E.g. cat + PL -> ct + PL -> cats, where a is copied from whatever OOV symbol was there. Actually this requires an alignment so maybe we skip it...
Is this different from 2?
These all sound reasonable, though. |
Nah you're good. We're just putting in work in downtime. Perfectly fine if you need to focus on more pertinent stuff. If you wouldn't mind finding some relevant papers, that would be great. (Morphology is a bit tangential to my general work.) How strong an alignment? I'm currently writing up the Wu strong alignment models, so may be able to transfer some of their alignment code for this?
Yeah, 2) would be: token x must occur more than 5% of the time (for example) to not be masked. 3) is, the 5% of tokens with least occurrence would be masked. Former can be a no op depending on corpus, latter will always kick in regardless of distribution. They're different paradigms of approaching low frequency rate: do we just want UNK to mask tokens that are barely there? Or do we want UNK to be a filler for the tail end of a corpus. |
Ok I will try to take a look.
I think it can just be a post-processing step that assumes a very approximate alignment. See e.g. https://aclanthology.org/K17-2010.pdf
Nice this distinction makes sense. In my comment I had meant is it different from 1). I guess the reasoning is that in "do not exceed X% occurrences in data" we could have a nearly uniform symbol distribution of symbols? |
Technically yes, but our general assumption with the library is natural language, so I don't really know off the top of my head when that would occur. I guess with absurdly small datasets? But at that point I don't think most of our models would be able to train anyhow. |
Currently we create a vocabulary of all items in all datapaths specified to the training script.
However, we may want to study how models perform when provided unknown symbols. In this case:
Kyle suggested we follow a fairseq feature which allows you to automatically replace low-frequency symbols with UNK during training. I think we should add this as a feature option, which also deletes those low-frequency symbols from the vocabulary, so that at inference when we come across them, they use the UNK embedding.
The text was updated successfully, but these errors were encountered: