-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Label smoothing update #83
Conversation
* Passes label smoothing from the CLI to the model class. * Replaces NLLLoss plus log softmax with CrossEntropyLoss. * Passes label smoothing to the loss function. Currently this does not seem to work: loss does not decline more than a tiny bit on Maori passives, and validation accuracy is 0.
@Adamits requesting review when you have time: this isn't working for me. |
I now think there was a regression earlier, probably related to the student forcing CL. |
Ah ok, do you know what it is? I have to check what commit I am yet, but when I had some time yesterday, I remade my change and ran it on Maori from Abstractness. I trained on file 1-8, and used 9 for validation, and it was at 60% acc after a few epochs for attentive_lstm. Can you give me more specific steps to reproduce your 0 acc case? I can check which commit I was at and try to help! Or do you have a good idea of what happened now? |
On MacOS:
#!/bin/bash
set -eou pipefail
readonly DATA=../../abstractness/data
readonly ARCH=attentive_lstm
# TODO: Set according to language.
readonly LANGUAGE=mri
readonly MODEL_DIR=../models/attentive_bilstm
readonly TRAIN="${DATA}/${LANGUAGE}/${LANGUAGE}_train.tsv"
cat "${DATA}/${LANGUAGE}/${LANGUAGE}_"[0-7]".tsv" > "${TRAIN}"
trap "rm -f ${TRAIN}" EXIT
readonly DEV="${DATA}/${LANGUAGE}/${LANGUAGE}_8.tsv"
# Model parameters from Kann & Schütze (MED):
# https://aclanthology.org/W16-2010/
yoyodyne-train \
--experiment "${LANGUAGE}" \
--train "${TRAIN}" \
--dev "${DEV}" \
--model_dir "${MODEL_DIR}" \
--arch "${ARCH}" \
--embedding_size 300 \
--hidden_size 100 \
--batch_size 20 \
--dropout .3 \
--gradient_clip_val 3 \
--learning_rate 1 \
--accelerator gpu \
--max_epochs 60 \
--log_every_n_step 20 \
--optimizer adadelta \
--seed 49 \
--precision 16 It never makes it past 2.9 or so loss and accuracy is stuck at zero. |
After installing from Pypu as you suggest, I ran a few epochs on CPU on my mac. That is, I copied your script, but changed After 8 epochs I have:
I will test on GPU later when I get to the airport! P.S. The version on Pypi does not have this commit in it, right? You were able to produce this behavior without the update in this PR? |
Okay, wow that's it, the MPS support is broken...just use CPU on MacOS in the meantime... |
Should we keep this with a doc or warning, or should we revert it? Also, what version of torch are you on? |
The problem is very clearly upstream, in PyTorch itself, and is probably limited to an unknowable combination of PyTorch and/or Lightning versions. My inclination is to first complete the migration to PyTorch 2 (#60) and then document it if it persists:
Re: this PR itself, I'm now testing it (slowly, on CPU) and will merge if it works. |
In case this wasn't clear I'm now reasonably certain that the issue I was seeing in this PR is related to the Torch MPS backend and not the change itself. I'm testing the PR itself on CPU now. |
Ahhh right. For some reason I was thinking it was related to MPS + CrossEntropyLoss. FWIW I did not realize I could run with my macbook pro with |
Me too, initially.
If you do that and you don't also have a third-party NVIDIA GPU hooked up, it trains on CUDA using an MPS backend. I have seen this characterized as "experimental" on other sources. But as you know from #60 we are currently stuck on an elderly, pre-2.0, version of Torch and PL and that might be related...which is why we should take care of #60 soon. |
I think we may need an update to pointer generator, which first takes a softmax https://github.com/CUNY-CL/yoyodyne/blob/master/yoyodyne/models/pointer_generator.py#L135, then uses that to compute generation probs. Then lastly takes the log https://github.com/CUNY-CL/yoyodyne/blob/master/yoyodyne/models/pointer_generator.py#L157. I might need to check the details of the paper again to be sure about what's right (i.e. can we use logits when computing the pointer/generator distributions). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accidentally added my main comment as a separate comment. Inline comments can probably be ignored.
I tried updating I guess this does change the computation graph, so I need to think about this a little harder. |
FWIW the ptr-gen version in this PR is working great for me on Polish…are
you getting something different?
…On Mon, Jul 3, 2023 at 5:38 PM Adam ***@***.***> wrote:
I tried updating pointer_generator so that the source attention
additionally returns logits, and to use the P_vocab (which we call
output_probs) as logits instead of taking a softmax, but this seems to
change the results (for the worse).
I guess this does change the computation graph, so I need to think about
this a little harder.
—
Reply to this email directly, view it on GitHub
<#83 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABG4OLWUODRPHYGBMOP4TTXOM3XDANCNFSM6AAAAAAZ2EUNZE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I was trying the Maori example. Are you able to reproduce the same results as before the change? I believe the current state of it computes |
This works fine: #!/bin/bash
# Training for the pointer-generator LSTM.
set -eou pipefail
readonly DATA=../../abstractness/data
readonly ARCH=pointer_generator_lstm
# TODO: Set according to language.
readonly LANGUAGE=pol
readonly MODEL_DIR="../models/pointer_generator_bilstm"
readonly TRAIN="${DATA}/${LANGUAGE}/${LANGUAGE}_train.tsv"
cat "${DATA}/${LANGUAGE}/${LANGUAGE}_"[0-7]".tsv" > "${TRAIN}"
trap "rm -f ${TRAIN}" EXIT
readonly DEV="${DATA}/${LANGUAGE}/${LANGUAGE}_8.tsv"
# Model parameters from Sharma et. al. for high setting:
# https://aclanthology.org/K18-3013/
yoyodyne-train \
--experiment "${LANGUAGE}" \
--train "${TRAIN}" \
--dev "${DEV}" \
--model_dir "${MODEL_DIR}" \
--features_col 3 \
--arch "${ARCH}" \
--embedding_size 300 \
--hidden_size 100 \
--learning_rate .001 \
--optimizer adam \
--label_smoothing .1 \
--batch_size 32 \
--dropout .3 \
--gradient_clip_val 3 \
--max_epochs 60 \
--seed 49 \
--precision bf16 The pointer generator does call softmax on the output probabilities, and then once they're combined with the generation probabilities, calls (There's also log softmax in the attention module, and in the transducer, but I don't think they can be factored out,) |
Ok I will run this. I think we need to make sure it reproduces the results with |
Tried a bunch of things and still no luck reproducing the two softmaxes + NLLLoss by somehow getting both as logits and then CrossEntropy. To be clear, the complication is that we need the attention softmax. Different versions of getting that softmax and also getting logits used for computing the output score seem to change the results (presumably since we change the CG, and thus we backprop a slightly altered function). |
Thought about this a bit more, and I don't think softmax and addition commute, that is, for tensors x and y: log(softmax(x + y)) != log(softmax(x) + softmax(y)) In particular, notice that Still I wonder if there is a trick. |
You mean distribute, right? |
Oops, yes :) |
Just updating this thread with some offline chat. @Adamits is trying to figure out how to make this work well with the pointer-generator, since that does sort of need non-logit probabilities in its internal calculations. |
Closes #80.