Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Label smoothing update #83

Closed
wants to merge 2 commits into from
Closed

Conversation

kylebgorman
Copy link
Contributor

@kylebgorman kylebgorman commented Jun 30, 2023

  • Passes label smoothing from the CLI to the model class.
  • Replaces NLLLoss plus log softmax with CrossEntropyLoss.
  • Passes label smoothing to the loss function.

Closes #80.

* Passes label smoothing from the CLI to the model class.
* Replaces NLLLoss plus log softmax with CrossEntropyLoss.
* Passes label smoothing to the loss function.

Currently this does not seem to work: loss does not decline more than a
tiny bit on Maori passives, and validation accuracy is 0.
@kylebgorman kylebgorman requested a review from Adamits June 30, 2023 18:00
@kylebgorman
Copy link
Contributor Author

@Adamits requesting review when you have time: this isn't working for me.

@kylebgorman
Copy link
Contributor Author

kylebgorman commented Jul 2, 2023

I now think there was a regression earlier, probably related to the student forcing CL.

@Adamits
Copy link
Collaborator

Adamits commented Jul 2, 2023

Ah ok, do you know what it is? I have to check what commit I am yet, but when I had some time yesterday, I remade my change and ran it on Maori from Abstractness. I trained on file 1-8, and used 9 for validation, and it was at 60% acc after a few epochs for attentive_lstm.

Can you give me more specific steps to reproduce your 0 acc case? I can check which commit I was at and try to help! Or do you have a good idea of what happened now?

@kylebgorman
Copy link
Contributor Author

kylebgorman commented Jul 2, 2023

On MacOS:

  • install version from PyPI: pip install yoyodyne
  • Run following script:
#!/bin/bash

set -eou pipefail

readonly DATA=../../abstractness/data
readonly ARCH=attentive_lstm
# TODO: Set according to language.
readonly LANGUAGE=mri
readonly MODEL_DIR=../models/attentive_bilstm

readonly TRAIN="${DATA}/${LANGUAGE}/${LANGUAGE}_train.tsv"
cat "${DATA}/${LANGUAGE}/${LANGUAGE}_"[0-7]".tsv" > "${TRAIN}"
trap "rm -f ${TRAIN}" EXIT
readonly DEV="${DATA}/${LANGUAGE}/${LANGUAGE}_8.tsv"

# Model parameters from Kann & Schütze (MED):
# https://aclanthology.org/W16-2010/
yoyodyne-train \
    --experiment "${LANGUAGE}" \
    --train "${TRAIN}" \
    --dev "${DEV}" \
    --model_dir "${MODEL_DIR}" \
    --arch "${ARCH}" \
    --embedding_size 300 \
    --hidden_size 100 \
    --batch_size 20 \
    --dropout .3 \
    --gradient_clip_val 3 \
    --learning_rate 1 \
    --accelerator gpu \
    --max_epochs 60 \
    --log_every_n_step 20 \
    --optimizer adadelta \
    --seed 49 \
    --precision 16

It never makes it past 2.9 or so loss and accuracy is stuck at zero.

@Adamits
Copy link
Collaborator

Adamits commented Jul 2, 2023

After installing from Pypu as you suggest, I ran a few epochs on CPU on my mac. That is, I copied your script, but changed --accelerator gpu to --accelerator cpu.

After 8 epochs I have:

Epoch 8:   0%|    | 0/25 [00:00<?, ?it/s, loss=0.268, v_num=0, val_accuracy=0.419, val_loss=0.849]

I will test on GPU later when I get to the airport!

P.S. The version on Pypi does not have this commit in it, right? You were able to produce this behavior without the update in this PR?

@kylebgorman
Copy link
Contributor Author

Okay, wow that's it, the MPS support is broken...just use CPU on MacOS in the meantime...

@Adamits
Copy link
Collaborator

Adamits commented Jul 3, 2023

Should we keep this with a doc or warning, or should we revert it? Also, what version of torch are you on?

@kylebgorman
Copy link
Contributor Author

The problem is very clearly upstream, in PyTorch itself, and is probably limited to an unknowable combination of PyTorch and/or Lightning versions. My inclination is to first complete the migration to PyTorch 2 (#60) and then document it if it persists:

PyTorch supports the use of Apple's [Metal Performance Shaders](https://developer.apple.com/metal/) (or MPS) as an accelerator. However we find that this accelerator backend may be buggy; models which converge on CPU, or on other platforms, may fail to converge when trained with MPS. We do not currently recommend using `--accelerator mps` or `--accelerator gpu` on MacOS, though the latter can be safely used with a third-party (i.e., NVIDIA) GPU.

Re: this PR itself, I'm now testing it (slowly, on CPU) and will merge if it works.

@kylebgorman
Copy link
Contributor Author

In case this wasn't clear I'm now reasonably certain that the issue I was seeing in this PR is related to the Torch MPS backend and not the change itself. I'm testing the PR itself on CPU now.

@Adamits
Copy link
Collaborator

Adamits commented Jul 3, 2023

In case this wasn't clear I'm now reasonably certain that the issue I was seeing in this PR is related to the Torch MPS backend and not the change itself. I'm testing the PR itself on CPU now.

Ahhh right. For some reason I was thinking it was related to MPS + CrossEntropyLoss.

FWIW I did not realize I could run with my macbook pro with --accelerator gpu, but I tried it and it was converging for me.

@kylebgorman
Copy link
Contributor Author

Ahhh right. For some reason I was thinking it was related to MPS + CrossEntropyLoss.

Me too, initially.

FWIW I did not realize I could run with my macbook pro with --accelerator gpu, but I tried it and it was converging for me.

If you do that and you don't also have a third-party NVIDIA GPU hooked up, it trains on CUDA using an MPS backend. I have seen this characterized as "experimental" on other sources. But as you know from #60 we are currently stuck on an elderly, pre-2.0, version of Torch and PL and that might be related...which is why we should take care of #60 soon.

@Adamits
Copy link
Collaborator

Adamits commented Jul 3, 2023

I think we may need an update to pointer generator, which first takes a softmax https://github.com/CUNY-CL/yoyodyne/blob/master/yoyodyne/models/pointer_generator.py#L135, then uses that to compute generation probs. Then lastly takes the log https://github.com/CUNY-CL/yoyodyne/blob/master/yoyodyne/models/pointer_generator.py#L157.

I might need to check the details of the paper again to be sure about what's right (i.e. can we use logits when computing the pointer/generator distributions).

Copy link
Collaborator

@Adamits Adamits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidentally added my main comment as a separate comment. Inline comments can probably be ignored.

yoyodyne/models/base.py Show resolved Hide resolved
yoyodyne/models/base.py Show resolved Hide resolved
@Adamits
Copy link
Collaborator

Adamits commented Jul 3, 2023

I tried updating pointer_generator so that the source attention additionally returns logits, and to use the P_vocab (which we call output_probs) as logits instead of taking a softmax, but this seems to change the results (for the worse).

I guess this does change the computation graph, so I need to think about this a little harder.

@kylebgorman
Copy link
Contributor Author

kylebgorman commented Jul 3, 2023 via email

@Adamits
Copy link
Collaborator

Adamits commented Jul 3, 2023

FWIW the ptr-gen version in this PR is working great for me on Polish…are you getting something different?

On Mon, Jul 3, 2023 at 5:38 PM Adam @.> wrote: I tried updating pointer_generator so that the source attention additionally returns logits, and to use the P_vocab (which we call output_probs) as logits instead of taking a softmax, but this seems to change the results (for the worse). I guess this does change the computation graph, so I need to think about this a little harder. — Reply to this email directly, view it on GitHub <#83 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OLWUODRPHYGBMOP4TTXOM3XDANCNFSM6AAAAAAZ2EUNZE . You are receiving this because you authored the thread.Message ID: @.>

I was trying the Maori example. Are you able to reproduce the same results as before the change? I believe the current state of it computes log(softmax) in the decode step, which the loss function then does again for CrossEntropyLoss.

@kylebgorman
Copy link
Contributor Author

This works fine:

#!/bin/bash
# Training for the pointer-generator LSTM.

set -eou pipefail

readonly DATA=../../abstractness/data
readonly ARCH=pointer_generator_lstm
# TODO: Set according to language.
readonly LANGUAGE=pol
readonly MODEL_DIR="../models/pointer_generator_bilstm"

readonly TRAIN="${DATA}/${LANGUAGE}/${LANGUAGE}_train.tsv"
cat "${DATA}/${LANGUAGE}/${LANGUAGE}_"[0-7]".tsv" > "${TRAIN}"
trap "rm -f ${TRAIN}" EXIT
readonly DEV="${DATA}/${LANGUAGE}/${LANGUAGE}_8.tsv"

# Model parameters from Sharma et. al. for high setting:
# https://aclanthology.org/K18-3013/
yoyodyne-train \
    --experiment "${LANGUAGE}" \
    --train "${TRAIN}" \
    --dev "${DEV}" \
    --model_dir "${MODEL_DIR}" \
    --features_col 3 \
    --arch "${ARCH}" \
    --embedding_size 300 \
    --hidden_size 100 \
    --learning_rate .001 \
    --optimizer adam \
    --label_smoothing .1 \
    --batch_size 32 \
    --dropout .3 \
    --gradient_clip_val 3 \
    --max_epochs 60 \
    --seed 49 \
    --precision bf16

The pointer generator does call softmax on the output probabilities, and then once they're combined with the generation probabilities, calls torch.log on them. Should I do something there?

(There's also log softmax in the attention module, and in the transducer, but I don't think they can be factored out,)

@Adamits
Copy link
Collaborator

Adamits commented Jul 5, 2023

This works fine:

#!/bin/bash
# Training for the pointer-generator LSTM.

set -eou pipefail

readonly DATA=../../abstractness/data
readonly ARCH=pointer_generator_lstm
# TODO: Set according to language.
readonly LANGUAGE=pol
readonly MODEL_DIR="../models/pointer_generator_bilstm"

readonly TRAIN="${DATA}/${LANGUAGE}/${LANGUAGE}_train.tsv"
cat "${DATA}/${LANGUAGE}/${LANGUAGE}_"[0-7]".tsv" > "${TRAIN}"
trap "rm -f ${TRAIN}" EXIT
readonly DEV="${DATA}/${LANGUAGE}/${LANGUAGE}_8.tsv"

# Model parameters from Sharma et. al. for high setting:
# https://aclanthology.org/K18-3013/
yoyodyne-train \
    --experiment "${LANGUAGE}" \
    --train "${TRAIN}" \
    --dev "${DEV}" \
    --model_dir "${MODEL_DIR}" \
    --features_col 3 \
    --arch "${ARCH}" \
    --embedding_size 300 \
    --hidden_size 100 \
    --learning_rate .001 \
    --optimizer adam \
    --label_smoothing .1 \
    --batch_size 32 \
    --dropout .3 \
    --gradient_clip_val 3 \
    --max_epochs 60 \
    --seed 49 \
    --precision bf16

The pointer generator does call softmax on the output probabilities, and then once they're combined with the generation probabilities, calls torch.log on them. Should I do something there?

(There's also log softmax in the attention module, and in the transducer, but I don't think they can be factored out,)

Ok I will run this. I think we need to make sure it reproduces the results with NLLLoss vs. CrossEntropyLoss though. For the attention softmax, I was just returning the logits along with the attention weights, and using the logits for computing the output, but this changed results (i think it must be changing the backward pass, b/c the forward computation should be the same).

@Adamits
Copy link
Collaborator

Adamits commented Jul 6, 2023

Tried a bunch of things and still no luck reproducing the two softmaxes + NLLLoss by somehow getting both as logits and then CrossEntropy. To be clear, the complication is that we need the attention softmax. Different versions of getting that softmax and also getting logits used for computing the output score seem to change the results (presumably since we change the CG, and thus we backprop a slightly altered function).

@Adamits
Copy link
Collaborator

Adamits commented Jul 8, 2023

Thought about this a bit more, and I don't think softmax and addition commute, that is, for tensors x and y:

log(softmax(x + y)) != log(softmax(x) + softmax(y))

In particular, notice that e^{x + y} == e^{x} * e^{y}, whereas we want e^{x} + e^{y}

Still I wonder if there is a trick.

@kylebgorman
Copy link
Contributor Author

Thought about this a bit more, and I don't think softmax and addition commute

You mean distribute, right?

@Adamits
Copy link
Collaborator

Adamits commented Jul 8, 2023

Thought about this a bit more, and I don't think softmax and addition commute

You mean distribute, right?

Oops, yes :)

@kylebgorman
Copy link
Contributor Author

Just updating this thread with some offline chat. @Adamits is trying to figure out how to make this work well with the pointer-generator, since that does sort of need non-logit probabilities in its internal calculations.

@kylebgorman
Copy link
Contributor Author

Closing this because the relevant work is happening in response to #80 in PR #83; @Adamits is on it.

@kylebgorman kylebgorman deleted the smoothing branch December 9, 2024 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Switch to CrossEntropyLoss?
2 participants