Label smoothing update #83

kylebgorman · 2023-06-30T18:00:46Z

Passes label smoothing from the CLI to the model class.
Replaces NLLLoss plus log softmax with CrossEntropyLoss.
Passes label smoothing to the loss function.

Closes #80.

* Passes label smoothing from the CLI to the model class. * Replaces NLLLoss plus log softmax with CrossEntropyLoss. * Passes label smoothing to the loss function. Currently this does not seem to work: loss does not decline more than a tiny bit on Maori passives, and validation accuracy is 0.

kylebgorman · 2023-06-30T18:00:58Z

@Adamits requesting review when you have time: this isn't working for me.

kylebgorman · 2023-07-02T16:04:15Z

I now think there was a regression earlier, probably related to the student forcing CL.

Adamits · 2023-07-02T19:05:31Z

Ah ok, do you know what it is? I have to check what commit I am yet, but when I had some time yesterday, I remade my change and ran it on Maori from Abstractness. I trained on file 1-8, and used 9 for validation, and it was at 60% acc after a few epochs for attentive_lstm.

Can you give me more specific steps to reproduce your 0 acc case? I can check which commit I was at and try to help! Or do you have a good idea of what happened now?

kylebgorman · 2023-07-02T19:18:44Z

On MacOS:

install version from PyPI: pip install yoyodyne
Run following script:

#!/bin/bash

set -eou pipefail

readonly DATA=../../abstractness/data
readonly ARCH=attentive_lstm
# TODO: Set according to language.
readonly LANGUAGE=mri
readonly MODEL_DIR=../models/attentive_bilstm

readonly TRAIN="${DATA}/${LANGUAGE}/${LANGUAGE}_train.tsv"
cat "${DATA}/${LANGUAGE}/${LANGUAGE}_"[0-7]".tsv" > "${TRAIN}"
trap "rm -f ${TRAIN}" EXIT
readonly DEV="${DATA}/${LANGUAGE}/${LANGUAGE}_8.tsv"

# Model parameters from Kann & Schütze (MED):
# https://aclanthology.org/W16-2010/
yoyodyne-train \
    --experiment "${LANGUAGE}" \
    --train "${TRAIN}" \
    --dev "${DEV}" \
    --model_dir "${MODEL_DIR}" \
    --arch "${ARCH}" \
    --embedding_size 300 \
    --hidden_size 100 \
    --batch_size 20 \
    --dropout .3 \
    --gradient_clip_val 3 \
    --learning_rate 1 \
    --accelerator gpu \
    --max_epochs 60 \
    --log_every_n_step 20 \
    --optimizer adadelta \
    --seed 49 \
    --precision 16

It never makes it past 2.9 or so loss and accuracy is stuck at zero.

Adamits · 2023-07-02T20:08:13Z

After installing from Pypu as you suggest, I ran a few epochs on CPU on my mac. That is, I copied your script, but changed --accelerator gpu to --accelerator cpu.

After 8 epochs I have:

Epoch 8:   0%|    | 0/25 [00:00<?, ?it/s, loss=0.268, v_num=0, val_accuracy=0.419, val_loss=0.849]

I will test on GPU later when I get to the airport!

P.S. The version on Pypi does not have this commit in it, right? You were able to produce this behavior without the update in this PR?

kylebgorman · 2023-07-02T20:14:36Z

Okay, wow that's it, the MPS support is broken...just use CPU on MacOS in the meantime...

Adamits · 2023-07-03T15:02:49Z

Should we keep this with a doc or warning, or should we revert it? Also, what version of torch are you on?

kylebgorman · 2023-07-03T15:12:47Z

The problem is very clearly upstream, in PyTorch itself, and is probably limited to an unknowable combination of PyTorch and/or Lightning versions. My inclination is to first complete the migration to PyTorch 2 (#60) and then document it if it persists:

PyTorch supports the use of Apple's [Metal Performance Shaders](https://developer.apple.com/metal/) (or MPS) as an accelerator. However we find that this accelerator backend may be buggy; models which converge on CPU, or on other platforms, may fail to converge when trained with MPS. We do not currently recommend using `--accelerator mps` or `--accelerator gpu` on MacOS, though the latter can be safely used with a third-party (i.e., NVIDIA) GPU.

Re: this PR itself, I'm now testing it (slowly, on CPU) and will merge if it works.

kylebgorman · 2023-07-03T15:18:38Z

In case this wasn't clear I'm now reasonably certain that the issue I was seeing in this PR is related to the Torch MPS backend and not the change itself. I'm testing the PR itself on CPU now.

Adamits · 2023-07-03T15:33:59Z

In case this wasn't clear I'm now reasonably certain that the issue I was seeing in this PR is related to the Torch MPS backend and not the change itself. I'm testing the PR itself on CPU now.

Ahhh right. For some reason I was thinking it was related to MPS + CrossEntropyLoss.

FWIW I did not realize I could run with my macbook pro with --accelerator gpu, but I tried it and it was converging for me.

kylebgorman · 2023-07-03T15:38:13Z

Ahhh right. For some reason I was thinking it was related to MPS + CrossEntropyLoss.

Me too, initially.

FWIW I did not realize I could run with my macbook pro with --accelerator gpu, but I tried it and it was converging for me.

If you do that and you don't also have a third-party NVIDIA GPU hooked up, it trains on CUDA using an MPS backend. I have seen this characterized as "experimental" on other sources. But as you know from #60 we are currently stuck on an elderly, pre-2.0, version of Torch and PL and that might be related...which is why we should take care of #60 soon.

Adamits · 2023-07-03T16:55:48Z

I think we may need an update to pointer generator, which first takes a softmax https://github.com/CUNY-CL/yoyodyne/blob/master/yoyodyne/models/pointer_generator.py#L135, then uses that to compute generation probs. Then lastly takes the log https://github.com/CUNY-CL/yoyodyne/blob/master/yoyodyne/models/pointer_generator.py#L157.

I might need to check the details of the paper again to be sure about what's right (i.e. can we use logits when computing the pointer/generator distributions).

Adamits

Accidentally added my main comment as a separate comment. Inline comments can probably be ignored.

yoyodyne/models/base.py

Adamits · 2023-07-03T21:38:47Z

I tried updating pointer_generator so that the source attention additionally returns logits, and to use the P_vocab (which we call output_probs) as logits instead of taking a softmax, but this seems to change the results (for the worse).

I guess this does change the computation graph, so I need to think about this a little harder.

kylebgorman · 2023-07-03T21:40:31Z

FWIW the ptr-gen version in this PR is working great for me on Polish…are you getting something different?

…

On Mon, Jul 3, 2023 at 5:38 PM Adam ***@***.***> wrote: I tried updating pointer_generator so that the source attention additionally returns logits, and to use the P_vocab (which we call output_probs) as logits instead of taking a softmax, but this seems to change the results (for the worse). I guess this does change the computation graph, so I need to think about this a little harder. — Reply to this email directly, view it on GitHub <#83 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABG4OLWUODRPHYGBMOP4TTXOM3XDANCNFSM6AAAAAAZ2EUNZE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Adamits · 2023-07-03T21:48:30Z

FWIW the ptr-gen version in this PR is working great for me on Polish…are you getting something different?
…
On Mon, Jul 3, 2023 at 5:38 PM Adam @.> wrote: I tried updating pointer_generator so that the source attention additionally returns logits, and to use the P_vocab (which we call output_probs) as logits instead of taking a softmax, but this seems to change the results (for the worse). I guess this does change the computation graph, so I need to think about this a little harder. — Reply to this email directly, view it on GitHub <#83 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OLWUODRPHYGBMOP4TTXOM3XDANCNFSM6AAAAAAZ2EUNZE . You are receiving this because you authored the thread.Message ID: @.>

I was trying the Maori example. Are you able to reproduce the same results as before the change? I believe the current state of it computes log(softmax) in the decode step, which the loss function then does again for CrossEntropyLoss.

kylebgorman · 2023-07-04T01:55:54Z

This works fine:

#!/bin/bash
# Training for the pointer-generator LSTM.

set -eou pipefail

readonly DATA=../../abstractness/data
readonly ARCH=pointer_generator_lstm
# TODO: Set according to language.
readonly LANGUAGE=pol
readonly MODEL_DIR="../models/pointer_generator_bilstm"

readonly TRAIN="${DATA}/${LANGUAGE}/${LANGUAGE}_train.tsv"
cat "${DATA}/${LANGUAGE}/${LANGUAGE}_"[0-7]".tsv" > "${TRAIN}"
trap "rm -f ${TRAIN}" EXIT
readonly DEV="${DATA}/${LANGUAGE}/${LANGUAGE}_8.tsv"

# Model parameters from Sharma et. al. for high setting:
# https://aclanthology.org/K18-3013/
yoyodyne-train \
    --experiment "${LANGUAGE}" \
    --train "${TRAIN}" \
    --dev "${DEV}" \
    --model_dir "${MODEL_DIR}" \
    --features_col 3 \
    --arch "${ARCH}" \
    --embedding_size 300 \
    --hidden_size 100 \
    --learning_rate .001 \
    --optimizer adam \
    --label_smoothing .1 \
    --batch_size 32 \
    --dropout .3 \
    --gradient_clip_val 3 \
    --max_epochs 60 \
    --seed 49 \
    --precision bf16

The pointer generator does call softmax on the output probabilities, and then once they're combined with the generation probabilities, calls torch.log on them. Should I do something there?

(There's also log softmax in the attention module, and in the transducer, but I don't think they can be factored out,)

Adamits · 2023-07-05T18:27:50Z

This works fine:

#!/bin/bash
# Training for the pointer-generator LSTM.

set -eou pipefail

readonly DATA=../../abstractness/data
readonly ARCH=pointer_generator_lstm
# TODO: Set according to language.
readonly LANGUAGE=pol
readonly MODEL_DIR="../models/pointer_generator_bilstm"

readonly TRAIN="${DATA}/${LANGUAGE}/${LANGUAGE}_train.tsv"
cat "${DATA}/${LANGUAGE}/${LANGUAGE}_"[0-7]".tsv" > "${TRAIN}"
trap "rm -f ${TRAIN}" EXIT
readonly DEV="${DATA}/${LANGUAGE}/${LANGUAGE}_8.tsv"

# Model parameters from Sharma et. al. for high setting:
# https://aclanthology.org/K18-3013/
yoyodyne-train \
    --experiment "${LANGUAGE}" \
    --train "${TRAIN}" \
    --dev "${DEV}" \
    --model_dir "${MODEL_DIR}" \
    --features_col 3 \
    --arch "${ARCH}" \
    --embedding_size 300 \
    --hidden_size 100 \
    --learning_rate .001 \
    --optimizer adam \
    --label_smoothing .1 \
    --batch_size 32 \
    --dropout .3 \
    --gradient_clip_val 3 \
    --max_epochs 60 \
    --seed 49 \
    --precision bf16

The pointer generator does call softmax on the output probabilities, and then once they're combined with the generation probabilities, calls torch.log on them. Should I do something there?

(There's also log softmax in the attention module, and in the transducer, but I don't think they can be factored out,)

Ok I will run this. I think we need to make sure it reproduces the results with NLLLoss vs. CrossEntropyLoss though. For the attention softmax, I was just returning the logits along with the attention weights, and using the logits for computing the output, but this changed results (i think it must be changing the backward pass, b/c the forward computation should be the same).

Adamits · 2023-07-06T22:17:39Z

Tried a bunch of things and still no luck reproducing the two softmaxes + NLLLoss by somehow getting both as logits and then CrossEntropy. To be clear, the complication is that we need the attention softmax. Different versions of getting that softmax and also getting logits used for computing the output score seem to change the results (presumably since we change the CG, and thus we backprop a slightly altered function).

Adamits · 2023-07-08T18:53:04Z

Thought about this a bit more, and I don't think softmax and addition commute, that is, for tensors x and y:

log(softmax(x + y)) != log(softmax(x) + softmax(y))

In particular, notice that e^{x + y} == e^{x} * e^{y}, whereas we want e^{x} + e^{y}

Still I wonder if there is a trick.

kylebgorman · 2023-07-08T20:40:03Z

Thought about this a bit more, and I don't think softmax and addition commute

You mean distribute, right?

Adamits · 2023-07-08T23:12:04Z

Thought about this a bit more, and I don't think softmax and addition commute

You mean distribute, right?

Oops, yes :)

kylebgorman · 2023-07-09T13:37:31Z

Just updating this thread with some offline chat. @Adamits is trying to figure out how to make this work well with the pointer-generator, since that does sort of need non-logit probabilities in its internal calculations.

kylebgorman · 2023-08-14T15:52:48Z

Closing this because the relevant work is happening in response to #80 in PR #83; @Adamits is on it.

kylebgorman requested a review from Adamits June 30, 2023 18:00

Adamits suggested changes Jul 3, 2023

View reviewed changes

yoyodyne/models/base.py Show resolved Hide resolved

yoyodyne/models/base.py Show resolved Hide resolved

Merge branch 'CUNY-CL:master' into smoothing

3115653

kylebgorman closed this Aug 14, 2023

kylebgorman deleted the smoothing branch December 9, 2024 22:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Label smoothing update #83

Label smoothing update #83

kylebgorman commented Jun 30, 2023 •

edited

Loading

kylebgorman commented Jun 30, 2023

kylebgorman commented Jul 2, 2023 •

edited

Loading

Adamits commented Jul 2, 2023

kylebgorman commented Jul 2, 2023 •

edited

Loading

Adamits commented Jul 2, 2023 •

edited

Loading

kylebgorman commented Jul 2, 2023

Adamits commented Jul 3, 2023 •

edited

Loading

kylebgorman commented Jul 3, 2023

kylebgorman commented Jul 3, 2023

Adamits commented Jul 3, 2023 •

edited

Loading

kylebgorman commented Jul 3, 2023

Adamits commented Jul 3, 2023

Adamits left a comment

Adamits commented Jul 3, 2023

kylebgorman commented Jul 3, 2023 via email

Adamits commented Jul 3, 2023

kylebgorman commented Jul 4, 2023

Adamits commented Jul 5, 2023

Adamits commented Jul 6, 2023

Adamits commented Jul 8, 2023 •

edited

Loading

kylebgorman commented Jul 8, 2023

Adamits commented Jul 8, 2023

kylebgorman commented Jul 9, 2023

kylebgorman commented Aug 14, 2023

Label smoothing update #83

Label smoothing update #83

Conversation

kylebgorman commented Jun 30, 2023 • edited Loading

kylebgorman commented Jun 30, 2023

kylebgorman commented Jul 2, 2023 • edited Loading

Adamits commented Jul 2, 2023

kylebgorman commented Jul 2, 2023 • edited Loading

Adamits commented Jul 2, 2023 • edited Loading

kylebgorman commented Jul 2, 2023

Adamits commented Jul 3, 2023 • edited Loading

kylebgorman commented Jul 3, 2023

kylebgorman commented Jul 3, 2023

Adamits commented Jul 3, 2023 • edited Loading

kylebgorman commented Jul 3, 2023

Adamits commented Jul 3, 2023

Adamits left a comment

Choose a reason for hiding this comment

Adamits commented Jul 3, 2023

kylebgorman commented Jul 3, 2023 via email

Adamits commented Jul 3, 2023

kylebgorman commented Jul 4, 2023

Adamits commented Jul 5, 2023

Adamits commented Jul 6, 2023

Adamits commented Jul 8, 2023 • edited Loading

kylebgorman commented Jul 8, 2023

Adamits commented Jul 8, 2023

kylebgorman commented Jul 9, 2023

kylebgorman commented Aug 14, 2023

kylebgorman commented Jun 30, 2023 •

edited

Loading

kylebgorman commented Jul 2, 2023 •

edited

Loading

kylebgorman commented Jul 2, 2023 •

edited

Loading

Adamits commented Jul 2, 2023 •

edited

Loading

Adamits commented Jul 3, 2023 •

edited

Loading

Adamits commented Jul 3, 2023 •

edited

Loading

Adamits commented Jul 8, 2023 •

edited

Loading