Uses torchmetrics for metric computation #284

kylebgorman · 2024-12-08T20:33:12Z

Closes #158.

Loss is computed as before, but streamlined somewhat.
torchmetrics' implementation of exact match accuracy is lightly adapted. This does everything in tensor-land and should keep things on devices. My tests confirm that accuracy is what it was before.
A torchmetrics-compatible implementation of symbol error rate (here defined as the edit distance divided by sum of target lengths) is inserted here. This is heavily documented and it is compatible with our existing implementation. The hot inner loop is still on CPU, but as mentioned in the documentation, this is probably the best option and I don't observe any obvious performance penalty when enabling this. Note that the computation of the old one was not as normally defined and gave the number of edits per word, not the number of edits per gold target symbol.
We do away with the evaluation module altogether. Rather we treat the metrics objects as nullables living in the base class, a design adapted from UDTube.
Both loss and the metrics expect logits tensors to be of the shape B x target_vocab_size x seq_len, but the RNN library code wants it to be B x seq_len x target_vocab_size. I choose the former to be the "default" representation and let the RNN dimensionality be an implementational detail.

The CLI interface is unimpacted.

Closes CUNY-CL#158. * Loss is computed as before, but streamlined somewhat. * `torchmetrics`' implementation of exact match accuracy is lightly adapted. This does everything in tensor-land and should keep things on devices. My tests confirm that accuracy is EXACTLY what it was before. * A `torchmetrics`-compatible implementation of symbol error rate (here defined as the edit distance divided by sum of target lengths) is inserted here. This is heavily documented and it is compatible with our existing implementation. The hot inner loop is still on CPU, but as mentioned in the documentation, this is probably the best option and I don't observe any obvious performance penalty when enabling this. * We do away with the `evaluation` module altogether. Rather we treat the metrics objects as nullables living in the base class, a design adapted from UDTube. The CLI interface is unimpacted, and my side-by-side shows the metrics are exactly the same as before this change.

Adamits · 2024-12-10T17:34:34Z

Nice!

Loss is computed as before, but streamlined somewhat.

torchmetrics' implementation of exact match accuracy is lightly adapted. This does everything in tensor-land and should keep things on devices. My tests confirm that accuracy is what it was before.

Cool, I think when we added our implementation, torchmetrics was requiring strings right? Glad ti see this.

A torchmetrics-compatible implementation of symbol error rate (here defined as the edit distance divided by sum of target lengths) is inserted here. This is heavily documented and it is compatible with our existing implementation. The hot inner loop is still on CPU, but as mentioned in the documentation, this is probably the best option and I don't observe any obvious performance penalty when enabling this. Note that the computation of the old one was not as normally defined and gave the number of edits per word, not the number of edits per gold target symbol.

So we go from a denominator of 1 (I am interpreting word as sequence, since we do not do pretokenization) to a denominator of the length of the gold symbols? Makes sense if so, but I cannot really remember what is standard.

We do away with the evaluation module altogether. Rather we treat the metrics objects as nullables living in the base class, a design adapted from UDTube.

Generally awesome!! Looking at the implementation, I am wondering why we replace the generic set of evals with specific metric attributes and metric booleans? What are the benefits of this (where I see the downsides as adding some bloat to the code, and lots of steps for including new metrics)?

Both loss and the metrics expect logits tensors to be of the shape B x target_vocab_size x seq_len, but the RNN library code wants it to be B x seq_len x target_vocab_size. I choose the former to be the "default" representation and let the RNN dimensionality be an implementational detail.

Does RNN library code mean for the forward function implemented in pytorch? I will look closer at the code, but my feeling is that it is conceptually better to try to maintain a consistent shape, and just reshape the tensor as close as possible to the function for which a different shape is needed.

Adamits · 2024-12-10T17:42:36Z

yoyodyne/models/base.py

@@ -64,11 +66,12 @@ def __init__(
        vocab_size,
        # All of these have keyword defaults.
        beam_width=defaults.BEAM_WIDTH,
+        compute_accuracy=False,


Why are we initializing these here since we define a property with a method below?

Adamits · 2024-12-10T17:51:27Z

yoyodyne/models/base.py

    @property
-    def num_parameters(self) -> int:
-        return sum(part.numel() for part in self.parameters())
+    def compute_accuracy(self) -> bool:


There is always a piece of me that thinks naming boolean properties like this in a dynamically typed language is confusing (it sounds like a method for computing the accuracy). Ofc, alternatives that try to be cute are not my favorite either (e.g. should_compute_accuracy/is_compute_accuracy), so idk if argue for a change, but always want to call it out in case someone has a better idea :D.

I'll try has_accuracy instead.

Adamits · 2024-12-10T17:54:51Z

yoyodyne/models/base.py

    dropout_layer: nn.Dropout
-    eval_metrics: Set[evaluators.Evaluator]


Ok I made a top-level comment about this. Looking at the code and thinking about it, I suppose if we wanted to avoid all of the metric specific properties/null checks, we would need to implement and maintain a generic Metric, which I suppose might create friction for implementing new metrics so I agree with this change. Adding a new metric probably only requires adding code in 3 places in this file (beyond implementing the metric and updating the CLI) right?

Adamits · 2024-12-10T17:56:06Z

yoyodyne/models/base.py


-        Returns:
-            Dict: averaged metrics over all validation steps.
+    def test_step(


This is new, yes? I guess it does the same thing as validation but specifies "test" mode?

Yeah it's something you can do in LightningCLI.

Adamits · 2024-12-10T17:57:45Z

yoyodyne/models/base.py

+        )
+        return loss
+
+    def _reset_metrics(self) -> None:


Is there a link to the torchmetrics docs explaining this that we could put here?

Adamits · 2024-12-10T18:03:22Z

yoyodyne/models/transducer.py


-    def validation_step(self, batch: data.PaddedBatch, batch_idx: int) -> Dict:
+    def on_validation_epoch_start(self) -> None:


Shouldn't this be inherited? Or does some class in its inheritance override this so we have to override it back?

Adamits · 2024-12-10T18:05:54Z

yoyodyne/models/transducer.py

+    def on_validation_epoch_start(self) -> None:
+        self._reset_metrics()
+
+    def validation_step(self, batch: data.PaddedBatch, batch_idx: int) -> None:


This also looks the same as the base method.

Adamits · 2024-12-10T18:07:02Z

yoyodyne/models/transducer.py

+        self._log_loss(loss, len(batch), "val")
+        self._update_metrics(predictions, batch.target.padded)
+
+    def on_validation_epoch_end(self) -> None:


kylebgorman · 2024-12-10T20:00:30Z

Nice!

Just to say: this isn't really ready for review yet. It depends on a lot of other small changes which I'll make first. I thought I could do it all in one go: I was wrong.

Cool, I think when we added our implementation, torchmetrics was requiring strings right? Glad ti see this.

Yep. It wasn't too hard to make our own.

So we go from a denominator of 1 (I am interpreting word as sequence, since we do not do pretokenization) to a denominator of the length of the gold symbols? Makes sense if so, but I cannot really remember what is standard.

Actually I mispoke here somewhat because I misread the code: the denominator was the length of the tensor; now it's the length of the string the tensor denotes.

Generally awesome!! Looking at the implementation, I am wondering why we replace the generic set of evals with specific metric attributes and metric booleans? What are the benefits of this (where I see the downsides as adding some bloat to the code, and lots of steps for including new metrics)?

Benefits: data lives on the accelerator, like loss data does. I actually don't think the steps to add metrics is meaningfully more difficult than what we had previously, so I don't see any big downsides either. It also can be documented without much trouble.

Both loss and the metrics expect logits tensors to be of the shape B x target_vocab_size x seq_len, but the RNN library code wants it to be B x seq_len x target_vocab_size. I choose the former to be the "default" representation and let the RNN dimensionality be an implementational detail.

Does RNN library code mean for the forward function implemented in pytorch? I will look closer at the code, but my feeling is that it is conceptually better to try to maintain a consistent shape, and just reshape the tensor as close as possible to the function for which a different shape is needed.

I originally piloted with that. Most things in the Torch universe (including loss functions, but also everything in torchmetrics) assume that the first dimension is batch size and the second is vocabulary size (third is length), so it's basically a quirk of the library's RNN modules' forward pass that they swap vocabulary size and length. Since the output of those RNN modules needs some postprocessing anyways (run through a classifier, combined with pointer-generator information, softmaxed, etc.) it makes sense to me to transpose them back to the Torch default one time ("set it and leave it") during those steps. I think this can be done in just a few places and will be an improvement in useability. Transpositions are not, I assume, computationally expensive (I assume what they do internally is just change the "striding" logic by making a minor edit to some metadata in the tensors), but they are a big maintenance burden. 90% of my debugging here involves transpositions and making shape conform.

Adamits · 2024-12-23T20:56:25Z

Nice!

Just to say: this isn't really ready for review yet. It depends on a lot of other small changes which I'll make first. I thought I could do it all in one go: I was wrong.

Cool, I think when we added our implementation, torchmetrics was requiring strings right? Glad ti see this.

Yep. It wasn't too hard to make our own.

So we go from a denominator of 1 (I am interpreting word as sequence, since we do not do pretokenization) to a denominator of the length of the gold symbols? Makes sense if so, but I cannot really remember what is standard.

Actually I mispoke here somewhat because I misread the code: the denominator was the length of the tensor; now it's the length of the string the tensor denotes.

Generally awesome!! Looking at the implementation, I am wondering why we replace the generic set of evals with specific metric attributes and metric booleans? What are the benefits of this (where I see the downsides as adding some bloat to the code, and lots of steps for including new metrics)?

Benefits: data lives on the accelerator, like loss data does. I actually don't think the steps to add metrics is meaningfully more difficult than what we had previously, so I don't see any big downsides either. It also can be documented without much trouble.

Both loss and the metrics expect logits tensors to be of the shape B x target_vocab_size x seq_len, but the RNN library code wants it to be B x seq_len x target_vocab_size. I choose the former to be the "default" representation and let the RNN dimensionality be an implementational detail.

Does RNN library code mean for the forward function implemented in pytorch? I will look closer at the code, but my feeling is that it is conceptually better to try to maintain a consistent shape, and just reshape the tensor as close as possible to the function for which a different shape is needed.

I originally piloted with that. Most things in the Torch universe (including loss functions, but also everything in torchmetrics) assume that the first dimension is batch size and the second is vocabulary size (third is length), so it's basically a quirk of the library's RNN modules' forward pass that they swap vocabulary size and length. Since the output of those RNN modules needs some postprocessing anyways (run through a classifier, combined with pointer-generator information, softmaxed, etc.) it makes sense to me to transpose them back to the Torch default one time ("set it and leave it") during those steps. I think this can be done in just a few places and will be an improvement in useability. Transpositions are not, I assume, computationally expensive (I assume what they do internally is just change the "striding" logic by making a minor edit to some metadata in the tensors), but they are a big maintenance burden. 90% of my debugging here involves transpositions and making shape conform.

Sorry for my delay. This mostly all makes sense to me. " 90% of my debugging here involves transpositions and making shape conform." -- this has typically been my experience in general when writing torch code :). I am mostly trying to suggest that having a yoyodyne default assumption of what shape tensors are in would be nice -- and I think would make it conceptually easier to visualize tensors as you code. If I follow correctly though, the decisions you made in reshaping sounds very reasonable.

kylebgorman added 4 commits December 8, 2024 15:29

Cleanup to transducer

3c9770a

Adds new dependency

7307a0c

Debugging

aabffbc

Adamits suggested changes Dec 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uses torchmetrics for metric computation #284

Uses torchmetrics for metric computation #284

kylebgorman commented Dec 8, 2024 •

edited

Loading

Adamits commented Dec 10, 2024

Adamits Dec 10, 2024

Adamits Dec 10, 2024

kylebgorman Dec 10, 2024

Adamits Dec 10, 2024

Adamits Dec 10, 2024

kylebgorman Dec 10, 2024

Adamits Dec 10, 2024

Adamits Dec 10, 2024

Adamits Dec 10, 2024

Adamits Dec 10, 2024

kylebgorman commented Dec 10, 2024

Adamits commented Dec 23, 2024

		dropout_layer: nn.Dropout
		eval_metrics: Set[evaluators.Evaluator]


		def validation_step(self, batch: data.PaddedBatch, batch_idx: int) -> Dict:
		def on_validation_epoch_start(self) -> None:

Uses torchmetrics for metric computation #284

Are you sure you want to change the base?

Uses torchmetrics for metric computation #284

Conversation

kylebgorman commented Dec 8, 2024 • edited Loading

Adamits commented Dec 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylebgorman commented Dec 10, 2024

Adamits commented Dec 23, 2024

kylebgorman commented Dec 8, 2024 •

edited

Loading