added support for using multiple losses and metrics in evaluator #8

BlueCrescent · 2024-01-15T15:32:17Z

No description provided.

…in evaluator

tests/test_evaluator.py

src/modalities/metrics.py

src/modalities/gym.py

le1nux

I like the structure of the code and the introduction of small functions. Helps reading the code a lot!

I was thinking about the next steps for the metrics and losses. They should be configurable by the config and assembled dynamically.

src/modalities/evaluator.py

src/modalities/metrics.py

src/modalities/gym.py

…valuator

Co-authored-by: Teucher, Roman <[email protected]>

* adding generic measure class * adding perplexity as well as cross entropy loss computations * starting unit tests

le1nux · 2024-01-31T18:46:06Z

Is this PR finalised and ready for review?

done: - integrated new aggregated measures into evaluator - updated and expanded evaluator tests - extracted throughput measurement into class - fixed missing keys in measure implementations - code clean up todo: - integration into trainer - parameterization f eval losses/metrics - testing the loss implementations (in particular perplexity)

…and_metrics_in_evaluator

* add aggregative measure factories to constructor of Evaluator * extract big train method into smaller ones * adapt test_evaluator to new params of changed Evaluator class * gym: remove loss functions and metrics for evaluation from evaluation method

…ixtures

WIP: The current version does not yet work.

… configs

…xity impl and test * add throughput aggregator factory callable; needed because otherwise it was intantiated in the class itself and was not mockable * fix perplexity computation: now tracking losses, summing over them and then computing torch.exp(loss_sum/#samples) *

Perplexity now gets computed correctly for each sequence in a batch and summed up afterwards.

src/modalities/loss_functions.py

Also: Small fix in perplexity tests.

… "mean". Otherwise (with reduction="sum"), this would drastically impact training loss. Instead, for accumulating and logging the training loss, the added losses now get divided by number of batches instead of the added batch sizes.

Also, small refactoring of corresponding tests.

…_losses_and_metrics_in_evaluator # Conflicts: # config_files/config_example_hf_meditron_7B_instruction.yaml # config_files/config_example_mem_map_dataset.yaml # config_files/config_lorem_ipsum.yaml # src/modalities/__main__.py # src/modalities/config/config.py # src/modalities/config/lookup_types.py # src/modalities/evaluator.py # src/modalities/resolver_register.py # src/modalities/trainer.py # tests/conftest.py

…size

…ltiple_losses_and_metrics_in_evaluator

le1nux

Really good work on the evaluation/aggregation part. I left a bunch of comments (some minor and some that would need some discussion). Primarily, I'm not so much convinced about the metric and loss factories that get instantiated on a very low level. I would prefer if the instantiation would completely happen in the hierarchical instantiation and that these aggregated measures (instead of the factories) would be passed to the evaluator. I think that's the main point that necessitates discussion and I would like to hear your thoughts on it :-)

FYI, I did not take a look at the tests yet and wanted to add my comments already, as I think they would be helpful already.

le1nux · 2024-03-11T11:13:52Z

config_files/config.yaml

config is outdated w.r.t. component_key and variant_key

le1nux · 2024-03-11T11:15:31Z

config_files/config_example_openGPTx_dataset.yaml

config is outdated w.r.t. component_key and variant_key

le1nux · 2024-03-11T11:21:32Z

src/modalities/evaluation/aggregator.py

+    ) -> torch.Tensor:
+        # we clone the value so that we can always resync the value without side-effects
+        cloned_value = self._key_to_value[key].clone()
+        value = Reducer.reduce(tensor=cloned_value, operation=reduce_operation)


Since we have the hierarchical instantiation up and running now, we should pass in the reducer via the constructor.
We can think of different reducers e.g., torch distributed reducer, which reduces the tensors across ranks. Another reducer for single GPU training without FSDP (which is still a todo) could just call torch.mean(). What do you think?

Yes, I agree that this makes sense. But I probably would postpone such a change until we actually have a second Reducer.

le1nux · 2024-03-11T11:28:34Z

src/modalities/registry/components.py

+        AggregativeCLMCrossEntropyLossFactory,
+        CLMCrossEntropyLossConfig,
+    ),
+    ComponentEntity("eval_measures", "perplexity", AggregativePerplexityFactory, CLMCrossEntropyLossConfig),


The config does not fit to the AggregativePerplexityFactory from a naming perspective.

src/modalities/registry/components.py

le1nux · 2024-03-11T22:11:05Z

src/modalities/evaluator.py

+                dataloader_tag=data_loader.dataloader_tag,
+            )
+
+        return {loss: loss.compute() for loss in losses}, {metric: metric.compute() for metric in metrics}


maybe it makes sense to simplify losses and metrics just as measures? Maybe we can find a better word even?

Also, what about replacing compute with aggregate?

Yeah, that sounds better. I renamed it.
I also agree, that joining the metrics and losses would be better.

le1nux · 2024-03-11T22:22:12Z

src/modalities/evaluator.py

+        global_train_sample_id: int,
+        local_sample_id_to_global_sample_id: Callable[[int], int],
+    ) -> Tuple[Dict[str, torch.Tensor], Dict[str, torch.Tensor]]:
+        losses = [f.create(self.local_rank) for f in self._loss_factories]


not a big fan of the factories here, as they seem a bit overengineered.
How about we instantiate the aggregated measures already in the hierarchical instantiation and pass the measures to the evaluator in main?
We could implement a reset function for all measures that clears out their internal state. In this case, we could get rid of the factories here, right?

I don't think it would be a good idea to have access to the stateful measures outside of this context. That seems like a potential source of errors to me. Consider for example, when we evaluating multiple dataloaders using the same loss object and working in parallel.

le1nux · 2024-03-11T22:25:19Z

src/modalities/evaluator.py

+
+def _extract_num_samples(data_loader: LLMDataLoader) -> int:
+    num_samples = len(data_loader.dataset)
+    if data_loader.batch_size is not None and data_loader.drop_last:


In LLMDataLoader the batch_size is never None.

Yeah, but I think the LLMDataLoader should change in that regard since the current form does not allow IterDatasets at all. So I would leave this here, to be on the save side.

le1nux · 2024-03-11T22:26:29Z

src/modalities/evaluation/throughput_aggregator.py

+        self,
+        num_samples: int,
+        local_rank: int,
+        throughput_aggregator_factory: Callable[[], ThroughputAggregator] = ThroughputAggregator,


Could we pass in the aggregator directly? The reset function is implmented already.

le1nux · 2024-03-11T22:28:48Z

src/modalities/evaluation/perplexity.py

+        self._target_key = target_key
+        self._prediction_key = prediction_key
+
+    def create(self, local_rank: int) -> AggregativeMeasure[PerplexityKeys]:


as mentioned earlier, I think the factories are overkill.

…egate().

…d of Evaluator.

… compute() to aggregate().

…n Trainer instead of iterable decorator. This code should be better readable than using start_throughput_measurement().

…ltiple_losses_and_metrics_in_evaluator

feat(evaluator): added support for using multiple losses and metrics …

7ef1362

…in evaluator

BlueCrescent requested a review from rrutmann January 15, 2024 15:32

BlueCrescent self-assigned this Jan 15, 2024

rrutmann requested changes Jan 16, 2024

View reviewed changes

le1nux requested changes Jan 18, 2024

View reviewed changes

src/modalities/evaluator.py Outdated Show resolved Hide resolved

src/modalities/evaluator.py Outdated Show resolved Hide resolved

src/modalities/metrics.py Outdated Show resolved Hide resolved

src/modalities/gym.py Outdated Show resolved Hide resolved

le1nux and others added 6 commits January 19, 2024 23:32

chore: Merge branch 'main' into feat/multiple_losses_and_metrics_in_e…

69134ad

…valuator

chore: Merge branch 'main' into feat/multiple_losses_and_metrics_in_e…

af47020

…valuator

fix: num_samples added to throughput_aggregator were not of type tensor

abd2268

feat: started implementation of Aggregator and StatefulMetrics

3519217

Co-authored-by: Teucher, Roman <[email protected]>

feat: towards generic measure evaluation

48e4a89

feat(evaluation): Adding generic evaluation measure for training

cd15bbd

* adding generic measure class * adding perplexity as well as cross entropy loss computations * starting unit tests

BlueCrescent and others added 9 commits February 5, 2024 08:52

Merge remote-tracking branch 'origin/main' into feat/multiple_losses_…

2678d05

…and_metrics_in_evaluator

refactor(evaluation): adaptions to loss function; using conftest.py f…

5034018

…ixtures

test(evaluation): added comparison implementation for perplexity test

c2da315

feat(config): added first version of evaluation losses in config

09f1078

WIP: The current version does not yet work.

fix(config): added some missing code for reading the new eval measure…

c6b0a1b

… configs

fix(evaluation): Fixed perplexity and finalized batch size two test.

00f0598

Perplexity now gets computed correctly for each sequence in a batch and summed up afterwards.

BlueCrescent commented Feb 18, 2024

View reviewed changes

src/modalities/loss_functions.py Outdated Show resolved Hide resolved

BlueCrescent and others added 8 commits February 19, 2024 11:24

fix: Minor fixes and reverted changes to Reducer.

59e8996

test(evaluation): Added multiple dataloadres to evaluator tests.

9af55e4

Also: Small fix in perplexity tests.

fix(evaluation): Fixed parameter for throughput aggregator in Evaluator.

bbf4d4f

Also, small refactoring of corresponding tests.

fix(evaluation): Bug fix (mutable default arg) and minor refactoring.

a113152

test(evaluation): Added tests for ThroughputAggregator.

d898e84

Merge branch 'main' into feat/multiple_losses_and_metrics_in_evaluator

7ea5ed9

feat(config): Added validation_measure_factories to all config files.

c24cf26

BlueCrescent added 2 commits February 19, 2024 16:54

refactor: Ran isort.

547c0d0

BlueCrescent marked this pull request as draft March 4, 2024 09:35

BlueCrescent added 2 commits March 4, 2024 11:12

fix(trainer): Fixed errors from merging.

fa004be

feat(config): Adapted eval measure configs to new config scheme.

67e77ee

le1nux self-requested a review March 4, 2024 13:21

le1nux added the enhancement New feature or request label Mar 4, 2024

BlueCrescent added 2 commits March 4, 2024 15:05

fix(config): Fixed configs for eval measures.

bb0bc62

fix(evaluation): data_loader.sampler_batch_size to data_loader.batch_…

eb5e792

…size

BlueCrescent marked this pull request as ready for review March 4, 2024 14:50

chore(merge): Merge remote-tracking branch 'origin/main' into feat/mu…

cc76bdf

…ltiple_losses_and_metrics_in_evaluator

le1nux reviewed Mar 11, 2024

View reviewed changes

le1nux force-pushed the main branch 3 times, most recently from cb6e816 to 179052b Compare March 13, 2024 22:14

BlueCrescent added 13 commits March 15, 2024 13:49

refactor(evaluation): Renamed AggregativeMeasure to AggregatedMeasure.

4e91766

refactor(evaluation): Renamed batch_result to result_batch.

e12dcd7

refactor(evaluation): Explicit reduce_operation parameter.

a8cd8b2

refactor(evaluation): Fixed typos.

27b0fa8

refactor(evaluation): In AggregatedMeasure, renamed compute() to aggr…

c886473

…egate().

refactor(evaluation): Turned _extract_num_samples into a static metho…

fdd4fa2

…d of Evaluator.

docs(config): Improved code comment.

82f5458

refactor(utilities): Removed unused imports.

356c4da

refactor(utilities): Removed method only used by tests.

dba5680

refactor(evaluation): Adapted evaluator tests to previous renaming of…

381caff

… compute() to aggregate().

feat(evaluation): Changed ThroughputAggregationContext to be usable i…

362ff84

…n Trainer instead of iterable decorator. This code should be better readable than using start_throughput_measurement().

chore(merge): Merge remote-tracking branch 'origin/main' into feat/mu…

00b8c95

…ltiple_losses_and_metrics_in_evaluator

chore(merge): Merge remote-tracking branch 'origin/main' into feat/mu…

b955e98

…ltiple_losses_and_metrics_in_evaluator

mali-git closed this Jun 11, 2024

fromm-m deleted the feat/multiple_losses_and_metrics_in_evaluator branch June 17, 2024 12:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added support for using multiple losses and metrics in evaluator #8

added support for using multiple losses and metrics in evaluator #8

BlueCrescent commented Jan 15, 2024

le1nux left a comment

le1nux commented Jan 31, 2024

le1nux left a comment •

edited

Loading

le1nux Mar 11, 2024

le1nux Mar 11, 2024

le1nux Mar 11, 2024

BlueCrescent Mar 15, 2024

le1nux Mar 11, 2024

le1nux Mar 11, 2024

le1nux Mar 11, 2024

BlueCrescent Mar 15, 2024

le1nux Mar 11, 2024

BlueCrescent Mar 15, 2024

le1nux Mar 11, 2024

BlueCrescent Mar 15, 2024

le1nux Mar 11, 2024

le1nux Mar 11, 2024

added support for using multiple losses and metrics in evaluator #8

added support for using multiple losses and metrics in evaluator #8

Conversation

BlueCrescent commented Jan 15, 2024

le1nux left a comment

Choose a reason for hiding this comment

le1nux commented Jan 31, 2024

le1nux left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

le1nux left a comment •

edited

Loading