-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added support for using multiple losses and metrics in evaluator #8
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the structure of the code and the introduction of small functions. Helps reading the code a lot!
I was thinking about the next steps for the metrics and losses. They should be configurable by the config and assembled dynamically.
Co-authored-by: Teucher, Roman <[email protected]>
* adding generic measure class * adding perplexity as well as cross entropy loss computations * starting unit tests
Is this PR finalised and ready for review? |
done: - integrated new aggregated measures into evaluator - updated and expanded evaluator tests - extracted throughput measurement into class - fixed missing keys in measure implementations - code clean up todo: - integration into trainer - parameterization f eval losses/metrics - testing the loss implementations (in particular perplexity)
…and_metrics_in_evaluator
* add aggregative measure factories to constructor of Evaluator * extract big train method into smaller ones * adapt test_evaluator to new params of changed Evaluator class * gym: remove loss functions and metrics for evaluation from evaluation method
WIP: The current version does not yet work.
…xity impl and test * add throughput aggregator factory callable; needed because otherwise it was intantiated in the class itself and was not mockable * fix perplexity computation: now tracking losses, summing over them and then computing torch.exp(loss_sum/#samples) *
Perplexity now gets computed correctly for each sequence in a batch and summed up afterwards.
Also: Small fix in perplexity tests.
… "mean". Otherwise (with reduction="sum"), this would drastically impact training loss. Instead, for accumulating and logging the training loss, the added losses now get divided by number of batches instead of the added batch sizes.
Also, small refactoring of corresponding tests.
…_losses_and_metrics_in_evaluator # Conflicts: # config_files/config_example_hf_meditron_7B_instruction.yaml # config_files/config_example_mem_map_dataset.yaml # config_files/config_lorem_ipsum.yaml # src/modalities/__main__.py # src/modalities/config/config.py # src/modalities/config/lookup_types.py # src/modalities/evaluator.py # src/modalities/resolver_register.py # src/modalities/trainer.py # tests/conftest.py
…ltiple_losses_and_metrics_in_evaluator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really good work on the evaluation/aggregation part. I left a bunch of comments (some minor and some that would need some discussion). Primarily, I'm not so much convinced about the metric and loss factories that get instantiated on a very low level. I would prefer if the instantiation would completely happen in the hierarchical instantiation and that these aggregated measures (instead of the factories) would be passed to the evaluator. I think that's the main point that necessitates discussion and I would like to hear your thoughts on it :-)
FYI, I did not take a look at the tests yet and wanted to add my comments already, as I think they would be helpful already.
config_files/config.yaml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config is outdated w.r.t. component_key
and variant_key
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config is outdated w.r.t. component_key
and variant_key
) -> torch.Tensor: | ||
# we clone the value so that we can always resync the value without side-effects | ||
cloned_value = self._key_to_value[key].clone() | ||
value = Reducer.reduce(tensor=cloned_value, operation=reduce_operation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have the hierarchical instantiation up and running now, we should pass in the reducer via the constructor.
We can think of different reducers e.g., torch distributed reducer, which reduces the tensors across ranks. Another reducer for single GPU training without FSDP (which is still a todo) could just call torch.mean(). What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that this makes sense. But I probably would postpone such a change until we actually have a second Reducer.
AggregativeCLMCrossEntropyLossFactory, | ||
CLMCrossEntropyLossConfig, | ||
), | ||
ComponentEntity("eval_measures", "perplexity", AggregativePerplexityFactory, CLMCrossEntropyLossConfig), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The config does not fit to the AggregativePerplexityFactory from a naming perspective.
src/modalities/evaluator.py
Outdated
dataloader_tag=data_loader.dataloader_tag, | ||
) | ||
|
||
return {loss: loss.compute() for loss in losses}, {metric: metric.compute() for metric in metrics} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it makes sense to simplify losses and metrics just as measures? Maybe we can find a better word even?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, what about replacing compute
with aggregate
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that sounds better. I renamed it.
I also agree, that joining the metrics and losses would be better.
global_train_sample_id: int, | ||
local_sample_id_to_global_sample_id: Callable[[int], int], | ||
) -> Tuple[Dict[str, torch.Tensor], Dict[str, torch.Tensor]]: | ||
losses = [f.create(self.local_rank) for f in self._loss_factories] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a big fan of the factories here, as they seem a bit overengineered.
How about we instantiate the aggregated measures already in the hierarchical instantiation and pass the measures to the evaluator in main?
We could implement a reset
function for all measures that clears out their internal state. In this case, we could get rid of the factories here, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it would be a good idea to have access to the stateful measures outside of this context. That seems like a potential source of errors to me. Consider for example, when we evaluating multiple dataloaders using the same loss object and working in parallel.
src/modalities/evaluator.py
Outdated
|
||
def _extract_num_samples(data_loader: LLMDataLoader) -> int: | ||
num_samples = len(data_loader.dataset) | ||
if data_loader.batch_size is not None and data_loader.drop_last: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In LLMDataLoader
the batch_size
is never None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but I think the LLMDataLoader should change in that regard since the current form does not allow IterDatasets at all. So I would leave this here, to be on the save side.
self, | ||
num_samples: int, | ||
local_rank: int, | ||
throughput_aggregator_factory: Callable[[], ThroughputAggregator] = ThroughputAggregator, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we pass in the aggregator directly? The reset
function is implmented already.
self._target_key = target_key | ||
self._prediction_key = prediction_key | ||
|
||
def create(self, local_rank: int) -> AggregativeMeasure[PerplexityKeys]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as mentioned earlier, I think the factories are overkill.
cb6e816
to
179052b
Compare
… compute() to aggregate().
…n Trainer instead of iterable decorator. This code should be better readable than using start_throughput_measurement().
…ltiple_losses_and_metrics_in_evaluator
…ltiple_losses_and_metrics_in_evaluator
No description provided.