Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MLPerf logging #831

Merged
merged 63 commits into from
May 3, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
c52f47b
draft mlperf logger
hanlint Mar 18, 2022
1d75c3a
add to callbacks module
hanlint Mar 21, 2022
caab2e9
add mlperf logging callback
hanlint Mar 23, 2022
8f2fee6
add submission directory structure
hanlint Mar 25, 2022
69bb806
add mlperf to setup
hanlint Mar 25, 2022
0813bb6
fix duplicate logging
hanlint Mar 25, 2022
43c74cd
Merge branch 'dev' into hanlin/mlperf
hanlint Mar 25, 2022
d2153d2
Apply suggestions from code review
hanlint Mar 28, 2022
8bbd7cd
Merge branch 'dev' into hanlin/mlperf
hanlint Apr 18, 2022
e010476
update with current_metrics
hanlint Apr 18, 2022
9d588f7
fix setup
hanlint Apr 19, 2022
bee409f
fix docstrings
hanlint Apr 19, 2022
f70406b
add hparams object
hanlint Apr 19, 2022
ba8652f
fix error
hanlint Apr 19, 2022
7ac866b
skip callback in asset test
hanlint Apr 19, 2022
03758b1
Merge branch 'dev' into hanlin/mlperf
hanlint Apr 19, 2022
689d84c
cleanup
hanlint Apr 19, 2022
f02eef3
try removing world_size
hanlint Apr 19, 2022
6491e8c
restore world_size
hanlint Apr 19, 2022
5fe7957
Merge branch 'dev' into hanlin/mlperf
hanlint Apr 19, 2022
b1b6004
Merge branch 'dev' into hanlin/mlperf
hanlint Apr 19, 2022
465f76f
Merge branch 'hanlin/mlperf' of github.com:mosaicml/composer into han…
hanlint Apr 19, 2022
d80e39d
trying removing mlperf tag
hanlint Apr 19, 2022
99bb2ab
cleanup
hanlint Apr 19, 2022
50ae088
Merge branch 'dev' into hanlin/mlperf
ravi-mosaicml Apr 19, 2022
73931a8
please jenkins help
hanlint Apr 19, 2022
2139ae0
one more time
hanlint Apr 19, 2022
6696fdb
never say timeout
hanlint Apr 19, 2022
7ff3392
Merge branch 'hanlin/mlperf' of github.com:mosaicml/composer into han…
hanlint Apr 19, 2022
e03c14e
Merge branch 'dev' into hanlin/mlperf
hanlint Apr 19, 2022
fed0d3f
Merge branch 'dev' into hanlin/mlperf
hanlint Apr 20, 2022
09ed9e5
remove world_size again
hanlint Apr 20, 2022
95c26fc
Merge branch 'hanlin/mlperf' of github.com:mosaicml/composer into han…
hanlint Apr 20, 2022
f2d3c51
Merge branch 'dev' into hanlin/mlperf
ravi-mosaicml Apr 20, 2022
7034f13
remove logging pip
hanlint Apr 21, 2022
67c0640
Merge branch 'hanlin/mlperf' of github.com:mosaicml/composer into han…
hanlint Apr 21, 2022
24e7439
Merge branch 'dev' into hanlin/mlperf
hanlint Apr 21, 2022
e497ce9
Merge branch 'hanlin/mlperf' of github.com:mosaicml/composer into han…
hanlint Apr 21, 2022
2862776
address comments
hanlint Apr 23, 2022
e0a13a4
implement cache clear
hanlint Apr 23, 2022
aae087b
Merge branch 'dev' into hanlin/mlperf
hanlint Apr 25, 2022
5585ce8
fix doctest
hanlint Apr 26, 2022
a524a65
Merge branch 'hanlin/mlperf' of github.com:mosaicml/composer into han…
hanlint Apr 26, 2022
a553110
Merge branch 'dev' into hanlin/mlperf
hanlint Apr 27, 2022
6a2c637
Update composer/callbacks/mlperf.py
hanlint Apr 29, 2022
8f4ea2f
address comments
hanlint Apr 29, 2022
a80a208
Merge branch 'hanlin/mlperf' of github.com:mosaicml/composer into han…
hanlint Apr 29, 2022
cc4d9be
restore dataloaders to state
hanlint May 3, 2022
a431577
cleanup
hanlint May 3, 2022
4cbd163
move items to init
hanlint May 3, 2022
b7fd11e
Merge branch 'dev' into hanlin/mlperf
hanlint May 3, 2022
cdeac03
fix pyright
hanlint May 3, 2022
212b089
clean up tests
hanlint May 3, 2022
13066df
use code block because cannot automate testcode
hanlint May 3, 2022
f8c9732
Apply suggestions from code review
hanlint May 3, 2022
b689f86
address comments
hanlint May 3, 2022
1705a61
Merge branch 'dev' into hanlin/mlperf
hanlint May 3, 2022
051035b
cleanup
hanlint May 3, 2022
afe5313
type ignore until logging pypi is done
hanlint May 3, 2022
ec2a578
Merge branch 'dev' into hanlin/mlperf
hanlint May 3, 2022
2480a2b
cleanup
hanlint May 3, 2022
0136d3d
Merge branch 'hanlin/mlperf' of github.com:mosaicml/composer into han…
hanlint May 3, 2022
621d12e
cleanup
hanlint May 3, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -27,4 +27,4 @@ test-dist-gpu:
clean-notebooks:
$(PYTHON) scripts/clean_notebooks.py -i notebooks/*.ipynb

.PHONY: test test-gpu test-dist test-dist-gpu lint style clean-notebooks
.PHONY: test test-gpu test-dist test-dist-gpu clean-notebooks
6 changes: 5 additions & 1 deletion composer/callbacks/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,13 @@
examples for writing your own callbacks at the :class:`~composer.core.callback.Callback` base class.
"""
from composer.callbacks.callback_hparams import (CallbackHparams, CheckpointSaverHparams, GradMonitorHparams,
LRMonitorHparams, MemoryMonitorHparams, SpeedMonitorHparams)
LRMonitorHparams, MemoryMonitorHparams, MLPerfCallbackHparams,
SpeedMonitorHparams)
from composer.callbacks.checkpoint_saver import CheckpointSaver
from composer.callbacks.grad_monitor import GradMonitor
from composer.callbacks.lr_monitor import LRMonitor
from composer.callbacks.memory_monitor import MemoryMonitor
from composer.callbacks.mlperf import MLPerfCallback
from composer.callbacks.speed_monitor import SpeedMonitor

__all__ = [
Expand All @@ -19,11 +21,13 @@
"MemoryMonitor",
"SpeedMonitor",
"CheckpointSaver",
"MLPerfCallback",
# hparams objects
"CallbackHparams",
"CheckpointSaverHparams",
"GradMonitorHparams",
"LRMonitorHparams",
"MemoryMonitorHparams",
"SpeedMonitorHparams",
"MLPerfCallbackHparams",
]
62 changes: 59 additions & 3 deletions composer/callbacks/callback_hparams.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

import abc
import textwrap
from dataclasses import dataclass
from dataclasses import asdict, dataclass
from typing import Optional

import yahp as hp
Expand All @@ -14,6 +14,7 @@
from composer.callbacks.grad_monitor import GradMonitor
from composer.callbacks.lr_monitor import LRMonitor
from composer.callbacks.memory_monitor import MemoryMonitor
from composer.callbacks.mlperf import MLPerfCallback
from composer.callbacks.speed_monitor import SpeedMonitor
from composer.core.callback import Callback
from composer.core.time import Time
Expand Down Expand Up @@ -48,7 +49,7 @@ class GradMonitorHparams(CallbackHparams):
""":class:`~.GradMonitor` hyperparamters.

Args:
log_layer_grad_norms (bool, optional):
log_layer_grad_norms (bool, optional):
See :class:`~.GradMonitor` for documentation.
"""

Expand Down Expand Up @@ -119,10 +120,65 @@ def initialize_object(self) -> SpeedMonitor:
return SpeedMonitor(window_size=self.window_size)


@dataclass
class MLPerfCallbackHparams(CallbackHparams):
""":class:`~.MLPerfCallback` hyperparameters.
ravi-mosaicml marked this conversation as resolved.
Show resolved Hide resolved

Args:
root_folder (str): The root submission folder
index (int): The repetition index of this run. The filename created will be
``result_[index].txt``.
benchmark (str, optional): Benchmark name. Currently only ``resnet`` supported.
target (float, optional): The target metric before the mllogger marks the stop
of the timing run. Default: ``0.759`` (resnet benchmark).
division (str, optional): Division of submission. Currently only ``open`` division supported.
metric_name (str, optional): name of the metric to compare against the target. Default: ``Accuracy``.
metric_label (str, optional): label name. The metric will be accessed via ``state.current_metrics[metric_label][metric_name]``.
submitter (str, optional): Submitting organization. Default: MosaicML.
system_name (str, optional): Name of the system (e.g. 8xA100_composer). If
not provided, system name will default to ``[world_size]x[device_name]_composer``,
e.g. ``8xNVIDIA_A100_80GB_composer``.
status (str, optional): Submission status. One of (onprem, cloud, or preview).
Default: ``"onprem"``.
cache_clear_cmd (str, optional): Command to invoke during the cache clear. This callback
will call ``subprocess(cache_clear_cmd)``. Default is disabled (None)

"""

root_folder: str = hp.required("The root submission folder.")
hanlint marked this conversation as resolved.
Show resolved Hide resolved
index: int = hp.required("The repetition index of this run.")
benchmark: str = hp.optional("Benchmark name. Default: resnet", default="resnet")
hanlint marked this conversation as resolved.
Show resolved Hide resolved
target: float = hp.optional("The target metric before mllogger marks run_stop. Default: 0.759 (resnet)",
hanlint marked this conversation as resolved.
Show resolved Hide resolved
default=0.759)
division: Optional[str] = hp.optional(
"Division of submission. Currently only open division"
"is supported. Default: open", default="open")
metric_name: str = hp.optional('name of the metric to compare against the target. Default: Accuracy',
default='Accuracy')
metric_label: str = hp.optional(
'label name. The metric will be accessed via state.current_metrics[metric_label][metric_name]. Default: eval',
default='eval')
submitter: str = hp.optional("Submitting organization. Default: MosaicML", default='MosaicML')
system_name: Optional[str] = hp.optional("Name of the system, defaults to [world_size]x[device_name]", default=None)
status: str = hp.optional("Submission status. Default: onprem", default="onprem")
cache_clear_cmd: Optional[str] = hp.optional(
"Command to invoke during the cache clear. This callback will call subprocess(cache_clear_cmd). Default: Disabled.",
default=None,
)

def initialize_object(self) -> MLPerfCallback:
"""Initialize the MLPerf Callback.

Returns:
MLPerfCallback: An instance of :class:`~.MLPerfCallback`
"""
return MLPerfCallback(**asdict(self))


@dataclass
class CheckpointSaverHparams(CallbackHparams):
""":class:`~.CheckpointSaver` hyperparameters.

Args:
save_folder (str, optional): See :class:`~.CheckpointSaver`.
filename (str, optional): See :class:`~.CheckpointSaver`.
Expand Down
Loading