Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Runtime Bugets and Step Hints #838

Open
wants to merge 20 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/CI.yml
Original file line number Diff line number Diff line change
Expand Up @@ -199,7 +199,7 @@ jobs:
pip install .[pytorch_cpu]
- name: Run pytest tests
run: |
pytest -vx tests/version_test.py
pytest -vx tests/test_version.py
pytest -vx tests/test_num_params.py
pytest -vx tests/test_param_shapes.py
pytest -vx tests/test_param_types.py
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/linting.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ jobs:
pip install pylint==2.16.1
- name: Run pylint
run: |
pylint algorithmic_efficiency
pylint algoperf
pylint reference_algorithms
pylint prize_qualification_baselines
pylint submission_runner.py
Expand Down Expand Up @@ -50,7 +50,7 @@ jobs:
- name: Install yapf
run: |
python -m pip install --upgrade pip
pip install yapf==0.32
pip install yapf==0.32 toml
- name: Run yapf
run: |
yapf . --diff --recursive
2 changes: 1 addition & 1 deletion .github/workflows/regression_tests_variants.yml
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ jobs:
run: |
docker pull us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_pytorch_${{ github.head_ref || github.ref_name }}
docker run -v $HOME/data/:/data/ -v $HOME/experiment_runs/:/experiment_runs -v $HOME/experiment_runs/logs:/logs --gpus all --ipc=host us-central1-docker.pkg.dev/training-algorithms-external/mlcommons-docker-repo/algoperf_pytorch_${{ github.head_ref || github.ref_name }} -d criteo1tb -f pytorch -s reference_algorithms/paper_baselines/adamw/pytorch/submission.py -w criteo1tb_resnet -t reference_algorithms/paper_baselines/adamw/tuning_search_space.json -e tests/regression_tests/adamw -m 10 -c False -o True -r false
criteo_resnet_pytorch:
criteo_embed_init_pytorch:
runs-on: self-hosted
needs: build_and_push_pytorch_docker_image
steps:
Expand Down
8 changes: 5 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ makefile
*.swp
*/data/
*events.out.tfevents*
algorithmic_efficiency/workloads/librispeech_conformer/data_dir
algorithmic_efficiency/workloads/librispeech_conformer/work_dir
algoperf/workloads/librispeech_conformer/data_dir
algoperf/workloads/librispeech_conformer/work_dir
*.flac
*.npy
*.csv
Expand All @@ -23,4 +23,6 @@ wandb/
scoring/plots/

!scoring/test_data/experiment_dir/study_0/mnist_jax/trial_0/eval_measurements.csv
!scoring/test_data/experiment_dir/study_0/mnist_jax/trial_1/eval_measurements.csv
!scoring/test_data/experiment_dir/study_0/mnist_jax/trial_1/eval_measurements.csv

algoperf/_version.py
20 changes: 13 additions & 7 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,34 +4,39 @@

- Finalized variant workload targets.
- Fix in random_utils helper function.
- For conformer PyTorch Dropout layers set `inplace=True`.
- For conformer PyTorch Dropout layers set `inplace=True`.
- Clear CUDA cache at begining of each trial for PyTorch.

## algoperf-benchmark-0.1.4 (2024-03-26)

Upgrade CUDA version to CUDA 12.1:

- Upgrade CUDA version in Dockerfiles that will be used for scoring.
- Update Jax and PyTorch package version tags to use local CUDA installation.

Add flag for completely disabling checkpointing.
Add flag for completely disabling checkpointing.

- Note that we will run with checkpointing off at scoring time.

Update Deepspeech and Conformer variant target setting configurations.
- Note that variant targets are not final.
Update Deepspeech and Conformer variant target setting configurations.

- Note that variant targets are not final.

Fixed bug in scoring code to take best trial in a study for external-tuning ruleset.

Added instructions for submission.
Added instructions for submission.

Changed default number of workers for PyTorch data loaders to 0. Running with >0 may lead to incorrect eval results see https://github.com/mlcommons/algorithmic-efficiency/issues/732.
Changed default number of workers for PyTorch data loaders to 0. Running with >0 may lead to incorrect eval results see <https://github.com/mlcommons/algorithmic-efficiency/issues/732>.

## algoperf-benchmark-0.1.2 (2024-03-04)

Workload variant additions and fixes:

- Add Deepspeech workload variant
- Fix bugs in Imagenet ResNet, WMT and Criteo1tb variants

Add prize qualification logs for external tuning ruleset.
Note: FastMRI trials with dropout are not yet added due to https://github.com/mlcommons/algorithmic-efficiency/issues/664.
Note: FastMRI trials with dropout are not yet added due to <https://github.com/mlcommons/algorithmic-efficiency/issues/664>.

Add missing funcitonality to Docker startup script for self_tuning ruleset.
Add self_tuning ruleset option to script that runs all workloads for scoring.
Expand All @@ -41,6 +46,7 @@ Datasetup fixes.
Fix tests that check training differences in PyTorch and JAX on GPU.

## algoperf-benchmark-0.1.1 (2024-01-19)

Bug fixes to FastMRI metric calculation and targets.

Added workload variants and targets for ogbg, fastmri, librispeech_conformer, imagenet_resnet, imagenet_vit, criteo1tb to be used as held-out workloads.
Expand Down
19 changes: 16 additions & 3 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
- [Style Testing](#style-testing)
- [Unit and Integration Tests](#unit-and-integration-tests)
- [Regression Tests](#regression-tests)
- [Versioning](#versioning)

## Contributing to MLCommons

Expand Down Expand Up @@ -204,7 +205,7 @@ docker run -t -d \
-v $HOME/data/:/data/ \
-v $HOME/experiment_runs/:/experiment_runs \
-v $HOME/experiment_runs/logs:/logs \
-v $HOME/algorithmic-efficiency:/algorithmic-efficiency \
-v $HOME/algorithmic-efficiency:/algoperf \
--gpus all \
--ipc=host \
<image_path> \
Expand All @@ -228,7 +229,7 @@ To run the below commands, use the versions installed via `pip install -e '.[dev
To automatically fix formatting errors, run the following (*WARNING:* this will edit your code, so it is suggested to make a git commit first!):

```bash
yapf -i -r -vv -p algorithmic_efficiency datasets prize_qualification_baselines reference_algorithms tests *.py
yapf -i -r -vv -p algoperf datasets prize_qualification_baselines reference_algorithms tests *.py
```

To sort all import orderings, run the following:
Expand All @@ -246,7 +247,7 @@ isort . --check --diff
To print out all offending pylint issues, run the following:

```bash
pylint algorithmic_efficiency
pylint algoperf
pylint datasets
pylint prize_qualification_baselines
pylint reference_algorithms
Expand Down Expand Up @@ -276,3 +277,15 @@ To run a regression test:
2. Turn on the self-hosted runner.
3. Run the self-hosted runner application for the runner to accept jobs.
4. Open a pull request into mian to trigger the workflow.

### Versioning

The package version is automatically determined by the `setuptools_scm` package based on the last git tag.
It follows the structure `major.minor.patch` + `devN` where `N` is the number of commits since the last tag.
It automatically increments the patch version (i.e. it guesses the next version) if there are commits after the last tag.
Additionally, if there are uncommitted changes, the version will include a suffix separated by a `+` character and includes the last commit hash plus the date on dirt workdir (see [setuptools_scm's documentation](https://setuptools-scm.readthedocs.io/en/latest/extending/#setuptools_scmlocal_scheme) with the default version and local scheme).
You can check what version `setuptools_scm` is creating by running `python -m setuptools_scm`.

To create a new version, create a new release (and tag) in the GitHub UI.
The package version is automatically updated to the new version.
Once the package is installed, the version can be accessed as the package attribute `algoperf.__version__`, i.e. via `python -c "import algoperf; print(algoperf.__version__)"`.
Loading