[WIP] Migrate JAX workloads from pmap to jit #848

priyakasimbeg · 2025-03-06T21:47:23Z

Purpose

The goal of this PR is to allow model parameter and optimizer state sharding, and also to migrate the JAX code from using jax.pmap to using jax.jit.

TODOs:

Changelog

Added some sharding utilities to handle data distributed
Replaced pmap code for CIFAR/MNIST with jit
Modified AdamW and Nesterov accordingly
Updated checkpoint and data_utils to support the new approach (mostly removing explicit jax_utils.replicate calls).

Issues

Prefetching functionality in CIFAR is temporarily disabled (marked with FIXME), not sure how to best support it here.
I haven't edited any of the PyTorch code, we will need to make sure they still do comparably..

[do not merge] Dev -> Main

Dev -> main

Apply it to the MNIST workload and the Nesterov optimizer.

Still need to test out (a) output losses, (b) speed, and (c) look into other librispeech.

compilation caching really speeds things up when doing repeated runs.

…-efficiency into jit_switch

github-actions · 2025-03-06T21:47:36Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

priyakasimbeg · 2025-04-03T02:50:50Z

algoperf/workloads/fastmri/fastmri_jax/workload.py

@@ -154,13 +156,12 @@ def _eval_model_on_split(self,
          num_batches=num_batches)

    total_metrics = {'ssim': 0., 'loss': 0.}


why did we swap out the eval_rngs with the model rng?

…ncy into jit_switch

priyakasimbeg and others added 20 commits February 4, 2025 17:08

Merge pull request #825 from mlcommons/dev

bf61255

[do not merge] Dev -> Main

Merge pull request #843 from mlcommons/dev

9653f18

Dev -> main

Use jax.jit for sharding initial steps

ae48ccd

Apply it to the MNIST workload and the Nesterov optimizer.

Use jax.jit for adamw

eb5cac7

Pass yapf checks

82977da

CIFAR workload sharding

99545d4

librispeech_conformer now running

018711a

Still need to test out (a) output losses, (b) speed, and (c) look into other librispeech.

fix formatting

fbeb5f1

shard default

6e4e7b0

start imagenet

4a2c02d

remove bn sync in imagenet (jit handles it automatically)

47beba1

ImageNet-ViT also works

3a18f19

Start working on WMT. OOM error

bd0f565

post-rebase, still on wmt

3044efb

cache sharding fix

e301c49

Merge branch 'dev' into jit_switch

e5ed97a

target_setting_algorithms sharding, compilation caching

4fcf984

compilation caching really speeds things up when doing repeated runs.

Update tests to correct batch size

d147e39

yapf and isort checks..

a2b61be

Merge branch 'jit_switch' of https://github.com/mlcommons/algorithmic…

be11c23

…-efficiency into jit_switch

priyakasimbeg requested a review from a team as a code owner March 6, 2025 21:47

priyakasimbeg changed the title ~~Jit switch~~ [WIP] Migrate JAX workloads from pmap to jit Mar 6, 2025

Merge branch 'dev' into jit_switch

e2a3b5f

priyakasimbeg changed the base branch from main to dev March 7, 2025 00:17

priyakasimbeg added 5 commits March 7, 2025 19:43

switch fastmri from pmap to jit

a80f4ec

migrate criteo workload

c39ca51

update utils function used for sharding conformer

06377d9

update conformer and deepspeech

9cbe7d9

debugging

c6ecd67

priyakasimbeg and others added 13 commits March 18, 2025 19:23

reformatting

848b50c

reformatting

fb62eae

reformatting

fe3f9f0

reformatting

004afbd

reformatting

f1db3d3

sharding deepspeech

c208cc7

ogbg jit migration

2e4cc9e

deepspeech jit changes

d3a06fc

set jax to 0.5.1

2cfa2a9

merge

70705a7

upgrade jax to 0.5.3

75d6315

change bsz back

1df0690

formatting

c1d0c66

priyakasimbeg commented Apr 3, 2025

View reviewed changes

priyakasimbeg and others added 16 commits April 3, 2025 03:02

remove debugging statements from submission_runner.py

1b9466c

pyproject.toml

7a71cf0

clean up ogbg

9e1f337

clean up ogbg

a1d0abd

Merge branch 'jit_switch' of github.com:mlcommons/algorithmic-efficie…

adb2b7e

…ncy into jit_switch

clean up mnist workload.py

99caa03

refactoring & clean up

b14174b

simplify changes in cifar jax

a3a9b9f

small fix

0a340a2

rename sharding utils

60c1cce

fix sharding rename

1edb724

refactoring

49864fb

modifications to cifar

7820ac6

fix

0a2043c

clean up and small fixes

95037bf

add test for sharding invariance

e79c761

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Migrate JAX workloads from pmap to jit #848

[WIP] Migrate JAX workloads from pmap to jit #848

priyakasimbeg commented Mar 6, 2025 •

edited

Loading

github-actions bot commented Mar 6, 2025 •

edited

Loading

priyakasimbeg Apr 3, 2025

		@@ -154,13 +156,12 @@ def _eval_model_on_split(self,
		num_batches=num_batches)

		total_metrics = {'ssim': 0., 'loss': 0.}

[WIP] Migrate JAX workloads from pmap to jit #848

Are you sure you want to change the base?

[WIP] Migrate JAX workloads from pmap to jit #848

Conversation

priyakasimbeg commented Mar 6, 2025 • edited Loading

Purpose

TODOs:

Changelog

Issues

github-actions bot commented Mar 6, 2025 • edited Loading

priyakasimbeg Apr 3, 2025

Choose a reason for hiding this comment

priyakasimbeg commented Mar 6, 2025 •

edited

Loading

github-actions bot commented Mar 6, 2025 •

edited

Loading