Multi-Host training checkpointing #21290

raresdolga · 2024-05-17T22:23:24Z

raresdolga
May 17, 2024

Hi,
I am trying to do multi-host training on a TPU pod. I managed to run back-propagation, but I got stuck at saving checkpoints, mainly in saving the distributed flax train state.

I distributed the initialised state using the following:

        num_devices = len(jax.devices())
        devices = mesh_utils.create_device_mesh((num_devices,))
        mesh = Mesh(devices, axis_names=("B",))

        jit_init_fn = jax.jit(
            shard_map(
                partial(self.init_train_state, self.config.batchnorm, self.model,  self._optimizer),
                mesh=mesh,
                in_specs=(PartitionSpec(), PartitionSpec("B",)),  # PRNG key and x
                out_specs=PartitionSpec(),
                check_rep=False,
            ),
        )
        state = jit_init_fn(state_rng, data)

After a few updates, I try to save the state but failed to get the parameters on one process. The things I tried:

if jax.process_index() == 0:
    # this fails
    state = jax.experimental.multihost_utils.process_allgather(state)
    # error:  devices connected to a single host must form a contiguous subcube of the global device mesh
    params = jax.tree_map(lambda x: jax.experimental.multihost_utils.global_array_to_host_local_array(x, mesh, PartitionSpec()), state.params)

Do you have any advice on how to get the parameters on one host and save them on disk with orbax?

I also tried directly saving with the legacy api:

async_checkpointer = orbax.checkpoint.AsyncCheckpointer(orbax.checkpoint.PyTreeCheckpointHandler(), timeout_secs=50)
checkpoints.save_checkpoint_multiprocess(ckpt_dir,
                                         state.params,
                                         step=3,
                                         overwrite=True,
                                         keep=4,
                                         orbax_checkpointer=async_checkpointer)

but the error that I get is:

File "/home/rares/latte/latte/jax_trainer.py", line 1257, in _train
    checkpoints.save_checkpoint_multiprocess(
  File "/home/rares/latte/.venv/lib/python3.11/site-packages/flax/training/checkpoints.py", line 806, in save_checkpoint_multiprocess
    sync_global_devices('Flax:Checkpoint:StartSave')
  File "/home/rares/latte/.venv/lib/python3.11/site-packages/jax/experimental/multihost_utils.py", line 88, in sync_global_devices
    assert_equal(h, f"sync_global_devices name mismatch ('{name}')")
  File "/home/rares/latte/.venv/lib/python3.11/site-packages/jax/experimental/multihost_utils.py", line 156, in assert_equal
    expected = broadcast_one_to_all(in_tree)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rares/latte/.venv/lib/python3.11/site-packages/jax/experimental/multihost_utils.py", line 82, in broadcast_one_to_all
    return jax.tree.map(post_jit, out_tree)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rares/latte/.venv/lib/python3.11/site-packages/jax/_src/tree.py", line 61, in map
    return tree_util.tree_map(f, tree, *rest, is_leaf=is_leaf)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rares/latte/.venv/lib/python3.11/site-packages/jax/_src/tree_util.py", line 320, in tree_map
    return treedef.unflatten(f(*xs) for xs in zip(*all_leaves))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rares/latte/.venv/lib/python3.11/site-packages/jax/_src/tree_util.py", line 320, in <genexpr>
    return treedef.unflatten(f(*xs) for xs in zip(*all_leaves))
                             ^^^^^^
  File "/home/rares/latte/.venv/lib/python3.11/site-packages/jax/experimental/multihost_utils.py", line 77, in post_jit
    return np.asarray(x.addressable_data(0))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rares/latte/.venv/lib/python3.11/site-packages/jax/_src/array.py", line 395, in __array__
    return np.asarray(self._value, dtype=dtype, **kwds)
                      ^^^^^^^^^^^
  File "/home/rares/latte/.venv/lib/python3.11/site-packages/jax/_src/profiler.py", line 335, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/rares/latte/.venv/lib/python3.11/site-packages/jax/_src/array.py", line 594, in _value
    self._npy_value = self._single_device_array_to_np_array()  # type: ignore
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Core halted unexpectedly: INTERNAL: Accelerator device halted prematurely, perhaps due to an on-device check-failure. Node 0 halted unexpectedly at tag:pc TensorCoreSequencer:1:0x180 (from TensorCoreSequencer:1:0x22a): scheckne: 
***************
An unexpected peer shows up in the launch group with a different launch id than the current group leader. This signals a bug in the runtime scheduling system or above. no HLO mapping
=== Source Location Trace: === 
learning/45eac/tpu/runtime/hal/internal/tpu_program_termination_validation.cc:113

Regards,
Rares

raresdolga · 2024-05-23T11:17:42Z

raresdolga
May 23, 2024
Author

I managed to solve it.
checkpoints.save_checkpoint_multiprocess already has the check jax.process_index == 0. Doing the check again results in an error.

0 replies

dlwh · 2024-10-14T16:49:25Z

dlwh
Oct 14, 2024

For those googling this error, I also ran into this error when I had a threading issue. Specifically, I was inadvertently launching TPU JAX kernels in a background data loading thread. So be sure you always launch the same TPU kernel on all workers at the same time. (Both the original error and mine are that.)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-Host training checkpointing #21290

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Multi-Host training checkpointing #21290

raresdolga May 17, 2024

Replies: 2 comments

raresdolga May 23, 2024 Author

dlwh Oct 14, 2024

raresdolga
May 17, 2024

raresdolga
May 23, 2024
Author

dlwh
Oct 14, 2024