[train] Add fault tolerance variant to the training data ingest benchmark #50399

justinvyu · 2025-02-10T23:41:13Z

Summary

Adds a .skip_training.fault_tolerance variant to the image classification training ingest release test which kills a node every N seconds and tests the worker recovery.

Also add the ability to stress-test concurrent multi-dataset execution with training ingest (training and validation datasets) by performing validation during the training epoch every N steps. This is not enabled because it is too unperformant and will cause the test to run for too long. This will be addressed in a follow-up PR.

Also updates the training benchmark script to load training state properly and do mid-epoch resumption by skipping batches all the way up to the batch that corresponds to the latest checkpoint.

Adds the following metrics:

checkpoint/download: Time spent downloading the checkpoint from storage to local.
checkpoint/load: Time spent loading the checkpoint from local.
train/iter_skip_batch: Time spent skipping batches upon restoration to do "mid-epoch resumption."
checkpoint/restoration_time: Extra time spent on restoration, which is just the sum of the above 3.

Here's what the fault tolerance test does:

Starts a chaos killer which kills a node every ~480 seconds, killing up to 2 nodes across the entire job.
Runs training with max_failures=4 (with 2 extra failures than needed as a buffer).

Signed-off-by: Justin Yu <[email protected]>

justinvyu · 2025-02-10T23:45:49Z

release/train_tests/benchmark/image_classification/factory.py

+        if self.benchmark_config.validate_every_n_steps > 0:
+            # TODO: This is just hard-coded for now. Maybe move this to be a configuration.
+            # Maybe move this to the RayDataLoaderFactory.
+            cpus_to_exclude = 16
+            train_ds.context.execution_options.exclude_resources = (
+                train_ds.context.execution_options.exclude_resources.add(
+                    ray.data.ExecutionResources(cpu=cpus_to_exclude)
+                )
+            )
+            logger.info(
+                f"[Dataloader] Reserving {cpus_to_exclude} CPUs for validation "
+                "that happens concurrently with training every "
+                f"{self.benchmark_config.validate_every_n_steps} steps. "
+            )


@raulchen Is this a reasonable way to handle multi-dataset?

justinvyu · 2025-02-10T23:46:25Z

release/train_tests/benchmark/train_benchmark.py

+        if restoration_time > 0:
+            metrics["checkpoint/restoration_time"] = restoration_time


Note: We should track "restoration time" in Ray Train by default. This time does not include process group re-init, actor startup, etc. Would be good to sum these two together to get the restoration startup cost.

…ase_test/fault_tolerance

Signed-off-by: Justin Yu <[email protected]>

…ase_test/fault_tolerance

justinvyu · 2025-02-11T18:50:13Z

release/train_tests/benchmark/train_benchmark.py

+        # Skip through batches if we restored to a middle of the epoch.
+        # TODO: Compare this baseline to the data checkpointing approach once we have it.
+        for _ in range(self._train_batch_idx):
+            with self._metrics["train/iter_skip_batch"].timer():
+                self.get_next_batch(train_dataloader)


Keeping an accurate total for the time spent "skipping" batches is a little tricky because this step could take a long time (ex: if training was killed previously at the second to last batch). Then, a node could get killed in this time, and the time spent skipping batches would be lost from the metrics, which are only snapshotted every checkpoint.

May want to checkpoint the metrics separately from the model if we want to track this better.

matthewdeng

Very clean!

matthewdeng · 2025-02-11T20:39:04Z

release/train_tests/benchmark/train_benchmark.py

+            download_start = time.perf_counter()
+            checkpoint.to_directory(temp_checkpoint_dir)
+            download_time = time.perf_counter() - download_start
+
+            load_start = time.perf_counter()
+            self.load_checkpoint(temp_checkpoint_dir)
+            load_time = time.perf_counter() - load_start
+
+            self._metrics["checkpoint/download"].add(download_time)
+            self._metrics["checkpoint/load"].add(load_time)


nit: use same context manager pattern as done later with the batch iteration?

e.g.

with self._metrics["checkpoint/download"].timer(): checkpoint.to_directory(temp_checkpoint_dir) with self._metrics["checkpoint/load"].timer(): self.load_checkpoint(temp_checkpoint_dir)

load_checkpoint will load in the snapshot of the metrics, which will overwrite these guys. So I need to save it separately then add it to the metrics later.

matthewdeng · 2025-02-11T20:41:26Z

release/train_tests/benchmark/train_benchmark.py

            with self._metrics["validation/step"].timer():
-                with torch.no_grad():
-                    out = self.model(input_batch)
-                    loss = self.loss_fn(out, labels)
-                    total_loss += loss
-                    num_rows += len(labels)
-                    self._metrics["validation/rows_processed"].add(len(labels))
+                if not self.benchmark_config.skip_validation_step:
+                    total_loss += self.validate_step(input_batch, labels)


Inverse ordering of timer and condition? Or do you intentionally want to include this as well like you mentioned in the other comment.

if not self.benchmark_config.skip_validation_step: with self._metrics["validation/step"].timer(): total_loss += self.validate_step(input_batch, labels)

I think it's easier to parse the output if this metric always exists and is just 0 rather than existing conditionally.

matthewdeng · 2025-02-11T20:49:39Z

release/train_tests/benchmark/train_benchmark.py

+        # which includes downloading the checkpoint, loading the checkpoint,
+        # and skipping through batches that were already processed.
+        restoration_time = (
+            self._metrics["checkpoint/download"].get()


nit: Use constants for all keys to make it safe against typos.

I'll do this in a followup.

Signed-off-by: Justin Yu <[email protected]>

… dies Signed-off-by: Justin Yu <[email protected]>

Signed-off-by: Justin Yu <[email protected]>

…ase_test/fault_tolerance

Signed-off-by: Justin Yu <[email protected]>

…ase_test/fault_tolerance

Signed-off-by: Justin Yu <[email protected]>

justinvyu added 10 commits February 10, 2025 12:13

add max_failures

c6f2ce6

Signed-off-by: Justin Yu <[email protected]>

add skip validation step

e4f102f

Signed-off-by: Justin Yu <[email protected]>

reinitialize the dataloader every epoch

8daa0a2

Signed-off-by: Justin Yu <[email protected]>

reserve some resources for concurrent validation during training

5879902

Signed-off-by: Justin Yu <[email protected]>

print -> logger + implement batch skipping on restoration

b74ec69

Signed-off-by: Justin Yu <[email protected]>

add checkpoint restoration time metric

55d0c04

Signed-off-by: Justin Yu <[email protected]>

save and load metrics

1f0e8ce

Signed-off-by: Justin Yu <[email protected]>

add release test entry + remove max_steps param for now

095e4c2

Signed-off-by: Justin Yu <[email protected]>

lower timeout

0156db2

Signed-off-by: Justin Yu <[email protected]>

add symlink to the setup chaos thing

9760471

Signed-off-by: Justin Yu <[email protected]>

justinvyu assigned raulchen, matthewdeng and hongpeng-guo Feb 10, 2025

justinvyu commented Feb 10, 2025

View reviewed changes

justinvyu added 9 commits February 10, 2025 17:49

Merge branch 'master' of https://github.com/ray-project/ray into rele…

ea5ffdb

…ase_test/fault_tolerance

fix ser and deser of metrics collection

b320286

Signed-off-by: Justin Yu <[email protected]>

keep the defaultdict

3fa2b39

Signed-off-by: Justin Yu <[email protected]>

oops

1d169e0

Signed-off-by: Justin Yu <[email protected]>

set resource limits on the val ds (if running concurrently)

f2b39a2

Signed-off-by: Justin Yu <[email protected]>

set checkpoint_dir_name

e5233d0

Signed-off-by: Justin Yu <[email protected]>

fix restore bug

22a56c3

Signed-off-by: Justin Yu <[email protected]>

small

9fffd32

Signed-off-by: Justin Yu <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into rele…

732fd4a

…ase_test/fault_tolerance

justinvyu commented Feb 11, 2025

View reviewed changes

matthewdeng approved these changes Feb 11, 2025

View reviewed changes

justinvyu added 5 commits February 11, 2025 13:24

map weights to cpu for load

bbf1dc7

Signed-off-by: Justin Yu <[email protected]>

add a log

785f8f7

Signed-off-by: Justin Yu <[email protected]>

fewer errors + let training happen for a bit longer before first node…

332195c

… dies Signed-off-by: Justin Yu <[email protected]>

some more adjustments to release test params

292db19

Signed-off-by: Justin Yu <[email protected]>

only show global progress bar

d20d22a

Signed-off-by: Justin Yu <[email protected]>

justinvyu added 9 commits February 11, 2025 16:18

add a kill delay to the start

b5f3d10

Signed-off-by: Justin Yu <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into rele…

64e6e27

…ase_test/fault_tolerance

remove concurrent validation for now

993e603

Signed-off-by: Justin Yu <[email protected]>

add todo about the bad performance of concurrent validation

0b03054

Signed-off-by: Justin Yu <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into rele…

356e221

…ase_test/fault_tolerance

add mock gpu setting

9e71bfb

Signed-off-by: Justin Yu <[email protected]>

run fault tolerance variant with mock gpu

ea869af

Signed-off-by: Justin Yu <[email protected]>

move kill delay inside the resource killer remote fn

e6ae4ff

Signed-off-by: Justin Yu <[email protected]>

update timeouts

0899219

Signed-off-by: Justin Yu <[email protected]>

justinvyu enabled auto-merge (squash) February 13, 2025 05:58

github-actions bot added the go add ONLY when ready to merge, run all tests label Feb 13, 2025

justinvyu merged commit 68c0ead into ray-project:master Feb 13, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] Add fault tolerance variant to the training data ingest benchmark #50399

[train] Add fault tolerance variant to the training data ingest benchmark #50399

justinvyu commented Feb 10, 2025 •

edited

Loading

justinvyu Feb 10, 2025

justinvyu Feb 10, 2025

justinvyu Feb 11, 2025

matthewdeng left a comment

matthewdeng Feb 11, 2025

justinvyu Feb 11, 2025

matthewdeng Feb 11, 2025

justinvyu Feb 11, 2025

matthewdeng Feb 11, 2025

justinvyu Feb 11, 2025

		if restoration_time > 0:
		metrics["checkpoint/restoration_time"] = restoration_time

[train] Add fault tolerance variant to the training data ingest benchmark #50399

[train] Add fault tolerance variant to the training data ingest benchmark #50399

Conversation

justinvyu commented Feb 10, 2025 • edited Loading

Summary

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewdeng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justinvyu commented Feb 10, 2025 •

edited

Loading