Experiment: Run PlantCAD IsoFLOP sweep #2074

eric-czech · 2025-11-23T19:55:57Z

Description

Log files

Run 1 (6e010000): job-driver-ray-run-eczech-plantcad_isoflops_batch_size-20251123-121917.log.zip
Run 2 (80010000): job-driver-ray-run-eczech-plantcad_isoflops_batch_size-20251124-035140.log.zip
Run 3 (82010000): job-driver-ray-run-eczech-plantcad_isoflops_batch_size-20251124-105417.log.zip

Run 1 Errors

# Every training job ends with `BrokenPipeError: [Errno 32] Broken pipe`
cat job-driver-ray-run-eczech-plantcad_isoflops_batch_size-20251123-121917.log | grep 'Broken pipe' | wc -l
4

# (pid=, ip=10.128.0.91) E1123 04:35:32.542303    3439 tpu_pjrt_client.cc:1811] TpuClient::AllocateBuffer returning error after 1 attempt(s): RESOURCE_EXHAUSTED: Error allocating device buffer: Attempting to allocate 224.00M. That was not possible. There are 77.60M free. The largest contiguous region of free memory is 63.92M due to fragmentation.; (0x0x0_HBM0)
# (pid=, ip=10.128.0.91) third_party/tensorflow/compiler/xla/util.h:288
# (pid=, ip=10.128.0.91) learning/45eac/research/pjrt/tpu_pjrt_client.cc:1779
# (pid=, ip=10.128.0.91) [OOM Summary: Reason: Not enough free memory. Allocator stats:
# (pid=, ip=10.128.0.91) Limit:                    102803437568
# (pid=, ip=10.128.0.91) InUse:                     68240086016
# (pid=, ip=10.128.0.91) MaxInUse:                  68240086016
# (pid=, ip=10.128.0.91) NumAllocs:                      104400
# (pid=, ip=10.128.0.91) MaxAllocSize:              16107439104
# (pid=, ip=10.128.0.91) Reserved:                  34481979392
# (pid=, ip=10.128.0.91) PeakReserved:              34481979392
# (pid=, ip=10.128.0.91) LargestFreeBlock:                    0
# (pid=, ip=10.128.0.91) ]
cat job-driver-ray-run-eczech-plantcad_isoflops_batch_size-20251123-121917.log | grep -E 'RESOURCE_EXHAUSTED: Error allocating device buffer' | wc -l
2214

# (pid=, ip=10.128.0.111) E1123 04:47:35.539063    3307 memory_space_assignment_util.cc:1382] INVALID_ARGUMENT: 104857600 bytes of scoped Vmem requested (via backend config for RPA-bq_16-bkvp_8-p_256.58), but the max valid bytes is 67043328. See go/scoped-vmem for more details.
# (pid=, ip=10.128.0.111) platforms/xla/service/ba16c7433/memory_space_assignment_util.cc:178
# (pid=, ip=10.128.0.111)  We are lowering the scoped Vmem for RPA-bq_16-bkvp_8-p_256.58 to 67043328 bytes.
# (pid=, ip=10.128.0.111) E1123 04:47:35.539071    3307 memory_space_assignment_util.cc:1382] INVALID_ARGUMENT: 104857600 bytes of scoped Vmem requested (via backend config for RPA-bq_16-bkvp_8-p_256.59), but the max valid bytes is 67043328. See go/scoped-vmem for more details.
# (pid=, ip=10.128.0.111) platforms/xla/service/ba16c7433/memory_space_assignment_util.cc:178
# (pid=, ip=10.128.0.111)  We are lowering the scoped Vmem for RPA-bq_16-bkvp_8-p_256.59 to 67043328 bytes.
# (pid=, ip=10.128.0.111) E1123 04:47:35.539083    3307 memory_space_assignment_util.cc:1382] INVALID_ARGUMENT: 104857600 bytes of scoped Vmem requested (via backend config for RPA-bq_16-bkvp_8-p_256.60), but the max valid bytes is 67043328. See go/scoped-vmem for more details.
# (pid=, ip=10.128.0.111) platforms/xla/service/ba16c7433/memory_space_assignment_util.cc:178
cat job-driver-ray-run-eczech-plantcad_isoflops_batch_size-20251123-121917.log | grep -E 'INVALID_ARGUMENT: .* bytes of scoped Vmem requested' | wc -l
    3565

# [The node with node id: 906cbc47f7e0a7da881b51655f2c2fae74fb329c8d248d6af06dee4e and address: 10.128.0.230 and node name: 10.128.0.230 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a         
# (1) raylet crashes unexpectedly (OOM, etc.)
# (2) raylet has lagging heartbeats due to slow network or busy workload.
cat job-driver-ray-run-eczech-plantcad_isoflops_batch_size-20251123-121917.log | grep 'has been marked dead because the detector has missed too many heartbeats' | wc -l
      17

# (pid=, ip=10.128.0.91) E1123 08:17:29.823055   28612 tpu_hal_vxc_common_helper.cc:678] Failed to cleanup driver after error: INTERNAL: FAILED_PRECONDITION: Tried to close when driver was in state FAILED
# (pid=, ip=10.128.0.91) learning/45eac/tpu/runtime/hal/internal/vxc/tpu_vxc_driver.cc:843
# (pid=, ip=10.128.0.91) E1123 08:17:29.823093   28612 tpu_hal_vxc_common_helper.cc:678] Failed to cleanup driver after error: INTERNAL: FAILED_PRECONDITION: Tried to close when driver was in state FAILED
# (pid=, ip=10.128.0.91) learning/45eac/tpu/runtime/hal/internal/vxc/tpu_vxc_driver.cc:843
# (pid=, ip=10.128.0.91) E1123 08:17:29.823112   28612 tpu_hal_vxc_common_helper.cc:678] Failed to cleanup driver after error: INTERNAL: FAILED_PRECONDITION: Tried to close when driver was in state FAILED
# (pid=, ip=10.128.0.91) learning/45eac/tpu/runtime/hal/internal/vxc/tpu_vxc_driver.cc:843
# (pid=, ip=10.128.0.91) E1123 08:17:29.823129   28612 tpu_hal_vxc_common_helper.cc:678] Failed to cleanup driver after error: INTERNAL: FAILED_PRECONDITION: Tried to close when driver was in state FAILED
# (pid=, ip=10.128.0.91) learning/45eac/tpu/runtime/hal/internal/vxc/tpu_vxc_driver.cc:843
cat job-driver-ray-run-eczech-plantcad_isoflops_batch_size-20251123-121917.log | grep 'Failed to initialize TPU system with error' | wc -l
       9

# (train_lm_task pid=1728, ip=10.128.1.20) 2025-11-23T08:27:47 - 0 - levanter.store.cache - cache.py:302 - INFO :: Loading cache from gs://marin-us-central1/tokenized/plantcad_cropped-b9b31c/validation [repeated 2x across cluster]
# (train_lm_task pid=1728, ip=10.128.1.20) 2025-11-23T08:27:47 - 0 - levanter.store.cache - cache.py:506 - INFO :: Attempting to load cache ledger from gs://marin-us-central1/tokenized/plantcad_cropped-b9b31c/validation/shard_ledger.json [repeated 2x across cluster]
# (train_lm_task pid=1728, ip=10.128.1.20) 2025-11-23T08:27:47 - 0 - levanter.store.cache - cache.py:514 - WARNING :: Metadata mismatch: { 'type_changes': { 'root.preprocessor_metadata': { 'new_type': <class 'dict'>,
# (train_lm_task pid=1728, ip=10.128.1.20)                                                     'new_value': { 'append_bos': False,
# (train_lm_task pid=1728, ip=10.128.1.20)                                                                    'append_eos': False,
# (train_lm_task pid=1728, ip=10.128.1.20)                                                                    'max_length': 1000000000000000019884624838656,
# (train_lm_task pid=1728, ip=10.128.1.20)                                                                    'padding': False,
# (train_lm_task pid=1728, ip=10.128.1.20)                                                                    'return_attention_mask': False,
# (train_lm_task pid=1728, ip=10.128.1.20)                                                                    'tokenizer': 'kuleshov-group/PlantCAD2-Small-l24-d0768',
# (train_lm_task pid=1728, ip=10.128.1.20)                                                                    'vocab_size': 7},
# (train_lm_task pid=1728, ip=10.128.1.20)                                                     'old_type': <class 'NoneType'>,
# (train_lm_task pid=1728, ip=10.128.1.20)                                                     'old_value': None}}}

Notes

TODO

Is left crop ideal for seq_len reduction?
Should zephyr pipelines be mixed with executor steps?
How do you type annotate functions for .map in zephyr?
How should grad accum be factored into training step counts?

Commands

> cat .env
export MARIN_PREFIX=gs://marin-dna-us-central1
export WANDB_PROJECT=marin-dna
export WANDB_ENTITY=eric-czech

# Run dashboard (required to submit job)
ray dashboard infra/marin-us-central1.yaml

# Submit job
# First run:
python lib/marin/src/marin/run/ray_run.py \
  --env_vars WANDB_API_KEY ${WANDB_API_KEY} --  \
  python experiments/plantcad/plantcad_isoflops_batch_size.py

# Second run:
python lib/marin/src/marin/run/ray_run.py \
  --env_vars WANDB_API_KEY ${WANDB_API_KEY} \
  --env_vars HUGGING_FACE_HUB_TOKEN ${HUGGING_FACE_HUB_TOKEN} --  \
  python experiments/plantcad/plantcad_isoflops_batch_size.py \
  --prefix $MARIN_PREFIX --force_run_failed true

WIP

eric-czech changed the title ~~Add isoflops by batch size experiment~~ Experiment: Run PlantCAD isoflops sweep by batch size Nov 23, 2025

eric-czech force-pushed the eac/plantcad-isoflop branch 5 times, most recently from c404be1 to deccd2a Compare November 26, 2025 01:15

eric-czech changed the title ~~Experiment: Run PlantCAD isoflops sweep by batch size~~ Experiment: Run PlantCAD isoflops sweep Nov 26, 2025

eric-czech force-pushed the eac/plantcad-isoflop branch from deccd2a to ccc5864 Compare November 26, 2025 02:24

eric-czech changed the title ~~Experiment: Run PlantCAD isoflops sweep~~ Experiment: Run PlantCAD IsoFLOP sweep Nov 26, 2025

eric-czech mentioned this pull request Nov 26, 2025

Experiment: Run PlantCAD IsoFLOP sweep (1/2) #2101

Open

eric-czech force-pushed the eac/plantcad-isoflop branch 3 times, most recently from 2743e72 to e0134ee Compare December 9, 2025 15:49

eric-czech force-pushed the eac/plantcad-isoflop branch 3 times, most recently from dc393c7 to 82855eb Compare December 22, 2025 19:17

eric-czech force-pushed the eac/plantcad-isoflop branch from 82855eb to c040dbd Compare December 25, 2025 15:53

eric-czech force-pushed the eac/plantcad-isoflop branch 8 times, most recently from 05ed07c to 312e57d Compare January 12, 2026 21:45

eric-czech added 5 commits January 13, 2026 05:29

Add plantcad isoflops by batch size experiment

d25890c

Add plantcad isoflops experiment

6699d3d

Add scaling law fits

95739b6

Add epochs to sweep

5b4da35

Re-run following de-duplication of pretraining data

7f3d264

eric-czech added 11 commits January 13, 2026 05:29

Add isotoken sweep

22ddcb7

Add RC augmentation

19435cc

Run low-compute sweep

c96e637

Run fp32 sweep

282ec67

Run sweep for OpenGenome2/Metagenomics (Evo 2) pretrain subset

2a20857

Run sweep for GPN animal promoter dataset

89434b7

Run sweep for vertebrate genomes dataset

5c0fda0

Run sweep for DCLM

ef91f56

Run sweep for synthetic DNA

bbd6ace

Run high-compute PlantCAD sweep

5677762

Run low-compute PlantCAD sweep

9a92594

eric-czech force-pushed the eac/plantcad-isoflop branch 2 times, most recently from b236f7b to 2f9a7d7 Compare January 15, 2026 21:40

Run PlantCAD sweep by compute range

446e1be

eric-czech force-pushed the eac/plantcad-isoflop branch from 2f9a7d7 to 446e1be Compare January 15, 2026 21:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Experiment: Run PlantCAD IsoFLOP sweep #2074

Experiment: Run PlantCAD IsoFLOP sweep #2074

Uh oh!

eric-czech commented Nov 23, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Experiment: Run PlantCAD IsoFLOP sweep #2074

Are you sure you want to change the base?

Experiment: Run PlantCAD IsoFLOP sweep #2074

Uh oh!

Conversation

eric-czech commented Nov 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

TODO

Commands

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eric-czech commented Nov 23, 2025 •

edited

Loading