Skip to content

Conversation

@eric-czech
Copy link
Contributor

@eric-czech eric-czech commented Nov 23, 2025

Description

refs: #2101
refs: #1729

Log files
Run 1 Errors
# Every training job ends with `BrokenPipeError: [Errno 32] Broken pipe`
cat job-driver-ray-run-eczech-plantcad_isoflops_batch_size-20251123-121917.log | grep 'Broken pipe' | wc -l
4

# (pid=, ip=10.128.0.91) E1123 04:35:32.542303    3439 tpu_pjrt_client.cc:1811] TpuClient::AllocateBuffer returning error after 1 attempt(s): RESOURCE_EXHAUSTED: Error allocating device buffer: Attempting to allocate 224.00M. That was not possible. There are 77.60M free. The largest contiguous region of free memory is 63.92M due to fragmentation.; (0x0x0_HBM0)
# (pid=, ip=10.128.0.91) third_party/tensorflow/compiler/xla/util.h:288
# (pid=, ip=10.128.0.91) learning/45eac/research/pjrt/tpu_pjrt_client.cc:1779
# (pid=, ip=10.128.0.91) [OOM Summary: Reason: Not enough free memory. Allocator stats:
# (pid=, ip=10.128.0.91) Limit:                    102803437568
# (pid=, ip=10.128.0.91) InUse:                     68240086016
# (pid=, ip=10.128.0.91) MaxInUse:                  68240086016
# (pid=, ip=10.128.0.91) NumAllocs:                      104400
# (pid=, ip=10.128.0.91) MaxAllocSize:              16107439104
# (pid=, ip=10.128.0.91) Reserved:                  34481979392
# (pid=, ip=10.128.0.91) PeakReserved:              34481979392
# (pid=, ip=10.128.0.91) LargestFreeBlock:                    0
# (pid=, ip=10.128.0.91) ]
cat job-driver-ray-run-eczech-plantcad_isoflops_batch_size-20251123-121917.log | grep -E 'RESOURCE_EXHAUSTED: Error allocating device buffer' | wc -l
2214

# (pid=, ip=10.128.0.111) E1123 04:47:35.539063    3307 memory_space_assignment_util.cc:1382] INVALID_ARGUMENT: 104857600 bytes of scoped Vmem requested (via backend config for RPA-bq_16-bkvp_8-p_256.58), but the max valid bytes is 67043328. See go/scoped-vmem for more details.
# (pid=, ip=10.128.0.111) platforms/xla/service/ba16c7433/memory_space_assignment_util.cc:178
# (pid=, ip=10.128.0.111)  We are lowering the scoped Vmem for RPA-bq_16-bkvp_8-p_256.58 to 67043328 bytes.
# (pid=, ip=10.128.0.111) E1123 04:47:35.539071    3307 memory_space_assignment_util.cc:1382] INVALID_ARGUMENT: 104857600 bytes of scoped Vmem requested (via backend config for RPA-bq_16-bkvp_8-p_256.59), but the max valid bytes is 67043328. See go/scoped-vmem for more details.
# (pid=, ip=10.128.0.111) platforms/xla/service/ba16c7433/memory_space_assignment_util.cc:178
# (pid=, ip=10.128.0.111)  We are lowering the scoped Vmem for RPA-bq_16-bkvp_8-p_256.59 to 67043328 bytes.
# (pid=, ip=10.128.0.111) E1123 04:47:35.539083    3307 memory_space_assignment_util.cc:1382] INVALID_ARGUMENT: 104857600 bytes of scoped Vmem requested (via backend config for RPA-bq_16-bkvp_8-p_256.60), but the max valid bytes is 67043328. See go/scoped-vmem for more details.
# (pid=, ip=10.128.0.111) platforms/xla/service/ba16c7433/memory_space_assignment_util.cc:178
cat job-driver-ray-run-eczech-plantcad_isoflops_batch_size-20251123-121917.log | grep -E 'INVALID_ARGUMENT: .* bytes of scoped Vmem requested' | wc -l
    3565

# [The node with node id: 906cbc47f7e0a7da881b51655f2c2fae74fb329c8d248d6af06dee4e and address: 10.128.0.230 and node name: 10.128.0.230 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a         
# (1) raylet crashes unexpectedly (OOM, etc.)
# (2) raylet has lagging heartbeats due to slow network or busy workload.
cat job-driver-ray-run-eczech-plantcad_isoflops_batch_size-20251123-121917.log | grep 'has been marked dead because the detector has missed too many heartbeats' | wc -l
      17

# (pid=, ip=10.128.0.91) E1123 08:17:29.823055   28612 tpu_hal_vxc_common_helper.cc:678] Failed to cleanup driver after error: INTERNAL: FAILED_PRECONDITION: Tried to close when driver was in state FAILED
# (pid=, ip=10.128.0.91) learning/45eac/tpu/runtime/hal/internal/vxc/tpu_vxc_driver.cc:843
# (pid=, ip=10.128.0.91) E1123 08:17:29.823093   28612 tpu_hal_vxc_common_helper.cc:678] Failed to cleanup driver after error: INTERNAL: FAILED_PRECONDITION: Tried to close when driver was in state FAILED
# (pid=, ip=10.128.0.91) learning/45eac/tpu/runtime/hal/internal/vxc/tpu_vxc_driver.cc:843
# (pid=, ip=10.128.0.91) E1123 08:17:29.823112   28612 tpu_hal_vxc_common_helper.cc:678] Failed to cleanup driver after error: INTERNAL: FAILED_PRECONDITION: Tried to close when driver was in state FAILED
# (pid=, ip=10.128.0.91) learning/45eac/tpu/runtime/hal/internal/vxc/tpu_vxc_driver.cc:843
# (pid=, ip=10.128.0.91) E1123 08:17:29.823129   28612 tpu_hal_vxc_common_helper.cc:678] Failed to cleanup driver after error: INTERNAL: FAILED_PRECONDITION: Tried to close when driver was in state FAILED
# (pid=, ip=10.128.0.91) learning/45eac/tpu/runtime/hal/internal/vxc/tpu_vxc_driver.cc:843
cat job-driver-ray-run-eczech-plantcad_isoflops_batch_size-20251123-121917.log | grep 'Failed to initialize TPU system with error' | wc -l
       9

# (train_lm_task pid=1728, ip=10.128.1.20) 2025-11-23T08:27:47 - 0 - levanter.store.cache - cache.py:302 - INFO :: Loading cache from gs://marin-us-central1/tokenized/plantcad_cropped-b9b31c/validation [repeated 2x across cluster]
# (train_lm_task pid=1728, ip=10.128.1.20) 2025-11-23T08:27:47 - 0 - levanter.store.cache - cache.py:506 - INFO :: Attempting to load cache ledger from gs://marin-us-central1/tokenized/plantcad_cropped-b9b31c/validation/shard_ledger.json [repeated 2x across cluster]
# (train_lm_task pid=1728, ip=10.128.1.20) 2025-11-23T08:27:47 - 0 - levanter.store.cache - cache.py:514 - WARNING :: Metadata mismatch: { 'type_changes': { 'root.preprocessor_metadata': { 'new_type': <class 'dict'>,
# (train_lm_task pid=1728, ip=10.128.1.20)                                                     'new_value': { 'append_bos': False,
# (train_lm_task pid=1728, ip=10.128.1.20)                                                                    'append_eos': False,
# (train_lm_task pid=1728, ip=10.128.1.20)                                                                    'max_length': 1000000000000000019884624838656,
# (train_lm_task pid=1728, ip=10.128.1.20)                                                                    'padding': False,
# (train_lm_task pid=1728, ip=10.128.1.20)                                                                    'return_attention_mask': False,
# (train_lm_task pid=1728, ip=10.128.1.20)                                                                    'tokenizer': 'kuleshov-group/PlantCAD2-Small-l24-d0768',
# (train_lm_task pid=1728, ip=10.128.1.20)                                                                    'vocab_size': 7},
# (train_lm_task pid=1728, ip=10.128.1.20)                                                     'old_type': <class 'NoneType'>,
# (train_lm_task pid=1728, ip=10.128.1.20)                                                     'old_value': None}}}
Notes

TODO

  • Is left crop ideal for seq_len reduction?
  • Should zephyr pipelines be mixed with executor steps?
  • How do you type annotate functions for .map in zephyr?
  • How should grad accum be factored into training step counts?

Commands

> cat .env
export MARIN_PREFIX=gs://marin-dna-us-central1
export WANDB_PROJECT=marin-dna
export WANDB_ENTITY=eric-czech

# Run dashboard (required to submit job)
ray dashboard infra/marin-us-central1.yaml

# Submit job
# First run:
python lib/marin/src/marin/run/ray_run.py \
  --env_vars WANDB_API_KEY ${WANDB_API_KEY} --  \
  python experiments/plantcad/plantcad_isoflops_batch_size.py

# Second run:
python lib/marin/src/marin/run/ray_run.py \
  --env_vars WANDB_API_KEY ${WANDB_API_KEY} \
  --env_vars HUGGING_FACE_HUB_TOKEN ${HUGGING_FACE_HUB_TOKEN} --  \
  python experiments/plantcad/plantcad_isoflops_batch_size.py \
  --prefix $MARIN_PREFIX --force_run_failed true

WIP

@eric-czech eric-czech changed the title Add isoflops by batch size experiment Experiment: Run PlantCAD isoflops sweep by batch size Nov 23, 2025
@eric-czech eric-czech force-pushed the eac/plantcad-isoflop branch 5 times, most recently from c404be1 to deccd2a Compare November 26, 2025 01:15
@eric-czech eric-czech changed the title Experiment: Run PlantCAD isoflops sweep by batch size Experiment: Run PlantCAD isoflops sweep Nov 26, 2025
@eric-czech eric-czech changed the title Experiment: Run PlantCAD isoflops sweep Experiment: Run PlantCAD IsoFLOP sweep Nov 26, 2025
@eric-czech eric-czech force-pushed the eac/plantcad-isoflop branch 3 times, most recently from 2743e72 to e0134ee Compare December 9, 2025 15:49
@eric-czech eric-czech force-pushed the eac/plantcad-isoflop branch 3 times, most recently from dc393c7 to 82855eb Compare December 22, 2025 19:17
@eric-czech eric-czech force-pushed the eac/plantcad-isoflop branch 8 times, most recently from 05ed07c to 312e57d Compare January 12, 2026 21:45
@eric-czech eric-czech force-pushed the eac/plantcad-isoflop branch 2 times, most recently from b236f7b to 2f9a7d7 Compare January 15, 2026 21:40
@eric-czech eric-czech force-pushed the eac/plantcad-isoflop branch from 2f9a7d7 to 446e1be Compare January 15, 2026 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants