Skip to content

Conversation

@jeremymanning
Copy link
Member

Summary

Replaces rsync-based code syncing with GitHub-based workflow for all cluster scripts.

Changes

Created: scripts/cluster/setup_cluster.sh

  • Centralized idempotent setup function for all cluster scripts
  • Clones/updates repo from ContextLab/giblet-responses (upstream)
  • Downloads Sherlock dataset if needed (11GB, checks for 15+ .nii.gz files)
  • Creates conda environment giblet-py311 if missing
  • Installs dependencies from requirements.txt
  • All operations are re-runnable and safe (idempotent)
  • Color-coded output with clear success/warning/error messages

Updated: scripts/cluster/remote_train.sh

  • Removed: All rsync operations for code syncing
  • Added: GitHub sync via setup_cluster_environment() function call
  • Kept: All existing functionality (screen sessions, training, monitoring, etc.)

Updated: scripts/cluster/remote_evaluate.sh

  • Added: GitHub sync step (new Step 1) before evaluation
  • Kept: rsync for syncing reconstruction results BACK to local machine (correct - results are too large for git)
  • Updated: Step numbering in output messages

Benefits of GitHub Sync

  1. Version Control: All code changes tracked in git history
  2. Atomic Updates: Hard reset ensures clean state every time
  3. No Conflicts: Untracked files automatically cleaned
  4. Faster: No SSH file transfer needed
  5. Reliable: GitHub is single source of truth
  6. Reproducible: Exact same code on local and remote

Usage

Before running cluster scripts, always commit and push to origin:

git add .
git commit -m "Update training code"
git push origin main
# Wait for merge to upstream

Then run cluster scripts as normal:

# Training
./scripts/cluster/remote_train.sh --cluster tensor02 --gpus 6

# Evaluation
./scripts/cluster/remote_evaluate.sh --cluster tensor02 --checkpoint path/to/checkpoint.pt

Testing

Scripts have been tested locally and are ready for cluster deployment once merged. After merge, user will test on actual cluster.

Related Issues

Closes #33
Closes #34

Generated with Claude Code

Co-Authored-By: Claude [email protected]

- Created scripts/cluster/setup_cluster.sh:
  - Centralized idempotent setup function for all cluster scripts
  - Clones/updates repo from ContextLab/giblet-responses (upstream)
  - Downloads Sherlock dataset if needed
  - Sets up conda environment (giblet-py311)
  - Installs dependencies from requirements.txt
  - All operations are re-runnable and safe

- Updated scripts/cluster/remote_train.sh:
  - Removed rsync code syncing
  - Added GitHub sync via setup_cluster_environment()
  - Keeps all existing functionality (screen, training, etc.)

- Updated scripts/cluster/remote_evaluate.sh:
  - Added GitHub sync step before evaluation
  - Keeps rsync for syncing results BACK to local machine (correct)
  - Updated step numbers in output

Benefits:
- Version control: All changes tracked in git
- Atomic updates: Hard reset ensures clean state
- No conflicts: Untracked files cleaned automatically
- Faster: No SSH file transfer needed
- Reproducible: Same code on local and remote

Closes ContextLab#33, Closes ContextLab#34

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@jeremymanning jeremymanning merged commit ce1d685 into ContextLab:main Nov 7, 2025
0 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create shared cluster setup function Refactor cluster scripts to use GitHub sync instead of rsync

1 participant