Skip to content

Conversation

fae713
Copy link

@fae713 fae713 commented Aug 6, 2025

Starting from version 0.5.8.0, I began encountering the following error when running RL Swarm on a CPU-only setup:

RuntimeError: Shared memory manager connection has timed out

This issue stems from how PyTorch handles multiprocessing on Unix-based systems. By default, PyTorch uses the fork method to spawn subprocesses, which can cause shared memory allocation problems, particularly in CPU environments that rely on MPFuture() (used internally by libraries like hivemind and genrl).

✅ The Fix
This PR adds the following code to the top of swarm_launcher.py, before any multiprocessing or Torch components are imported.

I modified the file:

rgym_exp/runner/swarm_launcher.py

to include:

import torch.multiprocessing as mp if mp.get_start_method(allow_none=True) != 'spawn': mp.set_start_method('spawn', force=True)

This safely switches the multiprocessing context to 'spawn', which is more compatible with shared memory on CPU and avoids runtime errors in setups where GPU acceleration isn't used.

💡 Why it matters

  • Prevents crashes during MPFuture() creation on CPU setups
  • Keeps the code compatible with GPU and Docker-based workflows
  • Improves accessibility for contributors and developers without access to CUDA-enabled hardware

Looking forward to feedback. Let me know if you'd prefer this behind a config flag or if further compatibility testing is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant