Skip to content

GRPO OOM after checkpointing #1057

@bxyu-nvidia

Description

@bxyu-nvidia

Describe the bug

In GRPO, after checkpointing, the subsequent vLLM wakeup experiences an OOM. See https://wandb.ai/nvidia/bxyu-nemo-gym-rl-integration/runs/fbgrwc5h/logs line 3329. There are some unrelated logs there regarding penguin

Steps/Code to reproduce bug

Please list minimal steps or code snippet for us to be able to reproduce the bug.

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.

Expected behavior

A clear and concise description of what you expected to happen.

Environment overview (please complete the following information)

  • Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)]
  • Method of install: [pip install or from source]. Please specify exact commands you used to install.
  • If method of install is [Docker], provide docker pull & docker run commands used

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

  • OS version
  • PyTorch version
  • Python version

Additional context

Add any other context about the problem here.
Example: GPU model

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingresearchTag for research team's issues

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions