GRPO OOM after checkpointing

**Describe the bug**

In GRPO, after checkpointing, the subsequent vLLM wakeup experiences an OOM. See https://wandb.ai/nvidia/bxyu-nemo-gym-rl-integration/runs/fbgrwc5h/logs line 3329. There are some unrelated logs there regarding penguin

**Steps/Code to reproduce bug**

Please list *minimal* steps or code snippet for us to be able to reproduce the bug.

A helpful guide on on how to craft a minimal bug report  http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports. 


**Expected behavior**

A clear and concise description of what you expected to happen.

**Environment overview (please complete the following information)**

 - Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)]
 - Method of install: [pip install or from source]. Please specify exact commands you used to install.
 - If method of install is [Docker], provide `docker pull` & `docker run` commands used

**Environment details**

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
- OS version
- PyTorch version
- Python version

**Additional context**

Add any other context about the problem here.
Example: GPU model


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GRPO OOM after checkpointing #1057

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GRPO OOM after checkpointing #1057

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions