Skip to content
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@ NeMo Gym includes a curated collection of resource servers for training and eval
| agent | Stateful Counter | <a href='resources_servers/stateful_counter/configs/stateful_counter.yaml'>resources_servers/stateful_counter/configs/stateful_counter.yaml</a> | Apache 2.0 | Train, Validation, Example |
| agent | Workbench | <a href='resources_servers/workbench/configs/workbench.yaml'>resources_servers/workbench/configs/workbench.yaml</a> | Apache 2.0 | Train, Validation, Example |
| coding | Comp Coding | <a href='resources_servers/comp_coding/configs/comp_coding.yaml'>resources_servers/comp_coding/configs/comp_coding.yaml</a> | Apache 2.0 | Train, Validation, Example |
| games | Grl Sokoban | <a href='resources_servers/grl_sokoban/configs/grl_sokoban.yaml'>resources_servers/grl_sokoban/configs/grl_sokoban.yaml</a> | None | |
| instruction_following | Instruction Following | <a href='resources_servers/instruction_following/configs/instruction_following.yaml'>resources_servers/instruction_following/configs/instruction_following.yaml</a> | Apache 2.0 | Train, Example |
| instruction_following | Multineedle | <a href='resources_servers/multineedle/configs/multineedle.yaml'>resources_servers/multineedle/configs/multineedle.yaml</a> | Apache 2.0 | Train, Validation, Example |
| instruction_following | Structured Outputs | <a href='resources_servers/structured_outputs/configs/structured_outputs_json.yaml'>resources_servers/structured_outputs/configs/structured_outputs_json.yaml</a> | Apache 2.0 | Train, Validation, Example |
Expand Down
310 changes: 310 additions & 0 deletions resources_servers/grl_sokoban/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,310 @@
# GRL Sokoban Resource Server

Single-box Sokoban puzzle environment served via FastAPI with NeMo Gym conventions. The environment is implemented locally under `resources_servers/grl_sokoban/env`, mirroring GRL’s behaviour without requiring the external repository.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think its worth pointing out that this integrates this environment: https://github.com/mpSchrader/gym-sokoban which implements DeepMind's paper Imagination Augmented Agents for Deep Reinforcement Learning following the standard https://gymnasium.farama.org/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch!


## Why it exists
- **Domain**: Deterministic Sokoban puzzles.
- **Evaluation**: Agents must push a box onto its target with minimal invalid moves.
- **Verifier**: `/verify` rewards the cumulative Sokoban score only when `success` is reported by the environment.

## Setup
1. **Install NeMo Gym locally (one-time)**
```bash
uv pip install -e ".[dev]"
```
This makes the `ng_*` CLI available in your active environment.
2. **Install Sokoban-specific dependencies**
```bash
uv pip install -r resources_servers/grl_sokoban/requirements.txt
```
3. (Optional) prepare datasets using `ng_collect_rollouts` once custom rollouts are available.

## Running
Spin up the server alongside a compatible agent:
```bash
config_paths="responses_api_models/openai_model/configs/openai_model.yaml,\
resources_servers/grl_sokoban/configs/grl_sokoban.yaml"
ng_run "+config_paths=[$config_paths]"
```

Collect trajectories:
```bash
ng_collect_rollouts +agent_name=grl_sokoban_game_agent \
+input_jsonl_fpath=resources_servers/grl_sokoban/data/example.jsonl \
+output_jsonl_fpath=resources_servers/grl_sokoban/data/example_rollouts.jsonl \
+limit=5
```

# Launch the rollout viewer
```bash
ng_viewer +jsonl_fpath=resources_servers/grl_sokoban/data/example_rollouts.jsonl
```

## Generating Test Examples for Reward Profiling

For CONTRIBUTING.md reward profiling requirements, generate ~500 diverse test examples with varying seeds and room dimensions:

```bash
cd resources_servers/grl_sokoban
python generate_test_examples.py --num-examples 500
```

This creates `data/test_examples.jsonl` with diverse configurations:
- **Room sizes**: [4,4] to [8,8] with various aspect ratios
- **Num boxes**: Weighted distribution (62% 1-box, 25% 2-box, 13% 3-box)
- **Seeds**: Randomized for unique, solvable puzzles

Use the generated test set for reward profiling (see next section).

## Running with vLLM for Reward Profiling

For reward profiling and RL training (as per CONTRIBUTING.md), use vLLM with local models like Qwen3-30B-A3B.

**Choose your setup:**
- **Single GPU?** → Follow the "Quick Start (Single GPU)" section below
- **Multi-GPU (2+ GPUs)?** → Follow the "Multi-GPU Setup" section below

---

## Quick Start (Single GPU)

### 1. Start vLLM Server

**Prerequisites:**
```bash
uv pip install vllm hf_transfer
```

```bash
HF_HOME=.cache/ \
vllm serve Qwen/Qwen3-30B-A3B \
--dtype auto \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.85 \
--enable-auto-tool-choice --tool-call-parser hermes \
--host 0.0.0.0 \
--port 10240 \
--max-model-len 8192 \
--trust-remote-code
```

**Wait 2-5 minutes for model loading.** Verify server is ready:
```bash
curl http://localhost:10240/v1/models
```

### 2. Start NeMo Gym Servers

In a new terminal:
```bash
# Set environment variables
export policy_base_url="http://localhost:10240/v1"
export policy_api_key="dummy"
export policy_model_name="Qwen/Qwen3-30B-A3B"

# Start servers (no Ray cluster needed for single GPU)
ng_run "+config_paths=[responses_api_models/vllm_model/configs/vllm_model.yaml,resources_servers/grl_sokoban/configs/grl_sokoban.yaml]"
```

**Wait until you see:** `All 3 / 3 servers ready!` before proceeding.

### 3. Collect Rollouts

**In a new terminal** (keep servers running):
```bash
ng_collect_rollouts +agent_name=grl_sokoban_game_agent \
+input_jsonl_fpath=resources_servers/grl_sokoban/data/test_examples.jsonl \
+output_jsonl_fpath=resources_servers/grl_sokoban/data/test_rollouts.jsonl \
+limit=null \
+num_repeats=4 \
+num_samples_in_parallel=32 \
+responses_create_params.temperature=0.8 \
+responses_create_params.max_output_tokens=3000
```

---

## Multi-GPU Setup (4+ GPUs)

### 1. Start vLLM Server with Multi-GPU
```bash
HF_HOME=.cache/ \
vllm serve Qwen/Qwen3-30B-A3B \
--dtype auto \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9 \
--enable-auto-tool-choice --tool-call-parser hermes \
--host 0.0.0.0 \
--port 10240 \
--max-model-len 8192 \
--trust-remote-code
```

**Wait 2-5 minutes for model loading.** Verify server is ready:
```bash
curl http://localhost:10240/v1/models
```

### 2. Start Shared Ray Cluster

**Important for multi-GPU setups:** To avoid slow startup and port conflicts, start a shared Ray cluster first:

```bash
# Clean up any existing Ray sessions
ray stop --force

# Start a shared Ray cluster
ray start --head --port=6379 --dashboard-host=0.0.0.0 --disable-usage-stats

# Wait a few seconds for cluster to be ready
sleep 3
```

### 3. Start NeMo Gym Servers

In a new terminal (or the same terminal after Ray starts):
```bash
# Set environment variables
export policy_base_url="http://localhost:10240/v1"
export policy_api_key="dummy"
export policy_model_name="Qwen/Qwen3-30B-A3B"

# Start servers with shared Ray cluster
ng_run "+config_paths=[responses_api_models/vllm_model/configs/vllm_model.yaml,resources_servers/grl_sokoban/configs/grl_sokoban.yaml]" \
"+ray_head_node_address=127.0.0.1:6379"
```

**Wait until you see:** `All 3 / 3 servers ready!` before proceeding.

### 4. Collect Rollouts

**In a new terminal** (keep servers running):

**Using the test examples dataset (500 diverse puzzles, with high parallelism):**
```bash
ng_collect_rollouts +agent_name=grl_sokoban_game_agent \
+input_jsonl_fpath=resources_servers/grl_sokoban/data/test_examples.jsonl \
+output_jsonl_fpath=resources_servers/grl_sokoban/data/test_rollouts.jsonl \
+limit=null \
+num_repeats=1 \
+num_samples_in_parallel=128 \
+responses_create_params.temperature=0.8 \
+responses_create_params.max_output_tokens=3000
```

---

## Analyze Reward Distribution (Both Setups)

### Automated Analysis (Recommended)

**Generate comprehensive reward profiling report** (required for CONTRIBUTING.md):

```bash
cd resources_servers/grl_sokoban

# Install pandas if not already installed
pip install pandas

# Generate report for Qwen3-30B-A3B
python analyze_rewards.py \
--rollouts-path data/test_rollouts.jsonl \
--model-name "Qwen3-30B-A3B" \
--output data/reward_analysis_qwen3_30b.md

# View the report
cat data/reward_analysis_qwen3_30b.md
```

This generates a complete report including:
- Reward distribution statistics (min, max, mean, median)
- Success rate analysis
- Reward histogram
- Tool call metrics and correlation with rewards
- Per-prompt performance breakdown
- Top/bottom performing prompts

**For Qwen3-235B-Instruct** (second required model):
```bash
# After collecting rollouts with 235B model, run:
python analyze_rewards.py \
--rollouts-path data/test_rollouts_qwen3_235b.jsonl \
--model-name "Qwen3-235B-Instruct" \
--output data/reward_analysis_qwen3_235b.md
```

### Results Summary (Qwen3-4B)

**Dataset**: 3,200 rollouts (200 prompts × 16 repeats)

**Performance Metrics**:
- **Success Rate**: 13.47% (431/3,200 rollouts)
- **Mean Reward**: 0.93 (range: -8.90 to 10.90)
- **Median Reward**: 0.00

**Key Findings**:
- Most rollouts (66.7%) received reward of 0.00 (no valid actions taken)
- Successful puzzle solutions achieved rewards of ~10.5-10.9
- Average 2.64 tool calls per rollout
- Moderate negative correlation between tool calls and reward (-0.23)

**Top Reward Distribution**:
- `0.0`: 2,134 rollouts (66.7%) - no valid actions or early termination
- `10.8`: 206 rollouts (6.4%) - successful puzzle completion
- `10.9`: 72 rollouts (2.2%) - successful puzzle completion
- `10.7`: 51 rollouts (1.6%) - successful puzzle completion
- Negative rewards: Invalid moves or non-optimal solutions

The moderate success rate (13.47%) indicates that Sokoban puzzle-solving requires spatial planning and understanding of box-pushing mechanics. Most failures result from the model not taking valid actions (reward 0.0), while successful completions achieve consistent high rewards (~10.5-10.9). The negative correlation between tool calls and reward suggests that longer sequences often lead to invalid moves or dead-end states.

See [`data/qwen3_4b_eval/reward-analysis.md`](data/qwen3_4b_eval/reward-analysis.md) for complete analysis.

### Interactive Viewer

**Visual exploration of rollouts:**
```bash
ng_viewer +jsonl_fpath=resources_servers/grl_sokoban/data/test_rollouts.jsonl
```

### Manual Command-Line Analysis

**Quick stats** (if you prefer manual analysis):
```bash
# Reward distribution
jq '.reward' resources_servers/grl_sokoban/data/test_rollouts.jsonl | sort -n | uniq -c

# Statistics (min, max, avg)
jq -s 'map(.reward) | {
min: min,
max: max,
avg: (add / length),
count: length
}' resources_servers/grl_sokoban/data/test_rollouts.jsonl

# Success rate
jq -s 'map(select(.success == true)) | length' \
resources_servers/grl_sokoban/data/test_rollouts.jsonl

# Tool call metrics (average per rollout)
jq -s 'map([.output[] | select(.type == "function_call")] | length) | add / length' \
resources_servers/grl_sokoban/data/test_rollouts.jsonl
```

### Other Recommended Models

**For math/coding tasks:** `Qwen/Qwen3-235B-Thinking`
**For agents/instruction following:** `Qwen/Qwen3-235B-Instruct`

Adjust `--tensor-parallel-size` based on available GPUs (235B models typically need 8 GPUs).

## Dataset artifacts
Placeholder files live under `data/` (`example.jsonl`, `example_metrics.json`, `example_rollouts.jsonl`). Replace them with generated rollouts and metrics when integrating into training pipelines.

## Tests
```bash
pytest resources_servers/grl_sokoban/tests
```

## Licensing
- Code: Apache 2.0
- Data: Apache 2.0
Loading