Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 95 additions & 0 deletions resources_servers/grl_sokoban/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# GRL Sokoban Resource Server

Single-box Sokoban puzzle environment. The environment is implemented under `resources_servers/grl_sokoban/env`, mirroring the sokoban implementation in GRL repo (https://github.com/lmgame-org/GRL) and based on code from https://github.com/lmgame-org/lmenv, developed in collaboration with NVIDIA. The implementation uses gym-sokoban package (https://github.com/mpSchrader/gym-sokoban) which implements DeepMind's paper Imagination Augmented Agents for Deep Reinforcement Learning following the standard of https://gymnasium.farama.org.

## Why it exists
- **Domain**: Deterministic Sokoban puzzles.
- **Evaluation**: Agents must push a box onto its target with minimal invalid moves.
- **Verifier**: `/verify` rewards the cumulative Sokoban score only when `success` is reported by the environment.

## Setup

Please follow the setup instructions as outlined in: https://github.com/NVIDIA-NeMo/Gym/blob/main/docs/tutorials/02-setup.md#step-1-clone-and-install.

## Running
Spin up the server alongside a compatible agent:
```bash
config_paths="responses_api_models/openai_model/configs/openai_model.yaml,\
resources_servers/grl_sokoban/configs/grl_sokoban.yaml"
ng_run "+config_paths=[$config_paths]"
```

Collect trajectories:
```bash
ng_collect_rollouts +agent_name=grl_sokoban_simple_agent \
+input_jsonl_fpath=resources_servers/grl_sokoban/data/example.jsonl \
+output_jsonl_fpath=resources_servers/grl_sokoban/data/example_rollouts.jsonl \
+limit=5
```

Launch the rollout viewer:
```bash
ng_viewer +jsonl_fpath=resources_servers/grl_sokoban/data/example_rollouts.jsonl
```

## Tests
```bash
pytest resources_servers/grl_sokoban/tests
```

## Licensing
- Code: Apache 2.0
- Data: Apache 2.0

---

## Reward Profiling Results

### Qwen3-4B

**Dataset**: 3,200 rollouts (200 prompts × 16 repeats)

**Performance Metrics**:
- **Success Rate**: 13.47% (431/3,200 rollouts)
- **Mean Reward**: 0.93 (range: -8.90 to 10.90)
- **Median Reward**: 0.00

**Key Findings**:
- Most rollouts (66.7%) received reward of 0.00 (no valid actions taken)
- Successful puzzle solutions achieved rewards of ~10.5-10.9
- Average 2.64 tool calls per rollout
- Moderate negative correlation between tool calls and reward (-0.23)

**Top Reward Distribution**:
- `0.0`: 2,134 rollouts (66.7%) - no valid actions or early termination
- `10.8`: 206 rollouts (6.4%) - successful puzzle completion
- `10.9`: 72 rollouts (2.2%) - successful puzzle completion
- `10.7`: 51 rollouts (1.6%) - successful puzzle completion
- Negative rewards: Invalid moves or non-optimal solutions

The moderate success rate (13.47%) indicates that Sokoban puzzle-solving requires spatial planning and understanding of box-pushing mechanics. Most failures result from the model not taking valid actions (reward 0.0), while successful completions achieve consistent high rewards (~10.5-10.9). The negative correlation between tool calls and reward suggests that longer sequences often lead to invalid moves or dead-end states.

### Qwen3-30B-A3B

**Dataset**: 3,200 rollouts (200 prompts × 16 repeats)

**Performance Metrics**:
- **Success Rate**: 38.56% (1,234/3,200 rollouts)
- **Mean Reward**: 4.00 (range: -5.40 to 10.90)
- **Median Reward**: 0.00

**Key Findings**:
- Most rollouts (43.9%) received reward of 0.00 (no valid actions taken)
- Successful puzzle solutions achieved rewards of ~10.5-10.9
- Average 2.10 tool calls per rollout
- Moderate positive correlation between tool calls and reward (0.22)

**Top Reward Distribution**:
- `0.0`: 1,405 rollouts (43.9%) - no valid actions or early termination
- `10.8`: 477 rollouts (14.9%) - successful puzzle completion
- `10.6`: 183 rollouts (5.7%) - successful puzzle completion
- `10.7`: 172 rollouts (5.4%) - successful puzzle completion
- `10.9`: 157 rollouts (4.9%) - successful puzzle completion
- Negative rewards: Invalid moves or non-optimal solutions

The higher success rate (38.56%) compared to Qwen3-4B indicates that the larger model performs significantly better at spatial planning and understanding box-pushing mechanics. While the majority of failures still result from not taking valid actions (reward 0.0), the model achieves nearly 3x the success rate of the 4B variant. The positive correlation between tool calls and reward suggests that the model can effectively use longer action sequences to solve puzzles.
237 changes: 237 additions & 0 deletions resources_servers/grl_sokoban/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,237 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import annotations

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union

from fastapi import Body, FastAPI, HTTPException, Request
from pydantic import BaseModel, Field

from nemo_gym.base_resources_server import (
BaseResourcesServerConfig,
BaseSeedSessionRequest,
BaseSeedSessionResponse,
BaseVerifyRequest,
BaseVerifyResponse,
SimpleResourcesServer,
)
from nemo_gym.server_utils import SESSION_ID_KEY
from resources_servers.grl_sokoban.sokoban_env import SokobanEnv


DEFAULT_GRID_LOOKUP = {0: "#", 1: "_", 2: "O", 3: "√", 4: "X", 5: "P", 6: "S"}
DEFAULT_ACTION_LOOKUP = {1: "Up", 2: "Down", 3: "Left", 4: "Right"}


class GrlSokobanResourcesServerConfig(BaseResourcesServerConfig):
env_config: Dict[str, Any] = Field(
default_factory=lambda: {
"grid_lookup": DEFAULT_GRID_LOOKUP,
"action_lookup": DEFAULT_ACTION_LOOKUP,
"search_depth": 100,
"dim_room": (6, 6),
"max_steps": 100,
"num_boxes": 1,
"render_mode": "text",
}
)


class GrlSokobanSeedSessionRequest(BaseSeedSessionRequest):
seed: Optional[int] = None


class GrlSokobanSeedSessionResponse(BaseSeedSessionResponse):
observation: str


class GrlSokobanStepRequest(BaseModel):
actions: List[Union[str, int]] = Field(default_factory=list)


class GrlSokobanStepTrace(BaseModel):
action_id: int
action_label: str
reward: float
done: bool
info: Dict[str, Any]


class GrlSokobanStepResponse(BaseModel):
observation: str
reward: float
total_reward: float
done: bool
steps: List[GrlSokobanStepTrace]
history: List[GrlSokobanStepTrace] = Field(default_factory=list)


class GrlSokobanVerifyResponse(BaseVerifyResponse):
success: bool


@dataclass
class SokobanSessionState:
env: Any
observation: str
total_reward: float = 0.0
done: bool = False
last_info: Dict[str, Any] = field(default_factory=dict)
history: List[GrlSokobanStepTrace] = field(default_factory=list)


class GrlSokobanResourcesServer(SimpleResourcesServer):
config: GrlSokobanResourcesServerConfig
session_id_to_state: Dict[str, SokobanSessionState] = Field(default_factory=dict)

def setup_webserver(self) -> FastAPI:
app = super().setup_webserver()
app.post("/step")(self.step)
return app

def _create_env(self):
return SokobanEnv(self.config.env_config)

async def seed_session(
self, request: Request, body: GrlSokobanSeedSessionRequest
) -> GrlSokobanSeedSessionResponse:
session_id = request.session[SESSION_ID_KEY]
env = self._create_env()
observation = env.reset(seed=body.seed)

self.session_id_to_state[session_id] = SokobanSessionState(
env=env,
observation=observation,
)

return GrlSokobanSeedSessionResponse(observation=observation)

async def step(self, request: Request, body: Any = Body()) -> GrlSokobanStepResponse:
session_id = request.session.get(SESSION_ID_KEY)
if session_id is None or session_id not in self.session_id_to_state:
raise HTTPException(status_code=400, detail="Session not initialized. Call /seed_session first.")

# Handle both formats: {"actions": [...]} and [...]
# This makes the endpoint more robust to handle cases where the model
# might send just the array instead of the expected object format
if isinstance(body, list):
# If body is directly a list, wrap it in the expected format
parsed_body = GrlSokobanStepRequest(actions=body)
elif isinstance(body, dict):
# If body is a dict, try to parse it normally
try:
parsed_body = GrlSokobanStepRequest.model_validate(body)
except Exception as e:
raise HTTPException(
status_code=422,
detail=f"Invalid request format. Expected {{'actions': [...]}} or [...], got: {body}. Error: {str(e)}",
)
else:
raise HTTPException(
status_code=422,
detail=f"Invalid request format. Expected {{'actions': [...]}} or [...], got: {type(body).__name__}",
)

session_state = self.session_id_to_state[session_id]
env = session_state.env

reverse_lookup = {label.lower(): idx for idx, label in env.ACTION_LOOKUP.items()}
total_step_reward = 0.0
steps: List[GrlSokobanStepTrace] = []

if session_state.done:
return GrlSokobanStepResponse(
observation=session_state.observation,
reward=0.0,
total_reward=session_state.total_reward,
done=True,
steps=[],
history=list(session_state.history),
)

for action in parsed_body.actions:
action_id = self._parse_action(action, reverse_lookup)
if action_id not in env.ACTION_LOOKUP:
raise HTTPException(status_code=400, detail=f"Invalid action identifier: {action}")

next_obs, reward, done, info = env.step(action_id)
total_step_reward += reward
session_state.total_reward += reward
session_state.observation = next_obs
session_state.last_info = info
session_state.done = bool(done)

step = GrlSokobanStepTrace(
action_id=action_id,
action_label=env.ACTION_LOOKUP[action_id],
reward=reward,
done=session_state.done,
info=info,
)
session_state.history.append(step)
steps.append(step)

if session_state.done:
break

return GrlSokobanStepResponse(
observation=session_state.observation,
reward=total_step_reward,
total_reward=session_state.total_reward,
done=session_state.done,
steps=steps,
history=list(session_state.history), # Return full history for convenience
)

async def verify(self, request: Request, body: BaseVerifyRequest) -> GrlSokobanVerifyResponse:
session_id = request.session.get(SESSION_ID_KEY)
session_state = self.session_id_to_state.get(session_id)

success = False
reward = 0.0
if session_state is not None:
success = bool(session_state.last_info.get("success"))
reward = session_state.total_reward

if session_id in self.session_id_to_state:
try:
session_state.env.close()
except Exception:
pass
del self.session_id_to_state[session_id]

return GrlSokobanVerifyResponse(
**body.model_dump(),
reward=reward,
success=success,
)

@staticmethod
def _parse_action(action: Union[str, int], reverse_lookup: Dict[str, int]) -> int:
if isinstance(action, int):
return action

candidate = action.strip()
if candidate.lower() in reverse_lookup:
return reverse_lookup[candidate.lower()]

try:
return int(candidate)
except ValueError as exc: # pragma: no cover - clarity around invalid input
raise HTTPException(status_code=400, detail=f"Unable to parse action: {action}") from exc


if __name__ == "__main__":
GrlSokobanResourcesServer.run_webserver()
28 changes: 28 additions & 0 deletions resources_servers/grl_sokoban/configs/grl_sokoban.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
grl_sokoban_resources_server:
resources_servers:
grl_sokoban:
entrypoint: app.py
domain: games
verified: true
grl_sokoban_simple_agent:
responses_api_agents:
simple_agent:
entrypoint: app.py
max_steps: 10
count_tool_calls: true
resources_server:
type: resources_servers
name: grl_sokoban_resources_server
model_server:
type: responses_api_models
name: policy_model
datasets:
- name: example
type: example
jsonl_fpath: resources_servers/grl_sokoban/data/example.jsonl
num_repeats: 1
gitlab_identifier:
dataset_name: grl_sokoban
version: 0.0.1
artifact_fpath: example.jsonl
license: Apache 2.0
5 changes: 5 additions & 0 deletions resources_servers/grl_sokoban/data/example.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{"level_id": 1, "seed": 84810, "dim_room": [5, 5], "num_boxes": 1, "responses_create_params": {"max_tool_calls": 10, "input": [{"role": "developer", "content": "You are a Sokoban-solving assistant. IMPORTANT: First call the `step` tool with an empty array [] to see the initial puzzle state. Example: step({\"actions\": []}). The tool will return the board as a string with symbols: #=wall, _=floor, O=target, X=box, \u221a=box on target, P=player, S=player on target. Then continue calling `step` with valid actions (Up, Down, Left, Right) until the puzzle is solved or you run out of moves. At the end, respond with <answer>Action1 || Action2 || ...</answer> summarizing all moves you executed."}, {"role": "user", "content": "Call the step tool to see the puzzle, then solve it step by step."}], "tools": [{"name": "step", "type": "function", "description": "Execute Sokoban moves sequentially. Call with empty array [] to see current puzzle state without moving.", "strict": true, "parameters": {"type": "object", "properties": {"actions": {"type": "array", "items": {"type": "string"}, "description": "Sequence of actions e.g. ['Up', 'Right']. Use empty array [] to view current state."}}, "required": ["actions"], "additionalProperties": false}}]}}
{"level_id": 2, "seed": 98293, "dim_room": [8, 8], "num_boxes": 1, "responses_create_params": {"max_tool_calls": 10, "input": [{"role": "developer", "content": "You are a Sokoban-solving assistant. IMPORTANT: First call the `step` tool with an empty array [] to see the initial puzzle state. Example: step({\"actions\": []}). The tool will return the board as a string with symbols: #=wall, _=floor, O=target, X=box, \u221a=box on target, P=player, S=player on target. Then continue calling `step` with valid actions (Up, Down, Left, Right) until the puzzle is solved or you run out of moves. At the end, respond with <answer>Action1 || Action2 || ...</answer> summarizing all moves you executed."}, {"role": "user", "content": "Call the step tool to see the puzzle, then solve it step by step."}], "tools": [{"name": "step", "type": "function", "description": "Execute Sokoban moves sequentially. Call with empty array [] to see current puzzle state without moving.", "strict": true, "parameters": {"type": "object", "properties": {"actions": {"type": "array", "items": {"type": "string"}, "description": "Sequence of actions e.g. ['Up', 'Right']. Use empty array [] to view current state."}}, "required": ["actions"], "additionalProperties": false}}]}}
{"level_id": 3, "seed": 30450, "dim_room": [6, 6], "num_boxes": 1, "responses_create_params": {"max_tool_calls": 10, "input": [{"role": "developer", "content": "You are a Sokoban-solving assistant. IMPORTANT: First call the `step` tool with an empty array [] to see the initial puzzle state. Example: step({\"actions\": []}). The tool will return the board as a string with symbols: #=wall, _=floor, O=target, X=box, \u221a=box on target, P=player, S=player on target. Then continue calling `step` with valid actions (Up, Down, Left, Right) until the puzzle is solved or you run out of moves. At the end, respond with <answer>Action1 || Action2 || ...</answer> summarizing all moves you executed."}, {"role": "user", "content": "Call the step tool to see the puzzle, then solve it step by step."}], "tools": [{"name": "step", "type": "function", "description": "Execute Sokoban moves sequentially. Call with empty array [] to see current puzzle state without moving.", "strict": true, "parameters": {"type": "object", "properties": {"actions": {"type": "array", "items": {"type": "string"}, "description": "Sequence of actions e.g. ['Up', 'Right']. Use empty array [] to view current state."}}, "required": ["actions"], "additionalProperties": false}}]}}
{"level_id": 4, "seed": 89987, "dim_room": [6, 7], "num_boxes": 1, "responses_create_params": {"max_tool_calls": 10, "input": [{"role": "developer", "content": "You are a Sokoban-solving assistant. IMPORTANT: First call the `step` tool with an empty array [] to see the initial puzzle state. Example: step({\"actions\": []}). The tool will return the board as a string with symbols: #=wall, _=floor, O=target, X=box, \u221a=box on target, P=player, S=player on target. Then continue calling `step` with valid actions (Up, Down, Left, Right) until the puzzle is solved or you run out of moves. At the end, respond with <answer>Action1 || Action2 || ...</answer> summarizing all moves you executed."}, {"role": "user", "content": "Call the step tool to see the puzzle, then solve it step by step."}], "tools": [{"name": "step", "type": "function", "description": "Execute Sokoban moves sequentially. Call with empty array [] to see current puzzle state without moving.", "strict": true, "parameters": {"type": "object", "properties": {"actions": {"type": "array", "items": {"type": "string"}, "description": "Sequence of actions e.g. ['Up', 'Right']. Use empty array [] to view current state."}}, "required": ["actions"], "additionalProperties": false}}]}}
{"level_id": 5, "seed": 78785, "dim_room": [6, 4], "num_boxes": 1, "responses_create_params": {"max_tool_calls": 10, "input": [{"role": "developer", "content": "You are a Sokoban-solving assistant. IMPORTANT: First call the `step` tool with an empty array [] to see the initial puzzle state. Example: step({\"actions\": []}). The tool will return the board as a string with symbols: #=wall, _=floor, O=target, X=box, \u221a=box on target, P=player, S=player on target. Then continue calling `step` with valid actions (Up, Down, Left, Right) until the puzzle is solved or you run out of moves. At the end, respond with <answer>Action1 || Action2 || ...</answer> summarizing all moves you executed."}, {"role": "user", "content": "Call the step tool to see the puzzle, then solve it step by step."}], "tools": [{"name": "step", "type": "function", "description": "Execute Sokoban moves sequentially. Call with empty array [] to see current puzzle state without moving.", "strict": true, "parameters": {"type": "object", "properties": {"actions": {"type": "array", "items": {"type": "string"}, "description": "Sequence of actions e.g. ['Up', 'Right']. Use empty array [] to view current state."}}, "required": ["actions"], "additionalProperties": false}}]}}
8 changes: 8 additions & 0 deletions resources_servers/grl_sokoban/data/example_metrics.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"name": "example",
"type": "example",
"jsonl_fpath": "resources_servers/grl_sokoban/data/example.jsonl",
"gitlab_identifier": null,
"license": "Apache 2.0",
"Number of examples": 5
}
Loading