NVIDIA-NeMo · yixinhuang48 · Oct 30, 2025 · Oct 30, 2025 · Nov 4, 2025 · Nov 4, 2025
diff --git a/resources_servers/grl_sokoban/README.md b/resources_servers/grl_sokoban/README.md
@@ -0,0 +1,95 @@
+# GRL Sokoban Resource Server
+
+Single-box Sokoban puzzle environment. The environment is implemented under `resources_servers/grl_sokoban/env`, mirroring the sokoban implementation in GRL repo (https://github.com/lmgame-org/GRL) and based on code from https://github.com/lmgame-org/lmenv, developed in collaboration with NVIDIA. The implementation uses gym-sokoban package (https://github.com/mpSchrader/gym-sokoban) which implements DeepMind's paper Imagination Augmented Agents for Deep Reinforcement Learning following the standard of https://gymnasium.farama.org.
+
+## Why it exists
+- **Domain**: Deterministic Sokoban puzzles.
+- **Evaluation**: Agents must push a box onto its target with minimal invalid moves.
+- **Verifier**: `/verify` rewards the cumulative Sokoban score only when `success` is reported by the environment.
+
+## Setup
+
+Please follow the setup instructions as outlined in: https://github.com/NVIDIA-NeMo/Gym/blob/main/docs/tutorials/02-setup.md#step-1-clone-and-install.
+
+## Running
+Spin up the server alongside a compatible agent:
+```bash
+config_paths="responses_api_models/openai_model/configs/openai_model.yaml,\
+resources_servers/grl_sokoban/configs/grl_sokoban.yaml"
+ng_run "+config_paths=[$config_paths]"
+```
+
+Collect trajectories:
+```bash
+ng_collect_rollouts +agent_name=grl_sokoban_simple_agent \
+    +input_jsonl_fpath=resources_servers/grl_sokoban/data/example.jsonl \
+    +output_jsonl_fpath=resources_servers/grl_sokoban/data/example_rollouts.jsonl \
+    +limit=5
+```
+
+Launch the rollout viewer:
+```bash
+ng_viewer +jsonl_fpath=resources_servers/grl_sokoban/data/example_rollouts.jsonl
+```
+
+## Tests
+```bash
+pytest resources_servers/grl_sokoban/tests
+```
+
+## Licensing
+- Code: Apache 2.0
+- Data: Apache 2.0
+
+---
+
+## Reward Profiling Results
+
+### Qwen3-4B
+
+**Dataset**: 3,200 rollouts (200 prompts × 16 repeats)
+
+**Performance Metrics**:
+- **Success Rate**: 13.47% (431/3,200 rollouts)
+- **Mean Reward**: 0.93 (range: -8.90 to 10.90)
+- **Median Reward**: 0.00
+
+**Key Findings**:
+- Most rollouts (66.7%) received reward of 0.00 (no valid actions taken)
+- Successful puzzle solutions achieved rewards of ~10.5-10.9
+- Average 2.64 tool calls per rollout
+- Moderate negative correlation between tool calls and reward (-0.23)
+
+**Top Reward Distribution**:
+- `0.0`: 2,134 rollouts (66.7%) - no valid actions or early termination
+- `10.8`: 206 rollouts (6.4%) - successful puzzle completion
+- `10.9`: 72 rollouts (2.2%) - successful puzzle completion
+- `10.7`: 51 rollouts (1.6%) - successful puzzle completion
+- Negative rewards: Invalid moves or non-optimal solutions
+
+The moderate success rate (13.47%) indicates that Sokoban puzzle-solving requires spatial planning and understanding of box-pushing mechanics. Most failures result from the model not taking valid actions (reward 0.0), while successful completions achieve consistent high rewards (~10.5-10.9). The negative correlation between tool calls and reward suggests that longer sequences often lead to invalid moves or dead-end states.
+
+### Qwen3-30B-A3B
+
+**Dataset**: 3,200 rollouts (200 prompts × 16 repeats)
+
+**Performance Metrics**:
+- **Success Rate**: 38.56% (1,234/3,200 rollouts)
+- **Mean Reward**: 4.00 (range: -5.40 to 10.90)
+- **Median Reward**: 0.00
+
+**Key Findings**:
+- Most rollouts (43.9%) received reward of 0.00 (no valid actions taken)
+- Successful puzzle solutions achieved rewards of ~10.5-10.9
+- Average 2.10 tool calls per rollout
+- Moderate positive correlation between tool calls and reward (0.22)
+
+**Top Reward Distribution**:
+- `0.0`: 1,405 rollouts (43.9%) - no valid actions or early termination
+- `10.8`: 477 rollouts (14.9%) - successful puzzle completion
+- `10.6`: 183 rollouts (5.7%) - successful puzzle completion
+- `10.7`: 172 rollouts (5.4%) - successful puzzle completion
+- `10.9`: 157 rollouts (4.9%) - successful puzzle completion
+- Negative rewards: Invalid moves or non-optimal solutions
+
+The higher success rate (38.56%) compared to Qwen3-4B indicates that the larger model performs significantly better at spatial planning and understanding box-pushing mechanics. While the majority of failures still result from not taking valid actions (reward 0.0), the model achieves nearly 3x the success rate of the 4B variant. The positive correlation between tool calls and reward suggests that the model can effectively use longer action sequences to solve puzzles.
diff --git a/resources_servers/grl_sokoban/app.py b/resources_servers/grl_sokoban/app.py
@@ -0,0 +1,237 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import annotations
+
+from dataclasses import dataclass, field
+from typing import Any, Dict, List, Optional, Union
+
+from fastapi import Body, FastAPI, HTTPException, Request
+from pydantic import BaseModel, Field
+
+from nemo_gym.base_resources_server import (
+    BaseResourcesServerConfig,
+    BaseSeedSessionRequest,
+    BaseSeedSessionResponse,
+    BaseVerifyRequest,
+    BaseVerifyResponse,
+    SimpleResourcesServer,
+)
+from nemo_gym.server_utils import SESSION_ID_KEY
+from resources_servers.grl_sokoban.sokoban_env import SokobanEnv
+
+
+DEFAULT_GRID_LOOKUP = {0: "#", 1: "_", 2: "O", 3: "√", 4: "X", 5: "P", 6: "S"}
+DEFAULT_ACTION_LOOKUP = {1: "Up", 2: "Down", 3: "Left", 4: "Right"}
+
+
+class GrlSokobanResourcesServerConfig(BaseResourcesServerConfig):
+    env_config: Dict[str, Any] = Field(
+        default_factory=lambda: {
+            "grid_lookup": DEFAULT_GRID_LOOKUP,
+            "action_lookup": DEFAULT_ACTION_LOOKUP,
+            "search_depth": 100,
+            "dim_room": (6, 6),
+            "max_steps": 100,
+            "num_boxes": 1,
+            "render_mode": "text",
+        }
+    )
+
+
+class GrlSokobanSeedSessionRequest(BaseSeedSessionRequest):
+    seed: Optional[int] = None
+
+
+class GrlSokobanSeedSessionResponse(BaseSeedSessionResponse):
+    observation: str
+
+
+class GrlSokobanStepRequest(BaseModel):
+    actions: List[Union[str, int]] = Field(default_factory=list)
+
+
+class GrlSokobanStepTrace(BaseModel):
+    action_id: int
+    action_label: str
+    reward: float
+    done: bool
+    info: Dict[str, Any]
+
+
+class GrlSokobanStepResponse(BaseModel):
+    observation: str
+    reward: float
+    total_reward: float
+    done: bool
+    steps: List[GrlSokobanStepTrace]
+    history: List[GrlSokobanStepTrace] = Field(default_factory=list)
+
+
+class GrlSokobanVerifyResponse(BaseVerifyResponse):
+    success: bool
+
+
+@dataclass
+class SokobanSessionState:
+    env: Any
+    observation: str
+    total_reward: float = 0.0
+    done: bool = False
+    last_info: Dict[str, Any] = field(default_factory=dict)
+    history: List[GrlSokobanStepTrace] = field(default_factory=list)
+
+
+class GrlSokobanResourcesServer(SimpleResourcesServer):
+    config: GrlSokobanResourcesServerConfig
+    session_id_to_state: Dict[str, SokobanSessionState] = Field(default_factory=dict)
+
+    def setup_webserver(self) -> FastAPI:
+        app = super().setup_webserver()
+        app.post("/step")(self.step)
+        return app
+
+    def _create_env(self):
+        return SokobanEnv(self.config.env_config)
+
+    async def seed_session(
+        self, request: Request, body: GrlSokobanSeedSessionRequest
+    ) -> GrlSokobanSeedSessionResponse:
+        session_id = request.session[SESSION_ID_KEY]
+        env = self._create_env()
+        observation = env.reset(seed=body.seed)
+
+        self.session_id_to_state[session_id] = SokobanSessionState(
+            env=env,
+            observation=observation,
+        )
+
+        return GrlSokobanSeedSessionResponse(observation=observation)
+
+    async def step(self, request: Request, body: Any = Body()) -> GrlSokobanStepResponse:
+        session_id = request.session.get(SESSION_ID_KEY)
+        if session_id is None or session_id not in self.session_id_to_state:
+            raise HTTPException(status_code=400, detail="Session not initialized. Call /seed_session first.")
+
+        # Handle both formats: {"actions": [...]} and [...]
+        # This makes the endpoint more robust to handle cases where the model
+        # might send just the array instead of the expected object format
+        if isinstance(body, list):
+            # If body is directly a list, wrap it in the expected format
+            parsed_body = GrlSokobanStepRequest(actions=body)
+        elif isinstance(body, dict):
+            # If body is a dict, try to parse it normally
+            try:
+                parsed_body = GrlSokobanStepRequest.model_validate(body)
+            except Exception as e:
+                raise HTTPException(
+                    status_code=422,
+                    detail=f"Invalid request format. Expected {{'actions': [...]}} or [...], got: {body}. Error: {str(e)}",
+                )
+        else:
+            raise HTTPException(
+                status_code=422,
+                detail=f"Invalid request format. Expected {{'actions': [...]}} or [...], got: {type(body).__name__}",
+            )
+
+        session_state = self.session_id_to_state[session_id]
+        env = session_state.env
+
+        reverse_lookup = {label.lower(): idx for idx, label in env.ACTION_LOOKUP.items()}
+        total_step_reward = 0.0
+        steps: List[GrlSokobanStepTrace] = []
+
+        if session_state.done:
+            return GrlSokobanStepResponse(
+                observation=session_state.observation,
+                reward=0.0,
+                total_reward=session_state.total_reward,
+                done=True,
+                steps=[],
+                history=list(session_state.history),
+            )
+
+        for action in parsed_body.actions:
+            action_id = self._parse_action(action, reverse_lookup)
+            if action_id not in env.ACTION_LOOKUP:
+                raise HTTPException(status_code=400, detail=f"Invalid action identifier: {action}")
+
+            next_obs, reward, done, info = env.step(action_id)
+            total_step_reward += reward
+            session_state.total_reward += reward
+            session_state.observation = next_obs
+            session_state.last_info = info
+            session_state.done = bool(done)
+
+            step = GrlSokobanStepTrace(
+                action_id=action_id,
+                action_label=env.ACTION_LOOKUP[action_id],
+                reward=reward,
+                done=session_state.done,
+                info=info,
+            )
+            session_state.history.append(step)
+            steps.append(step)
+
+            if session_state.done:
+                break
+
+        return GrlSokobanStepResponse(
+            observation=session_state.observation,
+            reward=total_step_reward,
+            total_reward=session_state.total_reward,
+            done=session_state.done,
+            steps=steps,
+            history=list(session_state.history),  # Return full history for convenience
+        )
+
+    async def verify(self, request: Request, body: BaseVerifyRequest) -> GrlSokobanVerifyResponse:
+        session_id = request.session.get(SESSION_ID_KEY)
+        session_state = self.session_id_to_state.get(session_id)
+
+        success = False
+        reward = 0.0
+        if session_state is not None:
+            success = bool(session_state.last_info.get("success"))
+            reward = session_state.total_reward
+
+        if session_id in self.session_id_to_state:
+            try:
+                session_state.env.close()
+            except Exception:
+                pass
+            del self.session_id_to_state[session_id]
+
+        return GrlSokobanVerifyResponse(
+            **body.model_dump(),
+            reward=reward,
+            success=success,
+        )
+
+    @staticmethod
+    def _parse_action(action: Union[str, int], reverse_lookup: Dict[str, int]) -> int:
+        if isinstance(action, int):
+            return action
+
+        candidate = action.strip()
+        if candidate.lower() in reverse_lookup:
+            return reverse_lookup[candidate.lower()]
+
+        try:
+            return int(candidate)
+        except ValueError as exc:  # pragma: no cover - clarity around invalid input
+            raise HTTPException(status_code=400, detail=f"Unable to parse action: {action}") from exc
+
+
+if __name__ == "__main__":
+    GrlSokobanResourcesServer.run_webserver()
diff --git a/resources_servers/grl_sokoban/configs/grl_sokoban.yaml b/resources_servers/grl_sokoban/configs/grl_sokoban.yaml
@@ -0,0 +1,28 @@
+grl_sokoban_resources_server:
+  resources_servers:
+    grl_sokoban:
+      entrypoint: app.py
+      domain: games
+      verified: true
+grl_sokoban_simple_agent:
+  responses_api_agents:
+    simple_agent:
+      entrypoint: app.py
+      max_steps: 10
+      count_tool_calls: true
+      resources_server:
+        type: resources_servers
+        name: grl_sokoban_resources_server
+      model_server:
+        type: responses_api_models
+        name: policy_model
+      datasets:
+      - name: example
+        type: example
+        jsonl_fpath: resources_servers/grl_sokoban/data/example.jsonl
+        num_repeats: 1
+        gitlab_identifier:
+          dataset_name: grl_sokoban
+          version: 0.0.1
+          artifact_fpath: example.jsonl
+        license: Apache 2.0
diff --git a/resources_servers/grl_sokoban/data/example.jsonl b/resources_servers/grl_sokoban/data/example.jsonl
@@ -0,0 +1,5 @@
+{"level_id": 1, "seed": 84810, "dim_room": [5, 5], "num_boxes": 1, "responses_create_params": {"max_tool_calls": 10, "input": [{"role": "developer", "content": "You are a Sokoban-solving assistant. IMPORTANT: First call the `step` tool with an empty array [] to see the initial puzzle state. Example: step({\"actions\": []}). The tool will return the board as a string with symbols: #=wall, _=floor, O=target, X=box, \u221a=box on target, P=player, S=player on target. Then continue calling `step` with valid actions (Up, Down, Left, Right) until the puzzle is solved or you run out of moves. At the end, respond with <answer>Action1 || Action2 || ...</answer> summarizing all moves you executed."}, {"role": "user", "content": "Call the step tool to see the puzzle, then solve it step by step."}], "tools": [{"name": "step", "type": "function", "description": "Execute Sokoban moves sequentially. Call with empty array [] to see current puzzle state without moving.", "strict": true, "parameters": {"type": "object", "properties": {"actions": {"type": "array", "items": {"type": "string"}, "description": "Sequence of actions e.g. ['Up', 'Right']. Use empty array [] to view current state."}}, "required": ["actions"], "additionalProperties": false}}]}}
+{"level_id": 2, "seed": 98293, "dim_room": [8, 8], "num_boxes": 1, "responses_create_params": {"max_tool_calls": 10, "input": [{"role": "developer", "content": "You are a Sokoban-solving assistant. IMPORTANT: First call the `step` tool with an empty array [] to see the initial puzzle state. Example: step({\"actions\": []}). The tool will return the board as a string with symbols: #=wall, _=floor, O=target, X=box, \u221a=box on target, P=player, S=player on target. Then continue calling `step` with valid actions (Up, Down, Left, Right) until the puzzle is solved or you run out of moves. At the end, respond with <answer>Action1 || Action2 || ...</answer> summarizing all moves you executed."}, {"role": "user", "content": "Call the step tool to see the puzzle, then solve it step by step."}], "tools": [{"name": "step", "type": "function", "description": "Execute Sokoban moves sequentially. Call with empty array [] to see current puzzle state without moving.", "strict": true, "parameters": {"type": "object", "properties": {"actions": {"type": "array", "items": {"type": "string"}, "description": "Sequence of actions e.g. ['Up', 'Right']. Use empty array [] to view current state."}}, "required": ["actions"], "additionalProperties": false}}]}}
+{"level_id": 3, "seed": 30450, "dim_room": [6, 6], "num_boxes": 1, "responses_create_params": {"max_tool_calls": 10, "input": [{"role": "developer", "content": "You are a Sokoban-solving assistant. IMPORTANT: First call the `step` tool with an empty array [] to see the initial puzzle state. Example: step({\"actions\": []}). The tool will return the board as a string with symbols: #=wall, _=floor, O=target, X=box, \u221a=box on target, P=player, S=player on target. Then continue calling `step` with valid actions (Up, Down, Left, Right) until the puzzle is solved or you run out of moves. At the end, respond with <answer>Action1 || Action2 || ...</answer> summarizing all moves you executed."}, {"role": "user", "content": "Call the step tool to see the puzzle, then solve it step by step."}], "tools": [{"name": "step", "type": "function", "description": "Execute Sokoban moves sequentially. Call with empty array [] to see current puzzle state without moving.", "strict": true, "parameters": {"type": "object", "properties": {"actions": {"type": "array", "items": {"type": "string"}, "description": "Sequence of actions e.g. ['Up', 'Right']. Use empty array [] to view current state."}}, "required": ["actions"], "additionalProperties": false}}]}}
+{"level_id": 4, "seed": 89987, "dim_room": [6, 7], "num_boxes": 1, "responses_create_params": {"max_tool_calls": 10, "input": [{"role": "developer", "content": "You are a Sokoban-solving assistant. IMPORTANT: First call the `step` tool with an empty array [] to see the initial puzzle state. Example: step({\"actions\": []}). The tool will return the board as a string with symbols: #=wall, _=floor, O=target, X=box, \u221a=box on target, P=player, S=player on target. Then continue calling `step` with valid actions (Up, Down, Left, Right) until the puzzle is solved or you run out of moves. At the end, respond with <answer>Action1 || Action2 || ...</answer> summarizing all moves you executed."}, {"role": "user", "content": "Call the step tool to see the puzzle, then solve it step by step."}], "tools": [{"name": "step", "type": "function", "description": "Execute Sokoban moves sequentially. Call with empty array [] to see current puzzle state without moving.", "strict": true, "parameters": {"type": "object", "properties": {"actions": {"type": "array", "items": {"type": "string"}, "description": "Sequence of actions e.g. ['Up', 'Right']. Use empty array [] to view current state."}}, "required": ["actions"], "additionalProperties": false}}]}}
+{"level_id": 5, "seed": 78785, "dim_room": [6, 4], "num_boxes": 1, "responses_create_params": {"max_tool_calls": 10, "input": [{"role": "developer", "content": "You are a Sokoban-solving assistant. IMPORTANT: First call the `step` tool with an empty array [] to see the initial puzzle state. Example: step({\"actions\": []}). The tool will return the board as a string with symbols: #=wall, _=floor, O=target, X=box, \u221a=box on target, P=player, S=player on target. Then continue calling `step` with valid actions (Up, Down, Left, Right) until the puzzle is solved or you run out of moves. At the end, respond with <answer>Action1 || Action2 || ...</answer> summarizing all moves you executed."}, {"role": "user", "content": "Call the step tool to see the puzzle, then solve it step by step."}], "tools": [{"name": "step", "type": "function", "description": "Execute Sokoban moves sequentially. Call with empty array [] to see current puzzle state without moving.", "strict": true, "parameters": {"type": "object", "properties": {"actions": {"type": "array", "items": {"type": "string"}, "description": "Sequence of actions e.g. ['Up', 'Right']. Use empty array [] to view current state."}}, "required": ["actions"], "additionalProperties": false}}]}}
diff --git a/resources_servers/grl_sokoban/data/example_metrics.json b/resources_servers/grl_sokoban/data/example_metrics.json
@@ -0,0 +1,8 @@
+{
+    "name": "example",
+    "type": "example",
+    "jsonl_fpath": "resources_servers/grl_sokoban/data/example.jsonl",
+    "gitlab_identifier": null,
+    "license": "Apache 2.0",
+    "Number of examples": 5
+}