Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
100 commits
Select commit Hold shift + click to select a range
316d43e
Miscellaneous infra.
pjin-nvidia Nov 13, 2025
dd953b0
Ray utils.
pjin-nvidia Nov 13, 2025
5717d7a
No cover.
pjin-nvidia Nov 13, 2025
4ecd8d3
Remove DEBUG. Comment.
pjin-nvidia Nov 15, 2025
8103dbf
Comment about ray package extra.
pjin-nvidia Nov 15, 2025
dc493d5
The.
pjin-nvidia Nov 15, 2025
f9e5d8f
Merge remote-tracking branch 'origin/main' into pjin/misc-infra
pjin-nvidia Nov 15, 2025
9502d82
Fix test (?).
pjin-nvidia Nov 15, 2025
0475d5e
Initial support for server pyproject.toml (WIP).
pjin-nvidia Nov 15, 2025
d86756b
Fix pyproject.toml check.
pjin-nvidia Nov 15, 2025
79028a6
Working directory Path.
pjin-nvidia Nov 15, 2025
7e62b1d
Install a server venv from pyproject.toml if available.
pjin-nvidia Nov 15, 2025
36efb94
Deprecated vllm_model requirements.txt.
pjin-nvidia Nov 15, 2025
8d49b95
Consistently use dashes in package names.
pjin-nvidia Nov 15, 2025
6fb0a95
Lint.
pjin-nvidia Nov 15, 2025
7231efa
Cleanup.
pjin-nvidia Nov 15, 2025
8fc0d9d
VLLM server spinup.
pjin-nvidia Nov 15, 2025
8975e98
VLLM server host and port.
pjin-nvidia Nov 15, 2025
51ba6fc
Allocate the free port for VLLM in the model server process.
pjin-nvidia Nov 16, 2025
aa97796
Type.
pjin-nvidia Nov 16, 2025
6ec9325
Fix for pyproject.toml (this works lol).
pjin-nvidia Nov 16, 2025
33ec3f9
VLLM server "routing" (just re-using the existing multiple clients).
pjin-nvidia Nov 16, 2025
77cda85
Better order.
pjin-nvidia Nov 16, 2025
44dcee1
Merge remote-tracking branch 'origin/main' into pjin/ray-utils
pjin-nvidia Nov 16, 2025
a85f4f0
WIP.
pjin-nvidia Nov 16, 2025
7201c8f
Comment.
pjin-nvidia Nov 16, 2025
834d9b9
Default to "mp" backend.
pjin-nvidia Nov 16, 2025
5ee8b57
Cleanup.
pjin-nvidia Nov 16, 2025
10b5295
Cleanup.
pjin-nvidia Nov 16, 2025
e4c5573
Non-async VLLM server heartbeat to avoid early asyncio event loop.
pjin-nvidia Nov 16, 2025
0a8da20
With pyproject.toml, no pre-install command needed.
pjin-nvidia Nov 16, 2025
85a09fe
Ray GPU node-related global config keys. Simplified spinup (WIP).
pjin-nvidia Nov 16, 2025
ad0e2fc
Improved server venv pyproject install that does not use editable.
pjin-nvidia Nov 17, 2025
5c1fe99
Querying ray state to find nodes with available and unused GPUs.
pjin-nvidia Nov 17, 2025
f32957e
Only use explicitly reserved ray GPU nodes if specified.
pjin-nvidia Nov 17, 2025
ef77c4c
Comment. Cleanup.
pjin-nvidia Nov 17, 2025
bbf4631
Cleanup.
pjin-nvidia Nov 17, 2025
531a61d
Type.
pjin-nvidia Nov 17, 2025
f88ec6a
No cover.
pjin-nvidia Nov 17, 2025
d819740
Type.
pjin-nvidia Nov 17, 2025
7640773
Rename reserved => allowed.
pjin-nvidia Nov 17, 2025
0436b47
Packaging and setup.
pjin-nvidia Nov 17, 2025
70670a2
Rename.
pjin-nvidia Nov 17, 2025
e61253c
VLLMModel local spinup (originally from PR #317).
pjin-nvidia Nov 17, 2025
854609f
Revert VLLMModel changes (moving to PR #318).
pjin-nvidia Nov 17, 2025
dc6ffef
One line uv pip install.
pjin-nvidia Nov 18, 2025
8758142
ruff
kbhardwaj-nvidia Nov 18, 2025
5eb9817
update app.py
kbhardwaj-nvidia Nov 19, 2025
9971a03
change name
kbhardwaj-nvidia Nov 19, 2025
56b9bfa
VLLM spinup in a Ray worker.
pjin-nvidia Nov 20, 2025
cb3a21d
Merge remote-tracking branch 'origin/pjin/misc-infra' into pjin/hross…
pjin-nvidia Nov 20, 2025
1951071
Merge remote-tracking branch 'origin/pjin/ray-utils' into pjin/hross/…
pjin-nvidia Nov 20, 2025
e23d73f
debugging
kbhardwaj-nvidia Nov 20, 2025
e8afd2d
Print the names of servers yet to have finished spinning up.
pjin-nvidia Nov 20, 2025
0142784
Formatting.
pjin-nvidia Nov 20, 2025
04a97dd
Import.
pjin-nvidia Nov 20, 2025
70ac196
Do not count resources of ray actors in 'DEAD' state (these resources…
pjin-nvidia Nov 20, 2025
fde1dc2
cleanup app, readme
kbhardwaj-nvidia Nov 20, 2025
bd4a420
Debug WIP.
pjin-nvidia Nov 20, 2025
5d8caba
add tests
kbhardwaj-nvidia Nov 20, 2025
d6ae991
end newline
kbhardwaj-nvidia Nov 20, 2025
ef7e6d2
Fixes.
pjin-nvidia Nov 21, 2025
7db6c1c
Debug.
pjin-nvidia Nov 24, 2025
8aae1d5
Merge remote-tracking branch 'origin/khushi/format' into pjin/hross/m…
pjin-nvidia Nov 24, 2025
afd9ee7
Fixes.
pjin-nvidia Nov 26, 2025
3e5c924
Support for specifying non-anonymous Ray namespace.
pjin-nvidia Nov 26, 2025
8bdcec0
Fix for starting nested Ray actors.
pjin-nvidia Nov 26, 2025
a0c0d19
Merge remote-tracking branch 'origin/main' into pjin/misc-infra
pjin-nvidia Nov 26, 2025
17f640f
Merge remote-tracking branch 'origin/main' into pjin/ray-utils
pjin-nvidia Nov 26, 2025
3f914dc
Default max_steps = 1.
pjin-nvidia Nov 27, 2025
0a94c2d
Merge remote-tracking branch 'origin/main' into pjin/misc-infra
pjin-nvidia Dec 1, 2025
d4b8074
Merge remote-tracking branch 'origin/main' into pjin/ray-utils
pjin-nvidia Dec 1, 2025
8fe389f
Matching the misc infra PR.
pjin-nvidia Dec 1, 2025
613efb4
No cover.
pjin-nvidia Dec 1, 2025
7575eb6
Global scheduling helper to track free GPUs of schedulable ray nodes.
pjin-nvidia Dec 2, 2025
d7e1683
Rename.
pjin-nvidia Dec 2, 2025
f7c1937
Print.
pjin-nvidia Dec 2, 2025
2d37d17
Avoid an unnecessary ray import.
pjin-nvidia Dec 2, 2025
a35f58d
Try to pass the linter.
pjin-nvidia Dec 2, 2025
1b53089
Test.
pjin-nvidia Dec 2, 2025
6327760
Tests.
pjin-nvidia Dec 2, 2025
f5466f9
Fix test.
pjin-nvidia Dec 2, 2025
7a7e952
Fix test.
pjin-nvidia Dec 2, 2025
eab68a0
Unfix test.
pjin-nvidia Dec 2, 2025
66b788d
Revert to just cd into working dir.
pjin-nvidia Dec 2, 2025
a78f226
Deduplicate.
pjin-nvidia Dec 2, 2025
fdb54fe
Also add explicit check for requirements.txt.
pjin-nvidia Dec 2, 2025
3fb2911
Revert format.
pjin-nvidia Dec 2, 2025
d62ab6c
VLLMModel refresh.
pjin-nvidia Dec 2, 2025
243ba60
Merge remote-tracking branch 'origin/pjin/misc-infra' into pjin/hross…
pjin-nvidia Dec 2, 2025
7809170
Add vllm_model pyproject.toml (depends on PR #317).
pjin-nvidia Dec 3, 2025
156f039
Unpin vllm version.
pjin-nvidia Dec 3, 2025
21ba79e
Consolidated ray actor env vars setup.
pjin-nvidia Dec 3, 2025
8c70fa1
Merge remote-tracking branch 'origin/pjin/ray-utils' into pjin/hross/…
pjin-nvidia Dec 3, 2025
aa34b0d
Resources server readme.
pjin-nvidia Dec 3, 2025
17171d7
Cleanup.
pjin-nvidia Dec 3, 2025
be30cad
Fixes for vllm_model server spinup.
pjin-nvidia Dec 4, 2025
6f731a3
Add translation_llm_judge.
pjin-nvidia Dec 4, 2025
372b4a4
Spinup of model worker only needs 1 GPU.
pjin-nvidia Dec 4, 2025
295e4c5
Fix requirements.txt.
pjin-nvidia Dec 5, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 96 additions & 15 deletions nemo_gym/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,14 +21,14 @@
import tomllib
from glob import glob
from importlib.metadata import version as md_version
from os import environ, makedirs
from os import environ, getcwd, makedirs
from os.path import exists
from pathlib import Path
from signal import SIGINT
from subprocess import Popen
from threading import Thread
from time import sleep
from typing import Dict, List, Optional
from typing import Any, Dict, List, Optional, Tuple

import psutil
import rich
Expand All @@ -49,6 +49,9 @@
GlobalConfigDictParserConfig,
get_global_config_dict,
)
from nemo_gym.ray_utils import (
_start_global_ray_gpu_scheduling_helper,
)
from nemo_gym.server_utils import (
HEAD_SERVER_KEY_NAME,
HeadServer,
Expand All @@ -59,21 +62,62 @@


def _setup_env_command(dir_path: Path, global_config_dict: DictConfig) -> str: # pragma: no cover
install_cmd = "uv pip install -r requirements.txt"
head_server_deps = global_config_dict[HEAD_SERVER_DEPS_KEY_NAME]
install_cmd += " " + " ".join(head_server_deps)

return f"""cd {dir_path} \\
&& uv venv --allow-existing --python {global_config_dict[PYTHON_VERSION_KEY_NAME]} \\
uv_venv_cmd = f"uv venv --seed --allow-existing --python {global_config_dict[PYTHON_VERSION_KEY_NAME]} .venv"

pyproject_toml = False
requirements_txt = False
try:
with open(f"{dir_path / 'pyproject.toml'}", "r") as _f:
pyproject_toml = True
except OSError:
pass
try:
with open(f"{dir_path / 'requirements.txt'}", "r") as _f:
requirements_txt = True
except OSError:
pass

if pyproject_toml:
install_cmd = f"""uv pip install '-e .' {" ".join(head_server_deps)}"""
elif requirements_txt:
install_cmd = f"""uv pip install -r requirements.txt {" ".join(head_server_deps)}"""
else:
raise RuntimeError(f"Missing pyproject.toml or requirements.txt for uv venv setup in server dir: {dir_path}")

cmd = f"""cd {dir_path} \\
&& {uv_venv_cmd} > {dir_path}/venv.out.log 2> {dir_path}/venv.err.log \\
&& source .venv/bin/activate \\
&& {install_cmd} \\
"""
"""

print(f"DEBUG: _setup_env_command: cmd = {cmd}", flush=True)

def _run_command(command: str, working_directory: Path) -> Popen: # pragma: no cover
return cmd


def _run_command(command: str, working_dir_path: Path, name: Optional[str] = None) -> Popen: # pragma: no cover
work_dir = f"{working_dir_path.absolute()}"
print(f"DEBUG: _run_command: cwd = {getcwd()}", flush=True)
print(f"DEBUG: _run_command: work dir = {work_dir}", flush=True)
custom_env = environ.copy()
custom_env["PYTHONPATH"] = f"{working_directory.absolute()}:{custom_env.get('PYTHONPATH', '')}"
return Popen(command, executable="/bin/bash", shell=True, env=custom_env)
py_path = custom_env.get("PYTHONPATH", None)
if py_path is not None:
custom_env["PYTHONPATH"] = f"{work_dir}:{py_path}"
else:
custom_env["PYTHONPATH"] = work_dir
if name is not None:
out_log_file = open(f"{work_dir}/run-{name}.out.log", "a")
err_log_file = open(f"{work_dir}/run-{name}.err.log", "a")
return Popen(
command,
executable="/bin/bash",
shell=True,
env=custom_env,
stdout=out_log_file,
stderr=err_log_file,
)


class RunConfig(BaseNeMoGymCLIConfig):
Expand Down Expand Up @@ -119,18 +163,23 @@ class ServerInstanceDisplayConfig(BaseModel):
class RunHelper: # pragma: no cover
_head_server: uvicorn.Server
_head_server_thread: Thread
_head_ray_gpu_helper: Any

_processes: Dict[str, Popen]
_server_instance_display_configs: List[ServerInstanceDisplayConfig]
_server_client: ServerClient

def start(self, global_config_dict_parser_config: GlobalConfigDictParserConfig) -> None:
print(f"DEBUG: RunHelper.start: ...", flush=True)

global_config_dict = get_global_config_dict(global_config_dict_parser_config=global_config_dict_parser_config)

# Initialize Ray cluster in the main process
# Note: This function will modify the global config dict - update `ray_head_node_address`
initialize_ray()

self._head_ray_gpu_helper = _start_global_ray_gpu_scheduling_helper()

# Assume Nemo Gym Run is for a single agent.
escaped_config_dict_yaml_str = shlex.quote(OmegaConf.to_yaml(global_config_dict))

Expand Down Expand Up @@ -166,12 +215,25 @@ def start(self, global_config_dict_parser_config: GlobalConfigDictParserConfig)

dir_path = PARENT_DIR / Path(first_key, second_key)

print(f"DEBUG: RunHelper: 1st key = {first_key}", flush=True)
print(f"DEBUG: RunHelper: 2nd key = {second_key}", flush=True)
print(f"DEBUG: RunHelper: dir path = {dir_path}", flush=True)
if (
f"{dir_path}".endswith("/bin/python") or
f"{dir_path}".endswith("/bin/python3")
):
dir_path = dir_path.parent
dir_path = dir_path.parent
print(f"DEBUG: RunHelper: dir path = {dir_path} (rewrite)", flush=True)

print(f"DEBUG: RunHelper: entry = {str(entrypoint_fpath)}", flush=True)

command = f"""{_setup_env_command(dir_path, global_config_dict)} \\
&& {NEMO_GYM_CONFIG_DICT_ENV_VAR_NAME}={escaped_config_dict_yaml_str} \\
{NEMO_GYM_CONFIG_PATH_ENV_VAR_NAME}={shlex.quote(top_level_path)} \\
python {str(entrypoint_fpath)}"""

process = _run_command(command, dir_path)
process = _run_command(command, dir_path, top_level_path)
self._processes[top_level_path] = process

host = server_config_dict.get("host")
Expand Down Expand Up @@ -233,6 +295,18 @@ def poll(self) -> None:

for process_name, process in self._processes.items():
if process.poll() is not None:
proc_out, proc_err = process.communicate()
print(f"Process `{process_name}` finished unexpectedly!")
print(f"Process `{process_name}` stdout:", flush=True)
if isinstance(proc_out, bytes):
print(proc_out.decode("utf-8"), flush=True)
else:
print(proc_out, flush=True)
print(f"Process `{process_name}` stderr:", flush=True)
if isinstance(proc_err, bytes):
print(proc_err.decode("utf-8"), flush=True)
else:
print(proc_err, flush=True)
raise RuntimeError(f"Process `{process_name}` finished unexpectedly!")

def wait_for_spinup(self) -> None:
Expand All @@ -243,11 +317,18 @@ def wait_for_spinup(self) -> None:
self.poll()
statuses = self.check_http_server_statuses()

num_spun_up = statuses.count("success")
num_spun_up = 0
waiting = []
for name, status in statuses:
if status == "success":
num_spun_up += 1
else:
waiting.append(name)
if len(statuses) != num_spun_up:
print(
f"""{num_spun_up} / {len(statuses)} servers ready ({statuses.count("timeout")} timed out, {statuses.count("connection_error")} connection errored, {statuses.count("unknown_error")} had unknown errors).
Waiting for servers to spin up. Sleeping {sleep_interval}s..."""
Waiting for servers to spin up: {waiting}
Sleeping {sleep_interval}s..."""
)
else:
print(f"All {num_spun_up} / {len(statuses)} servers ready! Polling every 60s")
Expand Down Expand Up @@ -289,15 +370,15 @@ async def sleep():
finally:
self.shutdown()

def check_http_server_statuses(self) -> List[ServerStatus]:
def check_http_server_statuses(self) -> List[Tuple[str, ServerStatus]]:
print(
"Checking for HTTP server statuses (you should see some HTTP requests to `/` that may 404. This is expected.)"
)
statuses = []
for server_instance_display_config in self._server_instance_display_configs:
name = server_instance_display_config.config_path
status = self._server_client.poll_for_status(name)
statuses.append(status)
statuses.append((name, status))

return statuses

Expand Down
3 changes: 2 additions & 1 deletion nemo_gym/config_types.py
Original file line number Diff line number Diff line change
Expand Up @@ -196,8 +196,8 @@ class DatasetConfig(BaseModel):
Literal["MIT"],
Literal["Creative Commons Attribution 4.0 International"],
Literal["Creative Commons Attribution-ShareAlike 4.0 International"],
Literal["NVIDIA Internal Use Only, Do Not Distribute"],
Literal["TBD"],
Literal["MIT"],
]
] = None

Expand All @@ -224,6 +224,7 @@ class Domain(str, Enum):
LONG_CONTEXT = "long_context"
SAFETY = "safety"
GAMES = "games"
TRANSLATION = "translation"
E2E = "e2e"
OTHER = "other"

Expand Down
7 changes: 6 additions & 1 deletion nemo_gym/global_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,8 @@
DISALLOWED_PORTS_KEY_NAME = "disallowed_ports"
HEAD_SERVER_DEPS_KEY_NAME = "head_server_deps"
PYTHON_VERSION_KEY_NAME = "python_version"
RAY_GPU_NODES_KEY_NAME = "ray_gpu_nodes"
RAY_NUM_GPUS_PER_NODE_KEY_NAME = "ray_num_gpus_per_node"
USE_ABSOLUTE_IP = "use_absolute_ip"
NEMO_GYM_RESERVED_TOP_LEVEL_KEYS = [
CONFIG_PATHS_KEY_NAME,
Expand All @@ -54,6 +56,8 @@
DISALLOWED_PORTS_KEY_NAME,
HEAD_SERVER_DEPS_KEY_NAME,
PYTHON_VERSION_KEY_NAME,
RAY_GPU_NODES_KEY_NAME,
RAY_NUM_GPUS_PER_NODE_KEY_NAME,
USE_ABSOLUTE_IP,
]

Expand Down Expand Up @@ -261,7 +265,8 @@ def parse(self, parse_config: Optional[GlobalConfigDictParserConfig] = None) ->
# Constrain sensitive package versions
global_config_dict[HEAD_SERVER_DEPS_KEY_NAME] = [
# The ray version is very sensitive. The children ray versions must exactly match those of the parent ray.
f"ray=={ray_version}",
# The ray extra [default] should also exactly match the extra in the top-level Gym pyproject.toml.
f"ray[default]=={ray_version}",
# OpenAI version is also sensitive since it changes so often and may introduce subtle incompatibilities.
f"openai=={openai_version}",
]
Expand Down
3 changes: 2 additions & 1 deletion nemo_gym/openai_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -467,7 +467,8 @@ async def _request(self, **request_kwargs: Dict) -> ClientResponse:

async def _raise_for_status(self, response: ClientResponse, request_kwargs: Dict[str, Any]) -> None:
if not response.ok:
print(f"Request kwargs: {json.dumps(request_kwargs)}")
print(f"Response status: {response.status}", flush=True)
# print(f"Request kwargs: {json.dumps(request_kwargs)}")

await raise_for_status(response)

Expand Down
Loading
Loading