Skip to content

Commit a0cbc66

Browse files
committed
feat: add embedded mode with auto-init runtime
1 parent 0a23df5 commit a0cbc66

18 files changed

Lines changed: 1763 additions & 8 deletions

File tree

Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
# Embedded Mode
2+
3+
## Background
4+
5+
TuFT is designed to serve as a **transparent compute service layer** for RL training frameworks like Trinity and veRL. In production, TuFT typically runs as a standalone daemon (`tuft launch`), and users must:
6+
7+
1. Write a `tuft_config.yaml` configuration file
8+
2. Manually start the server with `tuft launch --config ...`
9+
3. Set the `TINKER_BASE_URL` environment variable for clients to connect
10+
11+
This manual setup creates friction, especially for:
12+
- **RL framework users** who just want to run training scripts without learning TuFT internals
13+
- **Development/debugging** workflows where quick iteration is key
14+
- **CI pipelines** that need reproducible, self-contained environments
15+
16+
**Embedded mode** solves this by providing a `tuft.init()` API — similar to `ray.init()` — that handles service discovery, configuration generation, startup, and connection automatically.
17+
18+
## Two Modes of Operation
19+
20+
| | Daemon Mode | Embedded Mode |
21+
|---|---|---|
22+
| How to start | `tuft launch --config ...` | `tuft.init(model=...)` |
23+
| Lifecycle | Independent process, manually managed | Follows main process, auto-cleanup via atexit |
24+
| Best for | Production deployments, multi-user shared clusters | Dev/debug, training scripts, CI |
25+
| Service discovery | User sets `TINKER_BASE_URL` manually | Automatic (env var → address file → process scan → default port) |
26+
27+
**Both modes coexist**: `tuft.init()` first tries to discover an existing daemon. Only when no running service is found does it start an embedded instance.
28+
29+
## Quick Start
30+
31+
```python
32+
import tuft
33+
34+
# Initialize TuFT — auto-discovers existing service or starts one
35+
tuft.init(model="/path/to/Qwen2.5-0.5B-Instruct")
36+
37+
# Use the service client for training
38+
training_client = tuft.create_training_client(
39+
base_model="Qwen2.5-0.5B-Instruct",
40+
rank=8,
41+
)
42+
# ... your training loop ...
43+
44+
# Optional: explicit shutdown (atexit handles this automatically)
45+
tuft.shutdown()
46+
```
47+
48+
### Other `init()` patterns
49+
50+
```python
51+
# Connect to a specific running server
52+
tuft.init(address="http://gpu-cluster:10610")
53+
54+
# Use an existing config file
55+
tuft.init(config="/path/to/tuft_config.yaml")
56+
57+
# No arguments — relies on env vars or default config file
58+
tuft.init()
59+
60+
# Get a service client (auto-inits if not already done)
61+
service_client = tuft.get_service_client()
62+
```
63+
64+
## Service Discovery Priority
65+
66+
When `tuft.init()` is called, it tries to find an existing service in this order:
67+
68+
1. `address=...` argument passed to `init()`
69+
2. `TUFT_ADDRESS` environment variable
70+
3. Address file at `~/.tuft/tuft_current_server`
71+
4. Process scan (looks for running `tuft launch` or `uvicorn` processes)
72+
5. Default port probe: `http://127.0.0.1:10610`
73+
74+
If no service is found, embedded mode starts a new one using configuration from:
75+
76+
1. `config=...` argument passed to `init()`
77+
2. `TUFT_CONFIG` environment variable
78+
3. `model=...` argument → auto-generates minimal config
79+
4. `TUFT_MODEL_PATH` environment variable → auto-generates minimal config
80+
5. Default config file: `~/.tuft/configs/tuft_config.yaml`
81+
6. None available → raises `RuntimeError` with helpful guidance
82+
83+
## Environment Variables
84+
85+
| Variable | Description | Default |
86+
|----------|-------------|---------|
87+
| `TUFT_ADDRESS` | Address of running TuFT service ||
88+
| `TUFT_API_KEY` | API authentication key | Auto-generated |
89+
| `TUFT_CONFIG` | Path to configuration file ||
90+
| `TUFT_MODEL_PATH` | Model path for auto-config generation ||
91+
| `TUFT_ENABLE_AUTO_CONNECT` | Enable auto-connect in `get_service_client()` | `"1"` |
92+
| `TUFT_HOME` | TuFT home directory | `~/.tuft` |
93+
| `TUFT_HOST` | Server bind address | `127.0.0.1` |
94+
| `TUFT_PORT` | Server bind port | `10610` |
95+
96+
## Lifecycle
97+
98+
- **Embedded services** are tied to the main process. When the Python process exits (normally or via signal), the embedded TuFT server is automatically terminated via `atexit`.
99+
- **Daemon services** (`tuft launch`) are independent and persist until manually stopped.
100+
- `tuft.shutdown()` can be called explicitly to stop an embedded service early.
101+
- `tuft.init()` is **idempotent** — calling it multiple times is safe (no-op after first success).
102+
103+
## Integration with RL Frameworks
104+
105+
For framework integrations (e.g., Trinity), the pattern is:
106+
107+
```python
108+
import tuft
109+
110+
# In your framework's initialization code:
111+
tuft.init(model=model_path, ignore_reinit_error=True)
112+
service_client = tuft.get_service_client()
113+
114+
# Use service_client as before...
115+
```
116+
117+
This requires no changes to the user's workflow — the framework handles TuFT setup transparently.

docs/sphinx_doc/source/getting-started/index.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,14 @@ Install TuFT from source, PyPI, or Docker.
2020
2121
Run your first training and sampling example with TuFT.
2222
:::
23+
24+
:::{grid-item-card} Embedded Mode
25+
:link: embedded-mode
26+
:link-type: doc
27+
:shadow: none
28+
29+
Use `tuft.init()` for automatic service discovery and startup.
30+
:::
2331
```
2432

2533
```{toctree}
@@ -28,4 +36,5 @@ Run your first training and sampling example with TuFT.
2836
2937
installation
3038
quickstart
39+
embedded-mode
3140
```
Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
# 嵌入式模式
2+
3+
## 背景
4+
5+
TuFT 被设计为 RL 训练框架(如 Trinity)的**透明计算服务层**。在生产环境中,TuFT 通常作为独立守护进程运行(`tuft launch`),用户需要:
6+
7+
1. 编写 `tuft_config.yaml` 配置文件
8+
2. 手动执行 `tuft launch --config ...` 启动服务
9+
3. 设置 `TINKER_BASE_URL` 环境变量供客户端连接
10+
11+
这种手动配置带来了额外负担,尤其是:
12+
- **RL 框架用户**:只想运行训练脚本,不想学习 TuFT 的安装和配置
13+
- **开发调试**:需要快速迭代的工作流
14+
- **CI 流水线**:需要可复现的自包含环境
15+
16+
**嵌入式模式**通过提供 `tuft.init()` API 解决了这个问题——类似 `ray.init()`——自动完成服务发现、配置生成、启动和连接。
17+
18+
## 两种运行模式
19+
20+
| | 守护进程模式 | 嵌入式模式 |
21+
|---|---|---|
22+
| 启动方式 | `tuft launch --config ...` | `tuft.init(model=...)` |
23+
| 生命周期 | 独立进程,手动管理 | 跟随主进程,atexit 自动清理 |
24+
| 适用场景 | 生产部署、多用户共享集群 | 开发调试、训练脚本、CI |
25+
| 服务发现 | 用户手动设置 `TINKER_BASE_URL` | 自动(环境变量 → 地址文件 → 进程扫描 → 默认端口) |
26+
27+
**两种模式共存**`tuft.init()` 首先尝试发现已有的守护进程服务。只有在找不到运行中的服务时,才会启动嵌入式实例。
28+
29+
## 快速开始
30+
31+
```python
32+
import tuft
33+
34+
# 初始化 TuFT — 自动发现已有服务或启动一个新的
35+
tuft.init(model="/path/to/Qwen2.5-0.5B-Instruct")
36+
37+
# 使用 service client 进行训练
38+
training_client = tuft.create_training_client(
39+
base_model="Qwen2.5-0.5B-Instruct",
40+
rank=8,
41+
)
42+
# ... 你的训练循环 ...
43+
44+
# 可选:显式关闭(atexit 会自动处理)
45+
tuft.shutdown()
46+
```
47+
48+
### 其他 `init()` 模式
49+
50+
```python
51+
# 连接到指定的运行中服务
52+
tuft.init(address="http://gpu-cluster:10610")
53+
54+
# 使用已有配置文件
55+
tuft.init(config="/path/to/tuft_config.yaml")
56+
57+
# 无参数 — 依赖环境变量或默认配置文件
58+
tuft.init()
59+
60+
# 获取 service client(未初始化时自动触发 init)
61+
service_client = tuft.get_service_client()
62+
```
63+
64+
## 服务发现优先级
65+
66+
调用 `tuft.init()` 时,按以下顺序尝试发现已有服务:
67+
68+
1. `address=...` 参数显式传入
69+
2. `TUFT_ADDRESS` 环境变量
70+
3. 地址文件 `~/.tuft/tuft_current_server`
71+
4. 进程扫描(查找运行中的 `tuft launch``uvicorn` 进程)
72+
5. 默认端口探测:`http://127.0.0.1:10610`
73+
74+
如果未发现服务,嵌入式模式按以下优先级获取配置并启动:
75+
76+
1. `config=...` 参数显式传入
77+
2. `TUFT_CONFIG` 环境变量
78+
3. `model=...` 参数 → 自动生成最小配置
79+
4. `TUFT_MODEL_PATH` 环境变量 → 自动生成最小配置
80+
5. 默认配置文件:`~/.tuft/configs/tuft_config.yaml`
81+
6. 全部没有 → 抛出 `RuntimeError` 并给出提示
82+
83+
## 环境变量
84+
85+
| 变量 | 说明 | 默认值 |
86+
|------|------|--------|
87+
| `TUFT_ADDRESS` | TuFT 服务地址 ||
88+
| `TUFT_API_KEY` | API 认证密钥 | 自动生成 |
89+
| `TUFT_CONFIG` | 配置文件路径 ||
90+
| `TUFT_MODEL_PATH` | 模型路径(用于自动生成配置) ||
91+
| `TUFT_ENABLE_AUTO_CONNECT` | 启用 `get_service_client()` 自动连接 | `"1"` |
92+
| `TUFT_HOME` | TuFT 主目录 | `~/.tuft` |
93+
| `TUFT_HOST` | 服务绑定地址 | `127.0.0.1` |
94+
| `TUFT_PORT` | 服务绑定端口 | `10610` |
95+
96+
## 生命周期
97+
98+
- **嵌入式服务**绑定到主进程。当 Python 进程退出(正常或信号)时,嵌入式 TuFT 服务通过 `atexit` 自动终止。
99+
- **守护进程服务**`tuft launch`)独立运行,持续到手动停止。
100+
- `tuft.shutdown()` 可显式调用以提前停止嵌入式服务。
101+
- `tuft.init()`**幂等的** — 多次调用安全(首次成功后为空操作)。
102+
103+
## 与 RL 框架集成
104+
105+
框架集成(如 Trinity)的模式:
106+
107+
```python
108+
import tuft
109+
110+
# 在框架的初始化代码中:
111+
tuft.init(model=model_path, ignore_reinit_error=True)
112+
service_client = tuft.get_service_client()
113+
114+
# 像之前一样使用 service_client...
115+
```
116+
117+
这不需要改变用户的工作流 — 框架透明地处理 TuFT 的配置和启动。

docs/sphinx_doc/source_zh/getting-started/index.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,14 @@
2020
2121
使用 TuFT 运行您的第一个训练与推理示例。
2222
:::
23+
24+
:::{grid-item-card} 嵌入式模式
25+
:link: embedded-mode
26+
:link-type: doc
27+
:shadow: none
28+
29+
使用 `tuft.init()` 实现自动服务发现和启动。
30+
:::
2331
```
2432

2533
```{toctree}
@@ -28,4 +36,5 @@
2836
2937
installation
3038
quickstart
39+
embedded-mode
3140
```

examples/chat_sft/train.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,8 @@ def compute_weighted_nll_from_outputs(loss_fn_outputs, datums) -> float:
112112

113113
def connect(cfg: Config) -> tinker.ServiceClient:
114114
print(f"[1/6] connect service: {cfg.base_url}")
115+
# Alternative: use tuft.get_service_client() for auto-discovery/embedded mode
116+
# import tuft; return tuft.get_service_client()
115117
return tinker.ServiceClient(base_url=cfg.base_url, api_key=cfg.api_key)
116118

117119

examples/countdown_rl/train.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,8 @@ def init_wandb(cfg: Config):
9797

9898
def connect(cfg: Config) -> tinker.ServiceClient:
9999
print(f"[1/6] connect service: {cfg.base_url}")
100+
# Alternative: use tuft.get_service_client() for auto-discovery/embedded mode
101+
# import tuft; return tuft.get_service_client()
100102
return tinker.ServiceClient(base_url=cfg.base_url, api_key=cfg.api_key)
101103

102104

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
"""Embedded TuFT quickstart — demonstrates auto-init (embedded mode).
2+
3+
This example shows how to use TuFT in embedded mode, where the service
4+
is automatically started and managed within your training script's lifecycle.
5+
6+
No manual `tuft launch` or configuration files needed!
7+
8+
Usage:
9+
python train.py --model /path/to/Qwen2.5-0.5B-Instruct
10+
11+
The script will:
12+
1. Auto-detect the model and GPU configuration
13+
2. Start a TuFT server in the background
14+
3. Connect and run a minimal training loop
15+
4. Automatically shut down the server on exit
16+
"""
17+
18+
from __future__ import annotations
19+
20+
import argparse
21+
22+
from tinker import types
23+
24+
import tuft
25+
26+
27+
def main():
28+
parser = argparse.ArgumentParser(description="Embedded TuFT quickstart")
29+
parser.add_argument(
30+
"--model",
31+
type=str,
32+
required=True,
33+
help="Path to the base model (e.g., /path/to/Qwen2.5-0.5B-Instruct)",
34+
)
35+
parser.add_argument("--rank", type=int, default=8, help="LoRA rank")
36+
parser.add_argument("--steps", type=int, default=5, help="Number of training steps")
37+
args = parser.parse_args()
38+
39+
# =========================================================================
40+
# Step 1: Initialize TuFT in embedded mode
41+
# This will auto-detect GPUs, generate a minimal config, and start the server.
42+
# If a TuFT server is already running, it will connect to it instead.
43+
# =========================================================================
44+
print(f"[1/4] Initializing TuFT with model: {args.model}")
45+
tuft.init(model=args.model)
46+
print(" TuFT initialized (mode: embedded)")
47+
48+
# =========================================================================
49+
# Step 2: Create a training client
50+
# =========================================================================
51+
print(f"[2/4] Creating LoRA training client (rank={args.rank})")
52+
training_client = tuft.create_training_client(
53+
base_model=args.model,
54+
rank=args.rank,
55+
train_mlp=True,
56+
train_attn=True,
57+
)
58+
59+
# =========================================================================
60+
# Step 3: Run a minimal training loop
61+
# =========================================================================
62+
print(f"[3/4] Running {args.steps} training steps (with fake data)")
63+
for step in range(args.steps):
64+
# Create a fake training datum (in practice, use real tokenized data)
65+
datum = types.Datum(
66+
model_input=types.ModelInput.from_ints([101, 42, 37, 102]),
67+
loss_fn_inputs={
68+
"target_tokens": types.TensorData(
69+
data=[101, 99, 73, 102], dtype="int64", shape=[4]
70+
),
71+
"weights": types.TensorData(data=[1.0, 1.0, 1.0, 1.0], dtype="float32", shape=[4]),
72+
},
73+
)
74+
training_client.forward_backward([datum], loss_fn="cross_entropy").result()
75+
training_client.optim_step(types.AdamParams(learning_rate=1e-4)).result()
76+
print(f" Step {step + 1}/{args.steps} complete")
77+
78+
# =========================================================================
79+
# Step 4: Clean up (optional — atexit handles this automatically)
80+
# =========================================================================
81+
print("[4/4] Shutting down TuFT")
82+
tuft.shutdown()
83+
print(" Done!")
84+
85+
86+
if __name__ == "__main__":
87+
main()

0 commit comments

Comments
 (0)