From 106a7b12bfa61168d642738f50611d31c9d6e002 Mon Sep 17 00:00:00 2001 From: qin-ctx Date: Tue, 31 Mar 2026 20:02:55 +0800 Subject: [PATCH] docs(guides): reorganize observability documentation Promote observability to a dedicated guide and keep telemetry as a focused reference. --- docs/en/guides/03-deployment.md | 2 +- docs/en/guides/05-monitoring.md | 94 --------- docs/en/guides/05-observability.md | 238 +++++++++++++++++++++++ docs/en/guides/07-operation-telemetry.md | 6 +- docs/zh/guides/03-deployment.md | 2 +- docs/zh/guides/05-monitoring.md | 94 --------- docs/zh/guides/05-observability.md | 238 +++++++++++++++++++++++ docs/zh/guides/07-operation-telemetry.md | 6 +- 8 files changed, 486 insertions(+), 194 deletions(-) delete mode 100644 docs/en/guides/05-monitoring.md create mode 100644 docs/en/guides/05-observability.md delete mode 100644 docs/zh/guides/05-monitoring.md create mode 100644 docs/zh/guides/05-observability.md diff --git a/docs/en/guides/03-deployment.md b/docs/en/guides/03-deployment.md index 8f591a902..e1f42a64f 100644 --- a/docs/en/guides/03-deployment.md +++ b/docs/en/guides/03-deployment.md @@ -246,5 +246,5 @@ Use `/health` for Kubernetes liveness probes and `/ready` for readiness probes. ## Related Documentation - [Authentication](04-authentication.md) - API key setup -- [Monitoring](05-monitoring.md) - Health checks and observability +- [Observability & Diagnostics](05-observability.md) - Health checks, tracing, and debugging - [API Overview](../api/01-overview.md) - Complete API reference diff --git a/docs/en/guides/05-monitoring.md b/docs/en/guides/05-monitoring.md deleted file mode 100644 index 90f0ea69e..000000000 --- a/docs/en/guides/05-monitoring.md +++ /dev/null @@ -1,94 +0,0 @@ -# Monitoring & Health Checks - -OpenViking Server provides endpoints for monitoring system health and component status. - -## Health Check - -The `/health` endpoint provides a simple liveness check. It does not require authentication. - -```bash -curl http://localhost:1933/health -``` - -```json -{"status": "ok"} -``` - -## System Status - -### Overall System Health - -**Python SDK (Embedded / HTTP)** - -```python -status = client.get_status() -print(f"Healthy: {status['is_healthy']}") -print(f"Errors: {status['errors']}") -``` - -**HTTP API** - -```bash -curl http://localhost:1933/api/v1/observer/system \ - -H "X-API-Key: your-key" -``` - -```json -{ - "status": "ok", - "result": { - "is_healthy": true, - "errors": [], - "components": { - "queue": {"name": "queue", "is_healthy": true, "has_errors": false}, - "vikingdb": {"name": "vikingdb", "is_healthy": true, "has_errors": false}, - "vlm": {"name": "vlm", "is_healthy": true, "has_errors": false} - } - } -} -``` - -### Component Status - -Check individual components: - -| Endpoint | Component | Description | -|----------|-----------|-------------| -| `GET /api/v1/observer/queue` | Queue | Processing queue status | -| `GET /api/v1/observer/vikingdb` | VikingDB | Vector database status | -| `GET /api/v1/observer/vlm` | VLM | Vision Language Model status | - -### Quick Health Check - -**Python SDK (Embedded / HTTP)** - -```python -if client.is_healthy(): - print("System OK") -``` - -**HTTP API** - -```bash -curl http://localhost:1933/api/v1/debug/health \ - -H "X-API-Key: your-key" -``` - -```json -{"status": "ok", "result": {"healthy": true}} -``` - -## Response Time - -Every API response includes an `X-Process-Time` header with the server-side processing time in seconds: - -```bash -curl -v http://localhost:1933/api/v1/fs/ls?uri=viking:// \ - -H "X-API-Key: your-key" 2>&1 | grep X-Process-Time -# < X-Process-Time: 0.0023 -``` - -## Related Documentation - -- [Deployment](03-deployment.md) - Server setup -- [System API](../api/07-system.md) - System API reference diff --git a/docs/en/guides/05-observability.md b/docs/en/guides/05-observability.md new file mode 100644 index 000000000..b4705f242 --- /dev/null +++ b/docs/en/guides/05-observability.md @@ -0,0 +1,238 @@ +# Observability & Diagnostics + +This guide collects the current OpenViking observability entry points in one place, including: + +- service health and component status +- request-level `telemetry` +- terminal-side `ov tui` +- web-side `OpenViking Console` + +If you just want to know where to look first, start with the table below. + +## Choose the right entry point + +| Entry point | Best for | Typical use case | +| --- | --- | --- | +| `/health`, `observer/*` | service health, queue backlog, VikingDB and VLM status | deployment validation, on-call checks | +| `ov tui` | `viking://` trees, directory summaries, file content, vector records | development debugging, verifying that data actually landed | +| `OpenViking Console` | web UI for browsing, search, resource import, tenants, and system state | interactive investigation without typing every command | +| `telemetry` | per-request duration, token usage, vector retrieval, ingestion stages | debugging one specific slow or unexpected call | + +## Service health and component status + +### Health check + +`/health` provides a simple liveness check and does not require authentication. + +```bash +curl http://localhost:1933/health +``` + +```json +{"status": "ok"} +``` + +### Overall system status + +**Python SDK (Embedded / HTTP)** + +```python +status = client.get_status() +print(f"Healthy: {status['is_healthy']}") +print(f"Errors: {status['errors']}") +``` + +**HTTP API** + +```bash +curl http://localhost:1933/api/v1/observer/system \ + -H "X-API-Key: your-key" +``` + +```json +{ + "status": "ok", + "result": { + "is_healthy": true, + "errors": [], + "components": { + "queue": {"name": "queue", "is_healthy": true, "has_errors": false}, + "vikingdb": {"name": "vikingdb", "is_healthy": true, "has_errors": false}, + "vlm": {"name": "vlm", "is_healthy": true, "has_errors": false} + } + } +} +``` + +### Component status + +| Endpoint | Component | Description | +| --- | --- | --- | +| `GET /api/v1/observer/queue` | Queue | Processing queue status | +| `GET /api/v1/observer/vikingdb` | VikingDB | Vector database status | +| `GET /api/v1/observer/vlm` | VLM | Vision Language Model status | + +For example: + +```bash +curl http://localhost:1933/api/v1/observer/queue \ + -H "X-API-Key: your-key" +``` + +### Quick health check + +**Python SDK (Embedded / HTTP)** + +```python +if client.is_healthy(): + print("System OK") +``` + +**HTTP API** + +```bash +curl http://localhost:1933/api/v1/debug/health \ + -H "X-API-Key: your-key" +``` + +```json +{"status": "ok", "result": {"healthy": true}} +``` + +### Response time + +Every API response includes an `X-Process-Time` header with the server-side processing time in seconds: + +```bash +curl -v http://localhost:1933/api/v1/fs/ls?uri=viking:// \ + -H "X-API-Key: your-key" 2>&1 | grep X-Process-Time +# < X-Process-Time: 0.0023 +``` + +This layer answers "is the service up, blocked, or unhealthy?" If you want to inspect what happened inside one request, move on to telemetry. + +## Use `ov tui` for data-plane inspection + +The `ov` CLI includes a dedicated TUI file explorer: + +```bash +ov tui / +``` + +You can also start from a specific scope: + +```bash +ov tui viking://resources +``` + +Prerequisites: + +- OpenViking Server is running +- `ovcli.conf` is configured +- the current `X-API-Key` can read the target tenant data + +This TUI is useful for two kinds of inspection: + +- checking what actually exists under `viking://resources`, `viking://user`, `viking://agent`, and `viking://session` +- checking whether vector records for a URI were actually written, and how many there are + +Common keys: + +- `q`: quit +- `Tab`: switch focus between the tree and content panels +- `j` / `k`: move up and down +- `.`: expand or collapse a directory +- `g` / `G`: jump to the top or bottom +- `v`: toggle vector-record view +- `n`: load the next page in vector-record view +- `c`: count total vector records for the current URI + +A typical debugging flow is: + +1. Run `ov tui viking://resources` and locate the target document or directory. +2. Confirm the right-side panel shows `abstract`, `overview`, or file content. +3. Press `v` to inspect vector records for that URI. +4. Press `c` to get the total count, and `n` to keep paging if needed. + +TUI is primarily for data-plane inspection. It helps answer "did the resource really land?" and "were vectors really written?" but it does not directly show token totals or per-stage request timing. + +## Use OpenViking Console for web-based investigation + +The repo also contains a standalone web console. It is not wired into the main CLI and must be started separately: + +```bash +python -m openviking.console.bootstrap \ + --host 127.0.0.1 \ + --port 8020 \ + --openviking-url http://127.0.0.1:1933 +``` + +Then open: + +```text +http://127.0.0.1:8020/ +``` + +On first use, go to `Settings` and set your `X-API-Key`. + +The most useful panels for observability are: + +- `FileSystem`: browse URIs, directories, and files +- `Find`: run retrieval requests and inspect results +- `Add Resource`: import resources and inspect responses +- `Add Memory`: submit content through a session commit and inspect the memory flow +- `Tenants` / `Monitor`: inspect tenant, user, and system state + +If you need write operations such as `Add Resource`, `Add Memory`, or tenant/user administration, start the console with `--write-enabled`: + +```bash +python -m openviking.console.bootstrap \ + --host 127.0.0.1 \ + --port 8020 \ + --openviking-url http://127.0.0.1:1933 \ + --write-enabled +``` + +From an observability standpoint, one useful detail is that the console result panel shows raw API responses. For operations such as `find`, `add-resource`, and `session commit`, the proxy layer requests `telemetry` by default, so you can usually inspect `telemetry.summary` directly in the UI. + +Console is best for interactive click-through debugging. If you need to feed observability data into your own logs or automation, prefer the HTTP API or SDK and request telemetry explicitly. + +## Request-level telemetry + +The public request-tracing feature in OpenViking is called `operation telemetry`. It attaches a structured summary to a response so you can inspect things like: + +- total duration +- LLM and embedding token usage +- vector search counts, scan volume, and returned results +- resource-ingestion stages +- memory extraction stats for `session.commit` + +The most common way to request it is to pass: + +```json +{"telemetry": true} +``` + +For example: + +```bash +curl -X POST http://localhost:1933/api/v1/search/find \ + -H "Content-Type: application/json" \ + -H "X-API-Key: your-key" \ + -d '{ + "query": "memory dedup", + "limit": 5, + "telemetry": true + }' +``` + +For the full field reference, supported operations, and more examples, see: + +- [Operation Telemetry Reference](07-operation-telemetry.md) + +## Related Documentation + +- [Deployment](03-deployment.md) - server setup +- [Authentication](04-authentication.md) - API key setup +- [Operation Telemetry Reference](07-operation-telemetry.md) - request-level structured tracing +- [System API](../api/07-system.md) - system and observer API reference diff --git a/docs/en/guides/07-operation-telemetry.md b/docs/en/guides/07-operation-telemetry.md index 27c452c47..86453ea71 100644 --- a/docs/en/guides/07-operation-telemetry.md +++ b/docs/en/guides/07-operation-telemetry.md @@ -1,4 +1,4 @@ -# Operation Telemetry +# Operation Telemetry Reference Operation telemetry lets you ask OpenViking to return a compact summary of what happened during a request, such as duration, token usage, vector retrieval activity, queue progress, and resource-processing stages. @@ -8,6 +8,8 @@ Use it when you want to: - inspect token or retrieval behavior - capture structured execution data in your own logs or observability pipeline +For the broader observability entry points, including health checks, `ov tui`, and `OpenViking Console`, see [Observability & Diagnostics](05-observability.md). + ## How it works Telemetry is opt-in. OpenViking only returns a top-level `telemetry` object when you request it. @@ -297,6 +299,6 @@ print(result["telemetry"]["summary"]["duration_ms"]) ## Related docs -- [Monitoring & Health Checks](05-monitoring.md) +- [Observability & Diagnostics](05-observability.md) - [Authentication](04-authentication.md) - [System API](../api/07-system.md) diff --git a/docs/zh/guides/03-deployment.md b/docs/zh/guides/03-deployment.md index 4c77ce145..de68897fa 100644 --- a/docs/zh/guides/03-deployment.md +++ b/docs/zh/guides/03-deployment.md @@ -244,5 +244,5 @@ curl http://localhost:1933/ready ## 相关文档 - [认证](04-authentication.md) - API Key 设置 -- [监控](05-monitoring.md) - 健康检查与可观测性 +- [可观测性与排障](05-observability.md) - 健康检查、追踪与排障 - [API 概览](../api/01-overview.md) - 完整 API 参考 diff --git a/docs/zh/guides/05-monitoring.md b/docs/zh/guides/05-monitoring.md deleted file mode 100644 index 54359b86b..000000000 --- a/docs/zh/guides/05-monitoring.md +++ /dev/null @@ -1,94 +0,0 @@ -# 监控与健康检查 - -OpenViking Server 提供了用于监控系统健康状态和组件状态的端点。 - -## 健康检查 - -`/health` 端点提供简单的存活检查,不需要认证。 - -```bash -curl http://localhost:1933/health -``` - -```json -{"status": "ok"} -``` - -## 系统状态 - -### 整体系统健康状态 - -**Python SDK (Embedded / HTTP)** - -```python -status = client.get_status() -print(f"Healthy: {status['is_healthy']}") -print(f"Errors: {status['errors']}") -``` - -**HTTP API** - -```bash -curl http://localhost:1933/api/v1/observer/system \ - -H "X-API-Key: your-key" -``` - -```json -{ - "status": "ok", - "result": { - "is_healthy": true, - "errors": [], - "components": { - "queue": {"name": "queue", "is_healthy": true, "has_errors": false}, - "vikingdb": {"name": "vikingdb", "is_healthy": true, "has_errors": false}, - "vlm": {"name": "vlm", "is_healthy": true, "has_errors": false} - } - } -} -``` - -### 组件状态 - -检查各个组件的状态: - -| 端点 | 组件 | 描述 | -|------|------|------| -| `GET /api/v1/observer/queue` | Queue | 处理队列状态 | -| `GET /api/v1/observer/vikingdb` | VikingDB | 向量数据库状态 | -| `GET /api/v1/observer/vlm` | VLM | 视觉语言模型状态 | - -### 快速健康检查 - -**Python SDK (Embedded / HTTP)** - -```python -if client.is_healthy(): - print("System OK") -``` - -**HTTP API** - -```bash -curl http://localhost:1933/api/v1/debug/health \ - -H "X-API-Key: your-key" -``` - -```json -{"status": "ok", "result": {"healthy": true}} -``` - -## 响应时间 - -每个 API 响应都包含一个 `X-Process-Time` 请求头,其中包含服务端处理时间(单位为秒): - -```bash -curl -v http://localhost:1933/api/v1/fs/ls?uri=viking:// \ - -H "X-API-Key: your-key" 2>&1 | grep X-Process-Time -# < X-Process-Time: 0.0023 -``` - -## 相关文档 - -- [部署](03-deployment.md) - 服务器设置 -- [系统 API](../api/07-system.md) - 系统 API 参考 diff --git a/docs/zh/guides/05-observability.md b/docs/zh/guides/05-observability.md new file mode 100644 index 000000000..8bd8c8578 --- /dev/null +++ b/docs/zh/guides/05-observability.md @@ -0,0 +1,238 @@ +# 可观测性与排障 + +这份指南把 OpenViking 当前和“观测”有关的入口放在一起介绍,包括: + +- 服务健康检查与组件状态 +- 请求级 `telemetry` +- 终端侧 `ov tui` +- Web 侧 `OpenViking Console` + +如果你只想快速判断“该看哪里”,先看下面这张表。 + +## 先选哪个入口 + +| 入口 | 适合看什么 | 典型场景 | +| --- | --- | --- | +| `/health`、`observer/*` | 服务是否健康、队列是否堆积、VikingDB/VLM 状态 | 部署验收、值班巡检 | +| `ov tui` | `viking://` 文件树、目录摘要、文件正文、向量记录 | 开发调试、核对资源是否真正落库 | +| `OpenViking Console` | Web UI 里的文件浏览、检索、资源导入、租户与系统状态 | 不想手敲命令时做交互式排查 | +| `telemetry` | 单次请求耗时、token、向量检索、资源处理阶段 | 排查一次具体调用为什么慢、为什么结果异常 | + +## 服务健康与组件状态 + +### 健康检查 + +`/health` 提供简单的存活检查,不需要认证。 + +```bash +curl http://localhost:1933/health +``` + +```json +{"status": "ok"} +``` + +### 整体系统状态 + +**Python SDK (Embedded / HTTP)** + +```python +status = client.get_status() +print(f"Healthy: {status['is_healthy']}") +print(f"Errors: {status['errors']}") +``` + +**HTTP API** + +```bash +curl http://localhost:1933/api/v1/observer/system \ + -H "X-API-Key: your-key" +``` + +```json +{ + "status": "ok", + "result": { + "is_healthy": true, + "errors": [], + "components": { + "queue": {"name": "queue", "is_healthy": true, "has_errors": false}, + "vikingdb": {"name": "vikingdb", "is_healthy": true, "has_errors": false}, + "vlm": {"name": "vlm", "is_healthy": true, "has_errors": false} + } + } +} +``` + +### 组件状态 + +| 端点 | 组件 | 描述 | +| --- | --- | --- | +| `GET /api/v1/observer/queue` | Queue | 处理队列状态 | +| `GET /api/v1/observer/vikingdb` | VikingDB | 向量数据库状态 | +| `GET /api/v1/observer/vlm` | VLM | 视觉语言模型状态 | + +例如: + +```bash +curl http://localhost:1933/api/v1/observer/queue \ + -H "X-API-Key: your-key" +``` + +### 快速健康检查 + +**Python SDK (Embedded / HTTP)** + +```python +if client.is_healthy(): + print("System OK") +``` + +**HTTP API** + +```bash +curl http://localhost:1933/api/v1/debug/health \ + -H "X-API-Key: your-key" +``` + +```json +{"status": "ok", "result": {"healthy": true}} +``` + +### 响应时间 + +每个 API 响应都包含一个 `X-Process-Time` 请求头,表示服务端处理时间(单位为秒): + +```bash +curl -v http://localhost:1933/api/v1/fs/ls?uri=viking:// \ + -H "X-API-Key: your-key" 2>&1 | grep X-Process-Time +# < X-Process-Time: 0.0023 +``` + +这部分解决的是“服务现在是不是活着、是不是堵了、哪个组件有问题”。如果你要看某一次请求内部发生了什么,请继续看 telemetry。 + +## 用 `ov tui` 看数据面 + +`ov` CLI 里有一个独立的 TUI 文件浏览器命令: + +```bash +ov tui / +``` + +也可以从某个 scope 直接进入: + +```bash +ov tui viking://resources +``` + +使用前提: + +- OpenViking Server 已启动 +- 已配置好 `ovcli.conf` +- 当前 `X-API-Key` 有权读取对应租户数据 + +这个 TUI 适合做两类观测: + +- 看 `viking://resources`、`viking://user`、`viking://agent`、`viking://session` 下实际落了哪些数据 +- 看某个 URI 对应的向量记录是否已经写入,以及数量是否符合预期 + +常用按键: + +- `q`:退出 +- `Tab`:在左侧树和右侧内容面板之间切换焦点 +- `j` / `k`:上下移动 +- `.`:展开或折叠目录 +- `g` / `G`:跳到顶部或底部 +- `v`:切换到向量记录视图 +- `n`:在向量记录视图里加载下一页 +- `c`:在向量记录视图里统计当前 URI 的向量总数 + +一个常见排查流程是: + +1. 用 `ov tui viking://resources` 找到目标文档或目录。 +2. 确认右侧能看到 `abstract` / `overview` / 正文内容。 +3. 按 `v` 进入向量记录视图,确认该 URI 下是否已经有向量数据。 +4. 按 `c` 查看总量,必要时按 `n` 翻页继续核对。 + +TUI 更偏“数据面排查”。它适合回答“资源到底有没有进去”“向量到底有没有写进去”,但不直接展示单次请求的 token 或阶段耗时。 + +## 用 OpenViking Console 做 Web 观测 + +仓库里还有一个独立的 Web Console,它不是主 CLI 的一部分,需要单独启动: + +```bash +python -m openviking.console.bootstrap \ + --host 127.0.0.1 \ + --port 8020 \ + --openviking-url http://127.0.0.1:1933 +``` + +然后打开: + +```text +http://127.0.0.1:8020/ +``` + +第一次使用时,在 `Settings` 面板里填入 `X-API-Key`。 + +当前比较适合观测的面板有: + +- `FileSystem`:浏览 URI、查看目录和文件 +- `Find`:直接发检索请求并查看结果 +- `Add Resource`:导入资源并查看返回结果 +- `Add Memory`:通过 session 提交一段内容,观察 memory 提交流程 +- `Tenants` / `Monitor`:查看租户、用户以及系统状态 + +如果你要执行写操作,例如 `Add Resource`、`Add Memory`、租户或用户管理,需要带 `--write-enabled` 启动: + +```bash +python -m openviking.console.bootstrap \ + --host 127.0.0.1 \ + --port 8020 \ + --openviking-url http://127.0.0.1:1933 \ + --write-enabled +``` + +从观测角度看,Console 的一个优点是结果面板会直接显示接口返回值。对于 `find`、`add-resource` 和 `session commit` 这类操作,Console 代理层会默认帮你请求 `telemetry`,所以页面结果里通常可以直接看到 `telemetry.summary`。 + +Console 更适合“边点边看”的交互式排查;如果你要把观测数据接到自己的日志系统或自动化链路,建议直接调用 HTTP API 或 SDK,并显式请求 telemetry。 + +## 请求级 Telemetry + +OpenViking 的请求级追踪能力对外名称是 `operation telemetry`。它会在响应里附带一份结构化摘要,用来说明这次调用里发生了什么,例如: + +- 总耗时 +- LLM / embedding token 消耗 +- 向量检索次数、扫描量、返回量 +- 资源导入阶段耗时 +- `session.commit` 的 memory 提取统计 + +最常见的请求方式是在 body 里显式传: + +```json +{"telemetry": true} +``` + +例如: + +```bash +curl -X POST http://localhost:1933/api/v1/search/find \ + -H "Content-Type: application/json" \ + -H "X-API-Key: your-key" \ + -d '{ + "query": "memory dedup", + "limit": 5, + "telemetry": true + }' +``` + +完整字段、支持范围和更多示例见: + +- [操作级 Telemetry 参考](07-operation-telemetry.md) + +## 相关文档 + +- [部署](03-deployment.md) - 服务器设置 +- [认证](04-authentication.md) - API Key 设置 +- [操作级 Telemetry 参考](07-operation-telemetry.md) - 请求级结构化追踪 +- [系统 API](../api/07-system.md) - 系统与 observer 接口参考 diff --git a/docs/zh/guides/07-operation-telemetry.md b/docs/zh/guides/07-operation-telemetry.md index 7f74a14b5..0d4487218 100644 --- a/docs/zh/guides/07-operation-telemetry.md +++ b/docs/zh/guides/07-operation-telemetry.md @@ -1,4 +1,4 @@ -# 操作级 Telemetry 使用指南 +# 操作级 Telemetry 参考 操作级 telemetry 用来让 OpenViking 在请求结果里额外返回一份结构化摘要,帮助你了解这次操作实际发生了什么,例如耗时、token 消耗、向量检索情况、队列处理进度,以及资源导入阶段统计。 @@ -8,6 +8,8 @@ - 观察 token 或检索行为 - 把结构化执行摘要接入你自己的日志或观测系统 +更完整的观测入口说明,包括健康检查、`ov tui` 和 `OpenViking Console`,请先看 [可观测性与排障](05-observability.md)。 + ## 基本说明 Telemetry 是按需返回的。只有你显式请求时,OpenViking 才会在响应顶层返回 `telemetry` 字段。 @@ -297,6 +299,6 @@ print(result["telemetry"]["summary"]["duration_ms"]) ## 相关文档 -- [监控与健康检查](05-monitoring.md) +- [可观测性与排障](05-observability.md) - [认证](04-authentication.md) - [系统 API](../api/07-system.md)