Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,4 +57,5 @@ created: 2026-02-26
| F147 | i18n — Hub 界面中英文切换 | idea | 待定 | internal | — |
| F149 | ACP Runtime Operations — 项目级进程池 + Session Lease | spec | Maine Coon | internal | [F149](features/F149-acp-runtime-operations.md) |
| F152 | Expedition Memory — 外部项目记忆冷启动 + 经验回流 | spec | Ragdoll | internal | [F152](features/F152-expedition-memory.md) |
| F153 | Observability Phase 1 — 遥测脱敏 + 健康端点 + OTel 基础 | in-progress | Ragdoll | internal | [F153](features/F153-observability-phase1.md) |

254 changes: 254 additions & 0 deletions docs/features/F153-observability-phase1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,254 @@
---
feature_ids: [F153]
related_features: [F008, F045, F081, F118, F130, F150]
topics: [observability, telemetry, opentelemetry, security, metrics, tracing, logging]
doc_kind: spec
created: 2026-04-08
---

# F153: Observability Phase 1 — 遥测脱敏 + 健康端点 + OTel 基础

> **Status**: in-progress | **Owner**: Ragdoll (宪宪) | **Reviewer**: Maine Coon (砚砚) | **Priority**: P1

## Why

Cat Cafe 当前缺乏系统性可观测能力:

- 异常抛出后难以定位根因
- 超时发生后难以检测和区分"卡死"与"在忙"
- 猫猫是否在工作没有可靠的系统级信号
- 无法接入外部遥测(Sentry/OTel)因为没有脱敏保障

现有基础:Pino 结构化日志(F130)、ProcessLivenessProbe(F118)、EventAuditLog(F045)、`/health` 端点。Phase 1 目标是**连接已有管道 + 补齐安全前提**,不是从零建。

## What

### D1: 遥测脱敏策略 (Telemetry Redaction Policy)

#### 原则

外发遥测(任何离开本机的信号)必须经过 `TelemetryRedactor` 过滤。本地日志/archive 保持现状不降级。

#### 字段分类

| 分类 | 字段 | 外发处理 | 本地处理 |
|------|------|----------|----------|
| **Class A: 凭证** | authorization, cookie, set-cookie, x-api-key, token, apiKey, api_key, secret, password, credential, callbackToken, *_TOKEN, *_API_KEY | `[REDACTED]` | 现有 Pino redaction |
| **Class B: 业务正文** | prompt, message.content, thinking, toolInput, tool_result content, command, aggregated_output, MCP arguments, rich_block image(base64) | 默认不出圈;仅保留 `{ hash: sha256_16, length, type }` | 明文保留 |
| **Class C: 系统标识符** | userId, threadId, invocationId, sessionId, messageId, rawArchivePath | `HMAC-SHA256(tenant_salt, id)` | 明文保留 |
| **Class D: 可观测数值** | durationMs, timeoutMs, status, errorCode, tokenUsage(聚合值), queueDepth, retries, liveness level | 允许出圈 | 允许 |

#### Class C HMAC 方案

```typescript
import { createHmac } from 'node:crypto';

// 盐值必须从外部注入(.env / secret manager),不允许运行时生成。
// 非 dev 环境缺失时 fail fast,防止密钥来源不可控 / 跨实例不一致。
const TENANT_SALT = process.env.TELEMETRY_HMAC_SALT;

function assertSalt(): string {
if (TENANT_SALT) return TENANT_SALT;
if (process.env.NODE_ENV === 'development' || process.env.NODE_ENV === 'test') {
return 'dev-only-insecure-salt';
}
throw new Error('TELEMETRY_HMAC_SALT is required in non-dev environments');
}

export function hmacId(id: string): string {
return createHmac('sha256', assertSalt()).update(id).digest('hex').slice(0, 32);
}
```

- 盐值外部注入,不违反"配置不可变"铁律
- 非 dev 环境缺失 → 启动失败(fail fast),不会静默用弱盐
- dev/test 环境用固定弱盐,方便本地调试
- 同值同 hash → 外发遥测内仍可关联(Sentry 里两条同 hash = 同 thread)
- 泄露后不可直接还原 → 攻击者无法用 hash 回溯本地日志
- 逃生门:`TELEMETRY_EXPORT_RAW_SYSTEM_IDS=1`(默认 0),仅限自托管受控环境短期开启

#### TelemetryRedactor 架构

```
┌─────────────────────┐
│ TelemetryRedactor │
│ (Class A/B/C/D) │
└──────┬──────────────┘
┌────────────┼────────────┐
▼ ▼ ▼
OTel SpanProcessor OTel OTel
(traces) LogProcessor MetricReader
(logs) (metrics)
```

- 单一模块,三种 OTel 信号共用
- 作为 OTel pipeline 的 processor/exporter wrapper 实现
- 在 `onEnd`(spans)/ `onEmit`(logs)时过滤 attributes
- Metrics 通过 `MetricAttributeAllowlist` View 强制过滤(见 D2),代码级硬约束而非仅文档约定

#### 紧急修复(已完成)

`cli-spawn.ts:470` Windows debug 日志打出 `shimSpawn.args`(含 prompt 明文)→ 改为 `argCount`。

### D2: OpenTelemetry Metric Attribute Budget

#### 原则

OTel metrics 的 attribute 遵循与 Prometheus label 相同的基数约束:**有界维度做 attribute,无界 ID 不做 attribute**。

#### 允许的 Metric Attributes

| Attribute | 基数上界 | OTel 语义约定 |
|-----------|---------|---------------|
| `agent.id` | ≤15 | 自定义 |
| `gen_ai.system` | ≤5 | `gen_ai.system` (semconv) |
| `gen_ai.request.model` | ≤15 | `gen_ai.request.model` (semconv),经 `normalizeModel()` 归一化 |
| `operation.name` | enum | 自定义 |
| `status` | 4 | 自定义(success/error/timeout/cancelled) |

**最坏基数**:15 x 5 x 15 x ~10 x 4 = 45,000 时间序列,单实例可承载。

#### Model 归一化

`defaultModel` 在代码中是自由字符串(`runtime-cat-catalog.ts`),直接做 attribute 会基数失控。telemetry 层加 `normalizeModel()` 归一化:

```typescript
// provider+family 分桶,长尾归 other
const MODEL_BUCKETS: Record<string, string> = {
'claude-opus': 'claude-opus',
'claude-sonnet': 'claude-sonnet',
'claude-haiku': 'claude-haiku',
'gpt-4': 'gpt-4',
'gpt-4o': 'gpt-4o',
'o3': 'o3',
'gemini-2.5': 'gemini-2.5',
'gemini-2.0': 'gemini-2.0',
};

export function normalizeModel(raw: string): string {
const lowered = raw.toLowerCase();
for (const [prefix, bucket] of Object.entries(MODEL_BUCKETS)) {
if (lowered.includes(prefix)) return bucket;
}
return 'other';
}
```

不直接上报原始 model 字符串,确保基数有界。

#### 禁止做 Metric Attribute

`threadId`, `invocationId`, `sessionId`, `userId`, `path`, `command`

这些进 trace span attributes 和 log record attributes(经 HMAC),不进 metric attributes。

#### 代码级硬约束:MetricAttributeAllowlist

文档白名单不够——必须在代码层面拦截禁止字段,防止后续开发者误加高基数 attribute。

```typescript
// MetricAttributeAllowlist — OTel MetricReader 的 View 层拦截
const ALLOWED_METRIC_ATTRIBUTES = new Set([
'agent.id', 'gen_ai.system', 'gen_ai.request.model',
'operation.name', 'status',
]);

// 通过 OTel View 的 attributeKeys 参数实现
const invocationDurationView = new View({
instrumentName: 'cat_cafe.invocation.duration',
attributeKeys: [...ALLOWED_METRIC_ATTRIBUTES], // 只保留白名单字段,其余自动丢弃
});
```

- 每个 instrument 注册时绑定 View,attribute 白名单在注册时锁定
- 非白名单字段被 OTel SDK 自动丢弃(不是忽略,是不进聚合)
- 单测覆盖:传入禁止字段(threadId 等),断言导出的 metric 不含该 attribute

#### GenAI 语义约定隔离

OTel GenAI Semantic Conventions 仍处于 Development 阶段。为避免上游 breaking change 影响:

```typescript
// packages/api/src/infrastructure/telemetry/genai-semconv.ts
// 版本隔离层 — 内部统一引用此处常量
export const GENAI_SYSTEM = 'gen_ai.system'; // stable
export const GENAI_MODEL = 'gen_ai.request.model'; // stable
export const GENAI_TOKENS_INPUT = 'gen_ai.usage.input_tokens'; // development
export const GENAI_TOKENS_OUTPUT = 'gen_ai.usage.output_tokens'; // development
```

上游 rename 时只改此文件,不影响业务代码。

### I1: `/ready` 端点

```typescript
app.get('/ready', async () => {
const checks = await Promise.allSettled([
redisClient.ping(), // Redis 可达
sqliteDb.prepare('SELECT 1').get(), // SQLite 可读(无副作用)
]);
const allOk = checks.every(c => c.status === 'fulfilled');
return { status: allOk ? 'ready' : 'degraded', checks: formatChecks(checks) };
});
```

注意:`wal_checkpoint` 有锁竞争副作用,不适合做 readiness 探针。`SELECT 1` 足以验证 SQLite 连接可用。checkpoint 留给诊断/维护路径。

- `/health`(已有)= 进程存活
- `/ready`(新增)= 依赖可达

### I2: OTel SDK 初始化

统一入口 `packages/api/src/infrastructure/telemetry/init.ts`:

```typescript
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { OTLPLogExporter } from '@opentelemetry/exporter-logs-otlp-http';
import { PrometheusExporter } from '@opentelemetry/exporter-prometheus';

const sdk = new NodeSDK({
serviceName: 'cat-cafe-api',
traceExporter: new OTLPTraceExporter(),
metricReader: new PrometheusExporter({ port: 9464 }), // /metrics 端点
logRecordProcessor: new RedactingLogProcessor(new OTLPLogExporter()),
spanProcessor: new RedactingSpanProcessor(new OTLPTraceExporter()),
});
```

- Pino 通过 `pino-opentelemetry-transport` 桥接,traceId 自动注入
- 禁用时(`OTEL_SDK_DISABLED=true`)零开销

### I3: 首批 Instruments

| Instrument | 类型 | 说明 |
|------------|------|------|
| `cat_cafe.invocation.duration` | Histogram | 每次 invocation 耗时 |
| `cat_cafe.llm.call.duration` | Histogram | 单次 LLM 调用耗时 |
| `cat_cafe.agent.liveness` | Gauge | ProcessLivenessProbe 状态(0=dead, 1=idle-silent, 2=busy-silent, 3=active) |
| `cat_cafe.invocation.active` | UpDownCounter | 当前活跃 invocation 数 |
| `cat_cafe.token.usage` | Counter | token 消耗(按 input/output 拆分) |

### I4: cli-spawn 回归测试

Mock logger,断言 Windows shim debug 日志 payload 不含 `args` 字段且不含 prompt 文本。

## Acceptance Criteria

- [ ] TelemetryRedactor 模块实现并覆盖 OTel traces/logs 管道
- [ ] Class A/B/C/D 字段分类通过砚砚 security review
- [ ] `/ready` 端点返回 Redis + SQLite 依赖状态
- [ ] OTel SDK 初始化,`/metrics` 端点暴露 Prometheus 格式指标
- [ ] 首批 5 个 instruments 接入并有数据
- [ ] cli-spawn 回归测试通过
- [ ] `OTEL_SDK_DISABLED=true` 时零开销(不影响现有行为)
- [ ] 全部通过 `pnpm check` + `pnpm lint` + 测试

## Out of Scope (Phase 2)

- OpenTelemetry 全链路追踪 span 细化(tool call 级别)
- Grafana 统一看板
- burn-rate 告警规则
- Sentry 接入(D1 就绪后可独立做)
6 changes: 6 additions & 0 deletions docs/features/index.json
Original file line number Diff line number Diff line change
Expand Up @@ -929,6 +929,12 @@
"name": "Expedition Memory — 外部项目记忆冷启动 + 经验回流",
"status": "spec | **Owner**: Ragdoll | **Priority**: P1",
"file": "F152-expedition-memory.md"
},
{
"id": "F153",
"name": "Observability Phase 1 — 遥测脱敏 + 健康端点 + OTel 基础",
"status": "in-progress | **Owner**: Ragdoll (宪宪) | **Reviewer**: Maine Coon (砚砚) | **Priority**: P1",
"file": "F153-observability-phase1.md"
}
],
"generated_at": "2026-04-09T04:33:26.998Z"
Expand Down
12 changes: 12 additions & 0 deletions packages/api/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,18 @@
"@huggingface/transformers": "^3",
"@larksuiteoapi/node-sdk": "^1.59.0",
"@modelcontextprotocol/sdk": "^1.0.0",
"@opentelemetry/api": "^1.9.1",
"@opentelemetry/api-logs": "^0.214.0",
"@opentelemetry/exporter-logs-otlp-http": "^0.214.0",
"@opentelemetry/exporter-metrics-otlp-http": "^0.214.0",
"@opentelemetry/exporter-prometheus": "^0.214.0",
"@opentelemetry/exporter-trace-otlp-http": "^0.214.0",
"@opentelemetry/resources": "^2.6.1",
"@opentelemetry/sdk-logs": "^0.214.0",
"@opentelemetry/sdk-metrics": "^2.6.1",
"@opentelemetry/sdk-node": "^0.214.0",
"@opentelemetry/sdk-trace-node": "^2.6.1",
"@opentelemetry/semantic-conventions": "^1.40.0",
"@wecom/aibot-node-sdk": "1.0.4",
"better-sqlite3": "^12.6.2",
"cheerio": "^1.1.2",
Expand Down
Loading