feat(F152): Observability Phase 1 — OTel SDK + telemetry redaction by bouillipx · Pull Request #393 · zts212653/clowder-ai

bouillipx · 2026-04-09T04:14:24Z

Summary

Implements the complete F152 Observability Phase 1 foundation, including the cli-spawn hotfix that was previously in #387.

D1 TelemetryRedactor: 4-class field classification — Class A (credentials → [REDACTED]), Class B (business content → hash+length), Class C (system IDs → HMAC-SHA256), Class D (safe values → passthrough)
D2 MetricAttributeAllowlist: ViewOptions with createAllowListAttributesProcessor enforcing bounded cardinality on all cat_cafe.* instruments
OTel SDK init: Unified NodeSDK for traces/metrics/logs with Prometheus scrape + optional OTLP push
5 instruments: invocation.duration, llm.call.duration, agent.liveness, invocation.active, token.usage
/ready endpoint: Redis ping probe, returns ready/degraded
Security: GenAI semconv isolation, model name bucketing, HMAC fail-fast salt, cli-spawn args redaction
Tests: 6 new tests covering redactor classification, model normalizer, metric allowlist, cli-spawn regression

New files (7 telemetry modules)

packages/api/src/infrastructure/telemetry/genai-semconv.ts
packages/api/src/infrastructure/telemetry/hmac.ts
packages/api/src/infrastructure/telemetry/init.ts
packages/api/src/infrastructure/telemetry/instruments.ts
packages/api/src/infrastructure/telemetry/metric-allowlist.ts
packages/api/src/infrastructure/telemetry/model-normalizer.ts
packages/api/src/infrastructure/telemetry/redactor.ts

Closes #388

Test plan

pnpm lint (TypeScript) — passes
pnpm check (Biome) — passes
node --test test/telemetry/cli-spawn-redaction.test.js — 6/6 pass
Full test suite — no new failures (existing Redis-dependent tests unaffected)
砚砚 re-review requested (2 P1s from spec now addressed in code)

🤖 Generated with Claude Code

zts212653 · 2026-04-09T10:09:15Z

Code Review — Cat Café Maintainer Team

Reviewed by: 砚砚 (GPT-5.4) + 宪宪 (Opus 4.6)

Feature has been assigned F153 internally (F152 in cat-cafe is already taken by Expedition Memory). Issue #388 title updated accordingly.

Overall: the direction is valuable and we want this, but 2 P1 blockers must be fixed before we can merge.

P1-1: `activeInvocations` counter leak on generator early abort

Files: packages/api/src/domains/cats/services/agents/invocation/invoke-single-cat.ts (L298, L1773)

activeInvocations.add(1) is called before the first yield invocation_created, but the decrement and invocation.duration recording only happen in finally. If the caller does return()/abort after receiving the first system_info but before entering the cleanup path, the generator ends without reaching finally — this invocation permanently inflates the cat_cafe.invocation.active gauge and loses a duration sample.

Impact: The gauge drifts upward over time, making it useless for backpressure or autoscaling decisions. This is a counter lifecycle design issue — add/sub symmetry must be guaranteed by finally semantics, not caller behavior assumptions.

Fix: Ensure the decrement is structurally guaranteed (e.g., wrapping the full generator body in try/finally, or using a disposable pattern).

P1-2: Prometheus exporter hardcoded to port 9464

Files: packages/api/src/infrastructure/telemetry/init.ts (L37, L69), packages/api/src/index.ts (L207)

The Prometheus exporter binds to a fixed port with no env/config override. Any second API process on the same machine (we routinely run alpha env on 3011/3012/4111 alongside runtime on 3001/3002) will fail with EADDRINUSE. This is a startup regression.

Fix: Read port from PROMETHEUS_PORT env var, fall back to 9464 if unset.

P2: HMAC salt lazy validation contradicts fail-fast claim

Files: packages/api/src/infrastructure/telemetry/hmac.ts (L15), packages/api/src/infrastructure/telemetry/init.ts (L50)

Comments say "non-dev missing salt fail fast", but the salt is not validated at startup — it only throws on the first pseudonymizeId() call. Production can boot with a bad/missing salt config and run fine until a Class C telemetry event actually fires. This contradicts the stated security semantics.

Fix: Validate salt presence at init time (inside initTelemetry()), throw before the server starts listening.

Summary

#	Severity	Issue	Status
1	P1	`activeInvocations` counter leak on early abort	🔴 Must fix
2	P1	Prometheus port hardcoded 9464, EADDRINUSE on multi-instance	🔴 Must fix
3	P2	HMAC salt not validated at startup (lazy fail)	🟠 Should fix

Please address the two P1s; we'd also appreciate the P2 fix in the same pass. Once resolved, we'll re-review and proceed with intake.

🐾 [砚砚/GPT-54 + 宪宪/Opus-46]

…pt leakage The Windows shim debug log at cli-spawn.ts:470 was printing the full `shimSpawn.args` array, which includes the user prompt passed via `['--', effectivePrompt]` from CodexAgentService. In debug mode this would write prompt content to log files in plaintext. Replace `args: shimSpawn.args` with `argCount: shimSpawn.args.length` to preserve diagnostic value (how many args were resolved) without leaking prompt content. Part of the D1 Telemetry Redaction initiative (observability feature). [宪宪/Opus-46🐾] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…eady endpoint Implements the complete F152 observability foundation: - D1 TelemetryRedactor: 4-class field classification (Class A credentials → [REDACTED], Class B business content → hash+length, Class C system IDs → HMAC-SHA256 pseudonymization, Class D safe values → passthrough) - RedactingSpanProcessor and RedactingLogProcessor wrapping OTel export pipeline - D2 MetricAttributeAllowlist: ViewOptions with createAllowListAttributesProcessor enforcing bounded cardinality on all cat_cafe.* metric instruments - GenAI Semantic Conventions isolation layer (genai-semconv.ts) - Model name normalization/bucketing to control metric cardinality - HMAC-SHA256 pseudonymization with fail-fast salt injection for non-dev envs - Unified NodeSDK initialization (traces/metrics/logs) with Prometheus + OTLP - 5 OTel instruments: invocation.duration, llm.call.duration, agent.liveness, invocation.active, token.usage - /ready endpoint (Redis ping probe, returns ready/degraded) - OTel graceful shutdown in server close handler - Regression test: cli-spawn Windows shim debug log argCount verification - Unit tests: redactor classification, model normalizer, metric allowlist Closes #388 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Connect all 5 OTel instruments to their data sources: - invocationDuration: recorded in invoke-single-cat finally block (seconds) - activeInvocations: incremented on create, decremented in finally - tokenUsage: recorded from provider metadata.usage (input/output split) - llmCallDuration: recorded from metadata.usage.durationApiMs - agentLiveness: ObservableGauge polls registered ProcessLivenessProbes via probe registry (register in cli-spawn on probe.start, unregister in finally on probe.stop) All attributes use D2 allowlist-safe keys (agent.id, gen_ai.system, gen_ai.request.model, operation.name, status). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

P1-1: Move activeInvocations.add(1) inside try block so add/sub symmetry is guaranteed by the finally block, even on generator early abort (.return() or reference drop). P1-2: Read Prometheus scrape port from PROMETHEUS_PORT env var, fall back to 9464. Prevents EADDRINUSE when multiple API instances run on the same machine (alpha/runtime). P2: Add validateSalt() called at initTelemetry() startup — throws immediately if TELEMETRY_HMAC_SALT is missing in non-dev envs, rather than deferring to the first pseudonymizeId() call. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Addresses 砚砚 R2 review findings (2 P1 + 1 P2): P1 Trace signal: Create invocation span via @opentelemetry/api tracer in invoke-single-cat — span covers full lifecycle (try/catch/finally), records SpanStatusCode.ERROR on failure, SpanStatusCode.OK on success. RedactingSpanProcessor processes these before export. P1 Log signal: Add otel-logger.ts bridge that emits structured log records through the OTel log pipeline (RedactingLogProcessor → exporter). Emits invocation_started, invocation_completed, invocation_error events with trace-log correlation (active span context captured automatically). Does NOT replace Pino for local logs — parallel emission path. P2 /ready endpoint: Add SQLite health probe (evidenceStore.health() → SELECT 1), return 503 status code when any dependency check fails instead of 200 with degraded status. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Addresses 砚砚 R3 review findings (1 P1 + 1 P2 + 1 P3): P1: Fix trace-log correlation — emitOtelLog() now accepts an explicit Span parameter. Derives Context via trace.setSpan(context.active(), span) and passes it as LogRecord.context, which is the OTel-standard way to link log records to spans. Removed manual traceId/spanId from attributes. All 3 call sites in invoke-single-cat pass invocationSpan. P2: Add @opentelemetry/api-logs as direct dependency in package.json. Previously relied on transitive hoist from sdk-logs. P3: Add regression test verifying otel-logger uses trace.setSpan() + LogRecord.context for correlation, and does NOT use manual traceId/spanId attributes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

bouillipx · 2026-04-10T05:05:28Z

Review Findings — All Resolved ✅

Rebased onto latest main, resolved F152→F153 renumbering conflict (F152 is Expedition Memory on main). All 4 rounds of review findings have been addressed:

R1 Findings (this comment)

#	Severity	Finding	Fix	Commit
1	P1	`activeInvocations.add(1)` counter leak on early abort	Moved inside `try` block — `finally` guarantees decrement	`acc6897c`
2	P1	Prometheus port hardcoded 9464, EADDRINUSE	Reads `PROMETHEUS_PORT` env var, falls back to 9464	`acc6897c`
3	P2	HMAC salt lazy validation	Added `validateSalt()` at `initTelemetry()` startup — fail-fast before server listens	`acc6897c`

R2 Findings

#	Severity	Finding	Fix	Commit
1	P1	Empty trace pipeline — no spans created	Added invocation span with full lifecycle tracking in `invoke-single-cat.ts`	`589cfce9`
2	P1	Empty log pipeline — Pino not bridged to OTel	Created `otel-logger.ts` with `emitOtelLog()` bridge	`589cfce9`
3	P2	Incomplete `/ready` — no SQLite check, no 503	Added SQLite health probe + `reply.code(503)` on failure	`589cfce9`

R3 Findings

#	Severity	Finding	Fix	Commit
1	P1	Trace-log correlation broken — `startSpan()` not active, manual traceId attributes	Rewrote: explicit Span passing + `trace.setSpan()` + `LogRecord.context` (OTel standard)	`08994145`
2	P2	`@opentelemetry/api-logs` phantom dependency	Added as direct dependency in `package.json`	`08994145`
3	P3	Missing regression test for trace-log correlation	Added source-code assertion test verifying `trace.setSpan()` + `context: logContext` pattern	`08994145`

R4: PASS — all findings closed, gate cleared.

🐾 [宪宪/Opus-46🐾]

- Add F153 to docs/ROADMAP.md (lint check-feature-truth gate) - Make initTelemetry() gracefully degrade when HMAC salt is missing instead of crashing the server (telemetry should not be a crash source) - Set NODE_ENV=test fallback in test file for CI environments [宪宪/Opus-46🐾] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

bouillipx requested a review from zts212653 as a code owner April 9, 2026 04:14

bouillipx and others added 6 commits April 10, 2026 12:58

bouillipx force-pushed the hotfix/cli-spawn-args-redact branch from 5b969da to 0899414 Compare April 10, 2026 05:04

bouillipx and others added 2 commits April 10, 2026 14:54

Merge branch 'main' into hotfix/cli-spawn-args-redact

bc2eb21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(F152): Observability Phase 1 — OTel SDK + telemetry redaction#393

feat(F152): Observability Phase 1 — OTel SDK + telemetry redaction#393
bouillipx wants to merge 8 commits intomainfrom
hotfix/cli-spawn-args-redact

bouillipx commented Apr 9, 2026

Uh oh!

zts212653 commented Apr 9, 2026

Uh oh!

bouillipx commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bouillipx commented Apr 9, 2026

Summary

New files (7 telemetry modules)

Test plan

Uh oh!

zts212653 commented Apr 9, 2026

Code Review — Cat Café Maintainer Team

P1-1: activeInvocations counter leak on generator early abort

P1-2: Prometheus exporter hardcoded to port 9464

P2: HMAC salt lazy validation contradicts fail-fast claim

Summary

Uh oh!

bouillipx commented Apr 10, 2026

Review Findings — All Resolved ✅

R1 Findings (this comment)

R2 Findings

R3 Findings

R4: PASS — all findings closed, gate cleared.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

P1-1: `activeInvocations` counter leak on generator early abort