Skip to content

logging#3007

Merged
shagun-singh-inkeep merged 4 commits intomainfrom
logging-bug
Apr 3, 2026
Merged

logging#3007
shagun-singh-inkeep merged 4 commits intomainfrom
logging-bug

Conversation

@shagun-singh-inkeep
Copy link
Copy Markdown
Collaborator

No description provided.

@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Apr 3, 2026

🦋 Changeset detected

Latest commit: f26d76d

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 10 packages
Name Type
@inkeep/agents-core Patch
@inkeep/agents-api Patch
@inkeep/agents-manage-ui Patch
@inkeep/agents-cli Patch
@inkeep/agents-sdk Patch
@inkeep/agents-work-apps Patch
@inkeep/ai-sdk-provider Patch
@inkeep/create-agents Patch
@inkeep/agents-email Patch
@inkeep/agents-mcp Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel
Copy link
Copy Markdown

vercel bot commented Apr 3, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agents-api Ready Ready Preview, Comment Apr 3, 2026 10:44pm
agents-docs Ready Ready Preview, Comment Apr 3, 2026 10:44pm
agents-manage-ui Ready Ready Preview, Comment Apr 3, 2026 10:44pm

Request Review

@pullfrog
Copy link
Copy Markdown
Contributor

pullfrog bot commented Apr 3, 2026

TL;DR — Adds targeted diagnostic logging across Doltgres ref-scoping middleware and trigger execution paths to surface slow database operations (pool connect, branch checkout, ref resolution) and improve observability for production latency investigations.

Key changes

  • Add slow-operation logging to withRef and branchScopedDb middleware — Instruments pool connect, branch checkout, cleanup, and total duration with 5-second threshold warnings so slow Doltgres operations are visible in production logs.
  • Add timing instrumentation to resolveRef and ref middleware — Logs slow resolveRef calls and captures branch-not-found diagnostics including the full branch list for debugging ref resolution failures.
  • Add execution lifecycle logging to TriggerService — Logs dispatch entry, executeAgentAsync start, project load duration, and conversation creation timing to trace trigger execution latency end-to-end.
  • Add slow withRef logging in getFunctionTools — Warns when function tool loading via withRef exceeds 5 seconds during agent execution.
  • Add evaluation prompt logging in EvaluationService — Logs evaluator and conversation context before building the evaluation prompt.

Summary | 7 files | 1 commit | base: mainlogging-bug


Slow-operation logging in Doltgres ref-scoping

Before: withRef and branchScopedDbMiddleware had minimal logging — slow pool connections, branch checkouts, and cleanup errors were invisible unless they caused outright failures.
After: Both functions now time pool.connect(), checkoutBranch(), cleanup, and total duration, emitting logger.info warnings when any phase exceeds 5 seconds. Cleanup failures also log the error message and elapsed time.

The 5-second threshold keeps logs quiet under normal conditions while surfacing Doltgres connection pool exhaustion or slow branch operations in production. The withRef function additionally reuses a startTime variable that already existed for the debug-level duration log, now promoting it to an info-level slow-path warning.

ref-scope.ts · branchScopedDb.ts


Ref resolution diagnostics

Before: When resolveRef couldn't find a branch, it silently returned null. Slow resolution was undetectable.
After: A branch-not-found condition now logs the ref name, branch count, query duration, and the full list of branch names. The ref middleware also logs slow resolveRef calls (>5s) and captures the error message when resolution throws.

This is particularly useful for diagnosing tenant-scoped ref mismatches where the expected branch doesn't appear in dolt_branches.

ref-helpers.ts · ref.ts


Trigger execution lifecycle tracing

Before: dispatchExecution and executeAgentAsync had a single log after project load. Latency in dispatch, conversation creation, or early execution phases was opaque.
After: Four new log points trace the full lifecycle: dispatch entry, executeAgentAsync start, project load duration (loadProjectMs), and conversation creation duration (convMs).

TriggerService.ts · function-tools.ts · EvaluationService.ts

Pullfrog  | View workflow run | Triggered by Pullfrog | Using Claude Opus𝕏

Copy link
Copy Markdown
Contributor

@pullfrog pullfrog bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds diagnostic logging and slow-operation timing across the Dolt ref resolution and branch-scoping code paths. The intent — finding where latency hides — is good. A few issues to address before merging, mostly around log levels, duplicate logging, and a data leak in one log statement.

Pullfrog  | Fix all ➔Fix 👍s ➔View workflow run | Using Claude Opus𝕏

branchQueryMs,
totalMs,
branchNames: branches.map((b) => b.name),
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High: branchNames leaks all tenant branch names into logs. This array contains tenant-scoped branch names for every project. In a production environment with many tenants sharing a Doltgres instance, this could produce enormous log entries and expose tenant data in log aggregators.

Drop branchNames entirely — branchCount is sufficient to diagnose the "branch not found" scenario.

Suggested change
},
branchCount: branches.length,
branchQueryMs,
totalMs,

Comment on lines +53 to +65
if (!branch) {
const totalMs = Date.now() - startTime;
logger.info(
{
ref,
branchCount: branches.length,
branchQueryMs,
totalMs,
branchNames: branches.map((b) => b.name),
},
'resolveRef: branch not found in dolt_branches result'
);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium: Log level should be warn, not info. A branch not being found during resolveRef is an unexpected/concerning condition — callers generally expect the branch to exist. logger.warn is the appropriate level per the repo's logging guidelines ("recoverable issues or concerning patterns"). Also, this fires unconditionally (no slow-path threshold), so at info it will add noise for every cache miss or missing-branch lookup.

Comment on lines +242 to 246
logger.info(
{ projectMain, error: error instanceof Error ? error.message : String(error) },
'resolveRef threw in ref middleware'
);
logger.warn({ error, projectMain }, 'Failed to resolve project main branch');
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium: Duplicate log — this logger.info and the existing logger.warn on line 246 log the same error. The warn already captures error and projectMain. Remove this new info call to avoid double-logging the same failure.

Suggested change
logger.info(
{ projectMain, error: error instanceof Error ? error.message : String(error) },
'resolveRef threw in ref middleware'
);
logger.warn({ error, projectMain }, 'Failed to resolve project main branch');
logger.warn({ error, projectMain }, 'Failed to resolve project main branch');

Comment on lines +263 to +271
logger.info(
{
ref: resolvedRef.name,
connectionId,
cleanupMs: Date.now() - cleanupStart,
error: cleanupError instanceof Error ? cleanupError.message : String(cleanupError),
},
'withRef cleanup failed'
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium: Duplicate log — same cleanup failure is logged twice. This new logger.info and the existing logger.error on lines 272–275 report the same cleanupError. Remove one. The error level is the correct one for a cleanup failure; the info call should be dropped.

Suggested change
logger.info(
{
ref: resolvedRef.name,
connectionId,
cleanupMs: Date.now() - cleanupStart,
error: cleanupError instanceof Error ? cleanupError.message : String(cleanupError),
},
'withRef cleanup failed'
);
logger.error(
{ error: cleanupError, tempBranch, connectionId, cleanupMs: Date.now() - cleanupStart },
'Error during ref scope cleanup'
);

Comment on lines +160 to +167
logger.info(
{
ref: resolvedRef.name,
cleanupMs: Date.now() - cleanupStart,
error: cleanupError instanceof Error ? cleanupError.message : String(cleanupError),
},
'branchScopedDb cleanup failed'
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium: Same duplicate-log pattern as ref-scope.ts. The new logger.info for cleanup failure is immediately followed by the existing logger.error for the same error. Consolidate into the single error-level call and add the timing fields there.

Suggested change
logger.info(
{
ref: resolvedRef.name,
cleanupMs: Date.now() - cleanupStart,
error: cleanupError instanceof Error ? cleanupError.message : String(cleanupError),
},
'branchScopedDb cleanup failed'
);
logger.error(
{
error: cleanupError,
ref: resolvedRef.name,
cleanupMs: Date.now() - cleanupStart,
},
'Error during connection cleanup'
);

Comment on lines +157 to +165
logger.info(
{
evaluatorId: evaluator.id,
conversationId: conversation.id,
hasExpectedOutput: expectedOutput != null,
expectedOutputLength: expectedOutputText?.length ?? 0,
},
'Building evaluation prompt'
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low: This is a routine operation log at info level. Per the logging guidelines, routine pre-step breadcrumbs ("Building evaluation prompt") belong at debug. None of the structured fields here convey business-event or performance-alert information. Consider logger.debug or removing entirely — the downstream LLM call log is the meaningful signal.

Comment on lines +549 to +552
logger.info(
{ tenantId, projectId, agentId, triggerId, invocationId, conversationId, dispatchedAt },
'Trigger execution dispatched and starting execution'
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low: Duplicate of the log 5 lines above. Line 541–543 already logs 'Trigger invocation created' with the same structured fields (tenantId, projectId, agentId, triggerId, invocationId, conversationId). This second log at the same level with nearly the same payload adds noise without new information. Remove it or merge the two.

Comment on lines +660 to +662
logger.info(
{ tenantId, projectId, agentId, triggerId, invocationId },
'executeAgentAsync: starting'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low: Redundant entry-point log. executeAgentAsync already has a log at line 891 ('Starting async trigger execution') with the same fields plus conversationId. This extra log at function entry doesn't add diagnostic value — if you want it, make it debug.

@pullfrog
Copy link
Copy Markdown
Contributor

pullfrog bot commented Apr 3, 2026

TL;DR — Adds diagnostic logging and timing instrumentation across the Doltgres branch-scoping and ref-resolution hot path to help identify slow database operations in production. Slow operations (>5 s) are logged at info level with elapsed-time metadata.

Key changes

  • Add slow-operation logging to withRef and branchScopedDbMiddleware — Pool connect, branch checkout, cleanup, and total middleware duration are timed; any step exceeding 5 s emits a structured logger.info with elapsed ms.
  • Add timing instrumentation in ref.ts and ref-helpers.tsresolveRef calls and the underlying doltListBranches query are timed; slow resolutions and missing-branch cases are logged with branch metadata.
  • Add diagnostic logging to TriggerService execution pathdispatchExecution, executeAgentAsync, project loading, and conversation creation each log timing and context fields.
  • Add slow-withRef warning in getFunctionTools — Logs when function-tool loading takes more than 5 s, including the ref name and sub-agent ID.

Summary | 7 files | 4 commits | base: mainlogging-bug


Slow-operation instrumentation for Doltgres ref scoping

Before: withRef and branchScopedDbMiddleware had no visibility into per-step latency — slow pool connects, branch checkouts, or cleanup queries were invisible.
After: Each critical step (pool.connect, checkoutBranch, cleanup, total duration) is individually timed and logged at info level when it exceeds 5 seconds, with structured fields (connectMs, checkoutMs, totalMs, ref).

The same pattern is applied symmetrically in both withRef (used for scoped reads in DAL functions) and branchScopedDbMiddleware (used in request middleware). Cleanup failures now also log elapsed time and the error message for post-mortem analysis.

ref-scope.ts · branchScopedDb.ts


Ref resolution and branch-not-found diagnostics

Before: When resolveRef couldn't find a branch, it silently returned null — no log of what branches existed or how long the query took.
After: A missing branch logs the full branch list, query duration, and total resolution time. Slow resolveRef calls (>5 s) in the ref middleware are logged separately, and thrown errors now emit a structured info log before the existing warn.

Why log the full branch list?

This is a debugging aid for cases where a tenant-scoped branch name doesn't match any Doltgres branch — knowing the actual branch names available helps determine whether the issue is a naming mismatch, a missing migration, or a replication delay.

ref-helpers.ts · ref.ts


Trigger execution and agent lifecycle logging

Before: dispatchExecution and executeAgentAsync had limited logging — project-load timing was not captured and conversation creation was unlogged.
After: Each phase (dispatchExecution entry, executeAgentAsync start, project load with loadProjectMs, conversation creation with convMs) emits structured info logs with invocation and timing context.

TriggerService.ts · function-tools.ts

Pullfrog  | View workflow run | Triggered by Pullfrog | Using Claude Opus𝕏

Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(6) Total Issues | Risk: High

🔴❗ Critical (1) ❗🔴

Inline Comments:

  • 🔴 Critical: ref-helpers.ts:56 Logging all branch names exposes tenant/project identifiers in logs

🟠⚠️ Major (2) 🟠⚠️

Inline Comments:

  • 🟠 Major: ref-helpers.ts:53 Unconditional INFO log on expected code path creates noise
  • 🟠 Major: ref.ts:241-248 Duplicate logging — error logged at both INFO and WARN levels

🟡 Minor (2) 🟡

Inline Comments:

  • 🟡 Minor: ref-scope.ts:265-273 Duplicate logging — cleanup failure logged at INFO and ERROR
  • 🟡 Minor: branchScopedDb.ts:165-170 Same duplicate logging pattern

💭 Consider (1) 💭

Inline Comments:

  • 💭 Consider: EvaluationService.ts:157-165 Downgrade to DEBUG or remove — routine operation

✅ What's Good

The PR demonstrates good observability practices:

  • Consistent 5s threshold for slow operation detection across all files
  • Consistent naming for timing variables (xxxMs pattern)
  • Structured logging with relevant context (ref names, connection IDs, durations)
  • Strategic instrumentation points covering pool.connect, branch checkout, and cleanup operations

📋 Changeset Required

Per AGENTS.md, this PR modifies agents-api and agents-core packages with runtime behavior changes (new logging). A changeset is needed:

pnpm bump patch --pkg agents-api --pkg agents-core "Add slow operation logging for Doltgres database operations"

🚫 REQUEST CHANGES

Summary: The observability logging approach is solid, but the critical issue of logging all branch names (which contain tenant/project IDs) needs to be fixed before merge. This creates a data isolation concern where one tenant's identifiers could appear in logs triggered by another tenant's request. Additionally, several duplicate logging patterns (INFO + WARN/ERROR for same event) should be consolidated to follow the established api-logging-guidelines.

Discarded (3)
Location Issue Reason Discarded
TriggerService.ts:549 Duplicate dispatch log near executeAgentAsync:starting Valid observation but low severity — the logs serve slightly different purposes (dispatch vs execution start) and include different context fields
multi-file Duplicate slow-operation patterns in ref-scope.ts and branchScopedDb.ts Not a bug — both serve different purposes (middleware vs utility). Shared abstraction would be nice-to-have but not required for this PR
TriggerService.ts:549-915 Inconsistent context fields across logs Valid observation but these logs serve different phases of execution where different context is relevant
Reviewers (3)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-sre 6 0 0 0 4 0 2
pr-review-consistency 6 0 0 0 2 0 1
pr-review-standards 1 0 0 0 0 0 0
Total 13 0 0 0 6 0 3

Note: Findings from pr-review-standards overlapped with pr-review-sre on the branchNames issue.

if (!branch) {
const totalMs = Date.now() - startTime;
logger.info(
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 CRITICAL: Logging all branch names exposes tenant/project identifiers

Issue: Branch names follow the pattern {tenantId}_{projectId}_{branchName}. Logging ALL branch names via branchNames: branches.map((b) => b.name) leaks every tenant and project ID in the database whenever any branch resolution fails.

Why: This creates:

  1. A compliance/data-isolation issue where one tenant's IDs appear in logs caused by another tenant's request
  2. Unbounded log line size as branch count grows (could be hundreds or thousands)
  3. Log ingestion costs scaling with total branch count per failed lookup

Fix: Remove branchNames from the log payload:

Suggested change
{
logger.info(
{ ref, branchCount: branches.length, branchQueryMs, totalMs },
'resolveRef: branch not found in dolt_branches result'
);

Refs:

const branchQueryMs = Date.now() - branchQueryStart;
const branch = branches.find((b) => b.name === ref);

if (!branch) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR: Unconditional INFO log on expected code path creates noise

Issue: resolveRef is called multiple times during fallback resolution (e.g., try project-scoped, then raw ref). A not-found result is often expected, not exceptional. Logging INFO on every miss inflates log volume.

Why: Per api-logging-guidelines, INFO is for "important business events, NOT routine operations". Branch-not-found is a normal return path — the caller handles null. This will fire on every request that uses a non-existent or fallback ref.

Fix: Downgrade to DEBUG level, or let the caller log at INFO when resolution ultimately fails:

Suggested change
if (!branch) {
if (!branch) {
const totalMs = Date.now() - startTime;
logger.debug(
{ ref, branchCount: branches.length, branchQueryMs, totalMs },
'resolveRef: branch not found in dolt_branches result'
);
}

Comment on lines 241 to 248
} catch (error) {
logger.info(
{ projectMain, error: error instanceof Error ? error.message : String(error) },
'resolveRef threw in ref middleware'
);
logger.warn({ error, projectMain }, 'Failed to resolve project main branch');
refResult = null;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR: Duplicate logging — error logged at both INFO and WARN levels

Issue: Lines 242-245 log at INFO level, then line 246 logs the same error at WARN level. This creates redundant log entries for the same event.

Why: The established codebase pattern (per api-logging-guidelines) is to use WARN for recoverable issues, not to double-log at multiple levels. This adds noise and makes log analysis harder.

Fix: Remove the INFO log and keep only the WARN log:

Suggested change
} catch (error) {
logger.info(
{ projectMain, error: error instanceof Error ? error.message : String(error) },
'resolveRef threw in ref middleware'
);
logger.warn({ error, projectMain }, 'Failed to resolve project main branch');
refResult = null;
}
} catch (error) {
logger.warn({ error, projectMain }, 'Failed to resolve project main branch');
refResult = null;
}

Comment on lines +265 to 273
ref: resolvedRef.name,
connectionId,
cleanupMs: Date.now() - cleanupStart,
error: cleanupError instanceof Error ? cleanupError.message : String(cleanupError),
},
'withRef cleanup failed'
);
logger.error(
{ error: cleanupError, tempBranch, connectionId },
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor: Duplicate logging — cleanup failure logged at both INFO and ERROR levels

Issue: Lines 266-269 log at INFO level, then lines 270-273 log the same error at ERROR level. The same pattern appears in branchScopedDb.ts.

Why: Cleanup failures are unexpected errors warranting ERROR level, not INFO. Double-logging the same event at different levels creates noise.

Fix: Keep only the ERROR log:

Suggested change
ref: resolvedRef.name,
connectionId,
cleanupMs: Date.now() - cleanupStart,
error: cleanupError instanceof Error ? cleanupError.message : String(cleanupError),
},
'withRef cleanup failed'
);
logger.error(
{ error: cleanupError, tempBranch, connectionId },
} catch (cleanupError) {
logger.error(
{ error: cleanupError, tempBranch, connectionId, cleanupMs: Date.now() - cleanupStart },
'Error during ref scope cleanup'
);

Comment on lines +165 to +170
},
'branchScopedDb cleanup failed'
);
logger.error({ error: cleanupError }, 'Error during connection cleanup');
} finally {
const totalMs = Date.now() - mwStartTime;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor: Same duplicate logging pattern as ref-scope.ts

Issue: Lines 166-169 log at INFO, then line 170 logs the same error at ERROR level.

Fix: Keep only the ERROR log, include cleanupMs in it:

Suggested change
},
'branchScopedDb cleanup failed'
);
logger.error({ error: cleanupError }, 'Error during connection cleanup');
} finally {
const totalMs = Date.now() - mwStartTime;
} catch (cleanupError) {
logger.error({ error: cleanupError, ref: resolvedRef.name, cleanupMs: Date.now() - cleanupStart }, 'Error during connection cleanup');

Comment on lines +157 to +165
logger.info(
{
evaluatorId: evaluator.id,
conversationId: conversation.id,
hasExpectedOutput: expectedOutput != null,
expectedOutputLength: expectedOutputText?.length ?? 0,
},
'Building evaluation prompt'
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 Consider: Downgrade to DEBUG or remove — this is a routine operation

Issue: This INFO log fires on every evaluation execution, not gated by any threshold. Per api-logging-guidelines, INFO should be for "important events, NOT routine operations".

Why: The other new logs in this PR correctly use threshold checks (>5_000ms), but this one does not. Building evaluation prompts is a routine operation already covered by adjacent logs at lines 146-152 (schema info) and 111-117 (trace completion).

Fix: Consider DEBUG level if this detail is needed for troubleshooting, or remove if redundant with existing logs.

@github-actions github-actions bot deleted a comment from claude bot Apr 3, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 3, 2026

Preview URLs

Use these stable preview aliases for testing this PR:

These point to the same Vercel preview deployment as the bot comment, but they stay stable and easier to find.

Raw Vercel deployment URLs

Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(5) Total Issues | Risk: High

Re-review scope: This review covers only changes since the last automated review. The delta (c710dae06) removed 10 lines of unnecessary logging from EvaluationService.ts, addressing the "Consider" item from the prior review. No new code was introduced.

✅ Addressed Since Last Review

The following issue was addressed by the latest commit:

  • 💭 EvaluationService.ts:157-165 — Unnecessary INFO log removed ✅

🕐 Pending Recommendations (5)

The following issues from the prior review remain unresolved:

📋 Changeset Still Required

Per AGENTS.md, this PR modifies agents-api and agents-core packages with runtime behavior changes. A changeset is needed:

pnpm bump patch --pkg agents-api --pkg agents-core "Add slow operation logging for Doltgres database operations"

🚫 REQUEST CHANGES

Summary: Good progress — the unnecessary EvaluationService logging has been removed. However, the Critical issue (logging all branch names which contain tenant/project IDs) and Major issues (duplicate logging patterns, unconditional INFO logging) from the prior review remain unaddressed. Please fix the branchNames data leak in ref-helpers.ts:61 before merge — this is a data isolation concern where one tenant's identifiers could appear in logs triggered by another tenant's request.

Reviewers (0)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
N/A 0 0 0 0 0 0 0

Note: Delta review — no new code to analyze, so no reviewers dispatched. Prior review findings carried forward as Pending Recommendations.

Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(6) Total Issues | Risk: High

Re-review scope: This review covers only changes since the last automated review. The delta (f26d76df6) added a changeset file. All prior issues remain unaddressed.

🟡 Minor (1) 🟡

Inline Comments:

  • 🟡 Minor: .changeset/itchy-cups-join.md:6 Changeset message has typo ("for for") and trailing quote

🕐 Pending Recommendations (5)

The following issues from the prior review remain unresolved:


🚫 REQUEST CHANGES

Summary: A changeset was added (good!), but it has a minor typo. More importantly, the 5 issues from the prior review remain unaddressed. The critical issue is logging all branch names (branchNames: branches.map((b) => b.name)) in ref-helpers.ts:61, which leaks tenant/project identifiers from the entire database into logs whenever any branch resolution fails. This is a data isolation concern that must be fixed before merge.

Discarded (0)

None — delta contained only the changeset file.

Reviewers (0)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
N/A 0 0 0 0 0 0 0

Note: Delta review — only a changeset file was added, so no code reviewers dispatched. Prior review findings carried forward as Pending Recommendations.

"@inkeep/agents-api": patch
---

logging for for Doltgres database operations"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor: Changeset message has typo and formatting issue

Issue: The message contains a duplicate "for for" and a trailing double quote that shouldn't be there.

Why: Changeset messages appear in release notes and changelog. Typos look unprofessional and could confuse consumers.

Fix: (1-click apply)

Suggested change
logging for for Doltgres database operations"
Add slow operation logging for Doltgres database operations

Refs:

}
);
const withRefMs = Date.now() - withRefStart;
if (withRefMs > 5_000) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optionally, you can just always log the time in case 5 seconds doesn't capture what you are looking for

@shagun-singh-inkeep shagun-singh-inkeep added this pull request to the merge queue Apr 3, 2026
Merged via the queue into main with commit 528f69c Apr 3, 2026
34 checks passed
@shagun-singh-inkeep shagun-singh-inkeep deleted the logging-bug branch April 3, 2026 23:12
@itoqa
Copy link
Copy Markdown

itoqa bot commented Apr 3, 2026

Ito Test Report ✅

13 test cases ran. 3 additional findings, 10 passed.

Across the unified run, 13 test cases executed with 10 passing and 3 failing, confirming that most core behaviors are stable including valid webhook execution, scheduled rerun semantics and active-invocation guards, deep-link/mobile trigger UX, broken-ref handling, inert rendering of malicious payload text, run-as authorization boundaries, and repeated chat completions on the same conversation. The key defects were a High-severity security gap where malformed webhook signatures can still be accepted when signature verification is configured without a signing secret, a High-severity reliability bug where parallel manual scheduled run/rerun requests can collide on millisecond-based idempotency keys and throw 500s, and a Medium contract regression where webhook rerun with an empty JSON body is rejected with 400 because userMessage is incorrectly required.

✅ Passed (10)
Category Summary Screenshot
Adversarial Non-admin run-now and rerun on a delegated scheduled trigger were rejected with 403 and produced no extra invocations. ADV-2
Adversarial Malicious payload strings were rendered as escaped text and stayed inert through invocation detail and conversation views. ADV-3
Edge Concurrent run-now requests both succeeded with distinct invocation IDs, and the invocations table remained stable without stuck loading or UI corruption. EDGE-1
Edge Rerun attempt on a running invocation was correctly rejected with 400 and did not create an extra invocation. EDGE-2
Edge Re-test confirmed broken-ref run-now returns HTTP 400 and the invocations UI remains stable with no new pending invocation. EDGE-3
Edge Scheduled and webhook invocation deep-link pages stayed stable across reload/back-forward cycles, with no blank or crash state. EDGE-4
Logic Two consecutive /run/v1/chat/completions calls with the same conversationId completed successfully without runtime error operations after fixture correction. LOGIC-1
Mobile In iPhone 13 viewport, tab switching and trigger actions remained usable; Run Now succeeded and a new Running invocation row appeared. MOBILE-1
Happy-path Valid signed webhook request returned 202 and created invocation/conversation IDs; invocation appeared in UI and reached a terminal state. ROUTE-1
Happy-path Scheduled invocation rerun returned a distinct new invocation ID while keeping the original terminal invocation unchanged. ROUTE-4
ℹ️ Additional Findings (3)

These findings are unrelated to the current changes but were observed during testing.

Category Summary Screenshot
Adversarial ⚠️ Missing and invalid auth were rejected (401/403), but malformed signature sha256=invalidsig was accepted (202) and created invocation 56btzzwkge6f1q4mexq3h. ADV-1
Logic ⚠️ Parallel scheduled trigger manage requests can return 500 when idempotency keys collide in the same millisecond. LOGIC-2
Happy-path 🟠 Manual rerun endpoint rejects an empty rerun body with HTTP 400 because userMessage is required, violating the expected rerun behavior for omitted message fields. ROUTE-2
⚠️ Webhook authentication and signature bypass attempts fail closed
  • What failed: The malformed-signature request was accepted with 202 and created an invocation, but signature-protected webhooks should reject invalid signatures.
  • Impact: Attackers who obtain or guess the auth header can dispatch webhook executions without a valid HMAC signature. This weakens fail-closed protection for webhook triggers that appear to be configured with signature verification.
  • Steps to reproduce:
    1. Create a webhook trigger with authentication headers and signatureVerification enabled.
    2. Send a request without auth headers and confirm it is rejected.
    3. Send a request with invalid auth header and confirm it is rejected.
    4. Send a request with valid auth but malformed X-Signature sha256=invalidsig and observe that it is accepted with 202 and creates an invocation.
  • Stub / mock context: The trigger was provisioned through a local admin bypass setup path. It was configured with signature verification fields but without a signing-secret credential attachment, which left signature checks effectively unenforced during this run.
  • Code analysis: I reviewed trigger creation validation and run-time verification paths. Creation accepts signatureVerification without requiring a signing secret credential, and run-time verification explicitly bypasses signature checks when signingSecretCredentialReferenceId is absent, which matches the observed acceptance of malformed signatures.
  • Why this is likely a bug: The code permits storing signature-verification config without a signing secret and then treats that state as success at run time, allowing invalid signatures through instead of failing closed.

Relevant code:

agents-api/src/domains/manage/routes/triggers.ts (lines 246-267)

// Validate credential reference exists if provided
if (body.signingSecretCredentialReferenceId) {
  const credentialRef = await getCredentialReference(db)({
    scopes: { tenantId, projectId },
    id: body.signingSecretCredentialReferenceId,
  });

  if (!credentialRef) {
    throw createApiError({
      code: 'bad_request',
      message: `Credential reference not found: ${body.signingSecretCredentialReferenceId}`,
    });
  }
}

packages/agents-core/src/validation/schemas.ts (lines 714-721)

signingSecretCredentialReferenceId: () =>
  z.string().optional().describe('Reference to credential containing signing secret'),
signatureVerification: () =>
  SignatureVerificationConfigSchema.nullish()
    .superRefine((config, ctx) => {
      if (!config) return;

agents-api/src/domains/run/services/TriggerService.ts (lines 352-355)

// Skip verification if no signature verification is configured
if (!trigger.signatureVerification || !trigger.signingSecretCredentialReferenceId) {
  return { success: true };
}
⚠️ Parallel manage + run traffic does not cross-contaminate ref/session behavior
  • What failed: Under parallel traffic, requests that should succeed can fail with an unexpected 500 because duplicate idempotency keys are inserted; expected behavior is deterministic success/rejection without internal server error.
  • Impact: Concurrent trigger operations can intermittently fail and drop manual executions during traffic spikes. This undermines reliability for operational run-now/rerun workflows.
  • Steps to reproduce:
    1. Open a local non-production environment with an existing scheduled trigger under a tenant/project/agent.
    2. Send multiple manage run/rerun requests for that trigger in parallel so at least two requests are issued within the same millisecond.
    3. Observe that one or more requests can return an unexpected 500 due to duplicate idempotency key insertion.
  • Stub / mock context: A local QA tenant/project/agent fixture was created, manage requests used a development bypass secret, and run requests used test webhook/API credentials. No route interception or response stubbing was applied, so this exercised real local service behavior.
  • Code analysis: I inspected scheduled trigger run/rerun route handlers and runtime DB schema. Both handlers derive idempotency keys from Date.now() values, while the runtime table enforces a unique constraint on idempotency_key, creating a collision path under same-ms parallel requests.
  • Why this is likely a bug: The production code combines millisecond-based key generation with a global uniqueness constraint, so concurrent same-ms requests can deterministically violate the unique index and surface 500s.

Relevant code:

agents-api/src/domains/manage/routes/scheduledTriggers.ts (lines 1049-1060)

await createScheduledTriggerInvocation(runDbClient)({
  id: newInvocationId,
  tenantId,
  projectId,
  agentId,
  scheduledTriggerId,
  ref: resolvedRef,
  status: 'pending',
  scheduledFor: new Date().toISOString(),
  idempotencyKey: `manual-rerun-${invocationId}-${Date.now()}`,
  attemptNumber: 1,
});

agents-api/src/domains/manage/routes/scheduledTriggers.ts (lines 1377-1388)

await createScheduledTriggerInvocation(runDbClient)({
  id: invocationId,
  tenantId,
  projectId,
  agentId,
  scheduledTriggerId,
  ref: resolvedRef,
  status: 'pending',
  scheduledFor: new Date().toISOString(),
  idempotencyKey: `manual-run-${scheduledTriggerId}-${Date.now()}`,
  attemptNumber: 1,
});

packages/agents-core/src/db/runtime/runtime-schema.ts (lines 417-425)

idempotencyKey: varchar('idempotency_key', { length: 256 }).notNull(),
createdAt: timestamp('created_at', { withTimezone: true, mode: 'string' })
  .notNull()
  .defaultNow(),
},
(table) => [
  primaryKey({ columns: [table.tenantId, table.id] }),
  uniqueIndex('sched_invocations_idempotency_idx').on(table.idempotencyKey),
]
🟠 Manual webhook trigger rerun from manage route remains functional
  • What failed: The endpoint returns HTTP 400 (userMessage missing) instead of accepting rerun input where userMessage/messageParts are omitted.
  • Impact: Manual rerun callers that omit message fields cannot dispatch reruns and fail immediately with validation errors. This breaks compatibility for clients expecting server-side rerun defaults.
  • Steps to reproduce:
    1. Open a valid tenant/project/agent trigger context in the local environment.
    2. Send POST /manage/tenants/{tenantId}/projects/{projectId}/agents/{agentId}/triggers/{triggerId}/rerun with Content-Type: application/json and body {}.
    3. Observe the API response status and body.
    4. Confirm the response is HTTP 400 with a validation error for missing userMessage.
  • Stub / mock context: The rerun call used the local development bypass authorization path for manage endpoints, so authentication was simulated while endpoint behavior and payload validation were exercised against production code paths.
  • Code analysis: I reviewed the rerun route schema and handler in the manage API plus the manage UI caller path. The backend enforces userMessage as required and directly uses it for dispatch, so empty-body reruns are rejected before execution.
  • Why this is likely a bug: The production route currently rejects empty rerun bodies at schema validation, which directly explains the observed 400 and conflicts with the expected rerun contract for omitted message fields in this test.

Relevant code:

agents-api/src/domains/manage/routes/triggers.ts (lines 771-777)

schema: z.object({
  userMessage: z.string().describe('The user message to send to the agent'),
  messageParts: z
    .array(PartSchema)
    .optional()
    .describe('Optional structured message parts (from original trace)'),
}),

agents-api/src/domains/manage/routes/triggers.ts (lines 803-853)

const { userMessage, messageParts: rawMessageParts } = c.req.valid('json');
const messageParts = rawMessageParts ?? [{ kind: 'text' as const, text: userMessage }];

({ invocationId, conversationId } = await dispatchExecution({
  tenantId,
  projectId,
  agentId,
  triggerId,
  resolvedRef,
  payload: { _rerun: true },
  transformedPayload: undefined,
  messageParts,
  userMessageText: userMessage,
  runAsUserId: trigger.runAsUserId ?? undefined,
}));

agents-manage-ui/src/app/[tenantId]/projects/[projectId]/traces/conversations/[conversationId]/page.tsx (lines 158-185)

if (!userMessageActivity?.messageContent) {
  toast.error('No user message found in trace to rerun');
  return;
}

const result = await rerunTriggerAction(
  tenantId,
  projectId,
  conversation.agentId,
  conversation.triggerId,
  {
    userMessage: userMessageActivity.messageContent,
    messageParts,
  }
);

Commit: f26d76d

View Full Run


Tell us how we did: Give Ito Feedback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants