ArtifactService binary sanitization and child artifacts#2745
ArtifactService binary sanitization and child artifacts#2745mike-inkeep wants to merge 1 commit intostack/artifact_binary_sanitizerfrom
Conversation
|
|
The latest updates on your projects. Learn more about Vercel for GitHub.
1 Skipped Deployment
|
There was a problem hiding this comment.
PR Review Summary
(5) Total Issues | Risk: Medium
🟠⚠️ Major (3) 🟠⚠️
🟠 1) ArtifactService.ts:1007-1059 Partial batch failure leaves inconsistent state
Issue: The for-loop in createBinaryChildArtifacts iterates through binary parts and calls upsertLedgerArtifact for each one sequentially. If the database insert fails mid-way through the batch (e.g., on the 3rd of 5 parts), some child artifacts are persisted while others are not. The error propagates up, but the refs Map will be incomplete.
Why: Consider an artifact with 5 images: if child artifact creation succeeds for images 1-2 but fails on image 3, the error propagates and saveArtifact fails. However, child artifacts 1-2 remain in the database as orphaned records. If retried, deduplication may behave unexpectedly. The parent artifact's data structure would be inconsistent - some binary parts with artifactRef, others without.
Fix: Options to consider:
- Wrap in transaction for atomicity (if supported by the upsert function)
- Use
Promise.allto fail atomically if any fail:
const insertPromises = binaryParts.map(async (part) => {
const hash = this.extractContentHashFromBlobUri(part.data) || this.fallbackHash(part.data);
// ... build childArtifactId, check dedupe ...
await upsertLedgerArtifact(runDbClient)({ ... });
return { blobUri: part.data, ref: { artifactId, toolCallId } };
});
const results = await Promise.all(insertPromises);- Catch per-part errors and continue with partial success + explicit tracking
Refs:
🟠 2) ArtifactService.ts:906-911 Binary child artifacts only created for fullData, not summaryData
Issue: Binary child artifacts are created by traversing sanitizedData only. If artifact.summaryData contains blob-backed binary parts that are NOT present in artifact.data, those binaries will not have child artifacts created. The attachBinaryArtifactRefs call on summaryData will then fail to find refs for those parts, leaving them without artifactRef.
Why: If a caller provides different binary content in summaryData vs data (e.g., a thumbnail in summary, full image in data), the summary's thumbnail would have an orphaned blob URI with no child artifact record.
Fix: Consider whether summaryData can contain different binary parts than data. If yes, collect binary parts from both:
const binaryReferences = await this.createBinaryChildArtifacts({
parentArtifactId: artifact.artifactId,
parentArtifactType: artifact.type,
toolCallId: artifact.toolCallId,
value: sanitizedSummaryData ? [sanitizedData, sanitizedSummaryData] : sanitizedData,
});Or document that summaryData must be a subset of data for binary content.
Refs:
🟠 3) system Implicit contract change: artifact data now includes artifactRef for binary parts
Issue: The PR introduces an implicit data contract change. Artifact data structures (summaryData and fullData) will now contain artifactRef objects embedded within binary parts:
Before: { type: 'image', data: 'blob://...', mimeType: 'image/png' }
After: { type: 'image', data: 'blob://...', mimeType: 'image/png', artifactRef: { artifactId: 'bin_...', toolCallId: '...' } }
Why: This affects consumers retrieving artifacts via:
- Chat API (Vercel AI SDK Data Stream)
getContextArtifacts,getArtifactSummary,getArtifactFull- Streaming via
data-artifactevents - UI components rendering artifact data
While additive fields are generally backward-compatible, consumers doing strict validation or field enumeration may be affected.
Fix: Consider documenting this contract change. If artifactRef is intended to be a stable API for clients to resolve binary artifacts, add it to validation schemas and document the shape.
Refs:
Inline Comments:
- 🟠 Major:
ArtifactService.ts:997Silent fallback when context is missing - 🟠 Major:
ArtifactService.ts:1163-1165Regex requires literal.after hash
🟡 Minor (1) 🟡
Inline Comments:
- 🟡 Minor:
ArtifactService.ts:1148-1160Type guard returnstype: stringinstead of literal union
💭 Consider (1) 💭
💭 1) ArtifactService.ts:1044-1051 Child artifacts use metadata.parentArtifactId instead of existing derivedFrom column
Issue: The ledger_artifacts table already has a derivedFrom column designed for parent-child artifact relationships. This PR stores the relationship in metadata.parentArtifactId instead, creating two mechanisms for the same concept.
Why: The derivedFrom column is already wired into the DAL (maps from metadata.derivedFrom). Having two ways to express parent-child relationships may lead to inconsistent queries.
Fix: Consider using derivedFrom in metadata (or directly) for consistency:
metadata: {
derivedFrom: params.parentArtifactId, // Uses existing column
parentArtifactType: params.parentArtifactType,
// ...
}Refs:
Inline Comments:
- 💭 Consider:
ArtifactService.test.ts:1052Additional test coverage for edge cases
💡 APPROVE WITH SUGGESTIONS
Summary: This PR introduces a useful pattern for managing binary content in artifacts by extracting binaries into child artifacts with deduplicated storage. The main areas for improvement are: (1) partial batch failures can leave inconsistent state with orphaned child artifacts, (2) missing logging for silent fallbacks makes debugging difficult, and (3) the implicit contract change should be documented. The test coverage is a good start but could be expanded to cover edge cases. None of these are blockers, but addressing the error handling and logging would improve production reliability.
Discarded (9)
| Location | Issue | Reason Discarded |
|---|---|---|
ArtifactService.ts:1007 |
Sequential upserts for child artifacts | Intentional for deduplication logic; parallelization would break the dedupe Map accumulation |
ArtifactService.ts:1008-1014 |
Deduplication scope limited to tool call | Working as designed - per-tool-call isolation provides clearer provenance |
ArtifactService.ts:1021 |
Synthetic toolCallId pattern is new | The :binary suffix is actually good practice - distinguishes from real tool call IDs |
ArtifactService.ts:1030 |
Child artifact type naming uses hyphens | Consistent with existing test data showing binary-child type |
ArtifactService.ts:1037 |
blobUri vs data field name |
Internal storage structure, not a consistency issue |
ArtifactService.ts:989-1061 |
Extract named interfaces for types | Valid suggestion but INFO level - design preference |
ArtifactService.ts:1064-1096 |
Return type looser than guard | Follows naturally from type guard fix - INFO level |
ArtifactService.ts:1168-1170 |
fallbackHash uses hex encoding not SHA | Fallback path for edge cases, collision risk is minimal |
ArtifactService.ts:1172-1179 |
bin_ prefix is new naming pattern |
Acceptable new pattern - clearly distinguishes binary children |
Reviewers (6)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-errors |
4 | 1 | 0 | 0 | 1 | 0 | 2 |
pr-review-standards |
3 | 1 | 0 | 0 | 1 | 0 | 1 |
pr-review-architecture |
5 | 1 | 1 | 0 | 0 | 0 | 3 |
pr-review-tests |
7 | 0 | 0 | 0 | 1 | 0 | 6 |
pr-review-types |
3 | 0 | 0 | 0 | 1 | 0 | 2 |
pr-review-consistency |
6 | 0 | 0 | 0 | 0 | 0 | 6 |
| Total | 28 | 3 | 1 | 0 | 4 | 0 | 20 |
| }): Promise<Map<string, { artifactId: string; toolCallId: string }>> { | ||
| if (!this.context.taskId || !this.context.contextId) { | ||
| return new Map(); | ||
| } |
There was a problem hiding this comment.
🟠 MAJOR: Silent fallback when context is missing
Issue: When taskId or contextId is missing, createBinaryChildArtifacts silently returns an empty Map with no logging. The caller has no indication that binary child artifact creation was skipped.
Why: This creates orphaned blob URIs in parent artifacts that have no corresponding child artifact records. Debugging this in production would require correlating blob storage contents with artifact records - difficult to diagnose without logs indicating the skip occurred.
Fix:
| } | |
| if (!this.context.taskId || !this.context.contextId) { | |
| logger.warn( | |
| { parentArtifactId: params.parentArtifactId, hasTaskId: !!this.context.taskId, hasContextId: !!this.context.contextId }, | |
| 'Skipping binary child artifact creation due to missing context' | |
| ); | |
| return new Map(); | |
| } |
Refs:
| private extractContentHashFromBlobUri(blobUri: string): string | null { | ||
| const match = blobUri.match(/sha256-([a-f0-9]{16,64})\./i); | ||
| return match?.[1] || null; |
There was a problem hiding this comment.
🟠 MAJOR: Regex requires literal . after hash - fragile for URIs without extensions
Issue: The regex /sha256-([a-f0-9]{16,64})\./i requires a literal . after the hash. Blob URIs without an extension (e.g., blob://v1/.../sha256-abc123 with no trailing .ext) will fail to match, causing fallback to hex-encoding the entire URI.
Why: This could cause deduplication failures if the same binary is referenced via URIs with and without extensions. The fallback hash would differ from the extracted hash.
Fix:
| private extractContentHashFromBlobUri(blobUri: string): string | null { | |
| const match = blobUri.match(/sha256-([a-f0-9]{16,64})\./i); | |
| return match?.[1] || null; | |
| private extractContentHashFromBlobUri(blobUri: string): string | null { | |
| const match = blobUri.match(/sha256-([a-f0-9]{16,64})(?:\.|$)/i); | |
| return match?.[1] || null; | |
| } |
Refs:
- buildStorageKey always adds extension - current code may be safe if all blob URIs have extensions, but defensive regex is safer
| private isBlobBackedBinaryPart( | ||
| value: unknown | ||
| ): value is { type: string; data: string; mimeType?: string } { | ||
| if (!value || typeof value !== 'object' || Array.isArray(value)) { | ||
| return false; | ||
| } | ||
|
|
||
| const maybePart = value as Record<string, unknown>; | ||
| return ( | ||
| (maybePart.type === 'image' || maybePart.type === 'file') && | ||
| typeof maybePart.data === 'string' && | ||
| isBlobUri(maybePart.data) | ||
| ); |
There was a problem hiding this comment.
🟡 Minor: Type guard returns type: string instead of literal union
Issue: The type guard validates only 'image' | 'file' for the type field but returns a predicate with type: string. This loses type information that could enable exhaustive checking downstream.
Why: Downstream code consuming the narrowed type cannot safely narrow further since TypeScript sees type: string instead of the literal union.
Fix:
| private isBlobBackedBinaryPart( | |
| value: unknown | |
| ): value is { type: string; data: string; mimeType?: string } { | |
| if (!value || typeof value !== 'object' || Array.isArray(value)) { | |
| return false; | |
| } | |
| const maybePart = value as Record<string, unknown>; | |
| return ( | |
| (maybePart.type === 'image' || maybePart.type === 'file') && | |
| typeof maybePart.data === 'string' && | |
| isBlobUri(maybePart.data) | |
| ); | |
| private isBlobBackedBinaryPart( | |
| value: unknown | |
| ): value is { type: 'image' | 'file'; data: string; mimeType?: string } { | |
| if (!value || typeof value !== 'object' || Array.isArray(value)) { | |
| return false; | |
| } | |
| const maybePart = value as Record<string, unknown>; | |
| return ( | |
| (maybePart.type === 'image' || maybePart.type === 'file') && | |
| typeof maybePart.data === 'string' && | |
| isBlobUri(maybePart.data) | |
| ); | |
| } |
Refs:
- InlineBinaryPart in artifact-binary-sanitizer.ts - uses
type: 'image' | 'file'pattern
| expect(upsertFn).toHaveBeenCalledTimes(2); | ||
| }); | ||
| }); | ||
| }); |
There was a problem hiding this comment.
💭 Consider: Additional test coverage for edge cases
Issue: The new tests cover the happy path well but miss several edge cases:
- Missing context - No test for when
taskId/contextIdis missing (early return path) - No binary parts - No test confirming correct behavior when artifact has no blob-backed binaries
summaryDatapath - No test when bothdataandsummaryDataare providedtype: 'file'- All tests usetype: 'image'; thefiletype path is untested- Deeply nested parts - Tests only cover flat structures, not
{ outer: { inner: { images: [...] } } }
Why: These paths exist in the implementation and should have test coverage to prevent regressions.
Fix: Consider adding test cases for these scenarios. Example for missing context:
it('skips binary child artifact creation when taskId is missing', async () => {
const serviceWithoutTask = new ArtifactService({
...mockContext,
taskId: undefined,
});
// ... verify only parent artifact created
});Refs:














No description provided.