Vision & Media Architecture -- Image Processing for AI Providers

Three-layer system: capture/storage, media processing (analyze/convert/resize/optimize), adapter formatting. Context-aware resizing using flexbox-style budget allocation. Phase 1 (basic format handling) complete.

Parent: Live

Status: Phase 1 Complete (Basic Format Handling) | Phase 2-4 Planned

Current Implementation
Identified Problems
Architecture Overview
Media Processing Pipeline
RAG Integration (CSS Flexbox Analogy)
Implementation Phases
Command Specifications

Current Implementation

✅ Phase 1: Basic Format Handling (COMPLETE)

What Works:

Vision models (Grok-vision, GPT-4V) receive multimodal ContentPart[] format
Non-vision models receive flattened text (images discarded)
Anthropic/Claude handles nested part.image.base64 format
OpenAI-compatible adapters detect vision capability per model

Files Modified:

adapters/anthropic/shared/AnthropicAdapter.ts:279-294 - Flexible format detection
shared/adapters/BaseOpenAICompatibleAdapter.ts:147-173 - Vision capability detection + message flattening
system/user/server/modules/PersonaResponseGenerator.ts:392-399 - ContentPart[] generation

Format Sent from PersonaResponseGenerator:

{
  type: 'image',
  image: {           // Nested (OpenAI-style)
    base64: string,
    mimeType: string
  }
}

Adapter Handling:

Grok/XAI: ✅ Works (BaseOpenAICompatibleAdapter)
Claude: ✅ Fixed (checks both part.base64 and part.image?.base64)
Groq/DeepSeek/Together/Fireworks: ✅ Fixed (non-vision models get plain text)

Identified Problems

🔴 Critical Production Issues

1. Images Discarded for Non-Vision Models

Problem: Non-vision models lose all visual information when images are present.

Example:

User sends: "Describe this meme [image of Steve Buscemi]"
Groq Lightning receives: "Describe this meme"  // Image lost!
Response: "I don't see any meme to describe."

Solution: Describe images using a vision model ONCE, insert descriptions as text.

2. Context Window Explosion

Problem: Base64 images consume massive amounts of context window.

Math:

1MB image (1920x1080 PNG)
→ Base64 encoding: ~1.33MB
→ As text: ~1,330,000 characters
→ Tokens (4 chars/token): ~332,500 tokens
→ With 128K context window: 260% over budget!

Impact:

One image can exceed entire context window
RAG system can't include conversation history
API requests fail with "context_length_exceeded"

Solution: Intelligent resizing based on available context budget.

3. Format Incompatibility

Problem: Some models don't support certain image formats.

Examples:

WebP: Not supported by older vision models
AVIF: Cutting edge, limited support
HEIC: Apple format, limited support

Solution: Format conversion to safe fallbacks (PNG, JPEG).

4. No Cost Optimization

Problem: Sending full-resolution images wastes money.

Math:

GPT-4V pricing (vision):
- Base: $0.01/1K tokens
- Image (1920x1080): ~330K tokens = $3.30 per message!
- Resized (512x288): ~20K tokens = $0.20 per message
- Savings: 94% cost reduction with minimal quality loss

Solution: Resize to minimum viable resolution for task.

5. RAG Budget Integration Missing

Problem: Images don't participate in RAG's token budget system.

Current RAG System (works for text only):

const ragBudget = {
  total: 128000,           // Model's context window
  systemPrompt: 2000,      // Fixed
  messages: 100000,        // Flexible
  outputReserve: 4000      // Reserved for response
};

What's Missing: Images need to "flex" within available space, just like text messages.

Architecture Overview

The Three-Layer System

┌─────────────────────────────────────────────────────────┐
│ Layer 1: CAPTURE & STORAGE                             │
│ ─────────────────────────────────────────────────────── │
│ • chat/send receives media paths                        │
│ • file/load reads files as base64                       │
│ • file/mime-type detects format                         │
│ • ChatMessageEntity stores in database                  │
└─────────────────────────────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────┐
│ Layer 2: MEDIA PROCESSING (NEW)                        │
│ ─────────────────────────────────────────────────────── │
│ • media/analyze - Vision model describes image          │
│ • media/convert - Format conversion (webp → png)        │
│ • media/resize - Intelligent resizing for context       │
│ • media/optimize - Compression without quality loss     │
│ • media/estimate-tokens - Calculate token cost          │
└─────────────────────────────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────┐
│ Layer 3: ADAPTER FORMATTING (CURRENT)                  │
│ ─────────────────────────────────────────────────────── │
│ • Vision models: Format as ContentPart[]                │
│ • Non-vision models: Use text descriptions              │
│ • Provider-specific formatting (Anthropic vs OpenAI)    │
└─────────────────────────────────────────────────────────┘

Media Processing Pipeline

Phase 2: Image Description for Non-Vision Models

Goal: Non-vision models get semantic understanding via text descriptions.

Flow:

// In BaseOpenAICompatibleAdapter.generateText()

if (!supportsVision && hasImages) {
  // Describe images ONCE using vision model
  for (const imagePart of imageParts) {
    const description = await Commands.execute('media/analyze', {
      base64: imagePart.image.base64,
      mimeType: imagePart.image.mimeType,
      prompt: 'Describe this image in detail for a text-only AI model',
      context: this.context,
      sessionId: this.sessionId
    });

    // Replace image with description
    textParts.push(`[Image: ${description.text}]`);
  }

  // Send text-only message with descriptions
  const content = textParts.join('\n');
}

Caching Strategy:

Cache descriptions by image hash (SHA-256 of base64)
Avoid re-describing same image across multiple messages
Cache TTL: 24 hours (descriptions rarely change)

Provider Selection for Description:

Try Grok-vision (fast, cheap, excellent quality)
Fallback to GPT-4V (more expensive but very reliable)
Fallback to Claude Sonnet 4.5 (most expensive, best quality)

Phase 3: Context-Aware Resizing

Goal: Images fit within available context budget without exceeding limits.

Flow:

// In BaseOpenAICompatibleAdapter.generateText()

// 1. Calculate available context
const modelInfo = this.config.models?.find(m => m.id === model);
const contextWindow = modelInfo?.contextWindow || 128000;

// 2. Estimate current usage
const messagesTokens = this.estimateTokens(request.messages);
const systemPromptTokens = this.estimateTokens(request.systemPrompt);
const outputReserve = request.maxTokens || 4000;

// 3. Calculate available space for images
const availableForImages = contextWindow - messagesTokens - systemPromptTokens - outputReserve;

// 4. Resize each image to fit budget
for (const imagePart of imageParts) {
  const currentTokens = await Commands.execute('media/estimate-tokens', {
    base64: imagePart.image.base64,
    mimeType: imagePart.image.mimeType
  });

  if (currentTokens > availableForImages) {
    // Resize to fit
    const resized = await Commands.execute('media/resize', {
      base64: imagePart.image.base64,
      mimeType: imagePart.image.mimeType,
      maxTokens: availableForImages,
      preserveAspectRatio: true,
      quality: 85
    });

    imagePart.image.base64 = resized.base64;
    imagePart.image.mimeType = resized.mimeType;
  }
}

Token Estimation Methods:

Different models count image tokens differently:

GPT-4V (Tile-Based):

Tiles = ceil(width/512) × ceil(height/512)
Tokens = (Tiles × 170) + 85 base

Example (1920×1080):
Tiles = 4 × 3 = 12
Tokens = (12 × 170) + 85 = 2,125 tokens

Claude (Base64 Length):

Base64 chars = image bytes × 1.33
Tokens ≈ Base64 chars / 4

Example (1MB PNG):
Base64 = 1,330,000 chars
Tokens ≈ 332,500 tokens

Grok (Similar to GPT-4V):

Uses tile-based counting
More efficient than Claude's approach

Resizing Algorithm:

async function resizeToFitBudget(
  base64: string,
  mimeType: string,
  maxTokens: number,
  model: string
): Promise<{ base64: string, mimeType: string }> {
  // 1. Decode image dimensions
  const { width, height } = await getImageDimensions(base64);

  // 2. Calculate target dimensions
  let targetWidth = width;
  let targetHeight = height;

  while (estimateTokens(targetWidth, targetHeight, model) > maxTokens) {
    // Reduce by 20% each iteration
    targetWidth = Math.floor(targetWidth * 0.8);
    targetHeight = Math.floor(targetHeight * 0.8);
  }

  // 3. Resize with quality preservation
  return await Commands.execute('media/resize', {
    base64,
    mimeType,
    targetWidth,
    targetHeight,
    quality: 85,
    format: 'png'  // Safe fallback
  });
}

Phase 4: Format Conversion

Goal: Convert unsupported formats to safe fallbacks.

Supported Format Matrix:

Format	GPT-4V	Claude	Grok	Gemini	Notes
PNG	✅	✅	✅	✅	Universal support
JPEG	✅	✅	✅	✅	Universal support
WebP	✅	❌	✅	✅	Claude doesn't support
GIF	✅	❌	✅	✅	Static only (no animation)
AVIF	❌	❌	❌	✅	Too new
HEIC	❌	❌	❌	❌	Apple proprietary

Conversion Strategy:

async function ensureCompatibleFormat(
  base64: string,
  mimeType: string,
  provider: string,
  model: string
): Promise<{ base64: string, mimeType: string }> {
  // Check if format is supported
  const supported = isFormatSupported(provider, model, mimeType);

  if (!supported) {
    // Convert to PNG (safest fallback)
    return await Commands.execute('media/convert', {
      base64,
      fromFormat: mimeType,
      toFormat: 'image/png',
      quality: 95  // High quality for conversion
    });
  }

  return { base64, mimeType };
}

RAG Integration (CSS Flexbox Analogy)

The Flexbox Mental Model

Think of RAG's context window like a CSS flexbox container:

.context-window {
  display: flex;
  flex-direction: column;
  max-height: 128000px; /* tokens */
}

.system-prompt {
  flex: 0 0 2000px;  /* Fixed: 2000 tokens */
}

.messages {
  flex: 1 1 auto;    /* Flexible: Grows/shrinks */
}

.images {
  flex: 1 1 auto;    /* Flexible: Competes with messages */
  max-height: 50000px; /* Cap at 50K tokens */
}

.output-reserve {
  flex: 0 0 4000px;  /* Fixed: 4000 tokens */
}

Current RAG Budget System

Location: system/user/server/modules/rag-builders/ChatRAGBuilder.ts

interface RAGBudget {
  total: number;           // Context window
  systemPrompt: number;    // Fixed cost
  messages: number;        // Flexible (grows/shrinks)
  outputReserve: number;   // Fixed reserve
}

// Calculate message budget
const budget: RAGBudget = {
  total: modelContextWindow,
  systemPrompt: estimateTokens(systemPrompt),
  outputReserve: 4000,
  messages: 0  // Calculated below
};

budget.messages = budget.total - budget.systemPrompt - budget.outputReserve;

Enhanced RAG Budget with Images

New Structure:

interface EnhancedRAGBudget {
  total: number;           // Context window
  systemPrompt: number;    // Fixed cost
  messages: number;        // Flexible
  images: number;          // Flexible (NEW)
  outputReserve: number;   // Fixed reserve
}

// Dynamic allocation strategy
const budget: EnhancedRAGBudget = {
  total: modelContextWindow,
  systemPrompt: estimateTokens(systemPrompt),
  outputReserve: 4000,

  // Allocate remaining space between messages and images
  // Default split: 70% messages, 30% images
  messages: 0,  // Calculated
  images: 0     // Calculated
};

const available = budget.total - budget.systemPrompt - budget.outputReserve;
budget.messages = Math.floor(available * 0.7);
budget.images = Math.floor(available * 0.3);

Dynamic Resizing (Flexbox Behavior)

Scenario 1: No Images (like flex-shrink: 0 on messages)

// Images don't exist, messages get full space
budget.messages = available;
budget.images = 0;

Scenario 2: Small Images (under budget)

// Images fit comfortably, no resizing needed
const imageTokens = 5000;
if (imageTokens < budget.images) {
  // Keep full resolution
  // Messages get remaining space
  budget.messages = available - imageTokens;
}

Scenario 3: Large Images (over budget - resize!)

// Images exceed budget, resize to fit (flex-shrink)
const imageTokens = 60000;  // Too large!
if (imageTokens > budget.images) {
  // Resize images to fit budget
  await resizeImagesTo(budget.images);

  // Messages get their allocated space
  budget.messages = Math.floor(available * 0.7);
}

Scenario 4: Too Many Messages (images get compressed)

// Many messages in history, reduce image budget
const messageCount = 50;
if (messageCount > 30) {
  // Shift allocation: 85% messages, 15% images
  budget.messages = Math.floor(available * 0.85);
  budget.images = Math.floor(available * 0.15);

  // Resize images to fit smaller budget
  await resizeImagesTo(budget.images);
}

Implementation in ChatRAGBuilder

Location: Modify buildRAGContext() method

// In ChatRAGBuilder.buildRAGContext()

async buildRAGContext(params: {
  messages: ChatMessageEntity[],
  artifacts: Artifact[],
  systemPrompt: string,
  modelContextWindow: number
}): Promise<RAGContext> {
  // 1. Calculate base budget
  const budget = this.calculateBudget(params.modelContextWindow, params.systemPrompt);

  // 2. Detect images in artifacts
  const imageArtifacts = params.artifacts.filter(a => a.type === 'image');

  if (imageArtifacts.length > 0) {
    // 3. Estimate image token cost
    const imageTokens = await this.estimateImageTokens(imageArtifacts);

    // 4. Resize images if needed
    if (imageTokens > budget.images) {
      await this.resizeArtifactsToFitBudget(imageArtifacts, budget.images);
    }

    // 5. Adjust message budget (images took some space)
    const actualImageTokens = await this.estimateImageTokens(imageArtifacts);
    budget.messages = budget.total - budget.systemPrompt - budget.outputReserve - actualImageTokens;
  }

  // 6. Build message list within budget
  const messages = await this.selectMessagesWithinBudget(params.messages, budget.messages);

  return {
    messages,
    artifacts: imageArtifacts,
    systemPrompt: params.systemPrompt,
    budget
  };
}

Implementation Phases

✅ Phase 1: Basic Format Handling (COMPLETE)

Status: Deployed 2025-11-26

Vision capability detection per model
Format flexibility (nested/flat base64)
Non-vision models get plain text (images discarded)

🚧 Phase 2: Image Description (NEXT)

Estimated: 2-3 hours

Tasks:

Create media/analyze command
- Use Grok-vision to describe images
- Cache descriptions by image hash
- Return text descriptions
Integrate into BaseOpenAICompatibleAdapter
- Detect non-vision models with images
- Call media/analyze for each image
- Insert descriptions as [Image: ...] text
Test with non-vision models
- Groq Lightning should understand image content
- DeepSeek should get semantic information

Success Criteria:

Non-vision models respond accurately to image content
Descriptions cached (avoid redundant API calls)
Cost under $0.01 per image description

🚧 Phase 3: Context-Aware Resizing

Estimated: 4-6 hours

Tasks:

Implement token estimation
- GPT-4V tile-based calculation
- Claude base64 length estimation
- Grok tile-based calculation
Create media/resize command
- Accept target token budget
- Resize to fit within budget
- Preserve aspect ratio
Create media/estimate-tokens command
- Calculate token cost per model
- Account for different counting methods
Integrate into BaseOpenAICompatibleAdapter
- Calculate available context budget
- Resize images before sending to API
- Log token savings

Success Criteria:

Images never exceed context window
Automatic resizing maintains quality
Token usage reduced by 80%+ for large images

🚧 Phase 4: Format Conversion

Estimated: 2-3 hours

Tasks:

Create media/convert command
- Support webp → png
- Support heic → jpeg
- Support avif → png
Build format compatibility matrix
- Per provider (anthropic, openai, xai, etc.)
- Per model (gpt-4v, claude-3-opus, etc.)
Integrate into adapters
- Auto-convert unsupported formats
- Choose optimal output format
- Log conversions

Success Criteria:

All formats work with all providers
Automatic conversion transparent to user
No format-related API errors

🚧 Phase 5: RAG Budget Integration

Estimated: 6-8 hours

Tasks:

Extend RAGBudget interface
- Add images field
- Add allocation strategy
Implement flexbox-style allocation
- Dynamic split between messages/images
- Adapt based on content
Integrate image resizing into RAG builder
- Resize artifacts to fit budget
- Adjust message inclusion based on space
Add budget monitoring
- Log actual vs allocated tokens
- Warn when approaching limits

Success Criteria:

Images and messages coexist within budget
Dynamic allocation prevents overflow
RAG builder never exceeds context window
Budget utilization >90% (efficient use of space)

Command Specifications

`media/analyze` (Phase 2)

Purpose: Describe images using vision models for semantic understanding.

Parameters:

interface MediaAnalyzeParams extends CommandParams {
  base64: string;           // Base64-encoded image
  mimeType: string;         // MIME type (image/png, etc.)
  prompt?: string;          // Custom prompt (optional)
  detail?: 'low' | 'high';  // Detail level (default: high)
  provider?: string;        // Force specific provider (optional)
}

Response:

interface MediaAnalyzeResult extends CommandResult {
  text: string;                    // Image description
  provider: string;                // Provider used (grok, openai, anthropic)
  model: string;                   // Model used (grok-vision-4, etc.)
  tokensUsed: number;              // Tokens consumed
  estimatedCost: number;           // Cost in USD
  cached: boolean;                 // Was description cached?
  cacheKey: string;                // SHA-256 hash for caching
}

Implementation:

// commands/media/analyze/server/MediaAnalyzeServerCommand.ts

async execute(params: MediaAnalyzeParams): Promise<MediaAnalyzeResult> {
  // 1. Generate cache key (SHA-256 of base64)
  const cacheKey = this.generateCacheKey(params.base64);

  // 2. Check cache
  const cached = await this.checkCache(cacheKey);
  if (cached) {
    return { ...cached, cached: true };
  }

  // 3. Select provider (Grok → GPT-4V → Claude)
  const provider = params.provider || await this.selectBestVisionProvider();

  // 4. Generate description
  const prompt = params.prompt ||
    'Describe this image in detail, focusing on key visual elements, text content, and overall context. Be concise but thorough.';

  const result = await Commands.execute('ai/generate', {
    provider,
    messages: [{
      role: 'user',
      content: [
        { type: 'text', text: prompt },
        { type: 'image', image: { base64: params.base64, mimeType: params.mimeType }}
      ]
    }],
    maxTokens: 500
  });

  // 5. Cache result
  await this.cacheDescription(cacheKey, result.text, provider, result.model);

  return {
    success: true,
    text: result.text,
    provider,
    model: result.model,
    tokensUsed: result.usage.totalTokens,
    estimatedCost: result.usage.estimatedCost,
    cached: false,
    cacheKey
  };
}

`media/resize` (Phase 3)

Purpose: Resize images to fit within token budget.

Parameters:

interface MediaResizeParams extends CommandParams {
  base64: string;                  // Base64-encoded image
  mimeType: string;                // Input MIME type

  // Resize strategy (one required)
  maxTokens?: number;              // Target token budget
  targetWidth?: number;            // Target width in pixels
  targetHeight?: number;           // Target height in pixels
  scale?: number;                  // Scale factor (0.5 = 50%)

  // Options
  preserveAspectRatio?: boolean;   // Default: true
  quality?: number;                // JPEG quality 1-100 (default: 85)
  format?: string;                 // Output format (default: input format)
  model?: string;                  // Model for token estimation
}

Response:

interface MediaResizeResult extends CommandResult {
  base64: string;                  // Resized image (base64)
  mimeType: string;                // Output MIME type
  originalDimensions: {
    width: number;
    height: number;
  };
  newDimensions: {
    width: number;
    height: number;
  };
  originalTokens: number;          // Before resize
  newTokens: number;               // After resize
  reductionPercent: number;        // Token reduction %
  originalSize: number;            // Bytes before
  newSize: number;                 // Bytes after
}

`media/estimate-tokens` (Phase 3)

Purpose: Estimate token cost for images per model.

Parameters:

interface MediaEstimateTokensParams extends CommandParams {
  base64: string;                  // Base64-encoded image
  mimeType: string;                // MIME type
  model: string;                   // Model for estimation
  provider: string;                // Provider (openai, anthropic, xai)
}

Response:

interface MediaEstimateTokensResult extends CommandResult {
  tokens: number;                  // Estimated tokens
  method: 'tile-based' | 'base64-length' | 'pixels';
  details: {
    width: number;
    height: number;
    tiles?: number;                // For tile-based models
    base64Length?: number;         // For base64-based models
  };
}

`media/convert` (Phase 4)

Purpose: Convert between image formats.

Parameters:

interface MediaConvertParams extends CommandParams {
  base64: string;                  // Base64-encoded image
  fromFormat: string;              // Input MIME type
  toFormat: string;                // Output MIME type
  quality?: number;                // Compression quality (default: 95)
}

Response:

interface MediaConvertResult extends CommandResult {
  base64: string;                  // Converted image
  mimeType: string;                // Output MIME type
  originalSize: number;            // Bytes before
  newSize: number;                 // Bytes after
}

Testing Strategy

Unit Tests

# Phase 2: Image Description
npx vitest tests/unit/media-analyze.test.ts

# Phase 3: Resizing
npx vitest tests/unit/media-resize.test.ts
npx vitest tests/unit/media-estimate-tokens.test.ts

# Phase 4: Format Conversion
npx vitest tests/unit/media-convert.test.ts

Integration Tests

# Test with real models
./jtag collaboration/chat/send --room="general" --message="Describe this" \
  --media="/path/to/test-image.png"

# Wait for responses
sleep 10

# Check responses
./jtag collaboration/chat/export --room="general" --limit=20

Performance Tests

// Test token reduction
const before = await estimateTokens(originalImage);
const after = await estimateTokens(resizedImage);
expect(after).toBeLessThan(before * 0.2);  // 80% reduction

// Test cost savings
const costBefore = calculateCost(before);
const costAfter = calculateCost(after);
expect(costAfter).toBeLessThan(costBefore * 0.2);  // 80% savings

Success Metrics

Phase 2: Image Description

✅ Non-vision models respond accurately to image content
✅ Description cache hit rate >50% after 1 hour
✅ Average description cost <$0.01 per image

Phase 3: Context-Aware Resizing

✅ Zero context window exceeded errors
✅ Token usage reduced 80%+ for large images
✅ Image quality remains visually acceptable
✅ Resizing adds <100ms latency per image

Phase 4: Format Conversion

✅ Zero format compatibility errors
✅ All formats work with all providers
✅ Conversion adds <200ms latency per image

Phase 5: RAG Integration

✅ Context budget never exceeded
✅ Budget utilization >90% (efficient space use)
✅ Images and messages coexist gracefully
✅ Flexbox-style allocation adapts to content

Open Questions

Image Description Caching: Where to store cache?
- Option A: In-memory (lost on restart)
- Option B: Database (persistent, slower)
- Option C: Redis (fast + persistent)
- Decision: TBD
Vision Provider Selection: Auto-select or user choice?
- Grok-vision: Fast, cheap, good quality
- GPT-4V: Expensive, very reliable
- Claude Sonnet 4.5: Most expensive, best quality
- Decision: Auto-select with fallback chain
Token Estimation Accuracy: How to improve?
- Current: Approximations based on documentation
- Better: Calibrate against actual API responses
- Best: Provider APIs expose token counting
- Decision: Start with approximations, calibrate over time
Base64 in Text Content: Worth trying?
- Some models might understand data URIs in text
- Could be simpler than multimodal format
- Needs testing with each provider
- Decision: Test in Phase 2, low priority

Changelog

2025-11-26: Initial document created

Documented Phase 1 implementation (complete)
Designed Phases 2-5 architecture
Specified command interfaces
Outlined RAG integration strategy

FilesExpand file tree

VISION-MEDIA-ARCHITECTURE.md

Latest commit

History

VISION-MEDIA-ARCHITECTURE.md

File metadata and controls

Vision & Media Architecture -- Image Processing for AI Providers

Table of Contents

Current Implementation

✅ Phase 1: Basic Format Handling (COMPLETE)

Identified Problems

🔴 Critical Production Issues

1. Images Discarded for Non-Vision Models

2. Context Window Explosion

3. Format Incompatibility

4. No Cost Optimization

5. RAG Budget Integration Missing

Architecture Overview

The Three-Layer System

Media Processing Pipeline

Phase 2: Image Description for Non-Vision Models

Phase 3: Context-Aware Resizing

Phase 4: Format Conversion

RAG Integration (CSS Flexbox Analogy)

The Flexbox Mental Model

Current RAG Budget System

Enhanced RAG Budget with Images

Dynamic Resizing (Flexbox Behavior)

Implementation in ChatRAGBuilder

Implementation Phases

✅ Phase 1: Basic Format Handling (COMPLETE)

🚧 Phase 2: Image Description (NEXT)

🚧 Phase 3: Context-Aware Resizing

🚧 Phase 4: Format Conversion

🚧 Phase 5: RAG Budget Integration

Command Specifications

media/analyze (Phase 2)

media/resize (Phase 3)

media/estimate-tokens (Phase 3)

media/convert (Phase 4)

Testing Strategy

Unit Tests

Integration Tests

Performance Tests

Success Metrics

Phase 2: Image Description

Phase 3: Context-Aware Resizing

Phase 4: Format Conversion

Phase 5: RAG Integration

Open Questions

Related Documentation

Changelog

`media/analyze` (Phase 2)

`media/resize` (Phase 3)

`media/estimate-tokens` (Phase 3)

`media/convert` (Phase 4)