Skip to content

[BUG] Index Out of Bounds Panic When Embedding Count Mismatches #142

@EnthusiasticTech

Description

@EnthusiasticTech

Project

vgrep

Description

The indexer accesses embeddings by index without validating that the embedding count matches the chunk count. If the embedding API returns fewer embeddings than expected (due to errors, timeouts, or bugs), the application panics with "index out of bounds".

Affected Files

  • src/core/indexer.rs (lines 139, 167) - Indexer
  • src/core/indexer.rs (lines 515, 544) - ServerIndexer

Evidence

The Bug Pattern:

        let all_chunks: Vec<&str> = pending_files
            .iter()
            .flat_map(|f| f.chunks.iter().map(|c| c.content.as_str()))
            .collect();
        
        // ... 
        
        let all_embeddings = self.engine.embed_batch(&all_chunks)?;
        // ❌ NO VALIDATION: all_embeddings.len() == all_chunks.len()
        for pending in &pending_files {
            let file_id = self.db.insert_file(&pending.path, &pending.hash)?;

            for (chunk_idx, chunk) in pending.chunks.iter().enumerate() {
                let embedding = &all_embeddings[embedding_idx];  // 💥 PANIC if index >= len!
                self.db.insert_chunk(...)?;
                embedding_idx += 1;
            }
        }

Panic Scenario:

all_chunks.len() = 100
all_embeddings.len() = 95  (server returned partial results)

Loop iteration 96:
  embedding_idx = 95
  all_embeddings[95]  // 💥 PANIC: index out of bounds: len is 95 but index is 95

Error Message

Debug Logs

System Information

Bounty Version: 0.1.0
OS: Ubuntu 24.04 LTS
CPU: AMD EPYC-Genoa Processor (8 cores)
RAM: 15 GB

Screenshots

No response

Steps to Reproduce

Method 1: Simulate embedding API failure

// In embed_batch, simulate partial failure
pub fn embed_batch(&self, texts: &[&str]) -> Result<Vec<Vec<f32>>> {
    // Bug: Return fewer embeddings than requested
    let results: Vec<Vec<f32>> = texts.iter()
        .take(texts.len() - 5)  // Missing 5 embeddings!
        .map(|t| self.embed(t).unwrap())
        .collect();
    Ok(results)
}

Method 2: Server timeout during batch

# Start server with artificial delay
VGREP_EMBED_DELAY=100ms vgrep serve

# Index large codebase (will timeout partway)
timeout 10s vgrep index /large/codebase

# Server returns partial embeddings before timeout
# Client panics when iterating

Method 3: Memory pressure

# Limit memory to force OOM during embedding
systemd-run --scope -p MemoryMax=500M vgrep index /large/codebase

# embed_batch runs out of memory partway through
# Returns partial results, indexer panics

Expected Behavior

  1. Validate that all_embeddings.len() == all_chunks.len() before processing
  2. Return a descriptive error if counts don't match
  3. Never panic due to index out of bounds

Actual Behavior

  1. No validation of embedding count
  2. Blind index access into all_embeddings
  3. PANIC: index out of bounds: the len is X but the index is Y
  4. Entire indexing operation crashes
  5. Database may be left in inconsistent state

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingideIssues related to IDEinvalidThis doesn't seem rightvgrep

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions