Project
vgrep
Description
The search_similar() function in src/core/db.rs loads all embeddings matching the path prefix into memory before calculating similarity scores. For large codebases, this can cause Out-Of-Memory (OOM) crashes.
Each embedding is ~1.5KB (384 dimensions × 4 bytes), so:
- 10,000 chunks = ~15 MB
- 100,000 chunks = ~150 MB
- 1,000,000 chunks = ~1.5 GB (common in large monorepos)
The entire result set is loaded into a Vec<SearchResult> before any filtering or limiting occurs.
Affected Files
src/core/db.rs (lines 155-199)
Evidence
pub fn search_similar(
&self,
query_embedding: &[f32],
path_prefix: &Path,
limit: usize,
) -> Result<Vec<SearchResult>> {
let path_prefix_str = path_prefix.to_string_lossy();
let like_pattern = format!("{}%", path_prefix_str);
let mut stmt = self.conn.prepare(
r"SELECT c.id, c.file_id, f.path, c.content, c.start_line, c.end_line, c.embedding
FROM chunks c
JOIN files f ON c.file_id = f.id
WHERE f.path LIKE ?", // No LIMIT here!
)?;
let mut results: Vec<SearchResult> = stmt
.query_map([&like_pattern], |row| {
let embedding_blob: Vec<u8> = row.get(6)?; // Each ~1.5KB
let embedding = bytes_to_embedding(&embedding_blob);
let similarity = cosine_similarity(query_embedding, &embedding);
Ok(SearchResult {
chunk_id: row.get(0)?,
file_id: row.get(1)?,
path: PathBuf::from(row.get::<_, String>(2)?),
content: row.get(3)?, // Full content also loaded
start_line: row.get(4)?,
end_line: row.get(5)?,
similarity,
})
})?
.filter_map(Result::ok) // ALL results collected here
.collect(); // <-- OOM can happen here!
// Sort and truncate AFTER loading everything
results.sort_by(...);
results.truncate(limit * 3);
Ok(results)
}
Memory Calculation:
For each chunk loaded:
embedding: 384 × 4 = 1,536 bytes
content: ~500 bytes average (chunk_size / 2)
path: ~100 bytes average
SearchResult struct overhead: ~80 bytes
Total per chunk: ~2.2 KB
| Chunks |
Memory Required |
| 10,000 |
22 MB |
| 50,000 |
110 MB |
| 100,000 |
220 MB |
| 500,000 |
1.1 GB |
| 1,000,000 |
2.2 GB |
Error Message
Debug Logs
System Information
Bounty Version: 0.1.0
OS: Ubuntu 24.04 LTS
CPU: AMD EPYC-Genoa Processor (8 cores)
RAM: 15 GB
Screenshots
No response
Steps to Reproduce
# 1. Index a large codebase (e.g., Linux kernel, Chromium)
cd /path/to/large/codebase
vgrep index
# 2. Check the chunk count
sqlite3 ~/.vgrep/projects/*.db "SELECT COUNT(*) FROM chunks;"
# Output: 500000+
# 3. Run a search (will try to load all 500K+ embeddings)
vgrep "function"
# 4. Watch memory usage spike, potentially OOM
Expected Behavior
- Search should work efficiently regardless of index size
- Memory usage should be bounded and predictable
- Only top-K results should be fully loaded into memory
Actual Behavior
- ALL matching chunks are loaded into memory
- Memory grows linearly with index size
- Large codebases cause OOM crashes
- No streaming or pagination
Additional Context
No response
Project
vgrep
Description
The
search_similar()function insrc/core/db.rsloads all embeddings matching the path prefix into memory before calculating similarity scores. For large codebases, this can cause Out-Of-Memory (OOM) crashes.Each embedding is ~1.5KB (384 dimensions × 4 bytes), so:
The entire result set is loaded into a
Vec<SearchResult>before any filtering or limiting occurs.Affected Files
src/core/db.rs(lines 155-199)Evidence
Memory Calculation:
For each chunk loaded:
embedding: 384 × 4 = 1,536 bytescontent: ~500 bytes average (chunk_size / 2)path: ~100 bytes averageSearchResultstruct overhead: ~80 bytesTotal per chunk: ~2.2 KB
Error Message
Debug Logs
System Information
Screenshots
No response
Steps to Reproduce
Expected Behavior
Actual Behavior
Additional Context
No response