Skip to content

Conversation

@jrochkind
Copy link
Contributor

@jrochkind jrochkind commented Dec 17, 2025

We fetch chunks with embedding vector similarity, in pgvector, using the neighbor gem. It's just one line of code -- but we start by extracting it to a ChunkFetcher service object so we can add more complicated code, primarily llimits and exclusions.

The ability to say "not including these Chunks" (use case: that I already fetched), or "not including these Intevrviews" (same use case) is pretty easy ActiveRecord.

Harder, we want ot say "give me the top-ranked chunks, but no more than N-per interview." Googling and ChatGPT showed me there's a way to do that with a sub-query Common Table Expression (CTE) using ROW_NUMBER "window_function" with aggregates... phew! Fairly straightforward after I figured it out, then I figured out how to use ActiveRecord to generically wrap a query as a sub-query in larger query, so it could be "that query but limited to top-2 per interviewee".

Basic format of the SQL (as given to me by claude, ha) is:

WITH ranked_chunks AS (
      SELECT 
        chunks.*,
        chunks.embedding <=> ? as distance,
        ROW_NUMBER() OVER (PARTITION BY document_id ORDER BY chunks.embedding <=> ?) as doc_rank
      FROM chunks
      ORDER BY chunks.embedding <=> ?
      LIMIT <big limit to get enough to choose from>
)
    SELECT *
    FROM ranked_chunks
    WHERE doc_rank <= 2
    ORDER BY distance
    LIMIT <actual limit>

You can ask chatgpt or claude for more info on it. :)

Then use it, to expand chunks to Claude

After some experimentation, I think this is a fine point to test at:

  • Fetch 8 closest-vector chunks
  • Then fetch 8 more, closest that are only one-per-interview, not including any interviews from first 8

(It's possible point of diminishing returns is even a bit fewer chunks; adding TONS more chunks did not seem to help, see wiki).

@jrochkind jrochkind force-pushed the chunk_fetcher branch 2 times, most recently from aba9004 to 073be58 Compare December 17, 2025 20:28
@jrochkind jrochkind marked this pull request as ready for review December 18, 2025 16:44
@eddierubeiz eddierubeiz merged commit fa6ee2a into master Dec 18, 2025
1 check passed
@eddierubeiz eddierubeiz deleted the chunk_fetcher branch December 18, 2025 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants