Fix chunk deletion pagination to prevent 16 MB read limit #52

Mvrxel · 2025-11-07T20:55:03Z

Hi there!

I opened the issue about the “Too many bytes read” error in the deleteAsync function and decided to give it a try.

Does this seem like a reasonable direction to fix it? :)

Summary:

Implement truly paginated chunk deletion to avoid “Too many bytes read” errors when deleting large entries. Remove unnecessary doc reads, delete chunks by ID, and return an exclusive nextStartOrder = order + 1. Re‑export internal.chunks.deleteChunksPage for entries.deleteSync.

Problem:

Unbounded queries and extra reads of embedding/content docs caused > 16 MB reads, breaking deletion tests and large‑entry cleanup.

Changes:

Add paginated deletion loop with byte‑budget thresholds.
Delete chunks directly by ID (no content/embedding fetch).
Use nextStartOrder = lastOrder + 1 for exclusive pagination.
Restore export const deleteChunksPage = internalMutation(...).
Keep entries.deleteSync/Async logic; now loops safely until empty.

Testing:

Unit and large‑entry tests pass (deleteSync, deleteByKeySync). Multiple paginated passes complete within read limits.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

pkg-pr-new · 2025-11-13T05:20:15Z

Open in StackBlitz

npm i https://pkg.pr.new/get-convex/rag/@convex-dev/rag@52

commit: 4f02d52

ianmacartney · 2025-11-13T05:23:19Z

src/component/chunks.ts

-        dataUsedSoFar += estimateEmbeddingSize(embedding);
-        await ctx.db.delete(chunk.state.embeddingId);
-      }
+      await ctx.db.delete(chunk.state.embeddingId);


My concern here is whether there can be multiple references to the same embeddingId, and so trying to delete it if it doesn't exist will fail. And I believe deletes today require reading the document anyways, so may as well get it and figure out if it's putting us over the limit?

Yeah, that can happen, but as I understand it, it shouldn’t occur for “ready” chunks since each one has its own embedding. Only “replaced” ones can share the same embeddingId, but we don’t delete those again. So wouldn’t it make more sense to skip the pre‑get and just wrap the delete in a try‑catch?

Ah good call - but regardless we need to know how much data is being loaded, right? if we just delete we don't know how much "read bandwidth" was used. My assumption here being that delete adds to the read bandwidth, but it won't be added twice if we get & delete.

ianmacartney · 2025-11-13T05:23:42Z

src/component/chunks.ts

    await ctx.db.delete(chunk.contentId);
+    await ctx.db.delete(chunk._id);
+    lastOrder = chunk.order;
    if (dataUsedSoFar > BANDWIDTH_PER_TRANSACTION_HARD_LIMIT) {


Do you have an idea for why this isn't already preventing an issue?

I think the previous version was doing the get for embedding and content before checking the limit, so it was already hitting the limit during those reads, before even reaching the check.

Even still, the "hard limit" is set to 8MB which is half of the actual transaction limit. So either it was being called from another transaction that already had 8MB read, or it's reading twice as much as it thinks here.. which maybe could be the case if get & delete are accounted for separately. Were you seeing this when it was being called directly from something, e.g. when your code was calling entries deleteAsync ? or was it failing on some subsequent page? Maybe an issue here is that it does the first page eagerly here: https://github.com/get-convex/rag/blob/main/src/component/entries.ts#L552
Instead of always doing it async

I tried rolling chunks.ts back to the previous version and moving stuff to workers, but the error’s still there. Seems like the real issue’s in deleteChunksPageHandler — it still does those big gets before deleting, and since delete also counts as a read, the I/O kinda doubles. Even with workers, one page can still hit the 16 MB limit on its own.

Yeah so the big issue here is the accounting doubling. I don't want it to rely on stopping at 15.9MB read instead of 16.1MB. It should stop at 8MB. A simple test of this would be to make a simple mutation that reads a big document then deletes it and see if the dashboard lists the data read as doubled. If we just remove the get then I don't know how we count the size of what we deleted. Another approach here is to just use a heuristic and/or set limits so documents can't get bigger than 100k, say. Then just fetch 80 of them per page

fix deleteCHunksPageHandler

4f02d52

ianmacartney reviewed Nov 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix chunk deletion pagination to prevent 16 MB read limit #52

Fix chunk deletion pagination to prevent 16 MB read limit #52

Mvrxel commented Nov 7, 2025

Uh oh!

pkg-pr-new bot commented Nov 13, 2025

Uh oh!

ianmacartney Nov 13, 2025

Uh oh!

Mvrxel Nov 13, 2025

Uh oh!

ianmacartney Nov 13, 2025

Uh oh!

ianmacartney Nov 13, 2025

Uh oh!

Mvrxel Nov 13, 2025

Uh oh!

ianmacartney Nov 13, 2025

Uh oh!

Mvrxel Nov 13, 2025

Uh oh!

ianmacartney Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix chunk deletion pagination to prevent 16 MB read limit #52

Are you sure you want to change the base?

Fix chunk deletion pagination to prevent 16 MB read limit #52

Conversation

Mvrxel commented Nov 7, 2025

Uh oh!

pkg-pr-new bot commented Nov 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants