-
Notifications
You must be signed in to change notification settings - Fork 8
Fix chunk deletion pagination to prevent 16 MB read limit #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
commit: |
| dataUsedSoFar += estimateEmbeddingSize(embedding); | ||
| await ctx.db.delete(chunk.state.embeddingId); | ||
| } | ||
| await ctx.db.delete(chunk.state.embeddingId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern here is whether there can be multiple references to the same embeddingId, and so trying to delete it if it doesn't exist will fail. And I believe deletes today require reading the document anyways, so may as well get it and figure out if it's putting us over the limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that can happen, but as I understand it, it shouldn’t occur for “ready” chunks since each one has its own embedding. Only “replaced” ones can share the same embeddingId, but we don’t delete those again. So wouldn’t it make more sense to skip the pre‑get and just wrap the delete in a try‑catch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good call - but regardless we need to know how much data is being loaded, right? if we just delete we don't know how much "read bandwidth" was used. My assumption here being that delete adds to the read bandwidth, but it won't be added twice if we get & delete.
| await ctx.db.delete(chunk.contentId); | ||
| await ctx.db.delete(chunk._id); | ||
| lastOrder = chunk.order; | ||
| if (dataUsedSoFar > BANDWIDTH_PER_TRANSACTION_HARD_LIMIT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an idea for why this isn't already preventing an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the previous version was doing the get for embedding and content before checking the limit, so it was already hitting the limit during those reads, before even reaching the check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even still, the "hard limit" is set to 8MB which is half of the actual transaction limit. So either it was being called from another transaction that already had 8MB read, or it's reading twice as much as it thinks here.. which maybe could be the case if get & delete are accounted for separately. Were you seeing this when it was being called directly from something, e.g. when your code was calling entries deleteAsync ? or was it failing on some subsequent page? Maybe an issue here is that it does the first page eagerly here: https://github.com/get-convex/rag/blob/main/src/component/entries.ts#L552
Instead of always doing it async
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried rolling chunks.ts back to the previous version and moving stuff to workers, but the error’s still there. Seems like the real issue’s in deleteChunksPageHandler — it still does those big gets before deleting, and since delete also counts as a read, the I/O kinda doubles. Even with workers, one page can still hit the 16 MB limit on its own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah so the big issue here is the accounting doubling. I don't want it to rely on stopping at 15.9MB read instead of 16.1MB. It should stop at 8MB. A simple test of this would be to make a simple mutation that reads a big document then deletes it and see if the dashboard lists the data read as doubled. If we just remove the get then I don't know how we count the size of what we deleted. Another approach here is to just use a heuristic and/or set limits so documents can't get bigger than 100k, say. Then just fetch 80 of them per page
Hi there!
I opened the issue about the “Too many bytes read” error in the deleteAsync function and decided to give it a try.
Does this seem like a reasonable direction to fix it? :)
Summary:
Implement truly paginated chunk deletion to avoid “Too many bytes read” errors when deleting large entries. Remove unnecessary doc reads, delete chunks by ID, and return an exclusive nextStartOrder = order + 1. Re‑export internal.chunks.deleteChunksPage for entries.deleteSync.
Problem:
Unbounded queries and extra reads of embedding/content docs caused > 16 MB reads, breaking deletion tests and large‑entry cleanup.
Changes:
Add paginated deletion loop with byte‑budget thresholds.
Delete chunks directly by ID (no content/embedding fetch).
Use nextStartOrder = lastOrder + 1 for exclusive pagination.
Restore export const deleteChunksPage = internalMutation(...).
Keep entries.deleteSync/Async logic; now loops safely until empty.
Testing:
Unit and large‑entry tests pass (deleteSync, deleteByKeySync). Multiple paginated passes complete within read limits.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.