-
Couldn't load subscription status.
- Fork 100
fix: context window exceeded #165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a critical bug where the context window token limit could be exceeded when messages are too large, and implements proactive chunk truncation to prevent oversized prompts from being sent to the model.
Key Changes:
- Added
limit_chunkspans()function to proactively truncate retrieved chunks before adding them to context, with automatic per-tool adjustment when multiple tool calls occur - Enhanced
_clip()function with explicit edge case handling to return an empty list when even the last message exceeds the token limit - Added
_num_queriestracking toRAGLiteConfigto enable dynamic token limit adjustment across multiple tool calls
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/raglite/_rag.py | Implements chunk truncation logic, edge case handling in message clipping, and tool call query counting |
| src/raglite/_config.py | Adds _num_queries field to track number of concurrent queries for token limit calculations |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need better handling for truncating multiple tool calls
Fix: Improved Context Management and Context Window Handling
Previously, context size was managed only by
_clip, which failed when even the last message exceeded the model’s context window. In those cases, it could return all messages. In other cases it can lead to drop the user query entirely, leading to responses built only from tool messages without the original user query.This update introduces robust context limiting and proportional chunk allocation:
New
_limit_chunkspansUpdated
_clipUpdated
add_contextconfigrequired parameter, so ChunkSpans can be limited.Updated
test_ragThe file was updated so that chunk_spans is only asserted when the LLM does not start with
"llama-cpp-python". This change was made because the test could fail when all chunk spans are dropped by_limit_chunkspans, especially when using thesat-1l-smsentence splitter.Integration
_limit_chunkspansapplied inadd_contextand_run_toolsto constrain retrieved chunk spans before building messages.These prevent context overflows, and ensure that context truncation is handled with clear warnings.