Skip to content

Conversation

@jirastorza
Copy link

@jirastorza jirastorza commented Oct 15, 2025

This PR includes two fixes:

Fix: Improved Context Management and Context Window Handling

Previously, context size was managed only by _clip, which failed when even the last message exceeded the model’s context window. In those cases, it could return all messages. In other cases it can lead to drop the user query entirely, leading to responses built only from tool messages without the original user query.

This update introduces robust context limiting and proportional chunk allocation:

  • New _limit_chunkspans

    • Allocates available context tokens proportionally across tools based on their retrieved chunk sizes.
    • Enforces the model’s context limit while taking into account other messages number of tokens (user, system, tool messages, TEMPLATE).
    • Logs warnings when chunks are truncated due to window constraints.
  • Updated _clip

    • Ensures the user query is always preserved after clipping (if possible).
    • Warns when no messages fit or when clipping would remove the user query.
  • Updated add_context

    • Added a new config required parameter, so ChunkSpans can be limited.
  • Updated test_rag

    • The file was updated so that chunk_spans is only asserted when the LLM does not start with "llama-cpp-python". This change was made because the test could fail when all chunk spans are dropped by _limit_chunkspans, especially when using the sat-1l-sm sentence splitter.
  • Integration

    • _limit_chunkspans applied in add_context and _run_tools to constrain retrieved chunk spans before building messages.

These prevent context overflows, and ensure that context truncation is handled with clear warnings.

Fix: Change sentence splitter to "sat-1l-sm"

This pull request updates the sentence splitting logic by switching the SaT (Segment-any-Text) model used in _load_sat from "sat-3l-sm" to "sat-1l-sm" in src/raglite/_split_sentences.py.
The change reduces model size while maintaining comparable segmentation accuracy.

output

Benchmark results on the CUAD dataset (see attached plot) show that "sat-1l-sm" achieves similar performance to the larger "sat-3l-sm" model under the default Raglite Benchmark setup.
This makes it a better trade-off between efficiency and accuracy.

Model English Score Multilingual Score
sat-1l 88.5 84.3
sat-1l-sm 88.2 87.9
sat-3l-sm 96.5 93.5
source: https://github.com/segment-any-text/wtpsplit

We selected "sat-1l-sm" over "sat-1l" because it provides better multilingual performance with only a small trade-off in English accuracy.

@jirastorza
Copy link
Author

jirastorza commented Oct 15, 2025

Somehow changing the model has led to larger chunks, apparently leading to exceed the context window size (did not happen when benchmarking).

@lsorber
Copy link
Member

lsorber commented Oct 19, 2025

My guess at why the tests are failing is that sat-1l-sm yields larger sentences and/or chunks, which end up taking up more context than we have room for in the tests. EDIT: Just saw your comment too @jirastorza!

@emilradix
Copy link
Contributor

emilradix commented Oct 20, 2025

EDIT: the bug seems to be in the clip function, if the last message exceeds max_tokens then the first_message is set to zero, hence we return all messages
https://github.com/superlinear-ai/raglite/blob/main/src/raglite/_rag.py#L88-L92

So if I understand well, the reason the test is failing, is because in the config for the tests we are not specifying max_tokens of the model?
Edit: actually this should be automatically inferred from:
https://github.com/superlinear-ai/raglite/blob/main/src/raglite/_litellm.py#L329-L348
Which seems to be failing.

If this was detected correctly in test_rag.py, the text would get clipped at the max tokens?
https://github.com/superlinear-ai/raglite/blob/main/src/raglite/_rag.py#L188

@lsorber @jirastorza @Robbe-Superlinear Is this correct?

if so this leads to maybe a more important point: Surely we do not want to be just allowing context length to go above max token size, and then just clip the input down to the context size. We need to have some more clever way to limit this.

F.e. if we are passing chunk spans, the actual chunk needs to be in there, which as far as I can tell is not guaranteed (f.e. chunks at the end of a section)

@jirastorza
Copy link
Author

jirastorza commented Oct 20, 2025

The context window is being exceeded because the system prompt, retrieved chunk spans, and user prompt are all combined into a single message:

"""
---
You are a friendly and knowledgeable assistant that provides complete and insightful answers.
Whenever possible, use only the provided context to respond to the question at the end.
When responding, you MUST NOT reference the existence of the context, directly or indirectly.
Instead, you MUST treat the context as if its contents are entirely part of your working memory.
---

<context>
{context}
</context>

{user_prompt}
"""

In a new conversation, this results in only one message in the messages list. The _clip method is designed to remove entire messages from the left if the total token count exceeds the context window. However, when there is only one (very large) message, the cumulative token count calculation (cum_tokens > max_tokens) results in -np.searchsorted(cum_tokens, max_tokens) = 0, so the method returns the entire messages list without clipping anything. This means that if the message itself is too large, _clip cannot reduce its size, and the context window is still exceeded.

Additionally, although the maximum chunk size is set to 2048, the most relevant chunks are expanded with their neighbors using the retrieve_chunk_spans method. This means the actual retrieved context is much larger than expected, further increasing the risk of exceeding the context window.

edit: To address the context window issue, I will create a new branch named fix/context_window_exceeded and open a PR with a solution: when the context window is exceeded, the user will be warned and advised to reduce the number of retrieved chunks or use a model with a larger context window.

Do you agree? @emilradix @Robbe-Superlinear @lsorber

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR changes the sentence splitting model from "sat-3l-sm" to "sat-1l-sm" to optimize the trade-off between model size and accuracy. Benchmark results show that the smaller model achieves comparable performance on the CUAD dataset while reducing resource requirements.

Key Changes:

  • Switched to a lighter-weight sentence segmentation model ("sat-1l-sm") that provides better multilingual support with minimal impact on English accuracy
  • Updated inline comment to reflect the new model's characteristics

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@jirastorza jirastorza marked this pull request as draft October 24, 2025 09:58
@jirastorza
Copy link
Author

Let's first solve the exceeded context window problem in fix/context_window_exceeded branch. Once that is solved, the tests for this branch should pass.

@jirastorza jirastorza closed this Oct 27, 2025
@jirastorza jirastorza reopened this Oct 27, 2025
@jirastorza jirastorza changed the title feat: change sentence splitter to sat-1l-sm feat: change sentence splitter to sat-1l-sm and fix context management. Oct 27, 2025
@jirastorza jirastorza changed the title feat: change sentence splitter to sat-1l-sm and fix context management. feat: change sentence splitter to sat-1l-sm and fix: context management. Oct 27, 2025
@jirastorza
Copy link
Author

I merged fix/context_window_exceeded branch to this branch. So now we change the sentence splitter and solve the context window problem in the same PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants