feat: Align configuration for inference and evaluation #61

undo76 · 2024-12-09T14:10:13Z

Ensure that evaluation can be configured using the same configuration as inference.

lsorber

Partial review on the config modifications only. I chose to focus on this first as it will affect the remainder of the PR as well. My goal is to try and minimise the additions we add to the config dataclass. I'd like to keep it as simple as we can make it. Every parameter should be self-explanatory.

src/raglite/_config.py

lsorber · 2024-12-10T15:10:12Z

src/raglite/_config.py

@@ -53,6 +67,12 @@ class RAGLiteConfig:
        ),
        compare=False,  # Exclude the reranker from comparison to avoid lru_cache misses.
    )
+    search_method: "SearchMethod" = field(default_factory=_default_search_method, compare=False)
+    system_prompt: str | None = None


Which system prompt is this? Can we leave it out of the config? I don't believe we use one for RAG currently.

It is the one for the specific use case. It contains information about the assistant role, language, style, etc. I think it is important to keep it here and make it available to the evaluation as it contains valuable information. Also my aim is that just by modifying the Config on could switch from one use case to another without modifying anything else.

In other words, everything that could be modified to improve the performance should be in Config (num_chunks, search method and prompts)

I am using it like this:

messages = [ {"role": "system", "content": config.system_prompt}, *history, create_rag_instruction( user_prompt=user_prompt, context=chunk_spans, config=config, ), ]

I have the feeling we'll end up with potentially many system prompts though in the future, and then it will be difficult to distinguish between them. Can we solve this with the same partial trick?

I don't know how to use the same trick (any suggestions?). I don't think that we should use the assistant instructions (role, language, tone, format, examples, etc) in the rag instructions as they are immutable. One option would be to leave it outside of the configuration as an application specific feature, but then the evaluation would need to take care of the system prompt for the answers and evaluation phase.

Maybe we could configure create_rag_instruction and create_system_prompt as (partial) functions. (I would call them system_prompt and user_prompt,

To be clear: I consider that the evaluation pipeline should take into account the same system prompt that is going to be used during inference. Imagine that the system_prompt says that all the answers should be in Dutch, this information should be taken into account by the answer generation for evaluation.

Another thing I want is to configure everything in a single configuration class. This way switching different versions or use cases becomes trivial.

Finally, (a long shot maybe), by having access to the system_prompt, that describes the particular use cases could be useful for other RAG phases. We could leverage the system prompt in order to augment the chunks with contextual information, hypothetical questions, keywords, etc.

TL;DR: I am willing to change how it is configured (partial or another method), but I think that it should be included in the configuration.

src/raglite/_config.py

undo76 · 2024-12-12T21:37:21Z

Big refactoring to prevent cyclic dependencies. Not fully convinced about the interface yet. In particular I don't like config.retrieval, but it is taking shape. Other thing I don't like is that it is not possible to execute the different phases separately.

lsorber

Another review round.

lsorber · 2024-12-31T10:23:39Z

src/raglite/_eval.py

-    answered_evals: pd.DataFrame | int = 100, config: RAGLiteConfig | None = None
+    answered_evals_df: pd.DataFrame,
+    *,
+    metrics: Sequence[Any] | None,


I think this should be a Sequence[Metric] | None or maybe even a list[Metric] | None, where from ragas.metrics.base import Metric.

Yes, but as ragas is an optional dependency I used Any. A better solution would be to assign Metric conditionally to Any or ragas.metrics.base.Metric depending on the presence of ragas.

lsorber · 2024-12-31T10:26:21Z

src/raglite/_extract.py

    strict: bool = False,  # noqa: FBT001,FBT002
-    config: RAGLiteConfig | None = None,
+    *,
+    config: "RAGLiteConfig",


I would put the * above strict, as a positional True or False without the keyword argument name would not be very clear.

src/raglite/_query_adapter.py

lsorber · 2024-12-31T10:28:55Z

src/raglite/_rag.py

-    search: SearchMethod = hybrid_search,
-    config: RAGLiteConfig | None = None,
+    search: "ChunkSearchMethod",
+    rerank: Optional["ChunkRerankingMethod"] = None,


Python 3.10+:

Suggested change

rerank: Optional["ChunkRerankingMethod"] = None,

rerank: "ChunkRerankingMethod" | None = None,

lsorber · 2024-12-31T10:38:24Z

src/raglite/_rag.py

+    if rerank:
+        chunks = rerank(query, chunk_ids=chunk_ids, config=config)
+    else:
+        chunks = retrieve_chunks(chunk_ids, config=config)
+    context = retrieve_chunk_spans(chunks, chunk_neighbors=chunk_neighbors, config=config)
+    return context[:max_chunk_spans]


This is the algorithm as implemented:

Let's say you retrieve 40 chunks with search.

Then you rerank those 40 chunks with rerank.

Then you retrieve all chunk spans from those 40 reranked chunks.

Only then do you limit the number of chunk spans to the desired amount (or not any at all).

There are two downsides to this algorithm:

Retrieving all 40 chunks is relatively expensive if you intend to throw away a number of chunk spans afterwards.

If you don't throw away any chunk spans (which is the default), then reranking doesn't do very much as the LLM gets to see the same chunks whether you rerank or not!

To address both issues, I think we should introduce a max_chunks: int argument and update the algorithm as follows:

Let's say you retrieve 40 chunks with search.

Then you rerank those 40 chunks with rerank, and keep only the top max_chunks.

Then you retrieve all chunk spans from the top max_chunks.

Optional: limit the number of chunk spans to max_chunk_spans (with a default of None as is the case now).

I am not sure if I understand. max_chunks is always set to a value and search always returns [:max_chunks], therefore step 2 and step 3 are also limited to [:max_chunks] in the original algorithm.

Regarding the downsides:

Chunk spans consist of a list of contiguous chunks retrieved by search (+ neighbours). We need to compose all the chunk spans as we need to assign a pooled score using the sum of reciprocal rankings of the chunks, and then return the top max_chunk_spans. Maybe you are thinking that for some aggregated span scores we can be sure that they will never be in the top, so we could skip the neighbours retrieval phase, indeed, but the algo is not trivial and depends on the chunk score pooling function. (e.g. if we decided that span_score = max(chunk_scores) it would be trivial).

You are right here, that the default makes the reranking less useful. Maybe it would be better not to allow None in max_chunk_spans

For instance, I have been using this configuration:

retrieval = partial( retrieve_rag_context, max_chunk_spans=3, search_method=partial( hybrid_search, max_chunks=20, ), rerank=rerank_chunks, chunk_neighbors=(-1, 1), )

lsorber · 2024-12-31T10:41:13Z

src/raglite/_rag.py



 def create_rag_instruction(
    user_prompt: str,
    context: list[ChunkSpan],
    *,
-    rag_instruction_template: str = RAG_INSTRUCTION_TEMPLATE,
+    rag_instruction_template: str,


Is there a reason to remove the default?

I think that the RAG_INSTRUCTION_TEMPLATE is quite specific and the instructions should be in a system prompt. If we want to keep it, I think that something more general makes more sense. I am using the one below and adding the instructions to the system prompt (doing so, it helps with caching too)

rag_instruction_template = "{user_prompt}\n\n{context}"

lsorber · 2024-12-31T10:41:45Z

src/raglite/_rag.py

+    rag_instruction_template: str | None = None,
+) -> list[dict[str, str]]:
+    """Compose a list of messages to generate a response."""
+    messages = [
+        *([{"role": "system", "content": system_prompt}] if system_prompt else []),
+        *(history or []),
+        create_rag_instruction(
+            user_prompt=user_prompt,
+            context=context if context else [],
+            rag_instruction_template=rag_instruction_template or DEFAULT_RAG_INSTRUCTION_TEMPLATE,


If we do end up keeping this function, why not:

Suggested change

rag_instruction_template: str | None = None,

) -> list[dict[str, str]]:

"""Compose a list of messages to generate a response."""

messages = [

*([{"role": "system", "content": system_prompt}] if system_prompt else []),

*(history or []),

create_rag_instruction(

user_prompt=user_prompt,

context=context if context else [],

rag_instruction_template=rag_instruction_template or DEFAULT_RAG_INSTRUCTION_TEMPLATE,

rag_instruction_template: str = DEFAULT_RAG_INSTRUCTION_TEMPLATE,

) -> list[dict[str, str]]:

"""Compose a list of messages to generate a response."""

messages = [

*([{"role": "system", "content": system_prompt}] if system_prompt else []),

*(history or []),

create_rag_instruction(

user_prompt=user_prompt,

context=context if context else [],

rag_instruction_template=rag_instruction_template,

lsorber · 2024-12-31T10:43:49Z

src/raglite/_rag.py

+) -> list[dict[str, str]]:
+    """Compose a list of messages to generate a response."""
+    messages = [
+        *([{"role": "system", "content": system_prompt}] if system_prompt else []),


If a system prompt is provided, this will prevent prompt caching.

And if we were to drop this system prompt feature, then compose_rag_messages is equivalent to messages.append(create_rag_instruction(user_prompt, context)).

Therefore, I think we can actually do without compose_rag_messages, or is there a good reason to have this?

I don't know how this prevents (prefix) prompt catching. If system_prompt is static, at least with OpenAI prompt caching, the common tokens prefix will be cached. Maybe there is a confusion with the syntax I used. It is just a compact way of adding conditionally the system prompt (using a list for this is weird, I reckon, but it is a way of expressing "nothing" to be added)

I use this method in a couple of places, including eval. I think it is a good abstraction to keep as it removes the cognitive load of remembering which order and format the messages should be constructed. Just give the user prompt, the system prompt, the history and the contexts and it takes care of composing the right format for you. Also this method could be extended to manage max num of messages, filtering out tool results in history, adding special caching mechanisms as in Claude, etc.

Finally, about dropping the system prompt for good. I think that any realistic scenario should give the assistant a role, guidelines, style, language and response formatting instructions. I find the alternative of adding them to every user prompt unsatisfactory. Can you explain why do you prefer one over the other?

src/raglite/_chainlit.py

lsorber · 2024-12-31T11:44:36Z

src/raglite/_database.py

@@ -270,28 +273,41 @@ class Eval(SQLModel, table=True):
    document: Document = Relationship(back_populates="evals")

    @staticmethod
-    def from_chunks(
-        question: str, contexts: list[Chunk], ground_truth: str, **kwargs: Any
+    def from_contexts(


Would call it from_context (singular) for concistency with context: list[ChunkSpan] arguments elsewhere.

undo76 force-pushed the feat/ms-shared-configuration branch from dbc16b0 to 0d21062 Compare December 9, 2024 14:13

feat: Align configuration for inference and evaluation

32f496e

undo76 force-pushed the feat/ms-shared-configuration branch from 0d21062 to 32f496e Compare December 9, 2024 14:16

undo76 added 3 commits December 10, 2024 12:01

fix: Failing test

e60009f

fix: Failing test (oversample in search)

4f93902

fix: Failing test (oversample in search)

cf6efec

lsorber requested changes Dec 10, 2024

View reviewed changes

undo76 added 4 commits December 10, 2024 17:30

fix: Failing test (oversample in search)

605307d

fix: Conditional parameters in pgvector.

93e0495

fix: Refactoring

6002da1

fix: Rerank test

501063b

undo76 added 2 commits December 12, 2024 23:01

fix: Fix text and refactor hybrid search

2a14928

fix: Fix text and refactor hybrid search

8e6436e

undo76 force-pushed the feat/ms-shared-configuration branch from f1ef291 to 8e6436e Compare December 13, 2024 09:32

fix: Move functions and prompts out of config.

c83c586

lsorber requested changes Dec 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Align configuration for inference and evaluation #61

feat: Align configuration for inference and evaluation #61

undo76 commented Dec 9, 2024

lsorber left a comment

lsorber Dec 10, 2024

undo76 Dec 10, 2024 •

edited

Loading

lsorber Dec 11, 2024

undo76 Dec 12, 2024

undo76 commented Dec 12, 2024

lsorber left a comment

lsorber Dec 31, 2024

undo76 Jan 6, 2025

lsorber Dec 31, 2024

lsorber Dec 31, 2024

lsorber Dec 31, 2024

undo76 Jan 6, 2025

lsorber Dec 31, 2024

undo76 Jan 6, 2025

lsorber Dec 31, 2024

lsorber Dec 31, 2024

undo76 Jan 6, 2025

lsorber Dec 31, 2024

	rerank: Optional["ChunkRerankingMethod"] = None,
	rerank: "ChunkRerankingMethod" \| None = None,

feat: Align configuration for inference and evaluation #61

Are you sure you want to change the base?

feat: Align configuration for inference and evaluation #61

Conversation

undo76 commented Dec 9, 2024

lsorber left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

undo76 Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

undo76 commented Dec 12, 2024

lsorber left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

undo76 Dec 10, 2024 •

edited

Loading