Adjusted query generation #150

ofermend · 2025-11-23T06:38:42Z

To enable controlling % of queries from each category

Copilot

Pull request overview

This PR enhances query generation capabilities by refining question type descriptions, adding support for reproducible random sampling via a seed parameter, and documenting custom prompt template functionality for Vectara.

Key Changes:

Updated question type descriptions to be more precise (e.g., "factual questions with clear, direct answers" instead of "answerable directly from the text")
Added seed parameter to generate() and _post_process_questions() methods for reproducible question sampling
Documented custom prompt template feature in README with three configuration options (file-based JSON, file-based plain text, and inline YAML)

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
`tests/query_generation/test_llm_generator.py`	Updated test assertions to match new question type descriptions
`open_rag_eval/query_generation/llm_generator.py`	Refined question type instructions, added seed parameter for reproducibility, improved prompt clarity
`open_rag_eval/connectors/vectara_connector.py`	Added debug logging and json import (though JSON parsing not yet implemented)
`README.md`	Added comprehensive documentation for custom prompt template configuration options

Comments suppressed due to low confidence (1)

README.md:171

The documentation claims "Automatic JSON parsing: JSON files are automatically parsed and sent as structured data (not strings)", but the implementation at lines 135-137 reads the file content as plain text without JSON parsing. The code should attempt to parse JSON files:

with open(prompt_template_file, "r", encoding="utf-8") as f:
    prompt_content = f.read().strip()
    try:
        # Try to parse as JSON
        generation_dict["prompt_template"] = json.loads(prompt_content)
    except json.JSONDecodeError:
        # Fall back to plain text if not valid JSON
        generation_dict["prompt_template"] = prompt_content

This would align the implementation with the documented behavior.

Or create a `.env` file in your working directory with these variables (the CLI will automatically load it):

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-23T06:41:21Z

open_rag_eval/connectors/vectara_connector.py

@@ -1,4 +1,5 @@
 from itertools import islice
+import json


[nitpick] The import statement for json is not in alphabetical order. Python imports should be organized alphabetically within their grouping (standard library imports). The json import should come before logging and after itertools.

Copilot · 2025-11-23T06:41:21Z

open_rag_eval/connectors/vectara_connector.py

+        # DEBUG LOGGING
+        if "prompt_template" in generation_dict:
+            pt = generation_dict["prompt_template"]
+            print(f"DEBUG_PAYLOAD: query_id={query.get('queryId')} prompt_template_type={type(pt)} value={str(pt)}")


Debug logging using print() should not be committed to the codebase. Use the logger instance already available in this module (defined at line 15) instead:

logger.debug("DEBUG_PAYLOAD: query_id=%s prompt_template_type=%s value=%s", query.get('queryId'), type(pt).__name__, str(pt))

This follows Python logging best practices and allows proper log level control.

Suggested change

print(f"DEBUG_PAYLOAD: query_id={query.get('queryId')} prompt_template_type={type(pt)} value={str(pt)}")

logger.debug("DEBUG_PAYLOAD: query_id=%s prompt_template_type=%s value=%s", query.get('queryId'), type(pt).__name__, str(pt))

Copilot · 2025-11-23T06:41:22Z

open_rag_eval/query_generation/llm_generator.py

        min_words: int,
-        max_words: int
+        max_words: int,
+        seed: Optional[int] = None


The seed parameter is missing from the function's docstring. The Args section should include documentation for the seed parameter:

seed: Random seed for reproducible sampling (None for random)

Copilot · 2025-11-23T06:41:22Z

open_rag_eval/connectors/vectara_connector.py

@@ -1,4 +1,5 @@
 from itertools import islice
+import json


Import of 'json' is not used.

Suggested change

import json

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

open_rag_eval/query_generation/llm_generator.py:335

The docstring is missing documentation for the new seed parameter. Please add it to the Args section:

Args:
    questions: List of raw generated questions
    n_questions: Target number of questions
    min_words: Minimum words per question
    max_words: Maximum words per question
    seed: Random seed for reproducible sampling (None for random)

        """
        Post-process generated questions: deduplicate, filter, and sample.

        Args:
            questions: List of raw generated questions
            n_questions: Target number of questions
            min_words: Minimum words per question
            max_words: Maximum words per question

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

open_rag_eval/query_generation/llm_generator.py:338

The docstring for _post_process_questions is missing documentation for the new seed parameter. Please add it to the Args section:

Args:
    questions: List of raw generated questions
    n_questions: Target number of questions
    min_words: Minimum words per question
    max_words: Maximum words per question
    seed: Random seed for reproducible sampling (None for random)

        """
        Post-process generated questions: deduplicate, filter, and sample.

        Args:
            questions: List of raw generated questions
            n_questions: Target number of questions
            min_words: Minimum words per question
            max_words: Maximum words per question

        Returns:
            Filtered and sampled list of questions
        """

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-24T16:20:35Z

README.md

+        # ... other generation settings
+```
+
+The connector will automatically detect and parse JSON files, sending them as structured JSON arrays to Vectara's API.


The README claims "The connector will automatically detect and parse JSON files, sending them as structured JSON arrays to Vectara's API" but the actual implementation in vectara_connector.py (lines 130-142) only reads the file content as a string with f.read().strip(). There is no JSON parsing logic to convert JSON files into structured data. Either the README documentation needs to be corrected to remove the claim about automatic JSON parsing, or the implementation needs to be updated to actually parse JSON files.

Suggested change

The connector will automatically detect and parse JSON files, sending them as structured JSON arrays to Vectara's API.

The connector will read the file content and send it as-is to Vectara's API. If the file contains valid JSON, it will be sent as a JSON string; otherwise, it will be sent as plain text.

README.md

Copilot · 2025-11-24T16:20:35Z

open_rag_eval/query_generation/llm_generator.py

+                random.seed(seed)
            filtered_questions = random.sample(filtered_questions, n_questions)


Setting the global random seed inside a conditional block can cause non-deterministic behavior across multiple calls. If this method is called multiple times with the same seed, only the first call to random.sample() will use the intended seed - subsequent calls will use whatever state the random number generator was left in. Consider using random.Random(seed).sample() instead to create an isolated random instance:

if seed is not None: filtered_questions = random.Random(seed).sample(filtered_questions, n_questions) else: filtered_questions = random.sample(filtered_questions, n_questions)

Suggested change

random.seed(seed)

filtered_questions = random.sample(filtered_questions, n_questions)

filtered_questions = random.Random(seed).sample(filtered_questions, n_questions)

else:

filtered_questions = random.sample(filtered_questions, n_questions)

Co-authored-by: Copilot <[email protected]>

updated

c0635e5

ofermend requested a review from Copilot November 23, 2025 06:38

Copilot started reviewing on behalf of ofermend November 23, 2025 06:39 View session

updated

d19ff4d

Copilot finished reviewing on behalf of ofermend November 23, 2025 06:40

Copilot AI reviewed Nov 23, 2025

View reviewed changes

updated

32d8f93

ofermend requested a review from Copilot November 23, 2025 15:34

Copilot started reviewing on behalf of ofermend November 23, 2025 15:35 View session

Copilot finished reviewing on behalf of ofermend November 23, 2025 15:36

Copilot AI reviewed Nov 23, 2025

View reviewed changes

ofermend requested review from Copilot and vish119 November 24, 2025 16:18

Copilot started reviewing on behalf of ofermend November 24, 2025 16:18 View session

Copilot finished reviewing on behalf of ofermend November 24, 2025 16:20

Copilot AI reviewed Nov 24, 2025

View reviewed changes

vish119 approved these changes Nov 25, 2025

View reviewed changes

ofermend and others added 3 commits November 25, 2025 11:12

Update README.md

1843ef3

Co-authored-by: Copilot <[email protected]>

updated per copilot suggestion

17c08db

fixed conflicts

6087772

ofermend merged commit 3ef5f09 into dev Nov 25, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adjusted query generation #150

Adjusted query generation #150

ofermend commented Nov 23, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 23, 2025

Uh oh!

Copilot AI Nov 23, 2025

Uh oh!

Copilot AI Nov 23, 2025

Uh oh!

Copilot AI Nov 23, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 24, 2025

Uh oh!

Uh oh!

Copilot AI Nov 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	print(f"DEBUG_PAYLOAD: query_id={query.get('queryId')} prompt_template_type={type(pt)} value={str(pt)}")
	logger.debug("DEBUG_PAYLOAD: query_id=%s prompt_template_type=%s value=%s", query.get('queryId'), type(pt).__name__, str(pt))

	The connector will automatically detect and parse JSON files, sending them as structured JSON arrays to Vectara's API.
	The connector will read the file content and send it as-is to Vectara's API. If the file contains valid JSON, it will be sent as a JSON string; otherwise, it will be sent as plain text.

		random.seed(seed)
		filtered_questions = random.sample(filtered_questions, n_questions)

Adjusted query generation #150

Adjusted query generation #150

Conversation

ofermend commented Nov 23, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants