Skip to content

Conversation

@ofermend
Copy link
Collaborator

To enable controlling % of queries from each category

Copilot finished reviewing on behalf of ofermend November 23, 2025 06:40
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances query generation capabilities by refining question type descriptions, adding support for reproducible random sampling via a seed parameter, and documenting custom prompt template functionality for Vectara.

Key Changes:

  • Updated question type descriptions to be more precise (e.g., "factual questions with clear, direct answers" instead of "answerable directly from the text")
  • Added seed parameter to generate() and _post_process_questions() methods for reproducible question sampling
  • Documented custom prompt template feature in README with three configuration options (file-based JSON, file-based plain text, and inline YAML)

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
tests/query_generation/test_llm_generator.py Updated test assertions to match new question type descriptions
open_rag_eval/query_generation/llm_generator.py Refined question type instructions, added seed parameter for reproducibility, improved prompt clarity
open_rag_eval/connectors/vectara_connector.py Added debug logging and json import (though JSON parsing not yet implemented)
README.md Added comprehensive documentation for custom prompt template configuration options
Comments suppressed due to low confidence (1)

README.md:171

  • The documentation claims "Automatic JSON parsing: JSON files are automatically parsed and sent as structured data (not strings)", but the implementation at lines 135-137 reads the file content as plain text without JSON parsing. The code should attempt to parse JSON files:
with open(prompt_template_file, "r", encoding="utf-8") as f:
    prompt_content = f.read().strip()
    try:
        # Try to parse as JSON
        generation_dict["prompt_template"] = json.loads(prompt_content)
    except json.JSONDecodeError:
        # Fall back to plain text if not valid JSON
        generation_dict["prompt_template"] = prompt_content

This would align the implementation with the documented behavior.

Or create a `.env` file in your working directory with these variables (the CLI will automatically load it):

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -1,4 +1,5 @@
from itertools import islice
import json
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The import statement for json is not in alphabetical order. Python imports should be organized alphabetically within their grouping (standard library imports). The json import should come before logging and after itertools.

Copilot uses AI. Check for mistakes.
# DEBUG LOGGING
if "prompt_template" in generation_dict:
pt = generation_dict["prompt_template"]
print(f"DEBUG_PAYLOAD: query_id={query.get('queryId')} prompt_template_type={type(pt)} value={str(pt)}")
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug logging using print() should not be committed to the codebase. Use the logger instance already available in this module (defined at line 15) instead:

logger.debug("DEBUG_PAYLOAD: query_id=%s prompt_template_type=%s value=%s", 
             query.get('queryId'), type(pt).__name__, str(pt))

This follows Python logging best practices and allows proper log level control.

Suggested change
print(f"DEBUG_PAYLOAD: query_id={query.get('queryId')} prompt_template_type={type(pt)} value={str(pt)}")
logger.debug("DEBUG_PAYLOAD: query_id=%s prompt_template_type=%s value=%s", query.get('queryId'), type(pt).__name__, str(pt))

Copilot uses AI. Check for mistakes.
min_words: int,
max_words: int
max_words: int,
seed: Optional[int] = None
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The seed parameter is missing from the function's docstring. The Args section should include documentation for the seed parameter:

    seed: Random seed for reproducible sampling (None for random)

Copilot uses AI. Check for mistakes.
@@ -1,4 +1,5 @@
from itertools import islice
import json
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'json' is not used.

Suggested change
import json

Copilot uses AI. Check for mistakes.
@ofermend ofermend requested a review from Copilot November 23, 2025 15:34
Copilot finished reviewing on behalf of ofermend November 23, 2025 15:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

open_rag_eval/query_generation/llm_generator.py:335

  • The docstring is missing documentation for the new seed parameter. Please add it to the Args section:
Args:
    questions: List of raw generated questions
    n_questions: Target number of questions
    min_words: Minimum words per question
    max_words: Maximum words per question
    seed: Random seed for reproducible sampling (None for random)
        """
        Post-process generated questions: deduplicate, filter, and sample.

        Args:
            questions: List of raw generated questions
            n_questions: Target number of questions
            min_words: Minimum words per question
            max_words: Maximum words per question


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ofermend ofermend requested review from Copilot and vish119 November 24, 2025 16:18
Copilot finished reviewing on behalf of ofermend November 24, 2025 16:20
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

open_rag_eval/query_generation/llm_generator.py:338

  • The docstring for _post_process_questions is missing documentation for the new seed parameter. Please add it to the Args section:
Args:
    questions: List of raw generated questions
    n_questions: Target number of questions
    min_words: Minimum words per question
    max_words: Maximum words per question
    seed: Random seed for reproducible sampling (None for random)
        """
        Post-process generated questions: deduplicate, filter, and sample.

        Args:
            questions: List of raw generated questions
            n_questions: Target number of questions
            min_words: Minimum words per question
            max_words: Maximum words per question

        Returns:
            Filtered and sampled list of questions
        """

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

README.md Outdated
# ... other generation settings
```

The connector will automatically detect and parse JSON files, sending them as structured JSON arrays to Vectara's API.
Copy link

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README claims "The connector will automatically detect and parse JSON files, sending them as structured JSON arrays to Vectara's API" but the actual implementation in vectara_connector.py (lines 130-142) only reads the file content as a string with f.read().strip(). There is no JSON parsing logic to convert JSON files into structured data. Either the README documentation needs to be corrected to remove the claim about automatic JSON parsing, or the implementation needs to be updated to actually parse JSON files.

Suggested change
The connector will automatically detect and parse JSON files, sending them as structured JSON arrays to Vectara's API.
The connector will read the file content and send it as-is to Vectara's API. If the file contains valid JSON, it will be sent as a JSON string; otherwise, it will be sent as plain text.

Copilot uses AI. Check for mistakes.
Comment on lines +361 to 362
random.seed(seed)
filtered_questions = random.sample(filtered_questions, n_questions)
Copy link

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting the global random seed inside a conditional block can cause non-deterministic behavior across multiple calls. If this method is called multiple times with the same seed, only the first call to random.sample() will use the intended seed - subsequent calls will use whatever state the random number generator was left in. Consider using random.Random(seed).sample() instead to create an isolated random instance:

if seed is not None:
    filtered_questions = random.Random(seed).sample(filtered_questions, n_questions)
else:
    filtered_questions = random.sample(filtered_questions, n_questions)
Suggested change
random.seed(seed)
filtered_questions = random.sample(filtered_questions, n_questions)
filtered_questions = random.Random(seed).sample(filtered_questions, n_questions)
else:
filtered_questions = random.sample(filtered_questions, n_questions)

Copilot uses AI. Check for mistakes.
@ofermend ofermend merged commit 3ef5f09 into dev Nov 25, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants