-
Notifications
You must be signed in to change notification settings - Fork 20
Adjusted query generation #150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enhances query generation capabilities by refining question type descriptions, adding support for reproducible random sampling via a seed parameter, and documenting custom prompt template functionality for Vectara.
Key Changes:
- Updated question type descriptions to be more precise (e.g., "factual questions with clear, direct answers" instead of "answerable directly from the text")
- Added
seedparameter togenerate()and_post_process_questions()methods for reproducible question sampling - Documented custom prompt template feature in README with three configuration options (file-based JSON, file-based plain text, and inline YAML)
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
tests/query_generation/test_llm_generator.py |
Updated test assertions to match new question type descriptions |
open_rag_eval/query_generation/llm_generator.py |
Refined question type instructions, added seed parameter for reproducibility, improved prompt clarity |
open_rag_eval/connectors/vectara_connector.py |
Added debug logging and json import (though JSON parsing not yet implemented) |
README.md |
Added comprehensive documentation for custom prompt template configuration options |
Comments suppressed due to low confidence (1)
README.md:171
- The documentation claims "Automatic JSON parsing: JSON files are automatically parsed and sent as structured data (not strings)", but the implementation at lines 135-137 reads the file content as plain text without JSON parsing. The code should attempt to parse JSON files:
with open(prompt_template_file, "r", encoding="utf-8") as f:
prompt_content = f.read().strip()
try:
# Try to parse as JSON
generation_dict["prompt_template"] = json.loads(prompt_content)
except json.JSONDecodeError:
# Fall back to plain text if not valid JSON
generation_dict["prompt_template"] = prompt_contentThis would align the implementation with the documented behavior.
Or create a `.env` file in your working directory with these variables (the CLI will automatically load it):
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -1,4 +1,5 @@ | |||
| from itertools import islice | |||
| import json | |||
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The import statement for json is not in alphabetical order. Python imports should be organized alphabetically within their grouping (standard library imports). The json import should come before logging and after itertools.
| # DEBUG LOGGING | ||
| if "prompt_template" in generation_dict: | ||
| pt = generation_dict["prompt_template"] | ||
| print(f"DEBUG_PAYLOAD: query_id={query.get('queryId')} prompt_template_type={type(pt)} value={str(pt)}") |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug logging using print() should not be committed to the codebase. Use the logger instance already available in this module (defined at line 15) instead:
logger.debug("DEBUG_PAYLOAD: query_id=%s prompt_template_type=%s value=%s",
query.get('queryId'), type(pt).__name__, str(pt))This follows Python logging best practices and allows proper log level control.
| print(f"DEBUG_PAYLOAD: query_id={query.get('queryId')} prompt_template_type={type(pt)} value={str(pt)}") | |
| logger.debug("DEBUG_PAYLOAD: query_id=%s prompt_template_type=%s value=%s", query.get('queryId'), type(pt).__name__, str(pt)) |
| min_words: int, | ||
| max_words: int | ||
| max_words: int, | ||
| seed: Optional[int] = None |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The seed parameter is missing from the function's docstring. The Args section should include documentation for the seed parameter:
seed: Random seed for reproducible sampling (None for random)| @@ -1,4 +1,5 @@ | |||
| from itertools import islice | |||
| import json | |||
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'json' is not used.
| import json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (1)
open_rag_eval/query_generation/llm_generator.py:335
- The docstring is missing documentation for the new
seedparameter. Please add it to the Args section:
Args:
questions: List of raw generated questions
n_questions: Target number of questions
min_words: Minimum words per question
max_words: Maximum words per question
seed: Random seed for reproducible sampling (None for random) """
Post-process generated questions: deduplicate, filter, and sample.
Args:
questions: List of raw generated questions
n_questions: Target number of questions
min_words: Minimum words per question
max_words: Maximum words per question
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
open_rag_eval/query_generation/llm_generator.py:338
- The docstring for
_post_process_questionsis missing documentation for the newseedparameter. Please add it to the Args section:
Args:
questions: List of raw generated questions
n_questions: Target number of questions
min_words: Minimum words per question
max_words: Maximum words per question
seed: Random seed for reproducible sampling (None for random) """
Post-process generated questions: deduplicate, filter, and sample.
Args:
questions: List of raw generated questions
n_questions: Target number of questions
min_words: Minimum words per question
max_words: Maximum words per question
Returns:
Filtered and sampled list of questions
"""
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
README.md
Outdated
| # ... other generation settings | ||
| ``` | ||
|
|
||
| The connector will automatically detect and parse JSON files, sending them as structured JSON arrays to Vectara's API. |
Copilot
AI
Nov 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The README claims "The connector will automatically detect and parse JSON files, sending them as structured JSON arrays to Vectara's API" but the actual implementation in vectara_connector.py (lines 130-142) only reads the file content as a string with f.read().strip(). There is no JSON parsing logic to convert JSON files into structured data. Either the README documentation needs to be corrected to remove the claim about automatic JSON parsing, or the implementation needs to be updated to actually parse JSON files.
| The connector will automatically detect and parse JSON files, sending them as structured JSON arrays to Vectara's API. | |
| The connector will read the file content and send it as-is to Vectara's API. If the file contains valid JSON, it will be sent as a JSON string; otherwise, it will be sent as plain text. |
| random.seed(seed) | ||
| filtered_questions = random.sample(filtered_questions, n_questions) |
Copilot
AI
Nov 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting the global random seed inside a conditional block can cause non-deterministic behavior across multiple calls. If this method is called multiple times with the same seed, only the first call to random.sample() will use the intended seed - subsequent calls will use whatever state the random number generator was left in. Consider using random.Random(seed).sample() instead to create an isolated random instance:
if seed is not None:
filtered_questions = random.Random(seed).sample(filtered_questions, n_questions)
else:
filtered_questions = random.sample(filtered_questions, n_questions)| random.seed(seed) | |
| filtered_questions = random.sample(filtered_questions, n_questions) | |
| filtered_questions = random.Random(seed).sample(filtered_questions, n_questions) | |
| else: | |
| filtered_questions = random.sample(filtered_questions, n_questions) |
Co-authored-by: Copilot <[email protected]>
To enable controlling % of queries from each category