-
Notifications
You must be signed in to change notification settings - Fork 3
Add Query Division Classifier to Code Search #273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
- Add a division classifier under `agents/intents` - Add system prompt based on SDE document used in github repo search - Add division filtering support to both local and SDE code search tools *To-Do:* - Add division classifier to the agent flow
- Update division classifier prompt based on Rachel's feedback
- integrate division classifier into the code search sub tools - enable division classifier flag at the agent level
|
❌ Tests failed (exit code: 1) 📊 Test Results
Branch: 📋 Full coverage report and logs are available in the workflow run. |
| def __init__(self, *args, **kwargs): | ||
| super().__init__(*args, **kwargs) | ||
| self.division_agent = DivisionAgent( | ||
| config=BaseAgentConfig(model_name="gpt-4o-mini", system_prompt=DIVISION_PROMPT) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this would be part of the agent not tool.
Something like:
agent = CodeSearchAgent(division_classifier=....,...)And the filter would be applied at agentic level. At tool-level what we can do is, provide enum as a way to just filter without any classification.
tool = CodeSearchTool(Config(division_filter=True,...))
await tool.arun(CodeInputSchema(query=..., division="Something")) # where we pass division from upstream. It could be None as well. If `division` is None, then no need to filter, else just filter using that.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also try to avoid complicating the tools itself. If we have this as part of agent, few things would be nice:
- Division classification reasoning trace from original query
- We can use that trace and division to drive subsequent agentic iteration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have it as part of CodeSearchAgent, not tool. I'd avoid modifying tool to have llm-stuff unless necessary. This way, divsion classifier wil lbe an intent udnerstanding layer at agent layer,..,,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reiterating:
division = division_classifier.arun(query....) if self.division_classifier
results_full = []
while iteration...:
results = self.tool.arun(..., division=division.....) # if division is there, else Just None filter might work
# We need to test if performing classification at each iteration results in same set of division or not. Otherwise, original intent might drive the iteration.
# Also, we could reuse the same intent udnerstanding to udnerstand the results as well.
# Same division classifier -> run through each result -> get division -> then filter
# Doing this: we know tool is already dterministic, and the tuning is just happening at agent level.
relevant_results = self.checker.arun(results) # does relevancyc hecks tuned to code search itself. Something like archetype we have for lit and data search
results_full.extend(relevant_results)
refomrulation = self.reformulate(relevant_results, division_traces, division=....) # this will play nice with reformulation as well because there will be more context from the intent/division that can drive next iteration.
- perform next iteration or stop, etc
This way, tool behaviour is deterministic in nature and all the non-determinism comes from llm/agent workflow. And this will make a case taht agent improved because we did classification, then tool call, then filter, then checks, anything. Tool tentatively remains the same and only we focus on agentic improvement. This avoids pinpointing non-determinism. If tool is non-deterministic, then we'd have hard time replicating tool call results because there's llm layer to it. This is also the problem I have with LinkAssessor as well. Replication is hard with LinkAssessor (but there's leeway: because we're giving so many urls, 10s of urls, to it, it's fine if there's some non-determinism to it; just a side note).
This is also the reason we're not using Searchpipeline in akd lit v2 because searchpipeline was kind of non-deterministic. But again, since it's owrking with 100s of urls, it's less non-deterministic in nature stastiticaly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But thank you for the work. You're on right track. Let's segregate non-dtermnisim more to agentci worfklow than tool. That way: tool call will be our baselines always and always replicable. The comments are feedbacks, not discouraging you to think about the process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, got it. I'll continue working on this PR after tomorrow's SME session.
|
❌ Tests failed (exit code: 1) 📊 Test Results
Branch: 📋 Full coverage report and logs are available in the workflow run. |
- Remove division classifier from the code agent
|
❌ Tests failed (exit code: 1) 📊 Test Results
Branch: 📋 Full coverage report and logs are available in the workflow run. |
Summary 📝
This PR adds query classification to the code search agent and tools by implementing an LLM-based division classifier that categorizes user queries into one of NASA's five science divisions. This enhancement allows the search agent to narrow down searches to repositories belonging to specific divisions, improving search precision and relevance.
Details
ScienceDivisionenum andDivisionAgentclass inintents.pyfor classifying queries into Earth Science, Planetary Science, Astrophysics, Heliophysics, Biological and Physical Sciences, or Unknown divisionsDIVISION_PROMPTincode_prompts.pywith detailed overviews, study areas, and mission examples for each divisionLocalRepoCodeSearchToolandSDECodeSearchToolto support division-based filtering withuse_division_filter_localanduse_division_filter_sdeconfiguration flags in the agentfind_repo()andsde_search()methods to be async, perform division classification, and filter results by divisionCodeSearchAgentConfigto inherit division filter configuration and automatically initialize division agent when enabledextrafield for transparencyUsage
Checks