-
Notifications
You must be signed in to change notification settings - Fork 1.4k
feat: add multi-modal attachment propagation to all worker agents thr… #1196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@Wendong-Fan Could you please review this PR? thanks |
thanks @bittoby 's contribution! could @nitpicker55555 and @Zephyroam help checking this? |
|
@Zephyroam @nitpicker55555 I would appreciate your feedback. please review the PR! thanks |
nitpicker55555
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obviously, you didn’t test your code. You didn’t register the ImageAnalysisToolkit for the agent—so how is the agent supposed to gain image-reading capability?
Additionally, this design is overly complicated. Why not simply modify the prompt of the decompose agent so that, when breaking down the task, it passes the image location to the corresponding agent? That way, we would only need to adjust the prompt and register the ImageAnalysisToolkit for the agent.
|
@nitpicker55555 Obviously, after I tested it correctly I pushed the PR. When I attach a test image and input “As a developer, analyze this screenshot and write code” as prompt, then the developer agent generates HTML that matches the screenshot. |
you're right - I haven't added |
@bittoby Thank you for the explanation, but I still don’t understand: without providing a dedicated tool, how can the model read an image from a passed-in image URL instead of a base64-encoded input? Could you point out which part of your PR implements this functionality? |
|
Right now it only passes the image file paths along as additional_info. The current setup depends on the model’s built-in vision (which needs base64 data or a URL in the actual message). My PR doesn’t do that conversion or attach the images to the message - it just sends the paths as metadata. I will update PR. I misunderstood. I am clear now. For this to really work, we need to either:
Which option do you recommend? |
|
My idea is to tweak The benefits are: the change is relatively small, we’d only need to update the decompose agent’s prompt to pass along the image path. It would also allow the agent to read other image files, not just images uploaded by the user in the prompt, and it could support users providing an image path instead of the image itself. I also looked into how Claude Code does image reading, and it works in a similar way, via a What do you think @Wendong-Fan @bittoby |
|
@nitpicker55555 Thanks for your explanation! Concerns:
My approach (like Claude Code): make a small, separate tool:
But it requires more code, a bit more wiring I agree with your approach for now since it's faster to ship. We can refactor to a separate ReadImageTool later if we see reusability issues. @Wendong-Fan what's your preference? |
@bittoby what do you mean document agent read image for embedding? |
|
I mean, here "embedding" = putting the image into a document not analyzing it. |
|
@bittoby It seems you haven’t reviewed this tool at all. I strongly recommend checking the codebase before continuing the discussion. |
|
Okay. @nitpicker55555 I’ll update the PR to follow your approach. |
1e21ba1 to
805ed97
Compare
1e21ba1 to
53d8830
Compare
…ocument agents by integrating ImageAnalysisToolkit with proper agent registration and explicit priority instructions in system prompts
698b49b to
dba3f58
Compare
…lti-modal-worker-agents
|
@nitpicker55555 I updated pr to follow your feedback. please review again |
Enable Image Analysis for All Worker Agents
Problem
Only Multi-Modal Agent could analyze images. Worker agents (Developer, Browser, Document) couldn't see or process image attachments, causing tasks to fail.
Solution
ImageAnalysisToolkitto Developer, Browser, and Document agentstoolkits_to_register_agentparameterChanges
Agent Factories:
backend/app/agent/factory/developer.pybackend/app/agent/factory/browser.pybackend/app/agent/factory/document.pySystem Prompts:
backend/app/agent/prompt.pyImpact
All worker agents can now:
Testing
✅ Developer Agent: Extracts code from screenshots
✅ Browser Agent: Analyzes webpage screenshots
✅ Document Agent: Ready for document image analysis
Type
Closes #956