feat: add multi-modal attachment propagation to all worker agents thr… #1196

bittoby · 2026-02-09T05:50:25Z

Enable Image Analysis for All Worker Agents

Problem

Only Multi-Modal Agent could analyze images. Worker agents (Developer, Browser, Document) couldn't see or process image attachments, causing tasks to fail.

Solution

Added ImageAnalysisToolkit to Developer, Browser, and Document agents
Registered toolkits with toolkits_to_register_agent parameter
Updated system prompts with explicit image analysis instructions

Changes

Agent Factories:

backend/app/agent/factory/developer.py
backend/app/agent/factory/browser.py
backend/app/agent/factory/document.py

System Prompts:

backend/app/agent/prompt.py

Impact

All worker agents can now:

Analyze screenshots and images
Extract text from images
Answer questions about visual content
Process images alongside text in complex tasks

Testing

✅ Developer Agent: Extracts code from screenshots
✅ Browser Agent: Analyzes webpage screenshots
✅ Document Agent: Ready for document image analysis

Type

Feature
Bug Fix

Closes #956

bittoby · 2026-02-09T05:51:53Z

@Wendong-Fan Could you please review this PR? thanks

Wendong-Fan · 2026-02-09T18:47:54Z

@Wendong-Fan Could you please review this PR? thanks

thanks @bittoby 's contribution! could @nitpicker55555 and @Zephyroam help checking this?

bittoby · 2026-02-10T12:03:41Z

@Zephyroam @nitpicker55555 I would appreciate your feedback. please review the PR! thanks

nitpicker55555

Obviously, you didn’t test your code. You didn’t register the ImageAnalysisToolkit for the agent—so how is the agent supposed to gain image-reading capability?

Additionally, this design is overly complicated. Why not simply modify the prompt of the decompose agent so that, when breaking down the task, it passes the image location to the corresponding agent? That way, we would only need to adjust the prompt and register the ImageAnalysisToolkit for the agent.

bittoby · 2026-02-11T11:47:13Z

@nitpicker55555 Obviously, after I tested it correctly I pushed the PR. When I attach a test image and input “As a developer, analyze this screenshot and write code” as prompt, then the developer agent generates HTML that matches the screenshot.

bittoby · 2026-02-11T11:56:55Z

Obviously, you didn’t test your code. You didn’t register the ImageAnalysisToolkit for the agent—so how is the agent supposed to gain image-reading capability?

Additionally, this design is overly complicated. Why not simply modify the prompt of the decompose agent so that, when breaking down the task, it passes the image location to the corresponding agent? That way, we would only need to adjust the prompt and register the ImageAnalysisToolkit for the agent.

you're right - I haven't added ImageAnalysisToolkit to the worker agents yet. The current implementation works because it relies on the LLM's native vision capability (like gpt-5), but I agree we should add the toolkit for explicit tool calls and non-vision model support. Would you prefer I complete the current approach by adding the toolkit, or switch to your suggestion of modifying the decompose prompt to selectively pass images?

nitpicker55555 · 2026-02-11T12:21:48Z

Obviously, you didn’t test your code. You didn’t register the ImageAnalysisToolkit for the agent—so how is the agent supposed to gain image-reading capability?
Additionally, this design is overly complicated. Why not simply modify the prompt of the decompose agent so that, when breaking down the task, it passes the image location to the corresponding agent? That way, we would only need to adjust the prompt and register the ImageAnalysisToolkit for the agent.

you're right - I haven't added ImageAnalysisToolkit to the worker agents yet. The current implementation works because it relies on the LLM's native vision capability (like gpt-5), but I agree we should add the toolkit for explicit tool calls and non-vision model support. Would you prefer I complete the current approach by adding the toolkit, or switch to your suggestion of modifying the decompose prompt to selectively pass images?

@bittoby Thank you for the explanation, but I still don’t understand: without providing a dedicated tool, how can the model read an image from a passed-in image URL instead of a base64-encoded input? Could you point out which part of your PR implements this functionality?

bittoby · 2026-02-11T12:28:30Z

Right now it only passes the image file paths along as additional_info. The current setup depends on the model’s built-in vision (which needs base64 data or a URL in the actual message). My PR doesn’t do that conversion or attach the images to the message - it just sends the paths as metadata. I will update PR. I misunderstood. I am clear now.

For this to really work, we need to either:

add ImageAnalysisToolkit to the worker agents, or
change how we build the LLM request so it includes the images as base64.

Which option do you recommend?

nitpicker55555 · 2026-02-11T12:37:43Z

My idea is to tweak ImageAnalysisToolkit so it supports taking an image path as input and returning the actual image data back to the agent (right now ImageAnalysisToolkit only returns an image description).

The benefits are: the change is relatively small, we’d only need to update the decompose agent’s prompt to pass along the image path. It would also allow the agent to read other image files, not just images uploaded by the user in the prompt, and it could support users providing an image path instead of the image itself.

I also looked into how Claude Code does image reading, and it works in a similar way, via a Read tool.

What do you think @Wendong-Fan @bittoby

bittoby · 2026-02-11T12:59:23Z

@nitpicker55555 Thanks for your explanation!
Your idea (extending ImageAnalysisToolkit) is a solid quick win, but I see some limitations:

Concerns:

Mixed responsibility: one toolkit ends up doing both “read” and “analyze”
Less reusable: e.g., Document Agent might only need to read images for embeddings, but it would still have to pull in the whole analysis toolkit
Heavier deps everywhere: any agent that needs to read an image would also load the analysis dependencies

My approach (like Claude Code): make a small, separate tool:

read_image(path) -> base64
Keep ImageAnalysisToolkit focused on analysis only
More flexible: agents can read images for embeddings, uploads, etc.

But it requires more code, a bit more wiring

I agree with your approach for now since it's faster to ship. We can refactor to a separate ReadImageTool later if we see reusability issues.

@Wendong-Fan what's your preference?

nitpicker55555 · 2026-02-11T13:04:49Z

@nitpicker55555 Thanks for your explanation!

Your idea (extending ImageAnalysisToolkit) is a solid quick win, but I see some limitations:

Concerns:

Mixed responsibility: one toolkit ends up doing both “read” and “analyze”

Less reusable: e.g., Document Agent might only need to read images for embeddings, but it would still have to pull in the whole analysis toolkit

Heavier deps everywhere: any agent that needs to read an image would also load the analysis dependencies

My approach (like Claude Code): make a small, separate tool:

read_image(path) -> base64

Keep ImageAnalysisToolkit focused on analysis only

More flexible: agents can read images for embeddings, uploads, etc.

But it requires more code, a bit more wiring

I agree with your approach for now since it's faster to ship. We can refactor to a separate ReadImageTool later if we see reusability issues.

@Wendong-Fan what's your preference?

@bittoby what do you mean document agent read image for embedding?

bittoby · 2026-02-11T13:17:04Z

I mean, here "embedding" = putting the image into a document not analyzing it.
For example:
User uploads screenshot.png and asks: "Create a PDF report with this screenshot",
Document Agent needs to read the png from disk , convert it to base64 data then insert it to pdf. It doesn't need analyze the png, describe or extract text etc

nitpicker55555 · 2026-02-11T15:13:01Z

@bittoby It seems you haven’t reviewed this tool at all. I strongly recommend checking the codebase before continuing the discussion.
ImageAnalysisToolkit has only one dependency—Pillow—which is also required for simple image reading. The difference between this tool and the “read-only image” functionality we need is merely whether we leverage the built-in agent to generate a description. From the agent’s perspective, the only difference lies in the arguments passed in.
We could, of course, implement a separate tool as you suggested, but the supposed benefit of being “more flexible: agents can read images for embeddings, uploads, etc.” does not really exist.

bittoby · 2026-02-11T15:38:31Z

Okay. @nitpicker55555 I’ll update the PR to follow your approach.

…ocument agents by integrating ImageAnalysisToolkit with proper agent registration and explicit priority instructions in system prompts

…lti-modal-worker-agents

bittoby · 2026-02-11T19:13:32Z

@nitpicker55555 I updated pr to follow your feedback. please review again

Wendong-Fan requested review from Zephyroam and nitpicker55555 February 9, 2026 18:47

Wendong-Fan added this to the Sprint 14 milestone Feb 9, 2026

nitpicker55555 reviewed Feb 11, 2026

View reviewed changes

bittoby closed this Feb 11, 2026

bittoby force-pushed the feat/multi-modal-worker-agents branch from 1e21ba1 to 805ed97 Compare February 11, 2026 16:59

bittoby reopened this Feb 11, 2026

bittoby closed this Feb 11, 2026

bittoby force-pushed the feat/multi-modal-worker-agents branch from 1e21ba1 to 53d8830 Compare February 11, 2026 17:11

bittoby reopened this Feb 11, 2026

feat: enable multi-modal image analysis for Developer, Browser, and D…

dba3f58

…ocument agents by integrating ImageAnalysisToolkit with proper agent registration and explicit priority instructions in system prompts

bittoby force-pushed the feat/multi-modal-worker-agents branch from 698b49b to dba3f58 Compare February 11, 2026 19:03

bittoby added 2 commits February 11, 2026 19:09

feat: add mock toolkits

3392d95

Merge branch 'main' of https://github.com/bittoby/eigent into feat/mu…

236d2ae

…lti-modal-worker-agents

bittoby requested a review from nitpicker55555 February 11, 2026 19:54

feat: add multi-modal attachment propagation to all worker agents thr… #1196

Are you sure you want to change the base?

feat: add multi-modal attachment propagation to all worker agents thr… #1196

Conversation

bittoby commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Enable Image Analysis for All Worker Agents

Problem

Solution

Changes

Impact

Testing

Type

Uh oh!

bittoby commented Feb 9, 2026

Uh oh!

Wendong-Fan commented Feb 9, 2026

Uh oh!

bittoby commented Feb 10, 2026

Uh oh!

nitpicker55555 left a comment

Choose a reason for hiding this comment

Uh oh!

bittoby commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bittoby commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nitpicker55555 commented Feb 11, 2026

Uh oh!

bittoby commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nitpicker55555 commented Feb 11, 2026

Uh oh!

bittoby commented Feb 11, 2026

Uh oh!

nitpicker55555 commented Feb 11, 2026

Uh oh!

bittoby commented Feb 11, 2026

Uh oh!

nitpicker55555 commented Feb 11, 2026

Uh oh!

bittoby commented Feb 11, 2026

Uh oh!

bittoby commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bittoby commented Feb 9, 2026 •

edited

Loading

bittoby commented Feb 11, 2026 •

edited

Loading

bittoby commented Feb 11, 2026 •

edited

Loading

bittoby commented Feb 11, 2026 •

edited

Loading