[Feature Request] All worker could accept multi-modal information

### Motivation

now only multi-modal agent has the toolkit to read multi-modal information, but now many LLMs support vision, it's better to natively support worker agent get image path from decomposed sub tasks

### Solution

_No response_

### Alternatives

_No response_

### Additional context

_No response_