Replies: 3 comments 11 replies
-
Yes, what you're trying to do—making an agent analyze the visual styling of a PDF using multimodal LLMs—is totally viable, especially with frontier models like Claude, GPT-4o, and Gemini that support image inputs. Please checkout multimodal compatibility here: https://docs.agno.com/models/compatibility#multimodal-support Agno doesn't currently have an out-of-the-box example of this flow in the, but it’s absolutely possible to build it using a custom tool. Making custom tools with Agno is easy:
Docs for it: https://docs.agno.com/tools/tools So the steps would look something like this:
Let me know if it answers your question |
Beta Was this translation helpful? Give feedback.
-
Any update? |
Beta Was this translation helpful? Give feedback.
-
Hi , I am also having the same problem, #2927 says an update is coming soon. Very pleased that agno framework exists and works like magic. This feature would be a great add on for me. Thanks |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
As part of a project, i want to an agent to make some style based recommendations based on what a PDF for an ad campaign "looks like". This is part of a larger agent that works on ads, so i would want it to be able to get a screenshot/image of the pdf using a tool call specifically, and then analyze it in the LLMs context and then give feedback.
Most modern day frontier models are multimodal, so they should be able to do something like this. I know anthropic supports this in their api along with others
Is there a way to do this in Agno? I have tried scouring the cookbook for examples and tried to look through the codebase, but couldn't find anything, and figured people in the community would have some experience with something like this. Am i missing something simple?
Beta Was this translation helpful? Give feedback.
All reactions