[Experimental] Modality Transforms #2836

WaelKarkoub · 2024-05-30T20:38:34Z

Why are these changes needed?

NOTE: Do not review, I have not finished this feature just yet.

With the introduction of GPT-4o, we should expect an increase in interest in multimodal capabilities in AutoGen. This PR introduces a new transform that allows users to add image modality to any agent, with any image captioner, and will serve as the blueprint for other modalities.

Current State of Multimodality Support in AutoGen

Only images are currently supported.
Users are limited to using MultimodalConversableAgent and VisionCapability if they want to add image support to their agents. Both of which are LLM-based image captioning.

My requirements for adding multimodality support to agents:

Don't break anything: i.e. avoid making changes to existing interfaces.
The solution has to be modular and easily extensible.

Approaches Considered

I was contemplating between two approaches:

ModalityAdapters: A new agent capability that sits in front of every incoming message and converts them from one modality type to another (primarily to text).
Modality Transforms: Use the TransformMessages capability to convert messages from one modality to another.

Initially, I was working on ModalityAdapters as it seemed promising (I documented my thought process in this pdf, I initially called it ModalityTranslators but we voted for the adapter naming convention as it fit better). However, I encountered few roadblocks that led to the decision to use TransformMessages instead:

I need information on which modalities the agent supports. This is nontrivial and currently not possible as the OpenAIWrapper can fall back into an LLM that has a different modality from the first LLM it started with.
The implementation of ModalityAdapters seemed too close to TransformMessages, leading to unnecessary repeated code.

Using TransformMessages is more verbose, but it has several advantages:

It forces the user to be explicit about the modalities they want to support.
They can interact nicely with other transforms to improve performance (imagine performing some image denoising before captioning).

Tasks to complete before opening this PR for review.

Tasks

Give feedback

Get approval for the final design choices.

Get approval for the final design choices.
Options
Successfully updated the issue's project

There was an error updating the issue's project
Add caching.

Add caching.
Options
Successfully updated the issue's project

There was an error updating the issue's project
Add comments, docstrings, etc..

Add comments, docstrings, etc..
Options
Successfully updated the issue's project

There was an error updating the issue's project
Add tests.

Add tests.
Options
Successfully updated the issue's project

There was an error updating the issue's project
Cleanup code.

Cleanup code.
Options
Successfully updated the issue's project

There was an error updating the issue's project
Options

Things I noticed in the codebase that made it difficult to add new modalities

No way for the agent to know which LLM config is being used. I recommend letting the agent directly control which LLM config to feed to the OpenAI Wrapper. This will allow us to have better control when things go wrong, and we could rerun all the message hooks when an API request fails.
Content type is not clear: I can open a PR that can add/define types in Autogen for things like content and message.
For some reason, OpenAI decided to use their image API for their video API, we have to add more util to handle that case.

Demo

Here's a screenshot of a GPT-3.5 (named "gpt_3_w_image_modality") that can identify the animal generated by Dalle 3 (ignore the double messages, Groupchat doesn't work with transform messages just yet so I had to hack it).

Here's the code that I used to test

import os

from autogen import ConversableAgent
from autogen.agentchat.contrib.capabilities.generate_images import DalleImageGenerator, ImageGeneration
from autogen.agentchat.contrib.capabilities.image_captioners import HuggingFaceImageCaptioner
from autogen.agentchat.contrib.capabilities.modality_transforms import ImageModality
from autogen.agentchat.contrib.capabilities.transform_messages import TransformMessages
from autogen.agentchat.user_proxy_agent import UserProxyAgent

MAIN_SYSTEM_MESSAGE = """You are partaking in a two player:
- player 1: whispers the animal name to player 2
- player 2: needs to draw the animal
- player 3: needs to guess what is in the image
"""

PLAYER1_AGENT_SYSTEM_MESSAGE = """You are player 1: you must always respond in this format:
PROMPT: Draw me an animal. Replace the word animal with any animal of your choosing.

e.g. PROMPT: Draw me a zebra.
"""
PAINTER_AGENT_SYSTEM_MESSAGE = """You are player 2: you must draw the animal."""

CAPABLE_AGENT_SYSTEM_MESSAGE = """You are player 3: you must guess which animal is in the image."""

user_agent = UserProxyAgent(name="user_agent", human_input_mode="NEVER")

player1_agent = ConversableAgent(
    name="gpt_3",
    system_message=MAIN_SYSTEM_MESSAGE + PLAYER1_AGENT_SYSTEM_MESSAGE,
    llm_config={"model": "gpt-3.5-turbo", "api_key": os.environ["OPENAI_API_KEY"]},
    human_input_mode="NEVER",
    max_consecutive_auto_reply=3,
)

dalle_agent = ConversableAgent(
    name="dalle_agent",
    system_message=MAIN_SYSTEM_MESSAGE + PAINTER_AGENT_SYSTEM_MESSAGE,
    llm_config={"model": "gpt-3.5-turbo", "api_key": os.environ["OPENAI_API_KEY"]},
    human_input_mode="NEVER",
    max_consecutive_auto_reply=3,
)

capable_agent = ConversableAgent(
    name="gpt_3_w_image_modality",
    system_message=MAIN_SYSTEM_MESSAGE + CAPABLE_AGENT_SYSTEM_MESSAGE,
    llm_config={"model": "gpt-3.5-turbo", "api_key": os.environ["OPENAI_API_KEY"], "cache_seed": None},
    human_input_mode="NEVER",
    max_consecutive_auto_reply=3,
)

llm_config = {"config_list": [{"model": "dall-e-3", "api_key": os.environ["OPENAI_API_KEY"]}]}
dalle = DalleImageGenerator(llm_config=llm_config)
image_generator = ImageGeneration(image_generator=dalle, output_prompt_template="")
image_generator.add_to_agent(dalle_agent)

message_transforms = TransformMessages(transforms=[ImageModality(image_captioner=HuggingFaceImageCaptioner())])
message_transforms.add_to_agent(capable_agent)

user_agent.send("Let's start the game", player1_agent, request_reply=True)
player1_agent.send(user_agent.last_message(player1_agent), dalle_agent, request_reply=True)
dalle_agent.send(player1_agent.last_message(dalle_agent), capable_agent, request_reply=True)
capable_agent.send(dalle_agent.last_message(capable_agent), user_agent, request_reply=True)

Related issue number

Checks

I've included any doc changes needed for https://microsoft.github.io/autogen/. See https://microsoft.github.io/autogen/docs/Contribute#documentation to build and test documentation locally.
I've added tests (if relevant) corresponding to the changes introduced in this PR.
I've made sure all auto checks have passed.

BeibinLi · 2024-05-30T22:28:25Z

Lovely!

codecov-commenter · 2024-06-10T18:25:00Z

Codecov Report

Attention: Patch coverage is 2.67380% with 182 lines in your changes missing coverage. Please review.

Project coverage is 12.24%. Comparing base (84c7c24) to head (f0a1e01).
Report is 8 commits behind head on main.

Files	Patch %	Lines
...ntchat/contrib/capabilities/modality_transforms.py	0.00%	151 Missing ⚠️
autogen/agentchat/utils.py	25.00%	15 Missing ⚠️
...agentchat/contrib/capabilities/image_captioners.py	0.00%	13 Missing ⚠️
autogen/agentchat/contrib/img_utils.py	0.00%	2 Missing ⚠️
.../agentchat/contrib/capabilities/generate_images.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #2836       +/-   ##
===========================================
- Coverage   33.12%   12.24%   -20.89%     
===========================================
  Files          88       91        +3     
  Lines        9518     9775      +257     
  Branches     2037     2095       +58     
===========================================
- Hits         3153     1197     -1956     
- Misses       6096     8565     +2469     
+ Partials      269       13      -256

Flag	Coverage Δ
unittests	`12.24% <2.67%> (-20.89%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

WaelKarkoub added 10 commits May 18, 2024 20:42

work in progress

9fe7a17

Merge branch 'main' into modality-translator

5a00222

wip

6b35dd6

wip

91601c6

wip

c79798e

wip

825ea3a

wip

be33e38

removed transform messages changes

cd4ab00

remove conversable agents changes

e6f6729

more cleanup

0354698

WaelKarkoub had a problem deploying to openai1 May 30, 2024 20:38 — with GitHub Actions Failure

WaelKarkoub requested review from ekzhu and BeibinLi May 30, 2024 20:43

WaelKarkoub added the enhancement New feature or request label May 30, 2024

WaelKarkoub mentioned this pull request May 30, 2024

Huggingface agent #2599

Open

3 tasks

WaelKarkoub requested a review from qingyun-wu May 31, 2024 17:37

WaelKarkoub requested a review from sonichi May 31, 2024 18:03

WaelKarkoub added 4 commits June 2, 2024 15:06

Merge branch 'main' into modality-transforms

33242a9

added logs to the transform

4c79b57

improve

643f6ca

wip

0b8790a

WaelKarkoub mentioned this pull request Jun 4, 2024

[Refactor] Transforms Utils #2863

Merged

3 tasks

merge fix

f0a1e01

WaelKarkoub requested a deployment to openai1 June 10, 2024 18:10 — with GitHub Actions Waiting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Experimental] Modality Transforms #2836

[Experimental] Modality Transforms #2836

WaelKarkoub commented May 30, 2024 •

edited

Loading

Tasks

BeibinLi commented May 30, 2024

codecov-commenter commented Jun 10, 2024 •

edited

Loading

[Experimental] Modality Transforms #2836

Are you sure you want to change the base?

[Experimental] Modality Transforms #2836

Conversation

WaelKarkoub commented May 30, 2024 • edited Loading

Why are these changes needed?

Current State of Multimodality Support in AutoGen

Approaches Considered

Tasks to complete before opening this PR for review.

Tasks

Things I noticed in the codebase that made it difficult to add new modalities

Demo

Related issue number

Checks

BeibinLi commented May 30, 2024

codecov-commenter commented Jun 10, 2024 • edited Loading

Codecov Report

WaelKarkoub commented May 30, 2024 •

edited

Loading

codecov-commenter commented Jun 10, 2024 •

edited

Loading