Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Experimental] Modality Transforms #2836

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from
Draft

Conversation

WaelKarkoub
Copy link
Collaborator

@WaelKarkoub WaelKarkoub commented May 30, 2024

Why are these changes needed?

NOTE: Do not review, I have not finished this feature just yet.

With the introduction of GPT-4o, we should expect an increase in interest in multimodal capabilities in AutoGen. This PR introduces a new transform that allows users to add image modality to any agent, with any image captioner, and will serve as the blueprint for other modalities.

Current State of Multimodality Support in AutoGen

  • Only images are currently supported.
  • Users are limited to using MultimodalConversableAgent and VisionCapability if they want to add image support to their agents. Both of which are LLM-based image captioning.

My requirements for adding multimodality support to agents:

  1. Don't break anything: i.e. avoid making changes to existing interfaces.
  2. The solution has to be modular and easily extensible.

Approaches Considered

I was contemplating between two approaches:

  1. ModalityAdapters: A new agent capability that sits in front of every incoming message and converts them from one modality type to another (primarily to text).
  2. Modality Transforms: Use the TransformMessages capability to convert messages from one modality to another.

Initially, I was working on ModalityAdapters as it seemed promising (I documented my thought process in this pdf, I initially called it ModalityTranslators but we voted for the adapter naming convention as it fit better). However, I encountered few roadblocks that led to the decision to use TransformMessages instead:

  • I need information on which modalities the agent supports. This is nontrivial and currently not possible as the OpenAIWrapper can fall back into an LLM that has a different modality from the first LLM it started with.
  • The implementation of ModalityAdapters seemed too close to TransformMessages, leading to unnecessary repeated code.

Using TransformMessages is more verbose, but it has several advantages:

  • It forces the user to be explicit about the modalities they want to support.
  • They can interact nicely with other transforms to improve performance (imagine performing some image denoising before captioning).

Tasks to complete before opening this PR for review.

Tasks

Edit tasklist title
Beta Give feedback Tasklist Tasks, more options

Delete tasklist

Delete tasklist block?
Are you sure? All relationships in this tasklist will be removed.
  1. Get approval for the final design choices.
    Options
  2. Add caching.
    Options
  3. Add comments, docstrings, etc..
    Options
  4. Add tests.
    Options
  5. Cleanup code.
    Options
Loading

Things I noticed in the codebase that made it difficult to add new modalities

  • No way for the agent to know which LLM config is being used. I recommend letting the agent directly control which LLM config to feed to the OpenAI Wrapper. This will allow us to have better control when things go wrong, and we could rerun all the message hooks when an API request fails.
  • Content type is not clear: I can open a PR that can add/define types in Autogen for things like content and message.
  • For some reason, OpenAI decided to use their image API for their video API, we have to add more util to handle that case.

Demo

Here's a screenshot of a GPT-3.5 (named "gpt_3_w_image_modality") that can identify the animal generated by Dalle 3 (ignore the double messages, Groupchat doesn't work with transform messages just yet so I had to hack it).

Screenshot from 2024-05-30 20-34-02

Here's the code that I used to test

import os

from autogen import ConversableAgent
from autogen.agentchat.contrib.capabilities.generate_images import DalleImageGenerator, ImageGeneration
from autogen.agentchat.contrib.capabilities.image_captioners import HuggingFaceImageCaptioner
from autogen.agentchat.contrib.capabilities.modality_transforms import ImageModality
from autogen.agentchat.contrib.capabilities.transform_messages import TransformMessages
from autogen.agentchat.user_proxy_agent import UserProxyAgent

MAIN_SYSTEM_MESSAGE = """You are partaking in a two player:
- player 1: whispers the animal name to player 2
- player 2: needs to draw the animal
- player 3: needs to guess what is in the image
"""

PLAYER1_AGENT_SYSTEM_MESSAGE = """You are player 1: you must always respond in this format:
PROMPT: Draw me an animal. Replace the word animal with any animal of your choosing.

e.g. PROMPT: Draw me a zebra.
"""
PAINTER_AGENT_SYSTEM_MESSAGE = """You are player 2: you must draw the animal."""

CAPABLE_AGENT_SYSTEM_MESSAGE = """You are player 3: you must guess which animal is in the image."""

user_agent = UserProxyAgent(name="user_agent", human_input_mode="NEVER")

player1_agent = ConversableAgent(
    name="gpt_3",
    system_message=MAIN_SYSTEM_MESSAGE + PLAYER1_AGENT_SYSTEM_MESSAGE,
    llm_config={"model": "gpt-3.5-turbo", "api_key": os.environ["OPENAI_API_KEY"]},
    human_input_mode="NEVER",
    max_consecutive_auto_reply=3,
)

dalle_agent = ConversableAgent(
    name="dalle_agent",
    system_message=MAIN_SYSTEM_MESSAGE + PAINTER_AGENT_SYSTEM_MESSAGE,
    llm_config={"model": "gpt-3.5-turbo", "api_key": os.environ["OPENAI_API_KEY"]},
    human_input_mode="NEVER",
    max_consecutive_auto_reply=3,
)

capable_agent = ConversableAgent(
    name="gpt_3_w_image_modality",
    system_message=MAIN_SYSTEM_MESSAGE + CAPABLE_AGENT_SYSTEM_MESSAGE,
    llm_config={"model": "gpt-3.5-turbo", "api_key": os.environ["OPENAI_API_KEY"], "cache_seed": None},
    human_input_mode="NEVER",
    max_consecutive_auto_reply=3,
)

llm_config = {"config_list": [{"model": "dall-e-3", "api_key": os.environ["OPENAI_API_KEY"]}]}
dalle = DalleImageGenerator(llm_config=llm_config)
image_generator = ImageGeneration(image_generator=dalle, output_prompt_template="")
image_generator.add_to_agent(dalle_agent)

message_transforms = TransformMessages(transforms=[ImageModality(image_captioner=HuggingFaceImageCaptioner())])
message_transforms.add_to_agent(capable_agent)

user_agent.send("Let's start the game", player1_agent, request_reply=True)
player1_agent.send(user_agent.last_message(player1_agent), dalle_agent, request_reply=True)
dalle_agent.send(player1_agent.last_message(dalle_agent), capable_agent, request_reply=True)
capable_agent.send(dalle_agent.last_message(capable_agent), user_agent, request_reply=True)

Related issue number

Checks

@WaelKarkoub WaelKarkoub added the enhancement New feature or request label May 30, 2024
@BeibinLi
Copy link
Collaborator

Lovely!

@WaelKarkoub WaelKarkoub mentioned this pull request May 30, 2024
3 tasks
@WaelKarkoub WaelKarkoub requested a review from sonichi May 31, 2024 18:03
@WaelKarkoub WaelKarkoub mentioned this pull request Jun 4, 2024
3 tasks
@codecov-commenter
Copy link

codecov-commenter commented Jun 10, 2024

Codecov Report

Attention: Patch coverage is 2.67380% with 182 lines in your changes missing coverage. Please review.

Project coverage is 12.24%. Comparing base (84c7c24) to head (f0a1e01).
Report is 8 commits behind head on main.

Files Patch % Lines
...ntchat/contrib/capabilities/modality_transforms.py 0.00% 151 Missing ⚠️
autogen/agentchat/utils.py 25.00% 15 Missing ⚠️
...agentchat/contrib/capabilities/image_captioners.py 0.00% 13 Missing ⚠️
autogen/agentchat/contrib/img_utils.py 0.00% 2 Missing ⚠️
.../agentchat/contrib/capabilities/generate_images.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #2836       +/-   ##
===========================================
- Coverage   33.12%   12.24%   -20.89%     
===========================================
  Files          88       91        +3     
  Lines        9518     9775      +257     
  Branches     2037     2095       +58     
===========================================
- Hits         3153     1197     -1956     
- Misses       6096     8565     +2469     
+ Partials      269       13      -256     
Flag Coverage Δ
unittests 12.24% <2.67%> (-20.89%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants