feat: Add multimodal support (vision and audio)#49
feat: Add multimodal support (vision and audio)#49mohammed840 wants to merge 2 commits intoalexzhang13:mainfrom
Conversation
This PR adds multimodal capabilities to RLM, enabling vision (image analysis)
and audio (transcription + TTS) support in the REPL environment.
## Changes
### Gemini Client (rlm/clients/gemini.py)
- Added _load_image_as_part() to handle image loading from files, URLs, and base64
- Added _load_audio_as_part() to handle audio file loading
- Extended _get_mime_type() with audio and video MIME types
- Updated _content_to_parts() to process multimodal content (images + audio)
### REPL Environment (rlm/environments/local_repl.py)
- Added vision_query(prompt, images) - Analyze images with vision-capable LLMs
- Added vision_query_batched(prompts, images_list) - Batch image analysis
- Added audio_query(prompt, audio_files) - Transcribe/analyze audio files
- Added speak(text, output_path) - Text-to-speech generation
### System Prompt (rlm/utils/prompts.py)
- Documented new multimodal functions in the REPL environment description
## Usage Examples
### Vision
```python
# Analyze an image
description = vision_query("What objects are in this image?", ["photo.jpg"])
```
### Audio
```python
# Transcribe audio
transcript = audio_query("Transcribe this audio", ["recording.mp3"])
# Text-to-speech
audio_path = speak("Hello world", "output.aiff")
```
Closes #multimodal-support
|
Love this, @mohammed840 before I make minor changes, can you add a flag in the |
alexzhang13
left a comment
There was a problem hiding this comment.
Make changes minimal, but add flag for RLM(...) for enable_multimodal: bool = False by default to enable these flags. Use a separate system prompt as well.
Addresses PR feedback: - Added enable_multimodal: bool = False flag to RLM constructor - Created separate RLM_SYSTEM_PROMPT (base) and RLM_MULTIMODAL_SYSTEM_PROMPT - Multimodal REPL functions only registered when enable_multimodal=True - Updated examples to use new flag
|
@alexzhang13 Thanks for the feedback! I've implemented all your requested changes: Changes Made:
Usage:# Default: No multimodal (backward compatible)
rlm = RLM(backend="gemini", ...)
# With multimodal enabled
rlm = RLM(backend="gemini", ..., enable_multimodal=True)Let me know if you'd like any additional changes! |
Summary
This PR adds multimodal capabilities to RLM, enabling vision (image analysis) and audio (transcription + TTS) support in the REPL environment.
Changes
Gemini Client (
rlm/clients/gemini.py)_load_image_as_part()to handle image loading from files, URLs, and base64_load_audio_as_part()to handle audio file loading_get_mime_type()with audio and video MIME types_content_to_parts()to process multimodal content (images + audio)REPL Environment (
rlm/environments/local_repl.py)vision_query(prompt, images)- Analyze images with vision-capable LLMsvision_query_batched(prompts, images_list)- Batch image analysisaudio_query(prompt, audio_files)- Transcribe/analyze audio filesspeak(text, output_path)- Text-to-speech generationSystem Prompt (
rlm/utils/prompts.py)New REPL Functions
vision_query(prompt, images)vision_query("What's in the image?", ["photo.jpg"])vision_query_batched(prompts, images_list)vision_query_batched([...], [[img1], [img2]])audio_query(prompt, audio_files)audio_query("Transcribe this", ["speech.mp3"])speak(text, output_path)speak("Hello world", "output.aiff")Usage Examples
Vision
Audio
Testing
saycommand as fallback)This addresses the open request for multimodal support in RLM.