A comprehensive multimodal AI application built with Streamlit and Google's Gemini AI that provides video analysis, object detection, audio transcription, and file management capabilities through an intuitive web interface.
This application leverages Google's Gemini AI models to process and analyze various types of media content. It provides four main functionalities:
- Video Analysis: Upload videos to automatically generate metadata including title, summary, duration, and tags
- Object Detection: Upload images or use camera input to detect and locate objects with bounding box visualization
- Audio Transcription: Convert audio files to text with speaker identification
- File API Management: List, view, and delete files uploaded to the Gemini API
The application is designed to be user-friendly with a clean, tabbed interface that allows users to easily switch between different AI-powered features.
- Upload video files (MP4, MOV, AVI, MKV)
- Automatic metadata generation with structured JSON schema
- Extract video title, summary, duration, and relevant tags
- Real-time processing status updates
- Upload images (JPG, JPEG, PNG) for object detection
- Specify custom objects to detect or detect all objects
- Visual bounding box annotations with object labels
- Pixel-perfect coordinate extraction and display
- Support for normalized coordinate conversion
- Support for multiple audio formats (MP3, WAV, AIFF, AAC, OGG, FLAC)
- Speaker identification and dialogue formatting
- Accurate transcription with filler words preservation
- Interview and conversation transcription optimization
- List all uploaded files with display names and file names
- Individual file deletion by name
- Bulk delete all files functionality
- Real-time file status monitoring
- Python 3.x - Primary programming language
- Streamlit - Web application framework for the user interface
- Google Generative AI (Gemini) - Core AI model for multimodal processing
- google-generativeai - Official Google Gemini AI SDK
- Pillow (PIL) - Image processing and manipulation
- OpenCV - Computer vision operations
- python-dotenv - Environment variable management
- Rich - Enhanced terminal output formatting
- IPython - Interactive Python environment
- streamlit-chat - Chat interface components
- Live - Development server utilities
- JSON - Data serialization and parsing
- Regex (re) - Text processing and markdown removal
GeminiMultiModalStreamlit/
βββ app.py # Main Streamlit application
βββ requirements.txt # Python dependencies
βββ README.md # Project documentation
βββ utils/ # Utility modules
β βββ model.py # AI model loading and configuration
β βββ util.py # Core utility functions
β βββ removemarkdownsyntax.py # Markdown text processing
βββ temp/ # Temporary file storage (auto-created)
app.py: Main application entry point with Streamlit UI and tab managementutils/model.py: Handles Gemini AI model initialization, configuration, and cachingutils/util.py: Core utilities for file upload, processing, metadata generation, and image processingutils/removemarkdownsyntax.py: Text processing utilities for cleaning AI responses
- Python 3.7 or higher
- Google AI API key (from Google AI Studio)
- Virtual environment (recommended)
git clone <repository-url>
cd GeminiMultiModalStreamlit# Windows
python -m venv venv
venv\Scripts\activate
# macOS/Linux
python -m venv venv
source venv/bin/activatepip install -r requirements.txtCreate a .env file in the project root directory:
API_KEY=your_google_ai_api_key_here
MODEL=gemini-1.5-flash-latest
CACHING_MODEL=gemini-1.5-flash-001
Important:
- Replace
your_google_ai_api_key_herewith your actual Google AI API key - You can obtain an API key from Google AI Studio
- The
MODELcan be any supported Gemini model version CACHING_MODELis used for cached content operations
streamlit run app.pyThe application will start and be accessible at http://localhost:8501
- Launch the application using the command above
- The interface will display four tabs: Video, Image, Audio, and File API
- Select the appropriate tab based on your use case
- Navigate to the Video tab
- Upload a video file using the file uploader
- Click "Analyze Video" to start processing
- Wait for file upload and processing completion
- View the generated metadata including title, summary, duration, and tags
- Go to the Image tab
- Upload an image using the sidebar file uploader
- Choose detection mode:
- Specific Object: Enter the object name you want to detect
- All Objects: Check "Detect All Objects" to find everything
- Click "Detect Objects" to process the image
- View the annotated image with bounding boxes and coordinate details
- Select the Audio tab
- Upload an audio file (supports multiple formats)
- Click "Transcribe Audio" to start processing
- Wait for upload and transcription completion
- View the formatted transcription with speaker identification
- Access the File API tab
- List Files: Click to view all uploaded files
- Delete Single File: Enter file name and click delete
- Delete All Files: Check the checkbox and confirm to remove all files
The application requires the following environment variables in your .env file:
| Variable | Description | Required | Example |
|---|---|---|---|
API_KEY |
Google AI API key | Yes | AIza... |
MODEL |
Gemini model version | Yes | gemini-1.5-flash-latest |
CACHING_MODEL |
Model for caching operations | No | gemini-1.5-flash-001 |
The application automatically configures different model settings based on use case:
- Structured Output (Video Analysis): JSON schema response with specific temperature and token limits
- General Purpose (Image/Audio): Standard configuration with higher creativity settings
- Caching: Optimized for repeated operations with TTL management
- Temporary files are automatically created and cleaned up
- Supported video formats: MP4, MOV, AVI, MKV
- Supported image formats: JPG, JPEG, PNG
- Supported audio formats: MP3, WAV, AIFF, AAC, OGG, FLAC
- Files are uploaded to Google's servers and processed remotely
- Model instances are cached using Streamlit's
@st.cache_resource - File processing includes polling mechanisms for completion status
- Bounding box coordinates are normalized and converted for accuracy
- Error handling and validation throughout the processing pipeline
Note: This application requires an active internet connection and valid Google AI API credentials to function properly. Make sure your API key has sufficient quota for the operations you plan to perform.