Visual Understanding Chat Assistant

An AI-powered video analysis and conversational assistant that processes video input, recognizes events, summarizes content, and engages in multi-turn conversations.

Project Overview

This system implements a comprehensive visual understanding chat assistant that:

Analyzes video content using Vision-Language Models (VLM)
Detects events and timestamps with high accuracy
Identifies safety violations and guideline adherence issues
Engages in natural conversations about video content
Maintains conversation context across multiple turns
Provides structured responses with relevant timestamps

Key Features

Video Event Recognition and Summarization

The system accepts video streams up to 2 minutes duration. It identifies specific events within videos such as accidents, traffic violations, pedestrian behavior, conversations, performances, and daily activities. The system provides precise timestamps for all detected events and summarizes video content highlighting key events. It detects guideline adherence and violations with detailed descriptions.

Multi-Turn Conversations

The system supports natural, contextual conversations about video content. It retains conversation history for coherent follow-up responses. The system implements agentic workflow for intelligent query understanding. It handles clarification requests and follow-up questions and provides suggested questions based on video analysis.

Video Input Processing

The system processes various video formats including MP4, WebM, and AVI. It handles videos up to 2 minutes duration. The system extracts representative frames for analysis and provides detailed video metadata and duration information.

System Architecture

Web Interface (HTML/JavaScript) → FastAPI Backend → Video Processor (VLM Analysis)
                                       ↓                      ↓
Chat Handler (Multi-turn) ← Chat History Database ← vLLM Server (Port 8000)
         ↓                       ↓                        ↓
   LLM Server              SQLite Database          Qwen2.5-VL-32B
   (Port 8001)             (Sessions)               (Vision Model)
         ↓
   Llama-3.1-8B
   (Text Model)

Component Details

Frontend Web Interface

Modern HTML5 and JavaScript interface provides drag-and-drop video upload functionality. Real-time chat interface with progress indicators and status updates. Responsive design works on all devices.

Backend FastAPI

RESTful API endpoints handle video upload and chat requests. Async video processing prevents blocking operations. Session management tracks user conversations. Error handling and logging ensure reliability. CORS support enables cross-origin requests.

Video Processor

Frame extraction uses PyAV library for robust video decoding. VLM-based event detection identifies activities and objects. Structured output parsing extracts events with timestamps. Timestamp generation provides precise timing. Safety violation detection identifies concerning content.

Chat Handler

Intent classification understands user query types. Context-aware responses use video analysis results. Multi-turn conversation management maintains dialogue flow. Follow-up question generation suggests relevant queries. Response personalization adapts to user needs.

Database Layer

SQLite stores chat history and session data. Session management tracks user conversations. User conversation tracking maintains context. Foreign key constraints ensure data integrity. Automatic cleanup prevents data bloat.

Tech Stack

Backend Technologies

FastAPI with Python

FastAPI provides high-performance async capabilities and excellent OpenAPI documentation. Native support for modern Python features makes it ideal for AI and ML workloads. Easy integration with ML models enables rapid development.

SQLite Database

SQLite offers lightweight, serverless database functionality perfect for chat history storage. No additional infrastructure required makes deployment simple. Excellent for development and moderate production loads.

PyAV for Video Processing

PyAV provides robust video decoding and frame extraction with comprehensive format support. More reliable than OpenCV for production video processing. Handles edge cases and various video formats well.

AI Model Stack

vLLM Framework

vLLM delivers high-performance inference serving with OpenAI-compatible API. Efficient GPU utilization maximizes hardware performance. Optimal for serving both VLM and LLM models with maximum throughput.

Qwen2.5-VL-32B-Instruct Vision Model

State-of-the-art vision-language understanding provides excellent event detection capabilities. Detailed video analysis with accurate timestamp detection. Handles diverse video content types effectively.

Meta-Llama-3.1-8B-Instruct Text Model

Superior conversational abilities with excellent reasoning and context understanding. Handles complex multi-turn conversations with high quality responses. Natural language processing for user queries.

Hardware Requirements

Two NVIDIA L40S GPU cards with 46GB VRAM each provide sufficient processing power. Linux environment tested on Ubuntu. Docker and Docker Compose for containerized deployment.

Setup and Installation

Prerequisites

Python 3.8 or higher version required. Docker and Docker Compose must be installed. NVIDIA GPU with sufficient VRAM (24GB+ recommended). Linux environment (tested on Ubuntu).

Clone Repository

git clone <repository-url>
cd visual-understanding-chat-assistant

Setup vLLM Servers

The system requires two vLLM instances running simultaneously.

VLM Server on Port 8000 for video analysis:

docker run -d --name vlm \
  --gpus device=0 \
  -p 8000:8000 \
  -v /mnt/models:/root/.cache/huggingface \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen2.5-VL-32B-Instruct \
  --trust-remote-code \
  --max-model-len 2048

LLM Server on Port 8001 for conversations:

docker run -d --name llm \
  --gpus device=1 \
  -p 8001:8000 \
  -v /mnt/models:/root/.cache/huggingface \
  -e HUGGING_FACE_HUB_TOKEN=your_token_here \
  vllm/vllm-openai:latest \
  --model Meta-Llama/Meta-Llama-3.1-8B-Instruct \
  --max-model-len 2048

Install Python Dependencies

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Verify Setup

Test system components:

python test_system.py

Check API health:

curl http://localhost:8000/health
curl http://localhost:8001/health

Start Application

python main.py

The web interface will be available at: http://localhost:8002

Usage Instructions

Basic Usage Flow

Upload Video Navigate to http://localhost:8002. Click "Choose File" and select a video (maximum 2 minutes). Click "Analyze Video" and wait for analysis to complete (typically 30-60 seconds).
Review Analysis Results View detected events with timestamps. Read the video summary. Check for any safety violations or guideline issues.
Start Conversation Use the chat interface to ask questions about the video. Ask follow-up questions for clarification. Request specific details about events or timestamps.

Example Interactions

Video Analysis Example

The system displays detected events such as "Person Walking (00:15 - 00:30): Individual crosses street using crosswalk" and "Vehicle Movement (00:05 - 00:20): Green bus travels along main road with other vehicles visible". Summary describes overall video content highlighting key activities and interactions.

Conversation Examples

User asks "What happened at the intersection?" Assistant responds with specific details about detected events including timestamps and descriptions of activities observed.

User asks "Can you tell me more about the accident?" Assistant provides detailed analysis of incident including time range, vehicles involved, and circumstances based on video analysis.

User asks "Were there any safety violations?" Assistant identifies specific safety issues detected in video analysis with precise timestamps and severity assessment.

Advanced Features

Time-based Queries

Users can ask "What happened at 01:30?" or "Show me events between 00:15 and 00:45" or "When did the accident occur?" System responds with relevant timestamp information.

Event-specific Queries

Users can request "Tell me about the traffic violations" or "Were there any pedestrian safety issues?" or "What vehicles were involved in the accident?" System provides focused analysis of specific event types.

Follow-up Conversations

Users can ask "Can you explain that in more detail?" or "What were the consequences of that action?" or "How serious was that violation?" System maintains context across conversation turns.

Testing

Test with Sample Videos

Run comprehensive system test:

python test_system.py

Test specific video:

python -c "
import asyncio
from video_processor import VideoProcessor

async def test():
    processor = VideoProcessor()
    result = await processor.analyze_video('eval_dataset/test_sample1.webm')
    print(result)

asyncio.run(test())
"

API Testing

Test video upload:

curl -X POST "http://localhost:8002/analyze-video" \
  -F "video=@eval_dataset/test_sample1.webm"

Test chat functionality:

curl -X POST "http://localhost:8002/chat" \
  -H "Content-Type: application/json" \
  -d '{"message": "What events happened?", "session_id": "test_session", "user_id": "test_user"}'

Performance Metrics

Analysis Speed

Video processing takes 30-60 seconds for 2-minute videos. Event detection identifies 15-30 events detected per video. Response time under 3 seconds for chat responses.

Accuracy

Event detection provides high precision for traffic and safety events. Timestamp accuracy within 2 seconds precision. Conversation relevance maintains context-aware responses.

Resource Usage

GPU memory usage approximately 15GB for VLM and 8GB for LLM. CPU shows moderate usage during video processing. Storage uses temporary files with automatic cleanup.

Configuration

Environment Variables

Set custom model endpoints:

export VLM_BASE_URL="http://localhost:8000/v1"
export LLM_BASE_URL="http://localhost:8001/v1"

Set database path:

export DB_PATH="./chat_history.db"

Set temporary file directory:

export TEMP_DIR="./temp"

Model Configuration

Modify video_processor.py and chat_handler.py to use different models. Adjust max_tokens, temperature, and other parameters as needed. Configure frame extraction settings for different video types.

Troubleshooting

Common Issues

vLLM Servers Not Responding

Check container status with docker ps. Check logs with docker logs vlm and docker logs llm. Restart if needed with docker restart vlm llm.

Video Analysis Fails

Ensure video file is valid and under 2 minutes duration. Check GPU memory availability. Verify video format compatibility.

Chat Responses Poor Quality

Check LLM server status. Verify analysis context is properly formatted. Review conversation history length.

Performance Optimization

Reduce max_frames in video processor for faster analysis. Adjust max_tokens in chat handler for shorter responses. Use GPU monitoring to optimize memory usage.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Built for Mantra Hackathon

Visual Understanding Chat Assistant demonstrates cutting-edge AI capabilities for video analysis and conversational intelligence using state-of-the-art vision-language models and efficient inference frameworks.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitignore		.gitignore
LICENSE		LICENSE
LLM_sql_vllm.py		LLM_sql_vllm.py
README.md		README.md
chat_handler.py		chat_handler.py
chat_history_db.py		chat_history_db.py
docker-compose.yaml		docker-compose.yaml
main.py		main.py
requirements.txt		requirements.txt
video_processor.py		video_processor.py

Folders and files

Latest commit

History

Repository files navigation

Visual Understanding Chat Assistant

Project Overview

Key Features

Video Event Recognition and Summarization

Multi-Turn Conversations

Video Input Processing

System Architecture

Component Details

Frontend Web Interface

Backend FastAPI

Video Processor

Chat Handler

Database Layer

Tech Stack

Backend Technologies

FastAPI with Python

SQLite Database

PyAV for Video Processing

AI Model Stack

vLLM Framework

Qwen2.5-VL-32B-Instruct Vision Model

Meta-Llama-3.1-8B-Instruct Text Model

Hardware Requirements

Setup and Installation

Prerequisites

Clone Repository

Setup vLLM Servers

Install Python Dependencies

Verify Setup

Start Application

Usage Instructions

Basic Usage Flow

Example Interactions

Video Analysis Example

Conversation Examples

Advanced Features

Time-based Queries

Event-specific Queries

Follow-up Conversations

Testing

Test with Sample Videos

API Testing

Performance Metrics

Analysis Speed

Accuracy

Resource Usage

Configuration

Environment Variables

Model Configuration

Troubleshooting

Common Issues

vLLM Servers Not Responding

Video Analysis Fails

Chat Responses Poor Quality

Performance Optimization

License

Built for Mantra Hackathon

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages