Video Understanding & Tagging System

中文 | English

AI Scientist × Travel Blogger — an intelligent tool that addresses two major challenges: managing large amounts of footage and producing creative content. It's now open-source to assist every content creator!

This repository discloses the core code we use to manage large volumes of travel videos. It primarily achieves video content understanding, automatic tagging, and generating a concise description for each video. The project integrates multimodal models (utilizing Google Gemma3 and sensevoice) and leverages large language models to produce accurate descriptions and tags. It can then work together with tools like Adobe Bridge for video tagging, content search, and narration retrieval. Additionally, you can generate script outlines for platforms such as Rednote or Tiktok based on the descriptions.

Features

Video Scene Understanding (Local Deployment)
Use Google Gemma3 to analyze the entire video's dynamic information, including people, objects, and scenes.
Audio Analysis (Local Deployment)
Employ sensevoice for audio analysis and extracting dialogue, providing textual information for content summarization and tag generation.
Tag and Video Description Generation (via API)
Based on multimodal information, call a large language model to generate tags, summaries, or descriptions.
Database Storage and Retrieval (Local Deployment)
Use structured database (SQLite) and vector database (ChromaDB) to store video analysis results for efficient retrieval.
Natural Language Video Search (Local Deployment)
Use query.py to implement natural language queries, finding the most matching videos through vector similarity search.
Web Interface Management and Search (Local Deployment)
Provide a Flask-based web interface supporting video browsing, searching, previewing, and exporting.
Search and Management (Local Deployment)
When combined with Adobe Bridge or similar tools, you can search for video clips by keywords.
Creative Content Production
Combine video descriptions, tags, and extracted dialogue to generate viral script outlines for Rednote or Tiktok.

Video Understanding Diagram

Usage Guide Video Introduction on Rednote

1. Automatic Video Tag Generation

The project automatically analyzes video content and narration to create multiple keyword tags, including scene, time, location, color, and more.

2. Keyword-based Search

Users can enter keywords (e.g., "white") to quickly find relevant videos. The project will automatically filter clips that match the query and display them.

3. Video Description Generation

In addition to tags, the system generates detailed descriptions of the video based on its content and stores them as text files for further organization and management.

4. Natural Language Video Query

Use query.py to search the video library using natural language:

python query.py "Help me find videos describing the Grand Canyon, with people walking, cloudy day, people admiring the beauty"

The system will return a list of the most matching videos, including similarity scores, video descriptions, and metadata information.

5. Web Interface Management

Start the web interface for visual management:

cd web
python app.py

Web interface features include:

View video library statistics
Select folders for video processing
Natural language video search
Video preview and playback
Export selected videos to a specified folder

5.1 CLIP Similarity Generation

This feature analyzes a video by detecting scenes/clips and finding similar videos for each clip:

python tools/clip_similarity_finder.py --video_path /path/to/your/video.mp4 --output_dir /path/to/output

Key capabilities:

Automatically detects scene changes in videos
Analyzes audio, video frames, and motion in each clip
Finds similar videos for each clip using multi-threading
Ensures each similar video is only used once across all clips
Organizes results in a clear directory structure
Supports background information to guide video selection

Advanced options:

python tools/clip_similarity_finder.py --video_path /path/to/video.mp4 --output_dir /path/to/output --threshold 30 --min_duration 1.0 --max_threads 8 --background "Need European city style videos with warm tones"

5.2 Text Similarity Finder

This feature finds videos that match text descriptions or instructions:

python tools/text_similarity_finder.py --text "Your text or instructions here" --output_dir /path/to/output

Key capabilities:

Splits input text into meaningful segments
Generates visual descriptions for each segment
Finds similar videos for each description
Can expand brief instructions into full video scripts
Supports background information to guide video selection

Advanced options:

# Using a text file as input
python tools/text_similarity_finder.py --text_file /path/to/text_file.txt --output_dir /path/to/output

# Expanding an instruction with target duration
python tools/text_similarity_finder.py --text "Create a short video about spring" --is_instruction --target_duration 30 --output_dir /path/to/output

# With background information
python tools/text_similarity_finder.py --text "Cherry blossoms in bloom" --background "Need European city style videos with warm tones" --output_dir /path/to/output

System Environment

The system now uses a simplified architecture with a single model environment based on Google Gemma3 for video understanding.

Development & Testing Environment
- Mac mini M4 Pro, 24GB Unified Memory
- Tested only on this configuration. If you need to run on a CUDA environment or a pure CPU environment, adjust parameters and paths in the code accordingly.
Deep Learning Dependencies
- Google Gemma3 for video understanding
- sensevoice for audio transcription
- Other dependencies listed in requirements.txt
Database Dependencies
- SQLite (structured data storage)
- ChromaDB (vector database)
- HuggingFace Embeddings (vector embeddings)

Installation & Configuration

1. Clone the Repository

git clone https://github.com/greenland-dream/video-understanding.git
cd video-understanding

2. Install Dependencies

pip install -r requirements.txt

3. Configure Models & Environment

Modify config/model_config.yaml to configure API provider priorities.
Copy config/api_configs.json.example to config/api_configs.json and fill in the necessary API keys. Currently supports siliconflow, deepseek_call, github_call, azure_call, qwen_call APIs. You can configure the priority of each API provider in the config file, and the code supports dynamically switching among them.

4. Running Examples

This project provides multiple ways to run:

4.1 Video Processing (main.py)

Open main.py:
Replace "your_folder_path" with your video folder path, for instance:
```
folder_paths = [
    "/home/user/videos"  # e.g., your video folder
]
```
Add a meta_data.txt File
Inside the "/home/user/videos" folder, add a meta_data.txt file that contains a brief description (one sentence) of the videos' shoot time/location. For example:
```
These videos were shot in December 2024 in the town of CONSUEGRA, Spain.
```
Run the Python code:
```
python main.py
```
The code will iterate through each folder listed in folder_paths and automatically process any videos within them.

4.2 Video Query (query.py)

After processing videos, you can use natural language queries to search for videos:

python query.py "Help me find videos describing the Grand Canyon, with people walking, cloudy day, people admiring the beauty"

The system will return a list of the most matching videos, sorted by similarity.

4.3 Web Interface (web/app.py)

Start the web interface for visual management:

cd web
python app.py

Then visit http://127.0.0.1:5000 in your browser to use the web interface.

⚠️ Note:

You can add multiple folder paths to folder_paths; each folder must contain a meta_data.txt.
Make sure the paths are formatted correctly, such as:
- macOS/Linux: "/Users/yourname/Videos"
Database files will be stored in the db/data/ directory, including SQLite database and ChromaDB vector database.

Project Structure

.
├── config/            # Configuration files
├── db/                # Database files
│   ├── data/          # Store SQLite and ChromaDB data
│   └── video_db.py    # Database operation class
├── docs/              # Documentation
├── modules/           # Core modules
│   ├── video_query/   # Video query module
│   └── ...
├── tools/             # Utility tools
│   ├── clip_similarity_finder.py  # Find similar videos for each clip in a video
│   └── text_similarity_finder.py  # Find videos matching text descriptions
├── utils/             # Utility functions
├── web/               # Web interface
│   ├── app.py         # Flask application
│   ├── static/        # Static resources
│   └── templates/     # HTML templates
├── main.py            # Main entry script (video processing)
├── query.py           # Query entry script (video search)
├── requirements.txt   # Dependency list
└── README.md          # Documentation

System Flowchart

The following flowchart illustrates the relationships and data flow between the three main entry points of the system:

This flowchart shows:

Video Processing Flow (main.py): Processes videos by extracting audio, analyzing frames, generating descriptions and tags, and storing results in databases.
Video Query Flow (query.py): Parses natural language queries, searches by description and transcript, and displays results to the user.
Web Interface Flow (web/app.py): Provides a web interface for video statistics, processing, searching, streaming, and exporting.

The dotted lines represent connections between different modules, particularly how they all interact with the shared database.

License

This project is released under the MIT License.

Contributing

Pull Requests and Issues are welcome!

Acknowledgments

Thank you to everyone who has supported and contributed to this project!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Video Understanding & Tagging System

Features

Video Understanding Diagram

Usage Guide Video Introduction on Rednote

1. Automatic Video Tag Generation

2. Keyword-based Search

3. Video Description Generation

4. Natural Language Video Query

5. Web Interface Management

5.1 CLIP Similarity Generation

5.2 Text Similarity Finder

System Environment

Installation & Configuration

1. Clone the Repository

2. Install Dependencies

3. Configure Models & Environment

4. Running Examples

4.1 Video Processing (main.py)

4.2 Video Query (query.py)

4.3 Web Interface (web/app.py)

Project Structure

System Flowchart

License

Contributing

Acknowledgments

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.vscode		.vscode
config		config
db		db
docs		docs
modules		modules
tools		tools
utils		utils
web		web
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
main.py		main.py
query.py		query.py
requirements.txt		requirements.txt

License

greenland-dream/video-understanding

Folders and files

Latest commit

History

Repository files navigation

Video Understanding & Tagging System

Features

Video Understanding Diagram

Usage Guide Video Introduction on Rednote

1. Automatic Video Tag Generation

2. Keyword-based Search

3. Video Description Generation

4. Natural Language Video Query

5. Web Interface Management

5.1 CLIP Similarity Generation

5.2 Text Similarity Finder

System Environment

Installation & Configuration

1. Clone the Repository

2. Install Dependencies

3. Configure Models & Environment

4. Running Examples

4.1 Video Processing (main.py)

4.2 Video Query (query.py)

4.3 Web Interface (web/app.py)

Project Structure

System Flowchart

License

Contributing

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages