diff --git a/tutorials/video/getting-started/video_pipeline_tutorial.ipynb b/tutorials/video/getting-started/video_pipeline_tutorial.ipynb new file mode 100644 index 000000000..b10c221ad --- /dev/null +++ b/tutorials/video/getting-started/video_pipeline_tutorial.ipynb @@ -0,0 +1,693 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Video Pipeline Tutorial with NeMo Curator\n", + "\n", + "This notebook demonstrates how to use NeMo Curator's video curation pipeline to process videos, extract clips, generate embeddings, and create captions.\n", + "\n", + "## Table of Contents\n", + "1. [Installation and Setup](#installation-and-setup)\n", + "2. [Understanding the Video Pipeline](#understanding-the-video-pipeline)\n", + "3. [Basic Example: Reading Videos](#basic-example-reading-videos)\n", + "4. [Advanced Example: Complete Video Processing](#advanced-example-complete-video-processing)\n", + "5. [Pipeline Parameters Explained](#pipeline-parameters-explained)\n", + "6. [Troubleshooting](#troubleshooting)\n", + "\n", + "---\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Installation and Setup\n", + "\n", + "### Prerequisites\n", + "\n", + "Before running the video pipeline, ensure you have:\n", + "\n", + "- **NVIDIA GPU** with Volta™ or higher (compute capability 7.0+)\n", + "- **CUDA 12 or above**\n", + "- **FFmpeg 7+** (will be installed using the provided script)\n", + "\n", + "### System Requirements\n", + "\n", + "- **Memory**: 16GB+ RAM for basic processing\n", + "- **GPU Memory**: 16GB+ VRAM recommended (up to 38GB for full pipeline with captions)\n", + "- **Storage**: Sufficient space for input videos and output clips\n", + "\n", + "### Installation Steps\n", + "\n", + "1. **Install FFmpeg:**\n", + "First, install FFmpeg using the provided installation script:\n", + "```bash\n", + "# Download and run the FFmpeg installation script\n", + "curl -O https://raw.githubusercontent.com/NVIDIA-NeMo/Curator/main/docker/common/install_ffmpeg.sh\n", + "chmod +x install_ffmpeg.sh\n", + "./install_ffmpeg.sh\n", + "```\n", + "\n", + "2. **Install UV (if not already installed):**\n", + "UV is a fast Python package installer and resolver that's significantly faster than pip:\n", + "```bash\n", + "# Install UV package manager\n", + "curl -LsSf https://astral.sh/uv/install.sh | sh\n", + "# Or on Windows: powershell -c \"irm https://astral.sh/uv/install.ps1 | iex\"\n", + "```\n", + "\n", + "3. **Create and activate a virtual environment with UV:**\n", + "```bash\n", + "uv venv .venv\n", + "source .venv/bin/activate # On Windows: .venv\\Scripts\\activate\n", + "```\n", + "\n", + "4. **Install NeMo Curator with video support using UV:**\n", + "```bash\n", + "uv pip install \"nemo-curator[video,video_cuda]\"\n", + "```\n", + "\n", + "5. **Verify installation:**\n", + "```bash\n", + "python -c \"import nemo_curator; print('Installation successful!')\"\n", + "```\n", + "\n", + "### Download Required Models\n", + "\n", + "The video pipeline requires several pre-trained models (e.g. [Cosmos Embed](https://huggingface.co/nvidia/Cosmos-Embed1-448p)). Models will be downloaded automatically based on the selected stages.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Understanding the Video Pipeline\n", + "\n", + "NeMo Curator's video pipeline is built on a **stage-based architecture** where each stage performs a specific processing step:\n", + "\n", + "### Core Components\n", + "\n", + "1. **Pipelines**: Ordered sequences of stages forming an end-to-end workflow\n", + "2. **Stages**: Individual processing units that perform single steps\n", + "3. **Tasks**: Data units that flow through the pipeline (`VideoTask` containing `Video` and `Clip` objects)\n", + "4. **Executors**: Components that run pipelines on distributed backends (Ray)\n", + "\n", + "### Pipeline Stages\n", + "\n", + "The video pipeline includes these stages (all optional - choose based on your needs):\n", + "\n", + "1. **VideoReader**: Reads video files and extracts metadata\n", + "2. **Splitting Algorithm**: \n", + " - **Fixed Stride**: Splits videos into fixed-length clips\n", + " - **TransNetV2**: Uses AI to detect scene transitions for intelligent splitting ([GitHub](https://github.com/soCzech/TransNetV2))\n", + "3. **ClipTranscodingStage**: Converts clips to standardized format\n", + "4. **MotionFilterStage**: Filters clips based on motion content\n", + "5. **ClipAestheticFilterStage**: Filters clips based on aesthetic quality using [CLIP](https://openai.com/research/clip) model\n", + "6. **Embedding Generation**: Creates vector embeddings for similarity search\n", + " - **Cosmos-Embed1**: NVIDIA's state-of-the-art video embedding model (224p, 336p, 448p variants) ([Hugging Face](https://huggingface.co/nvidia/Cosmos-Embed1-448p))\n", + " - **InternVideo2**: Advanced video understanding model for comprehensive embeddings ([GitHub](https://github.com/OpenGVLab/InternVideo2))\n", + "7. **Caption Generation**: Generates text descriptions of video content using [Qwen-VL](https://huggingface.co/Qwen/Qwen-VL) model\n", + "8. **Caption Enhancement**: Improves and refines generated captions using [Qwen-LM](https://huggingface.co/Qwen/Qwen2.5-7B) for better quality\n", + "9. **ClipWriterStage**: Saves processed clips and metadata\n", + "\n", + "### Data Flow\n", + "\n", + "```\n", + "Input Videos → VideoReader → Splitting → Transcoding → Filtering → Embeddings → Captions → Caption Enhancement → Output\n", + "```\n", + "\n", + "**Note**: All stages except VideoReader are optional. You can customize the pipeline by:\n", + "- **Basic**: VideoReader → Splitting → Transcoding → Output\n", + "- **With Quality Control**: Add Motion/Aesthetic filtering\n", + "- **With AI Features**: Add Embedding generation and/or Caption generation\n", + "- **Full Pipeline**: Include all stages for comprehensive video processing\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Running the Basic Example\n", + "\n", + "[`video_read_example.py`](https://github.com/NVIDIA-NeMo/Curator/blob/main/tutorials/video/getting-started/video_read_example.py). To run this example:\n", + "\n", + "```bash\n", + "python video_read_example.py --video-folder /path/to/your/videos --video-limit 5 --verbose\n", + "```\n", + "\n", + "**Parameters:**\n", + "- `--video-folder`: Path to directory containing video files\n", + "- `--video-limit`: Maximum number of videos to process (-1 for unlimited)\n", + "- `--verbose`: Enable detailed logging\n", + "\n", + "**What it does:**\n", + "- Reads video files from the specified directory\n", + "- Extracts metadata (duration, framerate, resolution, etc.)\n", + "- Processes videos in parallel using Ray\n", + "- Provides detailed logging of the process\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Running the Advanced Example\n", + "\n", + "The comprehensive video processing example is available in [video_split_clip_example.py](https://github.com/NVIDIA-NeMo/Curator/blob/main/tutorials/video/getting-started/video_split_clip_example.py)\n", + "\n", + "To run the comprehensive video processing pipeline, use the provided script:\n", + "\n", + "Key features of the comprehensive pipeline:\n", + "- Video reading and metadata extraction\n", + "- Multiple splitting algorithms (Fixed Stride and TransNetV2)\n", + "- Clip transcoding with various encoders \n", + "- Motion and aesthetic filtering\n", + "- Embedding generation (Cosmos-Embed1, InternVideo2)\n", + "- Caption generation (Qwen)\n", + "- Preview generation\n", + "- Flexible output options\n", + "\n", + "\n", + "```bash\n", + "python video_split_clip_example.py \\\n", + " --video-dir /path/to/your/videos \\\n", + " --model-dir /path/to/models \\\n", + " --output-clip-path /path/to/output/clips \\\n", + " --splitting-algorithm fixed_stride \\\n", + " --generate-embeddings \\\n", + " --video-limit 5 \\\n", + " --verbose\n", + "```\n", + "\n", + "**Key Parameters:**\n", + "- `--video-dir`: Input video directory\n", + "- `--model-dir`: Model directory (Can be empty and models will be automatically downloaded)\n", + "- `--output-clip-path`: Output directory for processed clips\n", + "- `--splitting-algorithm`: Choose between \"fixed_stride\" or \"transnetv2\"\n", + "- `--generate-embeddings`: Enable embedding generation\n", + "- `--generate-captions`: Enable caption generation\n", + "- `--aesthetic-threshold`: Filter clips by aesthetic score (e.g., 3.5)\n", + "- `--motion-filter`: Motion filtering mode (\"disable\", \"enable\", \"score-only\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Pipeline Parameters Explained\n", + "\n", + "### Splitting Algorithms\n", + "\n", + "#### Fixed Stride Splitting\n", + "**What it does**: Splits videos into clips of fixed duration at regular intervals.\n", + "- **Parameters**:\n", + " - `--fixed-stride-split-duration`: Duration of each clip in seconds (default: 10.0)\n", + " - `--fixed-stride-min-clip-length-s`: Minimum clip length in seconds (default: 2.0)\n", + " - `--limit-clips`: Maximum clips per video (0 = unlimited)\n", + "\n", + "#### TransNetV2 Splitting\n", + "**What it does**: Uses AI to detect scene transitions and intelligently split videos at natural break points.\n", + "- **Parameters**:\n", + " - `--transnetv2-threshold`: Probability threshold for scene transitions (default: 0.4)\n", + " - `--transnetv2-min-length-s`: Minimum scene length in seconds (default: 2.0)\n", + " - `--transnetv2-max-length-s`: Maximum scene length in seconds (default: 10.0)\n", + " - `--transnetv2-max-length-mode`: How to handle long scenes (\"truncate\" or \"stride\")\n", + " - `--transnetv2-crop-s`: Seconds to crop from start/end of scenes (default: 0.5)\n", + "\n", + "### Transcoding Parameters\n", + "\n", + "**What it does**: Converts video clips to a standardized format for consistent processing and storage.\n", + "- `--transcode-encoder`: Video encoder (\"libopenh264\", \"h264_nvenc\", \"libx264\")\n", + "- `--transcode-encoder-threads`: CPU threads per encoding operation\n", + "- `--transcode-ffmpeg-batch-size`: Number of clips to encode in parallel\n", + "- `--transcode-use-hwaccel`: Use GPU acceleration for decoding\n", + "- `--transcode-use-input-video-bit-rate`: Use input video's bit rate\n", + "\n", + "### Filtering Parameters\n", + "\n", + "#### Motion Filtering\n", + "**What it does**: Analyzes video motion content to filter out static or low-motion clips.\n", + "- `--motion-filter`: Mode (\"disable\", \"enable\", \"score-only\")\n", + "- `--motion-global-mean-threshold`: Global motion threshold (default: 0.00098)\n", + "- `--motion-per-patch-min-256-threshold`: Per-patch motion threshold (default: 0.000001)\n", + "\n", + "#### Aesthetic Filtering\n", + "**What it does**: Uses AI to score video clips based on visual quality and aesthetic appeal.\n", + "- `--aesthetic-threshold`: Minimum aesthetic score (e.g., 3.5)\n", + "- `--aesthetic-reduction`: Score reduction method (\"mean\" or \"min\")\n", + "\n", + "### Embedding Parameters\n", + "\n", + "**What it does**: Generates vector embeddings from video clips for similarity search and clustering.\n", + "- `--embedding-algorithm`: Algorithm (\"cosmos-embed1-224p\", \"cosmos-embed1-336p\", \"cosmos-embed1-448p\", \"internvideo2\")\n", + "- `--embedding-gpu-memory-gb`: GPU memory allocation (default: 20.0)\n", + "\n", + "### Captioning Parameters\n", + "\n", + "**What it does**: Generates text descriptions of video content using AI vision-language models.\n", + "- `--generate-captions`: Enable caption generation\n", + "- `--captioning-algorithm`: Model variant (\"qwen\")\n", + "- `--captioning-batch-size`: Batch size for processing (default: 8)\n", + "- `--captioning-max-output-tokens`: Maximum tokens per caption (default: 512)\n", + "- `--captioning-sampling-fps`: Frames per second for sampling (default: 2.0)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Example Usage Scenarios\n", + "\n", + "### Scenario 1: Basic Video Splitting\n", + "For simple video splitting without advanced features:\n", + "\n", + "```bash\n", + "python video_split_clip_example.py \\\n", + " --video-dir /path/to/videos \\\n", + " --model-dir /path/to/models \\\n", + " --output-clip-path /path/to/output \\\n", + " --splitting-algorithm fixed_stride \\\n", + " --fixed-stride-split-duration 15.0 \\\n", + " --video-limit 10\n", + "```\n", + "\n", + "### Scenario 2: High-Quality Video Processing\n", + "For production-quality processing with all features:\n", + "\n", + "```bash\n", + "python video_split_clip_example.py \\\n", + " --video-dir /path/to/videos \\\n", + " --model-dir /path/to/models \\\n", + " --output-clip-path /path/to/output \\\n", + " --splitting-algorithm transnetv2 \\\n", + " --transnetv2-threshold 0.3 \\\n", + " --transnetv2-min-length-s 3.0 \\\n", + " --transnetv2-max-length-s 15.0 \\\n", + " --generate-embeddings \\\n", + " --embedding-algorithm cosmos-embed1-336p \\\n", + " --generate-captions \\\n", + " --captioning-batch-size 4 \\\n", + " --aesthetic-threshold 3.5 \\\n", + " --motion-filter enable \\\n", + " --transcode-encoder h264_nvenc \\\n", + " --transcode-use-hwaccel \\\n", + " --video-limit 50\n", + "```\n", + "\n", + "### Scenario 3: Quick Testing\n", + "For rapid testing with minimal resources:\n", + "\n", + "```bash\n", + "python video_split_clip_example.py \\\n", + " --video-dir /path/to/videos \\\n", + " --model-dir /path/to/models \\\n", + " --output-clip-path /path/to/output \\\n", + " --splitting-algorithm fixed_stride \\\n", + " --fixed-stride-split-duration 5.0 \\\n", + " --transcode-encoder libopenh264 \\\n", + " --video-limit 3 \\\n", + " --dry-run\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Interactive End-to-End Example\n", + "\n", + "Now let's put everything together! This section will walk you through a complete video processing pipeline from start to finish.\n", + "\n", + "### What We'll Do\n", + "\n", + "1. **Download sample videos** from the PE-Video dataset\n", + "2. **Process the videos** using NeMo Curator's video pipeline\n", + "3. **Explore the results** and understand the output structure\n", + "\n", + "This hands-on example will help you understand how all the components work together in practice.\n", + "\n", + "---\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 1: Download Sample Videos\n", + "\n", + "First, let's download some sample videos from the [PE-Video](https://huggingface.co/datasets/facebook/PE-Video) dataset. This will give us real video content to work with.\n", + "\n", + "The following code cell would download 10 videos from PE-Video dataset:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Install required dependencies for this example\n", + "!pip install datasets\n", + "\n", + "import os\n", + "from pathlib import Path\n", + "\n", + "from datasets import load_dataset\n", + "\n", + "# Create output directory for sample videos\n", + "output_dir = Path(\"./pe_video_samples\")\n", + "output_dir.mkdir(exist_ok=True)\n", + "\n", + "print(f\"Downloading sample videos to: {output_dir.absolute()}\")\n", + "\n", + "# Load PE-Video dataset (streaming mode for efficiency)\n", + "dataset = load_dataset(\"facebook/PE-Video\", split=\"train\", streaming=True)\n", + "\n", + "# Download 10 sample videos (adjust this number as needed)\n", + "count = 0\n", + "max_videos = 10\n", + "\n", + "print(f\"Downloading {max_videos} sample videos...\")\n", + "\n", + "for sample in dataset:\n", + " if count >= max_videos:\n", + " break\n", + "\n", + " video_data = sample.get(\"mp4\")\n", + " description = sample.get(\"json\", {}).get(\"description\", f\"video_{count+1}\")\n", + "\n", + " if video_data:\n", + " # Create safe filename\n", + " safe_name = \"\".join(c for c in description[:30] if c.isalnum() or c in (\" \", \"-\", \"_\")).strip()\n", + " filename = f\"{safe_name}_{count+1}.mp4\" if safe_name else f\"video_{count+1}.mp4\"\n", + "\n", + " # Save video\n", + " with open(output_dir / filename, \"wb\") as f:\n", + " f.write(video_data)\n", + "\n", + " print(f\"✓ Downloaded: {filename}\")\n", + " count += 1\n", + "\n", + "print(f\"Successfully downloaded {count} videos to {output_dir.absolute()}\")\n", + "print(\"Video files:\")\n", + "for video_file in output_dir.glob(\"*.mp4\"):\n", + " file_size = video_file.stat().st_size / (1024 * 1024) # Size in MB\n", + " print(f\" - {video_file.name} ({file_size:.1f} MB)\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 2: Set Up Video Processing Pipeline\n", + "\n", + "Now let's configure and run the video processing pipeline on our downloaded videos. We'll use a moderate configuration that demonstrates key features without requiring excessive resources.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Command Breakdown\n", + "\n", + "The following command runs the complete video processing pipeline. Here's what each parameter does:\n", + "\n", + "**📁 Input/Output:**\n", + "- `--video-dir ./pe_video_samples` → Input directory containing our downloaded videos\n", + "- `--output-clip-path ./processed_clips` → Output directory where processed clips will be saved\n", + "\n", + "**✂️ Video Splitting:**\n", + "- `--splitting-algorithm fixed_stride` → Split videos into clips using fixed time intervals\n", + " - *Alternative: `transnetv2` for AI-based scene detection*\n", + "- `--fixed-stride-split-duration 8.0` → Each clip will be 8 seconds long\n", + "- `--fixed-stride-min-clip-length-s 2.0` → Discard clips shorter than 2 seconds\n", + "\n", + "**🎥 Video Processing:**\n", + "- `--transcode-encoder libopenh264` → Use libopenh264 codec (good speed/quality balance)\n", + " - *Alternatives: `h264_nvenc` (GPU), `libx264` (CPU)*\n", + "- `--transcode-ffmpeg-batch-size 8` → Process 8 clips in parallel during transcoding\n", + "\n", + "**🧠 AI Features:**\n", + "- `--generate-embeddings` → Generate vector embeddings for similarity search and clustering\n", + "- `--embedding-algorithm cosmos-embed1-224p` → Use NVIDIA's Cosmos-Embed1 model at 224p resolution\n", + " - *Alternatives: `cosmos-embed1-336p`, `cosmos-embed1-448p`, `internvideo2`*\n", + "- `--embedding-gpu-memory-gb 8.0` → Allocate 8GB of GPU memory for embedding generation\n", + "\n", + "**🔍 Quality Filtering:**\n", + "- `--motion-filter score-only` → Calculate motion scores but don't filter clips based on motion\n", + " - *Alternatives: `enable` (filter low-motion clips), `disable` (no motion analysis)*\n", + "- `--aesthetic-threshold 3.0` → Filter out clips with aesthetic scores below 3.0 (1-5 scale)\n", + " - *Higher values = more selective filtering*\n", + "\n", + "**⚙️ Processing Control:**\n", + "- `--video-limit 3` → Process only 3 videos (for this example)\n", + " - *Remove this parameter to process all videos*\n", + "- `--verbose` → Show detailed progress information during processing\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!python video_split_clip_example.py \\\n", + " --video-dir ./pe_video_samples \\\n", + " --output-clip-path ./processed_clips \\\n", + " --splitting-algorithm fixed_stride \\\n", + " --fixed-stride-split-duration 8.0 \\\n", + " --fixed-stride-min-clip-length-s 2.0 \\\n", + " --transcode-encoder libopenh264 \\\n", + " --transcode-ffmpeg-batch-size 8 \\\n", + " --generate-embeddings \\\n", + " --embedding-algorithm cosmos-embed1-224p \\\n", + " --embedding-gpu-memory-gb 8.0 \\\n", + " --motion-filter score-only \\\n", + " --aesthetic-threshold 3.0 \\\n", + " --video-limit 3 \\\n", + " --verbose\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 3: Understanding the Output\n", + "\n", + "The video pipeline produces several types of output:\n", + "\n", + "#### 📁 Directory Structure\n", + "```\n", + "processed_clips/\n", + "├── clips/ # Processed video clips (.mp4 files)\n", + "│ ├── video1_clip_0.mp4\n", + "│ ├── video1_clip_1.mp4\n", + "│ └── ...\n", + "├── metadata/ # Metadata files (.json)\n", + "│ ├── video1_metadata.json\n", + "│ └── ...\n", + "└── iv2_embd/ # InternVideo2 Embedding files (if generated)\n", + " └── ...\n", + "```\n", + "\n", + "#### 📊 Metadata Fields\n", + "Each clip in the metadata includes:\n", + "- **Basic Info**: `clip_path`, `duration`, `fps`, `resolution`\n", + "- **Quality Scores**: `aesthetic_score`, `motion_score`\n", + "- **AI Features**: `embedding` (vector), `caption` (text description)\n", + "- **Processing Info**: `source_video`, `clip_index`, `timestamp`\n", + "\n", + "#### 🎯 Next Steps\n", + "Now that you've seen the complete pipeline in action, you can:\n", + "\n", + "1. **Experiment with parameters** - Try different splitting algorithms, thresholds, or models\n", + "2. **Scale up** - Process more videos or use higher-quality settings\n", + "3. **Customize the pipeline** - Add or remove stages based on your needs\n", + "4. **Use the results** - Leverage embeddings for similarity search or captions for content analysis\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Video Deduplication Pipeline\n", + "\n", + "After processing videos and generating embeddings, you may want to remove duplicate or very similar video clips from your dataset. NeMo Curator provides a powerful semantic deduplication pipeline that uses the generated embeddings to identify and remove near-duplicate content.\n", + "\n", + "### What is Semantic Deduplication?\n", + "\n", + "Semantic deduplication goes beyond simple hash-based deduplication by understanding the *content* of videos. It uses the embeddings generated in the previous steps to:\n", + "\n", + "- **Identify similar content** even when videos have different encoding, resolution, or slight variations\n", + "- **Group similar clips** using clustering algorithms\n", + "- **Remove duplicates** while preserving the most representative examples\n", + "- **Maintain metadata** for all processed clips\n", + "\n", + "### When to Use Deduplication\n", + "\n", + "- **Large datasets** with potential duplicate content\n", + "- **Video collections** from multiple sources\n", + "- **Content curation** where quality over quantity matters\n", + "- **Storage optimization** by removing redundant clips\n", + "- **Training data preparation** for machine learning models\n", + "\n", + "### Deduplication Pipeline Parameters\n", + "\n", + "The semantic deduplication pipeline offers several key parameters:\n", + "\n", + "- **`n_clusters`**: Number of clusters for grouping similar content (default: 100)\n", + "- **`distance_metric`**: Method for measuring similarity (\"cosine\", \"euclidean\", \"manhattan\")\n", + "- **`eps`**: Maximum distance threshold for considering clips as duplicates (lower = more strict)\n", + "- **`which_to_keep`**: Strategy for selecting which clip to keep from duplicates (\"random\", \"first\", \"last\")\n", + "- **`random_state`**: Seed for reproducible results\n", + "\n", + "### Running the Deduplication Pipeline\n", + "\n", + "The following example shows how to run semantic deduplication on your processed video clips:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Import required modules\n", + "from nemo_curator.pipeline.pipeline import Pipeline\n", + "from nemo_curator.stages.deduplication.semantic import SemanticDeduplicationWorkflow\n", + "\n", + "# Configuration for deduplication\n", + "# Update these paths to match your actual processed video output\n", + "input_embeddings_path = \"./processed_clips/iv2_embd_parquet\" # Path to your embedding parquet files\n", + "output_dedup_path = \"./processed_clips/dedup_output\" # Path for deduplicated results\n", + "\n", + "# Create output directory if it doesn't exist\n", + "os.makedirs(output_dedup_path, exist_ok=True)\n", + "\n", + "# Create the deduplication pipeline\n", + "def create_video_dedup_pipeline() -> Pipeline:\n", + " return SemanticDeduplicationWorkflow(\n", + " input_path=input_embeddings_path,\n", + " output_path=output_dedup_path,\n", + " id_field=\"id\", # Field containing unique clip identifiers\n", + " embedding_field=\"embeddings\", # Field containing the vector embeddings\n", + " metadata_fields=[\"id\"], # Additional metadata fields to preserve\n", + " n_clusters=100, # Number of clusters for grouping similar content\n", + " distance_metric=\"cosine\", # Distance metric for similarity calculation\n", + " which_to_keep=\"random\", # Strategy for selecting which duplicate to keep\n", + " random_state=42, # Random seed for reproducible results\n", + " eps=0.002, # Maximum distance threshold for duplicates (lower = more strict)\n", + " # Storage options for local filesystem\n", + " read_kwargs={\"storage_options\": {}},\n", + " write_kwargs={\"storage_options\": {}},\n", + " verbose=True, # Enable detailed logging\n", + " )\n", + "\n", + "# Run the deduplication pipeline\n", + "print(\"Starting video deduplication pipeline...\")\n", + "print(f\"Input embeddings: {input_embeddings_path}\")\n", + "print(f\"Output directory: {output_dedup_path}\")\n", + "\n", + "# Create and run the pipeline\n", + "pipeline = create_video_dedup_pipeline()\n", + "pipeline.run()\n", + "\n", + "print(\"Deduplication completed!\")\n", + "print(f\"Results saved to: {output_dedup_path}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Understanding Deduplication Results\n", + "\n", + "After running the deduplication pipeline, you'll find:\n", + "\n", + "#### 📁 Output Structure\n", + "```\n", + "dedup_output/\n", + "├── deduplicated_embeddings.parquet # Deduplicated embedding data\n", + "├── cluster_assignments.parquet # Cluster membership for each clip\n", + "└── duplicate_groups.parquet # Groups of identified duplicates\n", + "```\n", + "\n", + "#### 📊 Key Metrics\n", + "The pipeline provides several useful metrics:\n", + "- **Total clips processed**: Number of input clips\n", + "- **Duplicates found**: Number of clips identified as duplicates\n", + "- **Deduplication ratio**: Percentage of clips removed\n", + "- **Clusters created**: Number of similarity groups formed\n", + "\n", + "#### 🎯 Customizing Deduplication\n", + "\n", + "You can adjust the deduplication behavior by modifying these parameters:\n", + "\n", + "**Strictness Control:**\n", + "- **`eps=0.001`**: Very strict (only nearly identical clips are considered duplicates)\n", + "- **`eps=0.005`**: Moderate (somewhat similar clips are considered duplicates)\n", + "- **`eps=0.01`**: Lenient (loosely similar clips are considered duplicates)\n", + "\n", + "**Clustering Strategy:**\n", + "- **`n_clusters=50`**: Fewer, larger clusters (more aggressive deduplication)\n", + "- **`n_clusters=200`**: More, smaller clusters (more conservative deduplication)\n", + "\n", + "**Distance Metrics:**\n", + "- **`\"cosine\"`**: Best for high-dimensional embeddings (recommended)\n", + "- **`\"euclidean\"`**: Good for normalized embeddings\n", + "- **`\"manhattan\"`**: Alternative for specific use cases\n", + "\n", + "### Integration with Video Pipeline\n", + "\n", + "The deduplication pipeline seamlessly integrates with the video processing pipeline:\n", + "\n", + "1. **Process videos** → Generate embeddings using the video pipeline\n", + "2. **Run deduplication** → Remove duplicate clips using this pipeline\n", + "3. **Use results** → Apply deduplicated dataset for your specific use case\n", + "\n", + "This two-step approach ensures you have both high-quality video content and an optimized, duplicate-free dataset.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Summary\n", + "\n", + "Now that you understand the basics of NeMo Curator's video pipeline, you can:\n", + "\n", + "1. **Experiment with different parameters** to optimize for your specific use case\n", + "2. **Scale up processing** by increasing `--video-limit` and using more powerful hardware\n", + "3. **Customize the pipeline** by adding or removing stages based on your needs\n", + "4. **Integrate with other tools** by using the generated embeddings and metadata\n", + "5. **Explore advanced features** like caption enhancement and preview generation\n", + "\n", + "### Additional Resources\n", + "\n", + "- **Official Documentation**: [NeMo Curator Video Guide](https://docs.nvidia.com/nemo-curator/)\n", + "- **API Reference**: Detailed documentation of all stages and parameters\n", + "- **Examples**: More complex examples in the `tutorials/` directory\n", + "- **Community**: Join discussions and get help from the community\n", + "\n", + "### Key Takeaways\n", + "\n", + "- NeMo Curator provides a powerful, scalable framework for video curation\n", + "- The pipeline is modular and can be customized for different use cases\n", + "- GPU acceleration significantly improves performance for large-scale processing\n", + "- Proper parameter tuning is essential for optimal results\n", + "- The system handles distributed processing automatically through Ray\n", + "\n", + "Happy video curating! 🎬✨" + ] + } + ], + "metadata": { + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}