A multimodal video annotation pipeline that runs entirely on local GPU hardware. One CLI run takes a long-form video and produces a time-aligned event corpus across audio, vision, on-screen text, audience chat, brand observations, music, and clip-worthy moments — all stored as Parquet + SQLite with full provenance back to the source stage and model version.
The included demo corpus is IRL livestreaming because it's a brand-dense, multi-speaker, noisy domain that exercises every stage at once — but the pipeline works on any long-form video that has people on camera: podcasts, talk shows, interviews, lectures, sports broadcasts, news recordings, conference talks. If your input is a long mp4 with humans in it, the pipeline produces structured labels for it.
TL;DR — drop in video.mp4, one run produces:
- ASR transcript (
stage_b_asr.whisper_large_v3_v1.jsonl) — Whisper-large-v3 segments - Speaker diarization (
stage_b_diar.pyannote_speaker_diarization_3.1_v1.jsonl) — pyannote-3.1 - Scene captions / OCR / shoppable detection (
stage_c_vlm.qwen2.5vl_7b_local_v1.jsonl) — Qwen2.5-VL-7B - Faces (
stage_face.yunet_2023mar+dlib_face_resnet_v1_128d_v1.jsonl+face_embeddings.npz) — YuNet detections + dlib 128-d - Brand observations + entity grounding (
stage_h_hot.intel_hot_v1.jsonl) — Qwen2.5 via local Ollama - Cold-pass key-moment scoring (
stage_h_cold.intel_cold_v1.jsonl) — Qwen3-14B - Auto-discovered clip candidates (
stage_auto_clips.auto_clips_heuristic_v2.jsonl) + per-platform exports underpackages/ - Unified
events.parquet— every detection above in one time-aligned table, joinable from DuckDB / Polars global.db(SQLite) — corpus catalog: streams, channels, brands, identities, dossier claims + verdicts
Every emission stamps the stage version + model version. Want to swap Whisper-large-v3 for whisper-v3-turbo? Just re-run that one layer; the rest of the corpus is preserved.
See it concretely: examples/state_of_gpt/ — the full pipeline output from running corpus-mill on Andrej Karpathy's 42-minute "State of GPT" talk (Microsoft Build 2023). 8,202 words, 86 scene captions, 36 auto-discovered clip candidates, 144 platform exports — runtime ~12 minutes on a 5090. The events.parquet is included; query it with DuckDB or Polars to see the unified timeline shape.
Zero AI inference leaves your hardware. The pipeline never sends your video, audio, transcripts, scene captions, detected faces, brand observations, or chat content to OpenAI, Anthropic, Google, or any other third-party AI service. Every model runs on your GPU:
| Stage | Model | Where it runs |
|---|---|---|
| ASR | Whisper-large-v3 (faster-whisper / ctranslate2) | local GPU |
| Diarization | pyannote/speaker-diarization-3.1 | local GPU |
| Face detection | YuNet (OpenCV, Apache 2.0) | local CPU/GPU |
| Face embeddings | dlib face_recognition_resnet (Boost License) | local CPU |
| Scene captions + OCR + shoppable detection | Qwen2.5-VL-7B (transformers) | local GPU |
| Hot-pass entity extraction, glossary, rerank, placement value | qwen2.5:7b via local Ollama | local GPU |
| Cold-pass key-moment scoring + profile generation | qwen3:14b via local Ollama | local GPU |
| Audio fingerprinting | chromaprint | local CPU |
| PDQ visual fingerprinting | pdqhash | local CPU |
| Cohort discovery | sklearn NMF | local CPU |
| Audience-overlap graph | networkx | local CPU |
This matters because the videos you process may be sensitive.
Internal training material, unaired interviews, security-camera
footage, medical or legal recordings, NDA-bound podcast guests,
private streamer VODs — none of that should be sent to a cloud LLM
provider, and with this pipeline none of it ever is. Air-gap your
machine after pip install and the pipeline still works.
The only network calls the pipeline ever makes are:
- The optional
corpus-mill ingest-url/ channel-watcher path, which uses yt-dlp to pull a public YouTube VOD into your local storage. (Skip this path entirely if you're processing your own files.) - Two genuinely optional metadata enrichments, both opt-in via env
var and off by default:
- AcoustID for resolving chromaprint audio fingerprints to
track + artist names. Only the fingerprint hash is sent — not
audio. Disable: leave
CORPUS_MILL_ACOUSTID_API_KEYempty. - Firecrawl for the brand-safety dossier's open-web claim
intake. Off unless you set
CORPUS_MILL_FIRECRAWL_URL. Can be self-hosted to keep this local too.
- AcoustID for resolving chromaprint audio fingerprints to
track + artist names. Only the fingerprint hash is sent — not
audio. Disable: leave
Everything else — every embedding, every transcript, every brand detection, every clip ranking — happens on the silicon you own.
I (Cahlen) built this in my spare time. I'm an AI researcher — not a TikTok creator, not a clipper, not a sponsorship-agency operator. The end goal of this work, for me, is synthetic data for model training. Long-form video is one of the densest signal sources we have for grounding language and vision models, and the existing public corpora barely scratch what's extractable.
But to actually produce good synthetic data, you have to build a lot of useful intermediate things along the way: a clipping platform, a brand-intelligence surface, an evidence-only adjudication ledger, an identity-persistence layer, a forensic fingerprinting stack. So this repo is all of those things at once. Each piece is also independently useful — you don't need to care about training data to get value from the brand drill-down, or the clip discovery surface, or just the ASR + diarization layer.
Why open-source, why now: I built this for my own work and could keep it private. But honestly — if I do that, it'll just sit on a hard drive and rot while almost nobody benefits from any of it. So I'm publishing it on the chance that someone, somewhere finds a piece they can use:
- An AI researcher who wants real multimodal supervision data without spending months scraping and aligning their own corpus
- A content creator who wants to find their own best moments algorithmically instead of paying clipper teams
- A research lab that wants the brand-significance / cohort / authenticity stack as a starting point for their own analysis
- A small studio that wants on-prem video annotation they fully control, without sending their footage to a cloud API
- Or anyone curious enough to pull just one stage out of the pipeline and use it standalone
If even a few people get real value from any one piece of it, I'd rather it be public than gathering dust in my home directory.
Setting expectations honestly: this is not a polished SaaS or a plug-and-play tool. It's a substantial engineering project with a real footprint, and getting it running on your own videos is real engineering work.
What that means in practice:
- Significant supporting infrastructure required. You need a capable NVIDIA GPU (~16 GB+ VRAM), a configured local Ollama with the right model quantizations pulled, working faster-whisper / pyannote / Qwen2.5-VL stacks, properly sized scratch and cold-archive disks, an HF token with the pyannote terms accepted, ffmpeg on PATH, and a willingness to debug PyTorch / CUDA / cuDNN version mismatches when they happen.
- Bugs are to be expected. This is one person's spare-time project across a meaningful surface area (~30K LOC, ~30 pipeline stages, dozens of models). I run it on my own machine; your hardware, OS, driver versions, and video-content edge cases inevitably differ. Things will break. The pipeline is well instrumented and the failures are usually local to a single stage, but fixing them for your use case is on you.
- Each user's setup is bespoke. If you only care about ASR + diarization, you don't need the VLM stack. If you only care about brand detection, you don't need the chat ingest. If you're processing podcast audio you can skip the keyframe pipeline entirely. Adapting the pipeline to YOUR specific use case is part of the work.
- No support guarantees. I'll fix issues that materially affect the project's correctness when I have the time and the bandwidth, but this isn't a funded product with an on-call rotation. Treat the repo as a starting point you'll customize, not a finished service you'll consume.
If "real engineering work" sounds fine — clone it, read the code, break things, learn from it, build something useful. If you want it working on your videos without doing the engineering yourself, see the setup service section at the bottom of this README.
-
AI / ML researchers building multimodal models (video-VLMs, speaker diarization, audience-aware captioning, identity re-identification, brand detection, audio fingerprinting). The pipeline turns long-form video into typed labels with full provenance — exactly the supervision signal that's expensive to build by hand.
-
Content creators with their own VOD catalog. If you're a streamer / podcaster / video creator, you don't need to hire an army of clippers to grow your social presence. Run your own VODs through this pipeline, look at the per-stream view, pick the highest-scoring moments yourself, and post them. The same pipeline also tells you which brands actually appear in your content (useful for renewal conversations with your sponsors), and how your placement patterns compare to peer creators in the corpus.
-
Brand teams and analysts. If you have a catalog of recordings (podcast guest appearances, conference talks, sponsored streams), the pipeline produces an audit-grade record of every visible brand placement with frame-thumbnail evidence. The cross-stream catalog lets you compare placement density across creators.
-
Anyone curious about long-form video at scale. Each individual layer is useful on its own. Want only the ASR + diarization? That's
stage_b_asr.jsonl+stage_b_diar.jsonl, two stages. Want only the brand catalog? That's the brand-intel rollups. The pipeline is end-to-end but the output is decomposable — you can cherry-pick the layers that matter for your use case.
Concrete things you can do with the pipeline output as training input:
- Distill a smaller VLM using the Qwen2.5-VL-7B scene captions +
shoppable detections as input/output supervision. Every prompt and
completion is captured to
global/training_corpus/llm_calls/. - Build a video-RLHF preference dataset by joining the auto-clip scoring traces (which clips ranked highly and why) with the reason-trace fields. The ranking function becomes a reward model.
- Train an identity-re-identification model on the face cluster assignments — every face is tagged with a stable cluster ID across the entire corpus, with thumbnails for visual verification.
- Train an engagement-aware captioning model on the chat-aligned transcript windows. Real audience reactions are weak labels for "this segment landed."
- Pre-train a multimodal embedding model on the time-aligned (transcript turn, scene caption, brand bbox, music presence) tuples — exactly the cross-modal supervision pairs HowTo100M / WebVid-style scrapers spend months building, here as a side effect of running the pipeline.
- Bootstrap weak labels for downstream classifiers using the authenticity regime (organic vs promotional), the cohort assignments (NMF), or the streamer-brand affinity matrix.
There's a lot of valuable signal in long-form video that single-modality tools miss:
- ASR alone gives you words but not who said them, what was on the shelf behind them, or what brand the streamer was wearing.
- Scene caption alone gives you what's visible but loses the temporal alignment to the audio and the reactions in chat.
- Brand-detection alone gives you placement counts but not the context — was the bottle being held up promotionally, or was it just on a desk?
corpus-mill fuses every modality onto one timeline with deterministic
event IDs, so cross-modal joins are SQL-trivial: "the chat reactions
during the 30s window where Apple was visible and the streamer said
'price'" is one query, not a research project.
Every pipeline run produces high-signal training data as a side effect:
- Multimodal supervision pairs. Time-aligned (transcript turn, scene caption, brand bbox, chat reaction) tuples are exactly the kind of weakly-supervised pairs you'd otherwise scrape and align by hand.
- LLM call capture. Every prompt + completion from the entity
extraction, glossary, placement-value, and rerank stages is
archived to
global/training_corpus/llm_calls/<date>.jsonl— ready as input for distillation or LoRA fine-tunes of smaller models. - Cohort-discovery weak labels. NMF cohort assignments, brand significance scores, and authenticity-regime labels are typed signals you can use to bootstrap downstream classifiers without human annotation.
- Forensic fingerprints. PDQ + chromaprint signatures over every packaged clip make derivation tracking automatic — useful both for clip-attribution work and for building deduplicated corpora.
- Frame thumbnails with bbox provenance. Every brand observation saves a padded JPEG crop alongside the source bbox, so downstream consumers can verify model calls or use the crops directly as training images.
The full corpus exhaust — Parquet events, SQLite catalog, JPEG
thumbnails, audio fingerprints — is structured for direct ingestion
into model training pipelines, dataset builders (HuggingFace
datasets, WebDataset, etc.), or as a labeled evaluation set.
- Audio: Whisper-large-v3 word-level transcripts and pyannote
speaker diarization, fused into per-turn
transcript_turnevents. - Vision: keyframes every 5s, Qwen2.5-VL-7B scene captions, OCR of on-screen text overlays, and shoppable-item detection (category + bbox + brand guess + price tier).
- Faces: YuNet detection (Apache 2.0) + dlib 128-d face embeddings (Boost License) + face thumbnails, clustered cross-corpus for identity persistence.
- Chat: yt-dlp live-chat replay (when YouTube exposes it for the source VOD), with per-message timestamps aligned to the transcript timeline.
- LLM intel: schema-constrained entity + money-mention extraction
with grounded
event_idcitations and post-validation against source turn text (rejects ungrounded LLM output). - Clips: heuristic key-moment scoring + ffmpeg cuts + per-platform exports (9:16 face-aware center-crop, 16:9, 1:1).
- Music: chromaprint fingerprinting with optional AcoustID resolution to track + artist + ISRC.
Corpus-level rollups derive brand significance (PMI / Dunning LLR), brand cohorts (NMF on channel × brand), authenticity scores (burstiness + PELT changepoints), audience-overlap graphs (chat-author Jaccard + networkx centrality), counterfactual recommendations, and forensic clip fingerprints (PDQ + TMK + chromaprint with HNSW lookup).
The webapp surfaces all of this as a queryable dashboard at
http://127.0.0.1:8000.
- NVIDIA GPU with ≥16 GB VRAM. The pipeline was developed against an RTX 5090 (32 GB). Smaller cards work if you skip the larger Qwen models or run them remotely.
- ~30 GB of working scratch per active stream (local NVMe strongly recommended — the pipeline rsyncs each stream onto fast scratch before processing).
- ~5-15 GB of cold archive per stream (post-pipeline).
- Python 3.11
ffmpegon PATH- A locally-running Ollama daemon with
qwen2.5:7bandqwen3:14bpulled - An
HF_TOKENenv var with read access topyannote/speaker-diarization-3.1(free, requires accepting model terms on Hugging Face) - For the visual-language stages: pulled Qwen2.5-VL-7B weights (transformers will fetch on first use)
A Dockerfile and docker-compose.yml are included. They package the
webapp + pipeline + all native dependencies (PyTorch + CUDA + ffmpeg
- chromaprint + dlib compiled CPU-only to dodge CUDA-version issues + YuNet weights baked in). Ollama runs on the host — it's a one-line install and avoids rebuilding it inside the container.
Prerequisites:
- NVIDIA driver ≥ 535 +
nvidia-container-toolkitconfigured (so Docker can see the GPU) - Docker Compose v2
- Ollama installed on the host
- An HF_TOKEN with the pyannote/speaker-diarization-3.1 T&Cs accepted on the model page
- Plenty of disk: ~50 GB for model weights on first run, plus 5–15 GB per processed stream
One-time host setup:
# 1. Install Ollama on the host + pull the two text models the pipeline uses
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen2.5:7b # ~5 GB — hot pass
ollama pull qwen3:14b # ~9 GB — cold pass
# 2. Make Ollama reachable from the container.
# By default Ollama binds to 127.0.0.1 only — the container lives on a
# different network namespace and won't be able to reach it. Tell the
# systemd unit to bind on all interfaces:
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf >/dev/null <<'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
EOF
sudo systemctl daemon-reload
sudo systemctl restart ollama
# Verify — should return JSON, not a connection error:
curl -s http://127.0.0.1:11434/api/tags | head -c 200
# 3. Clone the repo + create your .env
git clone https://github.com/cahlen/corpus-mill
cd corpus-mill
cp .env.example .env
# Edit .env — at minimum set HF_TOKENHeads up: binding Ollama to
0.0.0.0exposes it on every interface the host has. On a workstation behind a firewall this is fine; on a machine with a public IP, gate the port (e.g.ufw allow from 172.17.0.0/16 to any port 11434) so only the Docker bridge can reach it.
Bring it up:
docker compose up --buildFirst boot pulls the PyTorch base layer, compiles dlib (~5 min), and the first pipeline run downloads ~25 GB of HuggingFace model weights. Subsequent restarts are instant — model weights persist in a named volume.
The webapp is at http://127.0.0.1:8000. Drag-and-drop an mp4 in the upload box and watch the pipeline process it.
Disk planning:
# Override the default volume locations to point at your big disks.
# Edit .env:
STREAMS_DIR=/mnt/big-disk/corpus-mill/streams # cold archive
SCRATCH_DIR=/mnt/nvme/corpus-mill-scratch # active processingOptional: Firecrawl for dossier intake
The dossier system (channel-level claim tracking + adjudication) is OFF
by default. If you want it on, point corpus-mill at a Firecrawl
endpoint by setting these in .env:
CORPUS_MILL_FIRECRAWL_URL=https://api.firecrawl.dev # or http://your-self-host:3002
CORPUS_MILL_FIRECRAWL_API_KEY=fc-... # required, even for self-hostTwo ways to get one:
- Hosted (easiest): sign up at https://firecrawl.dev — free tier is enough for low-volume use.
- Self-host: https://github.com/devflowinc/firecrawl-simple is
the lightweight build the codebase was tested against. The repo has
its own docker-compose; run it on a separate port and point
CORPUS_MILL_FIRECRAWL_URLat it. Note: the project hasn't been updated in some time — expect to read the source.
Without Firecrawl configured, the dossier intake CLI command will exit early. Everything else in the pipeline runs fine without it.
If you'd rather skip Docker:
git clone https://github.com/cahlen/corpus-mill
cd corpus-mill
python3.11 -m venv ~/.venvs/corpus-mill
source ~/.venvs/corpus-mill/bin/activate
pip install -e ".[dev]"
# pyannote requires a HuggingFace token — set in your shell rc
export HF_TOKEN=hf_...
# Optional but recommended: point at fast scratch + cold archive disks
export CORPUS_MILL_STREAMS_DIR=/path/to/cold/archive
export CORPUS_MILL_SCRATCH_ROOT=$HOME/.cache/corpus-mill-scratch
# Pull required Ollama models
ollama pull qwen2.5:7b
ollama pull qwen3:14b
# Run on one stream
corpus-mill ingest "/path/to/yt-dlp-output [VIDEO_ID].mp4"
corpus-mill run VIDEO_ID
# Browse results
corpus-mill serve # http://127.0.0.1:8000For yt-dlp-based capture from a YouTube URL:
corpus-mill ingest-url "https://www.youtube.com/watch?v=VIDEO_ID"pytest -m "not slow" # fast unit tests
ffmpeg -ss 60 -i <video> -t 30 tests/fixtures/sample-30s.mp4 # fixture for slow tests
pytest -m slow # slow tests (real ML models)src/corpus_mill/
├── stages/ # pipeline stages, one module each
│ ├── prep.py # audio extract + keyframe sample
│ ├── asr.py # Whisper-large-v3
│ ├── diarize.py # pyannote
│ ├── asr_diar.py # combine words + speaker turns
│ ├── face_detect.py # YuNet + dlib face_recognition
│ ├── vlm.py # Qwen-VL scene captions
│ ├── shoppable.py # Qwen-VL product detection
│ ├── chat_ingest.py # yt-dlp live-chat replay
│ ├── intel.py # LLM hot/cold passes
│ ├── auto_clips.py # key-moment scoring + cuts
│ ├── clip_*.py # per-clip enrichments
│ ├── brand_*.py # corpus-level brand rollups
│ └── ...
├── catalog/ # Parquet + SQLite event store
│ ├── schema.py # EventRow + closed EventType enum
│ └── ddl.sql # full SQLite DDL
├── stats/ # ML rollups (NMF cohorts, etc)
├── webapp/ # FastAPI + static HTML/JS dashboard
└── ...
streams/<id>/ # per-stream artifacts (cold archive)
global/global.db # cross-stream catalog
See AGENTS.md for the architecture decisions a contributor needs to know before editing the code.
After ingesting streams, run the rollup stages to populate the recommendation and discovery surfaces:
corpus-mill aggregate-brands # brand_observations + streamer_brand_affinity
corpus-mill refresh-rollups # 6 stages: significance, cohorts, authenticity, graph, recommendations, discovery
corpus-mill fingerprint-clips-pdq # PDQ + chromaprint per packaged clip
corpus-mill build-clip-index # HNSW index over TMK signaturesSee docs/ml-stack.md for the math + references and AGENTS.md for the operational details.
corpus-mill watch-all --loop polls a list of YouTube channels for
new VODs and auto-ingests them. Channel list lives at
global/watch_channels.json (curate as you like).
The webapp surfaces every layer of the pipeline. Each view below is
served by the same FastAPI app (corpus-mill serve) reading the
same events.parquet + global.db produced by the pipeline.
The video plays on the left with a face-tracking overlay; on the right are the LLM-scored top moments tabbed by signal type (branded / money / exchange / brand drop) with one-click clip download or report card export. Below the player, the diarized Whisper transcript stays in sync with playback.
Tier-classified placement counts per brand across the whole corpus. Searchable; clicking any brand drills into the per-brand view.
For any brand: top branded clips with frame thumbnails, KPIs (unique placements / streams / estimated CPM), structural-streamers table (who features the brand most), placement-vs-mention correlation. Same view exists for every brand in the corpus.
The corpus auto-tags each brand by tier (luxury / high / mid / mass) based on mention concentration. Cartier sits in the luxury tier with 74% of placements at luxury price-tier shoppables.
Latent cohorts discovered by non-negative matrix factorization on the channel-by-brand occurrence matrix. Each cohort is a soft cluster of brands that co-appear via the same set of streamers. Cohort labels are auto-generated by qwen2.5:7b.
NMF reconstruction predicts what each streamer's brand share should look like given their cohort profile. The gap between predicted and observed is an exploratory signal — high-gap pairs are brands that the cohort model expects to see in this streamer's content but that aren't yet present. Useful as weak labels for downstream analysis and as a cohort-discovery sanity check.
Cross-stream brand intelligence with plain-English search and one-click bundling. Suggested bundles are generated by the local Qwen 7B from the fresh catalog state — they auto-group clips by brand mix + tier so you can scan candidate sets fast.
Top-level brand intel — KPIs, sponsorship-decision matrix, recent placements with frame thumbnails, streamer × branded-clip activity table, full pipeline-stage receipt for every stream.
Drag a suspect clip; the pipeline runs PDQ frame fingerprinting + chromaprint audio matching against the corpus index. Output is an evidence receipt with source identification, splice detection, audio-overdub flags, and an honest out-of-corpus fallback.
Inline explainer of the pipeline stages, hardware footprint, models, and storage layout. Useful as a self-served doc page when sharing the webapp with collaborators.
Most cards on every page open into modals with the full underlying record — useful both for spot-checking model output and for showing how every signal traces back to the source frame.
Click any clip on the dashboard, brand drill-down, or per-stream view to open the clip modal: video player, virality scores (corpus-relative + LLM virality + brand count), the auto-generated glossary blurb, per-platform breakdown (TikTok / Instagram / Shorts / Twitter — angle, posting strategy, cut, hashtags, fit score), brand placements with confidences, speaker share, transcript excerpt, and direct download links to every export artifact (.zip bundle, .mp4 per platform, .vtt/.srt/.txt transcript, report.json/md).
Click any campaign row to see the full bundle — every clip member with thumbnails, status workflow (draft → approved → active → delivered → closed), and the export-bundle button that produces a single archive of all the clip artifacts together.
Click any row of the brand-audience matrix on the intelligence page to see the full per-pocket analysis: priority score, organic-fit metrics, the streamer's existing brand mix (showing what brands they already feature, sized by occurrence count), and every clip from that streamer that already contains the queried brand (with frame thumbnails).
Per-streamer claim ledger. Claims about the streamer are adjudicated
only against processed footage — every verdict cites specific
clip evidence; anything outside what the cameras saw gets
not_adjudicable. Verdicts are append-only — overrides write a new
row pointing back at the prior verdict, the lineage stays intact.
Critically: claims are not human-submittable. The intake pipeline scrapes the open web (via optional self-hosted Firecrawl), extracts specific allegations with provenance (source URL + date + quoted passage), and routes each one through the adjudicator. Every persisted claim therefore traces back to a public, citable source — not to anyone's say-so.
This repo is published as-is for anyone who wants to clone it and make it work for themselves. I'll fix bugs and accept contributions when they make sense, but this isn't a funded SaaS — please don't expect a hosted offering or aggressive feature roadmap from this repository.
That said: the hardware and software footprint is non-trivial (NVIDIA GPU with ≥16 GB VRAM, ~5 TB of usable storage for any real corpus, properly configured local Ollama with the right model quantizations, working faster-whisper / pyannote / YuNet+dlib / Qwen2.5-VL stack, sensible scratch-vs-cold-archive disk layout, optional bgutil POT provider for yt-dlp) and getting it all working end-to-end on YOUR videos is real engineering work.
If you'd like me to come do that for you in person:
I offer a setup service where I fly to your location, procure the right hardware, install everything, point the pipeline at your existing video catalog, validate the output is what you need, and hand you a running system. Same code, no proprietary fork — you get this exact open-source repo running on metal you own.
I can also tailor the pipeline to your specific use case if it's properly scoped and budgeted — adding a new stage, swapping in different models, integrating with your existing data pipeline, fine-tuning the brand-detection allowlist for your domain, building a custom report layout, or anything else that's a meaningful extension of what's already here.
If any of that is interesting, email cahlen@gmail.com with
subject MULTIMODAL STREAM REQUEST and we can have a
conversation.
The corpus-mill pipeline code is licensed under the Apache License 2.0 — permissive, commercial use OK, patent grant included.
This pipeline orchestrates models that each carry their own license. Apache 2.0 covers the code in this repo only. You are responsible for complying with the terms of every model you load.
| Model | License | Commercial use |
|---|---|---|
| Whisper-large-v3 (via faster-whisper) | MIT | ✓ |
| pyannote/speaker-diarization-3.1 | MIT (gated; accept T&Cs on HF) | ✓ |
| Qwen2.5-VL-7B-Instruct | Tongyi Qianwen License | ✓ for most users; review clauses re: >100M MAU and output attribution |
| Qwen2.5:7b · Qwen3:14b | Apache 2.0 | ✓ |
| YuNet (face detection) | Apache 2.0 | ✓ |
| dlib face_recognition_resnet (face embeddings) | Boost Software License | ✓ |
| chromaprint | LGPL | ✓ (linking only) |
| pdqhash | BSD | ✓ |
Every model the default pipeline loads is permissively licensed for commercial use. If you're running against your own videos for personal, research, academic, or commercial use, you're covered end-to-end by the table above.














