corpus-mill

A multimodal video annotation pipeline that runs entirely on local GPU hardware. One CLI run takes a long-form video and produces a time-aligned event corpus across audio, vision, on-screen text, audience chat, brand observations, music, and clip-worthy moments — all stored as Parquet + SQLite with full provenance back to the source stage and model version.

The included demo corpus is IRL livestreaming because it's a brand-dense, multi-speaker, noisy domain that exercises every stage at once — but the pipeline works on any long-form video that has people on camera: podcasts, talk shows, interviews, lectures, sports broadcasts, news recordings, conference talks. If your input is a long mp4 with humans in it, the pipeline produces structured labels for it.

TL;DR — drop in video.mp4, one run produces:

ASR transcript (stage_b_asr.whisper_large_v3_v1.jsonl) — Whisper-large-v3 segments
Speaker diarization (stage_b_diar.pyannote_speaker_diarization_3.1_v1.jsonl) — pyannote-3.1
Scene captions / OCR / shoppable detection (stage_c_vlm.qwen2.5vl_7b_local_v1.jsonl) — Qwen2.5-VL-7B
Faces (stage_face.yunet_2023mar+dlib_face_resnet_v1_128d_v1.jsonl + face_embeddings.npz) — YuNet detections + dlib 128-d
Brand observations + entity grounding (stage_h_hot.intel_hot_v1.jsonl) — Qwen2.5 via local Ollama
Cold-pass key-moment scoring (stage_h_cold.intel_cold_v1.jsonl) — Qwen3-14B
Auto-discovered clip candidates (stage_auto_clips.auto_clips_heuristic_v2.jsonl) + per-platform exports under packages/
Unified events.parquet — every detection above in one time-aligned table, joinable from DuckDB / Polars
global.db (SQLite) — corpus catalog: streams, channels, brands, identities, dossier claims + verdicts

Every emission stamps the stage version + model version. Want to swap Whisper-large-v3 for whisper-v3-turbo? Just re-run that one layer; the rest of the corpus is preserved.

See it concretely: examples/state_of_gpt/ — the full pipeline output from running corpus-mill on Andrej Karpathy's 42-minute "State of GPT" talk (Microsoft Build 2023). 8,202 words, 86 scene captions, 36 auto-discovered clip candidates, 144 platform exports — runtime ~12 minutes on a 5090. The events.parquet is included; query it with DuckDB or Polars to see the unified timeline shape.

Entirely on-prem. Entirely offline.

Zero AI inference leaves your hardware. The pipeline never sends your video, audio, transcripts, scene captions, detected faces, brand observations, or chat content to OpenAI, Anthropic, Google, or any other third-party AI service. Every model runs on your GPU:

Stage	Model	Where it runs
ASR	Whisper-large-v3 (faster-whisper / ctranslate2)	local GPU
Diarization	pyannote/speaker-diarization-3.1	local GPU
Face detection	YuNet (OpenCV, Apache 2.0)	local CPU/GPU
Face embeddings	dlib face_recognition_resnet (Boost License)	local CPU
Scene captions + OCR + shoppable detection	Qwen2.5-VL-7B (transformers)	local GPU
Hot-pass entity extraction, glossary, rerank, placement value	qwen2.5:7b via local Ollama	local GPU
Cold-pass key-moment scoring + profile generation	qwen3:14b via local Ollama	local GPU
Audio fingerprinting	chromaprint	local CPU
PDQ visual fingerprinting	pdqhash	local CPU
Cohort discovery	sklearn NMF	local CPU
Audience-overlap graph	networkx	local CPU

This matters because the videos you process may be sensitive. Internal training material, unaired interviews, security-camera footage, medical or legal recordings, NDA-bound podcast guests, private streamer VODs — none of that should be sent to a cloud LLM provider, and with this pipeline none of it ever is. Air-gap your machine after pip install and the pipeline still works.

The only network calls the pipeline ever makes are:

The optional corpus-mill ingest-url / channel-watcher path, which uses yt-dlp to pull a public YouTube VOD into your local storage. (Skip this path entirely if you're processing your own files.)
Two genuinely optional metadata enrichments, both opt-in via env var and off by default:
- AcoustID for resolving chromaprint audio fingerprints to track + artist names. Only the fingerprint hash is sent — not audio. Disable: leave CORPUS_MILL_ACOUSTID_API_KEY empty.
- Firecrawl for the brand-safety dossier's open-web claim intake. Off unless you set CORPUS_MILL_FIRECRAWL_URL. Can be self-hosted to keep this local too.

Everything else — every embedding, every transcript, every brand detection, every clip ranking — happens on the silicon you own.

About this project — and why I'm publishing it

I (Cahlen) built this in my spare time. I'm an AI researcher — not a TikTok creator, not a clipper, not a sponsorship-agency operator. The end goal of this work, for me, is synthetic data for model training. Long-form video is one of the densest signal sources we have for grounding language and vision models, and the existing public corpora barely scratch what's extractable.

But to actually produce good synthetic data, you have to build a lot of useful intermediate things along the way: a clipping platform, a brand-intelligence surface, an evidence-only adjudication ledger, an identity-persistence layer, a forensic fingerprinting stack. So this repo is all of those things at once. Each piece is also independently useful — you don't need to care about training data to get value from the brand drill-down, or the clip discovery surface, or just the ASR + diarization layer.

Why open-source, why now: I built this for my own work and could keep it private. But honestly — if I do that, it'll just sit on a hard drive and rot while almost nobody benefits from any of it. So I'm publishing it on the chance that someone, somewhere finds a piece they can use:

An AI researcher who wants real multimodal supervision data without spending months scraping and aligning their own corpus
A content creator who wants to find their own best moments algorithmically instead of paying clipper teams
A research lab that wants the brand-significance / cohort / authenticity stack as a starting point for their own analysis
A small studio that wants on-prem video annotation they fully control, without sending their footage to a cloud API
Or anyone curious enough to pull just one stage out of the pipeline and use it standalone

If even a few people get real value from any one piece of it, I'd rather it be public than gathering dust in my home directory.

Heads up: this is not an out-of-the-box product

Setting expectations honestly: this is not a polished SaaS or a plug-and-play tool. It's a substantial engineering project with a real footprint, and getting it running on your own videos is real engineering work.

What that means in practice:

Significant supporting infrastructure required. You need a capable NVIDIA GPU (~16 GB+ VRAM), a configured local Ollama with the right model quantizations pulled, working faster-whisper / pyannote / Qwen2.5-VL stacks, properly sized scratch and cold-archive disks, an HF token with the pyannote terms accepted, ffmpeg on PATH, and a willingness to debug PyTorch / CUDA / cuDNN version mismatches when they happen.
Bugs are to be expected. This is one person's spare-time project across a meaningful surface area (~30K LOC, ~30 pipeline stages, dozens of models). I run it on my own machine; your hardware, OS, driver versions, and video-content edge cases inevitably differ. Things will break. The pipeline is well instrumented and the failures are usually local to a single stage, but fixing them for your use case is on you.
Each user's setup is bespoke. If you only care about ASR + diarization, you don't need the VLM stack. If you only care about brand detection, you don't need the chat ingest. If you're processing podcast audio you can skip the keyframe pipeline entirely. Adapting the pipeline to YOUR specific use case is part of the work.
No support guarantees. I'll fix issues that materially affect the project's correctness when I have the time and the bandwidth, but this isn't a funded product with an on-call rotation. Treat the repo as a starting point you'll customize, not a finished service you'll consume.

If "real engineering work" sounds fine — clone it, read the code, break things, learn from it, build something useful. If you want it working on your videos without doing the engineering yourself, see the setup service section at the bottom of this README.

Who this benefits

AI / ML researchers building multimodal models (video-VLMs, speaker diarization, audience-aware captioning, identity re-identification, brand detection, audio fingerprinting). The pipeline turns long-form video into typed labels with full provenance — exactly the supervision signal that's expensive to build by hand.
Content creators with their own VOD catalog. If you're a streamer / podcaster / video creator, you don't need to hire an army of clippers to grow your social presence. Run your own VODs through this pipeline, look at the per-stream view, pick the highest-scoring moments yourself, and post them. The same pipeline also tells you which brands actually appear in your content (useful for renewal conversations with your sponsors), and how your placement patterns compare to peer creators in the corpus.
Brand teams and analysts. If you have a catalog of recordings (podcast guest appearances, conference talks, sponsored streams), the pipeline produces an audit-grade record of every visible brand placement with frame-thumbnail evidence. The cross-stream catalog lets you compare placement density across creators.
Anyone curious about long-form video at scale. Each individual layer is useful on its own. Want only the ASR + diarization? That's stage_b_asr.jsonl + stage_b_diar.jsonl, two stages. Want only the brand catalog? That's the brand-intel rollups. The pipeline is end-to-end but the output is decomposable — you can cherry-pick the layers that matter for your use case.

Synthetic-data examples

Concrete things you can do with the pipeline output as training input:

Distill a smaller VLM using the Qwen2.5-VL-7B scene captions + shoppable detections as input/output supervision. Every prompt and completion is captured to global/training_corpus/llm_calls/.
Build a video-RLHF preference dataset by joining the auto-clip scoring traces (which clips ranked highly and why) with the reason-trace fields. The ranking function becomes a reward model.
Train an identity-re-identification model on the face cluster assignments — every face is tagged with a stable cluster ID across the entire corpus, with thumbnails for visual verification.
Train an engagement-aware captioning model on the chat-aligned transcript windows. Real audience reactions are weak labels for "this segment landed."
Pre-train a multimodal embedding model on the time-aligned (transcript turn, scene caption, brand bbox, music presence) tuples — exactly the cross-modal supervision pairs HowTo100M / WebVid-style scrapers spend months building, here as a side effect of running the pipeline.
Bootstrap weak labels for downstream classifiers using the authenticity regime (organic vs promotional), the cohort assignments (NMF), or the streamer-brand affinity matrix.

Why this exists

There's a lot of valuable signal in long-form video that single-modality tools miss:

ASR alone gives you words but not who said them, what was on the shelf behind them, or what brand the streamer was wearing.
Scene caption alone gives you what's visible but loses the temporal alignment to the audio and the reactions in chat.
Brand-detection alone gives you placement counts but not the context — was the bottle being held up promotionally, or was it just on a desk?

corpus-mill fuses every modality onto one timeline with deterministic event IDs, so cross-modal joins are SQL-trivial: "the chat reactions during the 30s window where Apple was visible and the streamer said 'price'" is one query, not a research project.

Use as a training-data factory

Every pipeline run produces high-signal training data as a side effect:

Multimodal supervision pairs. Time-aligned (transcript turn, scene caption, brand bbox, chat reaction) tuples are exactly the kind of weakly-supervised pairs you'd otherwise scrape and align by hand.
LLM call capture. Every prompt + completion from the entity extraction, glossary, placement-value, and rerank stages is archived to global/training_corpus/llm_calls/<date>.jsonl — ready as input for distillation or LoRA fine-tunes of smaller models.
Cohort-discovery weak labels. NMF cohort assignments, brand significance scores, and authenticity-regime labels are typed signals you can use to bootstrap downstream classifiers without human annotation.
Forensic fingerprints. PDQ + chromaprint signatures over every packaged clip make derivation tracking automatic — useful both for clip-attribution work and for building deduplicated corpora.
Frame thumbnails with bbox provenance. Every brand observation saves a padded JPEG crop alongside the source bbox, so downstream consumers can verify model calls or use the crops directly as training images.

The full corpus exhaust — Parquet events, SQLite catalog, JPEG thumbnails, audio fingerprints — is structured for direct ingestion into model training pipelines, dataset builders (HuggingFace datasets, WebDataset, etc.), or as a labeled evaluation set.

What the pipeline produces per video

Audio: Whisper-large-v3 word-level transcripts and pyannote speaker diarization, fused into per-turn transcript_turn events.
Vision: keyframes every 5s, Qwen2.5-VL-7B scene captions, OCR of on-screen text overlays, and shoppable-item detection (category + bbox + brand guess + price tier).
Faces: YuNet detection (Apache 2.0) + dlib 128-d face embeddings (Boost License) + face thumbnails, clustered cross-corpus for identity persistence.
Chat: yt-dlp live-chat replay (when YouTube exposes it for the source VOD), with per-message timestamps aligned to the transcript timeline.
LLM intel: schema-constrained entity + money-mention extraction with grounded event_id citations and post-validation against source turn text (rejects ungrounded LLM output).
Clips: heuristic key-moment scoring + ffmpeg cuts + per-platform exports (9:16 face-aware center-crop, 16:9, 1:1).
Music: chromaprint fingerprinting with optional AcoustID resolution to track + artist + ISRC.

Corpus-level rollups derive brand significance (PMI / Dunning LLR), brand cohorts (NMF on channel × brand), authenticity scores (burstiness + PELT changepoints), audience-overlap graphs (chat-author Jaccard + networkx centrality), counterfactual recommendations, and forensic clip fingerprints (PDQ + TMK + chromaprint with HNSW lookup).

The webapp surfaces all of this as a queryable dashboard at http://127.0.0.1:8000.

Hardware requirements

NVIDIA GPU with ≥16 GB VRAM. The pipeline was developed against an RTX 5090 (32 GB). Smaller cards work if you skip the larger Qwen models or run them remotely.
~30 GB of working scratch per active stream (local NVMe strongly recommended — the pipeline rsyncs each stream onto fast scratch before processing).
~5-15 GB of cold archive per stream (post-pipeline).

Software requirements

Python 3.11
ffmpeg on PATH
A locally-running Ollama daemon with qwen2.5:7b and qwen3:14b pulled
An HF_TOKEN env var with read access to pyannote/speaker-diarization-3.1 (free, requires accepting model terms on Hugging Face)
For the visual-language stages: pulled Qwen2.5-VL-7B weights (transformers will fetch on first use)

Quickstart — Docker (recommended)

A Dockerfile and docker-compose.yml are included. They package the webapp + pipeline + all native dependencies (PyTorch + CUDA + ffmpeg

chromaprint + dlib compiled CPU-only to dodge CUDA-version issues + YuNet weights baked in). Ollama runs on the host — it's a one-line install and avoids rebuilding it inside the container.

Prerequisites:

NVIDIA driver ≥ 535 + nvidia-container-toolkit configured (so Docker can see the GPU)
Docker Compose v2
Ollama installed on the host
An HF_TOKEN with the pyannote/speaker-diarization-3.1 T&Cs accepted on the model page
Plenty of disk: ~50 GB for model weights on first run, plus 5–15 GB per processed stream

One-time host setup:

# 1. Install Ollama on the host + pull the two text models the pipeline uses
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen2.5:7b      # ~5 GB — hot pass
ollama pull qwen3:14b       # ~9 GB — cold pass

# 2. Make Ollama reachable from the container.
#    By default Ollama binds to 127.0.0.1 only — the container lives on a
#    different network namespace and won't be able to reach it. Tell the
#    systemd unit to bind on all interfaces:
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf >/dev/null <<'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
EOF
sudo systemctl daemon-reload
sudo systemctl restart ollama

# Verify — should return JSON, not a connection error:
curl -s http://127.0.0.1:11434/api/tags | head -c 200

# 3. Clone the repo + create your .env
git clone https://github.com/cahlen/corpus-mill
cd corpus-mill
cp .env.example .env
# Edit .env — at minimum set HF_TOKEN

Heads up: binding Ollama to 0.0.0.0 exposes it on every interface the host has. On a workstation behind a firewall this is fine; on a machine with a public IP, gate the port (e.g. ufw allow from 172.17.0.0/16 to any port 11434) so only the Docker bridge can reach it.

Bring it up:

docker compose up --build

First boot pulls the PyTorch base layer, compiles dlib (~5 min), and the first pipeline run downloads ~25 GB of HuggingFace model weights. Subsequent restarts are instant — model weights persist in a named volume.

The webapp is at http://127.0.0.1:8000. Drag-and-drop an mp4 in the upload box and watch the pipeline process it.

Disk planning:

# Override the default volume locations to point at your big disks.
# Edit .env:
STREAMS_DIR=/mnt/big-disk/corpus-mill/streams        # cold archive
SCRATCH_DIR=/mnt/nvme/corpus-mill-scratch            # active processing

Optional: Firecrawl for dossier intake

The dossier system (channel-level claim tracking + adjudication) is OFF by default. If you want it on, point corpus-mill at a Firecrawl endpoint by setting these in .env:

CORPUS_MILL_FIRECRAWL_URL=https://api.firecrawl.dev      # or http://your-self-host:3002
CORPUS_MILL_FIRECRAWL_API_KEY=fc-...                     # required, even for self-host

Two ways to get one:

Hosted (easiest): sign up at https://firecrawl.dev — free tier is enough for low-volume use.
Self-host: https://github.com/devflowinc/firecrawl-simple is the lightweight build the codebase was tested against. The repo has its own docker-compose; run it on a separate port and point CORPUS_MILL_FIRECRAWL_URL at it. Note: the project hasn't been updated in some time — expect to read the source.

Without Firecrawl configured, the dossier intake CLI command will exit early. Everything else in the pipeline runs fine without it.

Quickstart — bare metal

If you'd rather skip Docker:

git clone https://github.com/cahlen/corpus-mill
cd corpus-mill
python3.11 -m venv ~/.venvs/corpus-mill
source ~/.venvs/corpus-mill/bin/activate
pip install -e ".[dev]"

# pyannote requires a HuggingFace token — set in your shell rc
export HF_TOKEN=hf_...

# Optional but recommended: point at fast scratch + cold archive disks
export CORPUS_MILL_STREAMS_DIR=/path/to/cold/archive
export CORPUS_MILL_SCRATCH_ROOT=$HOME/.cache/corpus-mill-scratch

# Pull required Ollama models
ollama pull qwen2.5:7b
ollama pull qwen3:14b

# Run on one stream
corpus-mill ingest "/path/to/yt-dlp-output [VIDEO_ID].mp4"
corpus-mill run VIDEO_ID

# Browse results
corpus-mill serve   # http://127.0.0.1:8000

For yt-dlp-based capture from a YouTube URL:

corpus-mill ingest-url "https://www.youtube.com/watch?v=VIDEO_ID"

Tests

pytest -m "not slow"                           # fast unit tests
ffmpeg -ss 60 -i <video> -t 30 tests/fixtures/sample-30s.mp4   # fixture for slow tests
pytest -m slow                                 # slow tests (real ML models)

Layout

src/corpus_mill/
├── stages/                 # pipeline stages, one module each
│   ├── prep.py             # audio extract + keyframe sample
│   ├── asr.py              # Whisper-large-v3
│   ├── diarize.py          # pyannote
│   ├── asr_diar.py         # combine words + speaker turns
│   ├── face_detect.py      # YuNet + dlib face_recognition
│   ├── vlm.py              # Qwen-VL scene captions
│   ├── shoppable.py        # Qwen-VL product detection
│   ├── chat_ingest.py      # yt-dlp live-chat replay
│   ├── intel.py            # LLM hot/cold passes
│   ├── auto_clips.py       # key-moment scoring + cuts
│   ├── clip_*.py           # per-clip enrichments
│   ├── brand_*.py          # corpus-level brand rollups
│   └── ...
├── catalog/                # Parquet + SQLite event store
│   ├── schema.py           # EventRow + closed EventType enum
│   └── ddl.sql             # full SQLite DDL
├── stats/                  # ML rollups (NMF cohorts, etc)
├── webapp/                 # FastAPI + static HTML/JS dashboard
└── ...

streams/<id>/               # per-stream artifacts (cold archive)
global/global.db            # cross-stream catalog

See AGENTS.md for the architecture decisions a contributor needs to know before editing the code.

Statistical-ML rollups

After ingesting streams, run the rollup stages to populate the recommendation and discovery surfaces:

corpus-mill aggregate-brands         # brand_observations + streamer_brand_affinity
corpus-mill refresh-rollups          # 6 stages: significance, cohorts, authenticity, graph, recommendations, discovery
corpus-mill fingerprint-clips-pdq    # PDQ + chromaprint per packaged clip
corpus-mill build-clip-index         # HNSW index over TMK signatures

See docs/ml-stack.md for the math + references and AGENTS.md for the operational details.

Channel watcher

corpus-mill watch-all --loop polls a list of YouTube channels for new VODs and auto-ingests them. Channel list lives at global/watch_channels.json (curate as you like).

Screenshots

The webapp surfaces every layer of the pipeline. Each view below is served by the same FastAPI app (corpus-mill serve) reading the same events.parquet + global.db produced by the pipeline.

Per-stream view — multimodal events on one timeline

The video plays on the left with a face-tracking overlay; on the right are the LLM-scored top moments tabbed by signal type (branded / money / exchange / brand drop) with one-click clip download or report card export. Below the player, the diarized Whisper transcript stays in sync with playback.

Brand catalog — every detected brand across every stream

Tier-classified placement counts per brand across the whole corpus. Searchable; clicking any brand drills into the per-brand view.

Brand drill-down — Apple

For any brand: top branded clips with frame thumbnails, KPIs (unique placements / streams / estimated CPM), structural-streamers table (who features the brand most), placement-vs-mention correlation. Same view exists for every brand in the corpus.

Brand drill-down — Cartier (luxury tier)

The corpus auto-tags each brand by tier (luxury / high / mid / mass) based on mention concentration. Cartier sits in the luxury tier with 74% of placements at luxury price-tier shoppables.

Brand cohorts — NMF on (channel × brand)

Latent cohorts discovered by non-negative matrix factorization on the channel-by-brand occurrence matrix. Each cohort is a soft cluster of brands that co-appear via the same set of streamers. Cohort labels are auto-generated by qwen2.5:7b.

Counterfactual brand recommendations

NMF reconstruction predicts what each streamer's brand share should look like given their cohort profile. The gap between predicted and observed is an exploratory signal — high-gap pairs are brands that the cohort model expects to see in this streamer's content but that aren't yet present. Useful as weak labels for downstream analysis and as a cohort-discovery sanity check.

Clip discovery + campaign assembly

Cross-stream brand intelligence with plain-English search and one-click bundling. Suggested bundles are generated by the local Qwen 7B from the fresh catalog state — they auto-group clips by brand mix + tier so you can scan candidate sets fast.

Sponsorship intelligence terminal

Top-level brand intel — KPIs, sponsorship-decision matrix, recent placements with frame thumbnails, streamer × branded-clip activity table, full pipeline-stage receipt for every stream.

Clip authenticity forensics

Drag a suspect clip; the pipeline runs PDQ frame fingerprinting + chromaprint audio matching against the corpus index. Output is an evidence receipt with source identification, splice detection, audio-overdub flags, and an honest out-of-corpus fallback.

Architecture page

Inline explainer of the pipeline stages, hardware footprint, models, and storage layout. Useful as a self-served doc page when sharing the webapp with collaborators.

Drill-down modals

Most cards on every page open into modals with the full underlying record — useful both for spot-checking model output and for showing how every signal traces back to the source frame.

Clip detail

Click any clip on the dashboard, brand drill-down, or per-stream view to open the clip modal: video player, virality scores (corpus-relative + LLM virality + brand count), the auto-generated glossary blurb, per-platform breakdown (TikTok / Instagram / Shorts / Twitter — angle, posting strategy, cut, hashtags, fit score), brand placements with confidences, speaker share, transcript excerpt, and direct download links to every export artifact (.zip bundle, .mp4 per platform, .vtt/.srt/.txt transcript, report.json/md).

Campaign / bundle detail

Click any campaign row to see the full bundle — every clip member with thumbnails, status workflow (draft → approved → active → delivered → closed), and the export-bundle button that produces a single archive of all the clip artifacts together.

Brand × audience pocket detail

Click any row of the brand-audience matrix on the intelligence page to see the full per-pocket analysis: priority score, organic-fit metrics, the streamer's existing brand mix (showing what brands they already feature, sized by occurrence count), and every clip from that streamer that already contains the queried brand (with frame thumbnails).

Brand-safety dossier

Per-streamer claim ledger. Claims about the streamer are adjudicated only against processed footage — every verdict cites specific clip evidence; anything outside what the cameras saw gets not_adjudicable. Verdicts are append-only — overrides write a new row pointing back at the prior verdict, the lineage stays intact.

Critically: claims are not human-submittable. The intake pipeline scrapes the open web (via optional self-hosted Firecrawl), extracts specific allegations with provenance (source URL + date + quoted passage), and routes each one through the adjudicator. Every persisted claim therefore traces back to a public, citable source — not to anyone's say-so.

Project status + setup help

This repo is published as-is for anyone who wants to clone it and make it work for themselves. I'll fix bugs and accept contributions when they make sense, but this isn't a funded SaaS — please don't expect a hosted offering or aggressive feature roadmap from this repository.

That said: the hardware and software footprint is non-trivial (NVIDIA GPU with ≥16 GB VRAM, ~5 TB of usable storage for any real corpus, properly configured local Ollama with the right model quantizations, working faster-whisper / pyannote / YuNet+dlib / Qwen2.5-VL stack, sensible scratch-vs-cold-archive disk layout, optional bgutil POT provider for yt-dlp) and getting it all working end-to-end on YOUR videos is real engineering work.

If you'd like me to come do that for you in person:

I offer a setup service where I fly to your location, procure the right hardware, install everything, point the pipeline at your existing video catalog, validate the output is what you need, and hand you a running system. Same code, no proprietary fork — you get this exact open-source repo running on metal you own.

I can also tailor the pipeline to your specific use case if it's properly scoped and budgeted — adding a new stage, swapping in different models, integrating with your existing data pipeline, fine-tuning the brand-detection allowlist for your domain, building a custom report layout, or anything else that's a meaningful extension of what's already here.

If any of that is interesting, email cahlen@gmail.com with subject MULTIMODAL STREAM REQUEST and we can have a conversation.

License

The corpus-mill pipeline code is licensed under the Apache License 2.0 — permissive, commercial use OK, patent grant included.

Third-party model licenses (read this if you're running commercially)

This pipeline orchestrates models that each carry their own license. Apache 2.0 covers the code in this repo only. You are responsible for complying with the terms of every model you load.

Model	License	Commercial use
Whisper-large-v3 (via faster-whisper)	MIT	✓
pyannote/speaker-diarization-3.1	MIT (gated; accept T&Cs on HF)	✓
Qwen2.5-VL-7B-Instruct	Tongyi Qianwen License	✓ for most users; review clauses re: >100M MAU and output attribution
Qwen2.5:7b · Qwen3:14b	Apache 2.0	✓
YuNet (face detection)	Apache 2.0	✓
dlib face_recognition_resnet (face embeddings)	Boost Software License	✓
chromaprint	LGPL	✓ (linking only)
pdqhash	BSD	✓

Every model the default pipeline loads is permissively licensed for commercial use. If you're running against your own videos for personal, research, academic, or commercial use, you're covered end-to-end by the table above.

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
.github		.github
docs		docs
examples/state_of_gpt		examples/state_of_gpt
global		global
scripts		scripts
src/corpus_mill		src/corpus_mill
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
env.sh		env.sh
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

corpus-mill

Entirely on-prem. Entirely offline.

About this project — and why I'm publishing it

Heads up: this is not an out-of-the-box product

Who this benefits

Synthetic-data examples

Why this exists

Use as a training-data factory

What the pipeline produces per video

Hardware requirements

Software requirements

Quickstart — Docker (recommended)

Quickstart — bare metal

Tests

Layout

Statistical-ML rollups

Channel watcher

Screenshots

Per-stream view — multimodal events on one timeline

Brand catalog — every detected brand across every stream

Brand drill-down — Apple

Brand drill-down — Cartier (luxury tier)

Brand cohorts — NMF on (channel × brand)

Counterfactual brand recommendations

Clip discovery + campaign assembly

Sponsorship intelligence terminal

Clip authenticity forensics

Architecture page

Drill-down modals

Clip detail

Campaign / bundle detail

Brand × audience pocket detail

Brand-safety dossier

Project status + setup help

License

Third-party model licenses (read this if you're running commercially)

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages