fastapi-media-transcriber

A local-first, containerized transcription API using FastAPI, ffmpeg, and faster-whisper.

Features

Transcribe video/audio files to text with segment timestamps
Chunked transcription: Split long segments into smaller chunks for finer editing control
Automatic clip selection using heuristics (no external APIs)
Content-aware cleanup: Skip filler words, fluffy intros, and bad segments
Marker-based boundaries: Use verbal cues like "cut" or "restart" to control clip boundaries
Punchline boost: Fun clips prefer short, punchy segments with exclamations
EDL rendering: Stitch multiple keep ranges into one video, cutting out mess-ups
AI planning layer: Generate edit plans with stub/heuristic/AI modes
Trim video clips by time range
Full auto-clip pipeline: transcribe -> select -> render
Two input modes: file path (for mounted volumes) or direct file upload
Auto language detection or specify language explicitly
Multiple model sizes (tiny, base, small, medium, large)
Clean JSON response with full text, segments, and metadata

Quickstart

docker compose up --build

The API will be available at http://localhost:3000.

API Endpoints

Method	Path	Purpose
GET	`/health`	Health check
POST	`/transcribe`	Transcribe media file by path
POST	`/transcribe/upload`	Transcribe uploaded file
POST	`/select-clips`	Select best clips from segments
POST	`/trim`	Trim video by time range
POST	`/auto-clip`	Full pipeline: transcribe → select → render
POST	`/render-edl`	Stitch keep ranges, cut out mess-ups
POST	`/plan-edits`	AI planning layer (stub/heuristic/ai modes)
POST	`/make-clips`	NEW: One-call AI pipeline: transcribe → plan → render

Health Check

curl http://localhost:3000/health

Response:

{"ok": true}

Transcribe via Path

Place your media files in the ./data directory (mounted to /data in the container):

curl -X POST http://localhost:3000/transcribe \
  -H "Content-Type: application/json" \
  -d '{"path": "/data/sample.mp4"}'

With optional parameters:

curl -X POST http://localhost:3000/transcribe \
  -H "Content-Type: application/json" \
  -d '{"path": "/data/sample.mp4", "language": "en", "model": "base"}'

With chunked granularity (splits long segments >3s into smaller chunks):

curl -X POST http://localhost:3000/transcribe \
  -H "Content-Type: application/json" \
  -d '{"path": "/data/sample.mp4", "granularity": "chunked"}'

Transcribe via File Upload

curl -X POST http://localhost:3000/transcribe/upload \
  -F "file=@./data/sample.mp4"

With optional parameters:

curl -X POST http://localhost:3000/transcribe/upload \
  -F "file=@./data/sample.mp4" \
  -F "language=en" \
  -F "model=small"

With chunked granularity:

curl -X POST http://localhost:3000/transcribe/upload \
  -F "file=@./data/sample.mp4" \
  -F "granularity=chunked"

Transcription Response Format

{
  "text": "Full transcription text here...",
  "segments": [
    {
      "start": 0.0,
      "end": 2.5,
      "start_ms": 0,
      "end_ms": 2500,
      "text": "First segment"
    },
    {
      "start": 2.5,
      "end": 5.0,
      "start_ms": 2500,
      "end_ms": 5000,
      "text": "Second segment"
    }
  ],
  "meta": {
    "language": "en",
    "duration_s": 10.5,
    "engine": "faster-whisper",
    "model": "small",
    "granularity": "default"
  }
}

Transcription Parameters:

Field	Default	Description
path	required	Path to media file (for `/transcribe`)
file	required	Uploaded file (for `/transcribe/upload`)
language	auto	Language code (e.g., "en", "es") or auto-detect
model	"small"	Whisper model: tiny, base, small, medium, large
granularity	"default"	`"default"` or `"chunked"` - chunked splits segments >3s

Granularity Options:

Value	Behavior
`"default"`	Return segments as-is from Whisper
`"chunked"`	Split segments longer than 3000ms into smaller chunks by word boundaries

Chunked mode is useful when you need finer-grained timestamps for precise editing or AI planning.

Select Clips from Segments

Select the best clip windows from transcript segments using deterministic heuristics:

curl -X POST http://localhost:3000/select-clips \
  -H "Content-Type: application/json" \
  -d '{
    "segments": [
      {"start": 0.0, "end": 3.5, "text": "Hello and welcome to this video."},
      {"start": 3.8, "end": 7.2, "text": "Today we are going to talk about something interesting."},
      {"start": 7.5, "end": 12.0, "text": "This is a really important topic."},
      {"start": 12.3, "end": 16.8, "text": "Let me explain the first key point."},
      {"start": 17.0, "end": 21.5, "text": "Here is the second thing you need to know."}
    ],
    "clip_type": "mixed",
    "clean_level": "light",
    "max_clips": 2
  }'

With markers (creator "cut here" cheat code):

curl -X POST http://localhost:3000/select-clips \
  -H "Content-Type: application/json" \
  -d '{
    "segments": [
      {"start": 0.0, "end": 5.0, "text": "First good segment here."},
      {"start": 5.5, "end": 10.0, "text": "More good content."},
      {"start": 10.5, "end": 12.0, "text": "Okay cut that was bad."},
      {"start": 12.5, "end": 17.0, "text": "Starting fresh with new content."},
      {"start": 17.5, "end": 22.0, "text": "This is also good content."}
    ],
    "clip_type": "fun",
    "clean_level": "aggressive",
    "markers": ["cut", "restart", "take two"],
    "max_clips": 2
  }'

Response:

{
  "clips": [
    {
      "start": 5.5,
      "end": 16.8,
      "duration_s": 11.3,
      "start_ms": 5500,
      "end_ms": 16800,
      "score": 0.72,
      "reason": "mixed + light cleanup: good duration, dense speech, shifted start"
    }
  ]
}

Parameters:

Field	Default	Description
segments	required	Array of transcript segments
clip_type	"mixed"	Type of clips: "document", "fun", or "mixed"
clean_level	"light"	Cleanup level: "none", "light", or "aggressive"
markers	[]	Marker strings that act as hard boundaries (e.g., ["cut", "restart"])
target_s	profile default	Target clip duration in seconds
min_s	profile default	Minimum clip duration
max_s	profile default	Maximum clip duration
max_clips	3	Maximum number of clips to return
max_gap_s	1.2	Max gap between segments when merging

Clip Types:

Type	Target	Min	Max	Best For
document	30s	15s	60s	Longer, polished clips with clean endings
fun	12s	6s	20s	Short, punchy clips with expressive content + punchline boost
mixed	15s	10s	25s	Balanced selection; variety when max_clips >= 2

Clean Levels:

Level	Behavior
`"none"`	No cleanup filtering (legacy behavior)
`"light"`	Penalize filler-heavy segments; shift start to skip fluffy intros
`"aggressive"`	Stronger penalties; filter out candidates with >50% bad segments

Markers:

Segments containing marker text become hard boundaries
Clips will not start on, end on, or cross marker segments
Useful for verbal cues like "cut", "restart", "take two", "that was bad"

Scoring heuristics:

Prefers duration close to target
Prefers dense speech (more words per second)
Penalizes large internal gaps
Bonus for clean endings (sentences ending in . ! ?)
Document: Stricter gap penalties, stronger clean ending bonus
Fun: Bonus for exclamations (! ?), engaging words, and punchline boost for punchy single segments (5-12s with "wow", "insane", etc.)

Trimming a Video Clip

Trim a video to a specific time range:

curl -X POST http://localhost:3000/trim \
  -H "Content-Type: application/json" \
  -d '{
    "path": "/data/test.MOV",
    "start": 4.32,
    "end": 19.84
  }'

With custom output filename:

curl -X POST http://localhost:3000/trim \
  -H "Content-Type: application/json" \
  -d '{
    "path": "/data/test.MOV",
    "start": 4.32,
    "end": 19.84,
    "output_name": "my_clip.mp4"
  }'

Response:

{
  "input": "/data/test.MOV",
  "output": "/data/test_clip.mp4",
  "start": 4.32,
  "end": 19.84,
  "duration_s": 15.52
}

Notes:

Output is saved to the same directory as the input file
If output_name is omitted, output is named {input}_clip.mp4
Uses fast stream copy when possible, falls back to re-encoding if needed

Auto-Clip Pipeline

One-shot endpoint that chains: transcribe -> select-clips -> trim:

curl -X POST http://localhost:3000/auto-clip \
  -H "Content-Type: application/json" \
  -d '{
    "path": "/data/test.MOV",
    "clip_type": "fun",
    "clean_level": "light",
    "max_clips": 2
  }'

With markers for creator control:

curl -X POST http://localhost:3000/auto-clip \
  -H "Content-Type: application/json" \
  -d '{
    "path": "/data/selfie-video.MOV",
    "clip_type": "fun",
    "clean_level": "aggressive",
    "markers": ["cut", "restart", "that was dumb"],
    "max_clips": 3
  }'

Response:

{
  "transcript": {
    "text": "Full transcription...",
    "segments": [...],
    "meta": {...}
  },
  "clips": [
    {
      "start": 5.5,
      "end": 14.2,
      "duration_s": 8.7,
      "start_ms": 5500,
      "end_ms": 14200,
      "score": 0.72,
      "reason": "fun + light cleanup: snappy duration, expressive, punchline, shifted start"
    }
  ],
  "renders": [
    {
      "output": "/data/test_clip_1.mp4",
      "start": 5.5,
      "end": 14.2,
      "duration_s": 8.7
    }
  ]
}

Parameters:

Field	Default	Description
path	required	Path to input video
language	auto	Language code for transcription
model	small	Whisper model to use
clip_type	"mixed"	Type of clips: "document", "fun", or "mixed"
clean_level	"light"	Cleanup level: "none", "light", or "aggressive"
markers	[]	Marker strings that act as hard boundaries
target_s	profile default	Target clip duration
min_s	profile default	Minimum clip duration
max_s	profile default	Maximum clip duration
max_clips	1	Number of clips to generate
max_gap_s	1.2	Max gap for segment merging

Notes:

Output clips are named {input}_clip_1.mp4, {input}_clip_2.mp4, etc.
All processing is local (no external API calls)
Clip type determines duration defaults (see /select-clips for profile details)
Clean level controls filler detection and start shifting
Markers let you verbally mark "cut points" while recording
Designed for automation and chaining with other tools

Render EDL (Edit Decision List)

Stitch together multiple keep ranges into a single output video, cutting out mess-ups:

curl -X POST http://localhost:3000/render-edl \
  -H "Content-Type: application/json" \
  -d '{
    "path": "/data/raw-recording.MOV",
    "keep_ms": [
      [0, 5000],
      [8000, 15000],
      [20000, 30000]
    ]
  }'

With options:

curl -X POST http://localhost:3000/render-edl \
  -H "Content-Type: application/json" \
  -d '{
    "path": "/data/raw-recording.MOV",
    "keep_ms": [
      [0, 5000],
      [5100, 10000],
      [15000, 25000],
      [25500, 35000]
    ],
    "output": "/data/final-edit.mp4",
    "min_segment_ms": 500,
    "merge_gap_ms": 200,
    "max_segments": 10
  }'

Response:

{
  "input": "/data/raw-recording.MOV",
  "output": "/data/raw-recording_edl.mp4",
  "kept_ms": [
    [0, 10000],
    [15000, 35000]
  ],
  "duration_s": 30.0,
  "segments_rendered": 2
}

Parameters:

Field	Default	Description
path	required	Path to input video
keep_ms	required	List of `[start_ms, end_ms]` pairs to keep
output	`{input}_edl.mp4`	Output file path
min_segment_ms	500	Filter out segments shorter than this
merge_gap_ms	100	Merge segments within this gap
max_segments	unlimited	Maximum segments to render (keeps longest)
reencode	false	Force re-encoding (default: stream copy)

Processing Pipeline:

Validate: Check all ranges have valid start < end
Sort: Order ranges by start time
Merge: Combine overlapping or close ranges (within merge_gap_ms)
Filter: Remove segments shorter than min_segment_ms
Limit: Keep only the longest max_segments if specified
Render: Trim each segment and concatenate into final output

Use Cases:

Cut out verbal mess-ups, false starts, or awkward pauses
Manual editing workflow: transcribe -> review -> mark keeps -> render
Combine with /select-clips output for automated cleanup
Post-processing after reviewing auto-generated clips

Notes:

Uses fast stream copy by default; falls back to re-encode if needed
Set reencode: true if you encounter audio/video sync issues
Ranges that overlap or touch are automatically merged
Output kept_ms shows the final merged ranges actually rendered

Plan Edits (AI Planning Layer)

Generate an edit plan (clips with keep ranges) from transcript segments. Supports four modes:

Stub mode (for wiring/testing):

curl -X POST http://localhost:3000/plan-edits \
  -H "Content-Type: application/json" \
  -d '{
    "segments": [
      {"start": 0.0, "end": 10.0, "text": "Hello world"},
      {"start": 10.0, "end": 20.0, "text": "More content here"}
    ],
    "mode": "stub"
  }'

Heuristic mode (deterministic, no external APIs):

curl -X POST http://localhost:3000/plan-edits \
  -H "Content-Type: application/json" \
  -d '{
    "segments": [
      {"start": 0.0, "end": 10.0, "text": "Hello and welcome."},
      {"start": 10.0, "end": 20.0, "text": "Let me restart that."},
      {"start": 20.0, "end": 30.0, "text": "Okay, here is the real content."}
    ],
    "mode": "heuristic",
    "max_clips": 2,
    "markers": ["restart"]
  }'

AI mode (calls Claude API to generate intelligent edit plans):

curl -X POST http://localhost:3000/plan-edits \
  -H "Content-Type: application/json" \
  -d '{
    "segments": [
      {"start": 0.0, "end": 10.0, "text": "Hello world"},
      {"start": 10.5, "end": 15.0, "text": "Uh let me restart that"},
      {"start": 15.5, "end": 25.0, "text": "Here is the actual content you want"}
    ],
    "mode": "ai",
    "markers": ["restart"]
  }'

Note: AI mode requires ANTHROPIC_API_KEY to be set. It uses Claude to detect mess-ups and plan edits intelligently.

AI Labels mode (labels segments, then deterministic code converts to clips):

curl -X POST http://localhost:3000/plan-edits \
  -H "Content-Type: application/json" \
  -d '{
    "segments": [
      {"start": 0.0, "end": 5.0, "text": "Today we will discuss Python."},
      {"start": 5.0, "end": 10.0, "text": "Uh wait, let me restart that."},
      {"start": 10.0, "end": 15.0, "text": "Today we will discuss Python programming."},
      {"start": 15.0, "end": 20.0, "text": "Python is a great language!"},
      {"start": 20.0, "end": 25.0, "text": "Now let me talk about JavaScript."},
      {"start": 25.0, "end": 30.0, "text": "JavaScript is also popular."}
    ],
    "mode": "ai_labels",
    "max_clips": 2
  }'

Note: AI Labels mode asks Claude to classify each segment (keep/cut/unsure + tags + story_id), then deterministic code converts labels to clips. This is better for detecting retakes where earlier content should be cut and later versions kept.

AI Labels with debug output (see the labels and which segments formed each clip):

curl -X POST http://localhost:3000/plan-edits \
  -H "Content-Type: application/json" \
  -d '{
    "segments": [...],
    "mode": "ai_labels",
    "debug": true
  }'

AI Labels with custom unsure_policy (control how uncertain segments are handled):

curl -X POST http://localhost:3000/plan-edits \
  -H "Content-Type: application/json" \
  -d '{
    "segments": [...],
    "mode": "ai_labels",
    "unsure_policy": "adjacent"
  }'

Response (heuristic mode):

{
  "clips": [
    {
      "clip_id": "550e8400-e29b-41d4-a716-446655440000",
      "clip_type": "mixed",
      "title": "Hello and welcome",
      "keep_ms": [[0, 10000]],
      "total_ms": 10000,
      "reason": "heuristic: mixed selection",
      "confidence": 0.7
    }
  ],
  "meta": {
    "planner": "heuristic",
    "segments_in": 3,
    "max_clips": 2
  }
}

Response (AI mode):

{
  "clips": [
    {
      "clip_id": "a1b2c3d4-...",
      "clip_type": "document",
      "title": "Actual Content",
      "keep_ms": [[15500, 25000]],
      "total_ms": 9500,
      "reason": "Skipped restart section, kept coherent content",
      "confidence": 0.85
    }
  ],
  "meta": {
    "planner": "ai",
    "segments_in": 3,
    "max_clips": 2
  }
}

Response (AI Labels mode):

{
  "clips": [
    {
      "clip_id": "e5f6g7h8-...",
      "clip_type": "mixed",
      "title": "Today we will discuss Python",
      "keep_ms": [[10000, 20000]],
      "total_ms": 10000,
      "reason": "ai_labels: story 1, tags: ['clean_story']",
      "confidence": 0.85
    },
    {
      "clip_id": "i9j0k1l2-...",
      "clip_type": "mixed",
      "title": "Now let me talk about JavaScript",
      "keep_ms": [[20000, 30000]],
      "total_ms": 10000,
      "reason": "ai_labels: story 2, tags: ['topic_shift', 'clean_story']",
      "confidence": 0.85
    }
  ],
  "meta": {
    "planner": "ai_labels",
    "segments_in": 6,
    "max_clips": 2,
    "labels_count": 6
  }
}

Parameters:

Field	Default	Description
segments	required	Array of transcript segments
mode	"heuristic"	Planner mode: "stub", "heuristic", "ai", or "ai_labels"
max_clips	3	Maximum clips to generate
clip_types	["document", "fun", "mixed"]	Allowed clip types
preferred_clip_type	"mixed"	Preferred clip type
markers	[]	Marker words for mess-up detection
clean_level	"light"	Cleanup level: "none", "light", "aggressive"
min_clip_ms	6000	Minimum clip duration (ms)
max_clip_ms	60000	Maximum clip duration (ms)
max_keep_ranges	10	Max keep ranges per clip
enforce_segment_boundaries	true	Snap keep_ms to segment boundaries
unsure_policy	(by clip type)	ai_labels only: "keep", "cut", or "adjacent"
debug	false	ai_labels only: include labels and clip_sources in meta
lead_in_ms	300	Expand keep range starts by this amount (clamp to bounds)
tail_out_ms	300	Expand keep range ends by this amount (clamp to bounds)

Planner Modes:

Mode	Description
`stub`	Returns single clip covering entire transcript (for wiring)
`heuristic`	Deterministic multi-clip plan using markers, topic shifts, or time buckets
`ai`	Calls Claude API to intelligently detect mess-ups and plan edits
`ai_labels`	Claude labels each segment (keep/cut + tags + story_id), then deterministic code converts to clips. Best for retake detection.

Unsure Policy (ai_labels only):

When the AI labels a segment as "unsure", this policy determines what to do:

Policy	Behavior	Default For
`"keep"`	Treat unsure as keep (conservative)	"document" clips
`"cut"`	Treat unsure as cut (aggressive)	"fun" clips
`"adjacent"`	Keep if neighbors keep, cut if neighbors cut	"mixed" clips

Debug Output (ai_labels only):

When debug: true, the response includes extra metadata:

{
  "clips": [...],
  "meta": {
    "planner": "ai_labels",
    "labels": [
      {"idx": 0, "action": "keep", "tags": ["intro"], "story_id": 1},
      {"idx": 1, "action": "cut", "tags": ["retake_repeat"], "story_id": 1},
      {"idx": 2, "action": "keep", "tags": ["clean_story"], "story_id": 1}
    ],
    "clip_sources": [
      {
        "clip_index": 0,
        "story_id": 1,
        "kept_segment_indexes": [0, 2],
        "cut_segment_indexes": [1]
      }
    ],
    "unsure_policy": "keep"
  }
}

This is useful for debugging why certain segments were included or excluded.

Validation & Fallback (ai_labels):

AI-generated clips are automatically validated. If validation fails (empty output, out of bounds, etc.), the system falls back to heuristic mode. Check meta.planner to see which mode was actually used:

"ai_labels" - AI labels were used successfully
"ai_labels_fallback" - Fell back to heuristic (check meta.fallback_reason)

Post-Processing (ai_labels):

After AI labels are converted to clips, several deterministic post-processing steps are applied to produce cleaner, more publishable output:

Label Normalization — Tags that indicate bad content force action="cut" regardless of what the AI returned:
- Cut-forcing tags: false_start, retake_repeat, filler, restart_phrase, garbled, non_story, meta_commentary, outro
- Example: {"action": "unsure", "tags": ["false_start"]} → normalized to {"action": "cut", "tags": ["false_start"]}
- Unknown/invalid tags are dropped silently
- This prevents "unsure + bad tag" segments from being kept due to unsure_policy=keep
Outro Auto-Cut — Segments containing common wrap-up phrases are automatically marked as "cut" with an "outro" tag:
- "let's see", "that's it", "anyway", "cool", "ok bye", "alright so", "alright then", "so yeah", "yeah so"
- Short segments (<30 chars) containing these phrases are also cut
- Prevents awkward "umm, let's see..." endings in clips
Trailing Unsure Trimming (document mode only) — Removes trailing "unsure" segments from clip ends:
- Only applies when preferred_clip_type == "document" or unsure_policy == "keep"
- If the last N segments in a clip were originally labeled "unsure" by the AI, they are trimmed
- Prevents clips ending with mumbling, trailing thoughts, or uncertain content
- Will not trim if it would make the clip shorter than min_clip_ms
Smart Lead-in Range Drop — For clips with multiple keep ranges, short "lead-in" ranges at the start are intelligently removed:
- If the first range is <2500ms and there are 2+ ranges, it may be dropped
- Exceptions (first range is kept if):
  - The range contains 2+ kept segments
  - Any segment in the range has a clean_story tag
- Prevents clips starting with a tiny fragment before the main content
- Example: [[0, 2000], [10000, 25000]] → [[10000, 25000]] (unless exceptions apply)
Keep Range Expansion — Keep ranges are expanded by lead_in_ms and tail_out_ms:
- Start of each range is moved earlier by lead_in_ms (default: 300ms)
- End of each range is moved later by tail_out_ms (default: 300ms)
- Expansion is clamped to transcript bounds
- Expanded positions are snapped to nearest segment boundaries
- Creates more natural cuts that don't start/end abruptly mid-word

These rules run automatically in the order listed. They are designed to handle common issues in selfie/talking-head videos where creators trail off, have false starts, or AI labels segments as uncertain.

Heuristic Strategy:

If markers provided, split at marker segments
Otherwise, detect topic shifts via gaps (>3s) or reset phrases ("restart", "take two", etc.)
Fall back to equal time buckets
For each chunk, use select_clips to find best window

Use Cases:

Test integration with stub mode before going live
Deterministic clip planning with heuristic mode (no API costs)
Intelligent mess-up detection with AI mode (requires ANTHROPIC_API_KEY)

Chaining transcribe → plan-edits:

# 1. Transcribe and save to file
curl -s -X POST http://localhost:3000/transcribe \
  -H "Content-Type: application/json" \
  -d '{"path": "/data/my-video.MOV", "granularity": "chunked"}' \
  > /tmp/chunked.json

# 2. Pipe segments into plan-edits
python3 -c '
import json
d = json.load(open("/tmp/chunked.json"))
print(json.dumps({
  "segments": [{"start": s["start"], "end": s["end"], "text": s["text"]} for s in d["segments"]],
  "mode": "ai_labels",
  "max_clips": 2,
  "preferred_clip_type": "document",
  "lead_in_ms": 800,
  "tail_out_ms": 800
}))
' | curl -s -X POST http://localhost:3000/plan-edits \
  -H "Content-Type: application/json" \
  -d @- | python3 -m json.tool

Make Clips (One-Call Pipeline)

New one-call endpoint that orchestrates the full clip creation workflow: transcribe → plan-edits → render.

curl -X POST http://localhost:3000/make-clips \
  -H "Content-Type: application/json" \
  -d '{
    "path": "/data/my-video.MOV",
    "output_prefix": "my_video"
  }'

With options:

curl -X POST http://localhost:3000/make-clips \
  -H "Content-Type: application/json" \
  -d '{
    "path": "/data/my-video.MOV",
    "output_prefix": "my_video",
    "max_clips": 3,
    "preferred_clip_type": "document",
    "min_clip_ms": 10000,
    "max_clip_ms": 45000,
    "lead_in_ms": 500,
    "tail_out_ms": 300
  }'

Response:

{
  "clips": [
    {
      "clip_id": "a1b2c3d4-...",
      "output_path": "/data/my_video_clip1.mp4",
      "keep_ms": [[5000, 20000]],
      "total_ms": 15000,
      "title": "Today we will discuss Python"
    },
    {
      "clip_id": "e5f6g7h8-...",
      "output_path": "/data/my_video_clip2.mp4",
      "keep_ms": [[25000, 40000]],
      "total_ms": 15000,
      "title": "Now let me talk about JavaScript"
    }
  ],
  "meta": {
    "input_path": "/data/my-video.MOV",
    "output_prefix": "my_video",
    "segments_transcribed": 12,
    "clips_planned": 2,
    "clips_rendered": 2,
    "planner": "ai_labels"
  }
}

Parameters:

Field	Default	Description
path	required	Path to input video
output_prefix	required	Prefix for output filenames (e.g., "my_video" → my_video_clip1.mp4)
max_clips	2	Maximum clips to generate
preferred_clip_type	"document"	Preferred clip type
markers	[]	Marker words for mess-up detection
min_clip_ms	6000	Minimum clip duration (ms)
max_clip_ms	60000	Maximum clip duration (ms)
unsure_policy	(by clip type)	How to handle uncertain segments
lead_in_ms	300	Expand clip starts by this amount
tail_out_ms	300	Expand clip ends by this amount
model	settings default	Whisper model for transcription
language	auto	Language code for transcription

Key Features:

Deterministic filenames: Output files are named <output_prefix>_clip1.mp4, <output_prefix>_clip2.mp4, etc.
Uses ai_labels mode: Leverages Claude API for intelligent segment labeling
Full post-processing: Includes label normalization, lead-in/tail-out expansion, bridge range dropping
Automatic cleanup: Applies outro auto-cut and trailing unsure trimming

Note: Requires ANTHROPIC_API_KEY to be set.

Workflow Examples

Manual: Transcribe, Review, Trim

Transcribe to get segments with timestamps
Review segments and pick interesting ranges
Call /trim with your chosen timestamps

Automatic: One-Shot Clip Generation

# Generate up to 3 clips of ~15 seconds each
curl -X POST http://localhost:3000/auto-clip \
  -H "Content-Type: application/json" \
  -d '{"path": "/data/interview.mp4", "max_clips": 3, "target_s": 15}'

Programmatic: Custom Selection Logic

Call /transcribe to get segments
Apply your own selection logic
Call /trim for each clip you want

File Input

Place media files in the ./data directory. This folder is mounted to /data inside the container.

Supported formats: Any format supported by ffmpeg (mp4, mp3, wav, mkv, webm, mov, etc.)

Models

Available Whisper models (via faster-whisper):

Model	Size	Speed	Quality
tiny	~75MB	Fastest	Lower
base	~150MB	Fast	Good
small	~500MB	Medium	Better
medium	~1.5GB	Slower	Great
large	~3GB	Slowest	Best

Default model: small

Performance notes:

First request loads the model (takes a few seconds)
Subsequent requests reuse the cached model
Running on CPU; GPU support would significantly speed up transcription
For long files, expect roughly real-time processing with small model on modern CPUs

Configuration

Environment variables (set in docker-compose.yml or .env):

Variable	Default	Description
TRANSCRIBE_DEFAULT_MODEL	small	Default Whisper model
TRANSCRIBE_TMP_DIR	/tmp/transcriber	Temp directory for processing
MAX_SEGMENT_MS	3000	Max segment duration for chunked mode (ms)
ANTHROPIC_API_KEY	(none)	API key for AI planner mode

Development

Run tests:

pip install -r requirements.txt
pytest tests/

Future Enhancements

S3/URL input mode
GPU support
Word-level timestamps
Webhook callbacks for long files
Batch processing
LLM-powered clip selection (upgrade from heuristics)

Content-Aware Cleanup (Implemented)

Status: Implemented in v0.2

The clip selection system now includes content-aware cleanup to avoid filler-heavy segments and improve clip quality for selfie/talking-head videos.

What It Does

When recording selfie videos, creators often produce raw footage with:

Filler words — "uh", "um", "like", "you know", "so", "basically"
False starts — "So today we're going to— actually let me start over"
Awkward intros — Rambling warm-ups before getting to the point
Dead air — Long pauses or silence

The cleanup system detects and handles these automatically.

Per-Segment Quality Signals

Each segment is analyzed for:

Signal	Description
`word_count`	Total words in segment
`filler_count`	Count of filler words (uh, um, like, you know, so, basically, etc.)
`filler_ratio`	`filler_count / word_count`
`starts_fluffy`	Segment begins with filler or hedge words
`ends_clean`	Segment ends with sentence-final punctuation

"Bad Segment" Detection

Rule	Trigger
High filler density	`filler_ratio > 0.25`
Short fluff	`duration < 2s` AND `starts_fluffy`
Acknowledgement-only	`word_count <= 3` AND `starts_fluffy`

Clean Levels

Level	Behavior
`"none"`	No cleanup (legacy behavior)
`"light"`	Penalize filler-heavy segments; shift start to skip fluffy intros
`"aggressive"`	Stronger penalties; filter out candidates with >50% bad segments

Marker Boundaries

Pass markers array to create hard clip boundaries:

{"markers": ["cut", "restart", "take two", "that was dumb"]}

Clips will not cross segments containing marker text (case-insensitive).

Punchline Boost (Fun Clips)

For clip_type: "fun", single segments (5-12s) with exclamations or punch words get a significant score boost:

Punch words: "wow", "no way", "insane", "crazy", "bro", "wild", "unreal", "incredible"
Exclamation endings: !

Future: LLM-Powered Judging

Once heuristics are proven, optionally upgrade to an LLM judge:

Heuristics remain the candidate generator (fast, deterministic)
LLM re-ranks top candidates based on clip_type and clean_level
Enables nuanced judgments: "this segment sounds hesitant" vs. "this is confident delivery"
Keeps latency low by only scoring the top N candidates

This keeps the system local-first while allowing opt-in AI enhancement.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
app		app
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

fastapi-media-transcriber

Features

Quickstart

API Endpoints

Health Check

Transcribe via Path

Transcribe via File Upload

Transcription Response Format

Select Clips from Segments

Trimming a Video Clip

Auto-Clip Pipeline

Render EDL (Edit Decision List)

Plan Edits (AI Planning Layer)

Make Clips (One-Call Pipeline)

Workflow Examples

Manual: Transcribe, Review, Trim

Automatic: One-Shot Clip Generation

Programmatic: Custom Selection Logic

File Input

Models

Configuration

Development

Future Enhancements

Content-Aware Cleanup (Implemented)

What It Does

Per-Segment Quality Signals

"Bad Segment" Detection

Clean Levels

Marker Boundaries

Punchline Boost (Fun Clips)

Future: LLM-Powered Judging

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages