A distributed system for summarizing long transcripts using multiple LLMs in parallel. This project handles large transcripts by breaking them into manageable chunks while preserving context and timing information, then processes these chunks with LLMs to generate comprehensive summaries.
Summarizing very long transcripts (like interviews, speeches, or meetings) is challenging because:
- Large language models (LLMs) have token limits that prevent processing entire transcripts at once
- Simply splitting the text arbitrarily can lose important context and coherence
- Timestamps and speaker information need to be preserved for accurate summaries
This system addresses these challenges by intelligently preprocessing transcripts, chunking them based on linguistic and semantic boundaries, processing chunks in parallel with multiple LLMs, and then aggregating the results into a cohesive summary.
The system follows a modular pipeline architecture:
Input Transcript → Preprocessor → Chunker → LLM Executor (parallel) → Result Aggregator → Final Summary
↑ ↑
Sentence Detection Prompt Manager
- âś… Preprocessor: Cleans and prepares transcript data, combines segments, handles timestamps
- âś… Big Chunkeroosky: Splits preprocessed transcript into chunks respecting sentence boundaries and token limits
- âś… LLM Executor: Distributes chunks to multiple LLMs in parallel with support for OpenAI and Anthropic
- âś… Result Aggregator: Combines individual chunk summaries into a coherent whole
- đźš§ Prompt Manager: Provides customizable prompts for different summarization tasks
Legend: âś… Implemented | đźš§ Coming Soon
Handles the initial processing of raw transcript data:
- Cleans text and normalizes formatting
- Converts timestamps to readable format (HH:MM:SS)
- Combines consecutive segments from the same speaker (with configurable limits)
- Aggregates segments into time intervals if needed
- Preserves detailed timing information for all processing steps
Output Format: A list of dictionaries, where each dictionary represents a processed segment with:
- Start/end times (both raw seconds and formatted timestamps)
- Speaker information
- Text content (with embedded timestamps)
- Metadata about original segments
Divides preprocessed segments into chunks suitable for LLM processing:
- Splits content based on token limits of target LLMs
- Respects sentence boundaries to avoid cutting sentences in half
- Handles extremely long segments by splitting at natural points
- Adds contextual information to each chunk including timing, speaker, and position data
Output Format: A list of chunk dictionaries, where each contains:
- Multiple segments with timing information
- Token count tracking
- Position metadata (chunk index, percentage through transcript)
- Context headers for LLM processing
Handles parallel processing of chunks:
- Supports multiple LLM providers (OpenAI and Anthropic)
- Manages async requests with rate limiting and semaphores
- Implements robust error handling and automatic retries
- Tracks token usage and estimated costs
- Provides mock responses for development without API keys
Output Format: Processed chunks with added summary data including:
- Generated summary content
- Token usage statistics
- Processing metadata (model used, cost, etc.)
Combines individual chunk summaries into a coherent final summary:
- Makes direct API calls to LLMs for reliable summaries
- Supports both single-pass and hierarchical aggregation for large documents
- Eliminates redundancy and ensures coherent narrative flow
- Preserves key insights, quotes, and themes from all chunks
- Structures output with consistent sections (Overview, Main Topics, Key Points, Notable Quotes)
Manages prompts for different summarization tasks:
- âś… Provides default prompt templates
- âś… Allows loading custom prompts from files
- âś… Supports dedicated system prompts for controlling AI behavior
- âś… Includes sample prompts for different use cases (analytical, video editing, etc.)
- âś… Enables command-line selection of prompt files
- âś… Supports custom aggregator prompts for tailored final summaries
- âś… Preserves intermediate chunk summaries for detailed analysis
The /prompts directory contains sample prompt templates for different use cases:
analytical_prompt.txt: Focuses on critical analysis of arguments and evidencevideo_editor_prompt.txt: Specialized for video editing with detailed timestampsvideo_editor_system.txt: System prompt that sets the AI's persona as a video editorvideo_editor_aggregator.txt: Aggregator prompt that preserves timestamps in the final summaryacademic_system.txt: System prompt for scholarly, academic-style analysisaccessibility_system.txt: System prompt for clear, accessible summaries
| Prompt Type | Purpose | When Applied | CLI Argument |
|---|---|---|---|
| Regular Prompt | Main instructions for processing each chunk | Individual chunks | --prompt-file |
| System Prompt | Sets the tone, personality and style | Individual chunks | --system-prompt-file |
| Aggregator Prompt | Controls how chunks are combined | Final aggregation | --aggregator-prompt-file |
- Regular prompts: Use
{transcript}as a placeholder for the transcript content - Aggregator prompts: Use
{summaries}as a placeholder for the list of summaries to combine
Saving intermediate chunks provides detailed summaries with timestamps before they're aggregated into the final summary. This is especially useful for video editing workflows where detailed timestamp information is critical.
{
"timestamp": "2025-05-15 20:25:25",
"chunks": [
{
"chunk_index": 0,
"start_time": 0.0,
"end_time": 992.4,
"summary": "### TIMELINE SUMMARY\n[00:00] - Speaker introduces...",
"tokens_used": 4947
}
]
}-
Input: JSON transcript with segments containing:
{ "segments": [ { "start": 0.0, "end": 25.52, "text": "Example text", "speaker": "SPEAKER_00" }, ... ] } -
Preprocessing: Segments are cleaned, combined, and enriched with timestamp information
-
Chunking: Preprocessed segments are divided into chunks based on token limits and sentence boundaries
-
LLM Processing: Each chunk is processed by an LLM to generate a summary
-
Aggregation: Individual summaries are combined into a final, coherent summary
The simplest way to use the transcript summarizer is through the main.py script, which provides a command-line interface to the entire pipeline:
# Basic usage
python main.py --input transcript.json --output summary.txt
# Limit segments (for testing or cost control)
python main.py --input transcript.json --output summary.txt --limit-segments 100
# Use a different model or provider
python main.py --input transcript.json --output summary.txt --provider anthropic --model claude-3-sonnet-20240229
# Generate detailed report with processing stats
python main.py --input transcript.json --output summary.txt --report
# Customize chunking parameters
python main.py --input transcript.json --output summary.txt --max-tokens-per-chunk 3000 --max-segment-duration 90
# Use custom prompt templates
python main.py --input transcript.json --output summary.txt --prompt-file prompts/analytical_prompt.txt
# Use both custom prompt and system prompt
python main.py --input transcript.json --output summary.txt --prompt-file prompts/video_editor_prompt.txt --system-prompt-file prompts/video_editor_system.txt
# Use custom aggregator prompt
python main.py --input transcript.json --output summary.txt --prompt-file prompts/video_editor_prompt.txt --aggregator-prompt-file prompts/video_editor_aggregator.txt
# Save intermediate chunk summaries (before aggregation)
python main.py --input transcript.json --output summary.txt --save-chunks chunks_output.json
# Full video editor workflow with all features
python main.py --input transcript.json --output summary.txt --prompt-file prompts/video_editor_prompt.txt --system-prompt-file prompts/video_editor_system.txt --aggregator-prompt-file prompts/video_editor_aggregator.txt --save-chunks chunks_output.jsonRun python main.py --help for a full list of options.
You can also use the transcript summarizer programmatically in your Python code:
import asyncio
import json
from main import TranscriptSummarizer
async def summarize_my_transcript(transcript_path, output_path):
# Load transcript
with open(transcript_path, 'r', encoding='utf-8') as f:
transcript_data = json.load(f)
# Create summarizer with desired config
summarizer = TranscriptSummarizer(
provider="openai",
model="gpt-4o-mini", # Specify model or leave as None to use default from .env
max_tokens_per_chunk=4000,
max_concurrent_requests=5,
hierarchical_aggregation=True
)
# Process transcript
result = await summarizer.summarize(
transcript_data,
merge_same_speaker=True,
max_segment_duration=120, # 2 minutes max per segment
# Custom prompt template (optional)
prompt_template="""Please summarize this transcript segment with attention to key points and important quotes:
{transcript}
Provide your summary in this format:
1. Main Points:
2. Key Details:
3. Notable Quotes:""",
# Additional metadata to include in summary
metadata={"title": "My Transcript", "speaker": "John Doe"}
)
# Extract summary and write to file
summary = result["summary"]
with open(output_path, 'w', encoding='utf-8') as f:
f.write(summary)
return result
# Run the summarizer
# asyncio.run(summarize_my_transcript('transcript.json', 'summary.txt'))While a dedicated prompt manager is still in development, you can already customize the prompts used for summarization in several ways:
Create a prompt file:
# Create a prompt file
echo '
Please analyze the following transcript segment:
{transcript}
Analyze with these sections:
1. Key Topics
2. Main Arguments
3. Evidence Presented
4. Notable Quotes
' > analytical_prompt.txt
# Use the prompt with main script
python main.py --input transcript.json --output summary.txt --prompt-file analytical_prompt.txtWith the TranscriptSummarizer class:
summarizer = TranscriptSummarizer()
result = await summarizer.summarize(
transcript_data,
prompt_template="""Summarize this transcript focusing on the emotional tone:
{transcript}
Include sections on:
1. Overall emotional themes
2. Key emotional moments
3. Relationship dynamics"""
)For advanced customization:
# Process with custom prompt
executor = LLMExecutor(provider="openai", model="gpt-4o-mini")
processed_chunks = await executor.process_chunks(
chunks,
"""Create a creative summary of this transcript segment as if it were a movie scene:
{transcript}
Include: setting, characters, dialogue highlights, and mood."""
)Environment variables can be configured in a .env file. See .env.template for an example with all available options.
# Provider Selection
DEFAULT_PROVIDER=openai # Options: openai, anthropic
# OpenAI Configuration
OPENAI_API_KEY=your_openai_api_key_here
OPENAI_ORG_ID=your_openai_org_id_here # Optional
OPENAI_MODEL=gpt-4o-mini # Options: gpt-3.5-turbo, gpt-4, gpt-4o-mini, etc.
# Anthropic Configuration
ANTHROPIC_API_KEY=your_anthropic_api_key_here
ANTHROPIC_MODEL=claude-3-sonnet-20240229
# Request Configuration
MAX_CONCURRENT_REQUESTS=10 # Maximum parallel requests to LLM API
TEMPERATURE=1.0 # Controls creativity of outputs (0.0-2.0)
MAX_TOKENS=4000 # Maximum tokens in LLM responses
REQUEST_TIMEOUT=60 # Timeout for API requests in seconds
# Default Provider Selection
DEFAULT_PROVIDER=openai # Options: openai, anthropicmerge_same_speaker: Whether to combine consecutive segments from same speakertime_interval_seconds: If set, aggregates into fixed time intervalsmax_segment_duration: Maximum duration for combined segments (in seconds)preserve_timestamps: Whether to include original timestamps in text
max_tokens_per_chunk: Maximum tokens per chunk for the target LLMoverlap_tokens: Number of tokens to overlap between chunks (for context)context_tokens: Tokens reserved for metadata and context headers
- Provider selection: Choose between OpenAI and Anthropic
- Model selection: Specify which model to use for each provider
- Concurrency controls: Limit parallel API requests
- Error handling: Configure retry attempts and delay
The system is designed to be extended in several ways:
You can customize how segments are processed by modifying the preprocessor parameters or extending the module with new functions. The preprocessor supports different aggregation strategies:
# Custom preprocessing with specific parameters
processed_segments = preprocess_transcript(
transcript_data['segments'],
merge_same_speaker=True, # Combine consecutive segments from same speaker
max_segment_duration=180, # 3 minutes max per segment
preserve_timestamps=True, # Include original timestamps in text
time_interval_seconds=None # Don't use time-interval aggregation
)The LLM executor supports different LLM providers through a standardized interface:
# Use OpenAI
executor_openai = LLMExecutor(provider="openai", model="gpt-4o-mini")
# Use Anthropic
executor_anthropic = LLMExecutor(provider="anthropic", model="claude-3-sonnet-20240229")The Big Chunkeroosky class can be configured with different chunking parameters:
# Configure chunking with specific parameters
chunker = BigChunkeroosky(
max_tokens_per_chunk=4000, # Max tokens per chunk
overlap_tokens=200, # Overlap between chunks for context
context_tokens=150 # Reserved tokens for metadata
)You can create custom prompts for different summarization needs:
# Different prompt templates for different purposes
summary_prompt = """
Please summarize the following transcript segment,
focusing on the main points and key ideas.
{transcript}
"""
analysis_prompt = """
Please analyze the following transcript segment,
identifying themes, insights, and notable statements.
{transcript}
"""The modular design allows for easy integration with other systems and pipelines.