Add on-device audio transcription with speaker diarization by ldt · Pull Request #1 · La-Forge/MacWindowRecorder

ldt · 2026-03-30T21:56:31Z

Summary

Adds automatic post-recording transcription with speaker diarization to TabRecord, using WhisperKit and SpeakerKit Swift packages. All processing runs on-device using Apple Neural Engine acceleration — no audio is sent to external servers.

Key Changes

TranscriptionCoordinator: Central orchestrator that manages the full pipeline (model download → audio prep → ASR → diarization → alignment → file output) and reports progress to the UI
ModelManager: Actor that handles CoreML model download and caching for WhisperKit and SpeakerKit models in ~/Library/Application Support/TabRecord/Models/
AudioPreprocessor: Extracts the selected audio track from multi-track MP4 recordings and resamples to 16 kHz mono PCM (Whisper's required format)
WhisperKitTranscriber: Thin wrapper around WhisperKit SPM that produces word-timestamped segments with language detection
SpeakerKitDiarizer: Thin wrapper around SpeakerKit SPM that produces speaker-labeled time segments using pyannote v4 models
SegmentAligner: Merges word-level transcription with diarization segments to produce speaker-attributed transcript
TranscriptWriter: Serializes aligned segments to .txt, .json, and .srt output formats
TranscriptionPreferences: @AppStorage-backed model for user preferences (model variant, language, speaker count, audio source, output formats)
PreferencesView: SwiftUI preferences panel with Transcription tab (macOS 14+)
MenuBarController: Updated to trigger transcription after recording stops and display transcription progress in the menu
AppDelegate: Updated to request notification permissions for transcription completion/failure alerts

Implementation Details

Hardware utilization: Uses cpuAndNeuralEngine compute units to maximize ANE throughput on Apple Silicon (encoder on ANE, decoder on CPU)
Privacy: Fully on-device processing; models cached locally after first download (~1.5 GB for large-v3-turbo)
Background processing: Pipeline runs at .utility priority on a detached Task to keep UI responsive
Error handling: Graceful failure with user notifications; recording files are preserved if transcription fails
Minimum macOS: Requires macOS 14.0+ for transcription (recording still works on 13.0+)
Output: Transcript files saved alongside recordings with speaker labels and timestamps

Testing

Comprehensive unit tests added for:

ModelManager (cache hits, disk space checks, model listing)
AudioPreprocessor (track extraction, resampling, temp file cleanup)
SegmentAligner (empty inputs, single/multi-speaker alignment, boundary cases)
TranscriptWriter (TXT/JSON/SRT formatting, timestamp conversion)

https://claude.ai/code/session_01DkKXRRkUErWxE49oAvcxwE

Adds a complete Software Design Document under .kiro/specs/ covering: - requirements.md: 8 user stories with acceptance criteria (auto-transcription, diarization, output formats, audio source selection, language support, on-device privacy, M3 Max performance, preferences) - design.md: Full technical design recommending WhisperKit + SpeakerKit (Swift SPM, CoreML/ANE) as the optimal stack for M3 Max. Includes component architecture, data flow diagram, all struct/actor interfaces, output formats (TXT/JSON/SRT), error handling, and performance targets (~3.5 min for 60-min audio on M3 Max). - tasks.md: 15 implementation tasks across 4 phases (~31h total effort) with dependency graph and acceptance criteria per task. Research basis: benchmarked WhisperKit (2.2% WER, RTF 0.05x on M3 Max ANE) vs mlx-whisper, WhisperX, Scribe, Parakeet-MLX, and Apple SpeechAnalyzer. https://claude.ai/code/session_01DkKXRRkUErWxE49oAvcxwE

@mainactor

Full on-device transcription pipeline using WhisperKit (Whisper large-v3-turbo, CoreML/ANE) + SpeakerKit (pyannote v4, CoreML/ANE) on Apple Silicon. New files: - TranscriptionPreferences.swift — @AppStorage-backed settings model - RecordingFiles.swift — shared value type grouping recording output URLs - ModelManager.swift — actor; downloads/caches CoreML models in App Support - AudioPreprocessor.swift — extracts audio track → 16 kHz mono WAV for Whisper - WhisperKitTranscriber.swift — thin actor wrapper around WhisperKit SPM - SpeakerKitDiarizer.swift — thin actor wrapper around SpeakerKit SPM - SegmentAligner.swift — midpoint-based word→speaker assignment + grouping - TranscriptWriter.swift — writes TXT, JSON, SRT transcript files to disk - TranscriptionCoordinator.swift — @mainactor pipeline orchestrator; UNNotifications - PreferencesView.swift — SwiftUI preferences panel (macOS 14+) Modified files: - MenuBarController.swift — integrates coordinator; adds transcription menu items - AppDelegate.swift — notification centre setup, show-in-finder / retry actions - OutputFileNamer.swift — adds makeTranscriptURL(date:ext:) - project.yml — adds WhisperKit + SpeakerKit SPM packages - Info.plist — version bump 1.0.4 → 1.1.0 - release.yml — SPM package cache step - README.md — documents transcription feature, output formats, privacy Tests: - SegmentAlignerTests.swift — 10 unit tests covering alignment logic and edge cases - TranscriptWriterTests.swift — 16 unit tests covering TXT/JSON/SRT format correctness Design: .kiro/specs/audio-transcription-diarization/ (requirements, design, tasks) https://claude.ai/code/session_01DkKXRRkUErWxE49oAvcxwE

claude added 2 commits March 30, 2026 05:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add on-device audio transcription with speaker diarization#1

Add on-device audio transcription with speaker diarization#1
ldt wants to merge 2 commits intomainfrom
claude/audio-transcription-research-Y0pM2

ldt commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ldt commented Mar 30, 2026

Summary

Key Changes

Implementation Details

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants