Add on-device audio transcription with speaker diarization#1
Open
Add on-device audio transcription with speaker diarization#1
Conversation
Adds a complete Software Design Document under .kiro/specs/ covering: - requirements.md: 8 user stories with acceptance criteria (auto-transcription, diarization, output formats, audio source selection, language support, on-device privacy, M3 Max performance, preferences) - design.md: Full technical design recommending WhisperKit + SpeakerKit (Swift SPM, CoreML/ANE) as the optimal stack for M3 Max. Includes component architecture, data flow diagram, all struct/actor interfaces, output formats (TXT/JSON/SRT), error handling, and performance targets (~3.5 min for 60-min audio on M3 Max). - tasks.md: 15 implementation tasks across 4 phases (~31h total effort) with dependency graph and acceptance criteria per task. Research basis: benchmarked WhisperKit (2.2% WER, RTF 0.05x on M3 Max ANE) vs mlx-whisper, WhisperX, Scribe, Parakeet-MLX, and Apple SpeechAnalyzer. https://claude.ai/code/session_01DkKXRRkUErWxE49oAvcxwE
Full on-device transcription pipeline using WhisperKit (Whisper large-v3-turbo, CoreML/ANE) + SpeakerKit (pyannote v4, CoreML/ANE) on Apple Silicon. New files: - TranscriptionPreferences.swift — @AppStorage-backed settings model - RecordingFiles.swift — shared value type grouping recording output URLs - ModelManager.swift — actor; downloads/caches CoreML models in App Support - AudioPreprocessor.swift — extracts audio track → 16 kHz mono WAV for Whisper - WhisperKitTranscriber.swift — thin actor wrapper around WhisperKit SPM - SpeakerKitDiarizer.swift — thin actor wrapper around SpeakerKit SPM - SegmentAligner.swift — midpoint-based word→speaker assignment + grouping - TranscriptWriter.swift — writes TXT, JSON, SRT transcript files to disk - TranscriptionCoordinator.swift — @mainactor pipeline orchestrator; UNNotifications - PreferencesView.swift — SwiftUI preferences panel (macOS 14+) Modified files: - MenuBarController.swift — integrates coordinator; adds transcription menu items - AppDelegate.swift — notification centre setup, show-in-finder / retry actions - OutputFileNamer.swift — adds makeTranscriptURL(date:ext:) - project.yml — adds WhisperKit + SpeakerKit SPM packages - Info.plist — version bump 1.0.4 → 1.1.0 - release.yml — SPM package cache step - README.md — documents transcription feature, output formats, privacy Tests: - SegmentAlignerTests.swift — 10 unit tests covering alignment logic and edge cases - TranscriptWriterTests.swift — 16 unit tests covering TXT/JSON/SRT format correctness Design: .kiro/specs/audio-transcription-diarization/ (requirements, design, tasks) https://claude.ai/code/session_01DkKXRRkUErWxE49oAvcxwE
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds automatic post-recording transcription with speaker diarization to TabRecord, using WhisperKit and SpeakerKit Swift packages. All processing runs on-device using Apple Neural Engine acceleration — no audio is sent to external servers.
Key Changes
~/Library/Application Support/TabRecord/Models/.txt,.json, and.srtoutput formats@AppStorage-backed model for user preferences (model variant, language, speaker count, audio source, output formats)Implementation Details
cpuAndNeuralEnginecompute units to maximize ANE throughput on Apple Silicon (encoder on ANE, decoder on CPU).utilitypriority on a detached Task to keep UI responsiveTesting
Comprehensive unit tests added for:
https://claude.ai/code/session_01DkKXRRkUErWxE49oAvcxwE