Skip to content

Add language auto-detection#4

Merged
TroyHernandez merged 1 commit into
mainfrom
feature/language-detection
Mar 13, 2026
Merged

Add language auto-detection#4
TroyHernandez merged 1 commit into
mainfrom
feature/language-detection

Conversation

@TroyHernandez
Copy link
Copy Markdown
Contributor

Summary

  • transcribe() now defaults to language = NULL, auto-detecting the spoken language from audio before decoding
  • New exported detect_language() for standalone language identification (returns language code + top-k probabilities)
  • Refactored language token table into shared whisper_language_table() with reverse lookup via whisper_lang_from_id()
  • Bumps version to 0.3.0

Breaking: previous default was language = "en". Pass language = "en" explicitly to restore old behavior.

Test plan

  • detect_language() returns "en" (97.7%) on JFK audio
  • detect_language() returns "es" (93.2%) on Allende audio
  • transcribe() with language = NULL auto-detects and transcribes correctly
  • Long audio (30.6s, 2 chunks) detects once, propagates to all chunks
  • 199 unit tests pass (105 new language tests including 99 round-trip checks)
  • R CMD check: 0 errors, 0 warnings, 1 NOTE (timestamp, not real)

transcribe() now defaults to language = NULL, which detects the spoken
language from the audio before decoding. New exported detect_language()
for standalone identification. Detection feeds SOT token to decoder and
reads language logits (99 languages supported).

Breaking: previous default was language = "en".
@TroyHernandez TroyHernandez merged commit 7730967 into main Mar 13, 2026
3 of 4 checks passed
@TroyHernandez TroyHernandez deleted the feature/language-detection branch March 13, 2026 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant