Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Natural Subtitle Segmentation and Splitting without trashing the readability. #829

Open
ankitgurua opened this issue Jun 26, 2024 · 0 comments

Comments

@ankitgurua
Copy link

I asked for an issue with both Whisper and WhisperX that kills the readability of the subtitle whenever you put the length limits. Fullstops appearing mid sentences, segments splitting people's names. Random sentence cuts that felt unnatural.

To deal with this i found this spacy python file (credits to Glenn Langford) which can do all of the above for us while also putting length limits. It basically redeems the readability of the subtitle no matter your character or max lines value. The script shortens your subtitles while maintaining the natural flow by splitting the subtitles at punctuation and conjunctions and natural words. It takes care of not splitting at nouns and people's names and city names.

But there's a problem this script only works with whisper. When i tried running it on WhisperX JSON output it straight up gave me errors. I understand this is because of the structural differences in WhisperX and Whisper. But i really wanna run this script with WhisperX as timestamps of original whisper give me headaches.

If you want to run this script with original Whisper do this.

Install Python
pip install -U pip setuptools wheel
pip install -U 'spacy[cuda11x]'
python -m spacy download en_core_web_trf
Run this python script with JSON in same directory
(https://gist.githubusercontent.com/glangford/a2b24ffd92c832c60e1b1b49da1a8b27/raw/c588b33d2598f7ef92a26edf3dc314d119a70602/subwisp.py)
python3 -m subwisp input.json >output.srt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant