Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Suggestion - Existing subtitle text accuracy enhancement #93

Open
codefaux opened this issue Dec 9, 2024 · 3 comments
Open

Comments

@codefaux
Copy link

codefaux commented Dec 9, 2024

Hi there.

Amazing project, it looks like -almost- exactly what I'm looking for and I think this feature might be worth implementing.

Right now I'm trying to improve the subtitles available on a series. This series has subtitles, but they're very ...poor. Timing is only OK, but word accuracy is dog poo. One example within seconds of the opening in the first episode:

Subtitle:

- I question that proposal.
- What does it matter?

Actual scene:

- Considering the circumstances, I question that proposal at this time.
- What does it matter? We're not-

The problem is that the series is scifi, so planet/race/person names and technobabble come up frequently, which often means transcription models have a tendency to become too...creative.

EDIT: Forgot to mention here -- this is why I'm trying to use the original subtitle text as an initial_prompt below -- I believe this will help with matching said difficulties to text, and in my experience this has been the case.

My current (very WIP) effort uses the .srt to split the audio segments for each subtitle out of a media file, then processes each individual audio clip with Whisper -- using the original subtitle text as initial_prompt, patience=2 -- and the text I'm getting from Whisper is a great accuracy match but I'm fighting with timing stuff.

Is this worth implementing here?

Is there a better way?

Can this project already accomplish my goal, but somehow I've missed it?

Thanks for your time.

@baxtree
Copy link
Owner

baxtree commented Dec 11, 2024

Hi, @codefaux , glad to know the transcription you are getting has great accuracy via prompting. Currently this project transcribes the whole audio without segmenting it. The enclosing method could be modified to take in the original subtitle cues, aligning with your idea, and ofc. it means the original time codes need to be reasonably accurate. Your approach sounds to me a promising way to improve the subtitle quality and it would also be interesting to know how well Whisper works on very short audio segments without surrounding context.

@codefaux
Copy link
Author

Hey there. Thanks for your attention. I think there's strong merit here, but after a few days of poking at it, I can say -- there are a few issues and I'm at my limit on improving the situation.

Using a DVD source with what I would judge to be very accurate timecodes on their less accurate wording, I've found a few key notes;

  • Significant improvement in word accuracy versus original subtitles, with typically flawless word accuracy versus audio, except....
  • Issues with things like names and technobabble when the supplied original subtitle line does not include the word(s), suggesting that a transcription AI which allowed a supplied list of names/etc would be much better suited for this sort of thing -- if/when such an AI model exists. -- This is a big one for scifi content especially, and I don't see a way around it for now.
  • Frequent issues breaking subtitles between speaking voices, when one subtitle segment contains two voices speaking separate lines sequentially, even when input text properly differentiates them and they are clearly different voices (for example switch between male and female voices of diverse vocal quality) -- obviously not a huge deal, time and model improvements can help with this, perhaps some parameter tuning.
  • Quite significant issues with hallucination when word fragments are present in the clips, which is kind of what you were likely alluding to regarding reasonably accurate time codes. -- Might be able to work around this with intelligent silence detection around splits and combining nearby subtitle groups and so on but realistically the effort to build that framework would be pretty monumental, and to questionable gains.

I've reached the end of the road regarding my own ability to improve this, I'm not deeply versed in either python or AI model manupilation/etc, it's been a somewhat brute force approach from the start and I don't know how to begin implementing the changes required either here or on my own project.

If you wish for help testing implementations of the above, I'd gladly provide it, but as of now I'm only worth as much as any other Ideas Guy, lol.

@baxtree
Copy link
Owner

baxtree commented Dec 13, 2024

No worries at all. It has been quite common so far for people with less insight into the code base to use issues to throw ideas. Do you have a loyalty-free pair of video and subtitle files for me to test this? If you want to draw a PR draft that's also welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants