Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There is a delay between mouth shape and audio #55

Open
hu1148509978 opened this issue Dec 7, 2024 · 4 comments
Open

There is a delay between mouth shape and audio #55

hu1148509978 opened this issue Dec 7, 2024 · 4 comments

Comments

@hu1148509978
Copy link

Thank you for publicly showcasing such an outstanding work!
When reasoning about Chinese audio, I encountered the problem of inconsistent mouth shape speed and audio speed:
In the first few seconds, the mouth shape is synchronized, but over time, the delay between the mouth shape and the audio increases, causing the mouth shape to be out of sync.
I use video editing software to drag the audio forward or backward for a few seconds to synchronize my mouth movements, but I really don't want to synchronize it through video editing.
May I ask how to solve this problem? If you could reply, I would greatly appreciate it!

macron2story.mp4
@Fictionarry
Copy link
Owner

Hi, this seems to be a problem with the unsynchronized video fps or the audio sampling rate. The input audio is suggested to have a sampling rate of 16,000 before being processed by deepspeech, wav2vec or HuBERT (recommended for your cross-lingual application), and the generated video is of 25 fps. This may be the problem I guess. If not, you can first check whether the generated video and audio have the same length. By the way, I found the provided video is of 30 fps, which is not 25 as the original. I recommend using ffmpeg to combine the video and audio as follow, which can ensure there is no misalignment occurring during this process.

ffmpeg -i <video_path> -i <audio_path> -q 2 <output_path>

@hu1148509978
Copy link
Author

Thank you for your prompt response. I will try again. Thank you again for such an excellent job!

@hu1148509978
Copy link
Author

Hi, this seems to be a problem with the unsynchronized video fps or the audio sampling rate. The input audio is suggested to have a sampling rate of 16,000 before being processed by deepspeech, wav2vec or HuBERT (recommended for your cross-lingual application), and the generated video is of 25 fps. This may be the problem I guess. If not, you can first check whether the generated video and audio have the same length. By the way, I found the provided video is of 30 fps, which is not 25 as the original. I recommend using ffmpeg to combine the video and audio as follow, which can ensure there is no misalignment occurring during this process.

ffmpeg -i <video_path> -i <audio_path> -q 2 <output_path>

Excellent job! According to your explanation, I have achieved good results!

@Fictionarry
Copy link
Owner

Excellent job! According to your explanation, I have achieved good results!

Haha, thanks for your feedback :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants