-
Notifications
You must be signed in to change notification settings - Fork 1
Lip sync
This page details the lip-sync system used in the application for animating avatar mouth movements in synchronization with audio. Covered here are the technical aspects of the implementation, including data generation, viseme mapping, and integration with the avatar animation system.
- Overview
- Viseme Data Generation
- Viseme to Blend Shape Mapping
- Implementation Details
- Teeth Animation
- Synchronization with Audio
- Extensibility
- Performance Considerations
- Integration with Other Animations
The lip-sync system uses viseme data generated from audio clips to animate the avatar's mouth movements. It leverages the morph targets (blend shapes) provided by Ready Player Me avatars to create realistic lip movements synchronized with speech.
We use Rhubarb Lip Sync to generate viseme data from audio clips. The output is a JSON file containing timing information for each viseme.
Example of generated viseme data:
{
"mouthCues": [
{"start": 0.00, "end": 0.05, "value": "X"},
{"start": 0.05, "end": 0.10, "value": "A"},
// ... more cues ...
],
"metadata": {
"duration": 1.5
}
}
Ready Player Me avatars come with a set of blend shapes compatible with the Oculus LipSync SDK. We map Rhubarb's visemes to these blend shapes:
const visemeToBlendShape: { [key: string]: string[] } = {
'X': ['viseme_sil'],
'A': ['viseme_PP'],
'B': ['viseme_kk'],
'C': ['viseme_I'],
'D': ['viseme_aa'],
'E': ['viseme_O'],
'F': ['viseme_U'],
'G': ['viseme_FF'],
'H': ['viseme_TH'],
};
Available blend shapes include:
- Basic visemes:
viseme_sil
,viseme_PP
,viseme_FF
,viseme_TH
,viseme_DD
,viseme_kk
,viseme_CH
,viseme_SS
,viseme_nn
,viseme_RR
,viseme_aa
,viseme_E
,viseme_I
,viseme_O
,viseme_U
- Additional shapes:
mouthOpen
,mouthSmile
,eyesClosed
,eyesLookUp
,eyesLookDown
The lip-sync animation is implemented in the animateLipsync
function:
const animateLipsync = (delta: number) => {
if (!lipsyncDataRef.current || !isLipsyncPlaying || lipsyncStartTimeRef.current === null) {
return;
}
const currentTime = performance.now() / 1000 - lipsyncStartTimeRef.current;
const currentCue = lipsyncDataRef.current.mouthCues.find((cue: any) =>
currentTime >= cue.start && currentTime < cue.end
);
if (currentCue && currentCue !== currentCueRef.current) {
const blendShapes = visemeToBlendShape[currentCue.value];
if (blendShapes) {
Object.values(visemeToBlendShape).flat().forEach(shape => {
targetValuesRef.current[shape] = blendShapes.includes(shape) ? 1 : 0;
});
}
currentCueRef.current = currentCue;
lerpFactorRef.current = 0;
}
// Interpolate between current and target values
lerpFactorRef.current = Math.min(lerpFactorRef.current + delta * 5, 1);
const shapes = Object.keys(targetValuesRef.current);
const values = shapes.map(shape => {
const current = currentValuesRef.current[shape] || 0;
const target = targetValuesRef.current[shape] || 0;
const value = THREE.MathUtils.lerp(current, target, lerpFactorRef.current);
currentValuesRef.current[shape] = value;
return value;
});
setBlendShapes(shapes, values);
};
This function is called every frame when lip-sync is active, updating the blend shape values based on the current viseme.
For teeth animation, we use a simplified approach:
if (meshName === 'Wolf3D_Teeth') {
// For teeth, only use 'mouthOpen'
const mouthOpenIndex = mesh.morphTargetDictionary?.['mouthOpen'];
if (mouthOpenIndex !== undefined && mesh.morphTargetInfluences) {
let mouthOpenValue = 0;
shapes.forEach((shape, index) => {
if (teethMovingVisemes.includes(shape)) {
mouthOpenValue = Math.max(mouthOpenValue, values[index]);
}
});
mesh.morphTargetInfluences[mouthOpenIndex] = mouthOpenValue * 1;
}
}
This approach uses the 'mouthOpen' morph target for teeth movement, as most visemes don't significantly affect teeth visibility.
Audio synchronization is achieved by using the timing information from the viseme data:
lipsyncStartTimeRef.current = performance.now() / 1000;
We use this start time to calculate the current position in the audio playback and apply the appropriate viseme.
The current system can be adapted to use different viseme sets or audio processing methods:
- Modify the
visemeToBlendShape
mapping to accommodate new viseme sets. - Update the viseme data loading and parsing if using a different audio processing tool.
- Adjust the
animateLipsync
function to handle different data formats if necessary.
The lip-sync system is designed to be performant:
- It uses efficient lerping between viseme states.
- Blend shape calculations are optimized to run every frame without significant overhead.
- The system only updates when audio is playing, reducing unnecessary computations.
The lip-sync system is designed to work seamlessly with other facial and full-body animations:
- Lip-sync blend shapes are applied independently of other animations.
- The
useFrame
hook in Three.js ensures that lip-sync updates are synchronized with the render loop.
useFrame((state, delta) => {
if (mixerRef.current) {
mixerRef.current.update(delta);
}
if (isLipsyncPlaying) {
animateLipsync(delta);
}
});
This approach allows lip-sync to be active during any full-body animation without conflicts.