In today's world, effective communication is essential for fostering inclusivity and breaking down barriers. However, for individuals who rely on visual communication methods like American Sign Language (ASL), traditional communication tools often fall short. That's where GenASL comes in – a Generative AI-powered solution that translates speech or text into expressive ASL avatar animations, bridging the gap between spoken/written language and sign language.
The rise of the foundational models, and the fascinating world of Generative AI that we live in, is incredibly exciting and it opens up doors to imagine and build what was not previously possible. In this blog-post, we'll dive into the architecture and implementation details of GenASL, leveraging AWS Generative AI capabilities to create human-like ASL avatar videos.
The GenASL solution comprises several AWS services working together to enable seamless translation from speech/text to ASL avatar animations. Users send audio, video, or text as input to GenASL, and they will visualize an ASL avatar video that interprets the input data. The solution utilizes the AWS AIML services namely Amazon Transcribe, Amazon SageMaker/Amazon Bedrock, and Foundational Models.
Here's a high-level overview of the architecture:
- Amazon EC2 instance initiates a batch process to create ASL Avatars from a video data set consisting of 8000+ poses using RTMPose, a Real-Time Multi-Person Pose Estimation toolkit based on MMPose.
- AWS Amplify distributes the GenASL web app consisting of HTML, JavaScript, and CSS to end users’ mobile devices.
- The Amazon Cognito Identity pool grants temporary access to the Amazon S3 bucket.
- The user uploads an audio file to the Amazon S3 bucket using AWS SDK through the web app.
- The GenASL web app invokes the backend services by sending the Amazon S3 object Key in the payload to an API hosted on Amazon API Gateway.
- Amazon API Gateway instantiates an AWS Step Functions workflow. The state Machine orchestrates the Artificial Intelligence /Machine Learning (AI/ML) services Amazon Transcribe, and Amazon Bedrock and the NoSQL datastore Amazon DynamoDB using AWS lambda functions.
- The AWS Step Functions workflow generates a pre-signed URL of the ASL Avatar video for the corresponding audio file.
- A pre-signed URL for the video file stored in Amazon S3 is sent back to the user’s browser through Amazon API Gateway. The user’s mobile device plays the video file using the pre-signed URL.
Let's dive into the implementation details of each component:
The American Sign Language Lexicon Video Dataset (ASLLVD) consists of multiple synchronized videos showing the signing from different angles of more than 3,300 ASL signs in citation form, each produced by 1-6 native ASL signers. Linguistic annotations include gloss labels, sign start and end time codes, start and end handshape labels for both hands, morphological and articulatory classifications of sign type. For compound signs, the dataset includes annotations for each morpheme. To facilitate computer vision-based sign language recognition, the dataset also includes numeric ID labels for sign variants, video sequences in uncompressed-raw format, and camera calibration sequences.
We store the input dataset in an S3 bucket (Video dataset) and using RTMPose, a real-time multi-person pose estimation model based on MMPose, a Pytorch-based pose estimation open-source toolkit, we generate the ASL avatar videos. MMPose, is a member of the OpenMMLab Project, and contains a rich set of algorithms for 2d multi-person human pose estimation, 2d hand pose estimation, 2d face landmark detection, and 133 keypoint whole-body human pose estimation
Amazon SageMaker notebooks initiates the batch process which stores the ASL Avatar videos in another S3 bucket (ASL Avatars) for every ASL gloss and stores the ASL Gloss and it’s corresponding ASL avatar video’s S3 key in the DynamoDB table.
The back end process has 3 steps, processing the input audio to an english text, translating the english text to ASL gloss and lastly, generating ASL avatar video from the ASL gloss. This API layer is fronted by Amazon API Gateway. API Gateway allows the user to authenticate, monitor and throttle the API request. So, whenever the API gets a request to generate the sign video, it invokes an AWS step function workflow and then return the step function execution URL back to the front-end application. The step function has three steps, outlined below.
Input Processing: First, we convert the audio input to english text using Amazon Transcribe, an automatic speech-to-text AI service that uses deep learning for speech recognition. Amazon Transcribe is a fully managed and continuously training service, designed to handle a wide range of speech/acoustic characteristics, including variations in volume, pitch, and speaking rate.
Translation: In this step, we translate the english text to ASL Gloss using Amazon Bedrock, which is the easiest way to build and scale generative AI applications utilizing FMs. Amazon Bedrock is a fully managed service that makes Foundational Models from leading AI startups and Amazon to be available from an API, so you can choose from some of the most cutting-edge FMs available today that is best suited for your use case. It provides an API driven serverless experience for builders, providing an ability to accelerate development. We have used Anthropic’s Claude V3 to create ASL Gloss.
Avatar Generation: In this last step, we generate the ASL avatar video from the ASL Gloss. Using the ASL gloss created in the translation layer, we look up corresponding ASL sign from the Dynamo DB table. The “lookup ASL Avatar“ Lambda function stitches the videos together, and generate a temporary video, upload that to the S3 bucket, create a pre-signed URL, and send the pre-signed URL for both sign video and avatar video back to the front-end. The front-end plays the video in a loop.
The front end application is built using AWS Amplify, a framework that allows you to build, develop, and deploy full stack applications including mobile and web applications. So, you can simply add the authentication to a front-end AWS Amplify app using Amplify CLI Add Auth that generates the signup screen, login screen, as well as the backend and the Amazon Cognito Identity pools. During the audio file upload to S3, the front-end connects with S3 using the temporary identity is provided by the Cognito identity pool.
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.
This section provides steps to deploy an ASL avatar generator using AWS services. The following sections outline the steps for cloning the repository, processing data, deploying the backend, and setting up the frontend.
Clone the git repository using the following command:
git clone https://github.com/aws-samples/genai-asl-avatar-generator.gitFollow the instructions specified in the dataprep folder to intialize the database,
Modify genai-asl-avatar-generator/dataprep/config.ini with information specific to your environment:
[DEFAULT]
s3_bucket= <your S3 bucket>
s3_prefix= <files will be generated in this prefix within your S3 bucket>
table_name=<dynamodb table name>
region=<your preferred AWS region> Set up your environment by installing the required Python packages:
cd genai-asl-avatar-generator/dataprep
./env_setup.cmdPrepare the sign video annotation file for each processing run:
python prep_metadata.pyDownload sign videos, segment them, and store them in S3:
python create_sign_videos.pyGenerate avatar videos:
python create_pose_videos.pyDeploy the backend:
cd genai-asl-avatar-generator/backend
sam deploy --guidedInitialize your Amplify environment:
amplify initModify the frontend configuration to point to the backend API:
- Open
frontend/amplify/backend/function/Audio2Sign/index.py - Modify the
stateMachineArnvariable to have the state machine ARN shown in the output generated from backend deployment
amplify add hostingIn the prompt, select "Amazon CloudFront and S3" and select the bucket to host GenASL application
npm install --forceamplify publishAfter deploying the Amplify project using the amplify publish command, a CloudFront URL will be returned. You can use this URL to access the GenASL demo application.
With the application open, you can register a new user and test the ASL avatar generation functionality.
Remove all the frontend resources created by Amplify using the following delete command:
amplify deleteRemove all the backend resources created by SAM using the following delete command:
sam delete- If you created a new EC2 instance for running the batch process, you can terminate the EC2 instance using the AWS Console.
- If you reused an existing EC2 instance, you can delete the project folder recursively to clean up all the resources:
rm -rf genai-asl-avatar-generatorUse the following AWS CLI commands to delete the buckets created for storing ASL videos: Replace with the name of your S3 bucket.
aws s3 rm s3://<bucket-name> --recursive
aws s3 rb s3://<bucket-name> --force By following these steps, you should be able to deploy the GenAI ASL Avatar Generator to your AWS environment. Make sure to adjust the configurations according to your specific requirements and environment settings. If you encounter any issues, refer to the project documentation or seek help from the community.