Language Recognize

About

Description

The idea of this project is to develop a basic ML model to recognize the language a given text is written in and deploy it to an instance on AWS so that it can be consulted.

Data Sources

The data we are using to train the model is from the following sources:

Wikibooks Dataset
TED Talks Dataset

Repository Contents

The notebooks directory contains Jupyter notebooks for data exploration and model development.
The data-preparation directory contains scripts and utilities to prepare the dataset for the models.
The service directory contains the code for running and deploying the model as an HTTP API.

Bucket Organization

The S3 bucket of the project is organized as follows:

raw-data/ contains the raw data as downloaded from the datasets.
data-preparation/ contains the necessary scripts and utilities to prepare the data for the model.

Data Preprocessing

Since the model will be trained using a TensorFlow text line dataset, the data must be organized along the following structure:

output_dir
  - Language1
    - Text1.txt
    - Text2.txt
    ...
  - Language2
    - Text1.txt
    ...

In order to prepare the data from the TED dataset csvs and store it in this format, you can run the data-preparation/make_dataset script as follows:

python data-preparation/make_dataset.py --input_dir=data/raw-data/ted-dataset

You can pass the following optional arguments as well:

--output_dir=<directory to store output>
--col_name=<name of column with text in the csvs>

API

Run Server Locally

To run the model server locally, first install the required dependencies from service/src/requirements.txt. If your python version is different from 3.7, you must install the appropriate version of tflite_runtime from here. Next, ensure you have saved the prediction model in tflite format in service/src/assets/model.tflite, and then go to the service/src directory and start the server with:

uvicorn main:app

Then you can consult the model via HTTP requests. The documentation url for the API is /docs.

Run with Docker

In order to run the service with docker, first navigate to the service directory and build the image with

docker build -t language-recognize .

Then you can start the service with

docker run --name lang-recognize-container -p 3000:80 language-recognize

Deploy to AWS Fargate

In order to deploy the service to an AWS Fargate container, follow these steps: First, you must set up an AWS ECR repository for your docker images. Then, build and push the docker image of the service to the repository (you can follow the instructions here).

Once you have done this, navigate to the service/aws-infrastructure directory, copy the .env.example file to a .env file, and fill in the values with the image URI from ECS and the ID of the subnet you want to deploy to (this can be a subnet in your default VPC).

Finally, you can run the build.sh script to launch the stack. When you're done playing around with it, you can take down the service from AWS by running the following:

aws cloudformation delete-stack --stack-name language-recognize-stack

Back to top.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Language Recognize

Contents

About

Description

Data Sources

Repository Contents

Bucket Organization

Data Preprocessing

API

Run Server Locally

Run with Docker

Deploy to AWS Fargate

Files

README.md

Latest commit

History

README.md

File metadata and controls

Language Recognize

Contents

About

Description

Data Sources

Repository Contents

Bucket Organization

Data Preprocessing

API

Run Server Locally

Run with Docker

Deploy to AWS Fargate