Skip to content

saahiluppal/langdet

Repository files navigation

Language Detection System

This repo provides clean implementation of Language Detection System in TensorFlow-2 using all best practices.

Languages that Models can detect are:

  • Bulgarian
  • Czech
  • Danish
  • Dutch
  • English (Of course)
  • Estonian
  • Finnish
  • French
  • German
  • Greek
  • Hungarian
  • Italian
  • Latvian
  • Lithuanian
  • Polish
  • Portuguese
  • Romanian
  • Slovak
  • Slovenian
  • Spanish
  • Swedish

Usage

Installation

Conda (Recommended)

# Tensorflow CPU
conda activate (import tensorflow as tf)

pip

pip install -r requirements.txt

Downloading pre-trained weights

Model_two
Tokenizer

NOTE: Models requires their respective tokenizers to work with; SO kindly download models along with their tokenizers

Or hit wget on terminal (linux)

# Model
wget https://github.com/saahiluppal/langdet/blob/master/model.h5
# Tokenizer
wget https://github.com/saahiluppal/langdet/blob/master/tokenizer.json

Not sure which model to use, You can find information about models here

Action

# wanna detect language (we recommend using more than 5 words for better accuracy)
# file dependencies soon to be added
detect.py

# Training custom model (we recommend setting code which better suits your needs)
manual_tokens.py
# jupyter notebook for same
manual_tokens.ipynb

# Wanna preprocess downloaded data for custom use
extraction.py

Dataset Used

I used Dataset from European Parliament Parallel Corpus,which can be found here
While full dataset is large (1.5 GB Unextracted) you might want to use smaller preprocessed dataset can be found here

LICENSE

Apache License 2.0

About

Language Detection System

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages