In this paper, a Multilayer Perceptron (MLP) with tf-idf features and n-gram features, a Recurrent Neural Network (RNN) are applied to identify the predominant language of a given paragraph from the WiLI-2018 dataset [Thoma, 2018]. The WiLI-2018 dataset includes 235 distinct languages. We could achieve 90 % accuracy on the test set for the MLP with tf-idf features, 93% for the MLP with n-grams and 91% with the RNN.
Implementation of the three lanugage identification models:
-
MLP classifier with tf-idf features
-
MLP classifier with n-grams features
-
RNN
Each folder contains a readme that explains the parameter of each python script. Example bash scripts show how each python script in the pipeline can be runned.