Classifies given short sentence to english or dutch using: (1) Decision Tree or (2) Adaptive Boosting with a Decision Stump. The platform required for the implementation on Python3 with the required file mentioned in the requirement.txt
1. Using Decision Tree: 98.58
2. Using Adaboost: 96.6
- data_collection.py: This collects the raw english and dutch sentences and stores into data.csv file
- data.csv: Contains collected data with three fields: sentence, length, lang. Sentence is a raw collected text, length is a length of sentence, and lang is language type: en for english and de for dutch
- adaboost.py: Implementation of the adaboosting learning technique, from scratch, with a decision stump
- decision_tree.py: Imlementation of the decision tree technique, from scratch, using ID3
- features.py: Transforms sentence to features
- main.py: main program that picks classifier technique and performs one of the following: a. trains the given classifier with the train and test sentences with word length 10, 20, and 50 respectively b. predicts the given text (either english or dutch) using trained model
- writeup.pdf: Detailed explanation on: Data Collection, Preprocessing, Training, and Evaluation.
1. dataset:
It contains two subdirectories: train and val. Each of them have a file containing sentences of path lengths 10, 20, and 50 respectively.
2. models:
Directory that holds the trained model for decision tree and adaboost
3. weights:
Directory that holds the weights for the adaboost during training
1. For training use following command
python main.py classifier_type "train" train_sentence_length val_sentence_length
where classifier_type is "dec" for decision tree or "ada" for adaboosting
train_sentence_length is length of the sentence you want to train with (10, 20, 50)
val_sentence_length is length of the sentence you want to perform hyperparameter tunning with (10, 20, 50)
For eg: to train classifier decision tree with a train sentence length 50 and val sentence length 50 use following:
python main.py "train" "dec" "50" "50"
2. For prediction use the following command
python main.py classifier_type "predict" file_name
where file_name is name of the file you want to test. By default put the sentence inside the text.txt file and
perform prediction. For multiple sentence, use one line seperation between sentences.
For eg: to make prediction using classifier adaboost with a test.txt file use following command:
python main.py "predict" "ada" "test.txt"
Please refer to the writeup.pdf for the detail in data collection, feature extraction, and the accuracy.