Flipkart Product Category Classification

The task is to classify the product category based on the product's description.
The product's description has to be extracted from the product category tree.
The dataset at hand is the Flipkart e-commerce sales sample dataset containing about 20k samples. [link]

Solution:

The textual description of each product is used to categories the product.
The text data is preprocessed by removing emails, new line characters, distracting single quotes, digits, puntuations, single characters, accented words, and multiple spaces.
Various feature engineering techniques are used to develop input representations for the ML models. They are as follows:
1. Word frequency based representation (only unigrams).
2. Word frequency based representation (unigrams and bigrams).
3. Word frequency based representation (unigrams, bigrams, and trigrams).
4. Word TF-IDF based representation (only unigrams).
5. Word TF-IDF based representation (unigrams and bigrams).
6. Word TF-IDF based representation (unigrams, bigrams, and trigrams).
7. Character TF-IDF based representation (bigrams and trigrams).
The dataset is split using stratified 70:30, train:test ratio.
The models are trained on the test data and its performance is measured on the validation data.
Standard machine learning models are used as they give pretty good accuracy. They are as follows:
1. Multinomial Naive Bayes
2. Random Forest
3. Linear SVC
Confusion matrix as well as the performance analysis of each model is provided in the notebook.

16,631 samples out of the total 20k samples cover the top 10 categories which is about 83% of the total data.
The 70:30 stratified split results in 11,641 train samples and 4,990 test samples.

Feature	Multinomial Naive Bayes	Random Forest	Linear SVC
Word frequency based representation (only unigrams)	97.35%	98.29%	98.93%
Word frequency based representation (unigrams and bigrams)	95.53%	98.39%	98.79%
Word frequency based representation (unigrams, bigrams, and trigrams)	95.39%	98.43%	98.71%
Word TF-IDF based representation (only unigrams)	95.13%	97.87%	98.21%
Word TF-IDF based representation (unigrams and bigrams)	97.61%	97.83%	99.27%
Word TF-IDF based representation (unigrams, bigrams, and trigrams)	97.17%	97.73%	99.17%
Character TF-IDF based representation (bigrams and trigrams)	87.27%	96.89%	98.95%

19,619 samples out of the total 20k samples cover the top 10 categories which is about 98% of the total data.
The 70:30 stratified split results in 13,733 train samples and 5,886 test samples.

Feature	Multinomial Naive Bayes	Random Forest	Linear SVC
Word frequency based representation (only unigrams)	93.84%	96.55%	97.04%
Word frequency based representation (unigrams and bigrams)	93.30%	97.17%	97.19%
Word frequency based representation (unigrams, bigrams, and trigrams)	91.76%	97.26%	96.97%
Word TF-IDF based representation (only unigrams)	88.26%	96.00%	97.63%
Word TF-IDF based representation (unigrams and bigrams)	92.18%	96.29%	97.72%
Word TF-IDF based representation (unigrams, bigrams, and trigrams)	92.15%	96.29%	97.77%
Character TF-IDF based representation (bigrams and trigrams)	76.97%	94.75%	97.38%

The simple Machine learning models give quite a good performance for the task at hand.
Thus, no deep learning models (DNN, CNN, LSTM) or other complex architectures (BERT, etc) are used.
It can be infered that Linear SVC is better than Multinomial Naive Bayes and Random Forest in terms of performance as well as memory.
The performance can be further improved by opting for a multimodal approach, extracting the information from the product's images.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
modules		modules
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md