complement-naive-bayes

Implementation of Complement Naive Bayes text classifier used for automatic categorisation of product listings on eCommerce sites. Complement Naive Bayes was chosen over the classic Naive Bayes due to the fact that distribution of products among categories tend to be skewed (more products in one category than another), which causes Classic Naive Bayes to prefer categories which had more products during the training phase. Complement Naive Bayes performs much better on skewed training data.

Usage

complement-naive-bayes might be used as a library which exposes API for traning and labeling of new products or as a standalone command line application.

Command line interface

In order to use complement-naive-bayes from command line:

clone the repo:

git clone https://github.com/wolny/complement-naive-bayes.git

go to project dir and create executable jar

cd complement-naive-bayes
./gradlew jar

invoke java -jar complement-naive-bayes-{version}.jar to see the options:

The following option is required: -c, --command
Usage: <main class> [options]
  Options:
  * -c, --command
       Command for the classifier, can be 'train' for training, 'label' for
       label assignment, or 'validate' for validating the classifier accuracy
    -m, --multithreaded
       Use multi-threaded model (true/false)
       Default: false
    -o, --outputModel
       Output file for the model. Option valid only for training.
       Default: ~/.cbayes/model.json
    -te, --testDir
       Input directory containing product files for labeling/validation
       Default: ~/.cbayes/test
    -tr, --trainDir
       Input directory containing product files for training
       Default: ~/.cbayes/train

put your JSON training product files in trainDir and train your model:

java -jar complement-naive-bayes-{version}.jar -c train --trainDir trainDir

put your JSON test product files in testDir and validate you model:

java -jar complement-naive-bayes-{version}.jar -c validate --testDir testDir

put your JSON product files that you want to label in testDir and label you products:

java -jar complement-naive-bayes-{version}.jar -c label --testDir testDir

Important note: because all products need to be loaded in memory for training, make sure to run the app with proper heap size (-Xmx<memory>)

JSON product files

trainDir/testDir must contain product files in JSON format. Each file must contain list of products with the following JSON schema:

[
    {
        "id": 1,
        "sellerId": 12,
        "category": 123,
        "title": "test title1",
        "description": "test description1"
    },
    {
        "id": 2,
        "sellerId": 23,
        "category": 123,
        "title": "test title2",
        "description": "test description2"
    }
]

For training categoryId, title, description, sellerId attributes are obligatory, sellerId is needed to filter products of the same seller from a given category in order to avoid Seller Bias.
For testing only title, description attributes are necessary.
For now only English language is supported, but it's very easy to add support for other languages, all one has to do is create Tokenizer for a given language and train the model using this Tokenizer.

API

The follwing snippet of code show how to use already trained model in order to label a sample product:

// read Naive Bayes model from JSON file
String pathToModel = "./model.json";
NaiveBayesModel model = NaiveBayesSerializer.readFrom(pathToModel);

// create Complement Naive Bayes classifier
DocumentClassifier classifier = new WeightNormalizedComplementNaiveBayes(model);

// get the title and description of the product which is to be labeled
String title = "...";
String description = "...";
String text = title + " " + description;

// extract features, MAKE SURE THE SAME EXTRACTOR WAS USED DURING TRAINING PHASE
Document document = Extractors.STANDARD_EXTRACTOR.extractFeatureVector(text);

// label document
LabelingResult labelingResult = classifier.label(document);

// get categories ordered by score
List<LabelingResult.ScoredCategory> categories = labelingResult.getOrderedCategories();

// print 3 best category suggestions according to the model
System.out.println(Lists.newArrayList(Iterables.limit(categories, 3)));

... or use the following Play application

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
gradle/wrapper		gradle/wrapper
src		src
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

complement-naive-bayes

Usage

Command line interface

JSON product files

API

About

Releases

Packages

Languages

License

wolny/complement-naive-bayes

Folders and files

Latest commit

History

Repository files navigation

complement-naive-bayes

Usage

Command line interface

JSON product files

API

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages