From d076d864eece25ba91c792cb6ee49917f5147216 Mon Sep 17 00:00:00 2001
From: Andrew DalPino <me@andrewdalpino.com>
Date: Wed, 8 Jul 2020 05:02:36 -0500
Subject: [PATCH] Update to 0.1.0-rc5

---
 README.md     | 4 ++--
 composer.json | 2 +-
 train.php     | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/README.md b/README.md
index bc110aa9..568a45f6 100644
--- a/README.md
+++ b/README.md
@@ -57,7 +57,7 @@ Neural networks compute a non-linear continuous function and therefore require c
 
 First, we'll convert all characters to lowercase and remove any extra whitespace using [Text Normalizer](https://docs.rubixml.com/en/latest/transformers/text-normalizer.html). Then, [Word Count Vectorizer](https://docs.rubixml.com/en/latest/transformers/word-count-vectorizer.html) is responsible for creating a continuous feature vector of word counts from the raw text and [TF-IDF Transformer](https://docs.rubixml.com/en/latest/transformers/tf-idf-transformer.html) applies a weighting scheme to those counts. Finally, [Z Scale Standardizer](https://docs.rubixml.com/en/latest/transformers/z-scale-standardizer.html) takes the TF-IDF weighted counts and centers and scales the sample matrix to have 0 mean and unit variance. This last step will help the neural network converge quicker.
 
-The Word Count Vectorizer is a bag-of-words feature extractor that uses a fixed vocabulary and term counts to quantify the words that appear in a particular document. We elect to limit the size of the vocabulary to 10,000 of the most frequent words that satisfy the criteria of appearing in at least 3 different documents. In this way, we limit the amount of *noise* words that enter the training set.
+The Word Count Vectorizer is a bag-of-words feature extractor that uses a fixed vocabulary and term counts to quantify the words that appear in a particular document. We elect to limit the size of the vocabulary to 10,000 of the most frequent words that satisfy the criteria of appearing in at least 3 different documents but no more than 5,000 documents. In this way, we limit the amount of *noise* words that enter the training set.
 
 Another common feature representation for words are their [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) values which take the term frequencies (TF) from Word Count Vectorizer and weight them by their inverse document frequencies (IDF). IDFs can be interpreted as the word's *importance* within the text corpus. Specifically, higher weight is given to words that are more rare within the corpus.
 
@@ -84,7 +84,7 @@ use Rubix\ML\Persisters\Filesystem;
 $estimator = new PersistentModel(
     new Pipeline([
         new TextNormalizer(),
-        new WordCountVectorizer(10000, 3, new NGram(1, 2)),
+        new WordCountVectorizer(10000, 3, 5000, new NGram(1, 2)),
         new TfIdfTransformer(),
         new ZScaleStandardizer(),
     ], new MultilayerPerceptron([
diff --git a/composer.json b/composer.json
index b5f2013a..0039421f 100644
--- a/composer.json
+++ b/composer.json
@@ -23,7 +23,7 @@
     "require": {
         "php": ">=7.2",
         "league/csv": "^9.5",
-        "rubix/ml": "^0.1.0-rc3"
+        "rubix/ml": "^0.1.0-rc5"
     },
     "suggest": {
         "ext-tensor": "For faster training and inference"
diff --git a/train.php b/train.php
index dbedf1f1..731fb359 100644
--- a/train.php
+++ b/train.php
@@ -41,7 +41,7 @@
 $estimator = new PersistentModel(
     new Pipeline([
         new TextNormalizer(),
-        new WordCountVectorizer(10000, 3, new NGram(1, 2)),
+        new WordCountVectorizer(10000, 3, 5000, new NGram(1, 2)),
         new TfIdfTransformer(),
         new ZScaleStandardizer(),
     ], new MultilayerPerceptron([