diff --git a/opennlp-docs/src/docbkx/langdetect.xml b/opennlp-docs/src/docbkx/langdetect.xml index 865f5f943..7e901714f 100644 --- a/opennlp-docs/src/docbkx/langdetect.xml +++ b/opennlp-docs/src/docbkx/langdetect.xml @@ -147,7 +147,7 @@ lav Egija Tri-Active procedūru īpaši iesaka izmantot siltākajos gadalaik
Training Tool - The following command will train the language detector and write the model to langdetect.bin: + The following command will train the language detector and write the model to langdetect-custom.bin: +model.serialize(new File("langdetect-custom.bin"));]]>
diff --git a/opennlp-docs/src/docbkx/lemmatizer.xml b/opennlp-docs/src/docbkx/lemmatizer.xml index 44356e040..4c07b66ad 100644 --- a/opennlp-docs/src/docbkx/lemmatizer.xml +++ b/opennlp-docs/src/docbkx/lemmatizer.xml @@ -41,31 +41,31 @@ +$ opennlp LemmatizerME opennlp-en-ud-ewt-lemmas-1.2-2.5.0.bin < sentences]]> The Lemmatizer now reads a pos tagged sentence(s) per line from standard input. For example, you can copy this sentence to the console: +Rockwell_PROPN International_ADJ Corp_NOUN 's_PUNCT Tulsa_PROPN unit_NOUN said_VERB it_PRON +signed_VERB a_DET tentative_NOUN agreement_NOUN extending_VERB its_PRON contract_NOUN +with_ADP Boeing_PROPN Co._NOUN to_PART provide_VERB structural_ADJ parts_NOUN for_ADP +Boeing_PROPN 's_PUNCT 747_NUM jetliners_NOUN ._PUNCT]]> The Lemmatizer will now echo the lemmas for each word postag pair to the console: @@ -89,7 +89,7 @@ signed VBD sign @@ -116,10 +116,10 @@ String[] tokens = new String[] { "Rockwell", "International", "Corp.", "'s", "provide", "structural", "parts", "for", "Boeing", "'s", "747", "jetliners", "." }; -String[] postags = new String[] { "NNP", "NNP", "NNP", "POS", "NNP", "NN", - "VBD", "PRP", "VBD", "DT", "JJ", "NN", "VBG", "PRP$", "NN", "IN", - "NNP", "NNP", "TO", "VB", "JJ", "NNS", "IN", "NNP", "POS", "CD", "NNS", - "." }; +String[] postags = new String[] { "PROPN", "ADJ", "NOUN", "PUNCT", "PROPN", "NOUN", + "VERB", "PRON", "VERB", "DET", "NOUN", "NOUN", "VERB", "PRON", "NOUN", "ADP", + "PROPN", "NOUN", "PART", "VERB", "ADJ", "NOUN", "ADP", "PROPN", "PUNCT", "NUM", "NOUN", + "PUNCT" }; String[] lemmas = lemmatizer.lemmatize(tokens, postags);]]> @@ -136,20 +136,20 @@ String[] lemmas = lemmatizer.lemmatize(tokens, postags);]]> corresponding lemma, each column separated by a tab character. Alternatively, if a (word,postag) pair can output multiple lemmas, the @@ -157,10 +157,10 @@ shrapnel NN shrapnel each row, a word, its postag and the corresponding lemmas separated by "#": First the dictionary must be loaded into memory from disk or another @@ -170,7 +170,7 @@ entramos V entrar @@ -217,22 +217,22 @@ String[] lemmas = lemmatizer.lemmatize(tokens, postags); Sample sentence of the training data: +1.8 NUM 1.8 +millions NOUN million +in ADP in +September PROPN september +. PUNCT O]]> The Universal Dependencies Treebank and the CoNLL 2009 datasets distribute training data for many languages. @@ -267,11 +267,11 @@ Arguments description: Its now assumed that the english lemmatizer model should be trained from a file called - 'en-lemmatizer.train' which is encoded as UTF-8. The following command will train the - lemmatizer and write the model to en-lemmatizer.bin: + 'en-custom-lemmatizer.train' which is encoded as UTF-8. The following command will train the + lemmatizer and write the model to en-custom-lemmatizer.bin: +$ opennlp LemmatizerTrainerME -model en-custom-lemmatizer.bin -params PerceptronTrainerParams.txt -lang en -data en-custom-lemmatizer.train -encoding UTF-8]]> @@ -294,7 +294,7 @@ $ opennlp LemmatizerTrainerME -model en-lemmatizer.bin -params PerceptronTrainer InputStreamFactory inputStreamFactory = null; try { inputStreamFactory = new MarkableFileInputStreamFactory( - new File(en-lemmatizer.train)); + new File(en-custom-lemmatizer.train)); } catch (FileNotFoundException e) { e.printStackTrace(); } @@ -345,7 +345,7 @@ InputStreamFactory inputStreamFactory = null; The following command shows how the tool can be run: +$ opennlp LemmatizerEvaluator -model en-custom-lemmatizer.bin -data en-custom-lemmatizer.test -encoding utf-8]]> This will display the resulting accuracy score, e.g.: diff --git a/opennlp-docs/src/docbkx/postagger.xml b/opennlp-docs/src/docbkx/postagger.xml index 5f045e4f8..71a0fc388 100644 --- a/opennlp-docs/src/docbkx/postagger.xml +++ b/opennlp-docs/src/docbkx/postagger.xml @@ -41,7 +41,7 @@ under the License. Download the English maxent pos model and start the POS Tagger Tool with this command: +$ opennlp POSTagger opennlp-en-ud-ewt-pos-1.2-2.5.0.bin]]> The POS Tagger now reads a tokenized sentence per line from stdin. Copy these two sentences to the console: @@ -53,9 +53,9 @@ Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group .]]> The POS Tagger will now echo the sentences with pos tags to the console: +Pierre_PROPN Vinken_PROPN ,_PUNCT 61_NUM years_NOUN old_ADJ ,_PUNCT will_AUX join_VERB the_DET board_NOUN as_ADP + a_DET nonexecutive_ADJ director_NOUN Nov._PROPN 29_NUM ._PUNCT +Mr._PROPN Vinken_PROPN is_AUX chairman_NOUN of_ADP Elsevier_ADJ N.V._PROPN ,_PUNCT the_DET Dutch_PROPN publishing_VERB group_NOUN .]]> The tag set used by the English pos model is the Penn Treebank tag set. @@ -69,7 +69,7 @@ Mr._NNP Vinken_NNP is_VBZ chairman_NN of_IN Elsevier_NNP N.V._NNP ,_, the_DT Dut In the sample below it is loaded from disk. @@ -125,8 +125,8 @@ Sequence[] topSequences = tagger.topKSequences(sent);]]> The native POS Tagger training material looks like this: +About_ADV 10_NUM Euro_PROPN ,_PUNCT I_PRON reckon._PUNCT +That_PRON sounds_VERB good_ADJ ._PUNCT]]> Each sentence must be in one line. The token/tag pairs are combined with "_". The token/tag pairs are whitespace separated. The data format does not @@ -180,8 +180,8 @@ Arguments description: The following command illustrates how an English part-of-speech model can be trained: +$ opennlp POSTaggerTrainer -type maxent -model en-custom-pos-maxent.bin \ + -lang en -data en-custom-pos.train -encoding UTF-8]]> @@ -207,7 +207,8 @@ $ opennlp POSTaggerTrainer -type maxent -model en-pos-maxent.bin \ POSModel model = null; try { - ObjectStream lineStream = new PlainTextByLineStream(new MarkableFileInputStreamFactory(new File("en-pos.train")), StandardCharsets.UTF_8); + ObjectStream lineStream = new PlainTextByLineStream( + new MarkableFileInputStreamFactory(new File("en-custom-pos-maxent.bin")), StandardCharsets.UTF_8); ObjectStream sampleStream = new WordTagSampleStream(lineStream); diff --git a/opennlp-docs/src/docbkx/sentdetect.xml b/opennlp-docs/src/docbkx/sentdetect.xml index 4e3a1db6d..11b047d31 100644 --- a/opennlp-docs/src/docbkx/sentdetect.xml +++ b/opennlp-docs/src/docbkx/sentdetect.xml @@ -63,13 +63,13 @@ Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, Download the english sentence detector model and start the Sentence Detector Tool with this command: +$ opennlp SentenceDetector opennlp-en-ud-ewt-sentence-1.2-2.5.0.bin]]> Just copy the sample text from above to the console. The Sentence Detector will read it and echo one sentence per line to the console. Usually the input is read from a file and the output is redirected to another file. This can be achieved with the following command. output.txt]]> +$ opennlp SentenceDetector opennlp-en-ud-ewt-sentence-1.2-2.5.0.bin < input.txt > output.txt]]> For the english sentence model from the website the input text should not be tokenized. @@ -81,8 +81,7 @@ $ opennlp SentenceDetector en-sent.bin < input.txt > output.txt]]> To instantiate the Sentence Detector the sentence model must be loaded first. @@ -148,7 +147,7 @@ Arguments description: To train an English sentence detector use the following command: It should produce the following output: @@ -183,7 +182,7 @@ Performing 100 iterations. 99: .. loglikelihood=-284.24296917223916 0.9834118369854598 100: .. loglikelihood=-283.2785335773966 0.9834118369854598 Wrote sentence detector model. -Path: en-sent.bin +Path: en-custom-sent.bin ]]> @@ -209,7 +208,7 @@ Path: en-sent.bin lineStream = - new PlainTextByLineStream(new MarkableFileInputStreamFactory(new File("en-sent.train")), StandardCharsets.UTF_8); + new PlainTextByLineStream(new MarkableFileInputStreamFactory(new File("en-custom-sent.train")), StandardCharsets.UTF_8); SentenceModel model; @@ -235,7 +234,7 @@ try (OutputStream modelOut = new BufferedOutputStream(new FileOutputStream(model The command shows how the evaluator tool can be run: - The en-sent.eval file has the same format as the training data. + The en-custom-sent.eval file has the same format as the training data. diff --git a/opennlp-docs/src/docbkx/tokenizer.xml b/opennlp-docs/src/docbkx/tokenizer.xml index 3627d8253..c68b5ced2 100644 --- a/opennlp-docs/src/docbkx/tokenizer.xml +++ b/opennlp-docs/src/docbkx/tokenizer.xml @@ -66,18 +66,15 @@ A form of asbestos once used to make Kent cigarette filters has caused a high Most part-of-speech taggers, parsers and so on, work with text tokenized in this manner. It is important to ensure that your - tokenizer - produces tokens of the type expected by your later text - processing - components. + tokenizer produces tokens of the type expected by your later text + processing components. With OpenNLP (as with many systems), tokenization is a two-stage process: first, sentence boundaries are identified, then tokens within - each - sentence are identified. + each sentence are identified.
@@ -100,7 +97,7 @@ $ opennlp SimpleTokenizer]]> our website. +$ opennlp TokenizerME opennlp-en-ud-ewt-tokens-1.2-2.5.0.bin]]> To test the tokenizer copy the sample from above to the console. The whitespace separated tokens will be written back to the @@ -110,7 +107,7 @@ $ opennlp TokenizerME en-token.bin]]> Usually the input is read from a file and written to a file. article-tokenized.txt]]> +$ opennlp TokenizerME opennlp-en-ud-ewt-tokens-1.2-2.5.0.bin < article.txt > article-tokenized.txt]]> It can be done in the same way for the Simple Tokenizer. @@ -154,8 +151,7 @@ London share prices were bolstered largely by continued gains on Wall Street and can be loaded. @@ -258,7 +254,7 @@ Arguments description: To train the english tokenizer use the following command: +Path: en-custom-token.bin]]>
@@ -314,7 +310,7 @@ Path: en-token.bin]]> The following sample code illustrates these steps: lineStream = new PlainTextByLineStream(new MarkableFileInputStreamFactory(new File("en-sent.train")), +ObjectStream lineStream = new PlainTextByLineStream(new MarkableFileInputStreamFactory(new File("en-custom-sent.train")), StandardCharsets.UTF_8); ObjectStream sampleStream = new TokenSampleStream(lineStream);