Skip to content

Commit

Permalink
Update README.md (#103)
Browse files Browse the repository at this point in the history
  • Loading branch information
azagniotov authored Feb 22, 2024
1 parent e48cb0c commit 26160fb
Showing 1 changed file with 79 additions and 80 deletions.
159 changes: 79 additions & 80 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,14 +21,6 @@ A Lucene plugin based on [Sudachi](https://github.com/WorksApplications/Sudachi)


### Table of Contents
* [Preface - Lucene Japanese morphological analysis landscape](#preface---lucene-japanese-morphological-analysis-landscape)
* [Lucene Kuromoji morphological analyzer](#lucene-kuromoji-morphological-analyzer)
* [What is MeCab](#what-is-mecab)
* [How MeCab-based tokenizers work](#how-mecab-based-tokenizers-work)
* [What is IPA dictionary](#what-is-ipa-dictionary)
* [What is UniDic dictionary](#what-is-unidic-dictionary)
* [Why the built-in Lucene Kuromoji module may impact Japanese search accuracy](#why-the-built-in-lucene-kuromoji-module-may-impact-japanese-search-accuracy)
* [Conclusion](#conclusion)
* [Solr Lucene Analyzer Sudachi plugin philosophy](#solr-lucene-analyzer-sudachi-plugin-philosophy)
* [Plugin compatibility with Lucene and Solr](#plugin-compatibility-with-lucene-and-solr)
* [Solr field analysis screen](#solr-field-analysis-screen)
Expand All @@ -51,84 +43,20 @@ A Lucene plugin based on [Sudachi](https://github.com/WorksApplications/Sudachi)
* [Functional tests](#functional-tests)
* [End-to-End tests](#end-to-end-tests)
* [Smoke tests](#smoke-tests)
* [Appendix - Lucene Japanese morphological analysis landscape](#appendix---lucene-japanese-morphological-analysis-landscape)
* [Lucene Kuromoji morphological analyzer](#lucene-kuromoji-morphological-analyzer)
* [What is MeCab](#what-is-mecab)
* [How MeCab-based tokenizers work](#how-mecab-based-tokenizers-work)
* [What is IPA dictionary](#what-is-ipa-dictionary)
* [What is UniDic dictionary](#what-is-unidic-dictionary)
* [Why the built-in Lucene Kuromoji module may impact Japanese search accuracy](#why-the-built-in-lucene-kuromoji-module-may-impact-japanese-search-accuracy)
* [Conclusion](#conclusion)
* [Licenses](#licenses)
* [Sudachi and Sudachi Logo](#sudachi-and-sudachi-logo)
* [Lucene and Lucene Logo](#lucene-and-lucene-logo)
* [Current work](#current-work)
<!-- TOC -->

## Preface - Lucene Japanese morphological analysis landscape

Tokenization, or morphological analysis, is a fundamental and important technology for processing a Japanese text, especially for industrial applications. Unlike whitespace separation between words for English text, Japanese text does not contain explicit word boundary information. The methods to recognize words within a text are unobvious and the morphological analysis of a token (segmentation + part-of-speech tagging) in Japanese is not trivial. Over time, there were various morphological tools developed, each with different kinds of the standard.

[`Back to top`](#table-of-contents)

### Lucene Kuromoji morphological analyzer

The Lucene "Kuromoji" is a built-in MeCab-style Japanese morphological analysis component that provides analysis/tokenization capabilities. By default, Kuromoji leverages [under the hood](https://github.com/apache/lucene/blob/2a3e5ca07f5df1f5080b5cb54ff104b7924e99f3/gradle/generation/kuromoji.gradle#L50-L97) the [MeCab tokenizer’s “IPA” dictionary (ja)](https://taku910.github.io/mecab/).

Kuromoji analyzer has its roots in the Kuromoji analyzer made by [Atilika](https://www.atilika.org/), a small NLP company in Tokyo. Atilika has donated Kuromoji codebase (see [LUCENE-3305](https://issues.apache.org/jira/browse/LUCENE-3305)) to the Apache Software Foundation as of Apache Lucene and Apache Solr v3.6. These days, the implementations of Atilika and Lucene Kuromoji have diverged, while [Atilika Kuromoji](https://github.com/atilika/kuromoji) seems to be abandoned anyways.

[`Back to top`](#table-of-contents)

### What is MeCab

[MeCab (ja)](https://taku910.github.io/mecab/) is an open source morphological analysis engine developed through a joint research unit project between Kyoto University Graduate School of Informatics and Nippon Telegraph and Telephone Corporation's Communication Science Research Institute.

MeCab was created by [Taku Kudo](http://chasen.org/~taku/) in ~2007. He/they made a breakthrough by leveraging the [CRF algorithm](https://en.wikipedia.org/wiki/Conditional_random_field) (Conditional Random Fields) to train a CRF model and build a word dictionary by [utilizing the trained model](https://taku910.github.io/mecab/).

[`Back to top`](#table-of-contents)

### How MeCab-based tokenizers work

MeCab-style tokenizer builds a graph-like structure (i.e.: lattice) to represent input corpus (i.e.: text terms/words) and to find the best connected path through that graph by leveraging Viterbi algorithm.

For Lattice-based tokenizers, a dictionary is an object or a data structure that provides a list of known terms or words, as well as how those terms should appear next to each other (i.e.: connection cost) according to Japanese grammar or some statistical probability. During the tokenization process, a tokenizer uses the dictionary in order to tokenize the input text by leveraging the dictionary metadata. The objective of tokenizer is to find the best tokenization that maximizes the sum of phrase scores.

To expand on the dictionary: a dictionary is not a mere "word collection", it includes a machine-learned language model which is carefully trained (for example, with the help of [MeCab CLI (ja)](https://taku910.github.io/mecab/learn.html)). If you want to update the dictionary, you have to start from "re-training" the model on a larger / fresher lexicon.

[`Back to top`](#table-of-contents)

### What is IPA dictionary

The IPA dictionary is the MeCab's so-called "standard dictionary", characterized by a more intuitive separation of morphological units than UniDic. In contrast, UniDic splits a sentence into smaller example units for retrieval. UniDIC is a dictionary based on "[short units](https://clrd.ninjal.ac.jp/bccwj/en/morphology.html)" (短単位 read as "tantani") as defined by the NINJAL (National Institute for Japanese Language and Linguistics) which produces and maintains the UniDic dictionary.

From a Japanese full-text search perspective, consistency of the tokenization (regardless of the length of the text) is more important. Therefore, UniDic dictionary is more suitable for Japanese full-text information retrieval since the dictionary is well maintained by researchers of NINJAL (to the best of my knowledge) and its shorter lexical units make it more suitable for splitting words when searching (tokenization is more coarse-grained) than the IPA dictionary.

[`Back to top`](#table-of-contents)

### What is UniDic dictionary

UniDic dictionaries produced by NLP researchers at NINJAL (National Institute for Japanese Language and Linguistics), which are based on the BCCWJ corpus and leverage MeCab-style dictionary format.

The “The Balanced Corpus of Contemporary Written Japanese” (BCCWJ) is a corpus created for the purpose of attempting to grasp the breadth of contemporary written Japanese, containing extensive samples of modern Japanese texts in order to create as uniquely balanced a corpus as possible.

The data is ~104.3 million words, covering genres such as general books and magazines, newspapers, business reports, blogs, internet forums, textbooks, and legal documents among others. Random samples of each genre were taken in order to be morphologically analyzed for the purpose of creating a dictionary.

Thus, UniDic is a lexicon (i.e.: collection of morphemes) of BCCWJ core data (about couple percents of the whole corpus is manually annotated with things like part of speech, etc). The approximate UniDic size is ~20-30k sentences.

As a supplementary fun read, you can have a look at the excellent article that outlines [Differences between IPADic and UniDic](https://github.com/ikawaha/kagome/wiki/About-the-dictionary#differences-between-ipadic-and-unidic) by the author of the GoLang-based [Kagome](https://github.com/ikawaha/kagome) tokenizer (**TL;DR**: UniDic has more advantage for lexical searching purpose).

Thus, the above makes a UniDic (which is the dictionary that Sudachi tokenizer leverages) dictionary to be the best choice for a MeCab-based tokenizer dictionary.

[`Back to top`](#table-of-contents)

### Why the built-in Lucene Kuromoji module may impact Japanese search accuracy

The MeCab IPA dictionary ([bundled within Lucene Kuromoji](https://github.com/apache/lucene/blob/2a3e5ca07f5df1f5080b5cb54ff104b7924e99f3/gradle/generation/kuromoji.gradle#L57-L60) by default) dates back to 2007. This means that there is a _high likelihood_ that some newer words / proper nouns that came into the use after 2007 (e.g: new Japanese imperial era `令和` (read as "Reiwa"), people's names, manga/anime/brand/place names, etc) _may not_ be tokenized correctly. The "not correctly" here means under-tokenized or over-tokenized.

Although the [support for the current Japanese imperial era "Reiwa" (令和) has been added to the Lucene Kuromoji especially by Uchida Tomoko](https://github.com/apache/lucene/commit/7619c07d3a80bb781f688c2cbbff33024142670a), for many post-2007 (i.e.: more modern) words there is no explicit support by the Lucene Kuromoji maintainers.

[`Back to top`](#table-of-contents)

### Conclusion

The adoption of a more updated version of the dictionary can directly influence the search quality and accuracy of the 1st-phase retrieval, the Solr output. Depending on the business domain of a company that leverages search as its core function, this may create more or less issues.

Therefore, **Solr Lucene Analyzer Sudachi** is a reasonable choice for those who are interested to run their Solr eco-system on a more up-to date Japanese morphological analysis tooling.

[`Back to top`](#table-of-contents)

## Solr Lucene Analyzer Sudachi plugin philosophy

The plugin strives to where possible:
Expand Down Expand Up @@ -420,6 +348,77 @@ To run end-to-end tests, run the following command:

Smoke tests utilize Docker Solr images to deploy the built plugin jar into Solr app. These tests are not automated (i.e.: they do not run on Ci) and should be executed manually. You can find the Dockerfiles under the [src/smokeTest](src/smokeTest)

[`Back to top`](#table-of-contents)

## Appendix - Lucene Japanese morphological analysis landscape

Tokenization, or morphological analysis, is a fundamental and important technology for processing a Japanese text, especially for industrial applications. Unlike whitespace separation between words for English text, Japanese text does not contain explicit word boundary information. The methods to recognize words within a text are unobvious and the morphological analysis of a token (segmentation + part-of-speech tagging) in Japanese is not trivial. Over time, there were various morphological tools developed, each with different kinds of the standard.

[`Back to top`](#table-of-contents)

### Lucene Kuromoji morphological analyzer

The Lucene "Kuromoji" is a built-in MeCab-style Japanese morphological analysis component that provides analysis/tokenization capabilities. By default, Kuromoji leverages [under the hood](https://github.com/apache/lucene/blob/2a3e5ca07f5df1f5080b5cb54ff104b7924e99f3/gradle/generation/kuromoji.gradle#L50-L97) the [MeCab tokenizer’s “IPA” dictionary (ja)](https://taku910.github.io/mecab/).

Kuromoji analyzer has its roots in the Kuromoji analyzer made by [Atilika](https://www.atilika.org/), a small NLP company in Tokyo. Atilika has donated Kuromoji codebase (see [LUCENE-3305](https://issues.apache.org/jira/browse/LUCENE-3305)) to the Apache Software Foundation as of Apache Lucene and Apache Solr v3.6. These days, the implementations of Atilika and Lucene Kuromoji have diverged, while [Atilika Kuromoji](https://github.com/atilika/kuromoji) seems to be abandoned anyways.

[`Back to top`](#table-of-contents)

### What is MeCab

[MeCab (ja)](https://taku910.github.io/mecab/) is an open source morphological analysis engine developed through a joint research unit project between Kyoto University Graduate School of Informatics and Nippon Telegraph and Telephone Corporation's Communication Science Research Institute.

MeCab was created by [Taku Kudo](http://chasen.org/~taku/) in ~2007. He/they made a breakthrough by leveraging the [CRF algorithm](https://en.wikipedia.org/wiki/Conditional_random_field) (Conditional Random Fields) to train a CRF model and build a word dictionary by [utilizing the trained model](https://taku910.github.io/mecab/).

[`Back to top`](#table-of-contents)

### How MeCab-based tokenizers work

MeCab-style tokenizer builds a graph-like structure (i.e.: lattice) to represent input corpus (i.e.: text terms/words) and to find the best connected path through that graph by leveraging Viterbi algorithm.

For Lattice-based tokenizers, a dictionary is an object or a data structure that provides a list of known terms or words, as well as how those terms should appear next to each other (i.e.: connection cost) according to Japanese grammar or some statistical probability. During the tokenization process, a tokenizer uses the dictionary in order to tokenize the input text by leveraging the dictionary metadata. The objective of tokenizer is to find the best tokenization that maximizes the sum of phrase scores.

To expand on the dictionary: a dictionary is not a mere "word collection", it includes a machine-learned language model which is carefully trained (for example, with the help of [MeCab CLI (ja)](https://taku910.github.io/mecab/learn.html)). If you want to update the dictionary, you have to start from "re-training" the model on a larger / fresher lexicon.

[`Back to top`](#table-of-contents)

### What is IPA dictionary

The IPA dictionary is the MeCab's so-called "standard dictionary", characterized by a more intuitive separation of morphological units than UniDic. In contrast, UniDic splits a sentence into smaller example units for retrieval. UniDIC is a dictionary based on "[short units](https://clrd.ninjal.ac.jp/bccwj/en/morphology.html)" (短単位 read as "tantani") as defined by the NINJAL (National Institute for Japanese Language and Linguistics) which produces and maintains the UniDic dictionary.

From a Japanese full-text search perspective, consistency of the tokenization (regardless of the length of the text) is more important. Therefore, UniDic dictionary is more suitable for Japanese full-text information retrieval since the dictionary is well maintained by researchers of NINJAL (to the best of my knowledge) and its shorter lexical units make it more suitable for splitting words when searching (tokenization is more coarse-grained) than the IPA dictionary.

[`Back to top`](#table-of-contents)

### What is UniDic dictionary

UniDic dictionaries produced by NLP researchers at NINJAL (National Institute for Japanese Language and Linguistics), which are based on the BCCWJ corpus and leverage MeCab-style dictionary format.

The “The Balanced Corpus of Contemporary Written Japanese” (BCCWJ) is a corpus created for the purpose of attempting to grasp the breadth of contemporary written Japanese, containing extensive samples of modern Japanese texts in order to create as uniquely balanced a corpus as possible.

The data is ~104.3 million words, covering genres such as general books and magazines, newspapers, business reports, blogs, internet forums, textbooks, and legal documents among others. Random samples of each genre were taken in order to be morphologically analyzed for the purpose of creating a dictionary.

Thus, UniDic is a lexicon (i.e.: collection of morphemes) of BCCWJ core data (about couple percents of the whole corpus is manually annotated with things like part of speech, etc). The approximate UniDic size is ~20-30k sentences.

As a supplementary fun read, you can have a look at the excellent article that outlines [Differences between IPADic and UniDic](https://github.com/ikawaha/kagome/wiki/About-the-dictionary#differences-between-ipadic-and-unidic) by the author of the GoLang-based [Kagome](https://github.com/ikawaha/kagome) tokenizer (**TL;DR**: UniDic has more advantage for lexical searching purpose).

Thus, the above makes a UniDic (which is the dictionary that Sudachi tokenizer leverages) dictionary to be the best choice for a MeCab-based tokenizer dictionary.

[`Back to top`](#table-of-contents)

### Why the built-in Lucene Kuromoji module may impact Japanese search accuracy

The MeCab IPA dictionary ([bundled within Lucene Kuromoji](https://github.com/apache/lucene/blob/2a3e5ca07f5df1f5080b5cb54ff104b7924e99f3/gradle/generation/kuromoji.gradle#L57-L60) by default) dates back to 2007. This means that there is a _high likelihood_ that some newer words / proper nouns that came into the use after 2007 (e.g: new Japanese imperial era `令和` (read as "Reiwa"), people's names, manga/anime/brand/place names, etc) _may not_ be tokenized correctly. The "not correctly" here means under-tokenized or over-tokenized.

Although the [support for the current Japanese imperial era "Reiwa" (令和) has been added to the Lucene Kuromoji especially by Uchida Tomoko](https://github.com/apache/lucene/commit/7619c07d3a80bb781f688c2cbbff33024142670a), for many post-2007 (i.e.: more modern) words there is no explicit support by the Lucene Kuromoji maintainers.

[`Back to top`](#table-of-contents)

### Conclusion

The adoption of a more updated version of the dictionary can directly influence the search quality and accuracy of the 1st-phase retrieval, the Solr output. Depending on the business domain of a company that leverages search as its core function, this may create more or less issues.

Therefore, **Solr Lucene Analyzer Sudachi** is a reasonable choice for those who are interested to run their Solr eco-system on a more up-to date Japanese morphological analysis tooling.

[`Back to top`](#table-of-contents)

Expand Down

0 comments on commit 26160fb

Please sign in to comment.