diff --git a/.gitignore b/.gitignore index 2bffc7a..84e2a23 100644 --- a/.gitignore +++ b/.gitignore @@ -5,3 +5,4 @@ Gemfile.lock doc/* pkg/* coverage/* +vendor/* diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 0000000..96c2cc1 --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,31 @@ +# Changelog + +## v0.2.0 + +### Added + +- Add `tokenizer` option to `Document` class + + The value is an object with a `tokenize` method that accepts a string and returns an array of `Token` instances. + + For example, to use [natto](https://rubygems.org/gems/natto) instead of [unicode_utils](https://rubygems.org/gems/unicode_utils) for Japanese, install MeCab (`brew install mecab`), and then: + + ```ruby + require 'natto' + + class Tokenizer + def initialize + @nm = Natto::MeCab.new + end + + def tokenize(text) + @nm.enum_parse(text).map do |node| + Token.new(node) + end + end + end + + document = TfIdfSimilarity::Document.new("こんにちは世界", tokenizer: tokenizer) + ``` + +- Add `to_s` method to `Token` class, to use less memory than chaining `lowercase_filter` with `classic_filter`