Skip to content

Commit

Permalink
Merge pull request #29 from jpmckinney/satoryu-define_tokenizer
Browse files Browse the repository at this point in the history
Satoryu define tokenizer
  • Loading branch information
jpmckinney authored Nov 17, 2019
2 parents 00777f8 + 3719492 commit 0d6a631
Show file tree
Hide file tree
Showing 7 changed files with 40 additions and 14 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@
Gemfile.lock
doc/*
pkg/*
coverage/*
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ addons:
# Installing ATLAS will install BLAS.
- libatlas-dev
- libatlas-base-dev
- libatlas3gf-base
- libatlas3-base
before_install:
- bundle config build.nmatrix --with-lapacklib
- export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/usr/include/atlas
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Ruby Vector Space Model (VSM) with tf*idf weights
# Ruby Vector Space Model (VSM) with tf\*idf weights

[![Gem Version](https://badge.fury.io/rb/tf-idf-similarity.svg)](https://badge.fury.io/rb/tf-idf-similarity)
[![Build Status](https://secure.travis-ci.org/jpmckinney/tf-idf-similarity.png)](https://travis-ci.org/jpmckinney/tf-idf-similarity)
[![Coverage Status](https://coveralls.io/repos/jpmckinney/tf-idf-similarity/badge.png)](https://coveralls.io/r/jpmckinney/tf-idf-similarity)
[![Code Climate](https://codeclimate.com/github/jpmckinney/tf-idf-similarity.png)](https://codeclimate.com/github/jpmckinney/tf-idf-similarity)

Calculates the similarity between texts using a [bag-of-words](https://en.wikipedia.org/wiki/Bag_of_words_model) [Vector Space Model](https://en.wikipedia.org/wiki/Vector_space_model) with [Term Frequency-Inverse Document Frequency (tf*idf)](https://en.wikipedia.org/wiki/Tf–idf) weights. If your use case demands performance, use [Lucene](http://lucene.apache.org/core/) (see below).
Calculates the similarity between texts using a [bag-of-words](https://en.wikipedia.org/wiki/Bag_of_words_model) [Vector Space Model](https://en.wikipedia.org/wiki/Vector_space_model) with [Term Frequency-Inverse Document Frequency (tf\*idf)](https://en.wikipedia.org/wiki/Tf–idf) weights. If your use case demands performance, use [Lucene](http://lucene.apache.org/core/) (see below).

## Usage

Expand Down Expand Up @@ -47,7 +47,7 @@ Find the similarity of two documents in the matrix:
matrix[model.document_index(document1), model.document_index(document2)]
```

Print the tf*idf values for terms in a document:
Print the tf\*idf values for terms in a document:

```ruby
tfidf_by_term = {}
Expand Down Expand Up @@ -113,11 +113,11 @@ You can access more term frequency, document frequency, and normalization formul
require 'tf-idf-similarity/extras/document'
require 'tf-idf-similarity/extras/tf_idf_model'

The default tf*idf formula follows the [Lucene Conceptual Scoring Formula](http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html).
The default tf\*idf formula follows the [Lucene Conceptual Scoring Formula](http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html).

## Why?

At the time of writing, no other Ruby gem implemented the tf*idf formula used by Lucene, Sphinx and Ferret.
At the time of writing, no other Ruby gem implemented the tf\*idf formula used by Lucene, Sphinx and Ferret.

* [rsemantic](https://github.com/josephwilk/rsemantic) now uses the same [term frequency](https://github.com/josephwilk/rsemantic/blob/master/lib/semantic/transform/tf_idf_transform.rb#L14) and [document frequency](https://github.com/josephwilk/rsemantic/blob/master/lib/semantic/transform/tf_idf_transform.rb#L13) formulas as Lucene.
* [treat](https://github.com/louismullie/treat) offers many term frequency formulas, [one of which](https://github.com/louismullie/treat/blob/master/lib/treat/workers/extractors/tf_idf/native.rb#L13) is the same as Lucene.
Expand Down
3 changes: 0 additions & 3 deletions lib/tf-idf-similarity.rb
Original file line number Diff line number Diff line change
@@ -1,9 +1,6 @@
require 'forwardable'
require 'set'

require 'unicode_utils/downcase'
require 'unicode_utils/each_word'

module TfIdfSimilarity
end

Expand Down
12 changes: 7 additions & 5 deletions lib/tf-idf-similarity/document.rb
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
require 'tf-idf-similarity/tokenizer'

# A document.
module TfIdfSimilarity
class Document
Expand All @@ -19,7 +21,8 @@ class Document
def initialize(text, opts = {})
@text = text
@id = opts[:id] || object_id
@tokens = opts[:tokens]
@tokens = Array(opts[:tokens]).map { |t| Token.new(t) } if opts[:tokens]
@tokenizer = opts[:tokenizer] || Tokenizer.new

if opts[:term_counts]
@term_counts = opts[:term_counts]
Expand Down Expand Up @@ -51,10 +54,9 @@ def term_count(term)

# Tokenizes the text and counts terms and total tokens.
def set_term_counts_and_size
tokenize(text).each do |word|
token = Token.new(word)
tokenize(text).each do |token|
if token.valid?
term = token.lowercase_filter.classic_filter.to_s
term = token.to_s
@term_counts[term] += 1
@size += 1
end
Expand All @@ -76,7 +78,7 @@ def set_term_counts_and_size
# @see http://unicode.org/reports/tr29/#Default_Word_Boundaries
# @see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StandardTokenizerFactory
def tokenize(text)
@tokens || UnicodeUtils.each_word(text)
@tokens || @tokenizer.tokenize(text)
end
end
end
7 changes: 7 additions & 0 deletions lib/tf-idf-similarity/token.rb
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# coding: utf-8
require 'delegate'
require 'unicode_utils/downcase'
require 'unicode_utils/each_word'

# A token.
#
Expand Down Expand Up @@ -47,5 +49,10 @@ def lowercase_filter
def classic_filter
self.class.new(self.gsub('.', '').sub(/['`’]s\z/, ''))
end

def to_s
# Don't call #lowercase_filter and #classic_filter to avoid creating unnecessary objects.
UnicodeUtils.downcase(self).gsub('.', '').sub(/['`’]s\z/, '')
end
end
end
19 changes: 19 additions & 0 deletions lib/tf-idf-similarity/tokenizer.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
require 'unicode_utils/each_word'
require 'tf-idf-similarity/token'

# A tokenizer using UnicodeUtils to tokenize a text.
#
# @see https://github.com/lang/unicode_utils
module TfIdfSimilarity
class Tokenizer
# Tokenizes a text.
#
# @param [String] text
# @return [Enumerator] an enumerator of Token objects
def tokenize(text)
UnicodeUtils.each_word(text).map do |word|
Token.new(word)
end
end
end
end

0 comments on commit 0d6a631

Please sign in to comment.