Skip to content

Latest commit

 

History

History
85 lines (61 loc) · 4.45 KB

README.md

File metadata and controls

85 lines (61 loc) · 4.45 KB

SymSpellChecker.jl

Documentation Build Status
StableDev BuildCoverage

Julia port of SymSpell, extremely fast spelling correction and fuzzy search algorithm.

TL;DR

using SymSpellChecker

d = SymSpell()
push!(d, "hello")
push!(d, "world")

d["wrold"] = ["world"]

Dictionary creation

Dictionaries can be created as follows

using SymSpellChecker

# Loading from file
d = SymSpell("assets/frequency_dictionary_en_30_000.txt")

# Manual update
d = SymSpell()
push!(d, "hello", 100)
push!(d, "world", 50)

Third term in push! function is the word frequency, which is used later in lookup to sort results from highest frequency to the lowest.

SymSpell constructor has following arguments

  • max_dictionary_edit_distance: maximum allowed search distance. High value of this argument requires lots of memory. Default value is 2.
  • prefix_length: prefix length used to generate candidates, higher values corresponds to higher memory requirements, but smaller search times. Default value is 5
  • count_threshold: words with frequencies below this threshold wouldn't show in search results.

Lookup procedure

Words search can be made as follows

lookup(d, "wrold") # [SuggestItem("world", 1, 50)]

Here 1 is a Damerau-Levenshtein distance between world and wrold, 50 is a word frequency in current dictionary.

One can extract only words from lookup result

term.(lookup(d, "wrold")) = ["world"]

There is more convenient form of lookup exists

d["wrold"] = ["world"]

Search arguments can be passed either in lookup function or set globally with the help of set_options!(d::SymSpell; kwargs...) command.

set_options!(d, include_unknown = true, verbosity = "closest")
d["wrold"] = ["wrold", "world"]

# this is equivalent to
term.(lookup(d, include_unknown = true, verbosity = "closest"))

Following arguments are supported

  • include_unknown: whether include or not original word in results, if it falls under search criteria
  • ignore_token: ignore words in lookup that contain token string or regexp.
  • transfer_casing: when this option set to true, results will try to mimic casing of the original word, for example d["Wrold"] = ["World"]
  • max_edit_distance: maximum allowed distance for search. By default equals to the max_dictionary_edit_distance
  • verbosity: select type of search result. Three levels of verbosity exists
    • "top": only single suggestion is returned, with lowest distance and highest frequency
    • "closest": all words with lowest distance are returned
    • "all": all words within given max_edit_distance are returned

License

The SymSpellChecker.jl package is licensed under the MIT License. This package is based on SymSpell and it's python adaptation. Some parts of the code is based on StringDistances.jl.