Currently this is kind of a playground to experiment with tools for working with Korean in Elixir.
This project borrows liberally from open-korean-text. In some cases there is some directly ported code (KrDict.Util.Hangul
).
- https://en.wikipedia.org/wiki/Suffix_array
- https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform
- https://en.wikipedia.org/wiki/FM-index
- Make the prefix search more efficient by passing the current place in the node and modifying the query
- How to handle adding a meaning
- Add search frequency to the trie
- Find a good dictionary
- Create struct structure for a sentence that holds "segments" as they are expanded by processing and splitting. This would likely be a map of arbitrarilty assigned keys to "segments", then a list of keys to show order
-
By doing basic strategy of removing syllables from the end and then performing a prefix search reliably performs at about 50% accuracy depending on the text
-
This strategy is highly susceptible to how
- Complete
- Tailored The dictionary is. I have not been able to find a very complete word list including a good number of foreign loan words.
-
Next strategies to try:
- Find a really good dictionary
- Try stemming by deconstructing instead of syllable by syllable
- Create additional pipelines for words not found in dictionary
- Smarter strategies for choosing matches
- Can't be longer than original word? (might have some issues there)
- Can't have too large a difference in length?
- Partial POS tagging to know whether we prefer 다 at the end since we think it's a verb
- What are the mechanisms to help this thing learn as we use it. (Manual submission?)
If [available in Hex] (Ehttps://hex.pm/docs/publish), the package can be installed
by adding kr_dict
to your list of dependencies in mix.exs
:
def deps do
[
{:kr_dict, "~> 0.1.0"}
]
end
Documentation can be generated with ExDoc and published on HexDocs. Once published, the docs can be found at https://hexdocs.pm/kr_dict.