Skip to content

Implementation Details

Tino Didriksen edited this page May 9, 2019 · 2 revisions

Kukkuniiaat Implementation Details

Backend

Linguistic data

Command Line Usage

There are a two main cmdline tools for using the spell checker: hfst-ospell and libdivvun. Both are available as nightly builds for various Linux distros, Windows, and macOS via the Apertium build repository.

To install and run a basic plain text spell checker session on Debian/Ubuntu, the steps are as in the Docker image.

Via libdivvun

libdivvun is currently used for the web and HTML5 frontends for Google Docs and Microsoft Office.

Install everything:

$ sudo apt-get install wget ca-certificates
$ wget https://apertium.projectjj.com/apt/install-nightly.sh -O - | sudo bash
$ sudo apt-get install giella-kal divvun-gramcheck

Run text through tokenizer and spell checker:

$ echo "Aajap biilinik misissuisoqartanginnera kamassutigigaa" | kal-tokenise | divvun-cgspell -u 1.0 -n 5 /usr/share/voikko/3/kl.zhfst

"<Aajap>"
        "Aaja" Dial/Sgr Sem/Fem Sem/Hum Prop Rel Sg
        "Aaja" Dial/Sgr Sem/Mask Sem/Hum Prop Rel Sg
"<biilinik>"
        "biili" Dial/Ngr N Ins Pl
        "biili" N Ins Pl
"<misissuisoqartanginnera>"
        "misissuisoqartanginnera" ?
        "misissuisoqartannginnera" <W:8> <WA:0> <spelled> "<misissuisoqartannginnera>"
        "misissuisoqartannginnerai" <W:18> <WA:0> <spelled> "<misissuisoqartannginnerai>"
        "misissuisoqartannginnerat" <W:18> <WA:0> <spelled> "<misissuisoqartannginnerat>"
        "misissuisoqartuannginnera" <W:18> <WA:0> <spelled> "<misissuisoqartuannginnera>"
        "misissuissoqartannginnera" <W:18> <WA:0> <spelled> "<misissuissoqartannginnera>"
"<kamassutigigaa>"
        "kamassut" GE Der/nv Gram/TV V Par 3Sg 3SgO
        "kamassutige" Gram/TV V Par 3Sg 3SgO

That's a lot more information than we currently need, so I wrote spell-stream.pl to trim the excess:

$ echo "Aajap biilinik misissuisoqartanginnera kamassutigigaa" | kal-tokenise | divvun-cgspell -u 1.0 -n 5 /usr/share/voikko/3/kl.zhfst | ./spell-stream.pl

Aajap
biilinik
misissuisoqartanginnera	@spell <R:misissuisoqartannginnera> <R:misissuisoqartannginnerai> <R:misissuisoqartannginnerat> <R:misissuisoqartuannginnera> <R:misissuissoqartannginnera>
kamassutigigaa

Via hfst-ospell

hfst-ospell is currently used a the backend for the native Microsoft Windows and Office spellers.

Install everything:

$ sudo apt-get install wget ca-certificates
$ wget https://apertium.projectjj.com/apt/install-nightly.sh -O - | sudo bash
$ sudo apt-get install giella-kal hfst-ospell

Run a single word through, asking for a number of possible suggestions:

$ echo "5 misissuisoqartanginnera" | hfst-ospell-office /usr/share/voikko/3/kl.zhfst

@@ hfst-ospell-office is alive
&	misissuisoqartannginnera	misissuisoqartannginnerai	misissuisoqartannginnerat	misissuisoqartuannginnera	misissuissoqartannginnera

HTTP Service

For the Google Docs and Microsoft Word frontends, a HTTP service is needed. callback.php implements such a service by doing minimal forwarding to the Docker service, and it is running live on my server.

For example: https://tinodidriksen.com/spell/kal/callback.php?a=grammar&t=%3Cs1%3E%0aAajap%20biilinik%20misissuisoqartanginnera%20kamassutigigaa%0a%3C/s1%3E yields JSON output: {"a":"grammar","c":"<s1>\n\nAajap\nbiilinik\nmisissuisoqartanginnera\t@spell <R:misissuisoqartannginnera> <AFR:misissuisoqartannginnera> <R:misissuisoqartannginnerai> <AFR:misissuisoqartannginnerai> <R:misissuisoqartannginnerat> <AFR:misissuisoqartannginnerat> <R:misissuisoqartuannginnera> <AFR:misissuisoqartuannginnera> <R:misissuissoqartannginnera> <AFR:misissuissoqartannginnera>\nkamassutigigaa\n\n</s1>"}

The <s1>...</s1> tag is explained below.

Frontends

Google Docs and Microsoft Word (GASMSO)

The GASMSO frontend sends multiple whole paragraphs to the HTTP service. In order to keep track of which segments belong where, each paragraph is wrapped in s-tags, e.g. <s1>...</s1> <s2>...</s2>. The returned data is a verticalized and annotated version of the input. Each token is on a line of its own, and if there are any error markings then they are followed by a single tab character. Error type markings are @-prefixed, and spelling corrections are <R:...> tags. Greenlandic currently only has 2 error types: @spell for errors with corrections, and @unknown for errors where no corrections can be found. Having the whole text in the output helps with matching the text to the source, as opposed to offsets which vary wildly.