Skip to content

Backend Stream Format

Tino Didriksen edited this page Apr 11, 2019 · 2 revisions

Grammateket Backend I/O

Definitions

  • GDocs/Office: The frontend that runs in Google Apps scripts and MS Office.
  • Website: In this case, the code running on https://grammateket.com/ specifically callback.php. The code can be read on https://github.com/GrammarSoft/mv-grammar
  • Backend: The actual analysis machine, hidden from public access. Only the website can query this.

Flow Overview

The GDocs/Office frontend POST data to the website with parameters such as:

  • a: Action to perform, such as comma, danproof, logout, etc
  • t: Text to perform the action on, if applicable

Where t is the text copied from the document, potentially formatted. E.g.:

<s1>
A little houses.
</s1>

or if the word little was italicized in the source:

<s1>
A
<STYLE:i:abc123>
little
</STYLE:i:abc123>
houses.
</s1>

The website then generates a nonce, embeds that in the sentence markers, and forwards t to the backend over raw TCP/IP:

<s1-def456>
A little houses.
</s1-def456>

The nonce prevents the extremely rare occurance where the backend hiccups and returns someone else's analysis.

The backend then returns the analysis in a verticalized format, with error markings after a tab:

<s1-def456>
A
little
houses    <R:house> @sg
.
</s1-def456>

The website checks that the nonce matches, removes it from the analysis, wraps it in a JSON response, and returns it to GDocs/Office.

{"a": "danproof", "c": "<s1>\nA\nlittle\nhouses\t<R:house> @sg\n.\n</s1>"}

Error Types

  • We use @ to prefix grammatical errors and % to prefix comma errors.
  • Suggestions come in two forms: <R:...> is the primary suggestion. <AFR:...> are secondary suggestions that a very early spell checker module thinks are potentials, but they are ignored if there is no @ type or primary suggestion.

Following our format is not critical. What is important is to distinguish suggestions from error types. And, the primary suggestion should include all corrections the error types imply.

However, some error types are meaningful to the code because Options affect them, so it would be nice to stick with them. From https://github.com/GrammarSoft/proofing-gasmso/blob/master/js/sidebar.js#L674 :

  • @green: A soft error. Put on errors that the backend isn't entirely certain of.
  • @sentsplit: Put on a token that's probably meant to be the last in a sentence. The suggestion must include the new full stop.
  • @upper and @lower: This token should've been upper/lower-case instead. The suggestion must be the new form.
  • @proper: Unknown proper name.
  • @new: Unknown word that's probably an ok compound, but isn't in the dictionary.
  • @abbreviation: Unknown abbreviation.
  • @check!: Unknown other word.

Token Merging

Our backends may merge tokens by gluing them with =, e.g. United=Kingdom. Our frontends will turn the = back into a space. Using = is not required in the token list, because \t delimits that clearly. But it would be required in the suggestion, because those are space-delimited. E.g., it would be valid to return an analysis of

United=Kingdoms    <R:United=Kingdom> @sg

or

United Kingdoms    <R:United=Kingdom> @sg

Token Joining

Similarly, marking words that should be joined can be done in two ways. Either merge the tokens and provide a suggestion that's the new joined token:

proto type    <R:prototype> @comp

or say which token should be merged and in which direction:

proto
type    @-comp
proto    @comp-
type

The former approach becomes complicated if either part has errors of their own. Keeping the tokens split is easier for everyone.

Token Reordering

See token joining. Same problems. We do not currently have error types for token reordering.

Non-alphanumeric Input

Our backends cannot fully respect non-alphanumeric text. Things like « may come out as ", and other transformations and normalizations. Because of this, the frontend performs fuzzy matches on only the alphanumeric parts of the output. This makes the frontend rather robust in the face of slightly mangled output, so it is not critical that a backend 100% faithfully reproduces the input spacing or punctuation.

Sentence Delimiters

The <s>...</s> tags must be handled as wholly separate sentences. The backend is not allowed to analyse across <s> boundaries. Typically, each paragraph in the input will result in one <s> sentence. The format of the tags is not XML. Strictly put, any line starting with <s denotes a new sentence, and a line starting with </s ends that sentence. What comes after that s will in our implementation match regex \d+-\w+. The s tags must come back in the output in the same places.

Formatted Input

Our backends can handle formatted input, where the format is passed along in <STYLE> tags. The tags are always on a line of their own, surrounding the tokens they apply to. They can be nested, but never overlapping in a crossing-branches fashion. The format is <STYLE:T:U>...</STYLE:T:U> where T is the primary type of formatting, usually expressed in HTML terms, and U is a hash identifier for this particular format block. The hash is there to let the frontend track which secondary formats apply. Our backends do not currently let this through - it is not part of the expected output for these projects. But our backends do use it for some contextual aid in rules. If you can't use the information, just discard <STYLE> lines.