Skip to content

Commit

Permalink
Commented and documented the DOT lexicon, added full stop to regular …
Browse files Browse the repository at this point in the history
…FST analyser

Comments #6
  • Loading branch information
snomos committed Sep 14, 2023
1 parent 660e856 commit 1ef640b
Showing 1 changed file with 47 additions and 1 deletion.
48 changes: 47 additions & 1 deletion src/fst/affixes/abbreviations.lexc
Original file line number Diff line number Diff line change
Expand Up @@ -95,13 +95,59 @@ DOT ;

LEXICON DOT !!= * **@CODE@** - Adds the dot to dotted abbreviations.
!!= **@LEXNAME@**
!! we also allow different variations of dotted abbreviations at
!! the end of the sentence (especially for tokenisers)
!! * "kvæð." gets analysed as `"kvæð" ABBR Gram/IAbbr N Abbr`
!! in tokeniser mode also:
!! * "kvæð." -> `"ABBR Gram/IAbbr N Abbr` + `"." CLB` to account for sentence
!! final kvæð with no extra full stop.
!! * also `"kvæða" V Imp Sg` + `"." CLB` due to
!! homonymy.
!! Same treatment is done with two and three full stops after abbreviation in
!! the end of the sentence:
!! * "kvæð.." -> `"su" Adv Abbr` + `"." CLB Err/Orth`
!! * "kvæð..." -> `"su" Adv Abbr` + `"..." CLB`

+Use/-PMatch.:. # ; ! We need the dot here for regular fsts
+Err/Orth+Use/-PMatch.:.. # ; ! We need the dot here for regular fsts

+Use/-PMatch:%. # ; ! We need the dot here for regular fsts
! Split the abbr + full stop in two segments, but only when using pmatch:
< "@P.Pmatch.Loc@" {.} "+CLB":0 "+Use/PMatch":0 > # ;

! Make a regular ABBR analysis AND backtrack to find alternative analyses:
! NB! Not all backtracking will give alternative analyses, and those
! cases will give a warning about missing substring analysis. The warnings
! can be ignored.
< "+Use/PMatch":0 "@P.Pmatch.Backtrack@" 0:%. > # ;

! Error variants for cases with two full stops:
< "@P.Pmatch.Loc@" {.} "+CLB":0 "+Use/PMatch":0 "+Err/Orth":"." > # ;
< "+Use/PMatch":0 "@P.Pmatch.Backtrack@" 0:%. "+Err/Orth":"." > # ;

! folded three full-stops?
< "@P.Pmatch.Loc@" {...} "+CLB":0 "+Use/PMatch":0 > # ;
< "+Use/PMatch":0 "@P.Pmatch.Backtrack@" 0:"." 0:%. 0:%. > # ;

! Gives:
!$ echo 'kvæð.' \
!| hfst-tokenise -g tools/tokenisers/tokeniser-gramcheck-gt-desc.pmhfst
!"<kvæð.>"
! "." CLB <W:0.0> "<.>"
! "kvæð" ABBR Gram/IAbbr N Abbr <W:0.0> "<kvæð>"
! "kvæð" ABBR Gram/IAbbr N Abbr <W:0.0>
! "." CLB <W:0.0> "<.>"
! "kvæða" V Imp Sg <W:0.0> "<kvæð>"
!:\n
!
! which is exactly what we want. After mwedis and cg-mwesplit, this will be
! reformatted as:
!
!"<kvæð.>"
! "kvæð" ABBR Gram/IAbbr N Abbr <W:0.0>
!:\n
!
! hm - with lost CLB analysis. That is a bug, and must be looked into.

! =================

!LEXICON ab-dot-adv-itrab +ABBR+Gram/IAbbr: ab-dot-adv ;
Expand Down

0 comments on commit 1ef640b

Please sign in to comment.