Skip to content
This repository has been archived by the owner on Feb 14, 2023. It is now read-only.

jimregan/tag-clusterer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tag-clusterer. It clusters tags.

Actually, it doesn't. It's just a pair of filters to delexicalise tagged
text, which is then clustered by a word clustering tool (currently, mkcls
only), and then generates a tagset specification (.tsx) file for use with
apertium-tagger.

The input must be tagged -- not analysed -- text in the format used for
supervised training of the tagger. Feed its output to mkcls (play around with
the number of classes it generates), then feed that into mkcls-to-tsx.pl

semi-lexicalise.pl can take a set of tags to treat as stopwords, in the
apertium-transfer-tools .atx format, and semi-lexicalise the input. In this
case, the generated .tsx file will have closed classes, and may be usable
without extra intervention.

.atx looks like this:

<?xml version="1.0" encoding="iso-8859-1"?>
<transfer-at source="Portuguese" target="Spanish">
<source>
  <lexicalized-words>
    <lexicalized-word tags="cnjsub"/>
    <lexicalized-word tags="det.*"/>
    <lexicalized-word tags="pr"/>
    <lexicalized-word tags="prn.tn.*"/> 
    <lexicalized-word tags="prn.enc.*"/>
    <lexicalized-word tags="prn.pro.*"/>
    <lexicalized-word tags="rel.*"/>
    <lexicalized-word tags="vbser.*"/>
    <lexicalized-word tags="vbhaver.*"/>
    <lexicalized-word tags="vbmod.*"/> 
    <lexicalized-word tags="vblex.*" lemma="há"/>  
  </lexicalized-words>
</source>
<target>
  <lexicalized-words>
    <lexicalized-word tags="cnjsub"/>
    <lexicalized-word tags="det.*"/>
    <lexicalized-word tags="pr"/>
    <lexicalized-word tags="prn.tn.*"/> 
    <lexicalized-word tags="prn.enc.*"/>
    <lexicalized-word tags="prn.pro.*"/>
    <lexicalized-word tags="rel.*"/>
    <lexicalized-word tags="vbser.*"/>
    <lexicalized-word tags="vbhaver.*"/>
    <lexicalized-word tags="vbmod.*"/>
    <lexicalized-word tags="vblex.*" lemma="hacer"/>
  </lexicalized-words>
</target>
</transfer-at>

semi-lexicalise.pl only uses the <source> part, so you only need that.

About

tag-clusterer. It clusters tags. Generates .tsx.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages