New INSTALL file with detailed instructions for rule-based mode.

akutuzov · akutuzov · commit ff8e5cc79b9d · 2014-04-16T16:22:09.000+04:00
diff --git a/INSTALL b/INSTALL
@@ -0,0 +1,88 @@
+INSTALLATION AND USAGE GUIDE FOR AN@PHORA
+
+General overview
+================
+Anap@ora is an open source system for detecting anaphoric chains in Russian texts. It takes raw text as an input and produces a list of anaphor-antecedent pairs.
+For details about what anaphora is, see https://en.wikipedia.org/wiki/Anaphora_(linguistics) or https://ru.wikipedia.org/wiki/%D0%90%D0%BD%D0%B0%D1%84%D0%BE%D1%80%D0%B0_%28%D0%BB%D0%B8%D0%BD%D0%B3%D0%B2%D0%B8%D1%81%D1%82%D0%B8%D0%BA%D0%B0%29
+
+The system was built as a participant of anaphora resolution systems evaluation forum held at the conference 'Dialog – 2014' (http://dialog-21.ru). 
+It is implemented in Python. Live demo is available using Brat on-line markup system [Stenetorp et al 2012] at http://ling.go.mail.ru/anaphora.
+
+
+Requirements
+============
+To successfully run An@phora, you will need the following:
+
+1. Python (latest version of 2.7 branch is recommended)
+
+2. Freeling suite of language analyzers (http://nlp.lsi.upc.edu/freeling/). Latest version as of now is Freeling 3.1
+
+3. The authors run An@phora on Linux, but theoretically it should work with MS Windows as well.
+
+Installation
+============
+You need 3 Python scripts: anaphora.py, freeling.py and lemmatizer.py.
+Files prons.txt, reflexives.txt and relatives.txt provide lists of possible anaphoric expressions (in normal form). You can safely edit these lists, add your own words or remove those present.
+
+An@phora expects Freeling service to listen on port 50005 of localhost (can be changed in lemmatizer.py) and return morphological description of Russian texts.
+Once you have Freeling installed, you can run such a service with a command 'analyzer --server -p 50005 -f /usr/share/freeling/config/ru.cfg'.
+NB: original Freeling dictionaries contain some inconsistencies w.r.t. Russian language. This is not a problem for An@phora itself, but the quality of results will be worse.
+To avoid this, replace files dicc.src and probabilitas.dat in /usr/share/freeling/ru with the files we fixed manually (download from http://ling.go.mail.ru/freeling_ru_fixed.tar.gz).
+
+Usage
+=====
+General usage is simply 'anaphora.py + filename'. The file to analyze should be plain text in UTF-8. By default the system will output anaphoric chains in simple format with antecedents and anaphoric expressions separated with "<---":
+описания        <---    оно             pronoun
+комментариев    <---    которые         relative
+просто название <---    оно             pronoun
+
+The last column describes the class of anaphoric expression.
+
+Two settings can be tuned with the second and the third arguments. Default state equals to:
+    anaphora.py filename 23 plain,
+where '23' is the length of analysis window in words and 'plain' is the output format. 
+
+One can increase analysis window length to improve detecting antecedents that are located far from their anaphors, at the expense of degrading precision, or vice versa. 23 words was shown empirically to be the best value w.r.t. F-measure.
+
+As fot the third argument, An@phora supports three output formats: plain, xml and brat. Plain output is shown above. XML output returns the same data in the format used at Dialog evaluation forum. It uses offsets in characters from the beginning of the text:
+<chain>
+<item sh="1778" ln="12" type="anaph">
+<cont><![CDATA[комментариев]]></cont>
+</item>
+<item sh="1857" ln="7" comment="relative" type="anaph">
+<cont><![CDATA[которые]]></cont>
+</item>
+</chain>
+
+Finally, brat output returns data in standoff format used by annotation files of Brat on-line markup system mentioned above:
+T1      antecedent 355 364
+T2      pronoun 378 382
+R1      anaphora Arg1:T2 Arg2:T1
+T3      antecedent 1642 1650
+T4      pronoun 1670 1673
+R2      anaphora Arg1:T4 Arg2:T3
+T5      antecedent 1778 1790
+T6      relative 1857 1864
+R3      anaphora Arg1:T6 Arg2:T5
+
+
+There is another advanced feature of An@phora - using morphological analysis performed beforehand. One can analyze the files with his/her tool of choice, save the data using Freeling format and reuse it many times. 
+Also, An@phora itself can save Freeling analysis to special annotation files correspondent to text files.
+To use this feature, one should replace in anaphora.py the string
+    processed, curOffset = lemmatizer(res, startOffset = curOffset)
+with
+    processed, curOffset = lemmatizer(res, startOffset = curOffset, loadFrom = argument)
+If for file under analysis there is a file with the same name and '.words' extension (e.g., 'test.txt.words' for 'test.txt'), then morhological analysis will be taken from this file. If it does not exist, Ap@phora will receive morphological analysis from Freeling service and save it in .words file.
+Freeling format follows the example below:
+    ругательствам   ругательство    NCDPAI0000      1       33
+The elements are: token, lemma, tagset (see http://nlp.lsi.upc.edu/freeling/doc/tagsets/tagset-ru.html), probability, offset.
+
+Authors
+=======
+An@phora was built by Max Ionov and Andrey Kutuzov, of applied linguistics team at Mail.ru search engine (http://go.mail.ru).
+You can reach authors by e-mail: m.ionov@corp.mail.ru, andrey.kutuzov@corp.mail.ru.
+Details of the system are described in the paper "Influence of Morphology Processing Quality on Automated Anaphora Resolution for Russian", available online at http://ling.go.mail.ru/ionov_kutuzov_anaphora.pdf
+
+Copyright
+=========
+An@phora is distributed under the GNU public license. Please read the file LICENSE.
diff --git a/README.md b/README.md
@@ -1,10 +1,17 @@
 russian-anaphora
 ================
 
-System for automatic pronominal resolution for Russian
+Ana@phora is a system for automatic pronominal resolution for Russian
 
 The repository is a mess right now. Here are the main moments:
 These are rule-based, machine learning and hybrid systems for pronominal anaphora resolution in Russian.
+
+Detailed instructions for rule-based mode are given in the INSTALL file
+
+
+Machine Learning mode
+=====================
+
 To get antecedents for anaphors using only ML, one can use resolute-text.py
 
 ```
diff --git a/anaphora.py b/anaphora.py
@@ -6,17 +6,30 @@
 from lemmatizer import GetGroups
 from lemmatizer import GetConjunctions
 
+def print_usage(file_used):
+    print 'Usage: ' + file_used + ' <file to analyze> <length of analysis window, in words> <output format: plain,xml,brat>'
+
+if len(sys.argv) > 4 or len(sys.argv) < 2:
+    print_usage(sys.argv[0])
+    exit(1)
+
+
 argument = sys.argv[1]
 
 window = 23
 if len(sys.argv) > 2:
     window = int(sys.argv[2])
+
+currentOutput = 'plain'
+if len(sys.argv) > 3:
+    currentOutput = sys.argv[3]
+
+
 pronouns = []
 reflexives = []
 demonstratives = []
 relatives = []
 
-currentOutput = 'xml'
 
 def printxml(antecedent,anaphora):
     print "<chain>"
@@ -78,8 +91,8 @@ def printbrat(antecedent,anaphora):
     res = text.replace(u' ее',u' её')
     if currentOutput == 'plain':
 	print res.strip().encode('utf-8')
-    processed, curOffset = lemmatizer(res, startOffset = curOffset, loadFrom = argument)
-    #processed, curOffset = lemmatizer(res, startOffset = curOffset)
+    #processed, curOffset = lemmatizer(res, startOffset = curOffset, loadFrom = argument)
+    processed, curOffset = lemmatizer(res, startOffset = curOffset)
     for i in processed:
 	found = False
 	(token,lemma,tag,prob,offset) = i
diff --git a/freeling.py b/freeling.py
@@ -0,0 +1,42 @@
+#!/usr/bin/python2.7
+# -!- coding: utf-8 -!-
+# usage: freeling.py
+
+import os, sys, codecs, subprocess
+
+usage = 'usage: freeling.py'
+
+"""
+This is the recommended way to check against part of speech. Add a lambda-function for the desired POS
+and use it in your code in the following way: posFilters['desired_pos'](word), where word is a list
+"""
+posFilters = {
+	'noun': lambda x: x[2].startswith('N') or x[2].startswith('PP'),
+	'adj': lambda x: x[2].startswith('A') or x[2].startswith('R'),
+	'properNoun': lambda x: x[2].startswith('NP'),
+	'pronoun': lambda x: x[2].startswith('E'),
+	'comma': lambda x: x[2] == 'Fc',
+	'prep': lambda x: x[2] == 'B0',
+	'insideQuote': lambda x: x[2] == 'Fra' or x[2].startswith('QuO'),
+	'closeQuote': lambda x: x[2] == 'Frc',
+	'firstName': lambda x: x[2].startswith('N') and x[2][6] == 'N',
+	'secondName': lambda x: (x[2].startswith('N') and x[2][6] in ['F', 'S']) or (x[2].startswith('A') and x[2][5] in ['F', 'S']),
+	#'conj': lambda x: x[2] == 'C0' or x[2] == 'Fc'
+	'conj': lambda x: x[2] == 'C0',
+	'quant': lambda x: x[2].startswith('Z')
+}
+
+"""
+This is the list of groups which we are trying to extract. To disable any of the groups, just comment it
+Preposition Phrases extraction is disabled in order to be closer to Gold Standard
+"""
+agreementFilters = {
+	'adjNoun': lambda adj, noun: 'NN' + noun[2][2:] if (posFilters['adj'](adj) and posFilters['noun'](noun) and adj[2][2] == noun[2][3]) else None,
+	#'prepNP': lambda prep, noun: 'PP' if (posFilters['prep'](prep) and posFilters['noun'](noun)) else None,
+	#'insideQuote': lambda quote, word: 'QuO' if posFilters['insideQuote'](quote) else None,
+	#'closeQuote': lambda quote, closeQuote: 'QuC' if posFilters['insideQuote'](quote) and posFilters['closeQuote'](closeQuote) else None,
+	'name': lambda name, famName: 'NN' + name[2][2:] if posFilters['firstName'](name) and posFilters['secondName'](famName) else None,
+	'quantNoun': lambda quant, noun: 'NN%sP0%s' % (noun[2][2] if quant[2][1] != 'N' else 'N', noun[2][5:]) if (posFilters['quant'](quant) and posFilters['noun'](noun) and (quant[2][1] == noun[2][2] or (quant[2][1] == 'N' and noun[2][2] == 'G'))) else None
+}
+
+npConjunction = lambda word1, conj, word2: 'NN' if (posFilters['noun'](word1) and posFilters['noun'](word2) and posFilters['conj'](conj)) else None
diff --git a/lemmatizer.py b/lemmatizer.py