Skip to content

Commit ff8e5cc

Browse files
committed
New INSTALL file with detailed instructions for rule-based mode.
1 parent 62ea04e commit ff8e5cc

File tree

5 files changed

+154
-4
lines changed

5 files changed

+154
-4
lines changed

Diff for: INSTALL

+88
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
INSTALLATION AND USAGE GUIDE FOR AN@PHORA
2+
3+
General overview
4+
================
5+
Anap@ora is an open source system for detecting anaphoric chains in Russian texts. It takes raw text as an input and produces a list of anaphor-antecedent pairs.
6+
For details about what anaphora is, see https://en.wikipedia.org/wiki/Anaphora_(linguistics) or https://ru.wikipedia.org/wiki/%D0%90%D0%BD%D0%B0%D1%84%D0%BE%D1%80%D0%B0_%28%D0%BB%D0%B8%D0%BD%D0%B3%D0%B2%D0%B8%D1%81%D1%82%D0%B8%D0%BA%D0%B0%29
7+
8+
The system was built as a participant of anaphora resolution systems evaluation forum held at the conference 'Dialog – 2014' (http://dialog-21.ru).
9+
It is implemented in Python. Live demo is available using Brat on-line markup system [Stenetorp et al 2012] at http://ling.go.mail.ru/anaphora.
10+
11+
12+
Requirements
13+
============
14+
To successfully run An@phora, you will need the following:
15+
16+
1. Python (latest version of 2.7 branch is recommended)
17+
18+
2. Freeling suite of language analyzers (http://nlp.lsi.upc.edu/freeling/). Latest version as of now is Freeling 3.1
19+
20+
3. The authors run An@phora on Linux, but theoretically it should work with MS Windows as well.
21+
22+
Installation
23+
============
24+
You need 3 Python scripts: anaphora.py, freeling.py and lemmatizer.py.
25+
Files prons.txt, reflexives.txt and relatives.txt provide lists of possible anaphoric expressions (in normal form). You can safely edit these lists, add your own words or remove those present.
26+
27+
An@phora expects Freeling service to listen on port 50005 of localhost (can be changed in lemmatizer.py) and return morphological description of Russian texts.
28+
Once you have Freeling installed, you can run such a service with a command 'analyzer --server -p 50005 -f /usr/share/freeling/config/ru.cfg'.
29+
NB: original Freeling dictionaries contain some inconsistencies w.r.t. Russian language. This is not a problem for An@phora itself, but the quality of results will be worse.
30+
To avoid this, replace files dicc.src and probabilitas.dat in /usr/share/freeling/ru with the files we fixed manually (download from http://ling.go.mail.ru/freeling_ru_fixed.tar.gz).
31+
32+
Usage
33+
=====
34+
General usage is simply 'anaphora.py + filename'. The file to analyze should be plain text in UTF-8. By default the system will output anaphoric chains in simple format with antecedents and anaphoric expressions separated with "<---":
35+
описания <--- оно pronoun
36+
комментариев <--- которые relative
37+
просто название <--- оно pronoun
38+
39+
The last column describes the class of anaphoric expression.
40+
41+
Two settings can be tuned with the second and the third arguments. Default state equals to:
42+
anaphora.py filename 23 plain,
43+
where '23' is the length of analysis window in words and 'plain' is the output format.
44+
45+
One can increase analysis window length to improve detecting antecedents that are located far from their anaphors, at the expense of degrading precision, or vice versa. 23 words was shown empirically to be the best value w.r.t. F-measure.
46+
47+
As fot the third argument, An@phora supports three output formats: plain, xml and brat. Plain output is shown above. XML output returns the same data in the format used at Dialog evaluation forum. It uses offsets in characters from the beginning of the text:
48+
<chain>
49+
<item sh="1778" ln="12" type="anaph">
50+
<cont><![CDATA[комментариев]]></cont>
51+
</item>
52+
<item sh="1857" ln="7" comment="relative" type="anaph">
53+
<cont><![CDATA[которые]]></cont>
54+
</item>
55+
</chain>
56+
57+
Finally, brat output returns data in standoff format used by annotation files of Brat on-line markup system mentioned above:
58+
T1 antecedent 355 364
59+
T2 pronoun 378 382
60+
R1 anaphora Arg1:T2 Arg2:T1
61+
T3 antecedent 1642 1650
62+
T4 pronoun 1670 1673
63+
R2 anaphora Arg1:T4 Arg2:T3
64+
T5 antecedent 1778 1790
65+
T6 relative 1857 1864
66+
R3 anaphora Arg1:T6 Arg2:T5
67+
68+
69+
There is another advanced feature of An@phora - using morphological analysis performed beforehand. One can analyze the files with his/her tool of choice, save the data using Freeling format and reuse it many times.
70+
Also, An@phora itself can save Freeling analysis to special annotation files correspondent to text files.
71+
To use this feature, one should replace in anaphora.py the string
72+
processed, curOffset = lemmatizer(res, startOffset = curOffset)
73+
with
74+
processed, curOffset = lemmatizer(res, startOffset = curOffset, loadFrom = argument)
75+
If for file under analysis there is a file with the same name and '.words' extension (e.g., 'test.txt.words' for 'test.txt'), then morhological analysis will be taken from this file. If it does not exist, Ap@phora will receive morphological analysis from Freeling service and save it in .words file.
76+
Freeling format follows the example below:
77+
ругательствам ругательство NCDPAI0000 1 33
78+
The elements are: token, lemma, tagset (see http://nlp.lsi.upc.edu/freeling/doc/tagsets/tagset-ru.html), probability, offset.
79+
80+
Authors
81+
=======
82+
An@phora was built by Max Ionov and Andrey Kutuzov, of applied linguistics team at Mail.ru search engine (http://go.mail.ru).
83+
You can reach authors by e-mail: [email protected], [email protected].
84+
Details of the system are described in the paper "Influence of Morphology Processing Quality on Automated Anaphora Resolution for Russian", available online at http://ling.go.mail.ru/ionov_kutuzov_anaphora.pdf
85+
86+
Copyright
87+
=========
88+
An@phora is distributed under the GNU public license. Please read the file LICENSE.

Diff for: README.md

+8-1
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,17 @@
11
russian-anaphora
22
================
33

4-
System for automatic pronominal resolution for Russian
4+
Ana@phora is a system for automatic pronominal resolution for Russian
55

66
The repository is a mess right now. Here are the main moments:
77
These are rule-based, machine learning and hybrid systems for pronominal anaphora resolution in Russian.
8+
9+
Detailed instructions for rule-based mode are given in the INSTALL file
10+
11+
12+
Machine Learning mode
13+
=====================
14+
815
To get antecedents for anaphors using only ML, one can use resolute-text.py
916

1017
```

Diff for: anaphora.py

100644100755
+16-3
Original file line numberDiff line numberDiff line change
@@ -6,17 +6,30 @@
66
from lemmatizer import GetGroups
77
from lemmatizer import GetConjunctions
88

9+
def print_usage(file_used):
10+
print 'Usage: ' + file_used + ' <file to analyze> <length of analysis window, in words> <output format: plain,xml,brat>'
11+
12+
if len(sys.argv) > 4 or len(sys.argv) < 2:
13+
print_usage(sys.argv[0])
14+
exit(1)
15+
16+
917
argument = sys.argv[1]
1018

1119
window = 23
1220
if len(sys.argv) > 2:
1321
window = int(sys.argv[2])
22+
23+
currentOutput = 'plain'
24+
if len(sys.argv) > 3:
25+
currentOutput = sys.argv[3]
26+
27+
1428
pronouns = []
1529
reflexives = []
1630
demonstratives = []
1731
relatives = []
1832

19-
currentOutput = 'xml'
2033

2134
def printxml(antecedent,anaphora):
2235
print "<chain>"
@@ -78,8 +91,8 @@ def printbrat(antecedent,anaphora):
7891
res = text.replace(u' ее',u' её')
7992
if currentOutput == 'plain':
8093
print res.strip().encode('utf-8')
81-
processed, curOffset = lemmatizer(res, startOffset = curOffset, loadFrom = argument)
82-
#processed, curOffset = lemmatizer(res, startOffset = curOffset)
94+
#processed, curOffset = lemmatizer(res, startOffset = curOffset, loadFrom = argument)
95+
processed, curOffset = lemmatizer(res, startOffset = curOffset)
8396
for i in processed:
8497
found = False
8598
(token,lemma,tag,prob,offset) = i

Diff for: freeling.py

+42
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
#!/usr/bin/python2.7
2+
# -!- coding: utf-8 -!-
3+
# usage: freeling.py
4+
5+
import os, sys, codecs, subprocess
6+
7+
usage = 'usage: freeling.py'
8+
9+
"""
10+
This is the recommended way to check against part of speech. Add a lambda-function for the desired POS
11+
and use it in your code in the following way: posFilters['desired_pos'](word), where word is a list
12+
"""
13+
posFilters = {
14+
'noun': lambda x: x[2].startswith('N') or x[2].startswith('PP'),
15+
'adj': lambda x: x[2].startswith('A') or x[2].startswith('R'),
16+
'properNoun': lambda x: x[2].startswith('NP'),
17+
'pronoun': lambda x: x[2].startswith('E'),
18+
'comma': lambda x: x[2] == 'Fc',
19+
'prep': lambda x: x[2] == 'B0',
20+
'insideQuote': lambda x: x[2] == 'Fra' or x[2].startswith('QuO'),
21+
'closeQuote': lambda x: x[2] == 'Frc',
22+
'firstName': lambda x: x[2].startswith('N') and x[2][6] == 'N',
23+
'secondName': lambda x: (x[2].startswith('N') and x[2][6] in ['F', 'S']) or (x[2].startswith('A') and x[2][5] in ['F', 'S']),
24+
#'conj': lambda x: x[2] == 'C0' or x[2] == 'Fc'
25+
'conj': lambda x: x[2] == 'C0',
26+
'quant': lambda x: x[2].startswith('Z')
27+
}
28+
29+
"""
30+
This is the list of groups which we are trying to extract. To disable any of the groups, just comment it
31+
Preposition Phrases extraction is disabled in order to be closer to Gold Standard
32+
"""
33+
agreementFilters = {
34+
'adjNoun': lambda adj, noun: 'NN' + noun[2][2:] if (posFilters['adj'](adj) and posFilters['noun'](noun) and adj[2][2] == noun[2][3]) else None,
35+
#'prepNP': lambda prep, noun: 'PP' if (posFilters['prep'](prep) and posFilters['noun'](noun)) else None,
36+
#'insideQuote': lambda quote, word: 'QuO' if posFilters['insideQuote'](quote) else None,
37+
#'closeQuote': lambda quote, closeQuote: 'QuC' if posFilters['insideQuote'](quote) and posFilters['closeQuote'](closeQuote) else None,
38+
'name': lambda name, famName: 'NN' + name[2][2:] if posFilters['firstName'](name) and posFilters['secondName'](famName) else None,
39+
'quantNoun': lambda quant, noun: 'NN%sP0%s' % (noun[2][2] if quant[2][1] != 'N' else 'N', noun[2][5:]) if (posFilters['quant'](quant) and posFilters['noun'](noun) and (quant[2][1] == noun[2][2] or (quant[2][1] == 'N' and noun[2][2] == 'G'))) else None
40+
}
41+
42+
npConjunction = lambda word1, conj, word2: 'NN' if (posFilters['noun'](word1) and posFilters['noun'](word2) and posFilters['conj'](conj)) else None

Diff for: lemmatizer.py

100644100755
File mode changed.

0 commit comments

Comments
 (0)