This repository contains Ancient Greek texts which have been tokenized, POS-tagged, sentence-splitted, and lemmatized automatically. The texts come from the following repositories, which currently contain most of the Ancient Greek texts freely accessible over the internet:
- https://github.com/PerseusDL/canonical-greekLit/releases/tag/0.0.236
- https://github.com/OpenGreekAndLatin/First1KGreek/releases/tag/1.1.1802
As for the tokenization, POS tagging and sentence splitting, the data rely on those provided in:
Refer to these repositories for further documentation. In the present repository, the POS tag + the word form of a token have been automatically linked to those contained in Morpheus (see the "Morpheus" folder) and MorpheusUnderPhilologic. Since the latter databases also contain lemmata, this allowed their automatic extraction.
The XML structure of each file is self-explanatory and solutions of abbreviations are provided at the beginning of each file. For convenience I give an example here:
<s n="2">
<t p="4" n="1" a="[1]" o="p-s---mn-" u="1">
<f>ὃς</f>
<l i="234">
<l1 o="pr-s---mn-">ὅς</l1>
</l>
</t>
<t p="4" n="2" a="[1]" o="p-p---fa-" u="2">
<f>τάσδε</f>
<l i="5901">
<l1 o="pd-p---fa-">ὅδε</l1>
<l2>ὅδε</l2>
</l>
</t>
<!-- further t elements -->
</s>
Read the above xml fragment this way:
s
element: sentence element, where@n
is the sentence numbert
element: token element, which contains a number of values providing its morphological analysis:@p
: passage-level cts urn@n
: position of the token in@p
@a
: nth occurrence of that token in@p
@o
: morphological analysis of the token as provided automatically by the Mate tagger (this analysis follows the Morpheus format explained below)@u
: position of the token within the s(entence) elementf
element: the word form of the tokenl
element: possible lemmata extracted from Morpheus (<l2/>
) and PerseusUnderPhilologic (<l1/>
) found by matching their word formsAND
POS tags with those found in the present database. in<l1/>
@o
contains the original PerseusUnderPhilologic POS tag (see solutions below), which can be more informative than the Morpheus one. For example, ὃς in the above example is analyzed in PerseusUnderPhilologic as a relative pronoun (o="pr-s---mn-"
: see "r" in second position). Similarly, ὅδε is analyzed as a demonstative pronoun, while Morpheus simply treats it as a pronoun. One token may have more than one<l1/>
and/or<l2/>
elements associated.
The Morpheus POS tag in t/@o
consists of 9 characters, each of which has
an unambiguous meaning:
-
1: part of speech
n
: nounv
: verba
: adjectived
: adverbl
: articleg
: particlec
: conjunctionr
: prepositionp
: pronounm
: numerali
: interjectionu
: punctuation
-
2: person
1
: first person2
: second person3
: third person
-
3: number
s
: singularp
: plurald
: dual
-
4: tense
p
: presenti
: imperfectr
: perfectl
: pluperfectt
: future perfectf
: futurea
: aorist
-
5: mood
i
: indicatives
: subjunctiveo
: optativen
: infinitivem
: imperativep
: participle
-
6: voice
a
: activep
: passivem
: middlee
: medio-passive
-
7: gender
m
: masculinef
: femininen
: neuter
-
8: case
n
: nominativeg
: genitived
: dativea
: accusativev
: vocativel
: locative
-
9: degree
c
: comparatives
: superlative
The meaning of abbreviations in t/l/l1/@o (used in MorpheusUnderPhilologic) is the same as that in Morpheus (see above) except for the first two characters. Read them like this:
ae
: proper adjective (e.g., Ἀθηναῖος).ne
: proper noun (eg., Ζεύς)d-
: adverb" (eg., οὐ)dd
: demonstrative adverb (eg., ταύτῃ)de
: proper name adverb (eg., Ἀθήναζε)di
: interrogative adverb (eg., ποῦ)dr
: relative adverb (eg., οἷ)dx
: indefinite adverb (eg., που)c-
: conjunction (eg., καί)r-
: prepositionp-
: pronounpa
: definite articlepc
: reciprocal pronoun (eg., ἀλλήλους)pd
: demonstrative pronoun (eg., οὗτος)pi
: interrogative pronoun (eg., τίς)pk
: reflexive pronoun (eg., σεαυτόν)pp
: personal pronoun (eg., με)pr
: relative pronoun (eg., ὅς)ps
: possessive pronoun (eg., ἐμός)px
: indefinite pronoun (eg., τις)m-
: numerali-
: interjection (eg., ὀτοτοί)e-
: exclamationy-
: math term or abbrev for all of Euclid's ΑΒΓ geometrical figuresg-
: particlegm
: modal particle" (eg., κε)
In version (1.2.5):
- Lemmas for prepositions, particles, and a few clear mistakes concerning article lemmas have been corrected. This has increased dramatically the number of the lemmas available in the corpus: 21493806 lemmas against 25522507 tokens.
In version (1.2.4):
- The codepoint "’" is used as apostrophe or as quotation mark. Some known issues stemming from (wrong) Betacode conversion makes tokenization for this codepoint hard to handle. This version corrects some tokenization errors to move quotation mark "’" into a separate token
In version (1.2.3):
- "’" position for elision has been corrected, i.e., put in the same element of the elided word (This error was due to the fact that the apostrophe has been encoded with different codepoints)
In version (1.2.2):
- In tlg0018.tlg010.opp-grc1.xml and tlg0018.tlg015.opp-grc1.xml the erroneous ’Kv at the beginning of the first sentence has been corrected into Ἐν
- In tlg0018.tlg019.opp-grc1.xml the erroneous ’Η at the beginning of the first sentence has been corrected into Ἡ
- "’" position has been corrected, i.e., put at the end of a sentence
- Duplicate l1 and l2 are deleted
In version (1.2.1):
- lemmas are corrected: if a Morpheus lemma () is the same as a MorpheusUnderPhilologic lemma (), it is deleted.
- documentation is improved: meaning of abbreviations in @o published
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.