Current methods of computational morphological tagging provide no information with respect to the phonological processes that led to a given form’s inclusion in a particular morphological category. In order to retrieve this information for a particular form, one must undertake the cumbersome task of consulting the abstract phonological schemes in standard grammars and then attempt to manually apply those schemes to reconstruct the derivational path of the form in question (e.g. θεσος → -σ → θεός ). In light this state of affairs, it is our contention that a computational morphophonemic model is needed. This model would utilize both deductive and inductive approaches to verify and display the morphophonemic derivational patterns of existing lexical forms in ancient Greek. The deductive aspect of the project would entail leveraging the phonological and morphological paradigms contained in various grammars as well as existing texts with rich information to establish a set of formalized morphophonemic rules that are manually encoded into a functional computer language. This program is then run against a text in order to isolate patterns that betray information as to where and how this rule applies. Moreover, the inductive aspect of the project will entail applying the aforementioned rules to a unicode phonological script (i.e. consonants, vowels, and diacritics) in an effort to computationally construct the phonological derivation of existing morphological and lexical forms. In short, we are proposing a dynamic morphophonemic data set for ancient Greek that is partly manual and partly automated. We manually specify prototypical patterns along with exceptions to the patterns; the computer then applies those patterns, showing where they occur.
In the beginning of his book The Morphology of Biblical Greek, William Mounce states that “morphology is directly controlled by phonology” (pg. 2). Despite the veracity of this axiom, current methods of computational morphological tagging provide no information with respect to the phonological processes that led to a given form’s inclusion in a particular morphological category. Moreover, in order to retrieve this information for a particular form, one must still undertake the cumbersome task of consulting the abstract phonological schemes in the grammars (e.g. Smyth, Mounce, inter alia) and then attempt to manually apply those schemes to reconstruct the derivational path of the form in question (e.g. θεσος → -σ → θεός ). Adding to the complexity of this process, existing grammars typically only provide prototypical phonological patterns, making it all the more difficult to reconstruct the derivational path of non-prototypical forms.
In light this state of affairs, it is our contention that a computational morphophonemic model is needed. This model would utilize both deductive and inductive approaches to verify and display the morphophonemic derivational patterns of existing lexical forms in Ancient Greek. The deductive aspect of the project would entail leveraging the phonological and morphological paradigms contained in various grammars (e.g. Mounce, 1994; Smyth, 1956; inter alia) as well as existing texts with rich information (e.g. SBLGNT; Perseus) to establish a set of formalized morphophonemic rules that are manually encoded into a functional computer language (e.g. Haskell). This program is then run against a text in order to isolate patterns that betray information as to where and how this rule applies. Moreover, the inductive aspect of the project will entail applying the aforementioned rules to a unicode phonological script (i.e. consonants, vowels, and diacritics) in an effort to computationally construct the phonological derivation of existing morphological and lexical forms.
This computational approach will involve the following steps:
- Loading Greek Corpora: We will load a text with rich information (e.g. metadata such as paragraphs, book, chapter, verse, punctuation, variant readings ) into the computer’s memory. This will enable us to isolate the surface form of the text by omitting spaces, punctuation, annotation, etc (although this information will be readily available when needed).
- Modeling Greek Script: Utilizing a functional programming language: Haskell, we will create a model of the script itself including letter, consonant, capitalization, accents, diaereses, and some possible amalgamation rules, although strictly limited to the script level. The surface form of the text will then be mapped onto the script model. This will allow us to ask questions of the surface text that only occur at the script level (e.g. breathing marks and some aspects of accentuation).
- Modeling Greek Phonology: As a first step in modeling greek phonology we will break down the letters of the script into their respective vowel and consonant sounds (e.g. Dental, Glottal, Velar, etc.). This phonological reconstruction will be largely deductive, relying on data from existing sources (e.g. Grammars, etc.). Subsequently we will map the letters of the script to the their corresponding phonological values. This mapping will initiate the process of forming specific phonological derivation rules as individual vowels and consonants combine and transform into divergent and sometimes more complex phonological units (e.g. diphthongs, syllables, etc.). Moreover, the rules we create at this stage may be highly specific due to the availability of the data described in steps 1–3. For example, we can create a rule that combines punctuation (step 1), script (e.g. letters and accents, step 2), and syllables (step 3).
- Modeling Greek Morphology: Lastly, we will introduce the morphological notions of stem, suffix, prefix, etc, in addition to rules of elision and contraction (e.g. how adding a suffix or prefix to a stem often induces phonological transformations). At this stage, the human and computer collaborate to isolate patterns and exceptions to those patterns. By utilizing our formalized rules along with existing data we will be able to derive paradigm and lemma information.
By computationally constructing a morphology in this way, it is probable that we will be able to transparently display the possible phonological processes that led a given form. Additionally we will be able to computationally verify or falsify existing derivational hypotheses (i.e. phonological rules as posited by the grammars) that produced particular morphological forms, including instances where exceptions to established rules are apparent.
More broadly, this morphophonemic project represents the implementation of an exploratory methodology aimed at advancing the grammatical research of ancient texts beyond its current state. As a general observation it seems that there currently exists two main approaches to the problem of doing grammatical analyses on ancient Greek texts: manual tagging and automated tagging using machine learning. There are advantages and limitations to both approaches.
In the first approach, a person manual tags each word, phrase, clause, etc. against a grammatical model. For example, morphological tagging is often done by a person manually annotating each word with a morphological code. The set of valid morphological codes must be decided in advance. Often the person is using a tool that validates the morphological codes as they are entered. The main advantage to this manual approach is high-quality data. Since the annotator is human and understands the language, the results have a low percentage of error especially when compared with computer-based statistical approaches.
However, the manual approach is also inherently limited due to constraints on time and the degree to which the human can consistently annotate the text. Manually annotating each word is repetitive and time-consuming, often taking months or years. With respect to consistency, although the annotator knows the language well, comprehensive grammatical analyses involve a high number of decisions about details that are not well represented in the grammatical model. The annotator may start by making decisions one way, but months later may be annotating in a different way. Often it may be desired to revisit the earlier decisions, but time and other constraints make that unrealistic.
In the second approach, the computer automatically tags the data using statistical methods, often using training data. This is commonly known as machine learning. The advantage to the computer doing the annotation is time and scale. Once the computer has been trained to annotate text according to a grammatical model, it can complete annotations of large volumes of text in a reasonable time, given enough computer hardware.
The primary limitation of machine learning is the quality of the data. Accuracy rates around 80% are not uncommon. Even if it achieved 95% accuracy, that would be an error every 20 words, which is significant and disappointing for detailed grammatical analyses. Another limitation of machine learning is that its decision processes are opaque. One cannot access the reasoning of the computer. For example, consider the machine learning task of tagging emails as spam. One does not know why an email was tagged as spam. One cannot easily investigate the reasoning of the computer making the decision whether an email is spam or not.
Our approach seeks to leverage the advantages of both the manual and automated approaches. Instead of manually tagging each word or phrase, the human manually enters patterns into a formal language. These patterns represent grammatical rules that apply to the text. Since the computer excels at applying patterns, it is easy for it to automatically apply those patterns to the text.
The key aspect of this approach is the formal language for the patterns or rules. For our initial project, the formal language will be the functional programming language Haskell. In the future, we may consider other options such as type theory, which has been successful in the formalization of mathematics. Type theory is a language with a proof-theoretic component, allowing one to prove properties about our rules. For example, we would be able to prove that our rules are complete, that is, that our set of phonological rules completely and unambiguously annotate each individual word of a set of corpora.
Since applying type theory to our specific problem did not seem like a trivial task, we are starting with a more accessible language. Type theories are similar to functional programming languages. We chose Haskell for this project since it is a functional programming language with many open source libraries and a strong online community.
Initially the program will be a command-line utility that reports the results of applying the grammatical rules. All of the rules and modeling are done in the Haskell language. The code is available at https://github.com/scott-fleischman/greek-grammar on the GitHub website. Future developments may include a web application for viewing and interacting with the results, as well as creating and updating rules.