Skip to content

arprince/addressing_the_homeric_question_with_machine_learning

Repository files navigation

Addressing the Homeric Question with Machine Learning

Prelude

I had planed to move away from ancient author identification tasks and onto other genres of NLP but I happened across a gold mine - a large corpus of ancient Greek texts with parts of speech tags included. My initial desire to undertake this task was the hope that I might be able train a transformer on the corpus. This proved to be infeasible as the corpus is much too small and the time range at which documents were written far too wide. Given that writers in this corpus are thousands of miles and years apart, I could not be certain that a given classification model pretrained on an entire corpus of texts would classify based on authorial signature instead of dialect or time period.

Nevertheless, I found some really good data and I believe I have the ability to address the Homeric Question - was Homer of the Iliad and Odyssey one writer, two, or many? I am really excited to share these results as I feel that they are quite compelling. However, I want to be upfront that for any scientific hypothesis there is a certain burden of proof that must be met. The great challenge is that the Iliad and the Odyssey are old. They are incredibly, incredibly old. These texts are thought to have been written around 750BCE (with Homer himself writing about a topic that likely predated him by 400 some years). My oldest sizable non-Homeric sources (Herodotus and Thucydides) are 200-300 years behind Homer, and thus may not make a perfect comparator. This being said, while not ideal, I'd say our data is strong enough to begin.

Method

First we need to build our dataset. Any texts we wish to compare are conjoined into a single document and labeled with their author. If one writing is larger than another we drop random verses of the larger document until we have a balanced dataset. Each word in the Greek text is then converted to parts of speech tags. A given sentence might now look like,

x-------- g-------- g-------- p-d---ma- n-p---mg- n-s---fd- v3saia--- v--pne--- u--------

POS tags, among other stylometric feature sets are excellent (almost mandatory) for author identification tasks because it abstracts situational context from the authors words and forces the classifier to learn on their underlying style. There is no punctuation or unstemmed words that we have to worry about, so the data is ready to vectorize. In the past, I've preferred W2V and fasttext for vectorization but keeping it simple seemed to be a very performant strategy for this dataset. I will use a TFIDF vectorizer with a random forest classifier - very simple. I shied away from deep learning because of the small size of the data (but a little initial tinkering showed that bidirectional LSTMS or TCNs were a good place to start). We fit our data on this same kernel and plot a confusion matrix and misclassification histogram for each comparison we do. We will use accuracy as our evaluation metric. An accuracy score of 100% would indicate with confidence that we do in fact have two different authors. A score of 50% would indicate either we have a single author or that our classifier is a dud.

Implementation

To start we need to establish some ground truth and run some sanity checks. On this continuum of classification accuracy where 50% means that there is one author (potentially) and 100% means two, our first step is to see where two authors who are known to be different land on this scale. We will first compare the two historians who lived around the same time, Herodotus and Thucydides (Thucydides was fascinating for his use of primary sources and his refusal to attribute events to the will of the gods as his contemporary Herodotus did). This should give us a framework for two authors who are writing about the same genre within the same time period. We run our kernel and see that we classify the two works with 75% accuracy.

For our next sanity check we will compare Herodotus and Thucydides to both the Illiad and the Odyssey. We run the kernel four times with Herodotus v. Iliad, Herodotus v. Odyssey, Thucydides v. Iliad, and Thucydides v. Odyssey. Our results yeild classification accuracies in the high 90's for all tests. This is rather unfortunate as it gives us some light into just how different our Homeric texts are from our historians who lived hundredes of years later. Nonetheless, it helps to build this baseline for what to expect from a document that we know has multiple authors.

For our last type of sanity check we'll start by taking the work of Herodotus and splitting it in half randomly, as if he were two different authors. We will label half of the Herodotus text as "MOCK_CLASS." After running our kernel we see that we only achieve 50% classification accuracy. This is to good and is to be expected. We run this same test for Thucydides and get the same results. We then perform this test on the Iliad and the Odyssey. Each comes back with roughly 50% accuracy. This is potentially our first piece of exciting information. A high score would have indicated that the classifier detected two distinct styles within the same work. Low scores here mean that we "may" be looking at two, single author texts. Lets investigate further.

Ignoring that the works of Homer are much older than our sanity checks, the theory goes that for Homer to be a single author we would expect a score very close to 50% but certainty less than 75-95%. Lets run the kernel! We see that we classify the Iliad and the Odyssey correctly 71% of the time. This number is lower than our sanity check of our historians (Herodotus v. Thucydides) but is much greater than the sanity checks with just one author at a time.

Confusion matrix showing classification accuracy between the Iliad and Odyssey.

Confusion matrix showing classification accuracy between the Iliad and Odyssey.

Some interesting observations: We see in the confusion matrix that there is a slight misclassification preference in which the classifier thinks a verse is from the Iliad but it ends up being from the Odyssey - the sequel. Our misclassification histogram shows general misclassification roughly across the board - there is no kernel of the Odyssey that consistently gets misclassified as being written by the author of the Iliad.

Histogram showing classifier misclassification between the Iliad and Odyssey from beginning to end of each novel. There does not appear to be a significant centre of misclassification.

Histogram showing classifier misclassification between the Iliad and Odyssey from beginning to end of each novel. There does not appear to be a significant centre of misclassification.

Result

We're left with a score that is likely far too high to be the same author, but too low not to need some additional explanation. Here I can only hypothesize. Perhaps the lower score is simply due to the dactylic hexameter theme making sentences appear more similar (this is for Greek very similar to the iambic pentameter we are familiar with from Shakespeare). Perhaps the writer of the Iliad wrote a core piece of the Odyssey (I reject this idea because there is no central point of misclassification in either text according to our histograms). Perhaps, in its long history of being reenacted as a play, a redactor came along and standardized the two works and imparted some of their own style on it.

While I wish the results could tell me exactly "why" the Homeric works are a little more similar than of the other sanity checks, I feel like we've shown very strong evidence for a multiple-authorship theory and it's been exciting to put some math to the Homeric Question.

Classification results for all data sources. Sources that have minimal stylometric differences and are most likely to be one author are shaded green. Sources whose authors most likely lived in the same time period are shaded yellow. Sources whose authors lived in different time periods are shaded red and have the most stylometric differences.

Classification results for all data sources. Sources that have minimal stylometric differences and are most likely to be one author are shaded green. Sources whose authors most likely lived in the same time period are shaded yellow. Sources whose authors lived in different time periods are shaded red and have the most stylometric differences.

Data

https://github.com/PerseusDL/treebank_data/tree/master/v2.1/Greek

About

Addressing the Homeric Question with Machine Learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published