DC Session 7 Translation alignment

Session 7. Translation alignment

Thursday Feb 27, 16:00 UK = 17:00 CET

Convenors: Chiara Palladino (Furman University), Tariq Yousef (Leipzig)

Introduction: what is text alignment? (5 mins)
Ugarit: a tool for text alignment (10 mins)
Live demo of Ugarit (10 mins)
Case studies: translation alignment in the classroom (10 mins)
Applications: automatic translation alignment, graph databases, dynamic lexicon (15 mins)
Presentation of the exercise: low stakes and high stakes (15 mins)

Gregory Crane (2019), "Beyond Translation: Language Hacking and Philology." Harvard Data Science Review 1.2. Available: https://doi.org/10.1162/99608f92.282ad764
Tamara Pataridze & Bastien Kindt (2018). "Text Alignment in Ancient Greek and Georgian: A Case-Study on the First Homily of Gregory of Nazianzus." Journal of Data Mining and Digital Humanities. Available: https://jdmdh.episciences.org/4182/pdf

Go on Ugarit and create a bilingual alignment of a parallel corpus of your choice (or feel free to use our suggestion: Bible parallel corpus in different languages: https://github.com/SunoikisisDC/SunoikisisDC-2019-2020/tree/master/2020-Digital-Classics-slides/Translation%20Alignment/data/txt). Choose two languages that you are familiar with and focus on the differences across translation: what words align perfectly? What words align imperfectly, or not at all? What words are missing across the two texts? What is the overall percentage of matches?
After you have completed the bilingual alignment, choose a parallel text in a third language that you do not know and perform a trilingual alignment. See how much of the third language you can align, by using the two other languages as an aid for better understanding.

Look at Tariq's Jupyter notebook on doing translation alignment with NLTK/Python. Edit the notebook to compare two texts of your choice, and examine the results. Report back to your class any interesting features.
Update: Tariq has very kindly added a new Jupyter notebook that should allow you to (a) visualize the automated IBM alignments from the above exercise directly in the browser; and (b) use Ugarit visualization with already aligned sentences from the NLTK Comtrans corpus. Documentation will be added shortly. Please feel free to get in touch if you have any questions about this process.