-
Notifications
You must be signed in to change notification settings - Fork 2
DC Session 4 Python
Thursday Feb 6, 16:00 UK = 17:00 CET
Convenors: Paula Granados García (Open University & The Watercolour World), Matteo Romanello (École polytechnique fédérale de Lausanne)
YouTube link: https://youtu.be/JDxRd-RYkXA
Presentation: run the Jupyter notebooks on binder:
NB: for those who want to edit the notebooks on Binder (cloud-based platform) – for example to do the exercise – without losing their edits when the session expires (usually after 15 mins of inactivity), make sure to read this tutorial.
This session will begin with a general discussion of programming for the humanities with an specific focus on how programming languages can be useful to humanists, followed by a general introduction to the Python programming language. We will then look at two key Python libraries (collections of code that enhance Python funtionality for specific purposes): Pandas (for structuring and analysing data), and Beautiful Soup (for parsing HTML and XML). These skills will then be illustrated with specific examples and exercises, all of which will be available for your use and adaptation in the Jupyter notebook linked from this session page.
In preparation for this session, please install or activate a version of Jupyter Notebooks on your computer or in the cloud (see below under "Exercise" for links).
- Kestermont, Mike & Justin A. Stover (2016), "The Authorship of the Historia Augusta: Two new computational studies." Bulletin of the Institute of Classical Studies 59.2. Pp. 140–157. Available: https://onlinelibrary.wiley.com/doi/epdf/10.1111/j.2041-5370.2016.12043.x
- Romanello, Matteo (2016). "Exploring Intertextuality in Classics through Citation Networks" Digital Humanities Quarterly 10.2. Available: http://www.digitalhumanities.org/dhq/vol/10/2/000255/000255.html
- Büchler, Marco, et al. (2013), "Measuring the Influence of a Work by Text-Reuse." In ed. Dunn/Mahony, The Digital Classicist 2013. Bulletin of the Institute of Classical Studies, Supplement 122. Pp. 63–79.
- Hawkins, Laura F. 'Computational Models for Analyzing Data Collected from Reconstructed Cuneiform Syllabaries.' Digital Humanities Quarterly 12.1 (2018). Available: http://digitalhumanities.org:8081/dhq/vol/12/1/000368/000368.html (Wayback Machine)
- McKinney, W. (2011). "pandas: a foundational Python library for data analysis and statistics.@ Python for High Performance and Scientific Computing 14. Available: https://www.dlr.de/sc/portaldata/15/resources/dokumente/pyhpc2011/submissions/pyhpc2011_submission_9.pdf
- Teodora Petkova (2017). "Semantic Information Extraction: From Data Bits to Knowledge Bytes." Ontotext blog, 22 June 2017. Available: https://ontotext.com/semantic-information-extraction-data-bits-knowledge-bytes/
- Python Programming for the Humanities http://www.karsdorp.io/python-course/
- Programming Historian: https://programminghistorian.org/en/lessons/
- Python for Everybody: https://www.py4e.com/
- Pandas: https://pandas.pydata.org/pandas-docs/version/0.15/tutorials.html
- Charlie Harper (2018). "Visualizing Data with Bokeh and Pandas." Programming Historian. Available: https://programminghistorian.org/en/lessons/visualizing-with-bokeh
- Jeri Wieringa (2012), "Intro to Beautiful Soup." Programming Historian. Available: https://programminghistorian.org/en/lessons/intro-to-beautiful-soup
- To set up your own Jupyter Notebook environment:
- Install Jupyter on your desktop (easiest as part of the Anaconda package) (getting started with Jupyter)
- Set up a Microsoft Azure Notebooks instance online (if you have 365 or Skype account)
- Set up a Google Colab notebooks instance online (if you have Gmail or Google account) (getting started with Colab)
- The session notebooks can also be downloaded as a bundle and run locally using any of the above tools
Exercise description
- You are asked to write a simple python program by modifying the code we provided in notebook
Pandas_BeautifulSoup.ipynb
, section "XML data →DataFrame
"; the current code looks for<name>
element and creates aDataFrame
out of it. For the exercise you are asked to do something similar, but for a different set of TEI/EpiDoc elements of your choice. - These are the steps to follow:
- to identify one or more TEI elements of interest (can be lemmata, variants, bibliographic elements, metadata, etc.);
- to specify what information you to retain from them, and extract it from the XML (via
BeautifulSoup
) by modifying the code provided; - convert it to a
pandas.DataFrame
and explore some statistics (for example by usingvalue_counts()
).