Data transformation with python and regular expressions

Thursday November 2, 2017, 16h00-17h15 Greenwich Mean Time

Session 7: Data transformation with python and regular expressions

Convenors: Matteo Romanello (DAI/EPFL), Simona Stoyanova (University of London)

YouTube link: https://youtu.be/KoRMngEWbNE

Notebook: http://nlp.dainst.org:8888/notebooks/SunoikisisPython.ipynb (for security reasons, the token to access the notebook server will be given at the beginning of the lesson)

Outline

This session will introduce the syntax and some uses of the Python programming language. Through a Jupyter notebook we will give examples and exercises to enable the beginning student to grasp the basics of tweaking the format of text and other data. Examples will include tabular geographical and other archaeological datasets in CSV or Json, which you will be able to transform and enrich with functions and regular expressions.

Seminar readings

Seth van Hooland, Max De Wilde, Ruben Verborgh, Thomas Steiner, Rik Van de Walle, 'Exploring entity recognition and disambiguation for cultural heritage collections', Literary and Linguistic Computing, Volume 30, Issue 2, 1 June 2015, Pages 262–279. Available: http://freeyourmetadata.org/publications/named-entity-recognition.pdf
Giovanni Moretti, Rachele Sprugnoli, Stefano Menini, Sara Tonelli (2017), "ALCIDE: Extracting and visualising content from large document collections to support Humanities studies." Science Direct 00 (2017), 1–19. Available: https://drive.google.com/file/d/0B-Y9d701jA5WeTJ3dDJKY3NQanc/view

Other resources

Teodora Petkova, ' Semantic Information Extraction: From Data Bits to Knowledge Bytes', Ontotext blog post, 22 June 2017 https://ontotext.com/semantic-information-extraction-data-bits-knowledge-bytes/
Stephen Brown, 'Words, words. They’re all we have to go on: Image finding without the pictures', Digital Scholarship in the Humanities, Volume 31, Issue 4, 1 December 2016, Pages 671–688, https://doi.org/10.1093/llc/fqv018
Suleiman Odat, Tudor Groza, Jane Hunter, 'Extracting structured data from publications in the Art Conservation Domain', Literary and Linguistic Computing, Volume 30, Issue 2, 1 June 2015, Pages 225–245, https://doi.org/10.1093/llc/fqu002
NLTK and CLTK
Learn more Python with: Codecademy; Python Programming for the Humanities; Programming Historian; Python for Everybody.

Exercise

Ex 1

By modifying code contained in this notebook, extract all dates (e.g. "1982") from a web page of your choice (hint: you may need regexps); when done, write the extracted dates to a CSV file.

Too easy? same as above but keep track of the line number where the data occurs (hint: split the text of the webpage into lines, and then apply the regexp to each line; you many need a counter to keep track fo the lines numbers).

The CSV file can have as many columns as you think are necessary.

Ex 2 (optional; difficulty=advanced)

Read the CSV file produced for Ex. 1 into a dataframe and add a column indicating how many times the date occurs in the dataframe.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data transformation with python and regular expressions

Outline

Seminar readings

Other resources

Exercise

Clone this wiki locally