-
Notifications
You must be signed in to change notification settings - Fork 6
Data transformation with python and regular expressions
Thursday November 2, 2017, 16h00-17h15 Greenwich Mean Time
Session 7: Data transformation with python and regular expressions
Convenors: Matteo Romanello (DAI/EPFL), Simona Stoyanova (University of London)
YouTube link: https://youtu.be/KoRMngEWbNE
Notebook: http://nlp.dainst.org:8888/notebooks/SunoikisisPython.ipynb (for security reasons, the token to access the notebook server will be given at the beginning of the lesson)
This session will introduce the syntax and some uses of the Python programming language. Through a Jupyter notebook we will give examples and exercises to enable the beginning student to grasp the basics of tweaking the format of text and other data. Examples will include tabular geographical and other archaeological datasets in CSV or Json, which you will be able to transform and enrich with functions and regular expressions.
- Seth van Hooland, Max De Wilde, Ruben Verborgh, Thomas Steiner, Rik Van de Walle, 'Exploring entity recognition and disambiguation for cultural heritage collections', Literary and Linguistic Computing, Volume 30, Issue 2, 1 June 2015, Pages 262–279. Available: http://freeyourmetadata.org/publications/named-entity-recognition.pdf
- Giovanni Moretti, Rachele Sprugnoli, Stefano Menini, Sara Tonelli (2017), "ALCIDE: Extracting and visualising content from large document collections to support Humanities studies." Science Direct 00 (2017), 1–19. Available: https://drive.google.com/file/d/0B-Y9d701jA5WeTJ3dDJKY3NQanc/view
- Teodora Petkova, ' Semantic Information Extraction: From Data Bits to Knowledge Bytes', Ontotext blog post, 22 June 2017 https://ontotext.com/semantic-information-extraction-data-bits-knowledge-bytes/
- Stephen Brown, 'Words, words. They’re all we have to go on: Image finding without the pictures', Digital Scholarship in the Humanities, Volume 31, Issue 4, 1 December 2016, Pages 671–688, https://doi.org/10.1093/llc/fqv018
- Suleiman Odat, Tudor Groza, Jane Hunter, 'Extracting structured data from publications in the Art Conservation Domain', Literary and Linguistic Computing, Volume 30, Issue 2, 1 June 2015, Pages 225–245, https://doi.org/10.1093/llc/fqu002
- NLTK and CLTK
- Learn more Python with: Codecademy; Python Programming for the Humanities; Programming Historian; Python for Everybody.
Ex 1
By modifying code contained in this notebook, extract all dates (e.g. "1982") from a web page of your choice (hint: you may need regexps); when done, write the extracted dates to a CSV file.
Too easy? same as above but keep track of the line number where the data occurs (hint: split the text of the webpage into lines, and then apply the regexp to each line; you many need a counter to keep track fo the lines numbers).
The CSV file can have as many columns as you think are necessary.
Ex 2 (optional; difficulty=advanced)
Read the CSV file produced for Ex. 1 into a dataframe and add a column indicating how many times the date occurs in the dataframe.