Skip to content

Data transformation with python and regular expressions

simonastoyanova edited this page Oct 31, 2017 · 9 revisions

Thursday November 2, 2017, 16h00-17h15 Greenwich Mean Time

Session 7: Data transformation with python and regular expressions

Convenors: Matteo Romanello (DAI/EPFL), Simona Stoyanova (University of London)

YouTube link: https://youtu.be/KoRMngEWbNE

Notebook: http://nlp.dainst.org:8888/notebooks/SunoikisisPython.ipynb (for security reasons, the token to access the notebook server will be given at the beginning of the lesson)

Outline

This session will introduce the syntax and some uses of the Python programming language. Through a Jupyter notebook we will give examples and exercises to enable the beginning student to grasp the basics of tweaking the format of text and other data. Examples will include tabular geographical and other archaeological datasets in CSV or Json, which you will be able to transform and enrich with functions and regular expressions.

Seminar readings

Other resources

Exercise

Ex 1

By modifying code contained in this notebook, extract all dates (e.g. "1982") from a web page of your choice (hint: you may need regexps); when done, write the extracted dates to a CSV file.

Too easy? same as above but keep track of the line number where the data occurs (hint: split the text of the webpage into lines, and then apply the regexp to each line; you many need a counter to keep track fo the lines numbers).

The CSV file can have as many columns as you think are necessary.

Ex 2 (optional; difficulty=advanced)

Read the CSV file produced for Ex. 1 into a dataframe and add a column indicating how many times the date occurs in the dataframe.

Clone this wiki locally