This course familiarises PhD students with the main text mining techniques in social science and develops basic skills in digital methods. After completion you are familiar with the theoretical and methodological underpinnings of natural language processing perspective and are able to conduct a basic text analysis. Throughout the course we will focus on applying text analysis to empirical data, where possible related to the students own research.
Students will become familiar with digital methods in text analysis as a flexible approach that comes with a practical set of research instruments to empirically investigate a range of questions in social science. They will learn how to approach and manage text data, analyse texts, and visualize this information.
Session date | Session | Lecture Topic | Seminar topic |
---|---|---|---|
12 April | 1 | Introduction to text mining | Import text data |
14 April | 2 | Analysing text | Methods for text preprocessing |
19 April | 3 | Analysing words | Methods for word analysis |
21 April | 4 | Topic modelling | Methods for analysing topics |
26 April | 5 | NLP Ethics + live coding | Biases and real example analysis |
28 April | 6 | Text mining in the real world | Analysing your own text |
You must be a PhD student at Kingβs, Queen Mary or Imperial, and you must have already registered as a LISS DTP student via the following link: https://www.liss-dtp.ac.uk/registration/.
π Turing, A.M. and Haugeland, J., 1950. Computing machinery and intelligence. The Turing Test: Verbal Behavior as the Hallmark of Intelligence, pp.29-56.
π Weizenbaum, J., 1966. ELIZAβa computer program for the study of natural language communication between man and machine. Communications of the ACM, 9(1), pp.36-45.
π Hutchins, W.J., 2004, September. The Georgetown-IBM experiment demonstrated in January 1954. In Conference of the Association for Machine Translation in the Americas (pp. 102-114). Springer, Berlin, Heidelberg.
π https://www.ibm.com/ibm/history/exhibits/701/701_translator.html
π Bender, E.M., Hovy, D. and Schofield, A., 2020, July. Integrating ethics into the NLP curriculum. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts (pp. 6-9).
π Friedl, J.E., 2006. Mastering regular expressions. " O'Reilly Media, Inc.". [Introduction]
π Anandarajan, M., Hill, C. and Nolan, T., 2019. Term-document representation. In Practical text analytics (pp. 61-73). Springer, Cham. [Chapter 4 and 5]
π Bird, S., Klein, E. and Loper, E., 2009. Natural language processing with Python: analyzing text with the natural language toolkit. " O'Reilly Media, Inc.". [Chapter 3 and 7]
π Mikolov, T., Chen, K., Corrado, G. and Dean, J., 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
π Rong, X., 2014. word2vec parameter learning explained. arXiv preprint arXiv:1411.2738.
π Bird, S., Klein, E. and Loper, E., 2009. Natural language processing with Python: analyzing text with the natural language toolkit. " O'Reilly Media, Inc.". [Chapter 9]
π Blei, D.M., Ng, A.Y. and Jordan, M.I., 2003. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), pp.993-1022.
π Anandarajan, M., Hill, C. and Nolan, T., 2019. Term-document representation. In Practical text analytics (pp. 61-73). Springer, Cham. [Chapter 7]
π Bird, S., Klein, E. and Loper, E., 2009. Natural language processing with Python: analyzing text with the natural language toolkit. " O'Reilly Media, Inc.". [Chapter 6]
π Bender, E.M., Gebru, T., McMillan-Major, A. and Shmitchell, S., 2021, March. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?π¦. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610-623).
π Bolukbasi, T., Chang, K.W., Zou, J.Y., Saligrama, V. and Kalai, A.T., 2016. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29.
π Caliskan, A., Bryson, J.J. and Narayanan, A., 2017. Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334), pp.183-186.
π Garg, N., Schiebinger, L., Jurafsky, D. and Zou, J., 2018. Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16), pp.E3635-E3644.
π Anandarajan, M., Hill, C. and Nolan, T., 2019. Term-document representation. In Practical text analytics (pp. 61-73). Springer, Cham. [Chapter 12]
π Bird, S., Klein, E. and Loper, E., 2009. Natural language processing with Python: analyzing text with the natural language toolkit. " O'Reilly Media, Inc.". [Chapter 11].
π Bengfort, B., Bilbro, R., & Ojeda, T. (2018). Applied text analysis with python: Enabling language-aware data products with machine learning. O'Reilly Media, Inc.
π Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: analyzing text with the natural language toolkit. " O'Reilly Media, Inc.".
π Eisenstein, J. (2018). Natural language processing.
π Hovy, D. (2020). Text Analysis in Python for Social Scientists: Discovery and Exploration. Cambridge University Press.
π Manning, C., & Schutze, H. (1999). Foundations of statistical natural language processing. MIT press.
π https://www.nltk.org
Course featured by: