Dear Students, welcome to the course repository, where you will find all informations supplementing this term's machine learning for policy analysis course. Here you will find the lectures on the two topics introduced (Supervised Machine Learning & Natural Language Processing) in video format plus facilitating rmarkdown notebooks.
To get the most out of this lectures, I expect you to have R & R-Studio installed and updated on your local machine, and to be generally used to do data analytics in R using the ´tidyverse´ ecosystem. If that is not the case, you might want to take a look at the adittional resoures such as ´My R Brush-up course (Bonus)´ below, where I recap the fundamentals of working with data in R.
::::::::::::::> Watch this intro video to get started <:::::::::::::::::
Daniel is an Strategic Business Manager at NovoNordisk, where his team develops data driven methods and workflows to improve the performance of clinical trials. He is also an Associate Professor in Data Science & Innovation Economics at the Aalborg University Business School, where he was leading the Data Science research track at the AI:Growth lab, and coordinated teaching at the Social Data Science (SDS) master specialization. His research is dedicated to the development and application of data-driven methods to map, understand, and predict technological change, and its causes and consequences for socioeconomic systems on various levels of aggregation. His current contextual focus is the dynamics of AI research and industry.
His research is featured in leading academic journals such as Research Policy, but also attracted attention and funding from the industry, and lead to price-winning applications. Daniel is actively engaged in initiatives to educate (social science) students and researchers, professionals, and policymakers in understanding, evaluating, and applying modern Data Science and Artificial Intelligence methods for data-driven decision making.
As part of the AI:DK project, he coordinates and leads AI proof-of-concept projects within industry. His team also develops enterprise and policy software solutions for IP search and technology mapping.
- A: Case study: Using NLP and ML to predict green patents ::> Html <::
Legend:
- T: Theory lecture, explaining concepts without using to much code
- A: Applications and demonstrations of concepts and techniques, mostly code-based
- E: Exercises for you to try your skills
This part will introduce you to the fundamentals of supervised machine learning (SML, aka. predictive modelling), and illustrate practical applications theeof in R.
- T: Introduction to supervised ML ::> Video 1: Introduction & Statistics Refresher <:: ::> Video 2: Generalization, Hyperparameter Tuning & Model Clases <:: ::> Slides <::
- A: Applied supervised machine learning in R: ::> Video 1: Introduction & ML workflows with tidymodels <:: ::> Video 2: Regression problem case <:: ::> Video 3: Classification problem case <:: ::> Html <:: ::> Colab <::
In this part you will be introduced to the fundamentals of analysing textual data, and the practical application in R. After reviwing the basics of string manipulation, we will move to bag-of-word style text summaries, and move on to slightly more advanced applications such as sentiment analysis and topic modelling.
- A: Basics of text analysis in R ::> Video 1: Introduction to text analysis in R <:: ::> Html <::
- A: Working with long text and extracting text elements Vin R ::> Video 1 <:: ::> Html <::
- A: Text Vectorization and Topic Modelling in R ::> Video 1 <:: ::> Html <::
Find below a list of further resources (including own material), either to brush-up basic R knowledge, supplement what you learn here, or dive deeper into related or advanced topics.
- Hain, D. S., Jurowetzki, R., Squicciarini, M., & Xu, L. (2023). Unveiling the neurotechnology landscape: scientific advancements innovations and major trends.
- Nechaev, I., & Hain, D. S. (2023). Social impacts reflected in CSR reports: Method of extraction and link to firms innovation capacity. Journal of Cleaner Production, 429, 139256.
- Hain, Daniel, et al. Hain, D. S., Jurowetzki, R., Buchmann, T., & Wolf, P. (2022). A text-embedding-based approach to measuring patent-to-patent technological similarity. Technological Forecasting and Social Change, 177, 121559.: Own paper, where we introduce to text embeddings and use it to map technology based on patent data.
- Bekamiri, H., Hain, D. S., & Jurowetzki, R. (2021). PatentSBERTa: A Deep NLP based Hybrid Model for Patent Distance and Classification using Augmented SBERT. arXiv preprint arXiv:2103.11933.: More advanced version of the use of embeddings on.
- Hain, Daniel, et al. Hain, D. S., Jurowetzki, R., Buchmann, T., & Wolf, P. (2022). A text-embedding-based approach to measuring patent-to-patent technological similarity. Technological Forecasting and Social Change, 177, 121559.: Own paper, where we introduce to text embeddings and use it to map technology based on patent data.patents.
- Hain, D. S., Jurowetzki, R., Konda, P., & Oehler, L. (2020). From catching up to industrial leadership: towards an integrated market-technology perspective. An application of semantic patent-to-patent similarity in the wind and EV sector. Industrial and Corporate Change, 29(5), 1233-1255.: Application of the technique.
- Wickham, H., & Grolemund, G. (2016). R for data science: import, tidy, transform, visualize, and model data. O'Reilly Media, Inc.: The bible of modern data science in R. USe this to get started.
- Baumer, B., Kaplan, D. & Horton, N. (2020) Modern Data Science with R (2nd Ed.). CRC Press : Also nice supplementart book, also touching upon topics such as simulation and network analysis.
- Ismay & Kim (2020), Statistical Inference via Data Science: A ModernDive into R and the Tidyverse, CRC Press.: For those who want to first update their knowledge in basic and inferential statistics in a modern R setup.
- Hain, D., & Jurowetzki, R. (2020). Introduction to Rare-Event Predictive Modeling for Inferential Statisticians--A Hands-On Application in the Prediction of Breakthrough Patents. arXiv preprint arXiv:2003.13441.: Some of our introductory papers. An a bit more elaborate version of what we did so far on a more exciting dataset.
- Kuhn, M., Silge, J. (2020). Tidy Modeling with R: GReat introduction to
tidymodels
by the makers. - Kuhn, M. & Johnson (2019), Feature Engineering and Selection: A Practical Approach for Predictive Models, Taylor & Francis.: Less code but much deep insights in modern ML details, by Thomas Kuhn, the maker of much of
tidymodels
andcaret
- Silge, Julia (2020). Supervised Machine Learning Case Studies in R. Online course: Great interactive course Julia took out of DataCamp to offer it for free instead. Fully updated to the tidymodels workflow. YOU ALL SHOULD DO IT!
- Julia Silge and David Robinson (2020). Text Mining with R: A Tidy Approach, O’Reilly.: Great introduction to the
tidytext
ecosystem and NLP in R by the package makers. - Emil Hvidfeldt and Julia Silge (2020). Supervised Machine Learning for Text Analysis in R: More advanced introduction to SML based on textual data.
- Efficient R Programming
- Fundamentals of Data Visualization (O'Reily)
- Data Visualization (R): A practical introduction
- Exploring Enterprise Databases with R
- R Markdown: The Definitive Guide
As a bonus, find some very basic introductions to working with data in R (from another course of mine) below. If you are already used to work with R and the tidyverse, no need to do so. But in case you feel your R skills need a bit of a brush up, feel free to go through the material before auditing my classes.
- T: Introduction to the R Data Science Ecosystem ::> Video <:: ::> Slides <::
- A: Basics of statistical programming in R ::> Video <:: ::> Html <:: ::> Colab <::
- T: Introduction to data ::> Video <:: ::> Slides <::
- T: Data manipulation basics in R ::> Video <:: ::> Slides <::
- A: Data manipulation in R ::> Video <:: ::> Html <:: ::> Colab <::
- T: Data Visualization ::> Video <:: ::> Slides <::
- A: Basic data visualization in R using ggplot ::> Video 1 <:: ::> Html <:: ::> Colab <::
- E: Data manipulation & visualization basic exercises ::> 1: Basics <:: ::> 2: Joins <:: ::> 3: Data Manipulation Challange <:: ::> 4 EDA & Dataviz <::