We have started!
This course is an advanced course at CAUP during March and April 2019. Lectures will take place on Mondays at 14:00 while practical classes will take place on Thursdays at 10:00. Both have duration 2 hours with a short break.
The aim of this course is to get a good practical grasp of machine learning. I will not spend a lot of time on algorithm details but more on how to use these in python and try to discuss what methods are useful for what type of scientific question/research goal.
- March 4 - Managing data and simple regression
-
- Covering git and SQL
- Introducing machine learning through regression techniques.
- March 11 - Visualisation and inference methods
-
- Visualisation of data, do's and don't's
- Classical inference
- Bayesian inference
- MCMC
- March 18 - Density estimation and model choice
-
- Estimating densities, parametric & non-parametric
- Bias-variance trade-off
- Cross-validation
- Classification
- March 25 - Dimensional reduction
-
- Standardising data.
- Principal Component Analysis
- Manifold learning
- April 8 - Ensemble methods, neural networks, deep learning
-
- Local regression methods
- Random forests and other boosting methods
- Neural networks & deep learning
I expect that you have read through these two documents:
-
A couple of Python & Topcat pointers. This is a very basic document and might not contain a lot of new stuff. It does have a couple of tasks to try out - the solution for these you can find in the [ProblemSets/0 - Pyton and Topcat](ProblemSets/0 - Pyton and Topcat) directory.
-
A reminder/intro to relevant math contains a summary of some basic facts from linear algebra and probability theory that are useful for this course.
Below you can find some books of use. The links from the titles get you to the Amazon page. If there are free versions of the books legally available online, I include a link as well.
-
I base myself partially on "Statistics, Data Mining, and Machine Learning in Astronomy" - Ivezic, Connolly, VanderPlas & Gray
-
I have also consulted "Deep Learning" - Goodfellow, Bengio & Courville
-
"Pattern Classification" - Duda, Hart & Stork, is a classic in the field
-
"Pattern Recognition and Machine Learning" - Bishop, is a very good and comprehensive book. Personally I really like this one.
-
"Bayesian Data Analysis" - Gelman, is often the first book you are pointed to if you ask questions about Bayesian analysis.
-
"Information Theory, Inference and Learning Algorithms" - MacKay, is a very readable book on a lot of related topics. The book is also freely available on the web.
-
"Introduction to Statistical Learning - James et al" is a readable introduction (fairly basic) to statistical technique of relevance. It is also freely available on the web.
-"Elements of Statistical Learning - Hastie et al, is a more advanced version of the Introduction to Statistical Learning with much the same authors. This is also freely available on the web.
- "Bayesian Models for Astrophysical Data", Hilbe, Souza & Ishida is a good reference book for a range of Bayesian techniques and is a good way to learn about different modelling frameworks for Bayesian inference.
In this case you will want to fork the repository rather than just clone this. You can follow the instructions below (credit to Alexander Mechev for this) to create a fork of the repository:
- Make a github account and log in.
- Click on the 'Fork' at the top right. This will create a 'fork' on your own account. That means that you now have the latest commit of the repo and its history in your control. If you've tried to 'git push' to the MLD2019 repo you'd have noticed that you don't have access to it.
- Once it's forked, you can go to your github profile and you'll see a MLD2019 repo. Go to it and get the .git link (green button)
- Somewhere on your machine, git clone git clone https://github.com/[YOUR_GIT_UNAME]/MLD2019.git. You also need to enter the directory
- Add our repo as an upstream. That way you can get (pull) new updates: git remote add upstream https://github.com/jbrinchmann/MLD2019.git
- git remote -v should give: origin https://github.com/[YOUR_GIT_UNAME]/MLD2019.git (fetch) origin https://github.com/[YOUR_GIT_UNAME]/MLD2019.git (push) upstream https://github.com/jbrinchmann/MLD2019.git (fetch) upstream https://github.com/jbrinchmann/MLD2019.git (push)
- Now you're ready to add files and folders to your local fork. Use git add, git commit and git push (origin master) to add your assignments.
The course will make use of python throughout, and for this you need a recent version of python installed. I use python 3 by default but will try to make all scripts compatible with python 2 and python 3. For python you will need (well, I recommend it at least) at least these libraries installed:
- numpy - for numerical calculations
- astropy - because we are astronomers
- scipy - because we are scientists
- sklearn - Machine learning libraries with full name scikit-learn.
- matplotlib - plotting (you can use alternatives of course)
- pandas - nice handling of data
- seaborn - nice plots
(the last two are really "nice to have" but if you can install the others then these are easy).
You should also get astroML
which has a nice web page at XX and a git repository at https://github.com/astroML/astroML
It turns out that the astroML distribution that is often picked up when you install it using a package manager (maybe also pip?) is outdated and does not work with new versions of sklearn. To check whether you have a problem, try:
from astroML.datasets import fetch_sdss_sspp
If this crashes with a complaint about a module GMM, you have the old version. To fix this the best way is probably to check out the git version of astroML linked above using e.g.:
git clone https://github.com/astroML/astroML.git
To use astroML in Anaconda you need to get it from the astropy channel. For a one-off you can do:
conda install -c astropy astroML
If you want to add the astropy channel permanently (which probably is a good idea), you can do:
conda config --add channels astropy
The slides are available in the Lectures directory. You can find some files for creating tables in the ProblemSets/MakeTables directory.
The slides are available in the Lectures directory.
In the final problem class we will look at using deep learning in python. There are quite a few libraries for this around but we will use the most commonly used one, TensorFlow and we will use the keras python package for interacting with TensorFlow. Keras is a high-level interface (and can also use other libraries, Theano and CNTK, in addition to TensorFlow).
There are many pages that detail the installation of these packages and what you need for them. A good one with a bias towards Windows is this one. I will give a very brief summary here of how I set things up. This is not optimised for Graphical Processing Unit (GPU) work so for serious future work you will need to adjust this.
I am going to assume you use anaconda for your python environment. If not, you need to change this section a bit - use virtualenv instead of setting up a conda environment. It is definitely better to keep your TensorFlow/keras etc setup out of your default Python work environment. Most of the packages are also installed with pip
rather than conda, so what I use is
conda create -n tensorflow pip python=3.6
This creates an environment called tensorflow which uses python 3.6 and pip
for installation. To use this we need to activate it first:
activate tensorflow
(assuming you use bash - I do not so I need to do some more tricks. Use bash). Your prompt should not change to include (tensorflow)
.
I went for the simplest approach here:
pip install --upgrade tensorflow
This takes a while - the package is fairly large, 71.6Mb in my installation, and it requires a fair number of additional packages.
pip install keras
This is quicker.
pip install ipython
because that is not installed by default (you can skip this if you prefer not to use ipython).
pip install jupyter
because my example is a jupyter notebook.
You will also need to install some other packages I am sure you will need:
pip install matplotlib
pip install astropy
pip install pandas
pip install sklearn
pip install seaborn
and you might have others that you want to use but that should set up you fairly well for deep learning.