Hello everyone,
I'm João Almeida, I'm Software Engineer at CERN and I'm by no means an expert in Machine Learning. I took a few courses in college, attended some talks, read a lot, did a few projects ...s
There won't be any slides, I am going to show you a lot of code, some math and some plots.
We won't focus on the theoretical details behind the ML techniques, but on their intuition and on how to apply them. We will do this with the help of Python and Scikit Learn, a very nice python ML library.
ML is a set of techniques used to teach computers to learn from data.
Computers learn from examples and experience instead of following hard coded rules.
The learning task can be varied:
- it might be grouping examples together
- Predicting some continuous valued of a new example (Regression): a typical example is predicting the price of a house based on its characteristics;
- Finding which examples have unexpected characteristics. One example is fraud detection in online purchases
- Taking new examples and assigning labels to them, for instance taking a picture of a fruit and saying whether it is an apple or a banana.
- Predicting the next value in a time series, for instance predicting stock prices.
4 min
There are two distinct groups of ML applications, supervised learning and Unsupervised The distinction is whether you have data to teach what you are learning.
For instance when doing classification or regression you have data on what is supposed to be the output of your model for each example.
Clustering is the typical example, you have a bunch of data and want to try to extract some knowledge but you don't know exactly what. Clustering techniques allow you to find clusters "groups" of examples that are somehow similar,
This is an environment that allows you to run some python code in blocks and see the results right away.
Very similar to the Matlab cells if you ever worked with them
... doing some imports necessary for later
A very well known regression dataset, where we have a bunch of house features and want to predict the price of that house.
It's available inside scikit learn.
Load dataset
Here's some info about the dataset;
Show dataframe:
Here I'm using Pandas, another very cool python library to take a look at the dataset.
We can see a few examples with their features and the price of each house.
A dataset is a set of examples used to train a Machine Learning model
An example contains information about an object or event;
The example is represented by its features.
I think this is more understandable with examples.
Show dataframe and explain what are features, examples and labels
One very important step when doing machine learning is to understand the data and how each relates to each other.
To try to understand these relations we can plot the features against the price. Let's see some plots
These plots show clearly that there are features which have a much higher correlation with the price of the house, for instance average number of rooms vs the per capita crime.
However all plotted features seem to be somewhat correlated with the price.
Until here 10 min
Who has heard of it? who has used it? maybe with Excel?
We are in an higher dimention, we can't easily visualize this model, so we have to rely on metrics to estimate the model's performance.
We can also look at some examples and see how the model is performing.
Now let's look at a classification task before delving into how linear regression works.
Until here ~ 13-15 min
This is a very common classification dataset, it's small and so is available inside scikit learn.
The Iris are a family of flowers and in this dataset we have examples from 3 different species.
LOAD dataset These are the features and the 3 class labels;
Look at dataframe (1min)
Here we can see the examples, the features and the labels.
Is everyone understanding?
Plot data (20 s)
We can see here all the examples colored by class.
To start slowly we will focus only on the Iris Setosa and try only to classify each example as belonging to that species or not.
New plot (40 s)
Let's look at the plot again, now with only two classes.
Notes: Looking at this plot, if I have a new flower that would be at (4.5, 4.0) what kind of Iris would you predict it is? and here (7.0, 3.5)? and here (7.0, 4.0)?
That's exactly what a machine learning algorithm does it uses the available data to make predictions, some times it get's it right other times it fails.
It's called logistic regression but it is used for classification, there is a reason behind this, but it's a long story. In short naming things in Computer Science is hard
Run the algorithm
As you the model follows exactly the same API and this is true for all scikit learn models which makes it very easy to use and to change models we are working on.
Here the accuracy is the % of examples we classified correctly.
We can take a look at what the model is doing:
plot data with decision boundary
As you can see it draws a linear decision boundary where all examples in one side are classified to one class and on the other side to the other.
The output is a Linear combination of the features
Each feature has a weight, which can be positive or negative and the sum of all of the product between weights and features is the output of this model.
There the w_0 is the offset and we create a 'feature' x_0 with a value 1 we get a vectorized version.
Now that we understand how the models makes the predictions, how do we fit the model to the data? how do the find the weights that minimize the error?
Usually we use a method called Ordinary least squares where essentially we minimize the sum of the squared error for each datapoint.
It's an optimization problem we want to find the weights that minimize the error over all the examples we have for training.
We want to use the same linear model but now build a classifier.
We want this classifier to go from a continuous value, the output of the linear regression, and go to a label. Today we will focus only in binary labels.
For that we use the Logistic/Sigmoid function .
It has some interesting properties:
- monotonous
- continuous
- limited between 0 and 1
plot sigmoid
Now that we have covered the basics of machine learning let's play with a real world dataset.
Let's take the knowledge we gained and try to apply it to a real world dataset.
First Look at Scotch data: 5 min
TODO how to connect this?s
For instance let's imagine your trying to classify different types of fruit, do you think having more features would improve our model? For instance the name of the person that picked the fruit? Or his age? or whether they are vegetarian? In theory if you added these features the model should just ignore them. However to understand if it
Curse of Dimensionality: 3 min
Feature selection: throwing away the features with less useful information.
There are techniques to this in an automated way, but we can essentially inspect the data and select the features that have a stronger correlation with what we are trying to predict.
Feature Extraction This however is not optimal, even features with a weak correlation might have useful information for our model. So there are these techniques that combine the features into new ones. In practice transforming the data into a lower dimension while maintaining most of the information of the data..
Principal components are the directions of largest variance
The Eigenvectors of covariance matrix with the largest eigenvalues are the principal components
two classes and it seems easy to draw a line between them
also I ploted the PC of the data, the directions where it varies the most.
If we apply PCA and keep the same amount of features we are in fact just performing a rotation.
however if we reduce what in fact we are doing is project the data to the lower dimensional space.
resulting in this plot where as you can see most of the information was maintained and we can still draw the line that separates the classes.
PCA: 3 min
There is another problem that we might face when working with high dimensionality datasets, which is that we will have a more complex model and in turn have a higher chance of suffering from Overfitting.
Overfitting is what happens when your model instead of learning the system that produces your data learns the noise in your data.
This causes the model to perform worst than expected when testing in new data.
Let's try to visualize why a complex model leads to Overfitting.
Imagine we want to model a sinusoidal function, we take some samples from this function and try to fit a polynomial to it.
Looking at the code,
- we have here the function we want to model
- we sample it
- And then fit 4 different polynomials with 1, 3, 6 and 10 degrees
Looking at the resulting plots we can see that
- the 2 first polynomials are not flexible enough to model the system, this is called being underfit
- The 6 degree polynomial is not perfect, but I would say it fits the model pretty well in the spaces we have data.
- The 10 degree polynomial on the other hand, despite being the one that passes closer to the data points it clearly is very far away from the system we are modeling.
Model Complexity: 2 min
Applying PCA to scotch we are able to reduce the dimensionality from 11 features only to 2 features
We can see that we can still maintain 50 % of the variance just in 2 PC.
And we can now plot the data and visualize it.
PCA to Scotch: 1 min
Predicting Tobacco: 5 min
Cross validation: 2 min
I've been lying to you, I've been hiding most of the problems you might face when working with machine learning. However the goal of this talk was to make you interested in it not to scare you away.
it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.