What's this presentation going to be about? (1 min)

Hello everyone,

Who am I

I'm João Almeida, I'm Software Engineer at CERN and I'm by no means an expert in Machine Learning. I took a few courses in college, attended some talks, read a lot, did a few projects ...s

This presentation's format:

There won't be any slides, I am going to show you a lot of code, some math and some plots.

We won't focus on the theoretical details behind the ML techniques, but on their intuition and on how to apply them. We will do this with the help of Python and Scikit Learn, a very nice python ML library.

What is Machine Learning? (1 min)

ML is a set of techniques used to teach computers to learn from data.

Computers learn from examples and experience instead of following hard coded rules.

The learning task can be varied:

it might be grouping examples together
Predicting some continuous valued of a new example (Regression): a typical example is predicting the price of a house based on its characteristics;
Finding which examples have unexpected characteristics. One example is fraud detection in online purchases
Taking new examples and assigning labels to them, for instance taking a picture of a fruit and saying whether it is an apple or a banana.
Predicting the next value in a time series, for instance predicting stock prices.

4 min

Supervised vs Unsupervised Learning (1min)

There are two distinct groups of ML applications, supervised learning and Unsupervised The distinction is whether you have data to teach what you are learning.

Supervised:

For instance when doing classification or regression you have data on what is supposed to be the output of your model for each example.

Unsupervised Learning

Clustering is the typical example, you have a bunch of data and want to try to extract some knowledge but you don't know exactly what. Clustering techniques allow you to find clusters "groups" of examples that are somehow similar,

4 min

Let's look at some code

The Jupyter Notebook (1 min)

This is an environment that allows you to run some python code in blocks and see the results right away.

Very similar to the Matlab cells if you ever worked with them

... doing some imports necessary for later

Boston House Prices

A very well known regression dataset, where we have a bunch of house features and want to predict the price of that house.

It's available inside scikit learn.

Load dataset

Here's some info about the dataset;

Show dataframe:

Here I'm using Pandas, another very cool python library to take a look at the dataset.

We can see a few examples with their features and the price of each house.

The data (20 s)

A dataset is a set of examples used to train a Machine Learning model

An example contains information about an object or event;

The example is represented by its features.

I think this is more understandable with examples.

Let's see a real example:

Show dataframe and explain what are features, examples and labels

One very important step when doing machine learning is to understand the data and how each relates to each other.

To try to understand these relations we can plot the features against the price. Let's see some plots

These plots show clearly that there are features which have a much higher correlation with the price of the house, for instance average number of rooms vs the per capita crime.

However all plotted features seem to be somewhat correlated with the price.

Until here 10 min

Using Linear Regression:

Who has heard of it? who has used it? maybe with Excel?

Visualizing the resulting model:

We are in an higher dimention, we can't easily visualize this model, so we have to rely on metrics to estimate the model's performance.

We can also look at some examples and see how the model is performing.

Now let's look at a classification task before delving into how linear regression works.

Until here ~ 13-15 min

The Iris dataset (1 min)

This is a very common classification dataset, it's small and so is available inside scikit learn.

The Iris are a family of flowers and in this dataset we have examples from 3 different species.

LOAD dataset These are the features and the 3 class labels;

Look at dataframe (1min)

Here we can see the examples, the features and the labels.

Is everyone understanding?

Plot data (20 s)

We can see here all the examples colored by class.

To start slowly we will focus only on the Iris Setosa and try only to classify each example as belonging to that species or not.

New plot (40 s)

Let's look at the plot again, now with only two classes.

Notes: Looking at this plot, if I have a new flower that would be at (4.5, 4.0) what kind of Iris would you predict it is? and here (7.0, 3.5)? and here (7.0, 4.0)?

That's exactly what a machine learning algorithm does it uses the available data to make predictions, some times it get's it right other times it fails.

Logistic Regression:

It's called logistic regression but it is used for classification, there is a reason behind this, but it's a long story. In short naming things in Computer Science is hard

Run the algorithm

As you the model follows exactly the same API and this is true for all scikit learn models which makes it very easy to use and to change models we are working on.

Here the accuracy is the % of examples we classified correctly.

We can take a look at what the model is doing:

plot data with decision boundary

As you can see it draws a linear decision boundary where all examples in one side are classified to one class and on the other side to the other.

Linear Regression and Logistic Regression

Linear Regression

The output is a Linear combination of the features

Each feature has a weight, which can be positive or negative and the sum of all of the product between weights and features is the output of this model.

There the w_0 is the offset and we create a 'feature' x_0 with a value 1 we get a vectorized version.

How to find the Weights

Now that we understand how the models makes the predictions, how do we fit the model to the data? how do the find the weights that minimize the error?

Usually we use a method called Ordinary least squares where essentially we minimize the sum of the squared error for each datapoint.

It's an optimization problem we want to find the weights that minimize the error over all the examples we have for training.

Logistic Regression

We want to use the same linear model but now build a classifier.

We want this classifier to go from a continuous value, the output of the linear regression, and go to a label. Today we will focus only in binary labels.

For that we use the Logistic/Sigmoid function .

It has some interesting properties:

monotonous
continuous
limited between 0 and 1

Sigmoid Function

plot sigmoid

Now that we have covered the basics of machine learning let's play with a real world dataset.

Let's play with a real world dataset

Let's take the knowledge we gained and try to apply it to a real world dataset.

Look at Data

Look at the feature Histograms

look at how they are related to each other

First Look at Scotch data: 5 min

TODO how to connect this?s

Small detour

Curse of Dimensionality

More features != Better data

For instance let's imagine your trying to classify different types of fruit, do you think having more features would improve our model? For instance the name of the person that picked the fruit? Or his age? or whether they are vegetarian? In theory if you added these features the model should just ignore them. However to understand if it

Curse of Dimensionality: 3 min

Feature Selection and Extraction

Feature selection: throwing away the features with less useful information.

There are techniques to this in an automated way, but we can essentially inspect the data and select the features that have a stronger correlation with what we are trying to predict.

Feature Extraction This however is not optimal, even features with a weak correlation might have useful information for our model. So there are these techniques that combine the features into new ones. In practice transforming the data into a lower dimension while maintaining most of the information of the data..

PCA

Principal components are the directions of largest variance

The Eigenvectors of covariance matrix with the largest eigenvalues are the principal components

code:

two classes and it seems easy to draw a line between them

also I ploted the PC of the data, the directions where it varies the most.

If we apply PCA and keep the same amount of features we are in fact just performing a rotation.

however if we reduce what in fact we are doing is project the data to the lower dimensional space.

resulting in this plot where as you can see most of the information was maintained and we can still draw the line that separates the classes.

PCA: 3 min

Model Complexity

There is another problem that we might face when working with high dimensionality datasets, which is that we will have a more complex model and in turn have a higher chance of suffering from Overfitting.

Overfitting is what happens when your model instead of learning the system that produces your data learns the noise in your data.

This causes the model to perform worst than expected when testing in new data.

Let's try to visualize why a complex model leads to Overfitting.

code:

Imagine we want to model a sinusoidal function, we take some samples from this function and try to fit a polynomial to it.

Looking at the code,

we have here the function we want to model
we sample it
And then fit 4 different polynomials with 1, 3, 6 and 10 degrees

Looking at the resulting plots we can see that

the 2 first polynomials are not flexible enough to model the system, this is called being underfit
The 6 degree polynomial is not perfect, but I would say it fits the model pretty well in the spaces we have data.
The 10 degree polynomial on the other hand, despite being the one that passes closer to the data points it clearly is very far away from the system we are modeling.

Model Complexity: 2 min

Back to Scotch

PCA to Scotch

Applying PCA to scotch we are able to reduce the dimensionality from 11 features only to 2 features

We can see that we can still maintain 50 % of the variance just in 2 PC.

And we can now plot the data and visualize it.

PCA to Scotch: 1 min

Predicting Tobacco

Unbalanced dataset

Confusion Matrices

Predicting Tobacco: 5 min

Cross validation

Cross validation: 2 min

Finishing remarks

I've been lying to you, I've been hiding most of the problems you might face when working with machine learning. However the goal of this talk was to make you interested in it not to scare you away.

THE END

Overfitting

KNN

it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.

Files

PresentationPlan.md

Latest commit

History

PresentationPlan.md

File metadata and controls

What's this presentation going to be about? (1 min)

Who am I

This presentation's format:

What is Machine Learning? (1 min)

Supervised vs Unsupervised Learning (1min)

Supervised:

Unsupervised Learning

4 min

Let's look at some code

The Jupyter Notebook (1 min)

Boston House Prices

The data (20 s)

Let's see a real example:

Using Linear Regression:

Visualizing the resulting model:

The Iris dataset (1 min)

Logistic Regression:

Linear Regression and Logistic Regression

Linear Regression

How to find the Weights

Logistic Regression

Sigmoid Function

Let's play with a real world dataset

Look at Data

Look at the feature Histograms

look at how they are related to each other

Small detour

Curse of Dimensionality

More features != Better data

Feature Selection and Extraction

PCA

code:

Model Complexity

code:

Back to Scotch

PCA to Scotch

Predicting Tobacco

Unbalanced dataset

Confusion Matrices

Cross validation

Finishing remarks

THE END

Overfitting

KNN

3 min