Machine Learning - CS50AI Lecture 4

Machine learning is the study of computer algorithms that improve automatically through experience.

Supervised Learning:

Given a data set of input-output pairs, learn a function to map inputs to outputs.

Classification:

Supervised learning task of learning a function mapping an input point to a discrete category. For example, if we wanted to figure out whether it was going to rain tomorrow, based on past experiences/data, the computer should be able to draw reasonable conclusions. We have already taken a look at how to approach similar problems using probability but we won't always have definite probabilities that we use. Machine learning algorithms focus more on the idea of using historical data instead. An example of historical data could be the following:
Markdown Table Generator

Date	Humidity	Pressure	Rain
January 1	93%	999.7	Rain
January 2	49%	1015.5	No Rain
January 3	79%	1031.1	No Rain
January 4	65%	984.9	Rain
January 5	90%	975.2	Rain

What we would like to do is write a function f that can predict something based on a certain input (humidity, pressure).

f(humidity, pressure)

PREDICT:

f(93, 999.7) = Rain
f(49, 1015.5) = No Rain
f(79, 1031.1) = No Rain

Since we don't know exactly how f works, we can write a hypothesis function that does the same thing as f. f and h can then check each other's values for more reliable results.

2D Graphs:

One approach we can use to predict, in our case, rain or no rain would be to graph the data points. With humidity on the x axis and pressure on the y axis, we can first graph all the data points where it did rain.

Then, we can do the same for all the data points where it didn't rain.

Nearest-Neighbor Classification:

Algorithm that, given an input, chooses the class of the nearest data point to that input. This algorithm, like most, has its downsides. When outliers exist in a data set, it won't always choose what is most appropriate. In the case below, us humans would probably agree that given the humidity and pressure of the white dot, it will rain. However, the computer will choose the closest data point, in this case, not raining.

k-Nearest-Neighbor Classification:

Algorithm that, given an input, chooses the most common class out of the k nearest data points to that input. This is a much better approach because it takes into consideration "location" of the data point and doesn't reach conclusions based on just one neighbor. However, this algorithm also has a few downsides:

It could be slow to go through every neighbor and measure distance(there are optimizations).
- Data pruning
- Data structures
- Select relevant data

Data Boundaries:

Another way to approach classification is to look at all the data and somehow come up with a decision boundary. In the case of two dimensions, we can draw a line that separates the rainy days from the non-rainy days.

Obviously, data sets will never be this clean. There are usually outliers and other factors that will pollute the data. We still want to be able to draw boundaries but these boundaries may not always be linear.

x₁ = Humidity
x₂ = Pressure

h(x₁, x₂) = 1 if w₀ + w₁x₁ + w₂x₂ ≥ 0. 0 otherwise

Where:

w's are weights that determine the boundary
1 : Rain
0 : No Rain

The dot product of the following vectors represent the same idea:

Weight Vector: w:(w₀, w₁, w₂)
Input Vector: x:(1, x₁, x₂)
w * x: w₀ + w₁x₁ + w₂x₂

Then, we can later simplify the hypothesis function to look a little something like this:

h_w(X) = 1 if w * x ≥ 0. 0 otherwise.

Now, the big question is how to determine the values of the weights. For this, we can use a technique known as perceptron learning rule.

Perceptron Learning Rule:

Given data point (x, y), update each weight according to:

More generally, this formula states:

If the actual value was equal to the value we predicted, then the right side of the expression would equal 0 and the weight wouldn't change. If the actual value was greater than our predicted value(actual - estimate > 0), we would increase the weight so the value of the dot product goes up. However, if the actual value was lower than the predicted value(actual - estimate < 0), we would decrease the weight so the value of the dot product goes down. This is one way we can teach the computer to "learn" and choose the appropriate boundaries.

Hard & Soft Threshold:

One problem with our function is that it will always choose rain or no rain. In some cases, this may be what we want but if a data point is close to the boundaries, it may not make sense to make such a definite conclusion. Looking at the image below, we see that if the value resulting from the dot product is barely to the right of the vertical line(the threshold), we strongly conclude a value of 1.

Hard Threshold:

To represent this more logically, we can use soft thresholds instead.

Soft Threshold:

Now, as the output from the dot product gets larger and larger, we can conclude the outcome with more certainty. In other words, the output isn't restricted to either 1 or 0 and this allows to express likeliness and probability.

Other Classification Functions:

Summary:

Given some vector of weights that have been determined by past data, our hypothesis function will determine which class an input belongs to. The problem then, is to find a vector of weights that are representative of our data. At first, we can set the weights to random values and as we run the simulations with the training data, the weights will start to become more and more accurate. Changing the weights will use our hypothesis function which, intern, will use the weights.

Perceptron can only be used where there is a maximum of 2 classes(ex: rain/no rain)
Perceptron doesn't care about how close the boundary is to the data(this can lead to lower precision/accuracy).

Support Vector Machines:

Maximum Margin Separator:

Boundary that maximizes the distance between any of the data points. In some cases, drawing boundaries may be rather arbitrary. For example, there isn't always just 1 way you can draw a line between data points and this ultimately leads to other problems:

Data points for whom we want to predict the outcome could be really close to both data sets.
This makes predicting outcomes hard.

To solve this issue, we can draw our line based on the idea of a maximum margin separator that will position the boundary as far away from the data points as possible.

Example:

Regression:

Supervised learning task of learning a function mapping an input point to a continuous value. Instead of creating boundaries, we can try to find the line of best fit and use that as a reference to determine the result.

Loss Functions:

Function that expresses how poorly our hypothesis performs.

0-1 Loss Function:

L(actual, predicted) = 0 if actual = predicted, 1 otherwise

L₁ Loss Function:

L(actual, predicted) = |actual - predicted|. We can use this to find the line of best fit. The less loss, the more accurate our predictions will be.

L₂ Loss Function:

L(actual, predicted) = (actual - predicted)². This version penalizes outliers more harshly.

Overfitting:

A model that fits too closely to a particular data set and therefore may fail to generalize to future data.

Sometimes, a boundary that has a loss is better than a boundary that has no loss.

Cost:

With the cost function below, we run into the problem of overfitting.

cost(h) = loss(h)

To fix this, we must introduce some other aspect to the cost function so that we don't end up drawing a boundary that exactly matches the data. One thing we can add is complexity(h) that determines how complicated our boundary is. We would rather give precedence to something simpler rather than something complex. The new cost function could be written as follows:

cost(h) = loss(h) + complexity(h)

Regularization:

Penalizing hypotheses that are more complex to favor simpler, more general hypotheses.

In the above equation, we would like to penalize boundaries that are complex more than boundaries that are less complex.

cost(h) = loss(h) + λcomplexity(h)

With this, we can represent the idea that if λ is larger, then we want to penalize that complexity more but if λ is smaller, then we want to penalize the complexity less.

Holdout Cross-Validation:

Splitting data into a training set and a test set, such that learning happens on the training set and is evaluated on the test set.

k-Fold Cross-Validation:

Splitting data into k sets, and experimenting k times, using each set as a test set once, and using remaining data as training set.

Reinforcement Learning:

Given a set of rewards or punishments, learn what actions to take in the future.

Given that the agent is in some sort of state, the agent has to take an action and in return, will get either a reward or a punishment. It ends up in a new state and repeats the same process. In order to formulate these worlds, we'll use something known as the Markov Decision Process.

Markov Decision Process:

Model for decision making, representing states, actions, and their rewards. Similar to markov chains where the action-state path is linear, this process has more of a tree-like structure where each state has multiple different possible actions. Also, In a markov chain, the transition between one state to another is purely based on probability. In this case, we will assign rewards/punishments to each possible action and that will be the determining factor of what action the AI chooses.

Main ideas:

Set of states: S
Set of actions: ACTIONS(s)
Transition model: P(s' | s,a)
Reward function: R(s, a, s')

Q-Learning

Method for learning a function Q(s,a) estimate of the value(reward) of performing action a in state s

Start with Q(s, a) = 0 for all s,a
When we take an action and receive a reward:
- Estimate the value of Q(s, a) based on current reward and expected future rewards
- Update Q(s, a) to take into account old estimate as well as our new estimate

Approach:

Start with Q(s, a) = 0 for all s,a
Every time we take an action a in state s and observe a reward r, we update:
Q(s, a) ← Q(s, a) + a(new value estimate - old value estimate)
OR:
Q(s, a) ← Q(s, a) + a((r + future reward estimate) - Q(s, a))
OR: include γ if we wanted to weight future rewards less than current reward: r
Q(s, a) ← Q(s, a) + a((r + γ⋅max_a'Q(s', a')) - Q(s, a))

Greedy Decision-Making & Explore vs. Exploit:

When in state s choose action a with highest Q(s, a)

Although this works, there are downsides to this approach. The path the agent takes may not be the optimal path. Although it will find a path that works, it will continue to always use that path and won't explore choices that could possibly lead to the reward but also be shorter.

Exploration: Exploring other actions that may be faster and lead to larger rewards.
Exploitation: Knowledge that the AI already has.

Ɛ-Greedy:

Rather than choose the best move all the time, Ɛ-greedy using a damping factor similar to d in pagerank

Set Ɛ equal to how often we want to move randomly.
With probability 1 - Ɛ, choose estimated best move.
With probability Ɛ, choose a random move.

Function Approximation:

Approximating Q(s, a), often by a function combining various features, rather than storing one value for every state-action pair.

For games like chess where the number of state-action possibilities are really high, it's unfeasible to calculate Q(s, a) values for all of them.

Unsupervised Learning:

Given input data without any additional feedback, learn patterns

Clustering:

Organizing a set of objects into groups in such a way that similar objects tend to be in the same group

Clustering Applications:

Genetic research
Image segmentation
Market research
Medical imaging
Social network analysis

k-Means Clustering:

Algorithm for clustering data based on repeatedly assigning points to clusters and updating those clusters' centers.

Here is an example where we randomly chose 3 cluster centers and assigned points to them based on distance:

→

Since this clustering algorithm is an iterative process, the centers aren't fixed to where they are now. As we can see, the data in the red and greed clusters are not very close together. The next step in this process would be to move the cluster centers to the average of all the points in that cluster. Then, re-assign the points to the new clusters and continue this process until no points change in assignment. At this point, we can say that the data is reasonably well clustered.

→

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Week4_Learning.md

Week4_Learning.md

Machine Learning - CS50AI Lecture 4

Supervised Learning:

Classification:

2D Graphs:

Nearest-Neighbor Classification:

k-Nearest-Neighbor Classification:

Data Boundaries:

Perceptron Learning Rule:

Hard & Soft Threshold:

Other Classification Functions:

Summary:

Support Vector Machines:

Maximum Margin Separator:

Regression:

Loss Functions:

0-1 Loss Function:

L₁ Loss Function:

L₂ Loss Function:

Overfitting:

Cost:

Regularization:

Holdout Cross-Validation:

k-Fold Cross-Validation:

Reinforcement Learning:

Markov Decision Process:

Q-Learning

Greedy Decision-Making & Explore vs. Exploit:

Ɛ-Greedy:

Function Approximation:

Unsupervised Learning:

Clustering:

k-Means Clustering:

Files

Week4_Learning.md

Latest commit

History

Week4_Learning.md

File metadata and controls

Machine Learning - CS50AI Lecture 4

Supervised Learning:

Classification:

2D Graphs:

Nearest-Neighbor Classification:

k-Nearest-Neighbor Classification:

Data Boundaries:

Perceptron Learning Rule:

Hard & Soft Threshold:

Other Classification Functions:

Summary:

Support Vector Machines:

Maximum Margin Separator:

Regression:

Loss Functions:

0-1 Loss Function:

L1 Loss Function:

L2 Loss Function:

Overfitting:

Cost:

Regularization:

Holdout Cross-Validation:

k-Fold Cross-Validation:

Reinforcement Learning:

Markov Decision Process:

Q-Learning

Greedy Decision-Making & Explore vs. Exploit:

Ɛ-Greedy:

Function Approximation:

Unsupervised Learning:

Clustering:

k-Means Clustering:

L₁ Loss Function:

L₂ Loss Function: