- Linear Seprable SVM <- Data is linear seprable (the simplest condition)
- Linear SVM <- Data has a little noise (Not entirely linear seprable)
- Gaussian Kernel SVM <- Data is too complicated
- Lagrange
- SMO
- Lagrange Duality
- Karush-Kuhn-Tucker (KKT)
-
Original Problem
$(w, b)$ $\xRightarrow{\text{Lagrange Multiplier}}$ Dual Problem$(w, b)$ $\xRightarrow{KKT}$ Solve alpha Use SMO Here - Solved alpha (you'll know which are support vectors)
$\Rightarrow$ Find w, b (i.e. found separating hyperplane)
Testing Dataset:
$\vec{x}_i$ is ith data feature vector,$y_i$ is labeled class of$\vec{x}_i$
Find a separating hyperplane:
Classification function:
Distance between a vector x and the separating hyperplane:
We can represent our correctness of classification by:
- If classify correctly =>
$y_i$ and$(\vec{w} \cdot \vec{x} + b)$ will have same sign => Positive product - Else => Negative product
Based on testing data set T and separating hyperplane (w, b) Define functional margin between separating hyperplane (w, b) and data points: $$ \hat{\gamma_i} = y_i(\vec{w} \cdot \vec{x_i} + b)$$
We'd like to find the point closest to the separating hyperplane and make sure this is as far away from the separating line as possible
when ||w|| = 1 <=> functional margin = geometric margin
Optimization Problem
$$ \max_{\vec{w}, b} \gamma \
\ni y_i(\frac{\vec{w}}{||\vec{w}||} \cdot \vec{x_i} + \frac{b}{||\vec{w}||}) \geq \gamma $$
$$ \max_{\vec{w}, b} \frac{\hat{\gamma}}{||\vec{w}||} \
\ni y_i(\vec{w} \cdot \vec{x_i} + b) \geq \hat{\gamma} $$
Let
And we found that
$$ \min_{\vec{w}, b} \frac{1}{2}||\vec{w}||^2 \
\ni y_i(\vec{w} \cdot \vec{x_i} + b) - 1 \geq 0 $$ This is a Convex Quadratic Programming Problem. Assume we have solved the $w^$ and $b^$ then we get the maximum margin hyperplane and classification function
That's the Maximum Margin Method
-
Construct and Solve the constrained optimization problem $$ \min_{\vec{w}, b} \frac{1}{2}||\vec{w}||^2 \
\ni y_i(\vec{w} \cdot \vec{x_i} + b) - 1 \geq 0 $$
- Solve it and get the optimal solution $\vec{w}^$ and $b^$
-
Get the separating hyperplane and classification function
- Separating hyperplane $$ \vec{w}^\cdot\vec{x} + b^ = 0 $$
- Classification function $$ f(x) = \mathrm{sign}(\vec{w}^\cdot\vec{x} + b^) $$
Support vectors are vectors that fulfill condition $$ y_i(\vec{w}\cdot{\vec{x}_i + b) - 1 = 0} $$
Consider two hyperplane when
Distance between two hyperplane are called margin�. Margin depends on normal vector
Apply Lagrange Duality, get the optimal solution of primal problem by solving dual problem
Advantage
- Dual problem is easier to solve
- Introduce kernel function, expend to non-linear classification problem
For each constraint introduce a Lagrange multiplier
Lagrange function $$ \mathcal{L}(\vec{w}, b, \vec{\alpha}) = \frac{1}{2} ||\vec{w}||^2 - \sum_{i=1}^N \alpha_iy_i (\vec{w}\cdot\vec{x}i + b) + \sum{i=1}^N \alpha_i $$
According to Lagrange Duality. The dual problem of primal problem is a max-min problem. $$ \max_{\vec{\alpha}} \min_{\vec{w}, b} \mathcal{L}(\vec{w}, b, \vec{\alpha}) $$
First solve
then we get
Substitute back to
$$ \min_{\vec{w}, b} \mathcal{L}(\vec{w}, b, \vec{\alpha}) = -\frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \alpha_i \alpha_j y_i y_j (\vec{x}_i \cdot \vec{x}j) + \sum{i=1}^N \alpha_i $$
Second solve
Change sign
$$ \min_{\vec{\alpha}} \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \alpha_i \alpha_j y_i y_j (\vec{x}_i \cdot \vec{x}j) - \sum{i=1}^N \alpha_i $$
Then solve this problem you can get $\alpha^$, and $\exist j \ni \alpha_j^ > 0$ then we can get $\vec{w}^, b^$
Then we can substitute back our primal problem and get the separating hyperplane and classification function
The derivation above said that we definitely can get alphas and use it to get w and b. But didn't mention how.
It isn't possible to use the traditional way (i.e. Gradient Descent) since the dimension is too high (the same as input numbers).
So here is a efficient optimization algorithm, SMO.
It takes the large optimization problem and breaks it into many small problem.
- Find a set of alphas and b.
- Once we have a set of alphas we can easily compute our weights w
- And get the separating hyperplane
- SMO algorithm choose two alphas to optimize on each cycle. Once a sutiable pair of alphas is found, one is increased and one is decreased.
- Suitable criteria
- A pair mus meet is that both of the alphas have to be outside their margin boundary
- The alphas aren't already clamped or bounded
- Suitable criteria
- The reason that we have to change two alphas at the same time is because we need have a constraint
$\displaystyle \sum \alpha_i * y^{(i)} = 0$
- 李航 - 統計機器學習
- Machine Learning in Action