WHEN STUCK, SWITCH TO ANOTHER PERSPECTIVE
Support Vector Machine
- If we want to distinguish between ‘+’ and ‘-’ samples, how should we divide it?
- If you draw a line and divide it, what line should it be?
The easiest and most intuitive answer is probably the following line (dotted line) with some lines that use the widest distance between the ‘+’ and ‘-’ samples.
The most intuitive way is putting in the widest street that separates the positive samples from the negative samples. Therefore to separate the fittest way is to find the widest street approach(street is wide as possible).
SVM is the process of solving the above question. I will introduce how to set up and solve problems from this point of view.
1. Decision Rule
Let’s think about “what should be the decision rule to decide the decision boundary?”
- The w⃗ is perpendicular to the median line of the street.
- The u⃗ is an unknown sample which is left side or right side of the street.
What we are wondering is whether the unknown sample, u⃗ , belongs on the righ or left side of the street.
One way we can do this is to make sure that the value is greater than the constant c after we used inner product of w⃗ and u⃗. In the range not exceeding the generality, it can be expressed as follows:
Logic is very simple. It is easy to understand that the inner product is to project u⃗ to w⃗ in the above plot, and it is easy to think that the length is long and it goes to the right if it goes beyond the boundary and to the left if it is shorter.
Therefore, the above equation (1) becomes our decision rule. It is also the first tool we need to understand SVM, but there are still a lot of deficiencies. Yet, we have no idea what w⃗ should be set in that formula or what b should be set. We but can only know that w⃗ is orthogonal to the median line of the street we want.
Unfortunately, such a w⃗ could be drawn in a wide variety of ways, so the judgment I can make here is still lacking of constraints. So what we are going to do is to add several constraints to the formula so that we can compute w⃗ and b.
Design and add additional constraints
Let’s say x₊ is a positive(+) sample and x₋ is a negative(-) sample.
Dealing with two expressions is annoying. So we will design a variable here and generalize it to one expression.
Now let’s multiply this yᵢ by the formula above.
In other words, the result of the formula for two samples, ‘+’ and ‘-’, that will be on the boundary (gray lines; gutters).
What we want to do is to set a line and to maximize the distance between the resulting ‘+’ and ‘-’ samples. Then we have to think about how we can create ‘street’.
What’s the width of the street?
If I had a unit vector in that direction, then I could just dot the two together, that would be the width.
The equation (4) above is the third tool we need to understand SVM.
The goal of maximizing WIDTH can be described as follows:
- Quadratic Programing
- Lagrange Multiplier Method (1802)
We can maximize or minimize without thinking about the constraints anymore. In the below equation, the α is the Lagrange multiplier.
Differentiation perspective to w⃗:
Set to zero:
It tells us that the w⃗ is a linear sum of samples (all or some). The w⃗ is going to be a linear some of these vectors(in the sample set, because for some α will be 0)
Differenciation perspective to b:
Set to zero:
Now, using the equations we obtained here, we will plug in for the equation (5) for α and simplify the problem.
because the ∑ᵢαᵢyᵢ is zero:
bind the expression:
So now everything is over with maximization problem for α. By solving this problem and obtaining α, w⃗ can be obtained by using the equation (6) and then b can also be obtained through equation (3). Equation (3) means that the value of α is not zero, which means that the corresponding x⃗ is the sample defining the boundary line. In SVM, these samples are called “support vector”.
I want to find what this maxization depends on with respect these sample vectors. What I’ve discovered is that the optimization depends only on the dot product of pairs of samples. So now, my decision rule with this expression for w⃗ is going to be w⃗ plugged.
The decision rule depends only on the dot product of these sample.
For the last, the L equation has very good properties when it is expanded. For α, the first term is linear and the second term is quadratic, so any off-the-shelf algorithm can be used to solve α using the quadratic programming technique.
In other words, unlike neural networks, this SVM does not have to worry about falling into local minima at all. So the solution obtained with SVM is “theoretically guaranteed to be the best solution”.
Therefore, decision boundaries defined by support vectors are the most optimal boundaries, and when a new sample comes in with only the data that we have, we have found a decision rule that can best generalize.