Support Vector Machines From Scratch Part 6 – SMO

Introduction

This is the sixth installment in a series of blog posts about Support Vector Machines. If you have not read the first five blog posts, I highly recommend you go back and read those before continuing on to this blog post.

Last time we looked at kernels and the kernel trick, which allowed us to use SVMs to classify data that is not linearly separable.

In this blog post we will be focusing once again on how we train SVMs. In the third blog post I introduced Quadratic Programming and we used the CVXOPT QP solver in Python to train our SVM.

In this blog post I will be describing in detail a superior algorithm specifically designed for training Support Vector Machines called Sequential Minimal Optimization (SMO).

Sequential Minimal Optimization is a modified version of an optimization algorithm called coordinate ascent. In SMO, we solve the optimization problem posed by the SVM dual iteratively by solving the smallest possible optimization subproblem at every step.

Why Sequential Minimal Optimization?

Why do we even need SMO? We were already training our SVM without it, after all.

It is true that we can train SVMs using Quadratic Programming solvers such as CVXOPT. As a matter of fact, until SMO’s discovery/invention by John Platt in 1998, QP solvers were how SVMs were trained. There are many issues with QP solvers, though.

First of all, QP solvers are notoriously tricky to implement. There are many numerical precision issues and things like that to deal with. SMO is quite straightforward to implement. We will be implementing the algorithm ourselves from scratch.

Also, the techniques that QP solvers use to actually train SVMs involve keeping very large matrices in memory, and performing computationally costly operations with them. In the toy examples we looked at, you likely wouldn’t have run into any issues, but in a real scenario you could very easily hit a wall with large data sets where QP solvers are simply too inefficient to get the job done. SMO, on the other hand, does not require any of these expensive matrix operations whatsoever.

Coordinate Ascent

Coordinate ascent is an algorithm for maximizing objective functions.

The algorithm goes like this:

  1. Start with an initial guess at what the variables should be.
  2. Pick one decision variable.
  3. Find the partial derivative of the objective function with respect to that variable.
  4. If the partial derivative with respect to that variable is 0, pick another variable. If the partial derivative with respect to all variables is 0, you’ve found the maximum.
  5. Step that variable in the direction of that partial derivative, keeping the other decision variables constant.
  6. Repeat from step 2 until you’ve found the maximum.

Let’s take a look at one of the examples we saw in the second blog post:

min \; x^2 - 8x + y^2 - 12y + 48

This is a minimization problem. Coordinate ascent is for maximizing objective functions.

I’d like to change this minimization problem into an maximization problem so that, if we successfully perform coordinate ascent, we will get the same answer we did in the second blog post. To do this we just need to negate the objective function.

max \; -x^2 + 8x - y^2 + 12y - 48

Alternatively, we could modify the algorithm slightly to step the decision variable in the opposite direction of the partial derivative. This is called coordinate descent, and it would minimize our objective. For the sake of relating this to SMO, we will stick with coordinate ascent.

Now, we start by picking an initial guess at our decision variables (x and y). Let’s set them both equal to 0.

Let’s start by optimizing x.

We find the partial derivative of our objective with respect to x:

\dfrac{\partial}{\partial x} = -2x + 8

We plug in our current value of x, which is 0. We get 8.

Now we need to step our current value of x in the direction of the partial derivative. The partial derivative is positive, so we know we’re going to move x in the positive direction. Usually we pick a “learning rate”, some fixed value that we multiply our derivative by to get our final step amount. Let’s pick a learning rate of 0.5.

So we increase x by 4. Plugging our new x back into the partial derivative equation gives us 0. Our partial derivative is 0! That means that we’ve optimized our objective function for x.

Now we repeat the process with y and get an optimal value of y = 6. This is a maximum of our function.

Graphically, coordinate ascent looks like this:

The ellipses are the contour lines of our function. The red lines represent the steps we take in the direction of x and y. As you can see, we make axis-aligned movements towards the maximum.

Sequential Minimal Optimization

So, can we use coordinate ascent to optimize the SVM dual?

Recall that the SVM dual is:

max \sum_{i = 1}^n\alpha_i - \dfrac{1}{2}\sum_{i = i}^n\sum_{j = 1}^n\alpha_i\alpha_jy_iy_jK_{ij}

subject to:

\sum_{i = 1}^n\alpha_iy_i = 0

0 \le \alpha_i \le C

Unfortunately, we can’t. The reason we can’t is the linearity constraint:

\sum_{i = 1}^n\alpha_iy_i = 0

This means that all of the Lagrange multipliers lie on a line. Each \alpha_i is a function of all of the other \alphas. If we knew all but one of the \alphas, we could solve for the other deterministically.

To get around this, SMO uses a modified version of coordinate ascent where we solve for two Lagrange multipliers at a time.

Derivation

This section is a full derivation of the solution for one iteration of SMO’s modified version of coordinate ascent. I could not find the full derivation anywhere online so I feel compelled to post it in its entirety, step by step. The original paper has a derivation, but it skips over a lot.

If you do not care about the derivation you can skip this section. If you see any issues in my derivation please let me know.

Our goal is to optimize the SVM dual by solving for two Lagrange multipliers at a time.

Looking at the linearity constraint:

\sum_{i = 1}^n\alpha_iy_i = 0

We can rewrite it to be in terms of \alpha_1 and \alpha_2, the two Lagrange multipliers chosen for optimization:

\alpha_1y_1 + \alpha_2y_2 = - \sum_{i = 3}^n \alpha_iy_i

The term on the right-hand side is just some fixed constant that we can calculate. We will call this constant \zeta:

\alpha_1y_1 + \alpha_2y_2 = \zeta

We can now solve this equation for \alpha_1:

\alpha_1 = \dfrac{1}{y_1}(\zeta - \alpha_2y_2)

Since y_1 is either 1 or -1 we know that the reciprocal of y_1 is just y_1:

\alpha_1 = y_1(\zeta - \alpha_2y_2)

Distributing the y_1 gives us:

\alpha_1 = \zeta - y_1y_2\alpha_2

Now, I would also like to make a substitution in this equation using this equality:

s = y_1y_2

Plugging this back into the equation for \alpha_1 we get:

\alpha_1 = \zeta - s\alpha_2

Now we can rewrite the dual in terms of \alpha_1 and \alpha_2.

\mathcal{L}(\alpha_1, \alpha_2) = \dfrac{1}{2}K_{11}a_1^2 - \dfrac{1}{2}K_{22}a_2^2 + \alpha_1\alpha_2sK_{12} + \alpha_1y_1\sum_{j = 3}^na_iy_iK_{1j} + \alpha_2y_2\sum_{j = 3}^na_iy_iK_{2j} - \sum_{j = 3}^n\alpha_j + \text{constant}

This looks very complicated, but all I have done is separate the terms in the summation that include \alpha_1 or \alpha_2 out of the original equation.

Now, we can reformulate this equation into being in terms of only \alpha_2 by substituting \alpha_1 = \zeta - sa_2 in:

\mathcal{L}(\alpha_2) = \dfrac{1}{2}K_{11}(\zeta - sa_2)^2 - \dfrac{1}{2}K_{22}a_2^2 + (\zeta - sa_2)\alpha_2sK_{12} + (\zeta - sa_2)y_1\sum_{j = 3}^na_iy_iK_{1j} + \alpha_2y_2\sum_{j = 3}^na_iy_iK_{2j} - \alpha_1 - \alpha_2 + \text{constant}

Now, I will introduce another variable to help simplify this equation:

v_i = \sum_{j = 3}^n\alpha_jy_jK_{ij}

Plugging v in and rearranging terms a bit we get:

\mathcal{L}(\alpha_2) = \dfrac{1}{2}K_{11}(\zeta - sa_2)^2 - \dfrac{1}{2}K_{22}a_2^2 + sK_{12}(\zeta - sa_2)\alpha_2 + y_1(\zeta - sa_2)v_1 + y_2\alpha_2v_2 - (\zeta - sa_2) - \alpha_2 + \text{constant}

Now we have the dual equation in terms of only two Lagrange multipliers. This is still a Quadratic Program. We could plug this into a QP solver and it would still be much faster than before overall because we only have two decision variables.

We can do even better though, and solve for \alpha_2 analytically, thus avoiding the need for a QP solver altogether.

To analytically solve for \alpha_2 we differentiate with respect to \alpha_2 and set it equal to 0.

Let’s differentiate it term by term:

First Term: \dfrac{1}{2}K_{11}(\zeta - s\alpha_2)^2

We differentiate this with the chain rule and get:

\dfrac{d}{d\alpha_2} = -sK_{11}(\zeta - s\alpha_2)

Second Term: \dfrac{1}{2}K_{22}\alpha_2^2

We use the power rule:

\dfrac{d}{d\alpha_2} = K_{22}\alpha_2

Third Term: sK_{12}(\zeta - s\alpha_2)\alpha_2

Here we just apply the constant rule and get:

\dfrac{d}{d\alpha_2} = sK_{12}(\zeta - s\alpha_2)

Fourth Term: y_i(\zeta - sa_2)v_1

First, we distribute the v_1:

y_1(\zeta - sv_1\alpha_2)

Next, we distribute the y_1:

\zeta - y_1sv_1\alpha_2

Therefore:

\dfrac{d}{d\alpha_2} = -sy_1v_1

But, recall that s = y_1y_2:

\dfrac{d}{d\alpha_2} = -y_2y_1^2v_1

Because y_i must be either 1 or -1, y_i^2 is always just 1:

\dfrac{d}{d\alpha_2} = -y_2v_1

Fifth Term: y_2\alpha_2v_2

This is a straightforward application of the constant rule:

\dfrac{d}{d\alpha_2} =y_2v_2

Sixth Term: - (\zeta - sa_2)

First, we distribute the negative:

-\zeta + sa_2

Then apply the constant rule:

\dfrac{d}{d\alpha_2} = s

Seventh Term: \alpha_2

This is obvious:

\dfrac{d}{d\alpha_2} = 1

Eighth Term:

The eighth term is a constant that does not include \alpha_2 so it goes away when we differentiate.

At long last putting the terms together into one equation and setting that equation equal to zero gives us:

\dfrac{\partial \mathcal{L}}{\partial \alpha_2} = -sK_{11}(\zeta - sa_2) + K_{22} \alpha_2 - K_{12}\alpha_2 + sK_{12}(\zeta - s\alpha_2) - y_2v_1 + y_2v_2 + s - 1 = 0

Now, we need to solve for \alpha_2 to find our optimal solution. It is a lot of algebra. I will do it in small steps explaining each step as I go.

First, we can move s - 1 over to the other side of the equation:

-sK_{11}(\zeta - sa_2) + K_{22} \alpha_2 - K_{12}\alpha_2 + sK_{12}(\zeta - s\alpha_2) - y_2v_1 + y_2v_2 = 1 - s

Now we can factor a y_2 out of the terms of the right side of the equation, and move it over to the other side of the equation:

-sK_{11}(\zeta - sa_2) + K_{22} \alpha_2 - K_{12}\alpha_2 + sK_{12}(\zeta - s\alpha_2) = y_2(v_1 - v_2) + 1 - s

Wherever there is parenthesis on the left side of the equation we can distribute:

-sK_{11}\zeta - s^2K_{11}\alpha_2 + K_{22}\alpha_2 - K_{12}\alpha_2 + sK_{12}\zeta - s^2K_{12}\alpha_2 = y_2(v_1 - v_2) + 1 - s

Since s = y_1y_2 and y_i must be either 1 or -1, s^2 = 1, and those terms disappear. This will come up a lot during this derivation.

-sK_{11}\zeta - K_{11}\alpha_2 + K_{22}\alpha_2 - K_{12}\alpha_2 + sK_{12}\zeta - K_{12}\alpha_2 = y_2(v_1 - v_2) + 1 - s

Combining like terms:

-sK_{11}\zeta + sK_{12}\zeta + K_{11}a_2 +K_{22}\alpha_2 - 2K_{12}\alpha_2 = y_2(v_1 - v_2) + 1 - s

We can factor out the s and \zeta from the leftmost terms and move them over to the other side of the equation:

K_{11}\alpha_2 + K_{22}\alpha_2 - 2K_{12}\alpha_2 = s(K_{11} - K_{12})\zeta + y_2(v_1 - v_2) + 1 - s

We can factor out the \alpha_2 on the left side:

\alpha_2(K_{11} + K_{22} - 2K_{12}) = s(K_{11} - K_{12})\zeta + y_2(v_1 - v_2) + 1 - s

At this point we need to take another look at the equation for \alpha_1 to make more progress:

\alpha_1 = \zeta - s\alpha_2

Solving for \zeta gives us:

\zeta = \alpha_1 + sa_2

We don’t know what \alpha_1 and \alpha_2 are supposed to be yet for the current iteration, but we do know what they were last iteration. Because of the linear equality constraint, it must always be true that:

\alpha_1 + s\alpha_2 = \alpha_1^* + s\alpha_2^* = \zeta

Where the starred variables are the values found in the previous iteration.

Plugging this equation in for \zeta we get:

\alpha_2(K_{11} + K_{22} - 2K_{12}) = s(K_{11} - K_{12})(\alpha_1^* + s\alpha_2^*) + y_2(v_1 - v_2) + 1 - s

Distributing some of the parenthetical terms out on the right side of the equation gives us:

\alpha_2(K_{11} + K_{22} - 2K_{12}) = s(K_{11}\alpha_1^* + sK_{11}\alpha_2^* - K_{12}\alpha_1^* - sK_{12}\alpha_2^*) + y_2(v_1 - v_2) + 1 - s

\alpha_2(K_{11} + K_{22} - 2K_{12}) = sK_{11}\alpha_1^* + s^2K_{11}\alpha_2^* - sK_{12}\alpha_1^* - s^2K_{12}\alpha_2^* + y_2(v_1 - v_2) + 1 - s

\alpha_2(K_{11} + K_{22} - 2K_{12}) = sK_{11}\alpha_1^* + K_{11}\alpha_2^* - sK_{12}\alpha_1^* - K_{12}\alpha_2^* + y_2(v_1 - v_2) + 1 - s

Now, to make further progress we must revisit the equation we had for v:

v_i = \sum_{j = 1}^n\alpha_jy_jK_{ij}

Defining a new variable, u_i which is the guess given the current parameters of our ith training example in terms of the Lagrange multipliers:

u_i = \sum_{j = 1}^N\alpha_jy_jK_{ij} - b

We can solve v_i in terms of u_i:

v_i =u_i + b^* - y_1\alpha_1^*K_{1i} - y_2\alpha_2^*K_{2i}

Once again this is in terms of the values from the previous iteration (our current parameters).

Plugging this equation for v_i into our equality we can make more progress:

\alpha_2(K_{11} + K_{22} - 2K_{12}) = sK_{11}\alpha_1^* + K_{11}\alpha_2^* - sK_{12}\alpha_1^* - K_{12}\alpha_2^* + y_2(u_1 + b^* - y_1\alpha_1K_{11} -y_2\alpha_2K_{12} - u_2 - b^* + y_1\alpha_1K_{12} + y_2\alpha_2K_{22}) + 1 - s

The b^*s cancel out:

\alpha_2(K_{11} + K_{22} - 2K_{12}) = sK_{11}\alpha_1^* + K_{11}\alpha_2^* - sK_{12}\alpha_1^* - K_{12}\alpha_2^* + y_2(u_1 - y_1\alpha_1^*K_{11} - y_2\alpha_2*^K_{12} - u_2 +y_ 1\alpha_1^*K_{12} + y_2\alpha_2^*K_{22}) + 1 - s

We can distribute the y_2. Anywhere where there would be a y_1y_2 can be replaced with an s and anywhere where we have y_2s multiplying each other we can just get rid of it because y_i^2 = 1

\alpha_2(K_{11} + K_{22} - 2K_{12}) = sK_{11}\alpha_1^* + K_{11}\alpha_2^* - sK_{12}\alpha_1^* - K_{12}\alpha_2^* + y_2u_1 - s\alpha_1^*K_{11} - \alpha_2^*K_{12} - y_2u_2 + s\alpha_1^*K_{12} + \alpha_2^*K_{22} + 1 - s

Now we have some terms that cancel out:

\alpha_2(K_{11} + K_{22} - 2K_{12}) = K_{11}\alpha_2^* - K_{12}\alpha_2^* + y_2u_1 - \alpha_2^*K_{12} - y_2u_2 + \alpha_2^*K_{22} + 1 - s

We can group all of the terms containing \alpha_2^* together:

\alpha_2(K_{11} + K_{22} - 2K_{12}) = K_{11}\alpha_2^* - 2K_{12}\alpha_2^* + \alpha_2^*K_{22} + y_2u_1 - y_2u_2 + 1 - s

And factor out a \alpha_2^*:

\alpha_2(K_{11} + K_{22} - 2K_{12}) = \alpha_2^*(K_{11} + K_{22} - 2K_{12}) + y_2u_1 - y_2u_2 + 1 - s

Now we can factor out a y_2 from the other terms on the right side:

\alpha_2(K_{11} + K_{22} - 2K_{12}) = \alpha_2^*(K_{11} + K_{22} - 2K_{12}) + y_2(u_1 - u_2 + y_2 - y_1)

Notice how 1 became y_2 and s became y_1 because s = y_1y_2 and y_1 must be either 1 or -1.

Now, I will define another variable:

\eta = K_{11} + K_{22} - 2K_{12}

And substitute into the equation:

\alpha_2\eta = \alpha_2^*\eta + y_2(u_1 - u_2 + y_2 - y_1)

Now we divide both sides by \eta:

\alpha_2 = \alpha_2^* - \dfrac{y_2(u_1 - u_2 + y_2 - y_1)}{\eta}

The original paper defines another variable:

E_i = u_i - y_i

which is the “error” in the ith training example.

Rewriting our equation in terms of E gives us:

\alpha_2 = \alpha_2^* - \dfrac{y_2(E_1 - E_2)}{\eta}

Which is the equation found in the paper.

Keeping The Solution Feasible

There is one more step,

This is the analytical solution for \alpha_2, but it does not take into account the inequality constraints:

0 \le \alpha_i \le C

All we need to do to obey these constraints is to clip our \alpha_2 value so that it is within these constraints.

For example, we know from the linearity constraint:

\sum_{i = 1}^n\alpha_iy_i = 0

Solving this equation for \alpha_1 and \alpha_2 we get:

\alpha_1y_1 + \alpha_2y_2 = \zeta

We know that all of the ys must either be 1 or -1, whether they are the same or different will tell us the signs involved in this equation.

There are two possibilities:

  1. y_1 = y_2 \rightarrow a_1 + a_2 = \zeta
  2. y_1 \ne y_2 \rightarrow a_1 - a_2 = \zeta

Let’s first take a look at the case where y_1 = y_2:

We can plot this line to help gain some intuition about how we need to clip \alpha_2 to satisfy the inequality constraints:

A solution which satisfies the linearity constraint must lie on the diagonal line:

As you can see when the line crosses the a_1 axis at \zeta. This is gotten from the equation for the line:

a_1 + a_2 = \zeta

When a_2 = 0, the equation becomes:

a_1 = \zeta

The same is true of the other end of the line segment when a_1 = 0.

Not only must our solution lie on this line, but it must also lie within a box because of the inequality constraints: 0 \le \alpha_i \le C:

If we let \alpha_1 span its whole range [0, C], then \alpha_2 must be between these two bounds, L and H:

We can define the bounds like this:

L = max(0, \zeta - C)

H = min(C, \zeta)

Plugging back in the equation for \zeta to get these in terms of \alpha_1 and \alpha_2 we get:

L = max(0, \alpha_1 +\alpha_2 - C)

H = min(C, \alpha_2 + \alpha_1)

If the labels are not equal to each other: y_1 \ne y_2 the equation for the line becomes \alpha_2 - \alpha_1 = \zeta and the picture looks slightly different:

Now, the bounds look like this:

L = max(0, -\zeta)

H = min(C, C + \zeta)

Again, writing these bounds in terms of \alpha_1 and \alpha_2:

L = max(0, \alpha_2 - \alpha_1)

H = min(C, C + \alpha_2 - \alpha_1)

That’s it. We can now solve for two Lagrange multipliers at a time in the dual problem, keeping the solution feasible at every step.

Conclusion

To recap the algorithm it goes like this:

  1. Select two Lagrange multipliers, \alpha_1 and \alpha_2 to optimize.
  2. Solve for \alpha_2 using the equation \alpha_2^{(new)} = \alpha_2^{(old)} - \dfrac{y_2(E_1 - E_2)}{\eta}
  3. Clip this solution for \alpha_2 to be within the feasible region given by the bounds L and H.
  4. Repeat until convergence.

There is a whole other component of the algorithm that the paper goes into which is heuristics for choosing which Lagrange multipliers to optimize, but we can just select them at random and it may still converge. I will put that off until later for now, and jump into the simplest possible implementation of SMO.

Thank you for reading!

Resources:

https://www.microsoft.com/en-us/research/publication/sequential-minimal-optimization-a-fast-algorithm-for-training-support-vector-machines/: Support Vector Machines From Scratch Part 6 – SMO
Andrew Ng’s Fast SMO Algorithm Notes

Leave a comment