Support Vector Machines From Scratch Part 4 – Soft Margin SVM

Introduction

This is the fourth installment in a series of blog posts about Support Vector Machines. If you have not read the first three blog posts, I highly recommend you go back and read those before continuing to this blog post.

Last time we talked about Quadratic Programming, and successfully used a QP solver to train a Support Vector Machine to classify linearly separable data. What if our data is not linearly separable? Are we out of luck?

The answer is no. We can modify our SVM formulation to accommodate data that is not perfectly linearly separable. This new formulation is called Soft Margin SVM.

Linearly Separable Data

First of all, what do we mean when we say linearly separable data?

In the first blog post, I described some data that looked like this:

This data is perfectly linearly separable. There is a straight line that separates the classes perfectly. There are no green “x”s on the positive side, nor are there yellow triangles on the negative side.

In reality, things get messy. You will more frequently encounter data sets that are not perfectly linearly separable. It might look something like this:

It’s still mostly linearly separable, but some data points are outliers.

How do we modify our formulation of SVM to allow for some of these outliers, and still draw the ideal decision boundary?

First, we will need some way of quantifying how “off” the outlier data points are.

Hinge Loss

Hinge loss is a loss function used in maximum margin classifiers (like SVM). It gives us a measure of how far off our training data point is from being classified correctly. The formula looks like this:

$max(0, 1 - y_i(w^Tx_i - b))$

$y_i$ is the label
$w^Tx_i - b$ is the “score”. Any observation with a score above or on the positive support vector is going to have a score $\ge$ 1. Vice versa for the negative support vector.

I think the best way to gain an intuition about how hinge loss works is to look at some data points and calculate their hinge loss.

For example, consider this data point:

It is of the positive class $(y_i = 1)$ and it is quite far to the right of the positive support vector. This means that its “score” $(w^Tx_i - b)$ is going to be some positive value greater than $1$ .

Let’s say it has a score of $3$ . This means that its hinge loss is:

$max(0, (1 - 3))$

$max(0, -2) = 0$

It has a loss of 0, which is what we’d expect. It is comfortably on the right side of the decision boundary.

If you notice, any point from the positive class that is on or to the right of the positive support vector will have a hinge loss of $0$ .

What about this point?

It’s not on the correct side of the support vector. It is the negative class ( $y_i = -1$ ), but it is not too far off. It is inside the margin. Let’s say this point is -0.8.

$max(0, (1 - 0.8)) = 0.2$

As you can see there is some loss assigned to it, but it will be a small value.

Here is a graph of what the hinge loss function looks like:

It is called the hinge loss because it looks like a hinge. Anytime that the “score” multiplied by the label is greater than 1, the hinge loss becomes 0. This is true of all data points that are on the correct side of their respective support vectors.

For data points whose $y_i(w^Tx_i - b)$ value is less than 0 (on the wrong side of the decision boundary), we’re going to assign a loss that is greater than 1.

For data points whose $y_i(w^Tx_i - b)$ value is between 0 and 1 (inside the margin, but on the correct side of the decision boundary) we’re going to assign a hinge loss between 0 and 1.

Modifying Hard Margin SVM

All we need to do to modify our SVM formulation to accommodate data that is not perfectly linearly separable is to add a hinge loss penalty term to the SVM Primal function:

$min \dfrac{1}{2}\|w\|^2$

Becomes:

$min\dfrac{1}{2}\|w\|^2 + C\sum_{i = 1}^nmax(0, 1 - y_i(w^Tx_i - b))$

We’re trying to minimize the average hinge loss across all of our training examples, but we are also trying to minimize the magnitude of the weight vector: $\|w\|^2$ .

$C$ is just some hyperparameter we choose. The higher the value we give to $C$ , the more we penalize points that are not on the correct side of their support vector. As $C \rightarrow \infty$ we get closer to hard margin SVM.

You will often see it written this way:

$min\dfrac{1}{2}\|w\|^2 + C\sum_{i = 1}^n\zeta_i$

Where $\zeta_i$ is the “slack variable” for a given data point $i$ . We will use this form going forward but just know that $\zeta_i$ is the hinge loss for that data point.

We also need to incorporate $\zeta_i$ into the constraint. First, the slack variable needs to be subtracted from the right side of the original constraint to enforce that the margin must be compensated for by the slack variable:

$y_i(w^Tx - b) \ge 1 - \zeta_i$

If the hinge loss is 0, we get the original constraint back.

We also add a constraint to enforce that the slack variable be non-negative (as the hinge loss must be non-negative):

$\zeta_i \ge 0$

So, all together the Soft Margin SVM Primal becomes:

$min\dfrac{1}{2}\|w\|^2 + C\sum_{i = 1}^n\zeta_i$

subject to:

$y_i(w^Tx - b) \ge 1 - \zeta_i$

$\zeta_i \ge 0$

Soft Margin SVM Dual

What about the dual?

Let’s derive the dual form again starting with this new primal form.

First, we form the Lagrangian:

$\mathcal{L}(w, b, \zeta, \alpha, \beta) = \dfrac{1}{2}\|w\|^2 + C\sum_i\zeta_i + \sum_i\alpha_i(y_i(w^Tx_i - b) -1 + \zeta_i) - \sum_i\beta_i\zeta_i$

This is pretty similar to the Lagrangian we saw in the hard-margin case, except the original objective function now has the slack variables in it, and we add an extra term at the end for the new inequality constraint. These new constraints get its own set of Lagrange multipliers, $\beta$ .

Now, we differentiate the same way we did before. The partial derivatives with respect to $w$ and $b$ stay the same,

$\dfrac{\partial L}{\partial w} = w - \sum_{i = 1}^n\alpha_iy_ix_i \rightarrow w = \sum_{i = 1}^n\alpha_iy_ix_i$

$\dfrac{\partial L}{\partial b} = \sum_{i = 1}^n\alpha_iy_i = \alpha^Ty = 0$

We have another variable to differentiate with respect to: $\zeta$ :

$\dfrac{\partial \mathcal{L}}{\partial \zeta_i} = C - \alpha_i - \beta_i = 0$

Let’s rearrange the summations in the Lagrangian the same way we did in our original derivation, only this time we need to include the terms we added:

$\mathcal{L}(w, b, \zeta, \alpha) = \dfrac{1}{2}\|w\|^2 - w^T\sum \alpha_iy_ix_i - \sum\alpha_iy_ib + \sum \alpha_i + C\sum\zeta_i - \sum \alpha_i \zeta_i - \sum \beta_i \zeta_i$

$\mathcal{L}(w, b, \zeta, \alpha) = \dfrac{1}{2}\|w\|^2 - w^T\sum \alpha_iy_ix_i - \sum\alpha_iy_ib + \sum \alpha_i + \sum \zeta_i (C - \alpha_i - \beta_i)$

Now we can plug in our equations for the partial derivatives of $w$ , $b$ and $\zeta_i$ into this equation.

Let’s start with $w$ :

$\mathcal{L}(w, b, \zeta, \alpha) = \dfrac{1}{2}\|w\|^2 - w^Tw - \sum\alpha_iy_ib + \sum \alpha_i + \sum \zeta_i (C - \alpha_i - \beta_i)$

$\mathcal{L}(w, b, \zeta, \alpha) = -\dfrac{1}{2}\|w\|^2 - \sum\alpha_iy_ib + \sum \alpha_i + \sum \zeta_i (C - \alpha_i - \beta_i)$

Now $b$ :

$\mathcal{L}(w, b, \zeta, \alpha) = -\dfrac{1}{2}\|w\|^2 - 0 + \sum \alpha_i + \sum \zeta_i (C - \alpha_i - \beta_i)$

$\mathcal{L}(w, b, \zeta, \alpha) = -\dfrac{1}{2}\|w\|^2 + \sum \alpha_i + \sum \zeta_i (C - \alpha_i - \beta_i)$

Finally, $\zeta_i$ :

$\mathcal{L}(w, b, \zeta, \alpha) = -\dfrac{1}{2}\|w\|^2 + \sum \alpha_i + \sum \zeta_i (0)$

$\mathcal{L}(w, b, \zeta, \alpha) = -\dfrac{1}{2}\|w\|^2 + \sum \alpha_i$

We plug our equation in for $w$ and get:

$\mathcal{L}(w, b, \zeta, \alpha) = - \dfrac{1}{2}\sum_i\sum_j\alpha_i\alpha_jy_iy_j(x_i^Tx_j) + \sum_i\alpha_i$

After rearranging terms what we’re left with for our final dual formulation objective function is:

$\mathcal{L}(\alpha) = \sum_i\alpha_i - \dfrac{1}{2}\sum_i\sum_j\alpha_i\alpha_jy_iy_j(x_i^Tx_j)$

It’s the same! This is an amazing result.

The constraint associated with the Stationarity Condition stays the same. The equation we get when differentiating the Lagrangian with respect to $b$ is no different in this formulation:

$\sum_i\alpha_iy_i = 0$

However, we now have two inequality constraints:

$\alpha_i \ge 0$

$\beta_i \ge 0$

We need to rewrite our inequality constraint on $\beta$ in terms of $\alpha$ .

Solving this equation for $\beta_i$ :

$C - \alpha_i - \beta_i = 0$

$\beta_i = C - \alpha_i$

Substituting this equation for $\beta_i$ into the constraints we get:

$C - \alpha_i \ge 0$

$C \ge \alpha_i$

Combining these two constraints gives us:

$0 \le \alpha_i \le C$

So, after all that the only thing that changes in the dual formulation is that we now have this extra inequality constraint on $\alpha$ . This should hopefully start to justify our decision to use the dual form to you. This will become even more salient when we discuss kernels and SMO.

Solving Soft Margin SVM with CVXOPT

How do we need to change the code we wrote in the last blog post to handle this new formulation?

Well, the only thing that changed was our inequality constraints. If you remember, in the last post we started with the constraint:

$\alpha_i \ge 0$

But we needed to rewrite it into a form the QP solver was happy with. All inequality constraints must be in less than or equal to form when fed into the QP solver.

Let’s take our new constraint and transform it into a QP-friendly form:

$0 \le \alpha_i \le C$

First, we break this up into two separate constraints:

$\alpha_i \le C$

$\alpha_i \ge 0$

Now, as before we need these to both be in less than or equal to form. The first inequality already is. We do the same thing we did last time and multiply both sides by -1 to flip the alligator mouth of equation two:

$-\alpha_i \le 0$

The only lines of code we need to modify are the matrix $G$ and vector $h$ that we give to the QP solver:

Here was what we had:

G = matrix(-np.eye(n))
h = matrix(np.zeros(n))

This handles the $-\alpha_i \le 0$ constraint. We can add the other constraint using numpy’s stack functions:

G = matrix(np.vstack((-np.eye(n), np.eye(n))))
h = matrix(np.hstack((np.zeros(n), np.full(n, C))))

Those are the only lines we need to change! Of course, $C$ will have to be defined somewhere. Once again, $C$ is just some hyperparameter you choose that determines how much you want to penalize outlier data points.

Here is the full, modified code again, so that you don’t need to go back to the previous blog post:

### Solving the SVM Dual with QP
from cvxopt import matrix, solvers
import numpy as np
import matplotlib.pyplot as plt

# Data Set
X = np.array([
    [1, 4], 
    [4, 4],
    [2, 6],
    [8, -1],
    [6, -2],
    [9, -3]
]) * 1.0

y = np.array([-1, 1, -1, 1, -1, 1]) * 1.0

# Create masks for positive and negative labels
mask_positive = y == 1
mask_negative = y == -1

plt.scatter(X[mask_positive, 0], X[mask_positive, 1], color='red')
plt.scatter(X[mask_negative, 0], X[mask_negative, 1], color='blue')

#### Solve with QP
C = 3.0
n, m = X.shape
X_hat = X * y.reshape(-1, 1) 
P = matrix(np.dot(X_hat, X_hat.T))
q = matrix(-np.ones((n, 1)))
G = matrix(np.vstack((-np.eye(n), np.eye(n))))
h = matrix(np.hstack((np.zeros(n), np.full(n, C))))
A = matrix(y.reshape(1, -1))
b = matrix(0.0)

sol = solvers.qp(P, q, G, h, A, b)
alphas = np.array(sol['x'])

# Get the indices of the support vectors
S = (alphas > 1e-8).flatten()

# Compute weights
w = np.zeros(2)
for i, x in enumerate(X):
    w += alphas[i] * y[i] * x

# Compute bias
b = 0
for x, y in zip(X[S], y[S]):
    b += (y - np.dot(w, x))
b /= len(S[S == True])

# Draw the decision boundary
x1 = np.linspace(0, 10, 50)  # Generate 100 points for x1
x2 = (-w[0] * x1 - b) / w[1]  # Calculate x2 using the equation of the decision boundary

plt.scatter(X[mask_positive, 0], X[mask_positive, 1], color='red')
plt.scatter(X[mask_negative, 0], X[mask_negative, 1], color='blue')
plt.xlabel('Height')
plt.ylabel('Weight')
plt.plot(x1, x2, color='purple')

Conclusion

SVM is becoming more powerful for us. Now we can separate noisy data that isn’t perfectly linearly separable. In the next blog post, we will look at kernels and the kernel trick, which will make SVMs even more powerful, and allow us to classify data that is not linearly separable at all.

Thank you for reading!