Introduction

In a previous article, we introduced logistic regression and applied it to a binary classification problem. In our implementation, we used a 50/50 split to determine the threshold for classification. This is a linear line, separating the two classes. However, this approach may lead to overfitting, especially if the data is not linearly separable. What if there were a way to find a hyperplane that maximizes the margin between the two classes? In this article, we turn to support vector machines (SVM).

Hard Margin SVM

Support vector machines are a set of supervised learning methods used for classification, regression, and outliers detection. The main idea behind SVM is to find a hyperplane that best separates the data into different classes. In the case of binary classification, this hyperplane is a line (in 2D) or a plane (in 3D) that divides the feature space into two halves, with each half representing a different class.

The optimal hyperplane is the one that maximizes the margin between the two classes. The margin is defined as the distance between the hyperplane and the nearest data point from either class. SVM aims to find the hyperplane that has the largest margin, as this is believed to lead to better generalization on unseen data.

Mathematically, we can represent the hyperplane or decision boundary as \theta^T x + b = 0 where \theta is the weight vector, x is the input feature vector, and b is the bias term. The goal of SVM is to find the optimal values for \theta and b that maximize the margin. The optimization problem can be formulated as follows:

\text{max}_\theta \gamma \quad \text{subject to} \quad y_i(\theta^T x_i + b) \geq 1, \quad \forall i. Here, \gamma represents the geometric margin, y_i is the class label for the i-th data point, and x_i is the feature vector for the i-th data point. The constraint ensures that all data points are correctly classified with a margin of at least 1. y_i(\theta^T x_i + b) is the functional margin. This means every data point is correctly classified and lies on or outside the margin boundary. We set the margin to 1 for simplicity, but it can be scaled to any value. Observe the illustration below, which shows the hyperplane and the margin for a hard margin SVM.

The perpendicular distance from the hyperplane to a data point is \text{distance} = |\theta^T x^i + b| / ||\theta||. Since we want all the points to be correctly classified, we can multiply the distance by the class label y^i. Thus, \text{distance} = y^i(\theta^T x^i + b) / ||\theta||. Recall that the functional margin is defined as y^i(\theta^T x^i + b). Therefore, we can rewrite the distance as \text{distance} = \frac{y^i(\theta^T x^i + b)}{||\theta||}. Note that we can scale \theta and b by any positive constant; for simplicity, we set the functional margin to 1. This means that the distance from the hyperplane to the nearest data point is 1 / ||\theta||.

If our goal is to maximize the margin, we can minimize the norm of \theta. Our optimization problem becomes \text{min}_\theta \; \frac{1}{2}\theta^T\theta \; s.t. \; y^i(\theta^T x^i + b) \leq 1 \forall i

Soft Margin SVM

What happens if data is not linearly separable? Take for example, the datapoints below.

Imposter acorn! No way we can draw a line to separate the two classes. We are going to need to add some flexibilty. Continuing with the formalism from above, let’s define two conditions: (1) if the margin >= 1, don’t care, and (2) if the margin < 1 pay a linear penalty. This leads us to the soft margin SVM, which allows for some misclassification of data points. More formally,

\text{min}_{\theta, \xi, b} \frac{1}{2}\theta^T\theta + C\sum_{i=1}^N \xi_i \; \text{ s.t. } y^i(\theta^T x^i + b) + \xi_i \geq 1, \xi \geq 0, \forall i.

\xi_i is the “slack” variable that allows for some misclassification of data points. For 0<\xi_i \leq 1, the point is between the margin and the correct side of the hyperplane. For \xi_i > 1, the point is missclassified. C is a regularization parameter. A small C allows constraints to be easily ignored \rightarrow large margin. A large C forces the constraints to be satisfied \rightarrow small margin. C=\infty implies hard margin SVM.

So how do we solve this optimization problem? If we were to solve the above equation, we would need to use a quadratic programming solver. But this method is computationally expensive, especially as data size increases. Instead, we can re-write soft margin SVM as an uncostrained optimization problem and use gradient descent to solve it.

Recall that \xi has two contraints: y^i(\theta^T x^i) + \xi_i \geq 1 and \xi_i \geq 0. We can re-write the first constraint as y^i(\theta^T x^i + b) \geq 1 - \xi_i. If the goal is the minimze the functional margin, then the optimal value for \xi_i is \xi_i = \text{max}(0, 1-y^i(\theta^T x^i + b)). Hence. \text{min}_{\theta, b} \frac{1}{2}\theta^T\theta + C\sum_{i=1}^N \text{max}(0, 1-y^i(\theta^T x^i + b)). Based on where a data point lies if it is …

Outside the margin, then \xi_i = 0. No contribution to the loss.
On the margin, then \xi_i = 0. No contribution to the loss.
Inside the margin, then \xi_i = 1 - y^i(\theta^T x^i + b). Point violates the margin, so it contributes to the loss.

This is known as the hinge loss - sometimes regarded as a rigid version of the logistic loss. We can now use gradient descent to minimize the loss function.

Note

Try to think about the relationship between the hinge loss and the logistic loss. What are the similarities and differences? Read more about Logistic Regression here.

SVM Implementation

Gradient Descent for SVM

Suppose COST(\theta, b) is the cost function for soft margin SVM.

\begin{align*} COST(\theta, b) &= \frac{1}{2}\theta^T\theta + C\sum_{i=1}^N \text{max}(0, 1-y^i(\theta^T x^i + b)) \\ &= \sum_{i=1}^N (\frac{1}{2N}\theta^T\theta + C \text{max}(0, 1-y^i(\theta^T x^i + b))). \end{align*}

For each data point x^i

\frac{\partial COST(\theta, b)}{\partial \theta_j} = \begin{cases} \frac{1}{N}\theta_j - C y^i x_j^i , & \text{if } 1-y^i(\theta^T x^i + b) > 0 \\ \frac{1}{N}\theta_j, & \text{otherwise} \end{cases}

\frac{\partial COST(\theta, b)}{\partial b} = \begin{cases} -C y^i & \text{if } 1-y^i(\theta^T x^i + b) > 0 \\ 0, & \text{otherwise.} \end{cases}

Enough math. Let’s code.

SVM with NumPy

Using the uncontrained optimization formulation, we can implement SVM with just NumPy.

import numpy as np

class SVM:
    def __init__(self, learning_rate=0.001, n_iters=1000):
        self.lr = learning_rate
        self.n_iters = n_iters
        self.w = None
        self.b = None

    def fit(self, X, y, C=1):
        n_samples, n_features = X.shape
        self.w = np.zeros(n_features)
        self.b = 0

        for epoch in range(self.n_iters):
            condition = y * (np.dot(X, self.w) + self.b)
            # Initialize gradients
            grad_w = self.w.copy()
            grad_b = 0

            # Calculate gradient from loss
            for indx, x_i in enumerate(X):
                if condition[indx] < 1:
                    grad_w -= C * y[indx] * x_i
                    grad_b -= C * y[indx]
            
            # Update weights and bias
            self.w -= self.lr * grad_w
            self.b -= self.lr * grad_b

    def predict(self, X):
        linear_output = np.dot(X, self.w) + self.b
        predictions = np.sign(linear_output)
        predictions[predictions == 0] = -1
        return predictions

Let’s validate our implementation with a simple dataset.

Code

import matplotlib.pyplot as plt
from sklearn import datasets

# Use only two classes and two features for simplicity
iris = datasets.load_iris()
X = iris.data[iris.target != 2, :2]
y = iris.target[iris.target != 2]
y = np.where(y == 0, -1, 1)

# Use more iterations to allow the model to converge
svm = SVM(learning_rate=0.001, n_iters=1000)
svm.fit(X, y)
print(f"Weights: {svm.w}, Bias: {svm.b}")

predictions = svm.predict(X)
accuracy = np.mean(predictions == y)
print(f"Accuracy: {accuracy * 100:.2f}%")

def plot_decision_boundary(X, y, model):
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min, y_max, 100))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm')
    plt.title("SVM Decision Boundary")
    plt.xlabel("Feature 1")
    plt.ylabel("Feature 2")
    plt.show()  

plot_decision_boundary(X, y, svm)

Weights: [ 1.78678929 -2.79549737], Bias: -1.0460000000000018
Accuracy: 99.00%

Figure 1: Simple SVM implementation using NumPy

Looking good! We have successfully implemented a simple SVM classifier using NumPy. Next, let’s see how we can use the sklearn library to use SVM on a real-world dataset.

Baseball Application

In this section, we will apply SVM to a baseball dataset. We will use the svm module from the sklearn library to implement SVM. The dataset we will use is a subset of the Statcast data that contains information about batted balls.

Padres Pitcher, Dylan Cease, has one of the highest wiff rates in MLB. According to Baseball Savant, Cease stands alone at the top with a 33.9% whiff rate; Tarik Skubal sits close behind at 33.6%. For this case study, we are going to investigate what factors contribute to Cease’s swing-and-miss stuff. Similar to Decision Trees, SVMs are non-parametric models that can capture non-linear relationships. This is ideal when there are numerous features that may interact in complex ways. For every pitch, there lies granular information about the pitch type, velocity, spin rate, and location. We will use SVM to classify whether a pitch results in a whiff or not.

We will use the cease-whiff.csv dataset, which contains information about every pitch thrown by Dylan Cease in the 2025 and 2024 season. The data was quieried from Baseball Savant. So we’ll need to tidy the data first. A couple things for data prep: categorize the description column into whiff or not whiff. We will also drop fouls.

Code

import pandas as pd
import seaborn as sns
from sklearn.preprocessing import StandardScaler

df = pd.read_csv("cw2.csv")

# categorize description into whiff or not whiff

df['whiff'] = df['description'].apply(lambda x: 1 if x == 'swinging_strike' else 0)
# let's also drop fouls

df = df[df['description'] != 'foul']

# select relevant features
features = ['pitch_type', 'release_speed', 'release_pos_x', 'release_pos_z', 'pfx_x', 'pfx_z', 'plate_x', 'plate_z', 'vx0', 'vy0', 'vz0', 'ax', 'ay', 'az', 'effective_speed', 'release_spin_rate', 'release_extension','release_pos_y', 'n_thruorder_pitcher', 'arm_angle', 'spin_axis', 'bat_speed','whiff']
df = df[features]

# drop rows with missing values
df = df.dropna()

# encode categorical variables
df = pd.get_dummies(df, columns=['pitch_type'], drop_first=True, dtype=int)

# scale numerical features
scaler = StandardScaler()
num_features = ['release_speed', 'release_pos_x', 'release_pos_z', 'pfx_x', 'pfx_z', 'plate_x', 'plate_z', 'vx0', 'vy0', 'vz0', 'ax', 'ay', 'az', 'effective_speed', 'release_spin_rate', 'release_extension','release_pos_y', 'n_thruorder_pitcher', 'arm_angle', 'bat_speed', 'spin_axis']
df[num_features] = scaler.fit_transform(df[num_features])
df.describe()

	release_speed	release_pos_x	release_pos_z	pfx_x	pfx_z	plate_x	plate_z	vx0	vy0	vz0	...	arm_angle	spin_axis	bat_speed	whiff	pitch_type_FC	pitch_type_FF	pitch_type_KC	pitch_type_SI	pitch_type_SL	pitch_type_ST
count	1.882000e+03	1.882000e+03	1.882000e+03	1882.000000	1.882000e+03	1.882000e+03	1.882000e+03	1.882000e+03	1.882000e+03	1.882000e+03	...	1.882000e+03	1.882000e+03	1.882000e+03	1882.000000	1882.000000	1882.000000	1882.000000	1882.000000	1882.000000	1882.000000
mean	9.665193e-16	7.701951e-16	-6.614617e-15	0.000000	1.208149e-16	-7.550932e-17	1.510186e-16	3.020373e-17	2.476706e-15	3.020373e-16	...	6.644820e-16	1.812224e-16	1.585696e-16	0.460680	0.002125	0.371945	0.069075	0.009564	0.520191	0.015409
std	1.000266e+00	1.000266e+00	1.000266e+00	1.000266	1.000266e+00	1.000266e+00	1.000266e+00	1.000266e+00	1.000266e+00	1.000266e+00	...	1.000266e+00	1.000266e+00	1.000266e+00	0.498584	0.046065	0.483452	0.253650	0.097354	0.499725	0.123206
min	-4.076923e+00	-3.079733e+00	-3.168647e+00	-4.206431	-2.656910e+00	-2.941060e+00	-3.063429e+00	-3.098316e+00	-1.712382e+00	-2.682536e+00	...	-3.368881e+00	-2.140499e+00	-6.195404e+00	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	-6.463924e-01	-6.625058e-01	-6.961108e-01	-0.669836	-5.489989e-01	-7.215194e-01	-7.091011e-01	-7.146339e-01	-1.024796e+00	-6.895903e-01	...	-6.211546e-01	-7.677670e-01	-2.094316e-01	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	-2.775181e-01	-2.042987e-02	3.110570e-02	0.083327	-3.030760e-01	-3.995563e-03	-3.940257e-03	9.675244e-04	2.686163e-01	-5.441526e-02	...	7.782843e-02	3.871293e-02	1.837357e-01	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000
75%	1.031986e+00	6.594152e-01	6.856005e-01	0.640013	1.043645e+00	6.943943e-01	7.125942e-01	6.979036e-01	6.396214e-01	5.788533e-01	...	6.743743e-01	9.824660e-01	5.670739e-01	1.000000	0.000000	1.000000	0.000000	0.000000	1.000000	0.000000
max	1.714403e+00	3.152181e+00	2.649085e+00	4.307593	1.605754e+00	3.832365e+00	2.839450e+00	3.031471e+00	4.078634e+00	3.573972e+00	...	4.416344e+00	3.830884e+00	1.746576e+00	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000

8 rows × 28 columns

There you have it. Twenty-six features to work with for binary classification. Let’s split the data into training and testing sets.

Code

from sklearn.model_selection import train_test_split

X = df.drop('whiff', axis=1)
y = df['whiff']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.shape}, Testing set size: {X_test.shape}")

Training set size: (1505, 27), Testing set size: (377, 27)

For this analysis, we will use the LinearSVC model from sklearn.

Code

from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score

model = LinearSVC(C=10)
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print(f"Test set accuracy: {score * 100:.2f}%")

# cross-validation
cv_scores = cross_val_score(model, X, y, cv=5)
print(f"Cross-validation scores: {cv_scores}")
print(f"Mean CV accuracy: {np.mean(cv_scores) * 100:.2f}%")

Test set accuracy: 61.54%
Cross-validation scores: [0.63660477 0.59681698 0.56117021 0.55585106 0.65957447]
Mean CV accuracy: 60.20%

61% accuracy on the test set. So-so, still better than guessing. Let’s look at the feature importance to see which features are driving the model.

Code

import matplotlib.pyplot as plt
feature_names = X.columns
coefficients = model.coef_[0]
importance = pd.Series(coefficients, index=feature_names).sort_values(ascending=False)

plt.figure(figsize=(10, 6))
importance.plot(kind='bar')
plt.title('Feature Importance')
plt.ylabel('Coefficient Value')
plt.show()

Low magnitude coefficients include arm_angle, release_speed, release_pos_y, ay, n_thruorder_pitcher, and release_extension. The highest magnitude coefficient is vz0 which describes the pitch velocity in the z-direction. This makes sense because pitches with higher vertical velocity tend to have more movement, making them harder to hit. The most influential feature that increased the likelihood of a sample being classified as NOT a whiff is plate_z, the vertical position of the ball as it crosses the plate.

It is also worth noting that there could be data imbalance for swing and miss stuff. Not every pitch results in a whiff - in fact, most pitches do not. Let’s to get a more wholistic view of the model’s performance, we can plot the precision-recall curve. This will illustrate the trade-off between precision and recall for different threshold values.

Note

Wikipedia has a great article on Precision and Recall if you want to learn more. I’ll write an article on this topic in the future.

Code

from sklearn.metrics import PrecisionRecallDisplay
display = PrecisionRecallDisplay.from_estimator(
    model, X_test, y_test, name="LinearSVC", plot_chance_level=True, despine=True
)

Observe the cavity in the precision-recall curve. This indicates that the model is struggling to balance precision and recall. Only in a select few samples where recall is low do we gain high precision. The area under the precision-recall curve (AUC-PR) is a useful metric to summarize the model’s performance. The linear SVC models has an AUC-PR of 0.64.

While the initial results are not flattering, there is promise in the model’s ability to distinguish between classes. Moreover, this highlights the challenge in predicting baseball outcomes - a game where randomness plays a significant role. That said, we can do better. What if we use a non-linear kernel? Let’s try the SVC model with an RBF kernel. Let’s also remove the least important features.

Code

from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

# Remove low importance features
low_importance_features = ['arm_angle', 'release_speed', 'release_pos_y', 'ay', 'n_thruorder_pitcher', 'release_extension', 'pitch_type_FF', 'pitch_type_SL', 'pitch_type_SI', 'pitch_type_ST', 'pitch_type_KC']
X_reduced = X.drop(columns=low_importance_features)
X_train, X_test, y_train, y_test = train_test_split(X_reduced, y, test_size=0.2, random_state=42)

model_rbf = SVC(kernel='rbf', C=10, gamma='scale')
model_rbf.fit(X_train, y_train)
score = model_rbf.score(X_test, y_test)
print(f"Test set accuracy: {score * 100:.2f}%")

# cross-validation
cv_scores = cross_val_score(model_rbf, X, y, cv=5)
print(f"Cross-validation scores: {cv_scores}")
print(f"Mean CV accuracy: {np.mean(cv_scores) * 100:.2f}%")

Test set accuracy: 67.37%
Cross-validation scores: [0.59416446 0.58885942 0.57978723 0.59042553 0.58776596]
Mean CV accuracy: 58.82%

67% accuracy on the test set. That’s progress! The non-linear kernel allows the model to capture more complex relationships between features. Below is the precision-recall curve for the RBF SVC model.

Code

display_rbf = PrecisionRecallDisplay.from_estimator(
    model_rbf, X_test, y_test, name="SVC RBF", plot_chance_level=True, despine=True
)

As importantly, the AUC-PR for the RBF SVC model is 0.72. This is a sizeable improvement by simply changing the kernel and removing low importance features. One interpretation of this result is that the RBF model scores a precision value shy of 0.8 when recall is around 0.5. This means that when the model correctly identifies half of all whiffs, it is correct 80% of the time. This is promising! Consider this: a starting pitcher throws thousands of pitches throughout a season. Many of those pitches are called strikes, balls, or put in play. If a pitcher can identify the pitches that are most likely to result in whiffs, they can optimize their pitch selection and location to maximize their effectiveness.

Note

Pitch characteristics alone do not determine the outcome of a pitch. It takes two to tango. The batter’s skill, the game situation, and even luck play significant roles. However, understanding the factors that contribute to whiffs can help pitchers and coaches make more informed decisions. A future direction could incorporate batter data, pitch sequences, and situational context to build a more comprehensive model.

Finally, let’s visualize the pitch locations that resulted in whiffs vs those that did not using our new model.

Code

import seaborn as sns
sns.set_theme()

df_reduced = df.drop(columns=low_importance_features)
# undo one-hot encoding of pitch type
df_reduced['pitch_type'] = df[['pitch_type_FF', 'pitch_type_SL', 'pitch_type_SI', 'pitch_type_ST', 'pitch_type_KC']].idxmax(axis=1)
df_reduced['pitch_type'] = df_reduced['pitch_type'].str.replace('pitch_type_', '')
y_pred = model_rbf.predict(X_reduced)
df_reduced['whiff_pred'] = y_pred

df = pd.read_csv("cw2.csv")
df_reduced['effective_speed'] = df['effective_speed']
df_reduced['pfx_x'] = df['pfx_x']
df_reduced['pfx_z'] = df['pfx_z']

# plot two side-by-side scatter plots
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(8, 4))
ax1 = sns.scatterplot(data=df_reduced, x='pfx_x', y='pfx_z', hue='whiff_pred', size="effective_speed", sizes=(20,100), style="pitch_type", alpha=0.5, ax=axes[0])
ax1.set_title('Pitch Movement')
ax1.legend_.remove()

# change axis based on original plate_x and plate_z
df_reduced['plate_x'] = df['plate_x']
df_reduced['plate_z'] = df['plate_z']

ax2 = sns.scatterplot(data=df_reduced, x='plate_x', y='plate_z', hue='whiff_pred', size="effective_speed", sizes=(20,100), style="pitch_type", alpha=0.5, ax=axes[1])
# draw strike zone
strike_zone = plt.Rectangle((-0.83, 1.17), 1.66, 2.75, fill=False, color='black', linestyle='--')
ax2.add_patch(strike_zone)
ax2.axhline(1.17, color='black', linestyle='--')
ax2.axhline(3.92, color='black', linestyle='--')
ax2.axvline(-0.83, color='black', linestyle='--')
ax2.axvline(0.83, color='black', linestyle='--')
ax2.set_title('Pitch Location')

# rename whiff_pred legend
handles, labels = ax2.get_legend_handles_labels()
new_labels = ['Predicted Whiff','No Whiff','Whiff','Effective Speed','−3','−2','−1','0','1','Pitch Type','FF','SI','SL','KC','ST']
ax2.legend(handles, new_labels, loc='upper left', bbox_to_anchor=(1, 1))
sns.move_legend(ax2, "upper left", bbox_to_anchor=(1, 1))
plt.show()

fig.savefig("svm-pitch-movement-location.png", bbox_inches='tight', dpi=300)

These two plots work together to tell a story about what the SVM model has learned, separating the physical “stuff” of a pitch from its ultimate “strategy.” The Pitch Movement plot on the left reveals that fastballs with high vertical movement and sliders exhibit the most whiffs. However, the Pitch Location plot on the right adds more complexity in where these pitches must end up to be effective. It clearly shows the model predicting whiffs primarily in the classic “chase” zones—low below the strike zone and high above it—while correctly identifying pitches over the heart of the plate as non-whiffs (blue). Taken together, the plots demonstrate that the model has learned a two-part rule: a pitch is predicted as a whiff (red) only when it possesses both an elite movement profile and is delivered to a deceptive, difficult-to-hit location.

Conclusion

Analyzing pitch data serves as a testament to the phrase, “the Game within the Game”. Every pitch contains a wealth of information, and understanding these nuances can provide a competitive edge. Dylan Cease possesses elite stuff. Cease’s fastball-slider combination demonstrates how challenging it is for a batter to distinguish a fastball and slider—especially when both pitches come from the same arm slot.

It’s a beautiful lie, sold sixty feet six inches away. For the first forty feet, two entirely different products look identical. The batter, a man paid millions to make a split-second investment decision, is being sold a false prospectus. He has a fraction of a second to decide if he’s buying a 98-mile-per-hour corporate bond or a junk-rated slider that will default into the dirt. The old way of seeing the game—the scout in the stands with a radar gun and a gut feeling—would simply say, “The kid’s got a nasty slider.” And they’d be right, but also completely, hopelessly wrong.

Because what this model shows is that the game has moved on. The real edge isn’t knowing that the slider is good; it’s knowing precisely which version of the slider is good. Our little Support Vector Machine, humming away on a laptop, doesn’t care about labels. It has found the inefficiency in the market: the tiny, measurable quadrant of movement and location where a Cease slider transforms from merely difficult to statistically impossible. It’s the quant in the dugout, front-running the batter’s decision. And so the man in the box is left swinging at shadows, a victim not just of a pitcher’s talent, but of an arbitrage opportunity he never even knew existed. But to truly appreciate the deception, you have to see it for yourself.