K-Means Clustering – Running on Numbers

Introduction: The Problem of “Player Comps”

“Who does this guy remind you of?” For decades, player comparisons—or “comps”—have been a cornerstone of how we talk about baseball. We’ve relied on simple, useful archetypes: the “power hitter,” the “contact hitter,” the “speedster,” the “ace.” While handy, these labels have always been oversimplified. Take a modern star like Juan Soto. Is he just a “power hitter”? This label would ignore his elite plate discipline and on-base skills.

In today’s Statcast era, we have access to a wealth of data that can help us move beyond these simplistic labels. Can we create a more nuanced understanding of player similarities? Instead of relying on subjective comparisons, we can use data-driven methods to identify player archetypes based on their actual performance metrics.

Enter K-Means Clustering

In this post, we’ll explore how to use K-Means clustering, an unsupervised machine learning technique, to group players based on their statistical profiles. By doing so, we can uncover natural groupings of players that share similar characteristics, leading to more meaningful and accurate player comparisons. We will first cluster players based on a broad range of statcast metrics, and then we will focus on a subset of three key metrics to illustrate the process.

Note

In our previous posts, we have covered logistic regression, support vector machines, and decision trees - these are all supervised learning techniques. K-Means clustering is a different beast altogether, as it is an unsupervised learning technique. This means we do not have labeled data to train on; instead, we are trying to find patterns in the data without any prior knowledge of the outcomes.

The data we will use comes from Baseball Savant, specifically the percentile rankings for various offensive metrics. You can download the data here. We use the 2025 season data for this analysis. The reason why we use percentile rankings is that they standardize the metrics, allowing us to compare players on a relative scale. They put every player on a 0-100 scale for each metric, making it easier to identify similarities and differences.

K-Means Clustering Overview

Picture this: you are a shelf-stocker at a grocery store. Your goal is to organize items in a giant, messy supermarket into neat, organized sections. You want to group similar items together, like all the cereals in one aisle, all the snacks in another, and so on. This is essentially what K-Means clustering does with data.

Step 1: Choose the Number of Clusters (K) - You decide how many groups (clusters) you want to create. This is like deciding how many aisles you want in the store. For example, you might choose 5 aisles for cereals, snacks, beverages, dairy, and produce.

Step 2: Randomly Place Cluster Centers - You randomly place a few points in the store to represent the center of each aisle. These points are called “centroids.” They are like the managers of each aisle, guiding where items should go.

Step 3: Assign Items to Clusters - You look at each item in the store and decide which aisle it belongs to based on its proximity to the centroids. If a cereal box is closest to the cereal aisle centroid, you put it there. This is like assigning items to the right shelves based on their characteristics.

Step 4: Update Cluster Centers - After assigning items, you check the centroids again. You calculate the average position of all items in each aisle and move the centroids to these new positions. This is like the aisle managers adjusting their positions based on where the items are now located.

Step 5: Repeat Until Stable - You repeat steps 3 and 4 until the centroids stop moving significantly. This means the aisles are now stable, and items are grouped as best as possible. You have successfully organized the store!

In our baseball analogy, the players are the items in the store, and the clusters are the aisles. The metrics we use to compare players are like the characteristics of the items. By applying K-Means clustering, we can group players with similar performance profiles together, just like organizing items into their respective aisles.

Under the Hood: How K-Means Works

The goal of K-means is the partition a dataset into sets, S=\{S_1, S_2, \dots, S_k\}. Each set is to minimize the within-cluster sum of squares (WCSS), in other words, variance. More formally,

\text{argmin}_s \sum_{i=1}^{k} \sum_{x \in S_i} \|x - \mu_i\|^2 = \text{argmin}_s \sum_{i=1}^{k} |S_i|^2 Var S_i where \mu_i is the centroid of cluster S_i and |S_i| is the number of points in cluster S_i. The algorithm works by iteratively updating the centroids and reassigning points to clusters until convergence. This is based off Wikipedia’s K-means clustering article.

The algorithm itself goes as follows: Given a set of k clusters, C = \{c_1, c_2, \dots, c_k\}, and a set of data points X = \{x_1, x_2, \dots, x_n\}:

Assignment Step: Assign each data point x_i to the cluster c_j that is closest to it, based on the distance metric (usually Euclidean distance). S_j^{(t)} = \{x_i \in X \; | \; \|x_i - c_j\|^2 \leq \|x_i - c_k\|^2, \forall k \neq j\}
Update Step: Recalculate the centroids of each cluster based on the assigned points. c_j^{(t+1)} = \frac{1}{|S_j^{(t)}|} \sum_{x \in S_j^{(t)}} x where S_j^{(t)} is the set of points assigned to cluster c_j at iteration t.
Repeat: Repeat the assignment and update steps until the centroids do not change significantly or a maximum number of iterations is reached.

The million dollar question is how many clusters should we use? The answer is not always straightforward, but there are methods to help us decide. The Elbow-Method is a common technique to determine the optimal number of clusters. It involves plotting the sum of squared distances (inertia) between data points and their assigned cluster centers for different values of K. The idea is to find the “elbow” point in the plot, where adding more clusters yields diminishing returns in reducing inertia. Let’s visualize this with sample data.

Code

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import seaborn as sns

# Create synthetic dataset
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
df = pd.DataFrame(X, columns=['feature1', 'feature2'])

# Calculate inertia for different values of K
inertia = []
k_range = range(2, 11)  # Testing K from 2 to 10
for k in k_range:
    kmeans = KMeans(n_clusters=k, init='k-means++', n_init='auto', random_state=42)
    kmeans.fit(df)
    inertia.append(kmeans.inertia_)

# Plotting the Elbow Method
plt.figure(figsize=(10, 6))
plt.plot(k_range, inertia, marker='o')
plt.title('Elbow Method for Optimal K')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia (Sum of Squared Distances)')
plt.xticks(k_range)
plt.grid()
plt.show()

Based on the elbow plot, we can see that the inertia decreases as we increase K, but there is a point where the decrease becomes less significant. This point is where we should choose our optimal K. For the iris dataset, it appears that K=3 is a good choice, as the inertia reduction slows down significantly after that point. Now that we have our optimal K, we can proceed with clustering the data.

Code

import numpy as np

class SimpleKMeans:
    def __init__(self, n_clusters=4, max_iters=100, random_state=42):
        self.n_clusters = n_clusters
        self.max_iters = max_iters
        self.random_state = random_state
        self.centroids = None
        self.labels = None

    def _initialize_centroids(self, X):
        np.random.seed(self.random_state)
        random_indices = np.random.permutation(X.shape[0])
        self.centroids = X[random_indices[:self.n_clusters]]

    def _assign_clusters(self, X):
        distances = np.sqrt(((X - self.centroids[:, np.newaxis])**2).sum(axis=2))
        return np.argmin(distances, axis=0)

    def _update_centroids(self, X, labels):
        new_centroids = np.zeros((self.n_clusters, X.shape[1]))
        for k in range(self.n_clusters):
            cluster_points = X[labels == k]
            if len(cluster_points) > 0:
                new_centroids[k] = cluster_points.mean(axis=0)
        return new_centroids

    def fit(self, X):
        self._initialize_centroids(X)
        for _ in range(self.max_iters):
            self.labels = self._assign_clusters(X)
            new_centroids = self._update_centroids(X, self.labels)
            if np.all(self.centroids == new_centroids):
                break
            self.centroids = new_centroids
        return self

# Using the same synthetic data from before
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Fit the custom KMeans model
custom_kmeans = SimpleKMeans(n_clusters=4, random_state=42)
custom_kmeans.fit(X)
custom_labels = custom_kmeans.labels

# Plot the results
plt.figure(figsize=(10, 6))
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=custom_labels, palette='viridis', s=100)
plt.scatter(custom_kmeans.centroids[:, 0], custom_kmeans.centroids[:, 1], 
            s=300, c='red', marker='X', label='Centroids')
plt.title('K-Means Clustering from Scratch')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid()
plt.show()

The algorithm has successfully grouped the data points into clusters based on their features. Each color represents a different cluster, and we can see how the points are distributed across the feature space.

The Payoff - Profiling MLB Players with K-Means

Now that we have a solid understanding of K-Means clustering, let’s apply it to our baseball data. We will use the percentile rankings for various offensive metrics to cluster players and identify their profiles.

Code

df = pd.read_csv("percentile_rankings.csv")
df.sort_values("xwoba", ascending=False).head(5)

	player_name	player_id	year	xwoba	xba	xslg	xiso	xobp	brl	brl_percent	...	k_percent	bb_percent	whiff_percent	chase_percent	arm_strength	sprint_speed	oaa	bat_speed	squared_up_rate	swing_length
326	Soto, Juan	665742	2025	100.0	98.0	99.0	99.0	100.0	98.0	96.0	...	63.0	100.0	64.0	100.0	58.0	14	2.0	69.0	91.0	77.0
536	Judge, Aaron	592450	2025	100.0	99.0	100.0	100.0	99.0	100.0	100.0	...	20.0	99.0	1.0	75.0	89.0	37	79.0	97.0	14.0	2.0
529	Schwarber, Kyle	656941	2025	99.0	73.0	99.0	99.0	97.0	98.0	99.0	...	16.0	97.0	8.0	87.0	NaN	16	NaN	98.0	36.0	19.0
78	Ohtani, Shohei	660271	2025	99.0	86.0	100.0	100.0	94.0	100.0	100.0	...	14.0	96.0	3.0	59.0	NaN	67	NaN	95.0	36.0	3.0
173	Guerrero Jr., Vladimir	665489	2025	98.0	100.0	94.0	78.0	100.0	95.0	84.0	...	88.0	93.0	70.0	91.0	36.0	34	22.0	97.0	84.0	19.0

5 rows × 23 columns

The percentile rankings dataset contains various offensive metrics for players, such as xwOBA, xBA, and xSLG. For this application, we will use xba, xslg, xiso, and xobp. These metrics provide a comprehensive view of a player’s offensive performance, allowing us to cluster players based on their hitting profiles.

Code

features = ["xba", "xslg", "xiso", "xobp"]
df_k = df[features]
df_k = df_k.dropna()
df_k = df_k.sort_values("xslg", ascending=False)
df_k.head(5)

	xba	xslg	xiso	xobp
78	86.0	100.0	100.0	94.0
536	99.0	100.0	100.0	99.0
529	73.0	99.0	99.0	97.0
326	98.0	99.0	99.0	100.0
475	94.0	98.0	96.0	97.0

Let’s apply the Elbow method to determine the optimal number of clusters for our baseball data. We will plot the inertia for different values of K and look for the “elbow” point.

Code

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

inertia = []
k_values = range(1, 11)

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(df_k)
    inertia.append(kmeans.inertia_)

plt.figure(figsize=(10, 6))
plt.plot(k_values, inertia, marker='o')
plt.title('Elbow Method for Optimal K')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.xticks(k_values)
plt.grid()
plt.show()

Looking at the elbow plot, we can see that the inertia decreases as we increase K, but there is a point where the decrease becomes less significant. This point is where we should choose our optimal K. For our dataset, it appears that K=6 is a good choice, as the inertia reduction slows down significantly after that point.

The Final Clustering

Code

optimal_k = 6
final_kmeans = KMeans(n_clusters=optimal_k, init='k-means++', n_init='auto', random_state=42)
clusters = final_kmeans.fit_predict(df_k)

cluster_names = {2: "Elite Slugger", 5: "High-Average Hitter", 3: "Contact Specialist", 4: "Three True Outcome Hitter", 0: "Low-Average Power Threat", 1: "Struggling Hitter"}

# Add the cluster labels back to a copy of our original dataframe
results_df = df_k.loc[df_k.index].copy()
results_df['cluster'] = clusters

# map cluster names to the cluster labels
results_df['cluster'] = results_df['cluster'].map(cluster_names)

cluster_profiles = results_df.groupby('cluster')[features].mean().round(1)
cluster_profiles = cluster_profiles.sort_values(by=features, ascending=False)

print("--- Cluster Profiles (Average Percentiles) ---")
cluster_profiles.to_clipboard()

cluster_profiles = cluster_profiles.rename(index=cluster_names)
cluster_profiles

--- Cluster Profiles (Average Percentiles) ---

	xba	xslg	xiso	xobp
cluster
Elite Slugger	80.5	91.8	89.3	89.7
High-Average Hitter	75.8	64.9	55.8	72.8
Contact Specialist	59.0	26.0	18.3	60.4
Three True Outcome Hitter	56.2	81.8	82.8	40.5
Low-Average Power Threat	30.3	50.8	59.0	32.6
Struggling Hitter	17.2	14.3	20.5	18.6

Here are the descriptions of each cluster based on their expected offensive performance percentiles.

Elite Slugger: Players in this cluster are expected to excel in all major offensive categories, with particularly high percentiles in xBA, xSLG, and xOBP.
High-Average Hitter: This group consists of players who may not have the same power as the elite sluggers but still maintain strong overall offensive numbers, especially in batting average and on-base percentage.
Contact Specialist: Players here are characterized by their ability to make contact and avoid strikeouts, often at the expense of power numbers.
Three True Outcome Hitter: This cluster includes players who can slug, walk, or strike out, with less emphasis on traditional batting average.
Low-Average Power Threat: These players have power potential but struggle with consistency and making contact.
Struggling Hitter: This group consists of players who are below average in most offensive categories and may be at risk of losing their roster spots.

The figure below visualizes the average percentiles for each cluster across the four metrics we used. Each cluster is represented by a different color, and the radar chart allows us to see how each cluster compares across the metrics.

Code

# Data from the user
data = {
    'cluster': ['Elite Slugger', 'High-Average Hitter', 'Contact Specialist', 'Three True Outcome Hitter', 'Low-Average Power Threat', 'Struggling Hitter'],
    'xba': [80.5, 75.8, 59.0, 56.2, 30.3, 17.2],
    'xslg': [91.8, 64.9, 26.0, 81.8, 50.8, 14.3],
    'xiso': [89.3, 55.8, 18.3, 82.8, 59.0, 20.5],
    'xobp': [89.7, 72.8, 60.4, 40.5, 32.6, 18.6]
}
df_vis = pd.DataFrame(data)
df_vis = df_vis.set_index('cluster')

# Number of variables we're plotting.
num_vars = len(df_vis.columns)

# Compute angle for each axis.
angles = np.linspace(0, 2 * np.pi, num_vars, endpoint=False).tolist()

# The plot is a circle, so we need to "complete the loop"
# and append the start to the end.
angles += angles[:1]

# Labels for each axis
labels = df_vis.columns

# Create the figure and subplots
fig, axes = plt.subplots(figsize=(10, 9), nrows=3, ncols=2, subplot_kw=dict(polar=True))
axes = axes.flatten() # Flatten the 3x2 grid of axes for easy iteration

# Define colors for each cluster
colors = plt.cm.viridis(np.linspace(0, 1, len(df_vis)))

# Plot each cluster on a separate subplot
for i, (cluster_name, row) in enumerate(df_vis.iterrows()):
    ax = axes[i]
    values = row.tolist()
    values += values[:1]  # complete the loop

    # Plot the data
    ax.plot(angles, values, color=colors[i], linewidth=2)
    ax.fill(angles, values, color=colors[i], alpha=0.25)

    # Prettify the plot
    ax.set_rlim(0, 100) # Set radial limits to be consistent (0-100 for percentiles)
    ax.set_yticklabels([0, 25, 50, 75, 100])
    ax.set_xticks(angles[:-1])
    ax.set_xticklabels(labels, size=8)
    ax.set_title(cluster_name, size=12, y=1.1)

# Adjust layout to prevent titles from overlapping
plt.tight_layout(pad=3.0)
plt.show()

Let’s take a closer look at the top players in each cluster to see how they compare.

Elite Sluggers

Code

results_df = results_df.sort_values('xslg', ascending=False)
results_df = results_df.merge(df[['player_name']], left_index=True, right_index=True, how='left')
results_df[results_df['cluster'] == "Elite Slugger"].sort_values('xslg', ascending=False).head(5)

	xba	xslg	xiso	xobp	cluster	player_name
78	86.0	100.0	100.0	94.0	Elite Slugger	Ohtani, Shohei
536	99.0	100.0	100.0	99.0	Elite Slugger	Judge, Aaron
529	73.0	99.0	99.0	97.0	Elite Slugger	Schwarber, Kyle
326	98.0	99.0	99.0	100.0	Elite Slugger	Soto, Juan
475	94.0	98.0	96.0	97.0	Elite Slugger	Seager, Corey

For example, the top players in the “Elite Slugger” cluster are expected to have high xSLG and xOBP, indicating their ability to hit for both power and average. On paper, Corey Seager might not have as high of an OPS as some of the other players, but his expected statistics suggest he is a top-tier slugger. That said, this list contains the usual suspects like Shohei Ohtani and Aaron Judge.

High-Average Hitters

Code

results_df[results_df['cluster'] == "High-Average Hitter"].sort_values('xba', ascending=False).head(5)

	xba	xslg	xiso	xobp	cluster	player_name
339	100.0	80.0	61.0	73.0	High-Average Hitter	Bichette, Bo
444	97.0	71.0	48.0	77.0	High-Average Hitter	Correa, Carlos
320	96.0	59.0	39.0	88.0	High-Average Hitter	Henderson, Gunnar
448	96.0	75.0	52.0	63.0	High-Average Hitter	García Jr., Luis
458	96.0	71.0	49.0	82.0	High-Average Hitter	Kirk, Alejandro

What do Bo Bichette, Carlos Correa, and Gunnar Henderson have in common? They are all expected to have high batting averages and on-base percentages, making them valuable assets to their teams. These players may not hit for as much power as the elite sluggers, but they excel at getting on base and making contact.

Contact Specialists

Code

results_df[results_df['cluster'] == "Contact Specialist"].sort_values('xba', ascending=False).head(5)

	xba	xslg	xiso	xobp	cluster	player_name
17	93.0	6.0	1.0	51.0	Contact Specialist	Simpson, Chandler
49	88.0	21.0	6.0	59.0	Contact Specialist	Hoerner, Nico
35	88.0	18.0	4.0	49.0	Contact Specialist	Wilson, Jacob
348	86.0	24.0	10.0	91.0	Contact Specialist	Freeman, Tyler
488	86.0	13.0	3.0	35.0	Contact Specialist	Arraez, Luis

The “Contact Specialist” cluster includes players like Luis Arraez and Nico Hoerner. These players are expected to have high batting averages and low strikeout rates, making them valuable for their ability to put the ball in play consistently. They may not hit for as much power, but their contact skills make them effective hitters.

Three True Outcome Hitters

Code

results_df[results_df['cluster'] == "Three True Outcome Hitter"].sort_values('xslg', ascending=False).head(5)

	xba	xslg	xiso	xobp	cluster	player_name
164	19.0	95.0	98.0	73.0	Three True Outcome Hitter	Raleigh, Cal
474	86.0	93.0	89.0	44.0	Three True Outcome Hitter	Perez, Salvador
494	54.0	92.0	95.0	51.0	Three True Outcome Hitter	Buxton, Byron
459	44.0	92.0	95.0	6.0	Three True Outcome Hitter	Carpenter, Kerry
244	64.0	91.0	92.0	63.0	Three True Outcome Hitter	Suzuki, Seiya

The Three True Outcome Hitter. This group is significant in that they represent the new wave of hitters who rely on power, patience, and selectvity at the plate. They are the result of the shift in baseball towards a more analytical approach to hitting. Players like Cal Raleigh and Byron Buxton are prime examples of this cluster, as they are expected to have high slugging percentages and walk rates, but also high strikeout rates.

Low-Average Power Threats

Code

results_df[results_df['cluster'] == "Low-Average Power Threat"].sort_values('xba', ascending=False).head(5)

	xba	xslg	xiso	xobp	cluster	player_name
208	79.0	52.0	42.0	19.0	Low-Average Power Threat	Harris II, Michael
75	60.0	52.0	46.0	17.0	Low-Average Power Threat	Chourio, Jackson
534	60.0	54.0	50.0	39.0	Low-Average Power Threat	Bellinger, Cody
272	58.0	45.0	40.0	25.0	Low-Average Power Threat	Wagaman, Eric
385	49.0	56.0	58.0	15.0	Low-Average Power Threat	Castellanos, Nick

This is a one-dimensional group of players. They provide above-average power, as seen in their xslg and xiso percentiles. However, they are very poor at getting on base (xobp) and hitting for average (xba), with both metrics falling in the bottom third of the league. They are a significant risk offensively, offering power but little else. Enter Cody Bellinger and Nick Castellanos, two players who have shown flashes of brilliance in the past but have struggled with consistency. They are expected to have high slugging percentages, but their batting averages may not be as high as other players.

Struggling Hitters

Code

results_df[results_df['cluster'] == "Struggling Hitter"].sort_values('xba', ascending=False).head(5)

	xba	xslg	xiso	xobp	cluster	player_name
186	46.0	19.0	15.0	28.0	Struggling Hitter	Winn, Masyn
9	41.0	32.0	30.0	29.0	Struggling Hitter	Myers, Dane
425	41.0	2.0	1.0	11.0	Struggling Hitter	Kiner-Falefa, Isiah
129	39.0	21.0	18.0	32.0	Struggling Hitter	Toro, Abraham
110	37.0	5.0	5.0	27.0	Struggling Hitter	Frazier, Adam

This cluster represents the least productive offensive players. The average player ranks in the bottom 20th percent across all four expected categories. They struggle to get on base, hit for average, or generate any kind of power. These players are likely experiencing significant slumps or are simply overmatched.

For those who are curious…

Code

results_df[results_df['cluster'] == "Struggling Hitter"].sort_values('xba', ascending=False).tail(5)

	xba	xslg	xiso	xobp	cluster	player_name
538	2.0	2.0	9.0	4.0	Struggling Hitter	Walls, Taylor
373	1.0	6.0	22.0	1.0	Struggling Hitter	Bailey, Patrick
518	1.0	2.0	11.0	4.0	Struggling Hitter	Sweeney, Trey
472	1.0	19.0	41.0	3.0	Struggling Hitter	Toglia, Michael
251	1.0	13.0	35.0	34.0	Struggling Hitter	Jansen, Danny

Conclusion

Using K-Means clustering, we discovered six distinct player profiles based on their expected offensive performance metrics. This approach allows us to move beyond simplistic player comparisons and gain a more nuanced understanding of player archetypes. By clustering players based on their actual performance data, we can identify similarities and differences that may not be immediately apparent through traditional scouting methods.

K-means has its limitations, such as sensitivity to the initial placement of centroids and the need to specify the number of clusters beforehand. This method requires a careful selection of features and preprocessing steps to ensure meaningful results, in addition to a heuristic-based approach to determine the optimal number of clusters. However, it remains a powerful tool for uncovering patterns in data and can be applied to various domains beyond baseball.

Data science isn’t here to replace baseball wisdom but to enhance it. This kind of analysis provides a new, powerful lens through which we can appreciate the diverse skill sets of the players we love to watch.

Look at Cody Bellinger this season. The numbers show his bat speed is ticking up. Just a little. But it’s there. Data science sees it as a player beginning to migrate from one cluster to another, a tiny tremor that might signal a return to the guy who won an MVP in 2019. It’s a reminder that these profiles aren’t destiny. They’re just a moment in time.

The data points are clues, but the players are still people. And people, thank God, can still surprise you.

Past Aricles

GitHub:

Code

--- title: "K-Means Clustering" author: "Oliver Chang" email: oliverc1622@gmail.com date: "2025-08-12" # Update this date when you make changes categories: [unsupervised learning, clustering, k-means, scikit-learn] toc: true format: html: html-math-method: katex code-tools: true image: "main.png" title-block-banner: default --- <iframe src="https://streamable.com/m/blake-parker-swinging-strike-to-leury-garcia-hn5edr?partnerId=web_video-playback-page_video-share" width="560" height="315"></iframe> ## Introduction: The Problem of "Player Comps" "Who does this guy remind you of?" For decades, player comparisons—or “comps”—have been a cornerstone of how we talk about baseball. We’ve relied on simple, useful archetypes: the “power hitter,” the “contact hitter,” the “speedster,” the “ace.” While handy, these labels have always been oversimplified. Take a modern star like Juan Soto. Is he just a "power hitter"? This label would ignore his elite plate discipline and on-base skills. In today's Statcast era, we have access to a wealth of data that can help us move beyond these simplistic labels. Can we create a more nuanced understanding of player similarities? Instead of relying on subjective comparisons, we can use data-driven methods to identify player archetypes based on their actual performance metrics. ### Enter K-Means Clustering In this post, we’ll explore how to use K-Means clustering, an unsupervised machine learning technique, to group players based on their statistical profiles. By doing so, we can uncover natural groupings of players that share similar characteristics, leading to more meaningful and accurate player comparisons. We will first cluster players based on a broad range of statcast metrics, and then we will focus on a subset of three key metrics to illustrate the process. :::{.callout-note} In our previous posts, we have covered [logistic regression](https://runningonnumbers.com/posts/logistic-regression/), [support vector machines](https://runningonnumbers.com/posts/support-vector-machine/), and [decision trees](https://runningonnumbers.com/posts/fast-swings-and-barrels/) - these are all supervised learning techniques. K-Means clustering is a different beast altogether, as it is an unsupervised learning technique. This means we do not have labeled data to train on; instead, we are trying to find patterns in the data without any prior knowledge of the outcomes. ::: The data we will use comes from [Baseball Savant](https://baseballsavant.mlb.com/), specifically the percentile rankings for various offensive metrics. You can download the data [here](https://baseballsavant.mlb.com/leaderboard/percentile-rankings?type=batter&team=). We use the 2025 season data for this analysis. The reason why we use percentile rankings is that they standardize the metrics, allowing us to compare players on a relative scale. They put every player on a 0-100 scale for each metric, making it easier to identify similarities and differences. ## K-Means Clustering Overview Picture this: you are a shelf-stocker at a grocery store. Your goal is to organize items in a giant, messy supermarket into neat, organized sections. You want to group similar items together, like all the cereals in one aisle, all the snacks in another, and so on. This is essentially what K-Means clustering does with data. Step 1: **Choose the Number of Clusters (K)** - You decide how many groups (clusters) you want to create. This is like deciding how many aisles you want in the store. For example, you might choose 5 aisles for cereals, snacks, beverages, dairy, and produce. Step 2: **Randomly Place Cluster Centers** - You randomly place a few points in the store to represent the center of each aisle. These points are called "centroids." They are like the managers of each aisle, guiding where items should go. Step 3: **Assign Items to Clusters** - You look at each item in the store and decide which aisle it belongs to based on its proximity to the centroids. If a cereal box is closest to the cereal aisle centroid, you put it there. This is like assigning items to the right shelves based on their characteristics. Step 4: **Update Cluster Centers** - After assigning items, you check the centroids again. You calculate the average position of all items in each aisle and move the centroids to these new positions. This is like the aisle managers adjusting their positions based on where the items are now located. Step 5: **Repeat Until Stable** - You repeat steps 3 and 4 until the centroids stop moving significantly. This means the aisles are now stable, and items are grouped as best as possible. You have successfully organized the store! In our baseball analogy, the players are the items in the store, and the clusters are the aisles. The metrics we use to compare players are like the characteristics of the items. By applying K-Means clustering, we can group players with similar performance profiles together, just like organizing items into their respective aisles. ### Under the Hood: How K-Means Works The goal of K-means is the partition a dataset into sets, $S=\{S_1, S_2, \dots, S_k\}$. Each set is to minimize the within-cluster sum of squares (WCSS), in other words, variance. More formally, $$\text{argmin}_s \sum_{i=1}^{k} \sum_{x \in S_i} \|x - \mu_i\|^2 = \text{argmin}_s \sum_{i=1}^{k} |S_i|^2 Var S_i$$ where $\mu_i$ is the centroid of cluster $S_i$ and $|S_i|$ is the number of points in cluster $S_i$. The algorithm works by iteratively updating the centroids and reassigning points to clusters until convergence. This is based off Wikipedia's [K-means clustering](https://en.wikipedia.org/wiki/K-means_clustering) article. The algorithm itself goes as follows: Given a set of $k$ clusters, $C = \{c_1, c_2, \dots, c_k\}$, and a set of data points $X = \{x_1, x_2, \dots, x_n\}$: 1. **Assignment Step**: Assign each data point $x_i$ to the cluster $c_j$ that is closest to it, based on the distance metric (usually Euclidean distance). $$ S_j^{(t)} = \{x_i \in X \; | \; \|x_i - c_j\|^2 \leq \|x_i - c_k\|^2, \forall k \neq j\} $$ 2. **Update Step**: Recalculate the centroids of each cluster based on the assigned points. $$ c_j^{(t+1)} = \frac{1}{|S_j^{(t)}|} \sum_{x \in S_j^{(t)}} x $$ where $S_j^{(t)}$ is the set of points assigned to cluster $c_j$ at iteration $t$. 3. **Repeat**: Repeat the assignment and update steps until the centroids do not change significantly or a maximum number of iterations is reached. The million dollar question is how many clusters should we use? The answer is not always straightforward, but there are methods to help us decide. The **Elbow-Method** is a common technique to determine the optimal number of clusters. It involves plotting the sum of squared distances (inertia) between data points and their assigned cluster centers for different values of K. The idea is to find the "elbow" point in the plot, where adding more clusters yields diminishing returns in reducing inertia. Let's visualize this with sample data. ```{python} # | code-fold: true # | warning: false # | fig-cap: Elbow Method for Optimal K import pandas as pd import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.datasets import make_blobs import seaborn as sns # Create synthetic dataset X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0) df = pd.DataFrame(X, columns=['feature1', 'feature2']) # Calculate inertia for different values of K inertia = [] k_range = range(2, 11) # Testing K from 2 to 10 for k in k_range: kmeans = KMeans(n_clusters=k, init='k-means++', n_init='auto', random_state=42) kmeans.fit(df) inertia.append(kmeans.inertia_) # Plotting the Elbow Method plt.figure(figsize=(10, 6)) plt.plot(k_range, inertia, marker='o') plt.title('Elbow Method for Optimal K') plt.xlabel('Number of Clusters (K)') plt.ylabel('Inertia (Sum of Squared Distances)') plt.xticks(k_range) plt.grid() plt.show() ``` Based on the elbow plot, we can see that the inertia decreases as we increase K, but there is a point where the decrease becomes less significant. This point is where we should choose our optimal K. For the iris dataset, it appears that K=3 is a good choice, as the inertia reduction slows down significantly after that point. Now that we have our optimal K, we can proceed with clustering the data. ```{python} # | code-fold: true # | warning: false # | fig-cap: K-Means Clustering from Scratch import numpy as np class SimpleKMeans: def __init__(self, n_clusters=4, max_iters=100, random_state=42): self.n_clusters = n_clusters self.max_iters = max_iters self.random_state = random_state self.centroids = None self.labels = None def _initialize_centroids(self, X): np.random.seed(self.random_state) random_indices = np.random.permutation(X.shape[0]) self.centroids = X[random_indices[:self.n_clusters]] def _assign_clusters(self, X): distances = np.sqrt(((X - self.centroids[:, np.newaxis])**2).sum(axis=2)) return np.argmin(distances, axis=0) def _update_centroids(self, X, labels): new_centroids = np.zeros((self.n_clusters, X.shape[1])) for k in range(self.n_clusters): cluster_points = X[labels == k] if len(cluster_points) > 0: new_centroids[k] = cluster_points.mean(axis=0) return new_centroids def fit(self, X): self._initialize_centroids(X) for _ in range(self.max_iters): self.labels = self._assign_clusters(X) new_centroids = self._update_centroids(X, self.labels) if np.all(self.centroids == new_centroids): break self.centroids = new_centroids return self # Using the same synthetic data from before X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0) # Fit the custom KMeans model custom_kmeans = SimpleKMeans(n_clusters=4, random_state=42) custom_kmeans.fit(X) custom_labels = custom_kmeans.labels # Plot the results plt.figure(figsize=(10, 6)) sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=custom_labels, palette='viridis', s=100) plt.scatter(custom_kmeans.centroids[:, 0], custom_kmeans.centroids[:, 1], s=300, c='red', marker='X', label='Centroids') plt.title('K-Means Clustering from Scratch') plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.legend() plt.grid() plt.show() ``` The algorithm has successfully grouped the data points into clusters based on their features. Each color represents a different cluster, and we can see how the points are distributed across the feature space. ## The Payoff - Profiling MLB Players with K-Means Now that we have a solid understanding of K-Means clustering, let's apply it to our baseball data. We will use the percentile rankings for various offensive metrics to cluster players and identify their profiles. ```{python} # | code-fold: true # | warning: false df = pd.read_csv("percentile_rankings.csv") df.sort_values("xwoba", ascending=False).head(5) ``` The percentile rankings dataset contains various offensive metrics for players, such as xwOBA, xBA, and xSLG. For this application, we will use `xba`, `xslg`, `xiso`, and `xobp`. These metrics provide a comprehensive view of a player's offensive performance, allowing us to cluster players based on their hitting profiles. ```{python} # | code-fold: true # | warning: false features = ["xba", "xslg", "xiso", "xobp"] df_k = df[features] df_k = df_k.dropna() df_k = df_k.sort_values("xslg", ascending=False) df_k.head(5) ``` Let's apply the Elbow method to determine the optimal number of clusters for our baseball data. We will plot the inertia for different values of K and look for the "elbow" point. ```{python} # | code-fold: true # | warning: false from sklearn.cluster import KMeans import matplotlib.pyplot as plt inertia = [] k_values = range(1, 11) for k in k_values: kmeans = KMeans(n_clusters=k, random_state=42) kmeans.fit(df_k) inertia.append(kmeans.inertia_) plt.figure(figsize=(10, 6)) plt.plot(k_values, inertia, marker='o') plt.title('Elbow Method for Optimal K') plt.xlabel('Number of Clusters (K)') plt.ylabel('Inertia') plt.xticks(k_values) plt.grid() plt.show() ``` Looking at the elbow plot, we can see that the inertia decreases as we increase K, but there is a point where the decrease becomes less significant. This point is where we should choose our optimal K. For our dataset, it appears that $K=6$ is a good choice, as the inertia reduction slows down significantly after that point. ### The Final Clustering ```{python} # | code-fold: true # | warning: false optimal_k = 6 final_kmeans = KMeans(n_clusters=optimal_k, init='k-means++', n_init='auto', random_state=42) clusters = final_kmeans.fit_predict(df_k) cluster_names = {2: "Elite Slugger", 5: "High-Average Hitter", 3: "Contact Specialist", 4: "Three True Outcome Hitter", 0: "Low-Average Power Threat", 1: "Struggling Hitter"} # Add the cluster labels back to a copy of our original dataframe results_df = df_k.loc[df_k.index].copy() results_df['cluster'] = clusters # map cluster names to the cluster labels results_df['cluster'] = results_df['cluster'].map(cluster_names) cluster_profiles = results_df.groupby('cluster')[features].mean().round(1) cluster_profiles = cluster_profiles.sort_values(by=features, ascending=False) print("--- Cluster Profiles (Average Percentiles) ---") cluster_profiles.to_clipboard() cluster_profiles = cluster_profiles.rename(index=cluster_names) cluster_profiles ``` Here are the descriptions of each cluster based on their expected offensive performance percentiles. - **Elite Slugger**: Players in this cluster are expected to excel in all major offensive categories, with particularly high percentiles in xBA, xSLG, and xOBP. - **High-Average Hitter**: This group consists of players who may not have the same power as the elite sluggers but still maintain strong overall offensive numbers, especially in batting average and on-base percentage. - **Contact Specialist**: Players here are characterized by their ability to make contact and avoid strikeouts, often at the expense of power numbers. - **Three True Outcome Hitter**: This cluster includes players who can slug, walk, or strike out, with less emphasis on traditional batting average. - **Low-Average Power Threat**: These players have power potential but struggle with consistency and making contact. - **Struggling Hitter**: This group consists of players who are below average in most offensive categories and may be at risk of losing their roster spots. The figure below visualizes the average percentiles for each cluster across the four metrics we used. Each cluster is represented by a different color, and the radar chart allows us to see how each cluster compares across the metrics. ```{python} # | code-fold: true # | warning: false # | fig-cap: Cluster Profiles Radar Chart # Data from the user data = { 'cluster': ['Elite Slugger', 'High-Average Hitter', 'Contact Specialist', 'Three True Outcome Hitter', 'Low-Average Power Threat', 'Struggling Hitter'], 'xba': [80.5, 75.8, 59.0, 56.2, 30.3, 17.2], 'xslg': [91.8, 64.9, 26.0, 81.8, 50.8, 14.3], 'xiso': [89.3, 55.8, 18.3, 82.8, 59.0, 20.5], 'xobp': [89.7, 72.8, 60.4, 40.5, 32.6, 18.6] } df_vis = pd.DataFrame(data) df_vis = df_vis.set_index('cluster') # Number of variables we're plotting. num_vars = len(df_vis.columns) # Compute angle for each axis. angles = np.linspace(0, 2 * np.pi, num_vars, endpoint=False).tolist() # The plot is a circle, so we need to "complete the loop" # and append the start to the end. angles += angles[:1] # Labels for each axis labels = df_vis.columns # Create the figure and subplots fig, axes = plt.subplots(figsize=(10, 9), nrows=3, ncols=2, subplot_kw=dict(polar=True)) axes = axes.flatten() # Flatten the 3x2 grid of axes for easy iteration # Define colors for each cluster colors = plt.cm.viridis(np.linspace(0, 1, len(df_vis))) # Plot each cluster on a separate subplot for i, (cluster_name, row) in enumerate(df_vis.iterrows()): ax = axes[i] values = row.tolist() values += values[:1] # complete the loop # Plot the data ax.plot(angles, values, color=colors[i], linewidth=2) ax.fill(angles, values, color=colors[i], alpha=0.25) # Prettify the plot ax.set_rlim(0, 100) # Set radial limits to be consistent (0-100 for percentiles) ax.set_yticklabels([0, 25, 50, 75, 100]) ax.set_xticks(angles[:-1]) ax.set_xticklabels(labels, size=8) ax.set_title(cluster_name, size=12, y=1.1) # Adjust layout to prevent titles from overlapping plt.tight_layout(pad=3.0) plt.show() ``` Let's take a closer look at the top players in each cluster to see how they compare. #### Elite Sluggers ```{python} # | code-fold: true # | warning: false results_df = results_df.sort_values('xslg', ascending=False) results_df = results_df.merge(df[['player_name']], left_index=True, right_index=True, how='left') results_df[results_df['cluster'] == "Elite Slugger"].sort_values('xslg', ascending=False).head(5) ``` For example, the top players in the "Elite Slugger" cluster are expected to have high xSLG and xOBP, indicating their ability to hit for both power and average. On paper, [Corey Seager](https://www.baseball-reference.com/players/s/seageco01.shtml) might not have as high of an OPS as some of the other players, but his expected statistics suggest he is a top-tier slugger. That said, this list contains the usual suspects like [Shohei Ohtani](https://www.baseball-reference.com/players/o/ohtansh01.shtml) and [Aaron Judge](https://www.baseball-reference.com/players/j/judgeaa01.shtml). #### High-Average Hitters ```{python} # | code-fold: true # | warning: false results_df[results_df['cluster'] == "High-Average Hitter"].sort_values('xba', ascending=False).head(5) ``` What do [Bo Bichette](https://baseballsavant.mlb.com/savant-player/bo-bichette-666182?stats=statcast-r-hitting-mlb), [Carlos Correa](https://baseballsavant.mlb.com/savant-player/carlos-correa-621043?stats=statcast-r-hitting-mlb), and [Gunnar Henderson](https://baseballsavant.mlb.com/savant-player/gunnar-henderson-683002?stats=statcast-r-hitting-mlb) have in common? They are all expected to have high batting averages and on-base percentages, making them valuable assets to their teams. These players may not hit for as much power as the elite sluggers, but they excel at getting on base and making contact. #### Contact Specialists ```{python} # | code-fold: true # | warning: false results_df[results_df['cluster'] == "Contact Specialist"].sort_values('xba', ascending=False).head(5) ``` The "Contact Specialist" cluster includes players like [Luis Arraez](https://baseballsavant.mlb.com/savant-player/luis-arraez-650333?stats=statcast-r-hitting-mlb) and [Nico Hoerner](https://baseballsavant.mlb.com/savant-player/nico-hoerner-663538?stats=statcast-r-hitting-mlb). These players are expected to have high batting averages and low strikeout rates, making them valuable for their ability to put the ball in play consistently. They may not hit for as much power, but their contact skills make them effective hitters. #### Three True Outcome Hitters ```{python} # | code-fold: true # | warning: false results_df[results_df['cluster'] == "Three True Outcome Hitter"].sort_values('xslg', ascending=False).head(5) ``` The Three True Outcome Hitter. This group is significant in that they represent the new wave of hitters who rely on power, patience, and selectvity at the plate. They are the result of the shift in baseball towards a more analytical approach to hitting. Players like [Cal Raleigh](https://baseballsavant.mlb.com/savant-player/cal-raleigh-663728?stats=statcast-r-hitting-mlb) and [Byron Buxton](https://baseballsavant.mlb.com/savant-player/byron-buxton-621439?stats=statcast-r-hitting-mlb) are prime examples of this cluster, as they are expected to have high slugging percentages and walk rates, but also high strikeout rates. #### Low-Average Power Threats ```{python} # | code-fold: true # | warning: false results_df[results_df['cluster'] == "Low-Average Power Threat"].sort_values('xba', ascending=False).head(5) ``` This is a one-dimensional group of players. They provide above-average power, as seen in their xslg and xiso percentiles. However, they are very poor at getting on base (xobp) and hitting for average (xba), with both metrics falling in the bottom third of the league. They are a significant risk offensively, offering power but little else. Enter [Cody Bellinger](https://baseballsavant.mlb.com/savant-player/cody-bellinger-641355?stats=statcast-r-hitting-mlb) and [Nick Castellanos](https://baseballsavant.mlb.com/savant-player/nick-castellanos-592206?stats=statcast-r-hitting-mlb), two players who have shown flashes of brilliance in the past but have struggled with consistency. They are expected to have high slugging percentages, but their batting averages may not be as high as other players. #### Struggling Hitters ```{python} # | code-fold: true # | warning: false results_df[results_df['cluster'] == "Struggling Hitter"].sort_values('xba', ascending=False).head(5) ``` This cluster represents the least productive offensive players. The average player ranks in the bottom 20th percent across all four expected categories. They struggle to get on base, hit for average, or generate any kind of power. These players are likely experiencing significant slumps or are simply overmatched. For those who are curious... ```{python} # | code-fold: true # | warning: false results_df[results_df['cluster'] == "Struggling Hitter"].sort_values('xba', ascending=False).tail(5) ``` ## Conclusion Using K-Means clustering, we discovered six distinct player profiles based on their expected offensive performance metrics. This approach allows us to move beyond simplistic player comparisons and gain a more nuanced understanding of player archetypes. By clustering players based on their actual performance data, we can identify similarities and differences that may not be immediately apparent through traditional scouting methods. K-means has its limitations, such as sensitivity to the initial placement of centroids and the need to specify the number of clusters beforehand. This method requires a careful selection of features and preprocessing steps to ensure meaningful results, in addition to a heuristic-based approach to determine the optimal number of clusters. However, it remains a powerful tool for uncovering patterns in data and can be applied to various domains beyond baseball. Data science isn't here to replace baseball wisdom but to enhance it. This kind of analysis provides a new, powerful lens through which we can appreciate the diverse skill sets of the players we love to watch. Look at Cody Bellinger this season. The numbers show his bat speed is ticking up. Just a little. But it's there. Data science sees it as a player beginning to migrate from one cluster to another, a tiny tremor that might signal a return to the guy who won an MVP in 2019. It’s a reminder that these profiles aren't destiny. They're just a moment in time. The data points are clues, but the players are still people. And people, thank God, can still surprise you. <iframe src="https://streamable.com/m/cody-bellinger-homers-21-on-a-fly-ball-to-right-center-field-0urbp0?partnerId=web_video-playback-page_video-share" width="560" height="315"></iframe> --- Past Aricles - [Logistic Regression](https://runningonnumbers.com/posts/logistic-regression/) - [Support Vector Machines](https://runningonnumbers.com/posts/support-vector-machine/) - [Decision Trees](https://runningonnumbers.com/posts/fast-swings-and-barrels/) GitHub: - [Code](https://github.com/oliverc1623/Running-On-Numbers-Public/blob/main/posts/k-means/main.py) <script async data-uid="5d16db9e50" src="https://runningonnumbers.kit.com/5d16db9e50/index.js"></script>