unsupervised learning
clustering
k-means
scikit-learn
Author

Oliver Chang

Published

August 12, 2025

Introduction: The Problem of “Player Comps”

“Who does this guy remind you of?” For decades, player comparisons—or “comps”—have been a cornerstone of how we talk about baseball. We’ve relied on simple, useful archetypes: the “power hitter,” the “contact hitter,” the “speedster,” the “ace.” While handy, these labels have always been oversimplified. Take a modern start like Juan Soto. Is he just a “power hitter”? This label would ignore his elite plate discipline and on-base skills.

In today’s Statcast era, we have access to a wealth of data that can help us move beyond these simplistic labels. Can we create a more nuanced understanding of player similarities? Instead of relying on subjective comparisons, we can use data-driven methods to identify player archetypes based on their actual performance metrics.

Enter K-Means Clustering

In this post, we’ll explore how to use K-Means clustering, an unsupervised machine learning technique, to group players based on their statistical profiles. By doing so, we can uncover natural groupings of players that share similar characteristics, leading to more meaningful and accurate player comparisons. We will first cluster players based on a broad range of statcast metrics, and then we will focus on a subset of three key metrics to illustrate the process.

Note

In our previous posts, we have covered logistic regression, support vector machines, and decision trees - these are all supervised learning techniques. K-Means clustering is a different beast altogether, as it is an unsupervised learning technique. This means we do not have labeled data to train on; instead, we are trying to find patterns in the data without any prior knowledge of the outcomes.

The data we will use comes from Baseball Savant, specifically the percentile rankings for various offensive metrics. You can download the data here. We use the 2025 season data for this analysis. The reason why we use percentile rankings is that they standardize the metrics, allowing us to compare players on a relative scale. They put every player on a 0-100 scale for each metric, making it easier to identify similarities and differences.

K-Means Clustering Overview

Picture this: you are a shelf-stocker at a grocery store. Your goal is to organize items in a giant, messy supermarket into neat, organized sections. You want to group similar items together, like all the cereals in one aisle, all the snacks in another, and so on. This is essentially what K-Means clustering does with data.

Step 1: Choose the Number of Clusters (K) - You decide how many groups (clusters) you want to create. This is like deciding how many aisles you want in the store. For example, you might choose 5 aisles for cereals, snacks, beverages, dairy, and produce.

Step 2: Randomly Place Cluster Centers - You randomly place a few points in the store to represent the center of each aisle. These points are called “centroids.” They are like the managers of each aisle, guiding where items should go.

Step 3: Assign Items to Clusters - You look at each item in the store and decide which aisle it belongs to based on its proximity to the centroids. If a cereal box is closest to the cereal aisle centroid, you put it there. This is like assigning items to the right shelves based on their characteristics.

Step 4: Update Cluster Centers - After assigning items, you check the centroids again. You calculate the average position of all items in each aisle and move the centroids to these new positions. This is like the aisle managers adjusting their positions based on where the items are now located.

Step 5: Repeat Until Stable - You repeat steps 3 and 4 until the centroids stop moving significantly. This means the aisles are now stable, and items are grouped as best as possible. You have successfully organized the store!

In our baseball analogy, the players are the items in the store, and the clusters are the aisles. The metrics we use to compare players are like the characteristics of the items. By applying K-Means clustering, we can group players with similar performance profiles together, just like organizing items into their respective aisles.

Under the Hood: How K-Means Works

The goal of K-means is the partition a dataset into sets, S=\{S_1, S_2, \dots, S_k\}. Each set is to minimize the within-cluster sum of squares (WCSS), in other words, variance. More formally,

\text{argmin}_s \sum_{i=1}^{k} \sum_{x \in S_i} \|x - \mu_i\|^2 = \text{argmin}_s \sum_{i=1}^{k} |S_i|^2 Var S_i where \mu_i is the centroid of cluster S_i and |S_i| is the number of points in cluster S_i. The algorithm works by iteratively updating the centroids and reassigning points to clusters until convergence. This is based off Wikipedia’s K-means clustering article.

The algorithm itself goes as follows: Given a set of k clusters, C = \{c_1, c_2, \dots, c_k\}, and a set of data points X = \{x_1, x_2, \dots, x_n\}:

  1. Assignment Step: Assign each data point x_i to the cluster c_j that is closest to it, based on the distance metric (usually Euclidean distance). S_j^{(t)} = \{x_i \in X \; | \; \|x_i - c_j\|^2 \leq \|x_i - c_k\|^2, \forall k \neq j\}
  2. Update Step: Recalculate the centroids of each cluster based on the assigned points. c_j^{(t+1)} = \frac{1}{|S_j^{(t)}|} \sum_{x \in S_j^{(t)}} x where S_j^{(t)} is the set of points assigned to cluster c_j at iteration t.
  3. Repeat: Repeat the assignment and update steps until the centroids do not change significantly or a maximum number of iterations is reached.

The million dollar question is how many clusters should we use? The answer is not always straightforward, but there are methods to help us decide. The Elbow-Method is a common technique to determine the optimal number of clusters. It involves plotting the sum of squared distances (inertia) between data points and their assigned cluster centers for different values of K. The idea is to find the “elbow” point in the plot, where adding more clusters yields diminishing returns in reducing inertia. Let’s visualize this with sample data.

Code
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import seaborn as sns

# Create synthetic dataset
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
df = pd.DataFrame(X, columns=['feature1', 'feature2'])

# Calculate inertia for different values of K
inertia = []
k_range = range(2, 11)  # Testing K from 2 to 10
for k in k_range:
    kmeans = KMeans(n_clusters=k, init='k-means++', n_init='auto', random_state=42)
    kmeans.fit(df)
    inertia.append(kmeans.inertia_)

# Plotting the Elbow Method
plt.figure(figsize=(10, 6))
plt.plot(k_range, inertia, marker='o')
plt.title('Elbow Method for Optimal K')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia (Sum of Squared Distances)')
plt.xticks(k_range)
plt.grid()
plt.show()

Elbow Method for Optimal K

Based on the elbow plot, we can see that the inertia decreases as we increase K, but there is a point where the decrease becomes less significant. This point is where we should choose our optimal K. For the iris dataset, it appears that K=3 is a good choice, as the inertia reduction slows down significantly after that point. Now that we have our optimal K, we can proceed with clustering the data.

Code
import numpy as np

class SimpleKMeans:
    def __init__(self, n_clusters=4, max_iters=100, random_state=42):
        self.n_clusters = n_clusters
        self.max_iters = max_iters
        self.random_state = random_state
        self.centroids = None
        self.labels = None

    def _initialize_centroids(self, X):
        np.random.seed(self.random_state)
        random_indices = np.random.permutation(X.shape[0])
        self.centroids = X[random_indices[:self.n_clusters]]

    def _assign_clusters(self, X):
        distances = np.sqrt(((X - self.centroids[:, np.newaxis])**2).sum(axis=2))
        return np.argmin(distances, axis=0)

    def _update_centroids(self, X, labels):
        new_centroids = np.zeros((self.n_clusters, X.shape[1]))
        for k in range(self.n_clusters):
            cluster_points = X[labels == k]
            if len(cluster_points) > 0:
                new_centroids[k] = cluster_points.mean(axis=0)
        return new_centroids

    def fit(self, X):
        self._initialize_centroids(X)
        for _ in range(self.max_iters):
            self.labels = self._assign_clusters(X)
            new_centroids = self._update_centroids(X, self.labels)
            if np.all(self.centroids == new_centroids):
                break
            self.centroids = new_centroids
        return self

# Using the same synthetic data from before
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Fit the custom KMeans model
custom_kmeans = SimpleKMeans(n_clusters=4, random_state=42)
custom_kmeans.fit(X)
custom_labels = custom_kmeans.labels

# Plot the results
plt.figure(figsize=(10, 6))
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=custom_labels, palette='viridis', s=100)
plt.scatter(custom_kmeans.centroids[:, 0], custom_kmeans.centroids[:, 1], 
            s=300, c='red', marker='X', label='Centroids')
plt.title('K-Means Clustering from Scratch')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid()
plt.show()

K-Means Clustering from Scratch

The algorithm has successfully grouped the data points into clusters based on their features. Each color represents a different cluster, and we can see how the points are distributed across the feature space.

The Payoff - Profiling MLB Players with K-Means

Now that we have a solid understanding of K-Means clustering, let’s apply it to our baseball data. We will use the percentile rankings for various offensive metrics to cluster players and identify their profiles.

Code
df = pd.read_csv("percentile_rankings.csv")
df.sort_values("xwoba", ascending=False).head(5)
player_name player_id year xwoba xba xslg xiso xobp brl brl_percent ... k_percent bb_percent whiff_percent chase_percent arm_strength sprint_speed oaa bat_speed squared_up_rate swing_length
326 Soto, Juan 665742 2025 100.0 98.0 99.0 99.0 100.0 98.0 96.0 ... 63.0 100.0 64.0 100.0 58.0 14 2.0 69.0 91.0 77.0
536 Judge, Aaron 592450 2025 100.0 99.0 100.0 100.0 99.0 100.0 100.0 ... 20.0 99.0 1.0 75.0 89.0 37 79.0 97.0 14.0 2.0
529 Schwarber, Kyle 656941 2025 99.0 73.0 99.0 99.0 97.0 98.0 99.0 ... 16.0 97.0 8.0 87.0 NaN 16 NaN 98.0 36.0 19.0
78 Ohtani, Shohei 660271 2025 99.0 86.0 100.0 100.0 94.0 100.0 100.0 ... 14.0 96.0 3.0 59.0 NaN 67 NaN 95.0 36.0 3.0
173 Guerrero Jr., Vladimir 665489 2025 98.0 100.0 94.0 78.0 100.0 95.0 84.0 ... 88.0 93.0 70.0 91.0 36.0 34 22.0 97.0 84.0 19.0

5 rows × 23 columns

The percentile rankings dataset contains various offensive metrics for players, such as xwOBA, xBA, and xSLG. For this application, we will xba, xslg, xiso, and xobp. These metrics provide a comprehensive view of a player’s offensive performance, allowing us to cluster players based on their hitting profiles.

Code
features = ["xba", "xslg", "xiso", "xobp"]
df_k = df[features]
df_k = df_k.dropna()
df_k = df_k.sort_values("xslg", ascending=False)
df_k.head(5)
xba xslg xiso xobp
78 86.0 100.0 100.0 94.0
536 99.0 100.0 100.0 99.0
529 73.0 99.0 99.0 97.0
326 98.0 99.0 99.0 100.0
475 94.0 98.0 96.0 97.0

Let’s apply the Elbow method to determine the optimal number of clusters for our baseball data. We will plot the inertia for different values of K and look for the “elbow” point.

Code
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

inertia = []
k_values = range(1, 11)

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(df_k)
    inertia.append(kmeans.inertia_)

plt.figure(figsize=(10, 6))
plt.plot(k_values, inertia, marker='o')
plt.title('Elbow Method for Optimal K')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.xticks(k_values)
plt.grid()
plt.show()

Looking at the elbow plot, we can see that the inertia decreases as we increase K, but there is a point where the decrease becomes less significant. This point is where we should choose our optimal K. For our dataset, it appears that K=6 is a good choice, as the inertia reduction slows down significantly after that point.

The Final Clustering

Code
optimal_k = 6
final_kmeans = KMeans(n_clusters=optimal_k, init='k-means++', n_init='auto', random_state=42)
clusters = final_kmeans.fit_predict(df_k)

cluster_names = {2: "Elite Slugger", 5: "High-Average Hitter", 3: "Contact Specialist", 4: "Three True Outcome Hitter", 0: "Low-Average Power Threat", 1: "Struggling Hitter"}

# Add the cluster labels back to a copy of our original dataframe
results_df = df_k.loc[df_k.index].copy()
results_df['cluster'] = clusters

# map cluster names to the cluster labels
results_df['cluster'] = results_df['cluster'].map(cluster_names)

cluster_profiles = results_df.groupby('cluster')[features].mean().round(1)
cluster_profiles = cluster_profiles.sort_values(by=features, ascending=False)

print("--- Cluster Profiles (Average Percentiles) ---")
cluster_profiles.to_clipboard()

cluster_profiles = cluster_profiles.rename(index=cluster_names)
cluster_profiles
--- Cluster Profiles (Average Percentiles) ---
xba xslg xiso xobp
cluster
Elite Slugger 80.5 91.8 89.3 89.7
High-Average Hitter 75.8 64.9 55.8 72.8
Contact Specialist 59.0 26.0 18.3 60.4
Three True Outcome Hitter 56.2 81.8 82.8 40.5
Low-Average Power Threat 30.3 50.8 59.0 32.6
Struggling Hitter 17.2 14.3 20.5 18.6

Here are the descriptions of each cluster based on their expected offensive performance percentiles.

  • Elite Slugger: Players in this cluster are expected to excel in all major offensive categories, with particularly high percentiles in xBA, xSLG, and xOBP.
  • High-Average Hitter: This group consists of players who may not have the same power as the elite sluggers but still maintain strong overall offensive numbers, especially in batting average and on-base percentage.
  • Contact Specialist: Players here are characterized by their ability to make contact and avoid strikeouts, often at the expense of power numbers.
  • Three True Outcome Hitter: This cluster includes players who can slug, walk, or strike out, with less emphasis on traditional batting average.
  • Low-Average Power Threat: These players have power potential but struggle with consistency and making contact.
  • Struggling Hitter: This group consists of players who are below average in most offensive categories and may be at risk of losing their roster spots.

The figure below visualizes the average percentiles for each cluster across the four metrics we used. Each cluster is represented by a different color, and the radar chart allows us to see how each cluster compares across the metrics.

Code
# Data from the user
data = {
    'cluster': ['Elite Slugger', 'High-Average Hitter', 'Contact Specialist', 'Three True Outcome Hitter', 'Low-Average Power Threat', 'Struggling Hitter'],
    'xba': [80.5, 75.8, 59.0, 56.2, 30.3, 17.2],
    'xslg': [91.8, 64.9, 26.0, 81.8, 50.8, 14.3],
    'xiso': [89.3, 55.8, 18.3, 82.8, 59.0, 20.5],
    'xobp': [89.7, 72.8, 60.4, 40.5, 32.6, 18.6]
}
df_vis = pd.DataFrame(data)
df_vis = df_vis.set_index('cluster')

# Number of variables we're plotting.
num_vars = len(df_vis.columns)

# Compute angle for each axis.
angles = np.linspace(0, 2 * np.pi, num_vars, endpoint=False).tolist()

# The plot is a circle, so we need to "complete the loop"
# and append the start to the end.
angles += angles[:1]

# Labels for each axis
labels = df_vis.columns

# Create the figure and subplots
fig, axes = plt.subplots(figsize=(10, 9), nrows=3, ncols=2, subplot_kw=dict(polar=True))
axes = axes.flatten() # Flatten the 3x2 grid of axes for easy iteration

# Define colors for each cluster
colors = plt.cm.viridis(np.linspace(0, 1, len(df_vis)))

# Plot each cluster on a separate subplot
for i, (cluster_name, row) in enumerate(df_vis.iterrows()):
    ax = axes[i]
    values = row.tolist()
    values += values[:1]  # complete the loop

    # Plot the data
    ax.plot(angles, values, color=colors[i], linewidth=2)
    ax.fill(angles, values, color=colors[i], alpha=0.25)

    # Prettify the plot
    ax.set_rlim(0, 100) # Set radial limits to be consistent (0-100 for percentiles)
    ax.set_yticklabels([0, 25, 50, 75, 100])
    ax.set_xticks(angles[:-1])
    ax.set_xticklabels(labels, size=8)
    ax.set_title(cluster_name, size=12, y=1.1)

# Adjust layout to prevent titles from overlapping
plt.tight_layout(pad=3.0)
plt.show()

Cluster Profiles Radar Chart

Let’s take a closer look at the top players in each cluster to see how they compare.

Elite Sluggers

Code
results_df = results_df.sort_values('xslg', ascending=False)
results_df = results_df.merge(df[['player_name']], left_index=True, right_index=True, how='left')
results_df[results_df['cluster'] == "Elite Slugger"].sort_values('xslg', ascending=False).head(5)
xba xslg xiso xobp cluster player_name
78 86.0 100.0 100.0 94.0 Elite Slugger Ohtani, Shohei
536 99.0 100.0 100.0 99.0 Elite Slugger Judge, Aaron
529 73.0 99.0 99.0 97.0 Elite Slugger Schwarber, Kyle
326 98.0 99.0 99.0 100.0 Elite Slugger Soto, Juan
475 94.0 98.0 96.0 97.0 Elite Slugger Seager, Corey

For example, the top players in the “Elite Slugger” cluster are expected to have high xSLG and xOBP, indicating their ability to hit for both power and average. On paper, Corey Seager might not have as high of an OPS as some of the other players, but his expected statistics suggest he is a top-tier slugger. That said, this list contains the usual suspects like Shohei Ohtani and Aaron Judge.

High-Average Hitters

Code
results_df[results_df['cluster'] == "High-Average Hitter"].sort_values('xba', ascending=False).head(5)
xba xslg xiso xobp cluster player_name
339 100.0 80.0 61.0 73.0 High-Average Hitter Bichette, Bo
444 97.0 71.0 48.0 77.0 High-Average Hitter Correa, Carlos
320 96.0 59.0 39.0 88.0 High-Average Hitter Henderson, Gunnar
448 96.0 75.0 52.0 63.0 High-Average Hitter García Jr., Luis
458 96.0 71.0 49.0 82.0 High-Average Hitter Kirk, Alejandro

What do Bo Bichette, Carlos Correa, and Gunnar Henderson have in common? They are all expected to have high batting averages and on-base percentages, making them valuable assets to their teams. These players may not hit for as much power as the elite sluggers, but they excel at getting on base and making contact.

Contact Specialists

Code
results_df[results_df['cluster'] == "Contact Specialist"].sort_values('xba', ascending=False).head(5)
xba xslg xiso xobp cluster player_name
17 93.0 6.0 1.0 51.0 Contact Specialist Simpson, Chandler
49 88.0 21.0 6.0 59.0 Contact Specialist Hoerner, Nico
35 88.0 18.0 4.0 49.0 Contact Specialist Wilson, Jacob
348 86.0 24.0 10.0 91.0 Contact Specialist Freeman, Tyler
488 86.0 13.0 3.0 35.0 Contact Specialist Arraez, Luis

The “Contact Specialist” cluster includes players like Luis Arraez and Nico Hoerner. These players are expected to have high batting averages and low strikeout rates, making them valuable for their ability to put the ball in play consistently. They may not hit for as much power, but their contact skills make them effective hitters.

Three True Outcome Hitters

Code
results_df[results_df['cluster'] == "Three True Outcome Hitter"].sort_values('xslg', ascending=False).head(5)
xba xslg xiso xobp cluster player_name
164 19.0 95.0 98.0 73.0 Three True Outcome Hitter Raleigh, Cal
474 86.0 93.0 89.0 44.0 Three True Outcome Hitter Perez, Salvador
494 54.0 92.0 95.0 51.0 Three True Outcome Hitter Buxton, Byron
459 44.0 92.0 95.0 6.0 Three True Outcome Hitter Carpenter, Kerry
244 64.0 91.0 92.0 63.0 Three True Outcome Hitter Suzuki, Seiya

The Three True Outcome Hitter. This group is significant in that they represent the new wave of hitters who rely on power, patience, and selectvity at the plate. They are the result of the shift in baseball towards a more analytical approach to hitting. Players like Cal Raleigh and Byron Buxton are prime examples of this cluster, as they are expected to have high slugging percentages and walk rates, but also high strikeout rates.

Low-Average Power Threats

Code
results_df[results_df['cluster'] == "Low-Average Power Threat"].sort_values('xba', ascending=False).head(5)
xba xslg xiso xobp cluster player_name
208 79.0 52.0 42.0 19.0 Low-Average Power Threat Harris II, Michael
75 60.0 52.0 46.0 17.0 Low-Average Power Threat Chourio, Jackson
534 60.0 54.0 50.0 39.0 Low-Average Power Threat Bellinger, Cody
272 58.0 45.0 40.0 25.0 Low-Average Power Threat Wagaman, Eric
385 49.0 56.0 58.0 15.0 Low-Average Power Threat Castellanos, Nick

This is a one-dimensional group of players. They provide above-average power, as seen in their xslg and xiso percentiles. However, they are very poor at getting on base (xobp) and hitting for average (xba), with both metrics falling in the bottom third of the league. They are a significant risk offensively, offering power but little else. Enter Cody Bellinger and Nick Castellanos, two players who have shown flashes of brilliance in the past but have struggled with consistency. They are expected to have high slugging percentages, but their batting averages may not be as high as other players.

Struggling Hitters

Code
results_df[results_df['cluster'] == "Struggling Hitter"].sort_values('xba', ascending=False).head(5)
xba xslg xiso xobp cluster player_name
186 46.0 19.0 15.0 28.0 Struggling Hitter Winn, Masyn
9 41.0 32.0 30.0 29.0 Struggling Hitter Myers, Dane
425 41.0 2.0 1.0 11.0 Struggling Hitter Kiner-Falefa, Isiah
129 39.0 21.0 18.0 32.0 Struggling Hitter Toro, Abraham
110 37.0 5.0 5.0 27.0 Struggling Hitter Frazier, Adam

This cluster represents the least productive offensive players. The average player ranks in the bottom 20th percent across all four expected categories. They struggle to get on base, hit for average, or generate any kind of power. These players are likely experiencing significant slumps or are simply overmatched.

For those who are curious…

Code
results_df[results_df['cluster'] == "Struggling Hitter"].sort_values('xba', ascending=False).tail(5)
xba xslg xiso xobp cluster player_name
538 2.0 2.0 9.0 4.0 Struggling Hitter Walls, Taylor
373 1.0 6.0 22.0 1.0 Struggling Hitter Bailey, Patrick
518 1.0 2.0 11.0 4.0 Struggling Hitter Sweeney, Trey
472 1.0 19.0 41.0 3.0 Struggling Hitter Toglia, Michael
251 1.0 13.0 35.0 34.0 Struggling Hitter Jansen, Danny

Conclusion

Using K-Means clustering, we discovered six distinct player profiles based on their expected offensive performance metrics. This approach allows us to move beyond simplistic player comparisons and gain a more nuanced understanding of player archetypes. By clustering players based on their actual performance data, we can identify similarities and differences that may not be immediately apparent through traditional scouting methods.

K-means has its limitations, such as sensitivity to the initial placement of centroids and the need to specify the number of clusters beforehand. This method requires a careful selection of features and preprocessing steps to ensure meaningful results, in addition to a heuristic-based approach to determine the optimal number of clusters. However, it remains a powerful tool for uncovering patterns in data and can be applied to various domains beyond baseball.

Data science isn’t here to replace baseball wisdom but to enhance it. This kind of analysis provides a new, powerful lens through which we can appreciate the diverse skill sets of the players we love to watch.

Look at Cody Bellinger this season. The numbers show his bat speed is ticking up. Just a little. But it’s there. Data science sees it as a player beginning to migrate from one cluster to another, a tiny tremor that might signal a return to the guy who won an MVP in 2019. It’s a reminder that these profiles aren’t destiny. They’re just a moment in time.

The data points are clues, but the players are still people. And people, thank God, can still surprise you.


Past Aricles

GitHub: