“Who does this guy remind you of?” For decades, player comparisons—or “comps”—have been a cornerstone of how we talk about baseball. We’ve relied on simple, useful archetypes: the “power hitter,” the “contact hitter,” the “speedster,” the “ace.” While handy, these labels have always been oversimplified. Take a modern start like Juan Soto. Is he just a “power hitter”? This label would ignore his elite plate discipline and on-base skills.
In today’s Statcast era, we have access to a wealth of data that can help us move beyond these simplistic labels. Can we create a more nuanced understanding of player similarities? Instead of relying on subjective comparisons, we can use data-driven methods to identify player archetypes based on their actual performance metrics.
Enter K-Means Clustering
In this post, we’ll explore how to use K-Means clustering, an unsupervised machine learning technique, to group players based on their statistical profiles. By doing so, we can uncover natural groupings of players that share similar characteristics, leading to more meaningful and accurate player comparisons. We will first cluster players based on a broad range of statcast metrics, and then we will focus on a subset of three key metrics to illustrate the process.
Note
In our previous posts, we have covered logistic regression, support vector machines, and decision trees - these are all supervised learning techniques. K-Means clustering is a different beast altogether, as it is an unsupervised learning technique. This means we do not have labeled data to train on; instead, we are trying to find patterns in the data without any prior knowledge of the outcomes.
The data we will use comes from Baseball Savant, specifically the percentile rankings for various offensive metrics. You can download the data here. We use the 2025 season data for this analysis. The reason why we use percentile rankings is that they standardize the metrics, allowing us to compare players on a relative scale. They put every player on a 0-100 scale for each metric, making it easier to identify similarities and differences.
K-Means Clustering Overview
Picture this: you are a shelf-stocker at a grocery store. Your goal is to organize items in a giant, messy supermarket into neat, organized sections. You want to group similar items together, like all the cereals in one aisle, all the snacks in another, and so on. This is essentially what K-Means clustering does with data.
Step 1: Choose the Number of Clusters (K) - You decide how many groups (clusters) you want to create. This is like deciding how many aisles you want in the store. For example, you might choose 5 aisles for cereals, snacks, beverages, dairy, and produce.
Step 2: Randomly Place Cluster Centers - You randomly place a few points in the store to represent the center of each aisle. These points are called “centroids.” They are like the managers of each aisle, guiding where items should go.
Step 3: Assign Items to Clusters - You look at each item in the store and decide which aisle it belongs to based on its proximity to the centroids. If a cereal box is closest to the cereal aisle centroid, you put it there. This is like assigning items to the right shelves based on their characteristics.
Step 4: Update Cluster Centers - After assigning items, you check the centroids again. You calculate the average position of all items in each aisle and move the centroids to these new positions. This is like the aisle managers adjusting their positions based on where the items are now located.
Step 5: Repeat Until Stable - You repeat steps 3 and 4 until the centroids stop moving significantly. This means the aisles are now stable, and items are grouped as best as possible. You have successfully organized the store!
In our baseball analogy, the players are the items in the store, and the clusters are the aisles. The metrics we use to compare players are like the characteristics of the items. By applying K-Means clustering, we can group players with similar performance profiles together, just like organizing items into their respective aisles.
Under the Hood: How K-Means Works
The goal of K-means is the partition a dataset into sets, S=\{S_1, S_2, \dots, S_k\}. Each set is to minimize the within-cluster sum of squares (WCSS), in other words, variance. More formally,
\text{argmin}_s \sum_{i=1}^{k} \sum_{x \in S_i} \|x - \mu_i\|^2 = \text{argmin}_s \sum_{i=1}^{k} |S_i|^2 Var S_i where \mu_i is the centroid of cluster S_i and |S_i| is the number of points in cluster S_i. The algorithm works by iteratively updating the centroids and reassigning points to clusters until convergence. This is based off Wikipedia’s K-means clustering article.
The algorithm itself goes as follows: Given a set of k clusters, C = \{c_1, c_2, \dots, c_k\}, and a set of data points X = \{x_1, x_2, \dots, x_n\}:
Assignment Step: Assign each data point x_i to the cluster c_j that is closest to it, based on the distance metric (usually Euclidean distance). S_j^{(t)} = \{x_i \in X \; | \; \|x_i - c_j\|^2 \leq \|x_i - c_k\|^2, \forall k \neq j\}
Update Step: Recalculate the centroids of each cluster based on the assigned points. c_j^{(t+1)} = \frac{1}{|S_j^{(t)}|} \sum_{x \in S_j^{(t)}} x where S_j^{(t)} is the set of points assigned to cluster c_j at iteration t.
Repeat: Repeat the assignment and update steps until the centroids do not change significantly or a maximum number of iterations is reached.
The million dollar question is how many clusters should we use? The answer is not always straightforward, but there are methods to help us decide. The Elbow-Method is a common technique to determine the optimal number of clusters. It involves plotting the sum of squared distances (inertia) between data points and their assigned cluster centers for different values of K. The idea is to find the “elbow” point in the plot, where adding more clusters yields diminishing returns in reducing inertia. Let’s visualize this with sample data.
Code
import pandas as pdimport matplotlib.pyplot as pltfrom sklearn.cluster import KMeansfrom sklearn.datasets import make_blobsimport seaborn as sns# Create synthetic datasetX, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)df = pd.DataFrame(X, columns=['feature1', 'feature2'])# Calculate inertia for different values of Kinertia = []k_range =range(2, 11) # Testing K from 2 to 10for k in k_range: kmeans = KMeans(n_clusters=k, init='k-means++', n_init='auto', random_state=42) kmeans.fit(df) inertia.append(kmeans.inertia_)# Plotting the Elbow Methodplt.figure(figsize=(10, 6))plt.plot(k_range, inertia, marker='o')plt.title('Elbow Method for Optimal K')plt.xlabel('Number of Clusters (K)')plt.ylabel('Inertia (Sum of Squared Distances)')plt.xticks(k_range)plt.grid()plt.show()
Elbow Method for Optimal K
Based on the elbow plot, we can see that the inertia decreases as we increase K, but there is a point where the decrease becomes less significant. This point is where we should choose our optimal K. For the iris dataset, it appears that K=3 is a good choice, as the inertia reduction slows down significantly after that point. Now that we have our optimal K, we can proceed with clustering the data.
Code
import numpy as npclass SimpleKMeans:def__init__(self, n_clusters=4, max_iters=100, random_state=42):self.n_clusters = n_clustersself.max_iters = max_itersself.random_state = random_stateself.centroids =Noneself.labels =Nonedef _initialize_centroids(self, X): np.random.seed(self.random_state) random_indices = np.random.permutation(X.shape[0])self.centroids = X[random_indices[:self.n_clusters]]def _assign_clusters(self, X): distances = np.sqrt(((X -self.centroids[:, np.newaxis])**2).sum(axis=2))return np.argmin(distances, axis=0)def _update_centroids(self, X, labels): new_centroids = np.zeros((self.n_clusters, X.shape[1]))for k inrange(self.n_clusters): cluster_points = X[labels == k]iflen(cluster_points) >0: new_centroids[k] = cluster_points.mean(axis=0)return new_centroidsdef fit(self, X):self._initialize_centroids(X)for _ inrange(self.max_iters):self.labels =self._assign_clusters(X) new_centroids =self._update_centroids(X, self.labels)if np.all(self.centroids == new_centroids):breakself.centroids = new_centroidsreturnself# Using the same synthetic data from beforeX, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)# Fit the custom KMeans modelcustom_kmeans = SimpleKMeans(n_clusters=4, random_state=42)custom_kmeans.fit(X)custom_labels = custom_kmeans.labels# Plot the resultsplt.figure(figsize=(10, 6))sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=custom_labels, palette='viridis', s=100)plt.scatter(custom_kmeans.centroids[:, 0], custom_kmeans.centroids[:, 1], s=300, c='red', marker='X', label='Centroids')plt.title('K-Means Clustering from Scratch')plt.xlabel('Feature 1')plt.ylabel('Feature 2')plt.legend()plt.grid()plt.show()
K-Means Clustering from Scratch
The algorithm has successfully grouped the data points into clusters based on their features. Each color represents a different cluster, and we can see how the points are distributed across the feature space.
The Payoff - Profiling MLB Players with K-Means
Now that we have a solid understanding of K-Means clustering, let’s apply it to our baseball data. We will use the percentile rankings for various offensive metrics to cluster players and identify their profiles.
The percentile rankings dataset contains various offensive metrics for players, such as xwOBA, xBA, and xSLG. For this application, we will xba, xslg, xiso, and xobp. These metrics provide a comprehensive view of a player’s offensive performance, allowing us to cluster players based on their hitting profiles.
Let’s apply the Elbow method to determine the optimal number of clusters for our baseball data. We will plot the inertia for different values of K and look for the “elbow” point.
Code
from sklearn.cluster import KMeansimport matplotlib.pyplot as pltinertia = []k_values =range(1, 11)for k in k_values: kmeans = KMeans(n_clusters=k, random_state=42) kmeans.fit(df_k) inertia.append(kmeans.inertia_)plt.figure(figsize=(10, 6))plt.plot(k_values, inertia, marker='o')plt.title('Elbow Method for Optimal K')plt.xlabel('Number of Clusters (K)')plt.ylabel('Inertia')plt.xticks(k_values)plt.grid()plt.show()
Looking at the elbow plot, we can see that the inertia decreases as we increase K, but there is a point where the decrease becomes less significant. This point is where we should choose our optimal K. For our dataset, it appears that K=6 is a good choice, as the inertia reduction slows down significantly after that point.
The Final Clustering
Code
optimal_k =6final_kmeans = KMeans(n_clusters=optimal_k, init='k-means++', n_init='auto', random_state=42)clusters = final_kmeans.fit_predict(df_k)cluster_names = {2: "Elite Slugger", 5: "High-Average Hitter", 3: "Contact Specialist", 4: "Three True Outcome Hitter", 0: "Low-Average Power Threat", 1: "Struggling Hitter"}# Add the cluster labels back to a copy of our original dataframeresults_df = df_k.loc[df_k.index].copy()results_df['cluster'] = clusters# map cluster names to the cluster labelsresults_df['cluster'] = results_df['cluster'].map(cluster_names)cluster_profiles = results_df.groupby('cluster')[features].mean().round(1)cluster_profiles = cluster_profiles.sort_values(by=features, ascending=False)print("--- Cluster Profiles (Average Percentiles) ---")cluster_profiles.to_clipboard()cluster_profiles = cluster_profiles.rename(index=cluster_names)cluster_profiles
--- Cluster Profiles (Average Percentiles) ---
xba
xslg
xiso
xobp
cluster
Elite Slugger
80.5
91.8
89.3
89.7
High-Average Hitter
75.8
64.9
55.8
72.8
Contact Specialist
59.0
26.0
18.3
60.4
Three True Outcome Hitter
56.2
81.8
82.8
40.5
Low-Average Power Threat
30.3
50.8
59.0
32.6
Struggling Hitter
17.2
14.3
20.5
18.6
Here are the descriptions of each cluster based on their expected offensive performance percentiles.
Elite Slugger: Players in this cluster are expected to excel in all major offensive categories, with particularly high percentiles in xBA, xSLG, and xOBP.
High-Average Hitter: This group consists of players who may not have the same power as the elite sluggers but still maintain strong overall offensive numbers, especially in batting average and on-base percentage.
Contact Specialist: Players here are characterized by their ability to make contact and avoid strikeouts, often at the expense of power numbers.
Three True Outcome Hitter: This cluster includes players who can slug, walk, or strike out, with less emphasis on traditional batting average.
Low-Average Power Threat: These players have power potential but struggle with consistency and making contact.
Struggling Hitter: This group consists of players who are below average in most offensive categories and may be at risk of losing their roster spots.
The figure below visualizes the average percentiles for each cluster across the four metrics we used. Each cluster is represented by a different color, and the radar chart allows us to see how each cluster compares across the metrics.
Code
# Data from the userdata = {'cluster': ['Elite Slugger', 'High-Average Hitter', 'Contact Specialist', 'Three True Outcome Hitter', 'Low-Average Power Threat', 'Struggling Hitter'],'xba': [80.5, 75.8, 59.0, 56.2, 30.3, 17.2],'xslg': [91.8, 64.9, 26.0, 81.8, 50.8, 14.3],'xiso': [89.3, 55.8, 18.3, 82.8, 59.0, 20.5],'xobp': [89.7, 72.8, 60.4, 40.5, 32.6, 18.6]}df_vis = pd.DataFrame(data)df_vis = df_vis.set_index('cluster')# Number of variables we're plotting.num_vars =len(df_vis.columns)# Compute angle for each axis.angles = np.linspace(0, 2* np.pi, num_vars, endpoint=False).tolist()# The plot is a circle, so we need to "complete the loop"# and append the start to the end.angles += angles[:1]# Labels for each axislabels = df_vis.columns# Create the figure and subplotsfig, axes = plt.subplots(figsize=(10, 9), nrows=3, ncols=2, subplot_kw=dict(polar=True))axes = axes.flatten() # Flatten the 3x2 grid of axes for easy iteration# Define colors for each clustercolors = plt.cm.viridis(np.linspace(0, 1, len(df_vis)))# Plot each cluster on a separate subplotfor i, (cluster_name, row) inenumerate(df_vis.iterrows()): ax = axes[i] values = row.tolist() values += values[:1] # complete the loop# Plot the data ax.plot(angles, values, color=colors[i], linewidth=2) ax.fill(angles, values, color=colors[i], alpha=0.25)# Prettify the plot ax.set_rlim(0, 100) # Set radial limits to be consistent (0-100 for percentiles) ax.set_yticklabels([0, 25, 50, 75, 100]) ax.set_xticks(angles[:-1]) ax.set_xticklabels(labels, size=8) ax.set_title(cluster_name, size=12, y=1.1)# Adjust layout to prevent titles from overlappingplt.tight_layout(pad=3.0)plt.show()
Cluster Profiles Radar Chart
Let’s take a closer look at the top players in each cluster to see how they compare.
For example, the top players in the “Elite Slugger” cluster are expected to have high xSLG and xOBP, indicating their ability to hit for both power and average. On paper, Corey Seager might not have as high of an OPS as some of the other players, but his expected statistics suggest he is a top-tier slugger. That said, this list contains the usual suspects like Shohei Ohtani and Aaron Judge.
What do Bo Bichette, Carlos Correa, and Gunnar Henderson have in common? They are all expected to have high batting averages and on-base percentages, making them valuable assets to their teams. These players may not hit for as much power as the elite sluggers, but they excel at getting on base and making contact.
The “Contact Specialist” cluster includes players like Luis Arraez and Nico Hoerner. These players are expected to have high batting averages and low strikeout rates, making them valuable for their ability to put the ball in play consistently. They may not hit for as much power, but their contact skills make them effective hitters.
The Three True Outcome Hitter. This group is significant in that they represent the new wave of hitters who rely on power, patience, and selectvity at the plate. They are the result of the shift in baseball towards a more analytical approach to hitting. Players like Cal Raleigh and Byron Buxton are prime examples of this cluster, as they are expected to have high slugging percentages and walk rates, but also high strikeout rates.
Low-Average Power Threats
Code
results_df[results_df['cluster'] =="Low-Average Power Threat"].sort_values('xba', ascending=False).head(5)
xba
xslg
xiso
xobp
cluster
player_name
208
79.0
52.0
42.0
19.0
Low-Average Power Threat
Harris II, Michael
75
60.0
52.0
46.0
17.0
Low-Average Power Threat
Chourio, Jackson
534
60.0
54.0
50.0
39.0
Low-Average Power Threat
Bellinger, Cody
272
58.0
45.0
40.0
25.0
Low-Average Power Threat
Wagaman, Eric
385
49.0
56.0
58.0
15.0
Low-Average Power Threat
Castellanos, Nick
This is a one-dimensional group of players. They provide above-average power, as seen in their xslg and xiso percentiles. However, they are very poor at getting on base (xobp) and hitting for average (xba), with both metrics falling in the bottom third of the league. They are a significant risk offensively, offering power but little else. Enter Cody Bellinger and Nick Castellanos, two players who have shown flashes of brilliance in the past but have struggled with consistency. They are expected to have high slugging percentages, but their batting averages may not be as high as other players.
This cluster represents the least productive offensive players. The average player ranks in the bottom 20th percent across all four expected categories. They struggle to get on base, hit for average, or generate any kind of power. These players are likely experiencing significant slumps or are simply overmatched.
Using K-Means clustering, we discovered six distinct player profiles based on their expected offensive performance metrics. This approach allows us to move beyond simplistic player comparisons and gain a more nuanced understanding of player archetypes. By clustering players based on their actual performance data, we can identify similarities and differences that may not be immediately apparent through traditional scouting methods.
K-means has its limitations, such as sensitivity to the initial placement of centroids and the need to specify the number of clusters beforehand. This method requires a careful selection of features and preprocessing steps to ensure meaningful results, in addition to a heuristic-based approach to determine the optimal number of clusters. However, it remains a powerful tool for uncovering patterns in data and can be applied to various domains beyond baseball.
Data science isn’t here to replace baseball wisdom but to enhance it. This kind of analysis provides a new, powerful lens through which we can appreciate the diverse skill sets of the players we love to watch.
Look at Cody Bellinger this season. The numbers show his bat speed is ticking up. Just a little. But it’s there. Data science sees it as a player beginning to migrate from one cluster to another, a tiny tremor that might signal a return to the guy who won an MVP in 2019. It’s a reminder that these profiles aren’t destiny. They’re just a moment in time.
The data points are clues, but the players are still people. And people, thank God, can still surprise you.
---title: "K-Means Clustering"author: "Oliver Chang"email: oliverc1622@gmail.comdate: "2025-08-12" # Update this date when you make changescategories: [unsupervised learning, clustering, k-means, scikit-learn]toc: trueformat: html: html-math-method: katex code-tools: trueimage: "main.png"title-block-banner: default---<iframe src="https://streamable.com/m/blake-parker-swinging-strike-to-leury-garcia-hn5edr?partnerId=web_video-playback-page_video-share" width="560" height="315"></iframe>## Introduction: The Problem of "Player Comps""Who does this guy remind you of?" For decades, player comparisons—or “comps”—have been a cornerstone of how we talk about baseball. We’ve relied on simple, useful archetypes: the “power hitter,” the “contact hitter,” the “speedster,” the “ace.” While handy, these labels have always been oversimplified. Take a modern start like Juan Soto. Is he just a "power hitter"? This label would ignore his elite plate discipline and on-base skills. In today's Statcast era, we have access to a wealth of data that can help us move beyond these simplistic labels. Can we create a more nuanced understanding of player similarities? Instead of relying on subjective comparisons, we can use data-driven methods to identify player archetypes based on their actual performance metrics.### Enter K-Means ClusteringIn this post, we’ll explore how to use K-Means clustering, an unsupervised machine learning technique, to group players based on their statistical profiles. By doing so, we can uncover natural groupings of players that share similar characteristics, leading to more meaningful and accurate player comparisons. We will first cluster players based on a broad range of statcast metrics, and then we will focus on a subset of three key metrics to illustrate the process.:::{.callout-note}In our previous posts, we have covered [logistic regression](https://runningonnumbers.com/posts/logistic-regression/), [support vector machines](https://runningonnumbers.com/posts/support-vector-machine/), and [decision trees](https://runningonnumbers.com/posts/fast-swings-and-barrels/) - these are all supervised learning techniques. K-Means clustering is a different beast altogether, as it is an unsupervised learning technique. This means we do not have labeled data to train on; instead, we are trying to find patterns in the data without any prior knowledge of the outcomes.:::The data we will use comes from [Baseball Savant](https://baseballsavant.mlb.com/), specifically the percentile rankings for various offensive metrics. You can download the data [here](https://baseballsavant.mlb.com/leaderboard/percentile-rankings?type=batter&team=). We use the 2025 season data for this analysis. The reason why we use percentile rankings is that they standardize the metrics, allowing us to compare players on a relative scale. They put every player on a 0-100 scale for each metric, making it easier to identify similarities and differences.## K-Means Clustering OverviewPicture this: you are a shelf-stocker at a grocery store. Your goal is to organize items in a giant, messy supermarket into neat, organized sections. You want to group similar items together, like all the cereals in one aisle, all the snacks in another, and so on. This is essentially what K-Means clustering does with data.Step 1: **Choose the Number of Clusters (K)**- You decide how many groups (clusters) you want to create. This is like deciding how many aisles you want in the store. For example, you might choose 5 aisles for cereals, snacks, beverages, dairy, and produce.Step 2: **Randomly Place Cluster Centers**- You randomly place a few points in the store to represent the center of each aisle. These points are called "centroids." They are like the managers of each aisle, guiding where items should go.Step 3: **Assign Items to Clusters**- You look at each item in the store and decide which aisle it belongs to based on its proximity to the centroids. If a cereal box is closest to the cereal aisle centroid, you put it there. This is like assigning items to the right shelves based on their characteristics.Step 4: **Update Cluster Centers**- After assigning items, you check the centroids again. You calculate the average position of all items in each aisle and move the centroids to these new positions. This is like the aisle managers adjusting their positions based on where the items are now located.Step 5: **Repeat Until Stable**- You repeat steps 3 and 4 until the centroids stop moving significantly. This means the aisles are now stable, and items are grouped as best as possible. You have successfully organized the store!In our baseball analogy, the players are the items in the store, and the clusters are the aisles. The metrics we use to compare players are like the characteristics of the items. By applying K-Means clustering, we can group players with similar performance profiles together, just like organizing items into their respective aisles.### Under the Hood: How K-Means WorksThe goal of K-means is the partition a dataset into sets, $S=\{S_1, S_2, \dots, S_k\}$. Each set is to minimize the within-cluster sum of squares (WCSS), in other words, variance. More formally,$$\text{argmin}_s \sum_{i=1}^{k} \sum_{x \in S_i} \|x - \mu_i\|^2 = \text{argmin}_s \sum_{i=1}^{k} |S_i|^2 Var S_i$$where $\mu_i$ is the centroid of cluster $S_i$ and $|S_i|$ is the number of points in cluster $S_i$. The algorithm works by iteratively updating the centroids and reassigning points to clusters until convergence. This is based off Wikipedia's [K-means clustering](https://en.wikipedia.org/wiki/K-means_clustering) article.The algorithm itself goes as follows:Given a set of $k$ clusters, $C = \{c_1, c_2, \dots, c_k\}$, and a set of data points $X = \{x_1, x_2, \dots, x_n\}$:1. **Assignment Step**: Assign each data point $x_i$ to the cluster $c_j$ that is closest to it, based on the distance metric (usually Euclidean distance). $$ S_j^{(t)} = \{x_i \in X \; | \; \|x_i - c_j\|^2 \leq \|x_i - c_k\|^2, \forall k \neq j\} $$2. **Update Step**: Recalculate the centroids of each cluster based on the assigned points. $$ c_j^{(t+1)} = \frac{1}{|S_j^{(t)}|} \sum_{x \in S_j^{(t)}} x $$ where $S_j^{(t)}$ is the set of points assigned to cluster $c_j$ at iteration $t$.3. **Repeat**: Repeat the assignment and update steps until the centroids do not change significantly or a maximum number of iterations is reached.The million dollar question is how many clusters should we use? The answer is not always straightforward, but there are methods to help us decide. The **Elbow-Method** is a common technique to determine the optimal number of clusters. It involves plotting the sum of squared distances (inertia) between data points and their assigned cluster centers for different values of K. The idea is to find the "elbow" point in the plot, where adding more clusters yields diminishing returns in reducing inertia. Let's visualize this with sample data.```{python}# | code-fold: true# | warning: false# | fig-cap: Elbow Method for Optimal Kimport pandas as pdimport matplotlib.pyplot as pltfrom sklearn.cluster import KMeansfrom sklearn.datasets import make_blobsimport seaborn as sns# Create synthetic datasetX, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)df = pd.DataFrame(X, columns=['feature1', 'feature2'])# Calculate inertia for different values of Kinertia = []k_range =range(2, 11) # Testing K from 2 to 10for k in k_range: kmeans = KMeans(n_clusters=k, init='k-means++', n_init='auto', random_state=42) kmeans.fit(df) inertia.append(kmeans.inertia_)# Plotting the Elbow Methodplt.figure(figsize=(10, 6))plt.plot(k_range, inertia, marker='o')plt.title('Elbow Method for Optimal K')plt.xlabel('Number of Clusters (K)')plt.ylabel('Inertia (Sum of Squared Distances)')plt.xticks(k_range)plt.grid()plt.show()```Based on the elbow plot, we can see that the inertia decreases as we increase K, but there is a point where the decrease becomes less significant. This point is where we should choose our optimal K. For the iris dataset, it appears that K=3 is a good choice, as the inertia reduction slows down significantly after that point. Now that we have our optimal K, we can proceed with clustering the data.```{python}# | code-fold: true# | warning: false# | fig-cap: K-Means Clustering from Scratchimport numpy as npclass SimpleKMeans:def__init__(self, n_clusters=4, max_iters=100, random_state=42):self.n_clusters = n_clustersself.max_iters = max_itersself.random_state = random_stateself.centroids =Noneself.labels =Nonedef _initialize_centroids(self, X): np.random.seed(self.random_state) random_indices = np.random.permutation(X.shape[0])self.centroids = X[random_indices[:self.n_clusters]]def _assign_clusters(self, X): distances = np.sqrt(((X -self.centroids[:, np.newaxis])**2).sum(axis=2))return np.argmin(distances, axis=0)def _update_centroids(self, X, labels): new_centroids = np.zeros((self.n_clusters, X.shape[1]))for k inrange(self.n_clusters): cluster_points = X[labels == k]iflen(cluster_points) >0: new_centroids[k] = cluster_points.mean(axis=0)return new_centroidsdef fit(self, X):self._initialize_centroids(X)for _ inrange(self.max_iters):self.labels =self._assign_clusters(X) new_centroids =self._update_centroids(X, self.labels)if np.all(self.centroids == new_centroids):breakself.centroids = new_centroidsreturnself# Using the same synthetic data from beforeX, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)# Fit the custom KMeans modelcustom_kmeans = SimpleKMeans(n_clusters=4, random_state=42)custom_kmeans.fit(X)custom_labels = custom_kmeans.labels# Plot the resultsplt.figure(figsize=(10, 6))sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=custom_labels, palette='viridis', s=100)plt.scatter(custom_kmeans.centroids[:, 0], custom_kmeans.centroids[:, 1], s=300, c='red', marker='X', label='Centroids')plt.title('K-Means Clustering from Scratch')plt.xlabel('Feature 1')plt.ylabel('Feature 2')plt.legend()plt.grid()plt.show()```The algorithm has successfully grouped the data points into clusters based on their features. Each color represents a different cluster, and we can see how the points are distributed across the feature space.## The Payoff - Profiling MLB Players with K-MeansNow that we have a solid understanding of K-Means clustering, let's apply it to our baseball data. We will use the percentile rankings for various offensive metrics to cluster players and identify their profiles.```{python}# | code-fold: true# | warning: falsedf = pd.read_csv("percentile_rankings.csv")df.sort_values("xwoba", ascending=False).head(5)```The percentile rankings dataset contains various offensive metrics for players, such as xwOBA, xBA, and xSLG. For this application, we will `xba`, `xslg`, `xiso`, and `xobp`. These metrics provide a comprehensive view of a player's offensive performance, allowing us to cluster players based on their hitting profiles.```{python}# | code-fold: true# | warning: falsefeatures = ["xba", "xslg", "xiso", "xobp"]df_k = df[features]df_k = df_k.dropna()df_k = df_k.sort_values("xslg", ascending=False)df_k.head(5)```Let's apply the Elbow method to determine the optimal number of clusters for our baseball data. We will plot the inertia for different values of K and look for the "elbow" point.```{python}# | code-fold: true# | warning: falsefrom sklearn.cluster import KMeansimport matplotlib.pyplot as pltinertia = []k_values =range(1, 11)for k in k_values: kmeans = KMeans(n_clusters=k, random_state=42) kmeans.fit(df_k) inertia.append(kmeans.inertia_)plt.figure(figsize=(10, 6))plt.plot(k_values, inertia, marker='o')plt.title('Elbow Method for Optimal K')plt.xlabel('Number of Clusters (K)')plt.ylabel('Inertia')plt.xticks(k_values)plt.grid()plt.show()```Looking at the elbow plot, we can see that the inertia decreases as we increase K, but there is a point where the decrease becomes less significant. This point is where we should choose our optimal K. For our dataset, it appears that $K=6$ is a good choice, as the inertia reduction slows down significantly after that point.### The Final Clustering```{python}# | code-fold: true# | warning: falseoptimal_k =6final_kmeans = KMeans(n_clusters=optimal_k, init='k-means++', n_init='auto', random_state=42)clusters = final_kmeans.fit_predict(df_k)cluster_names = {2: "Elite Slugger", 5: "High-Average Hitter", 3: "Contact Specialist", 4: "Three True Outcome Hitter", 0: "Low-Average Power Threat", 1: "Struggling Hitter"}# Add the cluster labels back to a copy of our original dataframeresults_df = df_k.loc[df_k.index].copy()results_df['cluster'] = clusters# map cluster names to the cluster labelsresults_df['cluster'] = results_df['cluster'].map(cluster_names)cluster_profiles = results_df.groupby('cluster')[features].mean().round(1)cluster_profiles = cluster_profiles.sort_values(by=features, ascending=False)print("--- Cluster Profiles (Average Percentiles) ---")cluster_profiles.to_clipboard()cluster_profiles = cluster_profiles.rename(index=cluster_names)cluster_profiles```Here are the descriptions of each cluster based on their expected offensive performance percentiles. - **Elite Slugger**: Players in this cluster are expected to excel in all major offensive categories, with particularly high percentiles in xBA, xSLG, and xOBP.- **High-Average Hitter**: This group consists of players who may not have the same power as the elite sluggers but still maintain strong overall offensive numbers, especially in batting average and on-base percentage.- **Contact Specialist**: Players here are characterized by their ability to make contact and avoid strikeouts, often at the expense of power numbers.- **Three True Outcome Hitter**: This cluster includes players who can slug, walk, or strike out, with less emphasis on traditional batting average.- **Low-Average Power Threat**: These players have power potential but struggle with consistency and making contact.- **Struggling Hitter**: This group consists of players who are below average in most offensive categories and may be at risk of losing their roster spots.The figure below visualizes the average percentiles for each cluster across the four metrics we used. Each cluster is represented by a different color, and the radar chart allows us to see how each cluster compares across the metrics.```{python}# | code-fold: true# | warning: false# | fig-cap: Cluster Profiles Radar Chart# Data from the userdata = {'cluster': ['Elite Slugger', 'High-Average Hitter', 'Contact Specialist', 'Three True Outcome Hitter', 'Low-Average Power Threat', 'Struggling Hitter'],'xba': [80.5, 75.8, 59.0, 56.2, 30.3, 17.2],'xslg': [91.8, 64.9, 26.0, 81.8, 50.8, 14.3],'xiso': [89.3, 55.8, 18.3, 82.8, 59.0, 20.5],'xobp': [89.7, 72.8, 60.4, 40.5, 32.6, 18.6]}df_vis = pd.DataFrame(data)df_vis = df_vis.set_index('cluster')# Number of variables we're plotting.num_vars =len(df_vis.columns)# Compute angle for each axis.angles = np.linspace(0, 2* np.pi, num_vars, endpoint=False).tolist()# The plot is a circle, so we need to "complete the loop"# and append the start to the end.angles += angles[:1]# Labels for each axislabels = df_vis.columns# Create the figure and subplotsfig, axes = plt.subplots(figsize=(10, 9), nrows=3, ncols=2, subplot_kw=dict(polar=True))axes = axes.flatten() # Flatten the 3x2 grid of axes for easy iteration# Define colors for each clustercolors = plt.cm.viridis(np.linspace(0, 1, len(df_vis)))# Plot each cluster on a separate subplotfor i, (cluster_name, row) inenumerate(df_vis.iterrows()): ax = axes[i] values = row.tolist() values += values[:1] # complete the loop# Plot the data ax.plot(angles, values, color=colors[i], linewidth=2) ax.fill(angles, values, color=colors[i], alpha=0.25)# Prettify the plot ax.set_rlim(0, 100) # Set radial limits to be consistent (0-100 for percentiles) ax.set_yticklabels([0, 25, 50, 75, 100]) ax.set_xticks(angles[:-1]) ax.set_xticklabels(labels, size=8) ax.set_title(cluster_name, size=12, y=1.1)# Adjust layout to prevent titles from overlappingplt.tight_layout(pad=3.0)plt.show()```Let's take a closer look at the top players in each cluster to see how they compare.#### Elite Sluggers```{python}# | code-fold: true# | warning: falseresults_df = results_df.sort_values('xslg', ascending=False)results_df = results_df.merge(df[['player_name']], left_index=True, right_index=True, how='left')results_df[results_df['cluster'] =="Elite Slugger"].sort_values('xslg', ascending=False).head(5)```For example, the top players in the "Elite Slugger" cluster are expected to have high xSLG and xOBP, indicating their ability to hit for both power and average. On paper, [Corey Seager](https://www.baseball-reference.com/players/s/seageco01.shtml) might not have as high of an OPS as some of the other players, but his expected statistics suggest he is a top-tier slugger. That said, this list contains the usual suspects like [Shohei Ohtani](https://www.baseball-reference.com/players/o/ohtansh01.shtml) and [Aaron Judge](https://www.baseball-reference.com/players/j/judgeaa01.shtml).#### High-Average Hitters```{python}# | code-fold: true# | warning: falseresults_df[results_df['cluster'] =="High-Average Hitter"].sort_values('xba', ascending=False).head(5)```What do [Bo Bichette](https://baseballsavant.mlb.com/savant-player/bo-bichette-666182?stats=statcast-r-hitting-mlb), [Carlos Correa](https://baseballsavant.mlb.com/savant-player/carlos-correa-621043?stats=statcast-r-hitting-mlb), and [Gunnar Henderson](https://baseballsavant.mlb.com/savant-player/gunnar-henderson-683002?stats=statcast-r-hitting-mlb) have in common? They are all expected to have high batting averages and on-base percentages, making them valuable assets to their teams. These players may not hit for as much power as the elite sluggers, but they excel at getting on base and making contact.#### Contact Specialists```{python}# | code-fold: true# | warning: falseresults_df[results_df['cluster'] =="Contact Specialist"].sort_values('xba', ascending=False).head(5)```The "Contact Specialist" cluster includes players like [Luis Arraez](https://baseballsavant.mlb.com/savant-player/luis-arraez-650333?stats=statcast-r-hitting-mlb) and [Nico Hoerner](https://baseballsavant.mlb.com/savant-player/nico-hoerner-663538?stats=statcast-r-hitting-mlb). These players are expected to have high batting averages and low strikeout rates, making them valuable for their ability to put the ball in play consistently. They may not hit for as much power, but their contact skills make them effective hitters.#### Three True Outcome Hitters```{python}# | code-fold: true# | warning: falseresults_df[results_df['cluster'] =="Three True Outcome Hitter"].sort_values('xslg', ascending=False).head(5)```The Three True Outcome Hitter. This group is significant in that they represent the new wave of hitters who rely on power, patience, and selectvity at the plate. They are the result of the shift in baseball towards a more analytical approach to hitting. Players like [Cal Raleigh](https://baseballsavant.mlb.com/savant-player/cal-raleigh-663728?stats=statcast-r-hitting-mlb) and [Byron Buxton](https://baseballsavant.mlb.com/savant-player/byron-buxton-621439?stats=statcast-r-hitting-mlb) are prime examples of this cluster, as they are expected to have high slugging percentages and walk rates, but also high strikeout rates.#### Low-Average Power Threats```{python}# | code-fold: true# | warning: falseresults_df[results_df['cluster'] =="Low-Average Power Threat"].sort_values('xba', ascending=False).head(5)```This is a one-dimensional group of players. They provide above-average power, as seen in their xslg and xiso percentiles. However, they are very poor at getting on base (xobp) and hitting for average (xba), with both metrics falling in the bottom third of the league. They are a significant risk offensively, offering power but little else. Enter [Cody Bellinger](https://baseballsavant.mlb.com/savant-player/cody-bellinger-641355?stats=statcast-r-hitting-mlb) and [Nick Castellanos](https://baseballsavant.mlb.com/savant-player/nick-castellanos-592206?stats=statcast-r-hitting-mlb), two players who have shown flashes of brilliance in the past but have struggled with consistency. They are expected to have high slugging percentages, but their batting averages may not be as high as other players.#### Struggling Hitters```{python}# | code-fold: true# | warning: falseresults_df[results_df['cluster'] =="Struggling Hitter"].sort_values('xba', ascending=False).head(5)```This cluster represents the least productive offensive players. The average player ranks in the bottom 20th percent across all four expected categories. They struggle to get on base, hit for average, or generate any kind of power. These players are likely experiencing significant slumps or are simply overmatched.For those who are curious...```{python}# | code-fold: true# | warning: falseresults_df[results_df['cluster'] =="Struggling Hitter"].sort_values('xba', ascending=False).tail(5)```## ConclusionUsing K-Means clustering, we discovered six distinct player profiles based on their expected offensive performance metrics. This approach allows us to move beyond simplistic player comparisons and gain a more nuanced understanding of player archetypes. By clustering players based on their actual performance data, we can identify similarities and differences that may not be immediately apparent through traditional scouting methods.K-means has its limitations, such as sensitivity to the initial placement of centroids and the need to specify the number of clusters beforehand. This method requires a careful selection of features and preprocessing steps to ensure meaningful results, in addition to a heuristic-based approach to determine the optimal number of clusters. However, it remains a powerful tool for uncovering patterns in data and can be applied to various domains beyond baseball.Data science isn't here to replace baseball wisdom but to enhance it. This kind of analysis provides a new, powerful lens through which we can appreciate the diverse skill sets of the players we love to watch. Look at Cody Bellinger this season. The numbers show his bat speed is ticking up. Just a little. But it's there. Data science sees it as a player beginning to migrate from one cluster to another, a tiny tremor that might signal a return to the guy who won an MVP in 2019. It’s a reminder that these profiles aren't destiny. They're just a moment in time.The data points are clues, but the players are still people. And people, thank God, can still surprise you.<iframe src="https://streamable.com/m/cody-bellinger-homers-21-on-a-fly-ball-to-right-center-field-0urbp0?partnerId=web_video-playback-page_video-share" width="560" height="315"></iframe>---Past Aricles- [Logistic Regression](https://runningonnumbers.com/posts/logistic-regression/)- [Support Vector Machines](https://runningonnumbers.com/posts/support-vector-machine/)- [Decision Trees](https://runningonnumbers.com/posts/fast-swings-and-barrels/)GitHub:- [Code](https://github.com/oliverc1623/Running-On-Numbers-Public/blob/main/posts/k-means/main.py)