Swing Fast: Predicting Barrel Rate with Fast Swing Rate
scikit-learn
regression
MLB
Author
Oliver Chang
Published
July 17, 2025
Introduction
In this article, we will explore the relationship between swing speed and barrel rate in Major League Baseball (MLB) players. Barrel rate is a key metric that measures the percentage of batted balls that are classified as “barrels,” which are hits with a high probability of resulting in extra-base hits. According to MLB, batted-ball events classified as Barrels had lead to a “minium 0.500 batting average and 1.500 slugging percentage” “Standard StatsGlossary” (2025). By understanding how swing speed impacts barrel rate, we can gain insights into player performance and potentially identify areas for improvement. We will use a linear regression model but with an added twist: we will apply a square root transformation to capture non-linear relationships between swing speed and barrel rate. I refer you to my previous article on linear regressions for a refresher on the topic.
Data Preparation and Linear Regression
As usual, most of my data comes from playing around on Baseball Savant. I like making custom leaderboards and viewing relationships between different stats. For this analysis, I focus on two main features: swing speed rate and barrel rate. Swing speed rate is the percentage of swings that reached 75MPH or greater. Barrel rate is the percentage of batted balls that have an exit velocity of 98MPH or greater and an ideal launch angle of 25-31 degrees.
Let’s plot some data.
Code
import pandas as pdimport plotly.express as pximport numpy as npdf = pd.read_csv("stats.csv")p = px.scatter( df, x="fast_swing_rate", y="barrel_batted_rate", trendline="ols", trendline_color_override="red", title="Swing Speed Rate vs Barrel Rate", hover_data=["last_name, first_name", "fast_swing_rate", "barrel_batted_rate", "year"],)p.update_xaxes(title="Fast Speed Rate")p.update_yaxes(title="Barrel Rate")
Scatter Plot of Fast Swing Rate vs Barrel Rate
Observe that the relationship between fast swing rate and barrel rate is not a perfect linear relationship. In fact, there is a slight biphasic trend, where players with a fast swing rate less than 5% are clustered in low barrel rates, while players with a fast swing rate greater than 5% have a more linear relationship with barrel rate. This suggests that there may be a threshold effect, where players who swing fast enough can achieve higher barrel rates. An OLS regression line in red has an \(R^2\) value of 0.48, indicating that the model explains 48% of the variance in barrel rate. The coefficient for fast swing rate is 0.153 which means for every one-unit increase in fast swing rate, barrel rate increases by 0.153 units.
Square Root Transformation
What if we applied a square root transformation to fast swing rate? This transformation can help capture non-linear relationships and improve the model’s performance. Let’s see how it looks.
Scatter Plot of Sqrt Fast Swing Rate vs Barrel Rate
Slightly more linear! The OLS regression as a \(R^2\) value of 0.49, which is slightly higher than the non-transformed model. The coefficient for square root fast swing rate is 1.50 which means for every one-unit increase in square root fast swing rate, barrel rate increases by 1.50 units.
Decision Trees
Let’s go back to the original fast swing rate and barrel rate data. The biphasic trend suggests a simple linear regression may not be the best fit. We saw that a bunch of batters with a fast swing rate less than 5% had a low barrel rate. However, is there a threshold effect? A decision tree regression can help us identify if there is a threshold effect in the data.
Decision trees are our first non-parametric model to make an appearance on this blog! They are a type of supervised learning algorithm that can be used for both classification and regression tasks. Decision trees work by learning simple decision rules inferred from the data features.
Let \(T\) be a decision tree that consists of a set of nodes. Each internal (non-leaf) node \(m\) represents a test on a feature \(j\) with a split at point \(s_m\). For a data point \(x\), if its \(j\)-th feature \(x_j < s_m\), it goes to the left child node; otherwise, it goes to the right child node. The terminal nodes (leaves) of the tree represent the predicted value for the target variable.
For a given feature space \(X \in \mathbb{R}^d\), the tree defines a partition into \(M\) regions \(R_1, R_2, \ldots, R_M\) such that \(\cup_{m=1}^{M}R_m = X\) and \(R_i \cap R_j = \emptyset\) for \(i\neq j\).
Classification Trees
In classification tasks, the goal is to predict a categorical label. The decision tree algorithm recursively splits the data into subsets based on feature values, aiming to create pure subsets where all samples in a subset belong to the same class. The splitting criterion is often based on measures like Gini impurity or information gain.
For a leaf node representing region \(R_m\) let \(p_{mk}\) be the proportion of samples in \(R_m\) that belong to class \(k\). \(p_{mk} = \frac{1}{N_m}\sum_{x_i\in R_m} I(y_i=k).\) The predicted class for a sample \(x\) that falls into region \(R_m\) is given by the class with the highest proportion: \(\hat{y}(x) = \arg\max_k p_{mk}.\)
Let’s implement the algorithm for finding the best decision split. The “best split” is the one that results in the highest information gain. Given a 2D array data where each row is a sample. The last column of data is the class label (0 or 1), and the preceding columns are the feature values. You are also given feature_indices, a list of column indices to consider for splitting.
Splitting Criterion
We can use functions like entropy or Gini impurity to measure the quality of a split. The goal is to find the feature and threshold that maximizes the information gain, which is the reduction in entropy or impurity after the split. You can think of the split as \(\theta\), where \(\theta\) is a threshold value for a feature \(j\). The split divides the data into two subsets: \(D_L = \{x_i | x_{ij} < \theta\}\) and \(D_R = \{x_i | x_{ij} \geq \theta\}\). We want to find the split that maximizes the information gain: \[IG(D, \theta) = H(D) - \left( \frac{|D_L|}{|D|} H(D_L) + \frac{|D_R|}{|D|} H(D_R) \right)\] where \(H(D)\) is the entropy of the dataset \(D\).
find_best_split(data, feature_indices) iterates through each specified feature and each unique value within that feature as a potential split threshold. For each potential split, you will calculate the Information Gain. The function should return a tuple containing the index of the best feature to split on and the best threshold value.
import matplotlib.pyplot as pltdef entropy(labels):"""Calculates the entropy for a set of labels."""iflen(labels) ==0:return0 counts = np.bincount(labels.astype(int)) probabilities = counts[counts >0] /len(labels)return-np.sum(probabilities * np.log2(probabilities))def find_best_split(data, feature_indices): labels = data[:, -1] best_info_gain =-1 best_split =None parent_entropy = entropy(labels)for f_i in feature_indices: feature_data = data[:, f_i] unique_values = np.unique(feature_data)for threshold in unique_values: left_indices = feature_data < threshold right_indices = feature_data >= thresholdif np.all(left_indices) or np.all(right_indices):continue left_labels = labels[left_indices] right_labels = labels[right_indices]# Calculate the weighted average entropy p_left =len(left_labels) /len(labels) p_right =len(right_labels) /len(labels) weighted_entropy = p_left * entropy(left_labels) + p_right * entropy(right_labels)# Calculate information gain info_gain = parent_entropy - weighted_entropy# Update the best split if the current one is betterif info_gain > best_info_gain: best_info_gain = info_gain best_split = (f_i, threshold)return best_splitsample_data = np.array( [[2.5, 3.0, 0], [5.1, 3.5, 0], [3.5, 1.4, 0], [6.2, 2.8, 1], [4.7, 3.2, 0], [6.0, 2.0, 1], [5.5, 1.5, 1], [6.2, 1.0, 1], [5.6, 2.5, 1]],)feature_indx, threshold = find_best_split(sample_data, [0, 1])print(f"Best feature index: {feature_indx}, Best threshold: {threshold}")plt.scatter(sample_data[:, 0], sample_data[:, 1], c=sample_data[:, 2], cmap="coolwarm")plt.axvline(x=threshold, color="blue", linestyle="--", label=f"Split at {threshold:.2f}")plt.xlabel("Feature 1")plt.ylabel("Feature 2")plt.title("Sample Data for Decision Tree Split")plt.legend()plt.show()
Best feature index: 0, Best threshold: 5.5
The function find_best_split calculates the best feature and threshold to split the data based on minimizing entropy. The output shows that the best feature index is 0 (the first feature) with a threshold of 5.1. The plot visualizes the sample data and the split line.
find_best_split iteratres through each feature in the dataset and each unique value within that feature as a potential split threshold. Get the left and right indicies of that unique value by calculating boolean masks. Check if one of the masks is all True, which means that the split would not be valid. Then get the labels for the left and right splits and calculate the weighted average entropy. If the current split has a lower entropy than the best split so far, update the best split. Finally, return the index of the best feature to split on and the best threshold value.
Regression Trees
Regression trees are used to predict continuous values. The decision tree algorithm recursively splits the data into subsets based on feature values, aiming to create pure subsets where all samples in a subset have similar target values. The splitting criterion is often based on measures like mean squared error (MSE) or mean absolute error (MAE). The prediction for a new input \(x\) is given by \(f(x) = \sum_{m=1}^M c_mI(x\in R_m).\) The value \(c_m\) is chosen to minimize the sum of squared errors within the region. \(c_m = \text{argmin}_c\sum_{x_i \in R_m}(y_i - c)^2 = \frac{1}{N_m}\sum y_i.\)
Splitting Criterion
For regression tasks, the splitting criterion is based on minimizing the variance of the target variable in the resulting subsets. A given node \(n\) and a split candidate by feature \(j\) and point \(s\), partitions the data into two sets: \(R_{\text{left}}(\theta)=\{(x,y) | x_j\leq s\} \text{and } R_{\text{right}}(\theta)=\{(x,y) | x_j > s\}\).
The goal is the find the optimal split \(\theta=(j,s)\) that minimizes the weighted sum of squared errors of the two subsets. The objective function is: \(\sum_{x_i \in R_{\text{left}}(\theta)}(y_i - \bar{y}_{\text{left}})^2 + \sum_{x_i \in R_{\text{right}}(\theta)}(y_i - \bar{y}_{\text{right}})^2.\)
Grouping Players with Decision Trees
We can use a decision tree regression model to group players based on their fast swing rate and barrel rate. The decision tree will help us identify the optimal threshold for fast swing rate that separates players into different performance groups. Below I apply a decision tree regression model to the fast swing rate and barrel rate data. The goal is to find the optimal threshold for fast swing rate that separates players into different performance groups. Since we are interested in finding a threshold, we will set the maximum depth of the tree to 1. This means that the tree will only have one split, which will be the optimal threshold for fast swing rate.
Code
from sklearn.tree import DecisionTreeRegressorfrom sklearn.model_selection import train_test_splitfrom scipy.stats import ttest_indimport sysimport sklearnX_feature_name ="fast_swing_rate"y_feature_name ="barrel_batted_rate"X = df[[X_feature_name]]y = df[y_feature_name]model = DecisionTreeRegressor(max_depth=1, random_state=42)model.fit(X, y)threshold_X = model.tree_.threshold[0]print(f"The optimal threshold 'X' found by the Decision Tree is: {threshold_X:.2f}%")
The optimal threshold 'X' found by the Decision Tree is: 35.35%
Our decision tree found that the threshold for fast swing rate is approximately 35.35%. What does the average barrel rate look like for players below and above this threshold? Let’s calculate the average barrel rate for players below and above the threshold.
Code
below_threshold = df[df["fast_swing_rate"] < threshold_X]at_or_above_threshold = df[df["fast_swing_rate"] >= threshold_X]# Calculate the average barrel rate for each groupavg_barrel_rate_below = below_threshold["barrel_batted_rate"].mean()avg_barrel_rate_above = at_or_above_threshold["barrel_batted_rate"].mean()print(f"Average barrel rate for players below {threshold_X:.2f}%: {avg_barrel_rate_below:.1f}%")print(f"Average barrel rate for players at or above {threshold_X:.2f}%: {avg_barrel_rate_above:.1f}%")print(f"The drop in average barrel rate is: {(avg_barrel_rate_above - avg_barrel_rate_below):.1f}%")stat, pvalue = ttest_ind(below_threshold["barrel_batted_rate"], at_or_above_threshold["barrel_batted_rate"], equal_var=False, nan_policy='omit')print(f"\nT-test p-value: {pvalue:.2e}")if pvalue <0.05:print("The difference between the two groups is statistically significant.")else:print("The difference between the two groups is not statistically significant.")
Average barrel rate for players below 35.35%: 7.9%
Average barrel rate for players at or above 35.35%: 13.3%
The drop in average barrel rate is: 5.4%
T-test p-value: 2.08e-28
The difference between the two groups is statistically significant.
A different of 5.4% barrel rate is pretty stark! In taking a t-test as well, we find that the difference between the two groups is statistically significant with a p-value of 2.08e-28. This suggests that there is a significant difference in barrel rate between players below and above the threshold of 35.35% fast swing rate.
Let’s plot the cutoff threshold and the average barrel rates for each group.
Code
import plotly.graph_objects as gofig = px.scatter( df, x="fast_swing_rate", y="barrel_batted_rate", title="Decision Tree Regression of Fast Swing Rate vs Barrel Rate", hover_data=["last_name, first_name", "fast_swing_rate", "barrel_batted_rate", "year"],)fig.update_xaxes(title="Fast Speed Rate")fig.update_yaxes(title="Barrel Rate")fig.add_shape(type="line", x0=threshold_X, y0=0, x1=threshold_X, y1=df["barrel_batted_rate"].max(), line=dict(color="blue", width=2, dash="dash"), name="Decision Tree Threshold")fig.add_annotation( x=threshold_X, y=df["barrel_batted_rate"].max(), text=f"Threshold: {threshold_X:.2f}%", showarrow=True, arrowhead=2, ax=0, ay=-40, font=dict(color="green"), )fig.show()
Decision Tree Regression Plot of Fast Swing Rate vs Barrel Rate
A single split might not be enough to capture the complexity of the data. Let’s try a deeper decision tree with a maximum depth of 2. This will allow us to create more groups based on fast swing rate and barrel rate.
Code
import matplotlib.pyplot as pltX_feature_name ='fast_swing_rate'y_feature_name ='barrel_batted_rate'X = df[[X_feature_name]]y = df[y_feature_name]tree_model_d2 = DecisionTreeRegressor(max_depth=2, random_state=42)tree_model_d2.fit(X, y)df['group'] = tree_model_d2.apply(X)group_analysis = df.groupby('group').agg( min_swing_rate=(X_feature_name, 'min'), max_swing_rate=(X_feature_name, 'max'), avg_barrel_rate=(y_feature_name, 'mean'), player_count=('player_id', 'count')).sort_values(by='min_swing_rate').reset_index()print("--- Analysis of Player Groups (from depth-2 tree) ---")print(group_analysis)
--- Analysis of Player Groups (from depth-2 tree) ---
group min_swing_rate max_swing_rate avg_barrel_rate player_count
0 2 0.2 7.9 5.116867 83
1 3 8.1 35.2 8.957798 218
2 5 35.5 52.6 11.811429 70
3 6 52.7 78.0 15.351020 49
In the first pile were the monks, the 83 players who swung fast at less than 8% of pitches. Their reward for this monastic patience? A barrel rate of 5%. Then you had the great mass of players, the 218 guys who followed the rules and swung fast at a “normal” rate. They did a little better, barreling the ball about 9% of the time. This is where it should have ended. But the machine kept splitting.
The model found a group of 70 hitters who were…antsy. They swung at over a third of the pitches they saw. The old scout would have benched them for their lack of discipline. The machine, however, noticed their barrel rate had jumped to nearly 12%. Interesting.
Then it found the last group. A tiny cohort of 49 players who were, by any traditional measure, completely out of their minds. These guys swung at everything. More than half the pitches thrown to them, they were hacking at. They were the hitters your pitching coach warned you about, the ones who would chase a ball if you rolled it to the plate. And what was their reward for this utter lack of discipline? A barrel rate of over 15%. An elite, All-Star-level number. Although some outliers were present. Take for example, 2025 Julio Rodriguez, who has a fast swing rate of 60% but a barrel rate of 7.7%.
Below is a bar chart that visualizes the average barrel rate for each group. The x-axis shows the fast swing rate ranges, and the y-axis shows the average barrel rate for each group. The groups are color-coded based on their fast swing rate ranges.
Code
plt.style.use('seaborn-v0_8-whitegrid')fig, ax = plt.subplots(figsize=(9, 4))# Create labels for the x-axis based on the swing rate rangesgroup_labels = [f"{row.min_swing_rate:.2f}% - {row.max_swing_rate:.2f}%\n(n={row.player_count})"for index, row in group_analysis.iterrows()]# Create the bar chartbars = ax.bar( group_labels, group_analysis['avg_barrel_rate'], color=['#0a7ff5', '#e6f0fa', '#fae6e6', '#fa5f5f'], edgecolor='black')# Add labels and titlesax.set_xlabel('Fast Swing Rate', fontsize=12)ax.set_ylabel('Average Barrel Rate (%)', fontsize=12)ax.set_title('Player Barrel Rate Groups by Fast Swing Rate', fontsize=16, pad=20)ax.tick_params(axis='x', labelsize=10)# Add text labels on top of the barsfor bar in bars: yval = bar.get_height() ax.text(bar.get_x() + bar.get_width()/2.0, yval +0.1, f'{yval:.2f}%', ha='center', va='bottom', fontsize=12)plt.tight_layout()plt.savefig('decision_tree_3_groups.png')plt.show()
Decision Tree Regression Plot of Fast Swing Rate vs Barrel Rate with Average Barrel Rates
Code
# t-test for the four groups# Extract barrel rates for each groupgroup_0 = df[df['group'] ==2]['barrel_batted_rate']group_1 = df[df['group'] ==3]['barrel_batted_rate']group_2 = df[df['group'] ==5]['barrel_batted_rate']group_3 = df[df['group'] ==6]['barrel_batted_rate']# Perform t-teststtest_01 = ttest_ind(group_0, group_1, equal_var=False)ttest_12 = ttest_ind(group_1, group_2, equal_var=False)ttest_23 = ttest_ind(group_2, group_3, equal_var=False) # Display resultsprint("--- T-Test Results ---")print(f"Group 0 vs Group 1: t-statistic = {ttest_01.statistic:.2f}, p-value = {ttest_01.pvalue:.4f}")print(f"Group 1 vs Group 2: t-statistic = {ttest_12.statistic:.2f}, p-value = {ttest_12.pvalue:.4f}")print(f"Group 2 vs Group 3: t-statistic = {ttest_23.statistic:.2f}, p-value = {ttest_23.pvalue:.4f}")
--- T-Test Results ---
Group 0 vs Group 1: t-statistic = -10.56, p-value = 0.0000
Group 1 vs Group 2: t-statistic = -6.97, p-value = 0.0000
Group 2 vs Group 3: t-statistic = -5.07, p-value = 0.0000
Conclusion
Of course, a walk was as good as a hit. The quants had proven that a decade ago, and it was the one piece of common knowledge that was actually, you know, true. Getting on base for free was a foundational good.
But that truth had created a dangerous piece of collateral wisdom: that patience, in all its forms, was a virtue. That the same mindset that let you take ball four was the one you should use when a pitcher threw you something hittable.
The data, however, saw a clean line in the sand. It wasn’t about patience versus aggression; it was about knowing when to be which. The analysis didn’t argue against the value of a walk—it argued against the crippling passivity that had been mistaken for discipline.
The most valuable players weren’t just patient or just aggressive; they were both. They had the discipline of a monk until the moment they decided to swing. And at that moment, they had the fury of a barbarian. The rest of the league, stuck in the middle, was playing the wrong game entirely.
Of course barrels are not the only thing that matters in baseball. There are many other factors that contribute to a player’s success, such as plate discipline, contact rate, and defensive skills. However, understanding the relationship between swing speed and barrel rate can provide valuable insights into player performance and help teams make better decisions. I leave you with a clip of Julio Rodriguez: a fast swinger who hasn’t quite found his stride in 2025. However, he is still a young player with a lot of potential. With the right adjustments, he could become one of the best hitters in the league.