Swing Fast: Predicting Barrel Rate with Fast Swing Rate

scikit-learn
regression
MLB
Author

Oliver Chang

Published

July 17, 2025

Introduction

In this article, we will explore the relationship between swing speed and barrel rate in Major League Baseball (MLB) players. Barrel rate is a key metric that measures the percentage of batted balls that are classified as “barrels,” which are hits with a high probability of resulting in extra-base hits. According to MLB, batted-ball events classified as Barrels had lead to a “minium 0.500 batting average and 1.500 slugging percentage” “Standard Stats Glossary (2025). By understanding how swing speed impacts barrel rate, we can gain insights into player performance and potentially identify areas for improvement. We will use a linear regression model but with an added twist: we will apply a square root transformation to capture non-linear relationships between swing speed and barrel rate. I refer you to my previous article on linear regressions for a refresher on the topic.

Data Preparation and Linear Regression

As usual, most of my data comes from playing around on Baseball Savant. I like making custom leaderboards and viewing relationships between different stats. For this analysis, I focus on two main features: swing speed rate and barrel rate. Swing speed rate is the percentage of swings that reached 75MPH or greater. Barrel rate is the percentage of batted balls that have an exit velocity of 98MPH or greater and an ideal launch angle of 25-31 degrees.

Let’s plot some data.

Code
import pandas as pd
import plotly.express as px
import numpy as np

df = pd.read_csv("stats.csv")
p = px.scatter(
    df, 
    x="fast_swing_rate", 
    y="barrel_batted_rate", 
    trendline="ols", 
    trendline_color_override="red",
    title="Swing Speed Rate vs Barrel Rate",
    hover_data=["last_name, first_name", "fast_swing_rate", "barrel_batted_rate", "year"],
)
p.update_xaxes(title="Fast Speed Rate")
p.update_yaxes(title="Barrel Rate")

Scatter Plot of Fast Swing Rate vs Barrel Rate

Observe that the relationship between fast swing rate and barrel rate is not a perfect linear relationship. In fact, there is a slight biphasic trend, where players with a fast swing rate less than 5% are clustered in low barrel rates, while players with a fast swing rate greater than 5% have a more linear relationship with barrel rate. This suggests that there may be a threshold effect, where players who swing fast enough can achieve higher barrel rates. An OLS regression line in red has an \(R^2\) value of 0.48, indicating that the model explains 48% of the variance in barrel rate. The coefficient for fast swing rate is 0.153 which means for every one-unit increase in fast swing rate, barrel rate increases by 0.153 units.

Square Root Transformation

What if we applied a square root transformation to fast swing rate? This transformation can help capture non-linear relationships and improve the model’s performance. Let’s see how it looks.

Code
df["sqrt_fast_swing_rate"] = np.sqrt(df["fast_swing_rate"])

p = px.scatter(
    df, 
    x="sqrt_fast_swing_rate", 
    y="barrel_batted_rate", 
    trendline="ols", 
    trendline_color_override="red",
    title="Swing Speed Rate vs Barrel Rate",
    hover_data=["last_name, first_name", "sqrt_fast_swing_rate", "barrel_batted_rate", "year"],
)
p.update_xaxes(title="Sqrt Fast Speed Rate")
p.update_yaxes(title="Barrel Rate")

Scatter Plot of Sqrt Fast Swing Rate vs Barrel Rate

Slightly more linear! The OLS regression as a \(R^2\) value of 0.49, which is slightly higher than the non-transformed model. The coefficient for square root fast swing rate is 1.50 which means for every one-unit increase in square root fast swing rate, barrel rate increases by 1.50 units.

Decision Trees

Let’s go back to the original fast swing rate and barrel rate data. The biphasic trend suggests a simple linear regression may not be the best fit. We saw that a bunch of batters with a fast swing rate less than 5% had a low barrel rate. However, is there a threshold effect? A decision tree regression can help us identify if there is a threshold effect in the data.

Decision trees are our first non-parametric model to make an appearance on this blog! They are a type of supervised learning algorithm that can be used for both classification and regression tasks. Decision trees work by learning simple decision rules inferred from the data features.

Let \(T\) be a decision tree that consists of a set of nodes. Each internal (non-leaf) node \(m\) represents a test on a feature \(j\) with a split at point \(s_m\). For a data point \(x\), if its \(j\)-th feature \(x_j < s_m\), it goes to the left child node; otherwise, it goes to the right child node. The terminal nodes (leaves) of the tree represent the predicted value for the target variable.

For a given feature space \(X \in \mathbb{R}^d\), the tree defines a partition into \(M\) regions \(R_1, R_2, \ldots, R_M\) such that \(\cup_{m=1}^{M}R_m = X\) and \(R_i \cap R_j = \emptyset\) for \(i\neq j\).

Classification Trees

In classification tasks, the goal is to predict a categorical label. The decision tree algorithm recursively splits the data into subsets based on feature values, aiming to create pure subsets where all samples in a subset belong to the same class. The splitting criterion is often based on measures like Gini impurity or information gain.

For a leaf node representing region \(R_m\) let \(p_{mk}\) be the proportion of samples in \(R_m\) that belong to class \(k\). \(p_{mk} = \frac{1}{N_m}\sum_{x_i\in R_m} I(y_i=k).\) The predicted class for a sample \(x\) that falls into region \(R_m\) is given by the class with the highest proportion: \(\hat{y}(x) = \arg\max_k p_{mk}.\)

Let’s implement the algorithm for finding the best decision split. The “best split” is the one that results in the highest information gain. Given a 2D array data where each row is a sample. The last column of data is the class label (0 or 1), and the preceding columns are the feature values. You are also given feature_indices, a list of column indices to consider for splitting.

Splitting Criterion

We can use functions like entropy or Gini impurity to measure the quality of a split. The goal is to find the feature and threshold that maximizes the information gain, which is the reduction in entropy or impurity after the split. You can think of the split as \(\theta\), where \(\theta\) is a threshold value for a feature \(j\). The split divides the data into two subsets: \(D_L = \{x_i | x_{ij} < \theta\}\) and \(D_R = \{x_i | x_{ij} \geq \theta\}\). We want to find the split that maximizes the information gain: \[IG(D, \theta) = H(D) - \left( \frac{|D_L|}{|D|} H(D_L) + \frac{|D_R|}{|D|} H(D_R) \right)\] where \(H(D)\) is the entropy of the dataset \(D\).

find_best_split(data, feature_indices) iterates through each specified feature and each unique value within that feature as a potential split threshold. For each potential split, you will calculate the Information Gain. The function should return a tuple containing the index of the best feature to split on and the best threshold value.

import matplotlib.pyplot as plt

def entropy(labels):
    """Calculates the entropy for a set of labels."""
    if len(labels) == 0:
        return 0

    counts = np.bincount(labels.astype(int))
    probabilities = counts[counts > 0] / len(labels)
    return -np.sum(probabilities * np.log2(probabilities))

def find_best_split(data, feature_indices):
    labels = data[:, -1]
    best_info_gain = -1
    best_split = None
    parent_entropy = entropy(labels)
    for f_i in feature_indices:
        feature_data = data[:, f_i]
        unique_values = np.unique(feature_data)
        for threshold in unique_values:
            left_indices = feature_data < threshold
            right_indices = feature_data >= threshold

            if np.all(left_indices) or np.all(right_indices):
                continue

            left_labels = labels[left_indices]
            right_labels = labels[right_indices]

            # Calculate the weighted average entropy
            p_left = len(left_labels) / len(labels)
            p_right = len(right_labels) / len(labels)
            weighted_entropy = p_left * entropy(left_labels) + p_right * entropy(right_labels)
            
            # Calculate information gain
            info_gain = parent_entropy - weighted_entropy

            # Update the best split if the current one is better
            if info_gain > best_info_gain:
                best_info_gain = info_gain
                best_split = (f_i, threshold)

    return best_split

sample_data = np.array(
    [[2.5, 3.0, 0],
    [5.1, 3.5, 0],
    [3.5, 1.4, 0],
    [6.2, 2.8, 1],
    [4.7, 3.2, 0],
    [6.0, 2.0, 1],
    [5.5, 1.5, 1],
    [6.2, 1.0, 1],
    [5.6, 2.5, 1]],
)

feature_indx, threshold = find_best_split(sample_data, [0, 1])

print(f"Best feature index: {feature_indx}, Best threshold: {threshold}")

plt.scatter(sample_data[:, 0], sample_data[:, 1], c=sample_data[:, 2], cmap="coolwarm")
plt.axvline(x=threshold, color="blue", linestyle="--", label=f"Split at {threshold:.2f}")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Sample Data for Decision Tree Split")
plt.legend()
plt.show()
Best feature index: 0, Best threshold: 5.5

The function find_best_split calculates the best feature and threshold to split the data based on minimizing entropy. The output shows that the best feature index is 0 (the first feature) with a threshold of 5.1. The plot visualizes the sample data and the split line.

find_best_split iteratres through each feature in the dataset and each unique value within that feature as a potential split threshold. Get the left and right indicies of that unique value by calculating boolean masks. Check if one of the masks is all True, which means that the split would not be valid. Then get the labels for the left and right splits and calculate the weighted average entropy. If the current split has a lower entropy than the best split so far, update the best split. Finally, return the index of the best feature to split on and the best threshold value.

Regression Trees

Regression trees are used to predict continuous values. The decision tree algorithm recursively splits the data into subsets based on feature values, aiming to create pure subsets where all samples in a subset have similar target values. The splitting criterion is often based on measures like mean squared error (MSE) or mean absolute error (MAE). The prediction for a new input \(x\) is given by \(f(x) = \sum_{m=1}^M c_mI(x\in R_m).\) The value \(c_m\) is chosen to minimize the sum of squared errors within the region. \(c_m = \text{argmin}_c\sum_{x_i \in R_m}(y_i - c)^2 = \frac{1}{N_m}\sum y_i.\)

Splitting Criterion

For regression tasks, the splitting criterion is based on minimizing the variance of the target variable in the resulting subsets. A given node \(n\) and a split candidate by feature \(j\) and point \(s\), partitions the data into two sets: \(R_{\text{left}}(\theta)=\{(x,y) | x_j\leq s\} \text{and } R_{\text{right}}(\theta)=\{(x,y) | x_j > s\}\).

The goal is the find the optimal split \(\theta=(j,s)\) that minimizes the weighted sum of squared errors of the two subsets. The objective function is: \(\sum_{x_i \in R_{\text{left}}(\theta)}(y_i - \bar{y}_{\text{left}})^2 + \sum_{x_i \in R_{\text{right}}(\theta)}(y_i - \bar{y}_{\text{right}})^2.\)

Grouping Players with Decision Trees

We can use a decision tree regression model to group players based on their fast swing rate and barrel rate. The decision tree will help us identify the optimal threshold for fast swing rate that separates players into different performance groups. Below I apply a decision tree regression model to the fast swing rate and barrel rate data. The goal is to find the optimal threshold for fast swing rate that separates players into different performance groups. Since we are interested in finding a threshold, we will set the maximum depth of the tree to 1. This means that the tree will only have one split, which will be the optimal threshold for fast swing rate.

Code
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from scipy.stats import ttest_ind
import sys
import sklearn

X_feature_name = "fast_swing_rate"
y_feature_name = "barrel_batted_rate"

X = df[[X_feature_name]]
y = df[y_feature_name]

model = DecisionTreeRegressor(max_depth=1, random_state=42)
model.fit(X, y)

threshold_X = model.tree_.threshold[0]
print(f"The optimal threshold 'X' found by the Decision Tree is: {threshold_X:.2f}%")
The optimal threshold 'X' found by the Decision Tree is: 35.35%

Our decision tree found that the threshold for fast swing rate is approximately 35.35%. What does the average barrel rate look like for players below and above this threshold? Let’s calculate the average barrel rate for players below and above the threshold.

Code
below_threshold = df[df["fast_swing_rate"] < threshold_X]
at_or_above_threshold = df[df["fast_swing_rate"] >= threshold_X]

# Calculate the average barrel rate for each group
avg_barrel_rate_below = below_threshold["barrel_batted_rate"].mean()
avg_barrel_rate_above = at_or_above_threshold["barrel_batted_rate"].mean()

print(f"Average barrel rate for players below {threshold_X:.2f}%: {avg_barrel_rate_below:.1f}%")
print(f"Average barrel rate for players at or above {threshold_X:.2f}%: {avg_barrel_rate_above:.1f}%")
print(f"The drop in average barrel rate is: {(avg_barrel_rate_above - avg_barrel_rate_below):.1f}%")

stat, pvalue = ttest_ind(below_threshold["barrel_batted_rate"], at_or_above_threshold["barrel_batted_rate"], equal_var=False, nan_policy='omit')

print(f"\nT-test p-value: {pvalue:.2e}")
if pvalue < 0.05:
    print("The difference between the two groups is statistically significant.")
else:
    print("The difference between the two groups is not statistically significant.")
Average barrel rate for players below 35.35%: 7.9%
Average barrel rate for players at or above 35.35%: 13.3%
The drop in average barrel rate is: 5.4%

T-test p-value: 2.08e-28
The difference between the two groups is statistically significant.

A different of 5.4% barrel rate is pretty stark! In taking a t-test as well, we find that the difference between the two groups is statistically significant with a p-value of 2.08e-28. This suggests that there is a significant difference in barrel rate between players below and above the threshold of 35.35% fast swing rate.

Let’s plot the cutoff threshold and the average barrel rates for each group.

Code
import plotly.graph_objects as go

fig = px.scatter(
    df, 
    x="fast_swing_rate", 
    y="barrel_batted_rate", 
    title="Decision Tree Regression of Fast Swing Rate vs Barrel Rate",
    hover_data=["last_name, first_name", "fast_swing_rate", "barrel_batted_rate", "year"],
)
fig.update_xaxes(title="Fast Speed Rate")
fig.update_yaxes(title="Barrel Rate")
fig.add_shape(
    type="line",
    x0=threshold_X,
    y0=0,
    x1=threshold_X,
    y1=df["barrel_batted_rate"].max(),
    line=dict(color="blue", width=2, dash="dash"),
    name="Decision Tree Threshold"
)
fig.add_annotation(
    x=threshold_X,
    y=df["barrel_batted_rate"].max(),
    text=f"Threshold: {threshold_X:.2f}%",
    showarrow=True,
    arrowhead=2,
    ax=0,
    ay=-40,
    font=dict(color="green"),        
)
fig.show()

Decision Tree Regression Plot of Fast Swing Rate vs Barrel Rate

A single split might not be enough to capture the complexity of the data. Let’s try a deeper decision tree with a maximum depth of 2. This will allow us to create more groups based on fast swing rate and barrel rate.

Code
import matplotlib.pyplot as plt

X_feature_name = 'fast_swing_rate'
y_feature_name = 'barrel_batted_rate'

X = df[[X_feature_name]]
y = df[y_feature_name]

tree_model_d2 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_model_d2.fit(X, y)
df['group'] = tree_model_d2.apply(X)

group_analysis = df.groupby('group').agg(
    min_swing_rate=(X_feature_name, 'min'),
    max_swing_rate=(X_feature_name, 'max'),
    avg_barrel_rate=(y_feature_name, 'mean'),
    player_count=('player_id', 'count')
).sort_values(by='min_swing_rate').reset_index()

print("--- Analysis of Player Groups (from depth-2 tree) ---")
print(group_analysis)
--- Analysis of Player Groups (from depth-2 tree) ---
   group  min_swing_rate  max_swing_rate  avg_barrel_rate  player_count
0      2             0.2             7.9         5.116867            83
1      3             8.1            35.2         8.957798           218
2      5            35.5            52.6        11.811429            70
3      6            52.7            78.0        15.351020            49

In the first pile were the monks, the 83 players who swung fast at less than 8% of pitches. Their reward for this monastic patience? A barrel rate of 5%. Then you had the great mass of players, the 218 guys who followed the rules and swung fast at a “normal” rate. They did a little better, barreling the ball about 9% of the time. This is where it should have ended. But the machine kept splitting.

The model found a group of 70 hitters who were…antsy. They swung at over a third of the pitches they saw. The old scout would have benched them for their lack of discipline. The machine, however, noticed their barrel rate had jumped to nearly 12%. Interesting.

Then it found the last group. A tiny cohort of 49 players who were, by any traditional measure, completely out of their minds. These guys swung at everything. More than half the pitches thrown to them, they were hacking at. They were the hitters your pitching coach warned you about, the ones who would chase a ball if you rolled it to the plate. And what was their reward for this utter lack of discipline? A barrel rate of over 15%. An elite, All-Star-level number. Although some outliers were present. Take for example, 2025 Julio Rodriguez, who has a fast swing rate of 60% but a barrel rate of 7.7%.

Below is a bar chart that visualizes the average barrel rate for each group. The x-axis shows the fast swing rate ranges, and the y-axis shows the average barrel rate for each group. The groups are color-coded based on their fast swing rate ranges.

Code
plt.style.use('seaborn-v0_8-whitegrid')
fig, ax = plt.subplots(figsize=(9, 4))

# Create labels for the x-axis based on the swing rate ranges
group_labels = [
    f"{row.min_swing_rate:.2f}% - {row.max_swing_rate:.2f}%\n(n={row.player_count})"
    for index, row in group_analysis.iterrows()
]

# Create the bar chart
bars = ax.bar(
    group_labels,
    group_analysis['avg_barrel_rate'],
    color=['#0a7ff5', '#e6f0fa', '#fae6e6', '#fa5f5f'],
    edgecolor='black'
)

# Add labels and titles
ax.set_xlabel('Fast Swing Rate', fontsize=12)
ax.set_ylabel('Average Barrel Rate (%)', fontsize=12)
ax.set_title('Player Barrel Rate Groups by Fast Swing Rate', fontsize=16, pad=20)
ax.tick_params(axis='x', labelsize=10)

# Add text labels on top of the bars
for bar in bars:
    yval = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2.0, yval + 0.1, f'{yval:.2f}%', ha='center', va='bottom', fontsize=12)

plt.tight_layout()
plt.savefig('decision_tree_3_groups.png')
plt.show()

Decision Tree Regression Plot of Fast Swing Rate vs Barrel Rate with Average Barrel Rates
Code
# t-test for the four groups
# Extract barrel rates for each group
group_0 = df[df['group'] == 2]['barrel_batted_rate']
group_1 = df[df['group'] == 3]['barrel_batted_rate']
group_2 = df[df['group'] == 5]['barrel_batted_rate']
group_3 = df[df['group'] == 6]['barrel_batted_rate']

# Perform t-tests
ttest_01 = ttest_ind(group_0, group_1, equal_var=False)
ttest_12 = ttest_ind(group_1, group_2, equal_var=False)
ttest_23 = ttest_ind(group_2, group_3, equal_var=False) 

# Display results
print("--- T-Test Results ---")
print(f"Group 0 vs Group 1: t-statistic = {ttest_01.statistic:.2f}, p-value = {ttest_01.pvalue:.4f}")
print(f"Group 1 vs Group 2: t-statistic = {ttest_12.statistic:.2f}, p-value = {ttest_12.pvalue:.4f}")
print(f"Group 2 vs Group 3: t-statistic = {ttest_23.statistic:.2f}, p-value = {ttest_23.pvalue:.4f}")
--- T-Test Results ---
Group 0 vs Group 1: t-statistic = -10.56, p-value = 0.0000
Group 1 vs Group 2: t-statistic = -6.97, p-value = 0.0000
Group 2 vs Group 3: t-statistic = -5.07, p-value = 0.0000

Conclusion

Of course, a walk was as good as a hit. The quants had proven that a decade ago, and it was the one piece of common knowledge that was actually, you know, true. Getting on base for free was a foundational good.

But that truth had created a dangerous piece of collateral wisdom: that patience, in all its forms, was a virtue. That the same mindset that let you take ball four was the one you should use when a pitcher threw you something hittable.

The data, however, saw a clean line in the sand. It wasn’t about patience versus aggression; it was about knowing when to be which. The analysis didn’t argue against the value of a walk—it argued against the crippling passivity that had been mistaken for discipline.

The most valuable players weren’t just patient or just aggressive; they were both. They had the discipline of a monk until the moment they decided to swing. And at that moment, they had the fury of a barbarian. The rest of the league, stuck in the middle, was playing the wrong game entirely.

Of course barrels are not the only thing that matters in baseball. There are many other factors that contribute to a player’s success, such as plate discipline, contact rate, and defensive skills. However, understanding the relationship between swing speed and barrel rate can provide valuable insights into player performance and help teams make better decisions. I leave you with a clip of Julio Rodriguez: a fast swinger who hasn’t quite found his stride in 2025. However, he is still a young player with a lot of potential. With the right adjustments, he could become one of the best hitters in the league.

References

“Standard Stats Glossary.” 2025. MLB.com. https://www.mlb.com/glossary/statcast/barrel.