There is no denying that baseball has an element of luck. Sometimes the hardest hit balls find a glove, and sometimes a blooper falls in for a hit. We can try to quantify this luck by looking at the difference between a player’s actual stats and their expected stats.

Expected Batting Average (xBA)

First let’s take a look at expected batting average (xBA). We grab the latest expected statistics leaderboard from baseball savant.

Code

import pandas as pd

df = pd.read_csv("expected_stats.csv")
df["est_ba_minus_ba_diff"].describe().round(4)

count    251.0000
mean      -0.0069
std        0.0283
min       -0.0750
25%       -0.0260
50%       -0.0060
75%        0.0125
max        0.0660
Name: est_ba_minus_ba_diff, dtype: float64

The unluckiest batter has as a difference of -0.075, while the luckiest batter has a difference of 0.066. As it stands, Joc Pederson of the Texas Ranges holds unluckiest BA and Pavin Smith of the Arizona Diamondbacks boasts the luckiest BA. Interestingly, the mean xBA is -0.0069, suggesting that the average MLB hitter (with the minimum number of BIP/PA) is slightly unlucky when it comes to batting average.

Above shows Joc Pederson lining out to Samad Taylor. The batted ball had a hit probability of 78% per baseballsavant. Conversely, Pavin Smith’s single to Hyeseong Kim had a 94% chance of becoming an out.

Pederson and Smith are just two players on difference ends of the spectrum. Here is a histogram, illustrating the disitrbution of expected batting average differences. Oberve the slight left-skew distribution.

Code

import plotly.express as px
fig =  px.histogram(df, x="est_ba_minus_ba_diff", marginal="rug", hover_data=["last_name, first_name"],
    labels = {
        "est_ba_minus_ba_diff": "BA - xBA",
        "count": "Count"
    },
    title="Distribution of xBA Difference"
)
fig.show()

Pretty close correlation between xBA and BA with an \(R^2\) value of 0.48. xBA is not a perfect indicator of BA. In fact, xBA should be interpreted more heavily than BA given that it removes luck from the at-bat. Note that there remains a sizable discrepancy between xBA and BA, suggesting a sizable “luck” component observed in BA that xBA does not capture.

Code

from sklearn.metrics import r2_score

fig = px.scatter(df, x="est_ba", y="ba", hover_data=["last_name, first_name"], trendline="ols",
    labels = {
        "est_ba": "xBA",
        "ba": "BA"
    },
    title="Relationship Between xBA and BA"
)

fig.show()

Expected Weighted OBA (xwOBA)

xwOBA is what statcast considers, the most important offensive metric that tells “the story of a player’s season based on the quality of and amount of contact, not coutomes.” In sum, xwOBA weighs hit types and walks with an expected run value based on historically similar batted ball events. The formulation of the coefficients use exit velocity, launch angle, and sprint speed (for topped or weakly hit balls).

Code

df["est_woba_minus_woba_diff"].describe().round(4)

count    251.0000
mean      -0.0107
std        0.0313
min       -0.0940
25%       -0.0300
50%       -0.0110
75%        0.0130
max        0.0580
Name: est_woba_minus_woba_diff, dtype: float64

The mean wOBA-xwOBA difference is -0.0107. This is surpsignly greater than the mean difference of BA and xBA. In addition, the most unlucky batter with xwOBA is most unlucky than the most unlucky batter with xBA. Simiarly, the most lucky xwOBA batter is not as lucky as the most lucky xBA batter. Overall it seems like player are less lucky with xwOBA. A reason for this discrepency is that xwOBA weighs batted ball events with difference values. For example, a double is weighted more heavily than a single. As such a batted ball who’s exit velocity and launch angle that would likely result in a double will increase the diffence in xwOBA and wOBA if that double becomes a single.

Code

fig =  px.histogram(df, x="est_woba_minus_woba_diff", marginal="rug", hover_data=["last_name, first_name"], 
        labels = {
        "diff": "wOBA - xwOBA",
        "count": "Count"
    },
    title="Distribution of wOBA-xwOBA")
fig.show()

Combining Differences

Code

id_cols = ['last_name, first_name', 'player_id', 'pa', 'year']

# melt the three diff-cols into long form
df_long = df.melt(
    id_vars=id_cols,
    value_vars=[
        'est_ba_minus_ba_diff',
        'est_slg_minus_slg_diff',
        'est_woba_minus_woba_diff'
    ],
    var_name='metric',
    value_name='diff'
)
# map the verbose column names to simple metric labels
df_long['metric'] = df_long['metric'].map({
    'est_ba_minus_ba_diff':   'BA',
    'est_slg_minus_slg_diff': 'SLG',
    'est_woba_minus_woba_diff':'wOBA'
})

fig = px.histogram(df_long, x="diff", color="metric", marginal="rug", hover_data=["last_name, first_name"], 
        labels = {
        "diff": "Metric - Expected Metric",
        "count": "Count"
    },
    title="Distribution of xBA, xSLG, and xwOBA Differences"
)
fig.show()

Observe the histogram above that combines xBA, expected slugging (xSLG), and xwOBA. Among xBA, xSLG, and xwOBA, xSLG shows the greated spread in difference values. This is likely a result of how SLG assign weights to hits. Specifically, SLG is defined as (1B + 2Bx2 + 3Bx3 + HRx4)/AB. Unlike, wOBA, slugging weights are fixed and higher-valued. Hence, a batted ball that does not result in its expected outcome faces a greater difference penalty.

Who’s that unlucky slugger to the very left? That’s Royals veteran Salvador Perez.

That ball had a 97% hit probability and would have been a homerun in 9/30 ballparks.

Salvador Pérez’s recent flyout underscores the value of integrating ballpark- and weather-specific factors into our expected-stat models. Although current metrics leverage historical launch-angle and exit-velocity data to predict outcomes, they omit critical dimensions—namely batted-ball trajectory, precise horizontal launch direction, and wind conditions. In Pérez’s case, a strong tailwind or a gust toward the wall might readily have turned that routine out into a home run. That said, Statcast remains an extraordinary tool for quantifying the game, and enriching it with environmental and park-design variables promises even deeper insights into batted-ball performance.

Thanks for reading!