Introduction

Linear Regression—long before transformers and LLMs, it served as one of the first tools in the machine learning toolbox. Despite the rise of complex models, its role in modern statistical analysis remains essential and unshaken.

Rregression models measure the statistical relationship of an dependent variable (\(y\)), and a series of other variables (\(x_i\)). They are widely used for prediction, estimation, hypothesis testing, and modeling causal relationships. Independate variables serve as inputs into a system and take on different values freely. Dependent variables are those values that change as a consequence of change in other values in the system. \(X\) can refers to predictor or explanatory variable. \(Y\) denotes the response variable.

First Order Linear Model

\[Y=\theta_0 + \theta_1 X\]

where Y is the dependent variable, \(\theta_0\) is the y-incercept, \(\theta_1\) is the slop of the line, and \(X\) is the independent variable.

This is the standard \(y=mx + b\) model taught in middle school.

From this, we can create a hypothesis function (model):

\[h_\theta(x) = \theta_0 + \theta_1 x.\]

Note that we use a lowercase \(x\) to denote an individual data point. How do we find the optimal \(\theta\) coefficients? Sure we can plot the data and eye-ball a drawn line where half of the data points lie above and the other half below the line. But consider an application like baseball data. A historic game like baseball has decades of seasons. If we were to model batting average \((\frac{\text{hits}}{\text{at-bats}})\), it would be impossible to guess the best-fitted line by hand. We need a cleverer method.

We turn to least-squares linear regression. Cost function: \[Cost(\theta) = \frac{1}{2\times n}\sum_{i=0}^{n}(h_\theta(x^{(i)}) - y^{(i)})^2.\]

The cost function is takes the average of square-error differences among all data points. We fit by solving \(\text{min}_\theta Cost(\theta)\) or in other words, the parameters \(\theta\) that minimze the mean squared-error (MSE).

Gradient descent is the method we use to find the optimal \(\theta\) values. More formally,

\[\theta_j = \theta_j - \alpha \frac{\partial Cost(\theta)}{\partial \theta_j}.\]

\(\alpha\) is the learning rate. Calculus is inescapable. One of its most powerful contributions to humanity is the ability to systematically find optimal values—a cornerstone of decision-making in science, economics, engineering, and machine learning. Let’s solve for \(\theta_0\) and \(\theta_1\) for first-order linear regression.

We first take the partial derivative with respect to \(\theta_j\)

\[\begin{align} \frac{\partial Cost(\theta)}{\partial \theta_j} &= \frac{\partial}{\partial \theta_j} \frac{1}{2n}\sum_{i=0}^{n}(h_\theta(x^{(i)}) - y)^2 \\ &= \frac{\partial}{\partial \theta_j} \frac{1}{2n}\sum_{i=0}^{n}((\theta_0 + \theta_1 x) - y)^2 \\ &= \frac{1}{2n} \sum_{i=0}^{n}\frac{\partial}{\partial \theta_j}((\theta_0 + \theta_1 x) - y)^2 \\ &= \frac{1}{2n}\sum_{i=0}^{n}2((\theta_0 + \theta_1 x) - y) \frac{\partial}{\partial \theta_j}((\theta_0 + \theta_1 x) - y) \\ &= \frac{1}{2n}\sum_{i=0}^{n}2((\theta_0 + \theta_1 x) - y) (\frac{\partial}{\partial \theta_0}\theta_0 + \frac{\partial}{\partial \theta_1} \theta_1 x) \\ &= \frac{1}{n}\sum_{i=0}^{n}((\theta_0 + \theta_1 x) - y) (\frac{\partial}{\partial \theta_0}\theta_0 + \frac{\partial}{\partial \theta_1} \theta_1 x) \end{align}\]

Hence,

\[\frac{\partial Cost(\theta)}{\partial\theta_0} = \frac{1}{n}\sum_{i=0}^{n}((\theta_0 + \theta_1 x) - y) (1)\] and \[\frac{\partial Cost(\theta)}{\partial\theta_1} = \frac{1}{n}\sum_{i=0}^{n}((\theta_0 + \theta_1 x) - y) (x^{(i)}).\]

Now solve for \(\theta_0\) and \(\theta_1\).

\[ \begin{align} \frac{\partial Cost(\theta)}{\partial \theta_0} &= 0 \\ \frac{1}{n}\sum_{i=0}^{n}((\theta_0 + \theta_1 x) - y) & =0\\ \sum_{i=0}^{n}\theta_0 + \sum_{i=0}^{n}\theta_1x^{(i)} + \sum_{i=0}^{n}y &= 0 \\ \sum_{i=0}^{n}\theta_1x^{(i)} + \sum_{i=0}^{n}y &= -\sum_{i=0}^{n}\theta_0 \\ \sum_{i=0}^{n}\theta_1x^{(i)} + \sum_{i=0}^{n}y &= -n\theta_0 \\ \frac{1}{n}\sum_{i=0}^{n}\theta_1x^{(i)} + \frac{1}{n}\sum_{i=0}^{n}y &= \theta_0 \\ \theta_1\bar{x} + \bar{y} &= -\theta_0 \\ - \theta_1\bar{x} - \bar{y} &= \theta_0 \\ \theta_0 &= \bar{y} - \theta_1\bar{x} \end{align} \]

\[ \begin{align} \frac{\partial Cost(\theta)}{\partial \theta_1} &= 0 \\ \frac{1}{n}\sum_{i=0}^{n}((\theta_0 + \theta_1 x^{(i)}) - y) x^{(i)} & =0\\ \sum_{i=1}^{n} \left( \bar{y} - \theta_1 \bar{x} + \theta_1 x^{(i)} - y^{(i)} \right) x^{(i)} & =0\\ \sum_{i=1}^{n} \left[ \theta_1 (x^{(i)} - \bar{x}) + (\bar{y} - y^{(i)}) \right] x^{(i)} & =0\\ \theta_1 \sum_{i=1}^{n} (x^{(i)} - \bar{x}) x^{(i)} &= \sum_{i=1}^{n} (y^{(i)} - \bar{y}) x^{(i)} \\ \theta_1 &= \frac{\sum_{i=1}^{n} (x^{(i)} - \bar{x})(y^{(i)} - \bar{y})}{\sum_{i=1}^{n} (x^{(i)} - \bar{x})^2} \end{align} \]

Enough math. Let’s code.

Linear Regression Application to Baseball Data

Baseball is a game of numbers. There exist several statistics to measure a hitter’s offensive contribution. The simplest and well-known metric is batting average (BA or AVG); \(BA=\frac{\text{Hits}}{\text{At-bats}}\). Whlie batting average only considers hits, slugging percentage weights different types of hits, giving more value to extra base hits; \(SLG=\frac{(1B \cdot 1 + 2B \cdot 2 + 3B\cdot 3 + HR\cdot 4)}{AB}\). On-base percentage (\(OBP\)) “refers to how frequently a batter reachers base per plate appearance.” (“Standard Stats Glossary” 2025). \(OPS\) (on-base plus slugging) is an amalgamation of \(OBP\) and \(SLG\). That is, \(OPS=OBP + SLG\). \(OPS\) encapsulates a batter’s power and on-base rate. Lastly, there’s weighted on-base average (\(wOBA\)). \(wOBA\) is an all-empcompasing offensive measurement. Similar to \(SLG\), \(wOBA\) weighs batted-events, however, it does so with a different formula. Each batted event (walks, singles, home run, etc.), is weighted by the adjusted run expectancy in the context of a season.

Figure 1 illustrates the relationships between various baseball statistics and runs. Observe that the strength of the relationship between the independent variables (\(AVG, OBP, SLG, OPS, wOBA\)) and the dependent variable \(R\) (runs) are not the same.

Code

import pandas as pd
from pybaseball import team_batting
import seaborn as sns

data = team_batting(2000, 2025)

data = data[["Team", "Season", "R", "AVG", "OBP", "SLG", "OPS", "wOBA"]]
data = data[(data["Season"] != 2020) & (data["Season"] != 2025)]

# melt to facetgrid by metrics
long_df = data.melt(id_vars=["Team", "R", "Season"], value_vars=["AVG", "OBP", "OPS", "SLG", "wOBA"], var_name="Metric", value_name="Value")

g = sns.FacetGrid(long_df, col="Metric", col_wrap=3, height=4, sharex=False, sharey=False)
g.map_dataframe(sns.scatterplot, x="Value", y="R")

Figure 1: Scatter plots of baseball metrics and runs.

We will need to compute the linear model’s weights if we want to accurate prediction of team runs given a statistic. To hold true to our mathematical derivation, we’ll implement our own linear regression model. Of course, ready-made solutions exist in libraries such as Scikit-Learn.

Code

import numpy as np

class LinearRegression:
    def __init__(self):
        self.slope = None
        self.intercept = None

    def fit(self, X, y):
        n = len(X)
        x_mean = np.mean(X)
        y_mean = np.mean(y)
        numerator = 0
        denominator = 0
        for i in range(n):
            numerator += (X[i] - x_mean) * (y[i] - y_mean)
            denominator += (X[i] - x_mean) ** 2
        self.slope = numerator / denominator
        self.intercept = y_mean - self.slope * x_mean

    def predict(self, X):
        y_pred = []
        for x in X:
            y_pred.append(self.slope * x + self.intercept)
        return y_pred

long_df["y_pred"] = np.nan
model_dict = {}
# Batting Average Model
for metric in ["AVG", "OBP", "SLG", "OPS", "wOBA"]:
    model = LinearRegression()
    mask = long_df["Metric"] == metric
    X = long_df.loc[mask, "Value"].to_numpy()
    y = long_df.loc[mask, "R"].to_numpy()
    model.fit(X, y)
    model_dict[metric] = model
    y_pred = model.predict(X)
    long_df.loc[mask, "y_pred"] = y_pred

g = sns.FacetGrid(long_df, col="Metric", col_wrap=3, height=4, sharex=False, sharey=False)
g.map_dataframe(sns.scatterplot, x="Value", y="R")
g.map_dataframe(sns.lineplot, x="Value", y="y_pred", color="red")
g.set_ylabels("Predicted Runs")

Figure 2: Scatter plots of baseball metrics and runs with linear regression line.

Suppose we want to predict a team’s runs given a metric. In 2024, Dodgers had a team OPS of 0.781. Let’s use our newly trained OPS linear regression to predict the team runs.

Code

dodgers_ops = 0.781

predicted_runs = model_dict["OPS"].predict([dodgers_ops])[0]
print(f"Dodgers predicted runs in 2024: {round(predicted_runs)}")
print(f"Dodgers actual runs in 2024: {842}")

Dodgers predicted runs in 2024: 821
Dodgers actual runs in 2024: 842

Not bad! The Dodgers currently have an OPS of .801 in their 2025 season. Based on the OPS model, and assuming the Dodgers maintain their high OPS throughout the remaining season, they are slated to score 860 runs.

Code

dodgers_ops = 0.801

predicted_runs = model_dict["OPS"].predict([dodgers_ops])[0]

I’ll have to revisit these models when the 2025 season ends.

Multivariable Linear Regression

Let’s predict exit velocity from swing metrics. This season, Statcast released their swing path and swing length data on BaseballSavant. Granular bat tracking data can reveal mechanical trends that lendthemselves to batting outcomes.

Observe Figure 3. We consider two independent variables: average swing speed and swing length. The dependent variable is average exit velocity.

Code

df = pd.read_csv("swing-metrics2.csv")
sns.scatterplot(df, x="avg_swing_speed", y="avg_swing_length",hue="exit_velocity_avg")

Figure 3: Scatter plots of average swing speed, swing length, and average exit velocity.

We apply a multivariable linear regression to see if there’s a statistical relationship among average swing speed, swing length, and average exit velocity. To model this relationship, we turn to a vectorized version of the first-order linear regression above. \(y=X.W\) and \(W=(X^T\cdot X)^{-1}X^T\cdot y\).

class MVLinearRegression:
    def __init__(self):
        self.W = None

    def fit(self, X, y):
        '''
        X: n x d 
        '''
        # Add bias term to X -> [1 X]
        n = X.shape[0]
        X = np.hstack([np.ones((n, 1)), X])
        self.W = np.linalg.inv(X.T @ X) @ X.T @ y

    def predict(self, X):
        n = X.shape[0]
        X = np.hstack([np.ones((n, 1)), X])
        return X @ self.W

df_train = df[df["year"]<2025]
X = df_train[["avg_swing_speed","avg_swing_length"]].to_numpy()
y = df_train[["exit_velocity_avg"]].to_numpy().squeeze()


model = MVLinearRegression()
model.fit(X,y)
print(f"Model weights: {model.W}")

df_test = df[df["year"]==2025]
X_test = df_test[["avg_swing_speed","avg_swing_length"]].to_numpy()
y_test = df_test[["exit_velocity_avg"]].to_numpy().squeeze()
y_pred = model.predict(X_test)

Model weights: [51.12674671  0.644755   -1.05884044]

Figure 4 depicts multivariable linear regression analysis reveals that swing speed and swing length have opposing effects on exit velocity. Swing speed is a strong positive predictor: for every 1 mph increase in average swing speed, exit velocity increases by approximately 0.64 mph, a relationship that is highly statistically significant. In contrast, swing length shows a significant negative association, with each additional unit of swing length reducing exit velocity by about -1.0 mph when holding swing speed constant. This suggests that while higher swing speeds directly contribute to better batted-ball outcomes, longer swings may hinder efficient energy transfer, potentially due to reduced compactness or timing inefficiencies. Overall, the model highlights the importance of optimizing both speed and mechanical efficiency to maximize exit velocity.

Code

import plotly.graph_objects as go

# --- Create prediction surface ---
x1_range = np.linspace(X_test[:, 0].min(), X_test[:, 0].max(), 30)
x2_range = np.linspace(X_test[:, 1].min(), X_test[:, 1].max(), 30)
x1_grid, x2_grid = np.meshgrid(x1_range, x2_range)
X_grid = np.c_[x1_grid.ravel(), x2_grid.ravel()]
y_grid = model.predict(X_grid).reshape(x1_grid.shape)

fig = go.Figure()

# Scatter data points
fig.add_trace(go.Scatter3d(
    x=X[:, 0], y=X[:, 1], z=y,
    mode='markers',
    marker=dict(size=5, color='blue', opacity=1.0),
    name='Data'
))

# Regression surface
fig.add_trace(go.Surface(
    x=x1_range, y=x2_range, z=y_grid,
    colorscale='Viridis', opacity=0.5,
    name='Regression Plane'
))

fig.update_layout(
    scene=dict(
        xaxis_title='Swing Speed',
        yaxis_title='Swing Length',
        zaxis_title='Avg Exit Velocity'
    ),
    margin=dict(l=0, r=0, b=0, t=40)
)

fig.show();

(a) Predicted Exit Velocity as a Function of Swing Speed and Swing Length.

(b)

Figure 4

The residual analysis in Figure 5 for the 2025 test data shows that the multivariable linear regression model predicting exit velocity from average swing speed and swing length produces reasonably centered errors. The histogram of residuals appears roughly symmetric and centered near zero, suggesting that the model does not systematically over- or under-predict exit velocity on unseen data.

Quantitatively, the residuals are relatively small in magnitude, which indicates good calibration. However, the presence of some wider residual spread implies that other unmodeled factors—such as point-of-contact, pitch type, or bat-to-ball precision—may be influencing exit velocity and are not captured by swing speed and swing length alone.

Overall, the model performs well given its simplicity and the limited feature set. It serves as a useful first-order approximation for evaluating how mechanical swing inputs affect batted ball output, but more sophisticated modeling approaches (e.g., nonlinear models or additional features) would likely reduce residual variance and improve predictive accuracy.

Code

residuals = y_test - y_pred
g = sns.histplot(residuals)
g.set_title("Histogram of Residuals")
g.set_xlabel("Residual (Actual - Predicted Exit Velocity)")
g.set_ylabel("Frequency");

Figure 5: Residuals of multivariable linear regression.

I leave you all with a clip of the hardest hit ball in the statcast era by Oneil Cruz.

Thanks for reading!

References

“Standard Stats Glossary.” 2025. MLB.com. https://www.mlb.com/glossary/standard-stats.