Using batting stats from Major League Baseball’s 2023 season
Outlier detection is an unsupervised machine learning task to discover anomalies (unusual observations) inside a given data set. This task is useful in lots of real-world cases where our available dataset is already “contaminated” by anomalies. Scikit-learn implements several outlier detection algorithms, and in cases where we have now an uncontaminated baseline, we may also use these algorithms for novelty detection, a semi-supervised task that predicts whether latest observations are outliers.
The 4 outlier detection algorithms we’ll compare are:
- Elliptic Envelope is suitable for normally-distributed data with low dimensionality. As its name implies, it uses the multivariate normal distribution to create a distance measure to separate outliers from inliers.
- Local Outlier Factor is a comparison of the local density of an remark with that of its neighbors. Observations with much lower density than their neighbors are considered outliers.
- One-Class Support Vector Machine (SVM) with Stochastic Gradient Descent (SGD) is an O(n) approximate solution of the One-Class SVM. Note that the O(n²) One-Class SVM works well on our small example dataset but could also be impractical in your actual use case.
- Isolation Forest is a tree-based approach where outliers are more quickly isolated by random splits than inliers.
Since our task is unsupervised, we don’t have ground truth to match accuracies of those algorithms. As a substitute, we would like to see how their results (player rankings particularly) differ from each other and gain some intuition into their behavior and limitations, in order that we would know when to prefer one over one other.
Let’s compare just a few of those techniques using two metrics of batter performance from 2023’s Major Leage Baseball (MLB) season:
- On-base percentage (OBP), the speed at which a batter reaches base (by hitting, walking, or getting hit by pitch) per plate appearance
- Slugging (SLG), the typical variety of total bases per at bat
There are a lot of more sophisticated metrics of batter performance, including OBP plus SLG (OPS), weighted on-base average (wOBA), and adjusted weighted runs created (WRC+). Nonetheless, we’ll see that along with being commonly used and simple to know, OBP and SLG are moderately correlated and roughly normally distributed, making them well suited to this comparison.
We use the pybaseball
package to acquire hitting data. This Python package is under MIT license and returns data from Fangraphs.com, Baseball-Reference.com, and other sources which have in turn obtained offical records from Major League Baseball.
We use pybaseball’s 2023 batting statistics, which might be obtained either by batting_stats
(FanGraphs) or batting_stats_bref
(Baseball Reference). It seems that the player names are more appropriately formatted from Fangraphs, but player teams and leagues from Baseball Reference are higher formatted within the case of traded players. For a dataset with improved readability, we really need to merge three tables: FanGraphs, Baseball Reference, and a key lookup.
from pybaseball import (cache, batting_stats_bref, batting_stats,
playerid_reverse_lookup)
import pandas as pdcache.enable() # avoid unnecessary requests when re-running
MIN_PLATE_APPEARANCES = 200
# For readability and reasonable default sort order
df_bref = batting_stats_bref(2023).query(f"PA >= {MIN_PLATE_APPEARANCES}"
).rename(columns={"Lev":"League",
"Tm":"Team"}
)
df_bref["League"] =
df_bref["League"].str.replace("Maj-","").replace("AL,NL","NL/AL"
).astype('category')
df_fg = batting_stats(2023, qual=MIN_PLATE_APPEARANCES)
key_mapping =
playerid_reverse_lookup(df_bref["mlbID"].to_list(), key_type='mlbam'
)[["key_mlbam","key_fangraphs"]
].rename(columns={"key_mlbam":"mlbID",
"key_fangraphs":"IDfg"}
)
df = df_fg.drop(columns="Team"
).merge(key_mapping, how="inner", on="IDfg"
).merge(df_bref[["mlbID","League","Team"]],
how="inner", on="mlbID"
).sort_values(["League","Team","Name"])
First, we note that these metrics differ in mean and variance and are moderately correlated. We also note that every metric is fairly symmetric, with median value near mean.
print(df[["OBP","SLG"]].describe().round(3))print(f"nCorrelation: {df[['OBP','SLG']].corr()['SLG']['OBP']:.3f}")
OBP SLG
count 362.000 362.000
mean 0.320 0.415
std 0.034 0.068
min 0.234 0.227
25% 0.300 0.367
50% 0.318 0.414
75% 0.340 0.460
max 0.416 0.654Correlation: 0.630
Let’s visualize this joint distribution, using:
- Scatterplot of the players, coloured by National League (NL) vs American League (AL)
- Bivariate kernel density estimator (KDE) plot of the players, which smoothes the scatterplot with a Gaussian kernel to estimate density
- Marginal KDE plots of every metric
import matplotlib.pyplot as plt
import seaborn as snsg = sns.JointGrid(data=df, x="OBP", y="SLG", height=5)
g = g.plot_joint(func=sns.scatterplot, data=df, hue="League",
palette={"AL":"blue","NL":"maroon","NL/AL":"green"},
alpha=0.6
)
g.fig.suptitle("On-base percentage vs. Sluggingn2023 season, min "
f"{MIN_PLATE_APPEARANCES} plate appearances"
)
g.figure.subplots_adjust(top=0.9)
sns.kdeplot(x=df["OBP"], color="orange", ax=g.ax_marg_x, alpha=0.5)
sns.kdeplot(y=df["SLG"], color="orange", ax=g.ax_marg_y, alpha=0.5)
sns.kdeplot(data=df, x="OBP", y="SLG",
ax=g.ax_joint, color="orange", alpha=0.5
)
df_extremes = df[ df["OBP"].isin([df["OBP"].min(),df["OBP"].max()])
| df["OPS"].isin([df["OPS"].min(),df["OPS"].max()])
]
for _,row in df_extremes.iterrows():
g.ax_joint.annotate(row["Name"], (row["OBP"], row["SLG"]),size=6,
xycoords='data', xytext=(-3, 0),
textcoords='offset points', ha="right",
alpha=0.7)
plt.show()
The highest-right corner of the scatterplot shows a cluster of excellence in hitting corresponding to the heavy upper tails of the SLG and OBP distributions. This small group excels at getting on base and hitting for extra bases. How much we consider them to be outliers (due to their distance from the vast majority of the player population) versus inliers (due to their proximity to at least one one other) relies on the definition utilized by our chosen algorithm.
Scikit-learn’s outlier detection algorithms typically have fit()
and predict()
methods, but there are exceptions and likewise differences between algorithms of their arguments. We’ll consider each algorithm individually, but we’ll fit each to a matrix of attributes (n=2) per player (m=453). We’ll then rating not only each player but a grid of values spanning the range of every attribute, to assist us visualize the prediction function.
To visualise decision boundaries, we’d like to take the next steps:
- Create a 2D
meshgrid
of input feature values. - Apply the
decision_function
to every point on themeshgrid
, which requires unstacking the grid. - Re-shape the predictions back right into a grid.
- Plot the predictions.
We’ll use a 200×200 grid to cover the prevailing observations plus some padding, but you would adjust the grid to your required speed and determination.
import numpy as npX = df[["OBP","SLG"]].to_numpy()
GRID_RESOLUTION = 200
disp_x_range, disp_y_range = ( (.6*X[:,i].min(), 1.2*X[:,i].max())
for i in [0,1]
)
xx, yy = np.meshgrid(np.linspace(*disp_x_range, GRID_RESOLUTION),
np.linspace(*disp_y_range, GRID_RESOLUTION)
)
grid_shape = xx.shape
grid_unstacked = np.c_[xx.ravel(), yy.ravel()]
Elliptic Envelope
The form of the elliptic envelope is decided by the information’s covariance matrix, which provides the variance of feature i
on the foremost diagonal [i,i]
and the covariance of features i
and j
within the [i,j]
positions. Since the covariance matrix is sensitive to outliers, this algorithm uses the Minimum Covariance Determinant (MCD) Estimator, which is advisable for unimodal and symmetric distributions, with shuffling determined by the random_state
input for reproducibility. This robust covariance matrix will turn out to be useful again later.
Because we would like to match the outlier scores of their rating reasonably than a binary outlier/inlier classification, we use the decision_function
to attain players.
from sklearn.covariance import EllipticEnvelopeell = EllipticEnvelope(random_state=17).fit(X)
df["outlier_score_ell"] = ell.decision_function(X)
Z_ell = ell.decision_function(grid_unstacked).reshape(grid_shape)
Local Outlier Factor
This approach to measuring isolation relies on k-nearest neighbors (KNN). We calculate the whole distance from each remark to its nearest neighbors to define local density, after which we compare each remark’s local density with that of its neighbors. Observations with local density much lower than their neighbors are considered outliers.
Selecting the variety of neighbors to incorporate: In KNN, a rule of thumb is to let K = sqrt(N), where N is your remark count. From this rule, we obtain a K close to twenty (which happens to be the default K for LOF). You’ll be able to increase or decrease K to scale back overfitting or underfitting, respectively.
K = int(np.sqrt(X.shape[0]))print(f"Using K={K} nearest neighbors.")
Using K=19 nearest neighbors.
Selecting a distance measure: Note that our features are correlated and have different variances, so Euclidean distance just isn’t very meaningful. We’ll use Mahalanobis distance, which accounts for feature scale and correlation.
In calculating the Mahalanobis distance, we’ll use the robust covariance matrix. If we had not already calculated it via Ellliptic Envelope, we could calculate it directly.
from scipy.spatial.distance import pdist, squareform# If we did not have the elliptical envelope already,
# we could calculate robust covariance:
# from sklearn.covariance import MinCovDet
# robust_cov = MinCovDet().fit(X).covariance_
# But we will just re-use it from elliptical envelope:
robust_cov = ell.covariance_
print(f"Robust covariance matrix:n{np.round(robust_cov,5)}n")
inv_robust_cov = np.linalg.inv(robust_cov)
D_mahal = squareform(pdist(X, 'mahalanobis', VI=inv_robust_cov))
print(f"Mahalanobis distance matrix of size {D_mahal.shape}, "
f"e.g.:n{np.round(D_mahal[:5,:5],3)}...n...n")
Robust covariance matrix:
[[0.00077 0.00095]
[0.00095 0.00366]]Mahalanobis distance matrix of size (362, 362), e.g.:
[[0. 2.86 1.278 0.964 0.331]
[2.86 0. 2.63 2.245 2.813]
[1.278 2.63 0. 0.561 0.956]
[0.964 2.245 0.561 0. 0.723]
[0.331 2.813 0.956 0.723 0. ]]...
...
Fitting the Local Outlier Factor: Note that using a custom distance matrix requires us to pass metric="precomputed"
to the constructor after which the space matrix itself to the fit
method. (See documentation for more details.)
Also note that unlike other algorithms, with LOF we’re instructed not to make use of the score_samples
method for scoring existing observations; this method should only be used for novelty detection.
from sklearn.neighbors import LocalOutlierFactorlof = LocalOutlierFactor(n_neighbors=K, metric="precomputed", novelty=True
).fit(D_mahal)
df["outlier_score_lof"] = lof.negative_outlier_factor_
Create the choice boundary: Because we used a custom distance metric, we must also compute that custom distance between each point within the grid to the unique observations. Before we used the spatial measure pdist
for pairwise distances between each member of a single set, but now we use cdist
to return the distances from each member of the primary set of inputs to every member of the second set.
from scipy.spatial.distance import cdistD_mahal_grid = cdist(XA=grid_unstacked, XB=X,
metric='mahalanobis', VI=inv_robust_cov
)
Z_lof = lof.decision_function(D_mahal_grid).reshape(grid_shape)
Support Vector Machine (SGD-One-Class SVM)
SVMs use the kernel trick to rework features into a better dimensionality where a separating hyperplane might be identified. The radial basis function (RBF) kernel requires the inputs to be standardized, but because the documentation for StandardScaler
notes, that scaler is sensitive to outliers, so we’ll use RobustScaler
. We’ll pipe the scaled inputs into Nyström kernel approximation, as suggested by the documentation for SGDOneClassSVM
.
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.kernel_approximation import Nystroem
from sklearn.linear_model import SGDOneClassSVMsuv = make_pipeline(
RobustScaler(),
Nystroem(random_state=17),
SGDOneClassSVM(random_state=17)
).fit(X)
df["outlier_score_svm"] = suv.decision_function(X)
Z_svm = suv.decision_function(grid_unstacked).reshape(grid_shape)
Isolation Forest
This tree-based approach to measuring isolation performs random recursive partitioning. If the typical variety of splits required to isolate a given remark is low, that remark is taken into account a stronger candidate outlier. Like Random Forests and other tree-based models, Isolation Forest doesn’t assume that the features are normally distributed or require them to be scaled. By default, it builds 100 trees. Our example only uses two features, so we don’t enable feature sampling.
from sklearn.ensemble import IsolationForestiso = IsolationForest(random_state=17).fit(X)
df["outlier_score_iso"] = iso.score_samples(X)
Z_iso = iso.decision_function(grid_unstacked).reshape(grid_shape)
Note that the predictions from these models have different distributions. We apply QuantileTransformer
to make them more visually comparable on a given grid. From the documentation, please note:
Note that this transform is non-linear. It might distort linear correlations between variables measured at the identical scale but renders variables measured at different scales more directly comparable.
from adjustText import adjust_text
from sklearn.preprocessing import QuantileTransformerN_QUANTILES = 8 # This many color breaks per chart
N_CALLOUTS=15 # Label this many top outliers per chart
fig, axs = plt.subplots(2, 2, figsize=(12, 12), sharex=True, sharey=True)
fig.suptitle("Comparison of Outlier Identification Algorithms",size=20)
fig.supxlabel("On-Base Percentage (OBP)")
fig.supylabel("Slugging (SLG)")
ax_ell = axs[0,0]
ax_lof = axs[0,1]
ax_svm = axs[1,0]
ax_iso = axs[1,1]
model_abbrs = ["ell","iso","lof","svm"]
qt = QuantileTransformer(n_quantiles=N_QUANTILES)
for ax, nm, abbr, zz in zip( [ax_ell,ax_iso,ax_lof,ax_svm],
["Elliptic Envelope","Isolation Forest",
"Local Outlier Factor","One-class SVM"],
model_abbrs,
[Z_ell,Z_iso,Z_lof,Z_svm]
):
ax.title.set_text(nm)
outlier_score_var_nm = f"outlier_score_{abbr}"
qt.fit(np.sort(zz.reshape(-1,1)))
zz_qtl = qt.transform(zz.reshape(-1,1)).reshape(zz.shape)
cs = ax.contourf(xx, yy, zz_qtl, cmap=plt.cm.OrRd.reversed(),
levels=np.linspace(0,1,N_QUANTILES)
)
ax.scatter(X[:, 0], X[:, 1], s=20, c="b", edgecolor="k", alpha=0.5)
df_callouts = df.sort_values(outlier_score_var_nm).head(N_CALLOUTS)
texts = [ ax.text(row["OBP"], row["SLG"], row["Name"], c="b",
size=9, alpha=1.0)
for _,row in df_callouts.iterrows()
]
adjust_text(texts,
df_callouts["OBP"].values, df_callouts["SLG"].values,
arrowprops=dict(arrowstyle='->', color="b", alpha=0.6),
ax=ax
)
plt.tight_layout(pad=2)
plt.show()
for var in ["OBP","SLG"]:
df[f"Pctl_{var}"] = 100*(df[var].rank()/df[var].size).round(3)
model_score_vars = [f"outlier_score_{nm}" for nm in model_abbrs]
model_rank_vars = [f"Rank_{nm.upper()}" for nm in model_abbrs]
df[model_rank_vars] = df[model_score_vars].rank(axis=0).astype(int)
# Averaging the ranks is bigoted; we just need a countdown order
df["Rank_avg"] = df[model_rank_vars].mean(axis=1)
print("Counting right down to the best outlier...n")
print(
df.sort_values("Rank_avg",ascending=False
).tail(N_CALLOUTS)[["Name","AB","PA","H","2B","3B",
"HR","BB","HBP","SO","OBP",
"Pctl_OBP","SLG","Pctl_SLG"
] +
[f"Rank_{nm.upper()}" for nm in model_abbrs]
].to_string(index=False)
)
Counting right down to the best outlier...Name AB PA H 2B 3B HR BB HBP SO OBP Pctl_OBP SLG Pctl_SLG Rank_ELL Rank_ISO Rank_LOF Rank_SVM
Austin Barnes 178 200 32 5 0 2 17 2 43 0.256 2.6 0.242 0.6 19 7 25 12
J.D. Martinez 432 479 117 27 2 33 34 2 149 0.321 52.8 0.572 98.1 15 18 5 15
Yandy Diaz 525 600 173 35 0 22 65 8 94 0.410 99.2 0.522 95.4 13 15 13 10
Jose Siri 338 364 75 13 2 25 20 2 130 0.267 5.5 0.494 88.4 8 14 15 13
Juan Soto 568 708 156 32 1 35 132 2 129 0.410 99.2 0.519 95.0 12 13 11 11
Mookie Betts 584 693 179 40 1 39 96 8 107 0.408 98.6 0.579 98.3 7 10 20 7
Rob Refsnyder 202 243 50 9 1 1 33 5 47 0.365 90.5 0.317 6.6 5 19 2 14
Yordan Alvarez 410 496 120 24 1 31 69 13 92 0.407 98.3 0.583 98.6 6 9 18 6
Freddie Freeman 637 730 211 59 2 29 72 16 121 0.410 99.2 0.567 97.8 9 11 9 8
Matt Olson 608 720 172 27 3 54 104 4 167 0.389 96.5 0.604 99.2 11 6 7 9
Austin Hedges 185 212 34 5 0 1 11 2 47 0.234 0.3 0.227 0.3 10 1 4 3
Aaron Judge 367 458 98 16 0 37 88 0 130 0.406 98.1 0.613 99.4 3 5 6 4
Ronald Acuna Jr. 643 735 217 35 4 41 80 9 84 0.416 100.0 0.596 98.9 2 3 10 2
Corey Seager 477 536 156 42 0 33 49 4 88 0.390 97.0 0.623 99.7 4 4 3 5
Shohei Ohtani 497 599 151 26 8 44 91 3 143 0.412 99.7 0.654 100.0 1 2 1 1
It looks just like the 4 implementations mostly agree on learn how to define outliers, but with some noticeable differences in scores and likewise in ease of use.
Elliptic Envelope has narrower contours across the ellipse’s minor axis, so it tends to focus on those interesting players who run contrary to the general correlation between features. For instance, Rays outfielder José Siri ranks as more of an outlier under this algorithm because of his high SLG (88th percentile) versus low OBP (fifth percentile), which is consistent with an aggressive hitter who swings hard at borderline pitches and either crushes them or gets weak-to-no contact.
Elliptic Envelope can also be easy to make use of without configuration, and it provides the robust covariance matrix. If you’ve gotten low-dimensional data and an inexpensive expectation for it to be normally distributed (which is commonly not the case), it is advisable to try this straightforward approach first.
One-class SVM has more uniformly spaced contours, so it tends to emphasise observations along the general direction of correlation greater than the Elliptic Envelope. All-Star first basemen Freddie Freeman (Dodgers) and Yandy Diaz (Rays) rank more strongly under this algorithm than under others, since their SLG and OBP are each excellent (99th and 97th percentile for Freeman, 99th and ninety fifth for Diaz).
The RBF kernel required an additional step for standardization, nevertheless it also looked as if it would work well on this straightforward example without fine-tuning.
Local Outlier Factor picked up on the “cluster of excellence” mentioned earlier with a small bimodal contour (barely visible within the chart). For the reason that Dodgers’ outfielder/second-baseman Mookie Betts is surrounded by other excellent hitters including Freeman, Yordan Alvarez, and Ronald Acuña Jr., he ranks as only the Twentieth-strongest outlier under LOF, versus tenth or stronger under the opposite algorithms. Conversely, Braves outfielder Marcell Ozuna had barely lower SLG and considerably lower OBP than Betts, but he’s more of an outlier under LOF because his neighborhood is less dense.
LOF was essentially the most time-consuming to implement since we created robust distance matrices for fitting and scoring. We could have spent a while tuning K as well.
Isolation Forest tends to emphasise observations on the corners of the feature space, because splits are distributed across features. Backup catcher Austin Hedges, who played for the Pirates and Rangers in 2023 and signed with Guardians for 2024, is robust defensively however the worst batter (with no less than 200 plate appearances) in each SLG and OBP. Hedges might be isolated in a single split on either OBP or OPS, making him the strongest outlier. Isolation Forest is the only algorithm that didn’t rank Shohei Ohtani because the strongest outlier: since Ohtani was edged out in OBP by Ronald Acuña Jr., each Ohtani and Acuña might be isolated in a single split on only one feature.
As with common supervised tree-based learners, Isolation Forest doesn’t extrapolate, making it higher suited to fitting to a contaminated dataset for outlier detection than for fitting to an anomaly-free dataset for novelty detection (where it wouldn’t rating latest outliers more strongly than the prevailing observations).
Although Isolation Forest worked well out of the box, its failure to rank Shohei Ohtani because the best outlier in baseball (and possibly all skilled sports) illustrates the first limitation of any outlier detector: the information you utilize to suit it.
Not only did we omit defensive stats (sorry, Austin Hedges), we didn’t hassle to incorporate pitching stats. Because pitchers don’t even attempt to hit anymore… aside from Ohtani, whose season included the second-best batting average against (BAA) and Eleventh-best earned run average (ERA) in baseball (minimum 100 innings), a complete-game shutout, and a game during which he struck out ten batters and hit two home runs.
It has been suggested that Shohei Ohtani is a complicated extraterrestrial impersonating a human, nevertheless it seems more likely that there are two advanced extraterrestrials impersonating the identical human. Unfortunately, certainly one of them just had elbow surgery and won’t pitch in 2024… but the opposite just signed a record 10-year, $700 million contract. And because of outlier detection, now we will see why!