0% found this document useful (0 votes)
3 views10 pages

Karnuta Et Al 2020 Machine Learning Outperforms Regression Analysis To Predict Next Season Major League Baseball Player

This research investigates the effectiveness of machine learning (ML) models in predicting Major League Baseball player injuries compared to traditional regression analysis. The study analyzed data from 13,982 player-years between 2000 and 2017, finding that ML models outperformed logistic regression in predicting next-season injuries, particularly for position players. The results indicate that advanced ML techniques can provide valuable insights for injury prevention strategies in baseball.

Uploaded by

udm1818
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views10 pages

Karnuta Et Al 2020 Machine Learning Outperforms Regression Analysis To Predict Next Season Major League Baseball Player

This research investigates the effectiveness of machine learning (ML) models in predicting Major League Baseball player injuries compared to traditional regression analysis. The study analyzed data from 13,982 player-years between 2000 and 2017, finding that ML models outperformed logistic regression in predicting next-season injuries, particularly for position players. The results indicate that advanced ML techniques can provide valuable insights for injury prevention strategies in baseball.

Uploaded by

udm1818
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Original Research

Machine Learning Outperforms


Regression Analysis to Predict Next-Season
Major League Baseball Player Injuries
Epidemiology and Validation of 13,982 Player-Years
From Performance and Injury Profile Trends, 2000-2017
Jaret M. Karnuta,* MS, Bryan C. Luu,† BS, Heather S. Haeberle,*† MD, Paul M. Saluan,* MD,
Salvatore J. Frangiamore,* MD, Kim L. Stearns,* MD, Lutul D. Farrow,* MD,
Benedict U. Nwachukwu,‡ MD, Nikhil N. Verma,§ MD, Eric C. Makhni,k MD, MBA,
Mark S. Schickendantz,* MD, and Prem N. Ramkumar,*{ MD, MBA
Investigation performed at the Cleveland Clinic, Cleveland, Ohio, USA

Background: Machine learning (ML) allows for the development of a predictive algorithm capable of imbibing historical data on a
Major League Baseball (MLB) player to accurately project the player’s future availability.
Purpose: To determine the validity of an ML model in predicting the next-season injury risk and anatomic injury location for both
position players and pitchers in the MLB.
Study Design: Descriptive epidemiology study.
Methods: Using 4 online baseball databases, we compiled MLB player data, including age, performance metrics, and injury
history. A total of 84 ML algorithms were developed. The output of each algorithm reported whether the player would sustain an
injury the following season as well as the injury’s anatomic site. The area under the receiver operating characteristic curve (AUC)
primarily determined validation.
Results: Player data were generated from 1931 position players and 1245 pitchers, with a mean follow-up of 4.40 years (13,982
player-years) between the years of 2000 and 2017. Injured players spent a total of 108,656 days on the disabled list, with a mean of
34.21 total days per player. The mean AUC for predicting next-season injuries was 0.76 among position players and 0.65 among
pitchers using the top 3 ensemble classification. Back injuries had the highest AUC among both position players and pitchers, at
0.73. Advanced ML models outperformed logistic regression in 13 of 14 cases.
Conclusion: Advanced ML models generally outperformed logistic regression and demonstrated fair capability in predicting
publicly reportable next-season injuries, including the anatomic region for position players, although not for pitchers.
Keywords: machine learning; injury prediction; injury prevention

Baseball is one of the richest data-driven sports, in which a the MLB Health and Injury Tracking System (HITS) in
seemingly countless number of metrics exist to quantify 2010. While the goal of this system is to better understand
player performance. Major League Baseball (MLB) repre- player safety, access to the raw data is safeguarded, the
sents a “national pastime” focused on analytics that drive database lacks prior injury data and is provided without
not only the fan base and franchise’s personnel decisions the context of performance metrics.1
but also the orthopaedic and sports medicine litera- From the perspective of MLB franchises and athletes,
ture.14,22,27,28 With the increased attention to baseball inju- Conte et al11 reported that the total annual cost of injuries
ries, outcomes, and performance, MLB, its players’ union, from disabled list (DL) placement for franchises averaged
and minor league affiliates reached an agreement to create more than US$423 million. In an industry where a single
injury carries health, performance, and financial conse-
quences for athletes, and in a sport laden with “big data,”
The Orthopaedic Journal of Sports Medicine, 8(11), 2325967120963046
DOI: 10.1177/2325967120963046 the advent of machine learning (ML) arrives at an auspi-
ª The Author(s) 2020 cious time to manage the growing performance and injury

This open-access article is published and distributed under the Creative Commons Attribution - NonCommercial - No Derivatives License (https://creativecommons.org/
licenses/by-nc-nd/4.0/), which permits the noncommercial use, distribution, and reproduction of the article in any medium, provided the original author and source are
credited. You may not alter, transform, or build upon this article without the permission of the Author(s). For article reuse guidelines, please visit SAGE’s website at
http://www.sagepub.com/journals-permissions.

1
2 Karnuta et al The Orthopaedic Journal of Sports Medicine

databases to answer complex questions. ML is a subset of METHODS


artificial intelligence that uses computational algorithms
that learn and improve from experience.4,11 In its most Data Source and Database Creation
simplistic form, this involves using sets of real-world data
to predict or estimate an outcome.2,4,11 These data sets rep- The data for this study were obtained from several readily
resent “training sets” that the machine is then able to study accessible and validated sources previously studied in the
and draw inferences from, or “learn,” using pattern recog- literature: Baseball-Reference,19 FanGraphs,12 MLB’s Base-
nition to make decisions on its own.4 Such conclusions are ball Savant, 3 and Professional Baseball Transactions
compared with a testing set of actual outcomes to quantify Archive.21 These databases were cross-referenced to validate
the accuracy of the algorithm. As the data in the training their content, and redundant variables were kept only if they
sets grow and the number of testing repetitions increases, were the same to a margin of error of less than 1%. Data from
akin to “experiential learning,” the machine’s algorithm Baseball-Reference and FanGraphs were downloaded using
becomes more accurate and predictive. the open-source pybaseball package, and a custom Python
Logistic regression (LR) represents the most primitive (Version 3.7.3; Python Software Foundation) programming
form of ML and has been frequently applied in the litera- language script was developed to download data from both
ture.6,7 However, regression analysis is static and not pre- MLB’s Baseball Savant and Professional Baseball Transac-
dictive, meaning that it does not autoregulate to “learn” tions Archive.21 Injury data were coded by the designated
from complex data relationships, especially when more list (10-day, 15-day, or 60-day DL) to which the player was
data inputs are added. This study represents the first foray, assigned (if applicable), the site of injury (knee, back, hand,
to our knowledge, in the sports medicine literature apply- foot/ankle, shoulder, elbow), if the injury required surgery,
ing complex ML algorithms in which LR is compared whether the injury required placement in the minor league
against different ML algorithms. In this study, player char- for rehabilitation, or whether the injury resulted in the
acteristics, injuries, and performance metrics from 2000 to player’s being unable to play for the rest of the season. The
2017 served as the initial training set from which the total number of days away from sport was also tabulated as
machine learned relationships to predict the most likely the sum of the total days on the DL plus 1 day for every
outcome for future players with similar profiles from a test- injury labeled as a day-to-day injury (eg, if the player had
ing set. We hypothesized that, despite the complex scenar- an upper respiratory infection). Rookies, for whom prior
ios that result in injuries and placement on the DL, an ML injury data were unavailable, were not included in the study.
model trained in historical injury data may be capable of Once the raw data were collected, they were compiled
assessing the future injury risk in MLB players with high using R (Version 3.5.1; R Foundation for Statistical Comput-
validity. Moreover, the anatomic location of the injury may ing) and Python.24,25 All player injuries were grouped by
be correctly predicted to target prevention. We believe that year and summed to arrive at the total number of injuries
modern ML algorithms will be more representative models for that year. These data were then paired to player statis-
than primitive LR analyses in all clinical scenarios. For the tics for each season, resulting in a list of player statistics and
purpose of leveraging available analytics to permit data- injuries for each season in which they were in the major
driven injury prevention strategies and informed decisions, leagues. The full data-processing code can be viewed at
the objective of this study of MLB players was to (1) char- https://github.com/JaretK/BaseballInjuryLearning.
acterize the epidemiology of injury trends on the DL from
2000 to 2017, (2) determine the validity of an ML model in Data Processing and Feature Selection
predicting the injury risk for the subsequent year and ana-
tomic injury location, and (3) compare the performance of Age, performance data, professional injury history, and DL
modern ML algorithms versus LR analyses. data were inputted for each player across every MLB year

{
Address correspondence to Prem N. Ramkumar, MD, MBA, Cleveland Clinic, 9500 Euclid Avenue, Cleveland, OH 44106, USA (email: premramkumar@gmail
.com) (Twitter: @prem_ramkumar).
*Orthopaedic Machine Learning Laboratory, Cleveland Clinic, Cleveland, Ohio, USA.

Department of Orthopedic Surgery, Baylor College of Medicine, Houston, Texas, USA.

Hospital for Special Surgery, New York, New York, USA.
§
Rush University Medical Center, Chicago, Illinois, USA.
k
Department of Orthopedics, Henry Ford Health System, West Bloomfield, Michigan, USA.
Final revision submitted April 21, 2020; accepted June 1, 2020.
One or more of the authors has declared the following potential conflicts of interest or source of funding: P.M.S. has received educational support from
Arthrex, consulting fees from DJO and DePuy, nonconsulting fees from Arthrex, and hospitality payments from the Musculoskeletal Transplant Foundation.
S.J.F. has received grant payments from Arthrex and DJO and educational support from Arthrex and Rock Medical. K.L.S. has received educational support
from Arthrex; consulting fees from Molnlycke Health Care; nonconsulting fees from Horizon Pharma; honoraria from Fidia Pharma; and hospitality payments
from Biomet Orthopedics, the Musculoskeletal Transplant Foundation, Ramsay Medical, and Stryker. L.D.F. has received consulting fees from Zimmer
Biomet and hospitality payments from the Musculoskeletal Transplant Foundation. B.U.N. has received educational support from Smith & Nephew and
hospitality payments from Stryker, Wright Medical, and Zimmer Biomet. N.N.V. has received educational support from Medwest; consulting fees from
Arthrex, Medacta, and Smith & Nephew; nonconsulting fees from Arthrex and Smith & Nephew; and royalties from Smith & Nephew. E.C.M. has received
educational support from Pinnacle (Arthrex), consulting fees from Smith & Nephew, and hospitality payments from Stryker. M.S.S. has received consulting
fees and nonconsulting fees from Arthrex. AOSSM checks author disclosures against the Open Payments Database (OPD). AOSSM has not conducted an
independent investigation on the OPD and disclaims any liability or responsibility relating thereto.
Ethical approval was not sought for the present study.
The Orthopaedic Journal of Sports Medicine Machine Learning Predicts MLB Player Injuries 3

Figure 1. Schematic demonstrating machine learning algorithm development and testing.

that he played. Performance data included sabermetrics for formed: (2[players and pitchers]  7[clinical outcomes] 
hitting (eg, walks, strikeouts, home runs, slugging percent- 6[different model algorithms]). Models were built using the
age, total bases, number of hits per base, runs batted in), scikit-learn Python library (Version 0.20.3) and XGBoost
pitching (eg, walks, strikeouts, number of innings pitched, (Version 1.0.2).18,20,25,26 The ensemble classifier is a combi-
number of pitches thrown per pitch type, number of inten- nation of the top 3 performing models (“top 3 ensemble”) for
tional walks), and overall (eg, wins above replacement, win each clinical outcome. The ensemble classifier was built
probability added, leverage index, clutch score). Sabermet- using “soft voting,” in which the model decided to classify
rics are standardized metrics used to track baseball player a patient as “yes injury” or “no injury” on the average of
performance (more details on each metric can be found at each model’s predicted probability of an injury. All avail-
https://library.fangraphs.com/). Unique players were able data were fed into each model, including year of play to
extracted from the databases using their MLB identifica- account for any temporal trends in the injury incidence.
tion number.19,21 Each model utilized a 10 k-fold strategy to cross-validate
the model output; 10 k-folds require that 90% of the data be
ML Algorithm Outputs used to train the model, and the remaining 10% is used to
test the model in an unbiased fashion. This step is repeated
Algorithms were developed to predict each of the following a total of 10 times, using a separate 10% of the data each
7 different outputs related to the subsequent season: next- iteration. This way, all of the data are eventually used to
season injury, next-season knee injury, next-season back test the model without also being used to train each model
injury, next-season hand injury, next-season foot/ankle (ie, 10% used to test the model per iteration, with 10 total
injury, next-season shoulder injury, and next-season elbow iterations). Feature importance was calculated using the
injury. XGBoost model using the Gini importance metric. Figure
1 illustrates the flow of algorithm development and testing,
with application to new player data.
ML Model Development and Calibration All ML algorithms must be calibrated. The algorithms
Separate models were built for position players and pitch- were tested for calibration against one another to ensure
ers. For each player group, we built models to predict 1 of that the probability of a player injury was appropriately
the 7 clinical outcomes (next-season injury, next-season calculated.
knee injury, next-season back injury, next-season hand
injury, next-season foot/ankle injury, next-season shoulder Statistical Analysis
injury, and next-season elbow injury). For each clinical out-
come, 6 different model algorithms were created: LR, ran- Descriptive statistics were calculated for the cohort. The
dom forest, k-nearest neighbors, Naı̈ve Bayes, XGBoost, weight of the input variables contributing to the overall
and top 3 ensemble.10,17 Thus, a total of 84 models were injury risk was calculated using SHAP (SHapley Additive
4 Karnuta et al The Orthopaedic Journal of Sports Medicine

exPlanations) scores.9 Receiver operating characteristic TABLE 1


(ROC) curves and probability calibration curves were cre- Player Injury Characteristicsa
ated for each outcome. Each model was compared using
Player-Years, n (%)
accuracy, area under the ROC curve (AUC), F1 score, and
Brier score loss (BSL).15 AUC values of <0.7 are poor, Position players
0.7 are fair, 0.8 are good, and 0.9 are excellent.30 Total 9316 (100.0)
The accuracy of the model summarizes the number of With prior injuries 4091 (44.0)
players correctly classified divided by the total number Without prior injuries 5225 (56.0)
of players in each analysis. An F1 score represents the 1 placement on 10-day DL 147 (1.6)
weighted average of precision and recall. 13 Poor F1 1 placement on 15-day DL 1859 (19.9)
1 placement on 60-day DL 496 (5.3)
scores are closer to 0, whereas better F1 scores are
1 game missed because of day-to-day 3052 (32.7)
closer to 1.13 A lower BSL indicates a superior model injuries
and signifies the mean squared difference between the Pitchers
predicted probability and the actual probability. 15 Total 4657 (100.0)
Because actual probabilities are necessarily 0 or 1, a With prior injuries 2030 (43.6)
perfect BSL (indicating a perfectly calibrated model) is Without prior injuries 2627 (56.4)
0 when predicted probabilities are equal to actual prob- 1 placement on 10-day DL 88 (1.9)
abilities. Conversely, a BSL of 1 means that the pre- 1 placement on 15-day DL 1040 (22.3)
1 placement on 60-day DL 319 (6.9)
dicted probabilities are the opposite of the actual
1 game missed because of day-to-day 1004 (21.6)
probabilities.8 R was used for all statistical analyses. injuries
Combined
Knee injury 955 [355] (6.8)
RESULTS Back injury 1201 [327] (8.6)
Hand injury 1668 [581] (11.9)
Player Cohort Foot and ankle injury 925 [324] (6.6)
Shoulder injury 1129 [569] (8.1)
The position player group consisted of 9325 player-years Elbow injury 643 [364] (4.6)
(1931 unique players with a mean of 4.83 years of partici- a
Values in brackets indicate those requiring DL placement. DL,
pation in MLB) from 2000 to 2017. Player injury character-
disabled list.
istics are summarized in Table 1. A total of 4091 (44.0%)
position player–years had prior injuries requiring loss of
playing time, while 5225 (56.0%) had no evidence of inju- Predicting Next-Season (Future) Injuries
ries. Of the injuries that we collected, 147 player-years
(1.6%) had at least 1 placement on the 10-day DL, 1859 We predicted next-season injuries utilizing the injury and
(19.9%) had at least 1 placement on the 15-day DL, and performance data from each player’s most recent season.
496 (5.3%) had at least 1 placement on the 60-day DL. A Each player-year was treated independently from every other
total of 3052 player-years (32.7%) had injuries that were (ie, past injuries were not propagated through to future
designated as day to day but missed at least 1 game because years). Each player-year was used to train the model using
of an injury. the performance data for the current year to predict injuries
The pitcher group consisted of 4657 player-years (1245 in the subsequent year. Thus, if a player was in the league for
unique pitchers with a mean of 3.74 years played) from 5 complete years, his data would be used to train the model a
2000 to 2017. A total of 2030 (43.6%) pitcher-years had prior total of 4 times (the last year was not used to train the model,
injuries requiring loss of playing time (including day-to-day as he would not have available future injury data).8,9 The
injuries). Of the injuries that we collected, 88 player-years model with the highest AUC for position players was the top
(1.9%) had at least 1 placement on the 10-day DL, 1040 3 ensemble, with a mean AUC across 10 k-fold iterations of
(22.3%) had at least 1 placement on the 15-day DL, and 0.76 ± 0.02. This model also had the best accuracy, at 70.0% ±
319 (6.9%) had at least 1 placement on the 60-day DL. A 2.0%. Other models with their associated metrics are shown
total of 1004 player-years (21.6%) had injuries that were in Table 2. The top 3 ensemble’s ROC curves for each k-fold
designated as day to day but missed at least 1 game because and the mean ROC curve are shown in Figure 2. Variables
of an injury. ranked by relative importance for predicting future position
Injuries were subanalyzed by anatomic location and player injuries are shown in Figure 3.
are summarized in Table 1. Overall, 37.2% of all knee The models with the highest AUC for pitchers were
injuries required DL placement, as did 27.2% of all random forest and the top 3 ensemble, both with a mean
back injuries, 34.8% of hand injuries, 35.0% of foot and AUC across 10 k-fold iterations of 0.65 ± 0.02. The top 3
ankle injuries, 50.4% of shoulder injuries, and 56.6% of ensemble model had the highest accuracy, at 63.7% ±
elbow injuries. The 3176 players who were injured 2.0%. Other models with their associated metrics are
spent a total of 108,656 days injured, resulting in a shown in Table 3. The top 3 ensemble’s ROC curves
mean of 34.21 total days per player for the duration of for each k-fold and the mean ROC curve are shown in
the study. Figure 4.
The Orthopaedic Journal of Sports Medicine Machine Learning Predicts MLB Player Injuries 5

TABLE 2
Models Predicting Future Injuries Among Position Playersa

Model Accuracy, % AUC F1 Score Brier Score Loss

Logistic regression 68.7 ± 1.9 0.74 ± 0.021 0.68 ± 0.027 0.20 ± 0.008
Random forest 69.0 ± 2.0 0.75 ± 0.020 0.70 ± 0.027 0.20 ± 0.008
k-nearest neighbors 60.1 ± 1.9 0.64 ± 0.017 0.59 ± 0.027 0.29 ± 0.010
Naı̈ve Bayes 62.7 ± 3.0 0.71 ± 0.027 0.59 ± 0.071 0.35 ± 0.035
XGBoost 69.0 ± 2.1 0.75 ± 0.021 0.70 ± 0.029 0.20 ± 0.008
Top 3 ensemble 70.0 ± 2.0 0.76 ± 0.020 0.70 ± 0.029 0.20 ± 0.008

Values are reported as mean ± SD across 10 k-folds. AUC, area under the receiver operating characteristic curve.
a

Figure 2. Position player receiver operating characteristic (ROC) curve for predicting future injuries based on prior-season
performance and injuries, with sensitivity on the y-axis and 1-specificity on the x-axis. Area under the ROC curve (AUC) values
of <0.7 are poor, 0.7 are fair, 0.8 are good, and 0.9 are excellent.

Predicting Location of Injury DISCUSSION


For position players, the top 3 ensemble was the best pre- ML and performance-related big data surrounding MLB,
dictive model for future injuries of each anatomic region, colloquially known as “sabermetrics,” have reached an ech-
with the exception of the elbow, based on the AUC. Elbow elon in which both may be symbiotically applied to answer
injuries were best predicted with LR, with an accuracy of questions previously thought to be unanswerable. After
63.0% ± 3.6% and an AUC of 0.61 ± 0.08. Table 4 shows the building a database requiring careful compilation of data
accuracy, AUC, F1 score, and BSL of the models with the from 13,982 player-years of performance and injury data
highest AUCs for predicting future anatomic injuries. from 1931 position players and 1245 pitchers, we analyzed
Based on the AUC, the top 3 ensemble was the best predic- usage of the DL over the past 17 seasons. From this, we
tive model among pitchers for each of the 4 anatomic found that 44.0% of position players and 43.6% of pitchers
regions studied, as seen in Table 5. Given the lower AUCs had prior injuries. The hand and back were the most com-
with pitchers, the determinants of predicting an injury monly injured regions among position players, whereas
were not calculated. shoulder and elbow injuries occurred most frequently in the
6 Karnuta et al The Orthopaedic Journal of Sports Medicine

Figure 3. Variables ranked by relative importance for predicting future injuries among position players. Previous injuries and
weighted cutter runs per 100 pitches were the most important variables in predicting outcomes. The relative importance is
expressed as a fraction based on the weight of each variable, with 1.0 being the most important and 0.0 having no contribution
to the model. DL, disabled list.

TABLE 3
Models Predicting Future Injuries Among Pitchersa

Model Accuracy, % AUC F1 Score Brier Score Loss

Logistic regression 60.9 ± 3.0 0.64 ± 0.03 0.54 ± 0.04 0.24 ± 0.003
Random forest 62.2 ± 2.0 0.65 ± 0.02 0.54 ± 0.02 0.23 ± 0.005
k-nearest neighbors 54.6 ± 3.3 0.54 ± 0.03 0.42 ± 0.02 0.33 ± 0.023
Naı̈ve Bayes 58.9 ± 2.6 0.62 ± 0.03 0.38 ± 0.08 0.41 ± 0.024
XGBoost 60.3 ± 2.1 0.64 ± 0.01 0.54 ± 0.03 0.24 ± 0.004
Top 3 ensemble 63.7 ± 2.0 0.65 ± 0.02 0.55 ± 0.02 0.23 ± 0.003

Values are reported as mean ± SD across 10 k-folds. AUC, area under the receiver operating characteristic curve.
a

pitcher group. Once this database using publicly reported above replacement, and player age. Models for pitchers had
injuries was complete, we applied LR and advanced ML lower reliability compared with the position player models,
techniques to assess viability using an algorithm capable perhaps because of the limited data specific to overuse inju-
of predicting injuries among MLB players before they ries available among modern pitcher databases. Impor-
occurred. Using age, performance data, injury history, and tantly, however, we established that advanced ML models
DL data from 17 seasons, we found that our provisional are superior to LR, as advanced ML models, usually the top
models were predictive of next-season injuries with fair 3 ensemble and random forest, outperformed LR in terms of
reliability (AUC ¼ 0.71-0.80) among position players and the AUC in 13 of the 14 cases.
poor reliability (AUC ¼ 0.61-0.69) in pitchers using the top With the ubiquity of computing power and the availabil-
3 ensemble model. The expected anatomic region of injury ity of large patient data sets, ML represents a form of arti-
demonstrated poor to fair reliability depending on the site. ficial intelligence that warrants expansion into sports
The most important determinants of injury prediction for injury prevention and risk management using data-
the subsequent year, in descending order, were as follows: driven predictive analytics. While the simultaneous analy-
prior injury, weighted cutter runs per 100 pitches, wins sis of thousands of player profiles cannot be fully explained,
The Orthopaedic Journal of Sports Medicine Machine Learning Predicts MLB Player Injuries 7

Figure 4. Pitcher receiver operating characteristic (ROC) curve for predicting future injuries based on prior-season performance
and injuries, with sensitivity on the y-axis and 1-specificity on the x-axis. Area under the ROC curve (AUC) values <0.7 are poor,
0.7 are fair, 0.8 are good, and 0.9 are excellent.

TABLE 4
Best Performing Models Predicting Future Injuries Among Position Players, as Determined by the Highest AUCa

Accuracy, % AUC F1 Score Brier Score Loss

Future knee injury (top 3 ensemble) 90.0 ± 1.3 0.68 ± 0.04 0.10 ± 0.07 0.10 ± 0.010
Future back injury (top 3 ensemble) 89.0 ± 1.4 0.73 ± 0.03 0.22 ± 0.06 0.11 ± 0.010
Future hand injury (top 3 ensemble) 84.2 ± 1.7 0.71 ± 0.04 0.23 ± 0.03 0.13 ± 0.010
Future foot/ankle injury (top 3 ensemble) 90.7 ± 0.9 0.67 ± 0.04 0.06 ± 0.04 0.11 ± 0.005
Future shoulder injury (top 3 ensemble) 93.2 ± 0.9 0.64 ± 0.05 0.06 ± 0.05 0.09 ± 0.004
Future elbow injury (logistic regression) 63.0 ± 3.6 0.61 ± 0.08 0.07 ± 0.02 0.23 ± 0.007

Values are reported as mean ± SD across 10 K-folds.


a

TABLE 5
Best Performing Models Predicting Future Injuries Among Pitchers, as Determined by the Highest AUCa

Accuracy, % AUC F1 Score Brier Score Loss

Future knee injury (top 3 ensemble) 83.0 ± 1.1 0.58 ± 0.04 0.24 ± 0.07 0.13 ± 0.01
Future back injury (random forest) 94.2 ± 1.4 0.73 ± 0.04 0.54 ± 0.04 0.06 ± 0.01
Future hand injury (top 3 ensemble) 92.9 ± 1.3 0.70 ± 0.06 0.11 ± 0.07 0.06 ± 0.01
Future foot/ankle injury (top 3 ensemble) 87.0 ± 0.8 0.57 ± 0.04 0.33 ± 0.05 0.15 ± 0.01
Future shoulder injury (top 3 ensemble) 83.0 ± 1.9 0.63 ± 0.04 0.23 ± 0.04 0.14 ± 0.01
Future elbow injury (top 3 ensemble) 86.6 ± 1.9 0.61 ± 0.06 0.17 ± 0.05 0.12 ± 0.01

Values are reported as mean ± SD across 10 k-folds. AUC, area under the receiver operating characteristic curve.
a
8 Karnuta et al The Orthopaedic Journal of Sports Medicine

and the “black box” phenomenon is created with ML mod- may provide a new perspective on how we approach recov-
els, these dynamic algorithms are not unlike the clinical ery protocols and postoperative restrictions. For franchises
experience of an evolving surgeon in that they improve with seeking to identify at-risk players, individual player data
additive data or “experience.” This study does not represent may be uploaded into the algorithm and can provide the
the first attempt to apply ML to baseball. Yang and franchise and medical personnel with up to 70% accuracy
Swartz29 created a Bayesian model expressed as a Markov on whether the player will sustain an injury the following
chain that predicted division winners partway through a year, allowing the franchise to make informed recruiting
single season by combining prior winning percentages, decisions. Team physicians may similarly use these tools
overall batting ability, and the starting pitcher’s earned in expectation management and patient counseling, with
run average. Several ML analyses are well-described (ie, the ability to discuss the statistical likelihood of future inju-
LR and random forest) in the literature already and may ries with players. To a lesser degree, ML was capable of
assist the team physician in predicting injuries or identify- identifying the anatomic region where the injury was likely
ing subclinical abnormalities.5,16,17,23,27 to occur. This finding may be readily applied to provide the
Given the array of classic (ie, LR, random forest) and player in question with targeted physical therapy and neu-
advanced modeling techniques, the results of this study dem- romuscular adaptations.27
onstrate 3 important takeaway points to guide future ortho- While current injury predictive modeling demonstrates
paedic and sports medicine research in this new frontier of limitations that make current deployment untenable, future
injury modeling. First, a single predictive model is not neces- refinement of these algorithms offers tangible potential util-
sarily ideally suited for all clinical questions posed. Specifi- ity. Knowledge of which players are likely to incur an injury
cally, the top 3 ensemble was the model with the highest AUC has the potential of offering not only early interventions but
for predicting next season’s injury risk among position also informed decision making for the organization before
players and pitchers, but random forest was superior in pre- signing players to multiyear, multimillion-dollar contracts.28
dicting back injuries among pitchers. Thus, no single model Certainly, the ethics of predicting injuries merits a discus-
represents a panacea, and we recommend that an advanced sion. The implication of assigning a player such a value runs
data engineer work in concert with professional franchises the risk of diminishing the player’s value to a franchise.
and medical professionals to determine the best-suited model However, a player’s predisposition to injuries has always
for the clinical question. Second, we illustrated that with been under qualitative consideration; this algorithm simply
more iterations, the algorithm continued to improve or applies a quantitative probability of an injury. Conversely,
“learn.” After the 10th iteration of the next-season injury risk players who are less likely to sustain injuries may experience
model for position players, the AUC improved to 0.80, reach- an increase in value for availability. This algorithm may be
ing good validity, and was proven to be dynamic (unlike static used as a risk-management tool for professional players from
LR analyses). Third, this is the first study in the sports med- the franchise’s perspective. It is conceivable that applying
icine literature to demonstrate that regression analysis is not player-specific data to develop algorithms may not be in the
necessarily the gold standard when forecasting and predict- best interest of the MLB Players Association (MLBPA) and
ing risk, especially in the intersecting world of big data, in may cause sufficient concern to highly regulate the develop-
which performance metrics, injury profiles, and sports med- ment of these advanced models.
icine interventions are increasingly valued. Our study had several limitations. First, we were limited
Beyond the analytic aspect of ML, how can these findings by the granularity of available data. Because of inabilities
guide care of these elite MLB athletes? This algorithm to determine nuanced injury characteristics, such as imag-
offers the orthopaedic surgeon longitudinally caring for ing and physical examination findings, we could not discern
these players to more synchronously work with coaching at this stage whether the future injury would be attributed
and franchise management using quantitative, not qualita- to, for example, an elbow sprain versus a complete tear of
tive, metrics. The model may identify players at risk for a the medial ulnar collateral ligament. Additionally, we were
shoulder injury during the subsequent year and prompt unable to capture the impact that chronic, lingering inju-
earlier targeted examinations, ushering in the era of ries may have on future injuries, as team-reported injuries
“precision medicine” on the field. Earlier guided interven- are generally acute and severe enough to withdraw players
tions may offer targeted medical attention that reduces from games. We also acknowledge that the lack of anatomic
time away from the game during critical moments, such specificity of our data prediction algorithm does highlight
as the playoffs. This approach offers key integration points the limited immediate clinical utility of such a model. How-
with the growing wearable market and certain companies ever, this proof-of-concept study provides the framework
that are applying ML algorithms to study human activities for future studies that, with more granular data, may
(including pitching and batting in real time) through sen- potentially explore more specific injury prediction. The
sors on the shoulder and elbow. As we continue to work large size of our database, sourced from multiple databases
with professional MLB franchises and acquire more specific across the entire MLB population for 17 years and cross-
pitcher data, this will certainly improve and may identify referenced for accuracy, gives confidence that our advanced
injuries in this specific population during practice to guide ML model can deduce future injury prediction with mean-
an athlete’s availability and risk profile. This may allow ingful accuracy in the absence of a formal power analysis.
team physicians and franchise personnel to make strategic Another limitation was the sources of input of the databases
decisions to withhold a pitcher from a rotation and quantify utilized to obtain MLB player injury history and performance
the value of rest and recovery, opening a new frontier that data. As previously stated, information was collected from 4
The Orthopaedic Journal of Sports Medicine Machine Learning Predicts MLB Player Injuries 9

online baseball databases: Baseball-Reference, FanGraphs, This study is one example of the potential integration of ML
MLB’s Baseball Savant, and Professional Baseball Transac- into the practice of clinical sports medicine and provides a
tions Archive. Both Baseball-Reference and FanGraphs are foundation for future studies.
privately owned entities that compile information from a vari-
ety of sources, including companies that specialize in sports
data acquisition and commerce. No public information is avail- REFERENCES
able on Professional Baseball Transactions Archive’s method
of data collection. These 3 databases are not regulated by MLB 1. Ahmad CS, Dick RW, Snell E, et al. Major and Minor League Baseball
and should naturally be evaluated with a degree of uncer- hamstring injuries: epidemiologic findings from the Major League
Baseball Injury Surveillance System. Am J Sports Med. 2014;42(6):
tainty. Baseball Savant, on the other hand, is endorsed by
1464-1470.
MLB; however, this database only publishes statistics starting 2. Andrew G, Gao J.Scalable training of L1-regularized log-linear mod-
in 2015. Moreover, we did not use the official MLB HITS data, els. In: Proceedings of the 24th International Conference on Machine
as the HITS contains 6 years of data and is presently restricted Learning: ICML ‘07. Corvallis, Oregon: ACM Press; 2007:33-40.
from any performance-based analyses upon query and 3. Baseball Savant. Trending MLB players, Statcast and visualizations.
requires MLBPA approval. While public databases are cer- Accessed May 1, 2019. https://baseballsavant.mlb.com/
tainly prone to inaccuracy and underreporting, a larger data- 4. Batista GEAPA, Prati RC, Monard MC. A study of the behavior of
several methods for balancing machine learning training data. ACM
base with publicly reported DL and injury data is more than
SIGKDD Explorations Newsletter. 2004;6(1):20-29.
sufficient to preliminarily determine that these advanced com- 5. Beam AL, Kohane IS. Big data and machine learning in health care.
putational techniques are predictive of future injuries, supe- JAMA. 2018;319(13):1317-1318.
rior to regression analysis, and warrant further exploration. 6. Belk JW, Marshall HA, McCarty EC, Kraeutler MJ. The effect of
Compared with the position player data, the pitcher data were regular-season rest on playoff performance among players in the
relatively less specific in terms of predictive variables, as fac- National Basketball Association. Orthop J Sports Med. 2017;5(10):
tors such as practice pitch count, throwing form, and prior 2325967117729798.
7. Bini SA. Artificial intelligence, machine learning, deep learning, and
treatment modalities specific to this niche population were not cognitive computing: what do these terms mean and how will they
included in the database. The current data set is limited to only impact health care? J Arthroplasty. 2018;33(8):2358-2361.
game metrics and contains no wearable-based throwing 8. Blagus R, Lusa L. SMOTE for high-dimensional class-imbalanced
motion data. Additional pitcher data, including pitch count data. BMC Bioinformatics. 2013;14:106.
and pitch type, may be readily added into our dynamic algo- 9. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic
rithm in the future to strengthen its accuracy and prediction Minority Over-sampling Technique. Journal of Artificial Intelligence
Research. 2002;16:321-357.
confidence.
10. Chen T, Guestrin C.XGBoost: a scalable tree boosting system. In:
Despite the limitations of the present study, ML may have Proceedings of the 22nd ACM SIGKDD International Conference on
potential to play a role in the future of sports medicine. We Knowledge Discovery and Data Mining: KDD ‘16. New York: ACM
found that player characteristics such as age, injury history, Press; 2016:785-794.
and performance metrics quantitatively predicted the injury 11. Conte S, Camp CL, Dines JS. Injury trends in Major League Baseball
risk for the subsequent year among MLB position players. over 18 seasons: 1998-2015. Am J Orthop (Belle Mead NJ). 2016;45(3):
The location of injury exhibited fair reliability, particularly 116-123.
12. FanGraphs Baseball. Baseball statistics and analysis. Accessed May 1,
with the back and hand in position players and pitchers. For
2019. https://www.fangraphs.com/
pitchers, the prediction algorithms were shown to be less 13. Goutte C, Gaussier E. A probabilistic interpretation of precision, recall
predictive than those used to make the position player mod- and F-score, with implication for evaluation. In: Losada DE, Fernán-
els. This is likely because of generalized input parameters dez-Luna JM, eds. Advances in Information Retrieval. Lecture Notes
that require position-specific optimization to this niche in Computer Science. Berlin: Springer; 2005:345-359.
population. While more data for the dynamic algorithm are 14. Hardy R, Ajibewa T, Bowman R, Brand JC. Determinants of Major
League Baseball pitchers’ career length. Arthroscopy. 2017;33(2):
required to strengthen insights predictive of injuries among
445-449.
these elite athletes, the prospect of applying ML to an elite 15. Hernández-Orallo J, Flach P, Ferri C. Brier curves: a new cost-based
sports population warrants further exploration, as it demon- visualisation of classifier performance. In: Proceedings of the 28th
strates superiority to the previous gold standard regression International Conference on Machine Learning. Bellevue, WA: ICML;
analysis, offers quantitative risk management for fran- 2011:585-592.
chises, and presents an opportunity for targeted preventive 16. Huber M, Kurz C, Leidl R. Predicting patient-reported outcomes fol-
interventions for medical personnel. lowing hip and knee replacement surgery using supervised machine
learning. BMC Med Inform Decis Mak. 2019;19(1):3.
17. James LeDoux’s Blog. Introducing pybaseball: an open source pack-
age for baseball data analysis. Accessed May 1, 2019. https://
CONCLUSION jamesrledoux.com/projects/open-source/introducing-pybaseball/
18. Lundberg SM, Lee S-I. A unified approach to interpreting model pre-
This study affirms the potential of ML in the prediction of dictions. In: Guyon I, Luxburg UV, Bengio S, eds. Advances in Neural
the next-season injury risk for MLB players as well as the Information Processing Systems 30. Curran Associates; 2017:
prediction of the injury’s anatomic location. Advanced ML 4765-4774.
19. Baseball-Reference.com. MLB stats, scores, history, & records.
models generally outperformed LR and demonstrated fair Accessed May 1, 2019. https://www.baseball-reference.com/
capability of predicting whether a publicly reportable 20. Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: machine
injury was likely to occur the next season, including ana- learning in Python. Journal of Machine Learning Research. 2011;12:
tomic region, for position players, although not for pitchers. 2825-2830.
10 Karnuta et al The Orthopaedic Journal of Sports Medicine

21. Professional Baseball Transactions Archive. Home page. Accessed May 1, 26. Seabold S, Perktold J. Statsmodels: econometric and statistical mod-
2019. https://www.prosportstransactions.com/baseball/index.htm eling with Python. In: 9th Python in Science Conference. 2010.
22. Reznik A, Urish K. Understanding the impact of artificial intelligence 27. Voskanian N. ACL injury prevention in female athletes: review of
on orthopaedic surgery. American Academy of Orthopaedic Sur- the literature and practical considerations in implementing an ACL
geons. Accessed May 17, 2019. www.aaos.org/aaosnow/2018/sep/ prevention program. Curr Rev Musculoskelet Med. 2013;6(2):
research/research01/ 158-163.
23. Rommers N, Rössler R, Verhagen E, et al. A machine learning 28. Whiteside D, Martini DN, Lepley AS, Zernicke RF, Goulet GC. Predic-
approach to assess injury risk in elite youth football players. Med Sci tors of ulnar collateral ligament reconstruction in Major League Base-
Sports Exerc. 2020;52(8):1745-1751. ball pitchers. Am J Sports Med. 2016;44(9):2202-2209.
24. Schisterman EF, Perkins NJ, Mumford SL, Ahrens KA, Mitchell EM. 29. Yang TY, Swartz T. A two-stage Bayesian model for predicting win-
Collinearity and causal diagrams: a lesson on the importance of ners in Major League Baseball. Journal of Data Science. 2004;2(1):
model specification. Epidemiology. 2017;28(1):47-53. 61-73.
25. Scikit-learn: machine learning in Python. Scikit-learn 0.20.3 docu- 30. Youngstrom EA. A primer on receiver operating characteristic analy-
mentation. Accessed March 30, 2019. https://scikit-learn.org/ sis and diagnostic efficiency statistics for pediatric psychology: we
stable/index.html are ready to ROC. J Pediatr Psychol. 2014;39(2):204-221.

You might also like