You are on page 1of 9

Spine Deformity 6 (2018) 762e770

www.spine-deformity.org

Predicting Surgical Complications in Patients Undergoing Elective


Adult Spinal Deformity Procedures Using Machine Learning
Jun S. Kim, MDa, Varun Arvind, BSa, Eric K. Oermann, MDb, Deepak Kaji, BAa,
Will Ranson, BSa, Chierika Ukogu, BAa, Awais K. Hussain, BAa, John Caridi, MDb,
Samuel K. Cho, MDa,*
a
Department of Orthopaedic Surgery, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
b
Department of Neurosurgery, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
Received 30 July 2017; revised 25 February 2018; accepted 1 March 2018

Abstract
Study Design: Cross-sectional database study.
Objective: To train and validate machine learning models to identify risk factors for complications following surgery for adult spinal
deformity (ASD).
Summary of Background Data: Machine learning models such as logistic regression (LR) and artificial neural networks (ANNs) are
valuable tools for analyzing and interpreting large and complex data sets. ANNs have yet to be used for risk factor analysis in orthopedic
surgery.
Methods: The American College of Surgeons National Surgical Quality Improvement Program (ACS-NSQIP) database was queried for
patients who underwent surgery for ASD. This query returned 4,073 patients, which data were used to train and evaluate our models. The
predictive variables used included sex, age, ethnicity, diabetes, smoking, steroid use, coagulopathy, functional status, American Society of
Anesthesiologists (ASA) class O3, body mass index (BMI), pulmonary comorbidities, and cardiac comorbidities. The models were used to
predict cardiac complications, wound complications, venous thromboembolism (VTE), and mortality. Using ASA class as a benchmark for
prediction, area under receiver operating characteristic curves (AUC) was used to determine the accuracy of our machine learning models.
Results: The mean age of patients was 59.5 years. Forty-one percent of patients were male whereas 59.0% of patients were female. ANN
and LR outperformed ASA scoring in predicting every complication (p!.05). The ANN outperformed LR in predicting cardiac
complication, wound complication, and mortality (p!.05).
Conclusions: Machine learning algorithms outperform ASA scoring for predicting individual risk prognosis. These algorithms also
outperform LR in predicting individual risk for all complications except VTE. With the growing size of medical data, the training of
machine learning on these large data sets promises to improve risk prognostication, with the ability of continuously learning making them
excellent tools in complex clinical scenarios.
Level of Evidence: Level III.
Ó 2018 Scoliosis Research Society. All rights reserved.
Keywords: Adult spinal deformity; Machine learning; Artificial neural network; Logistic regression; Risk prediction

Introduction
The advent of digital technology, machine learning and
Author disclosures: JSK (none), VA (none), EKO (none), DK (none),
WR (none), CU (none), AKH (none), JC (none), SKC (grants from Zim-
deep learning in particular, is increasingly making it
mer, Orthopaedic Research and Education Foundation, and Stryker, possible to utilize big data to more precisely risk stratify
outside the submitted work). and prognosticate how an individual patient will behave
This study was approved by the Institutional Review Board of the given a disease or intervention. Machine learning has
Icahn School of Medicine at Mount Sinai, New York, NY. already been used in other realms such as retail and search
*Corresponding author. Department of Orthopaedic Surgery, Icahn
School of Medicine at Mount Sinai, 5 East 98th Street, Box 1188, New
engines. However, healthcare has lagged in the uptake of
York, NY 10029, USA. Tel.: (212) 241-0276; fax: (646) 537-8531. newer techniques to leverage the rich information contained
E-mail address: samuel.cho@mountsinai.org (S.K. Cho). in electronic health records (EHRs).
2212-134X/$ - see front matter Ó 2018 Scoliosis Research Society. All rights reserved.
https://doi.org/10.1016/j.jspd.2018.03.003
J.S. Kim et al. / Spine Deformity 6 (2018) 762e770 763

The practice of evidence-based medicine has sustained a prime target for quality improvement through the utili-
the progress seen in modern care and diagnosis. Traditional zation of machine learning.
statistical approaches have gleaned much about what is ML algorithms have the capability of ‘‘learning’’ using
known regarding risk factors used for prognostication. Ma- newly generated information to improve their predictive
chine learning (ML) combines these fundamental statistical capability. Briefly explained, these algorithms work by
insights with modern high-performance computing to learn utilizing a subset of the overall study data (70% in this
patterns that can be used for recognition and prediction. case) to ‘‘train’’ and create an accurate predictive model.
Importantly, machine learning often identifies patterns that This established model is then validated using the
are not readily apparent to human intuition, thus identifying remainder of the data to determine the accuracy of the post-
otherwise unknown connections [1]. Multivariate logistic training model. This study seeks to develop and validate
regression and artificial neural networks are the two most ML algorithms to precisely predict complications following
commonly used machine learning models employed in ASD using a national database, in order to compare ML
medicine [2]. Artificial neural networks were first developed algorithms with logistic regression (LR) or American So-
to model the neural architecture of the brain. Harnessing the ciety of Anesthesiologists (ASA) classification.
structure of biology, artificial neural networks (ANNs) are
particularly well suited for modeling complex, nonlinear data Methods
when little is known regarding the underlying distribution of
the data or colinearity among the variables [3]. Importantly, Patient selection and preprocessing
ANNs can perform these functions without prior assump- The National Surgical Quality Improvement Program
tions, leading to a highly adaptable system less susceptible to (NSQIP) database was used for the purpose of training and
anchoring biases [3]. However, similar to any machine validating ANN and LR models. Adult patients (>18 years)
learning algorithm, neural networks are susceptible to undergoing adult deformity surgery were identified based
intrinsic limitations and biases of the underlying data set. on Current Procedural Terminology (CPT) codes 22800,
Additionally, limitations of model design such as neural 22802, 22804, 22808, 22810, 22812, 22818, 22819. CPT
network architecture, feature selection, and optimization codes 22843, 22844, 22846, or 22847 were also included to
functions can lead to model biases and overfitting that capture long, multilevel fusion constructs. Patients with
decrease generalizability and prognostication value of neural CPT code 22842 and 22845 were included if they had an
networks on external data [4]. Advancements in neural ICD-9 diagnosis for spinal deformity (including 737.1,
network science and proper implementation with recognition 737.2, 737.3, 737.4, 737.8, 737.9). Cases with missing
of these limitations are important for future integration of preoperative data, emergency cases, patients with a wound
machine learning in surgical practice, and the utility of ma- class of 2, 3, or 4, an open wound on their body, current
chine learning in adult deformity surgery has not yet sepsis, current pneumonia, prior surgeries within 30 days,
been explored. cases requiring cardiac pulmonary resuscitation prior to
Adult spinal deformity (ASD) is a spinal disorder surgery, or neoplasm of spine were excluded in order to
defined as a complex spectrum of spinal diseases that reduce the risk of confounding variables. This resulted in
present in adulthood including adult scoliosis (progression 5,818 ASD patients included in our analysis.
of childhood scoliosis), degenerative scoliosis, sagittal and
coronal imbalance, and iatrogenic deformity (with or
without spinal stenosis) [5]. Adult degenerative scoliosis is Training and Holdout Data Sets
the most common cause of ASD and is commonly seen in For development of our models, 70% of the initial data
elderly adults, particularly those older than 60 years, as (training set) were used for training while 30% (holdout)
degeneration of intervertebral discs and facet joints exac- was set aside for posttraining evaluation. To overcome the
erbate scoliotic curvature [6]. With the aging baby boomer low sample size for positive complication cohorts, the
generation and overall population structure of the United adaptive synthetic sampling (ADASYN) approach for
States, it is not surprising that the demand and prevalence imbalanced learning was used to artificially generate pos-
for ASD surgery continues to increase [7]. In the bur- itive complication cases based on the training data set to
geoning era of rising healthcare costs and greater scrutiny increase the number of positive cases in the training set to
over surgical outcomes, there has been increasing emphasis improve class balance. In brief, ADASYN utilizes exam-
on understanding the risk factors and possible predictors to ples from the minority class that are difficult to learn and
optimize perioperative planning and management. Data- generates synthetic new cases based on these examples to
driven clinical decision support tools have the potential to improve model learning and generalizability [8].
lead to cost savings by leveraging the information con-
tained in large medical databases. Uptake of machine
Feature selection
learning approaches in the realm of spinal surgery have
lagged. However, the patient population and associated Input features used for training include sex, age,
increased rate of postoperative complications renders ASD ethnicity (white, black, Hispanic, or other), history of
764 J.S. Kim et al. / Spine Deformity 6 (2018) 762e770

diabetes (insulin dependent and independent), history of variables, all other categorical features were handled using
smoking (within one year of surgery), steroid use (<30 one-hot encoding.
days prior to surgery), history of bleeding disorders, func-
tional status (independent or partially/totally dependent),
ASA >3, BMI, and presence of pulmonary (ventilator Machine learning construction and testing
dependent <48 hours prior to surgery or history of chronic Machine learning models were trained to predict
obstructive pulmonary disease <30 days prior to surgery) occurrence of mortality, venous thromboembolism, wound
or cardiac comorbidities (use of antihypertensive medica- complications, and cardiac complications. Deep ANNs
tion or history of chronic heart failure <30 days prior to were constructed using the Neural Network toolbox in
surgery). All variables were selected from the ACS-NSQIP MatLab 2016b (MathWorks, Inc., Natick, MA). L2 regu-
database and the ACS-NSQIP guide may be consulted for larization was used to combat ANN overfitting, by aug-
further explanation of variables. menting the error function used for training with the
In machine learning applications, the number of training squared magnitude of the weights used in the ANN. This
examples required to reach a given accuracy grows expo- prevents overly complex models that are overfitted to a
nentially with the number of irrelevant features [9]. To specific data set, improving predictive generalizability.
combat this, feature selection was performed to prevent Because of the large class imbalance even post-ADASYN,
overfitting and improve the overall generalizability of our multiple ANNs were created by partitioning the majority
models. Logistic regression analysis was performed on the class into subsets in a 1:1 ratio with the minority class,
training data set, to obtain probability coefficients for each generating ANNs trained off of each partition. Subse-
feature. Stepwise entry and removal of features was used. quently, patients were distributed in a 2:1:1 ratio to
Prefiltering of features was done to minimize collinearity in generate training, validation, and testing data sets to train
the final model by a clinician with domain knowledge. The each ANN in a fivefold cross-validation scheme. Data that
top six features identified as having the greatest regression was not used for training (holdout data) was used for final
coefficient magnitudes were chosen as input variables for testing of the ANN to provide an unbiased assessment of
the ANN and LR for mortality, venous thromboembolism ANN performance. Data from the holdout set was fed into
(VTE), cardiac complications, and wound complications, each ANN, and final predictions were based off of indi-
respectively. Age and BMI were treated as continuous vidual accuracy-weighted predictions surveyed across each

Fig. 1. (A) Schematic of study workflow. (B) Diagram of ANN model. Bar lengths represent number of patient cases. ADASYN increases the number of
positive cases to combat class imbalance. Negative cases are then partitioned in a 1:1 ratio with the positive cases to create a class-balanced data set used for
ANN training. Each partition trains an independent neural net. During evaluation, data is fed through each neural net where the responses are surveyed,
weighted by the model’s accuracy, and the net prediction is used. ADASYN, adaptive synthetic sampling; ANN, artificial neural network.
J.S. Kim et al. / Spine Deformity 6 (2018) 762e770 765

ANN. Lastly, the patient data set was randomized and cohort, 4,073 patients (70%) were included into the training
repartitioned, and training and testing was performed again, set and 1,746 patients (30%) were used as a holdout
iteratively five times to perform statistical analysis of training set for evaluating the trained machine learning
ANN performance in a process known as ensembling. models (Fig. 1). Following our exclusion criteria, 2,376
Performance of the ANN was compared to traditional (41.0%) of patients were male, whereas 3,418 (59.0%)
logistic regression that was trained and tested on the same were female. The mean age was 59.5 years old and the
data that the ANN was evaluated on. Furthermore, these cohort exhibited low rates of complications across all out-
two machine learning models were compared to the ASA comes. Our study uses cardiac complications, VTE, wound
physical status classification system. Classification perfor- complications, and mortality as target outcomes. Specif-
mance for ANN, LR, and ASA was evaluated based on area ically, there was a 0.5% (29 patients) rate of mortality, 2%
under the receiver operating characteristic curve (AUC) (139) rate of wound complication, 2% (105) rate of VTE,
with 95% confidence intervals (CIs). and 0.7% (39) rate of cardiac complication (Table 1).
Additionally, there were low rates of overlap between
Source of funding complications except between mortality and cardiac com-
plications. Overall, 37.9% of patients who did not survive
There was no external source of funding. had preexisting cardiac complications. Unsurprisingly, age
was a highly predictive feature across all outcomes. Dia-
Results betic status and tobacco usage were also useful features,
which is consistent with their known association with poor
Data and analysis pipeline
clinical outcomes (Fig. 2) (Table 2) [10-12]. Furthermore,
A total of 5,818 patients were identified as having un- 214, 223, 217, and 212 patients were excluded from the
dergone ASD surgery between 2010 and 2014. Among this cardiac complication, VTE, wound complication, and
Table 1
Patient characteristics for patients included within the data set for model construction.
Feature Average Total Cardiac complication VTE complication Wound complication Mortality
Sex, n (%)
Male 2376 (41.0) 21 (0.9) 50 (2.1) 55 (2.3) 15 (0.6)
Female 3418 (59.0) 18 (0.5) 55 (1.6) 84 (2.5) 14 (0.4)
Age, mean 59.5
Ethnicity, n (%)
White 4703 (81.2) 26 (0.6) 89 (1.9) 111 (2.4) 22 (0.5)
Black 441 (7.6) 6 (1.4) 8 (1.8) 11 (2.5) 1 (0.2)
Hispanic 218 (3.8) 1 (0.5) 2 (0.9) 3 (1.4) 1 (0.5)
Other 432 (7.5) 6 (1.4) 6 (1.4) 14 (3.2) 5 (1.2)
Diabetes mellitus, n (%)
No 4947 (85.4) 24 (0.5) 91 (1.8) 111 (2.2) 20 (0.4)
Type 2 575 (9.9) 7 (1.2) 11 (1.9) 22 (3.8) 7 (1.2)
Type 1 272 (4.7) 8 (2.9) 3 (1.1) 6 (2.2) 2 (0.7)
Smoking history, n (%)
Smoker 1207 (20.8) 6 (0.5) 11 (0.9) 26 (2.2) 9 (0.7)
Nonsmoker 4587 (79.2) 33 (0.7) 94 (2.0) 113 (2.5) 20 (0.4)
Steroid use, n (%)
Steroid use 219 (3.8) 1 (0.5) 15 (6.8) 5 (2.3) 2 (0.9)
No steroid use 5575 (96.2) 38 (0.7) 90 (1.6) 134 (2.4) 27 (0.5)
History of bleeding disorder, n (%) 189 (0.9) 1 (0.5) 0 (0.0) 1 (0.5) 3 (1.6)
None 70 (1.2) 0 (0.0) 2 (2.9) 3 (4.3) 0 (0.0)
Functional status, n (%) 5724 (98.8) 39 (0.7) 103 (1.8) 136 (2.4) 29 (0.5)
Dependent 265 (4.6) 5 (1.9) 6 (2.3) 10 (3.8) 4 (1.5)
Independent 5529 (95.4) 34 (0.6) 99 (1.8) 129 (2.3) 25 (0.5)
ASA score, n (%)
>3 3078 (53.1) 33 (1.1) 60 (1.9) 99 (3.2) 28 (0.9)
BMI, n (%) 29.5
Comorbidities, n (%)
Pulmonary 294 (5.1) 2 (0.7) 6 (2.0) 6 (2.0) 5 (1.7)
Cardiac 3159 (54.5) 25 (0.8) 59 (1.9) 89 (2.8) 20 (0.6)
Complications, n (%)
Mortality 29 (0.5) 11 (37.9) 7 (24.1) 2 (6.9) 29 (100.0)
Wound complications 139 (2.4) 3 (2.2) 12 (8.6) 139 (100.0) 2 (1.4)
VTE 105 (1.8) 4 (3.8) 105 (100.0) 12 (11.4) 7 (6.7)
Cardiac complications 39 (0.7) 39 (100.0) 4 (10.3) 3 (7.7) 11 (28.2)
ASA, American Society of Anesthesiologists; BMI, body mass index; VTE, venous thromboembolism.
formed least effectively for all target outcomes, with an
CI: 0.76e0.80) for mortality. The ASA classifiers per-
CI: 0.56e0.58) for wound complications, and 0.787 (95%
0.547 (95% CI: 0.54e0.55) for VTE events, 0.575 (95%
0.690 (95% CI: 0.68e0.69) for cardiac complications,
consistently better than ASA as a classifier with an AUC of
predicting mortality. In contrast, the LR performed
wound complications, and 0.844 (95% CI: 0.82e0.86) for
dicting VTE, 0.606 (95% CI: 0.60e0.61) for predicting
diac complications, 0.542 (95% CI: 0.53e0.55) for pre-
an AUC of 0.768 (95% CI: 0.76e0.77) for predicting car-
obtained from the holdout data. The ANN performed with
formance of ANN with LR and ASA, AUC for ANN was
performed by LR (Fig. 4). In order to compare the per-
target except for VTE in which ANN and ASA were out-
ASA classifiers were outperformed by the ANN for every
mance of our classifiers (Fig. 3). The logistic regression and
performance. The AUC was used to measure the perfor-

ANN, LR, and ASA classification performance

data (S1e4).
respectively, to improve learning with class-imbalanced
VTE, wound complication, and mortality training sets,
were generated by ADASYN in the cardiac complication,
the training set. A total of 200, 205, 200, and 198 cases
sampling was used to generate data from minority class in
data (S1e4). As previously described, adaptive synthetic
mortality training sets, respectively, because of incomplete

predictive value, and lighter cells indicate weakly weighted features.


feature selection. Dark cells indicate highly weighted features indicating a strong
Fig. 2. Coefficient weights obtained from logistic regression analysis used for

766
ASA classification was used to benchmark ANN and LR

J.S. Kim et al. / Spine Deformity 6 (2018) 762e770


Table 2
Logistic regression results prior to feature selection.
Cardiac complications Venous thromboembolism Wound complications Mortality
Coefficient SE t Stat p value Coefficient SE t Stat p value Coefficient SE t Stat p value Coefficient SE t Stat p value
(Intercept) 4.277 0.182 23.441 .000 3.564 0.159 22.439 .000 2.920 0.103 28.442 .000 4.320 0.205 21.122 .000
Sex 0.006 0.082 1.398 .162 0.130 0.074 3.219 .001 0.016 0.072 0.907 .364 0.359 0.082 3.366 .001
Age 0.659 0.189 6.098 .000 0.067 0.127 2.475 .013 0.002 0.117 0.435 .664 0.496 0.162 3.745 .000
Black 0.261 0.110 0.291 .771 0.223 0.091 0.621 .535 0.005 0.091 0.247 .805 0.530 0.119 2.028 .043
Hispanic 0.913 0.136 0.873 .383 0.642 0.189 1.792 .073 0.636 0.170 1.838 .066 0.015 0.398 2.123 .034
Other 0.077 0.068 5.449 .000 0.068 0.076 1.191 .234 0.017 0.075 1.168 .243 0.289 0.086 0.865 .387
Diabetes mellitus type 2 0.187 0.083 1.821 .069 0.110 0.094 1.154 .249 0.011 0.065 1.994 .046 0.182 0.079 1.857 .063
Diabetes mellitus type 1 0.283 0.071 6.819 .000 0.248 0.188 2.911 .004 0.395 0.179 2.650 .008 0.224 0.087 0.293 .769
Smoke history 0.196 0.126 3.034 .002 0.475 0.120 3.926 .000 0.139 0.089 1.928 .054 0.161 0.092 2.833 .005
Steroid history 0.159 0.169 2.253 .024 0.438 0.068 5.715 .000 0.258 0.088 0.288 .774 0.031 0.103 0.284 .776
Bleeding disorder 0.015 0.042 0.198 .843 0.582 0.383 2.640 .008 0.306 0.043 0.199 .842 0.773 0.137 0.306 .760
Functional status 0.162 0.070 3.097 .002 0.017 0.094 0.212 .832 0.007 0.120 1.358 .174 0.136 0.077 2.438 .015
ASA >3 0.603 0.115 6.759 .000 0.010 0.082 1.492 .136 0.453 0.081 4.580 .000 1.337 0.165 8.534 .000
BMI 0.445 0.132 0.300 .764 0.178 0.115 1.865 .062 0.585 0.110 1.886 .059 0.382 0.125 1.148 .251
Pulmonary comorbidity 0.030 0.116 1.705 .088 0.059 0.118 1.252 .210 0.138 0.117 1.224 .221 0.099 0.082 0.038 .970
Cardiac comorbidity 0.495 0.095 3.844 .000 0.048 0.084 1.092 .275 0.142 0.083 0.919 .358 0.100 0.097 0.503 .615
ASA, American Society of Anesthesiologists; BMI, body mass index; SE, standard error.
Coefficients with p values !.05 are bolded. The p value for reach final regression model was p<.0001.
J.S. Kim et al. / Spine Deformity 6 (2018) 762e770 767

Fig. 3. Receiver operating characteristic curves plotting sensitivity versus 1-specificity for ANN (blue), LR (green), ASA score (red), and random-chance
(black-dashed). ANN, artificial neural network; ASA, American Society of Anesthesiologists; LR, logistic regression.

Fig. 4. Heatmap of AUC values from LR, ANN, and ASA when predict-
ing cardiac complications (cardiac), VTE, wound complications (wound),
and mortality. ANN, artificial neural network; ASA, American Society of
Anesthesiologists; AUC, area under the curve; LR, logistic regression;
VTE, venous thromboembolism.

Table 3
Comparison of AUC of a logistic regression, artificial neural network, and
ASA evaluated on blinded data.
LR (95% CI) ANN (95% CI) ASA (95% CI)
Cardiac 0.690 (0.68e0.69) 0.768 (0.76e0.77) 0.469 (0.46e0.47)
VTE 0.547 (0.54e0.55) 0.542 (0.53e0.55) 0.485 (0.47e0.49)
Wound 0.575 (0.56e0.58) 0.606 (0.60e0.61) 0.508 (0.50e0.51)
Mortality 0.787 (0.76e0.80) 0.844 (0.82e0.86) 0.516 (0.49e0.54) Fig. 5. Confusion matrices of trained ANN and LR machine learners eval-
ANN, artificial neural network; ASA, American Society of Anesthe- uated on holdout (A) mortality and (B) wound complication data sets to
siologists; AUC, area under the curve; CI, confidence interval; LR, logistic demonstrate real-world performance. ANN, artificial neural network;
regression; VTE, venous thromboembolism. LR, logistic regression.
768 J.S. Kim et al. / Spine Deformity 6 (2018) 762e770

AUC of 0.469 (95% CI: 0.46e0.47) for cardiac complica- model was 82% where the ANN correctly predicted 9 of 11
tions, 0.485 (95% CI: 0.47e0.49) for VTE, 0.508 (95% CI: mortalities. Optimization of the hyperparameters of the
0.50e0.51), and 0.516 (95% CI: 0.49e0.54) for mortality ANN algorithm, using a more granular, larger training data
(Table 3). ANN sensitivity outperformed LR sensitivity as set, implementing other techniques for dealing with class
shown for mortality and wound complications in Figure 5. imbalance (eg, undersampling), and altering the structure of
the ANN itself (eg, adding more hidden layers) are ways in
which the performance of the ANN can be improved. We
Discussion
expect that with a larger, more granular data set and further
With the advent of large, prospective, multi-institutional model optimization, we can increase the accuracy of our
clinical registries, physicians have access to large amounts model. In particular, the use of natural language processing
of diverse, high-quality clinical data. This has given birth to is a potential mechanism whereby the extensive and rich
ideas such as ‘‘precision medicine’’ with the goal of data contained within medical notes can be leveraged to
developing quantitative models that can be used to predict further train artificial intelligence networks.
health status, prognosticate disease processes, prevent dis- Automated feature selection with LR showed that age,
ease, and reduce complications. Previous groups have male gender, black race, bleeding disorder, high ASA
employed the use of ANNs and other ML models to these score, and BMI were strong independent risk factors for
data sets [13-16]. However, these studies either trained mortality, consistent with current surgical evidence. Wound
models on extremely large databases (O1,400,000 patients) complications were predicted by Hispanic race, BMI, dia-
or on complications with high occurrence rates. These ex- betes mellitus, and high ASA score. VTE was predicted for
amples are impractical for independent institutions or for by Hispanic race, history of steroid use, bleeding disorder,
small-scale procedures with rare complications. Low and smoking history. Cardiac complications were predicted
occurrence rates in relatively small data sets lead to large by age, Hispanic race, BMI, cardiac comorbidity, and high
class-imbalances that are a significant challenge in medical ASA score. These findings are echoed heavily by domain
machine learning [17,18]. To this end, we have trained knowledge in prior spine literature [19,20].
several supervised machine learning classifiers to predict A key strength of this study is the adaptability that can
the probability of postoperative complications in a rela- be achieved by interrogating medical data with different
tively small data set (!15,000 patients) that can outper- machine learning models. Indeed, neural network archi-
form traditional logistic regression and ASA classification tectures alone are a diverse field of study that seek to design
sensitivity with relatively low occurrence rates (!1%). optimal neural network structures to improve AI pre-
Furthermore, we have rigorously developed and tested our dictions [21]. In this study, a grid search was performed to
models by employing the best practices in machine learning identify optimal hyperparameters. However, this was only
in this study via performance of automated feature selec- carried out over a certain defined domain of hyper-
tion, L2 regularization, testing on blinded holdout data sets, parameters not considering all types of network structures
and comparison to a standard risk-scoring system, in order and other macro-scale parameters that may be more suited
to ensure a high standard necessary for implementation of for medical prognostication. This presents a novel oppor-
machine learning in clinical settings. tunity to design machine learners that are adept at learning
The ANN model was superior to the LR, and both were and prognosticating based on patient data that is highly
superior to a clinical benchmark, the ASA score, with a diverse, class-imbalanced, and often limited in sample
statistically significantly higher AUC when predicting car- size [22].
diac complications, wound complications, and mortality. The ability of machine learning to identify at-risk-
The sensitivity of the ANN was superior to LR, indicating patients and predict potential complications has been
an ability to identify a greater portion of positive cases, clearly demonstrated here, yet the ability to suggest av-
correctly identified as positive. This is an important char- enues of treatment based on predicted complications has
acteristic for use in clinical situations, suggesting that not yet been realized. Future work can take advantage of
ANNs are more suited for clinical prognostication than LR. EMR and medical literature to suggest optimal treatment
There is room, however, for significant improvement. The strategies based on key patient data. Such models can not
ANN model had a positive likelihood ratio of 2.61 and 1.59 only guide physicians in the decision-making process but
and a negative likelihood ratio of 0.26 and 0.58 for mor- can also aide health care systems in low-resource set-
tality and wound complication, respectively. In contrast, the tings, provide personalized care, and improve response
LR model had a positive likelihood ratio approaching 0 and times during critical settings. Taken together, the op-
0 and a negative likelihood ratio of 1 and 1 for mortality portunities described here can be used to strengthen
and wound complication, respectively. medical artificial intelligence, AI, to improve surgi-
The ANN model exhibited a tendency to ‘‘overtreat,’’ cal outcomes.
and in contrast, the LR model had a tendency to ‘‘under- The use of machine learning has several limitations that
treat’’ relative to that of the ANN. The LR model had a deserve discussion. The performance of any classifier, is
sensitivity of 0% whereas the sensitivity for the ANN rooted, in part, in the quality of the training data. Therefore,
J.S. Kim et al. / Spine Deformity 6 (2018) 762e770 769

weaknesses in the NSQIP are represented as weaknesses in models, and future studies can look to modifying the un-
the neural network classifier. Sources of biases such as derlying optimization function to address this issue.
selection bias from the NSQIP database translate to poor However, the ability to predict clinical outcomes using a
prognostication generalizability of the neural network and multidimensional data set is an attractive prospect, and the
are an inextricable limitation of machine learning models. methods in this study can be extrapolated to any clinical data
Larger national inpatient data sets such as the National set where there are class-imbalance issues between those
InPatient Sample (NIS) can be used in the future to address patients who experience complications and those that do not.
such biases. Such data sets sampling patients with a broad The advent of machine learning algorithms and their
demographic spectrum can serve to elucidate patterns in the implementation in a healthcare environment makes the use
model that are both more generalizable and predictive of of such machine learning increasingly possible. In the past,
future complications and risk. A major challenge in medi- generalized linear models such as logistic regression have
cine is the paucity of highly granular and robust large-scale been the most commonly used classifiers for this purpose.
data sets for specific operational cohorts. Large-scale da- However, the machine learning models described here,
tabases remain scattered across institutions and are isolated particularly the ANN, are similarly powerful, and in some
to protect patient privacy [23]. An additional source of bias circumstances, far exceed logistic regression. As the ability
was the implementation of feature selection during model to obtain high-quality patient data and computing power
training. Increasing number of features with small training increases over time, it is likely that machine learning
sizes can lead to decreased classifier performances; thus, techniques will find themselves increasingly commonplace
the use of feature selection can be used to combat this [24]. in the hospital setting.
However, omission of such features because of feature se-
lection can exclude important data, leading to increased
bias within the model. Larger data sets can be employed in
the future, allowing for the inclusion of more feature References
following selection. Furthermore, the NSQIP data set was [1] Cruz JA, Wishart DS. Applications of machine learning in cancer
not designed with spine surgery outcomes in mind. As a prediction and prognosis. Cancer Inform 2007;2:59e77.
result, many features that may serve as stronger inputs were [2] Dreiseitl S, Ohno-Machado L. Logistic regression and artificial neu-
not available. For instance, ICD diagnoses, radiologic data, ral network classification models: a methodology review. J Biomed
spine surgeryespecific complications, and more granular Inform 2002;35:352e9.
[3] Deo RC. Machine learning in medicine. Circulation 2015;132:
operative variables are lacking in the NSQIP database. This 1920e30.
poses a limitation that is intrinsic to the data set. [4] Crown WH. Potential application of machine learning in health out-
The models presented in this paper have several limi- comes research and some statistical cautions. Value Health 2015;18:
tations that deserve attention. Logistic regression is a 137e40.
generalized linear model that provides a robust and easy [5] Youssef JA, Orndorff DO, Patty CA, et al. Current status of adult spi-
nal deformity. Global Spine J 2013;3:51e62.
way to interpret model with informative information such [6] Birknes JK, White AP, Albert TJ, et al. Adult degenerative scoliosis:
as odds ratios and statistics. However, logistic regression a review. Neurosurgery 2008;63:94e103.
requires no co-linearity among independent features, which [7] Cowan JA, Dimick JB, Wainess R, et al. Changes in the utilization of
can be difficult to assume in medical research unless an spinal fusion in the United States. Neurosurgery 2006;59:15e20; dis-
independent study has been performed. Artificial neural cussion 15.
[8] He H, Bai Y, Garcia EA, Li S. ADASYN: adaptive synthetic sam-
networks can be considered a special case of logistic pling approach for imbalanced learning. In: 2008 IEEE International
regression that includes hidden inner layers of logits that in Joint Conference on Neural Networks (IEEE World Congress on
theory allow neural networks to approximate any function, Computational Intelligence). IEEE 2008. p. 1322e8.
while traditional logistic regression is limited. Unlike lo- [9] Blum AL, Langley P. Selection of relevant features and examples in
gistic regression, following training neural networks cannot machine learning. Artif Intell 1997;97:245e71.
[10] Møller AM, Pedersen T, Villebro N, Munksgaard A. Effect of smok-
currently be deconstructed to provide any insight into the ing on early complications after elective orthopaedic surgery. J Bone
structure of the function being approximated. As a result, Joint Surg Br 2003;85:178e81.
neural networks have been considered ‘‘black boxes’’ in [11] Iorio R, Williams KM, Marcantonio AJ, et al. Diabetes mellitus, he-
machine learning. This means that although neural net- moglobin A1C, and the incidence of total joint arthroplasty infection.
works may provide prognosticative value, it is currently not J Arthroplasty 2012;27:726e729.e1.
[12] Chen S, Anderson MV, Cheng WK, Wongworawat MD. Diabetes
possible to understand how the networks are using data to associated with increased surgical site infections in spinal arthrodesis.
make predictions and thus not possible to provide clinical Clin Orthop Relat Res 2009;467:1670e3.
reasoning with underlying pathophysiologic support. [13] Van Esbroeck A, Rubinfeld I, Hall B, Syed Z. Quantifying surgical
Additionally, the prognosticative capability of our models complexity with machine learning: looking beyond patient factors
was dependent on disease prevalence. Class imbalance can to improve surgical models. Surgery 2014;156:1097e105.
[14] Hu Z, Simon GJ, Arsoniadis EG, et al. Automated detection of post-
lead to bias when training machine learning algorithms due operative surgical site infections using supervised methods with elec-
to heavily favoring the majority class. As a result of this, tronic health record data. Stud Health Technol Inform 2015;216:
we observed a high false positive rate associated with our 706e10.
770 J.S. Kim et al. / Spine Deformity 6 (2018) 762e770

[15] Sohn S, Larson DW, Habermann EB, et al. Detection of clinically cervical discectomy and fusion. Spine (Phila Pa 1976) 2017;42:
important colorectal surgical site infection using Bayesian network. 565e72.
J Surg Res 2017;209:168e73. [20] Wang TY, Martin JR, Loriaux DB, et al. Risk assessment and charac-
[16] Bilimoria KY, Liu Y, Paruch JL, et al. Development and evaluation of terization of 30-day perioperative myocardial infarction following
the universal ACS NSQIP surgical risk calculator: a decision aid and spine surgery: a retrospective analysis of 1346 consecutive adult pa-
informed consent tool for patients and surgeons. J Am Coll Surg tients. Spine (Phila Pa 1976) 2016;41:438e44.
2013;217:833e842.e1. [21] Tsai J-T, Chou J-H, Liu T-K. Tuning the structure and parameters of a
[17] Krell MM, Wilshusen N, Seeland A, Kim SK. Classifier transfer with neural network by using hybrid Taguchi-genetic algorithm. IEEE
data selection strategies for online support vector machine classifica- Trans Neural Netw 2006;17:69e80.
tion with class imbalance. J Neural Eng 2017;14:025003. [22] Cios KJ, Moore GW. Uniqueness of medical data mining. Artif Intell
[18] Wang Q, Luo Z, Huang J, et al. A Novel Ensemble Method for Imbal- Med 2002;26:1e24.
anced Data Learning: Bagging of Extrapolation-SMOTE SVM. Com- [23] Weber GM, Mandl KD, Kohane IS. Finding the missing link for big
put Intell Neurosci 2017;2017:1827016. biomedical data. JAMA 2014;311:2479e80.
[19] Somani S, Di Capua J, Kim JS, et al. Comparing NIS and NSQIP: [24] Jain AK, Chandrasekaran B. 39 Dimensionality and sample size consid-
an independent risk factor analysis for risk stratification in anterior erations in pattern recognition practice. Handbook Stat 1982;2:835e55.

You might also like