Database
The Optum© de-identified Integrated Claims-Clinical dataset combines adjudicated claims data with EHR data. The EHR database is derived from more than 50 healthcare provider organizations in the United States, including more than 700 hospitals and 7000 clinics that provide care to more than 91 million patients in the United States. Optum©‘s Integrated Claims-EHR dataset is statistically de-identified under the Expert Determination method consistent with Health Insurance Portability and Accountability Act (HIPAA) and managed according to Optum© and customer data use agreements [
12
,
13
]. The Integrated dataset links both claims and clinical data for approximately 14 million matched individuals, and is generated by Optum© using a proprietary algorithm that incorporates both salting and cryptographic hashing. The EHR information contained in the integrated dataset includes medications prescribed and administered, lab results, vital signs, body measurements, diagnoses, procedures, and information derived from physician notes using Natural Language Processing (NLP).
The NIHSS scores are a part of the information derived from the physician notes [
14
], and these scores were used as an outcome variable when training and evaluating model performance. Because some invalid values were originally extracted from the physician notes (e.g. values which are not integers within NIHSS range), rigorous pre-processing was applied to exclude as many invalid NIHSS scores as possible. This exclusion criterion was defined in collaboration with Optum and evaluation was completed on the remaining extracted NIHSS values to ensure accuracy. When a patient had multiple NIHSS scores during their inpatient stay following stroke, the maximum score was used to capture the overall severity of the stroke. This study incorporated EHR data from January 2007 through September 2016.
Study population
Patients were included in the study if they had a primary diagnosis of stroke (hemorrhagic [the International Classification of Diseases (ICD)-9: 431; ICD-10 I61.XX], ischemic [ICD-9: 433.XX-434.XX, 436; ICD-10: I63.XXX], or transient ischemic attack (TIA) [ICD-9: 435.X; ICD-10: G45.9]) in an inpatient or emergency room setting, which was defined as the stroke event (Fig.
1
). Additionally, patients were required to have a real NIHSS score (extracted from physician notes) during the stroke event. Patients were also required to have been in the database for at least 6 months prior to the stroke diagnosis.
From the Optum© de-identified Integrated dataset, 7149 patients were identified who met these inclusion and exclusion criteria. Prior to developing the machine learning model, patients were randomly grouped into a training set (
n
= 6116, 86%) and hold-out test set (
n
= 1033, 14.4%). Model development, parameter tuning, and feature selection were completed using the training dataset.
Feature engineering and feature selection
Relevant patient demographics (e.g. age, gender) and billing codes related to procedures, diagnoses, prescriptions/medications, hospital visit information, and comorbidities were collected to form the initial set of 8023 potential features. All features were created during the inpatient hospitalization following stroke occurrence (Fig.
1
), except the Charlson Comorbidity Index, which was estimated based on data prior to the stroke [
15
]. Diagnoses codes were from the ninth and tenth revisions of the ICD-9 and ICD-10 [
16
]. In order to create features for machine learning models that are agnostic to coding version an equivalency mapping provided by The Centers for Medicare & Medicaid Services (CMS) was leveraged. As the granularities of diagnoses codes are different in ICD9 and ICD10 revisions this mapping includes many-to-many relationships. By starting with all diagnosis codes within the stroke events for the patient cohort, and recursively incorporating any diagnosis codes that are equivalent according to the CMS mapping, disjoint diagnosis code groups were created. Binary features were then formed by checking each patient for the presence or absence of any diagnosis within a given ‘diagnosis code group’ during the patients’ stroke event.
Additionally, simple presence/absence features were created for procedures coded with the Current Procedural Terminology (CPT4), Healthcare Common Procedure Coding System – HCPCS procedure codes, Bergenson-Eggers Type of Service codes (BETOS), patient discharge status, diagnosis-related group assigned to the inpatient stay, drug class of prescriptions written, drug class of medications administered, and routes by which medications were administered. Counts of procedures (e.g. CPT4) and BETOS code groups within a patient’s stroke event were also included in the initial feature set. Other features included patient’s age at the time of stroke, gender, and length of hospital stay following stroke occurrence.
During the feature selection process, features with near zero variance or with high correlation (> 0.9) to another feature were removed. In the latter case, only the feature more highly correlated with the response variable was retained. A response-balanced subset of the training cohort was created for this step in the process, by randomly selecting an equal number of patients from each of 5 stroke severity categories [
6
] (
n
= 183 per category,
n
= 915 total); this step was necessary such that the feature engineering process was not affected by the skewed distribution of stroke severity categories (i.e., more patients in less severe categories than in more severe categories, Fig.
2
). After initial feature selection, the remaining 619 features were used for the subsequent modeling step.
Machine learning model development
The imputed NIHSS scores were ultimately compared to the real NIHSS scores (extracted from physician notes) in the hold-out test dataset to assess model performance using the coefficient of determination (R
2
), the Pearson correlation coefficient (R), and root-mean-squared error (RMSE). During our initial model development, performance was compared across a set of models developed by several machine learning approaches including a random forest model, gradient boosting model, neural network, and linear regression. The random forest model, which is a meta estimator used to fit several classifying decision trees on various subsamples of the dataset, had the best performance. Model hyperparameters were optimized using a grid search and performance was evaluated using three-fold cross validation within the training data. (Additional file
1
: Table S1). Recursive feature elimination performed on the training data reduced the 619 features further, with only the top 100 features included in the final model. The top 100 features were selected because only minor improvements in performance would have been gained for a substantial increase in model complexity with the inclusion of additional features (Fig.
3
).