Development of machine learning models for the detection of surgical site infections following total hip and knee arthroplasty: a multicenter cohort study

Background Population based surveillance of surgical site infections (SSIs) requires precise case-finding strategies. We sought to develop and validate machine learning models to automate the process of complex (deep incisional/organ space) SSIs case detection. Methods This retrospective cohort study included adult patients (age ≥ 18 years) admitted to Calgary, Canada acute care hospitals who underwent primary total elective hip (THA) or knee (TKA) arthroplasty between Jan 1st, 2013 and Aug 31st, 2020. True SSI conditions were judged by the Alberta Health Services Infection Prevention and Control (IPC) program staff. Using the IPC cases as labels, we developed and validated nine XGBoost models to identify deep incisional SSIs, organ space SSIs and complex SSIs using administrative data, electronic medical records (EMR) free text data, and both. The performance of machine learning models was assessed by sensitivity, specificity, positive predictive value, negative predictive value, F1 score, the area under the receiver operating characteristic curve (ROC AUC) and the area under the precision–recall curve (PR AUC). In addition, a bootstrap 95% confidence interval (95% CI) was calculated. Results There were 22,059 unique patients with 27,360 hospital admissions resulting in 88,351 days of hospital stay. This included 16,561 (60.5%) TKA and 10,799 (39.5%) THA procedures. There were 235 ascertained SSIs. Of them, 77 (32.8%) were superficial incisional SSIs, 57 (24.3%) were deep incisional SSIs, and 101 (42.9%) were organ space SSIs. The incidence rates were 0.37 for superficial incisional SSIs, 0.21 for deep incisional SSIs, 0.37 for organ space and 0.58 for complex SSIs per 100 surgical procedures, respectively. The optimal XGBoost models using administrative data and text data combined achieved a ROC AUC of 0.906 (95% CI 0.835–0.978), PR AUC of 0.637 (95% CI 0.528–0.746), and F1 score of 0.79 (0.67–0.90). Conclusions Our findings suggest machine learning models derived from administrative data and EMR text data achieved high performance and can be used to automate the detection of complex SSIs. Supplementary Information The online version contains supplementary material available at 10.1186/s13756-023-01294-0.


Background
Surgical site infections (SSIs) are one of the most common healthcare-associated infections (HAIs) in postoperative procedures [1].In North America, SSIs occur in 2-5% of all surgeries and are associated with extended hospital stays of 11 days, resulting in an increased care cost of 13,000 USD per patient admission [2,3].The SSI rate varies significantly, ranging from 0.6 to 9.5%, depending on the type of surgical procedure, as reported in the European Centre for Disease Prevention and Control's (ECDC) 2023 surveillance report [4].Patients with SSIs are more likely to be admitted to critical care units and have a five-fold increase in hospital readmissions [2].About 77% of surgical patient deaths are associated with SSIs [2].
By the next decade, the demand for total hip (THA) and knee (TKA) arthroplasty procedures in the US is projected to grow by 174% and 673%, respectively [5].While many infection prevention and control (IPC) strategies are implemented in clinical practice (e.g., improved ventilation in operating rooms, sterilization methods, surgical techniques, antibiotic prophylaxis), SSIs remain a substantial cause of adverse patient outcomes [6].Surveillance programs audit the occurrence of SSIs.Identifying SSIs from large population-based databases can improve the completeness, accuracy, and efficiency of SSI surveillance programs [7].In Canada, SSI case identification relies on International Classification of Diseases (ICD) codes [8], sometimes followed by a comprehensive chart review to confirm the presence of SSIs [7,9].As such, traditional surveillance methods rely on manual chart review by trained reviewers.This process is timeconsuming, labour-intensive, and expensive.Additionally, it is well-studied that administrative data-based adverse event detection methods are suboptimal due to under-coding or miss-coding [10].
Electronic medical records (EMR) have been widely implemented and contain detailed and comprehensive information regarding all aspects of patient care, offering a valuable complement to coded data.The advance of artificial intelligence technologies promoted research on free text data, which enabled analysis of large, complex EMR text data sets.Machine learning models employed on EMR free-text data can significantly improve the detection of SSIs [11].The purpose of this study was to determine the incidence of SSI and to develop machine learning models to automate the process of detecting complex (deep incisional/ organ space) SSI following THA/TKA.

Patient cohort
We included adult patients (age ≥ 18 years) who were admitted to any tertiary acute care hospitals in Calgary, Canada, and underwent primary total elective hip or knee arthroplasty between January 1st, 2013, and August 31st, 2020.Patients who underwent hemiarthroplasty, cement spacers, revisions, or abandoned procedures were excluded.

Data sources
The study cohort was defined using the Canadian Classification of Health Interventions (CCI) administrative codes documented in the Alberta Discharge Abstract Database (DAD), with up to 20 procedure codes per record [12].Patient information was pulled if any of the following CCI codes were documented in their records: 1.VA.53 (Implantation of internal device, hip joint), 1.SQ.53 (Implantation of internal device, pelvis), 1.VG.53 (Implantation of internal device, knee joint), 1.VP.53 (Implantation of internal device, patella) [13].Structured data such as patient demographic information, diagnosis codes (up to 25 ICD 10th revision in Canada [ICD-10-CA] codes), procedure details and patient outcomes were extracted.
Sunrise Clinical Manager (SCM) is an inpatient electronic medical record system being used at the time of this study in all Calgary hospitals.SCM EMR captures demographic, clinical, and outcome data for all patients admitted to the study hospitals.To develop machine learning models, we extracted the free text data of nursing notes for patients who were readmitted to the Calgary Keywords Surgical site infections, Total hip arthroplasty, Total knee arthroplasty, Machine learning

Highlights
• The incidence rates of surgical site infections following total hip and knee arthroplasty were 0.5 and 0.52 per 100 surgical procedures.• The incidence of SSIs varied significantly between care facilities (ranging from 0.53 to 1.71 per 100 procedures).
hospitals within 90 days following a THA or TKA procedure.The patient's personal healthcare number (PHN) and unique lifetime identifiers (ULI) were used to link data sets.Patient records without valid PHNs or ULIs were excluded.

Reference standard
Surgical site infection (SSI) is defined by the Centers for Disease Control and Prevention (CDC) as an infection that occurs after surgery in the part of the body where the surgery took place [14].Manual case detection is supplemented with an administrative linkage using ICD-10 codes to increase case detection [7].Since mandatory reporting of superficial SSIs was terminated in April 2018 in Alberta, the incidence rate of superficial SSIs was calculated using data collected before April 2018.In our reference data set, all patients were followed for 90 days after the surgical procedure date to observe if they developed infections.The results from this review served as the reference standard for developing and validating the machine learning models.

Data preprocessing and feature extraction
The proposed method composed of both structured and unstructured datasets.Please refer to the Additional file 1 for information concerning data properties and the specifics of model development.After linking all datasets, using the reference standard data we created a variable for 'Not infected, ' 'Organ-space infection, ' or 'Deep incisional infection' .To build a structured dataset, we extracted all unique ICD-10 codes from DAD for the patient cohort to serve as main features and used one-hot encoding to represent each patient [16].The application of this technique yielded a feature matrix that is sparse in nature.In this matrix, each row corresponds to a patient's hospital stay, while each column represents a unique ICD code.By leveraging this approach, we were able to efficiently represent patient data in a concise format, which will be passed to the downstream machine learning model together with text dataset.
For the text dataset construction, we choose Multidisciplinary Progress Report (MPR) from each patient's EHR from the database of SCM EMR.An MPR is a nursing note that summarizes the nursing care plan and the patient's treatment response over a period of time.It also containing patient's vital sign, medication administration, nursing intervention, and any changes to patients condition.The MPRs for our cohort were pre-processed with the following techniques in sequence: case folding, lemmatization, stopwords removal, special character handling, medical concept extraction, and negation detection [16,17].To analyze text, we use a method called Bag-of-Words (BOW) that converts text into feature vectors, where each position in the vector represents the occurrence of the frequency of unique word or phrase from the text [16].Then, we employed the term frequencyinverse document frequency (TF-IDF) weighting models to enhance the characterization of significant words in BOW representation.The resulting TF-IDF scores provided a more robust measure of word importance in the analyzed health informatics documents [16].Once we have the TF-IDF feature matrix, we concatenate it with the ICD-10 feature matrix to get a merged representation for patient cohort.After the feature extraction, the dataset was split into training and testing sets by an 80:20 ratio.

Model development
We developed nine XGBoost models to identify deep incisional SSI, organ/space SSI and complex SSI using administrative data, EMR free text data and both types of data.XGBoost is a machine learning model that combines weak decision trees to perform regression and classification.To optimize the performance of the XGBoost model, we performed a grid search using the Grid-SearchCV function from the Scikit-learn library [18].The grid search involved creating a range of hyperparameter values, training the model for each combination of hyperparameters, and evaluating its performance using cross-validation and a specified scoring metric.Hyperparameters (e.g., learning_rate, max_depth, gamma, reg_ lambda, etc.) were tuned to maximize models' sensitivity.Optimal hyperparameters were utilized for training our final XGBoost models.Fine-tuned XGBoost models were evaluated using the preserved testing sets.For a detailed illustration of our methodology, please refer to Fig. 1.

Statistical analysis
Patient demographic and clinical characteristics were summarised using frequencies and percentages or medians and interquartile ranges (IQRs) as appropriate.The Charlson Comorbidity Index (CCI) was calculated for each patient based on their 25 diagnosis codes documented in the DAD using the weighted score approach [19].Chi-square tests and Wilcoxon rank-sum tests analyzed the comparison of categorical variables.Performance of SSI machine learning models was assessed by sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and F1 score.We computed the area under the receiver operating characteristic curve (ROC AUC) to evaluate the trade-off between the sensitivity and specificity of the XGBoost models on various thresholds.In this study, the occurrence of SSIs is significantly lower in comparison to those without infections, resulting in imbalanced data, which can present challenges during the evaluation of machine learning algorithms.The area under the precision-recall curve (PR AUC) was computed to present an average precision that combines PPV and sensitivity in a single visualization.Unlike the ROC AUC baseline of 0.5 (random classifier), the PR AUC baseline is the fraction of positives among the total sample.Different classes have different baselines.The PR AUC is a powerful performance measure for imbalanced data when the incidence of SSI is low and to identify positive SSI cases with minimal false positives [20].The Scikit-learn Python library was used for AUC statistics, and a bootstrap 95% confidence interval (95% CI) was calculated.XGBoost Python library was used for model development, and the Imbalancedlearn library was applied for resampling training data.All statistical analyses were performed using Stata

Study population
The study cohort consisted of 22,059 unique patients with 27,360 hospital admissions resulting in 88,351 days of hospital stay.This included 16,561 (60.5%)TKA and 10,799 (39.5%)THA procedures (Table 1).The median age was 66 years (IQR 59-73), 43.26% were male, and 96.6% were comorbidity-free.The patients spent a median of three days (IQR 2-4) in hospital at the time of the TKA and THA procedure, most of whom were discharged home.

SSIs description
Among all observed procedures, 17,991 were performed before April 2018, and 9,369 were performed after.The chart review ascertained 235 SSIs, resulting in an overall incidence rate of 0.86 per 100 surgical procedures.Of them, 77 (32.8%) were superficial incisional SSIs (66 of which occurred before 2018), 57 (24.3%) were deep incisional SSIs, 101 (42.9%) were organ space SSIs, and 158 (67.2%) were complex SSIs.The incidence rates were 0.37 for superficial incisional SSIs, 0.21 for deep incisional SSIs, 0.37 for organ space SSIs and 0.58 for complex SSIs per 100 surgical procedures.
SSIs incidence varied significantly between hospitals (ranging from 0.53 to 1.71 per 100 procedures).A significant decrease was observed in the incidence of SSIs over the study period (incidence rate ratios [IRR] per year 0.93; 95% CI 0.87-0.98).
Table 2 describes the nature of SSIs in this study cohort.The median age of patients with an SSI was 66 (IQR 59-72), and 54.9% were male.Blood culture tests were positive for only 29.9% of superficial incisional SSIs but increased to 87.7% of deep incisional SSIs and 98.0% of organ space SSIs.

Discussion
In this population-based multicenter cohort study, we observed a modestly reduced incidence of SSIs following total hip and knee arthroplasty over the study period, in contrast to the findings reported in existing literature.The incidence of SSIs varied substantially across hospitals.We developed and evaluated nine machine learning models to identify SSIs from patient charts.The model that was developed using both structured and unstructured (nursing notes) data achieved the best performance.Applying these models has the potential to reduce the workload for chart reviews of traditional IPC surveillance programs.
Surveillance and reporting of SSIs are critically important to prevent and control healthcare-associated infections.Parameters such as data quality of different surveillance programs, postsurgical follow-up process and imperfect criteria potentially contribute to the discordance of reported incidence of SSIs in literature [22].In our study, the SSI rates for TKA and THA were 0.52% and 0.5%, respectively.Comparatively, the CDC reported rates for TKA and THA were 0.65% and 0.4%, and the ECDC rates were 0.6% and 1.2%, respectively [4,23].While our study's TKA and THA rates were slightly lower than the CDC and ECDC reported rates [2,24,25].This finding is consistent with previously published studies [26].The observed decrease in the incidence of SSIs throughout the study period might have resulted from uniform provincial surveillance initiated by the Alberta Health Services IPC program starting in March 2012 [27].
The detection of SSIs from large population-based cohorts is shifting from solely relying on the composition of ICD codes to a mixed-use of patient structured and unstructured data leveraging the advantages of machine learning techniques [11].Clinical notes often contain valuable unstructured textual diagnoses and important clinical events, and have demonstrated enormous benefits for enhancing machine learning models` performance.For example, Bucher et al. developed a natural language processing approach using clinical notes to automate SSI surveillance [28].As a result, they reached a sensitivity of 0.79 and ROC AUC of 0.852 in their external validation model.In our study, the optimal model achieved a sensitivity of 83.9% (95% CI 66.3-94.6%),ROC AUC of 0.906 (95% CI 0.835-0.978),PR AUC of 0.637 (95% CI 0.528-0.746)and F1 score of 0.79.Adding nursing notes in model development improved our model's general performance, with an increase in the F1 score from 0.699 to 0.788 and an increase in PR AUC from 0.52 to 0.64.Considering the comparison baseline of PR AUC is the incidence of SSI, the magnitude of improvement is substantial.
Our study highlighted that a standard text description structure of nursing notes in EMR could potentially improve the accuracy of SSI detection models.For example, describe the observed evidence of SSIs (e.g., intraoperative cultures, purulent drainage, blood culture test positive, etc.) and conclude that its presence in notes would dramatically improve the possibility of machines in identifying SSIs from the text patterns.
Our findings demonstrate that accurate machine learning models can be developed using administrative and EMR text data.Three sets of models developed from this study can be easily translated into surveillance programs.For example, the set of models could be a tool for an initial screening patient charts to locate the most likely SSIs or exclude the negative cases, saving time and cost to enable large population-based surveillance.The developed models could also be applied to clinical practice to support quality improvement initiatives locally, nationally, or internationally.We believe that the developed models hold the potential to effectively decrease the workload of SSI surveillance, and determining the extent of this reduction represents a valuable direction for future research.
The generalizability of our models to other hospitals is a critical consideration.While the models demonstrated promising results in our specific setting, their applicability to other healthcare facilities may vary.The success of the models largely depends on the availability and quality of data in each hospital's EMR system.Therefore, rigorous validation and customization are strongly recommended before deploying our models in other settings to ensure their accuracy and effectiveness within the unique context of each hospital's healthcare environment.
Finally, while our model has shown promise, there is room for improvement, particularly in terms of precision and reliability.For instance, employing more advanced representations of data, such as language models and embeddings for text data, could be particularly beneficial.Techniques such as transformer-based models like BERT or GPT have shown a remarkable ability to understand the nuanced context within the text and can convert text into high-dimensional vectors, or embeddings, that encapsulate semantic meaning.Utilizing these advanced techniques in our models represents a significant area for future improve our ability to detect SSIs.

Limitations
Our study had several limitations.First, the reported incidence rates of SSIs were calculated using 90 days of follow-up as literature suggests most SSIs tend to occur within the first 3 months following surgery [7,14,29].Different follow-up days may generate discordance in SSI incidence rates.While using restricted follow-up days (e.g., 30 or 60 days) may improve the precision of models, the sensitivity will be compromised.Researchers need to choose the cut-offs according to their research objectives.Second, the imbalanced data may create challenges for machines to capture the text patterns of SSI cases.We employed random over sampling strategies during the model training phase to improve the performance of machine learning classification models for the imbalanced datasets.Third, we only included nursing notes for model development as they contain the most clinical detail of daily patient care and are universally documented in all patient records.Other clinical notes, such as diagnostic reports, surgery-related reports, and discharge summaries, were not included in this study.
Incorporating those notes may potentially enhance the sensitivity of the developed models, but it is likely that both the positive predictive value and overall performance will be greatly diminished.Lastly, the performance of models using clinical notes from the EMR database is contingent on the quality of reporting by nurses.Potential human errors, diverse documentation practices, and the adequacy of healthcare professionals' EMR training can influence the accuracy and reliability of the results.

Conclusions
Detecting SSIs from large population-based cohorts is imperative for IPC surveillance programs.Our findings suggest machine learning models derived from administrative data and nursing notes in EMR text data achieved high performance and can be used to automate the process of complex SSIs detection.

Fig. 1
Fig. 1 Schematic Representation of Data Linkage and ML Model for SSI Detection.MPRs multiplanary progress report, SCM sunrise clinical manager, SSI surgical site infections

Fig. 2
Fig. 2 Performance of XGBoost models for the detection of surgical site infections.A The area under the receiver operating characteristic curves (ROC AUC, left) and the area under precision-recall curves (PR AUC, right) for the administrative data based XGBoost models.B The area under the receiver operating characteristic curves (ROC AUC, left) and the area under precision-recall curves (PR AUC, right) for the EMR text data based XGBoost models.C The area under the receiver operating characteristic curves (ROC AUC, left) and the area under precision-recall curves (PR AUC, right) for the mix using of administrative and text data based XGBoost models

Table 1
Characteristics of patients who underwent primary total elective hip or knee arthroplasty, 2013-2020 IQR interquartile range, SSIs surgical site infections † Wilcoxon rank-sum test; ‡ Chi-square test

Table 2
Characteristics of patients with surgical site infections following total hip or knee arthroplasty, 2013-2020 IQR interquartile range, SSIs surgical site infections † Wilcoxon rank-sum test; ‡ Chi-square test

Table 3
Performance measures for developed machine learning algorithms for the detection of SSIs 95% CI 95% confidence interval, NPV negative predictive value, PPV positive predictive value, PR AUC the area under the precision-recall curve, ROC AUC the area under the receiver operating characteristic curve, SSIs surgical site infections