A Machine Learning Risk Prediction Model for Gastric Cancer with SHapley Additive exPlanations
Article information
Abstract
Purpose
Gastric cancer (GC) prediction models hold potential for enhancing early detection by enabling the identification of high-risk individuals, facilitating personalized risk-based screening, and optimizing the allocation of healthcare resources.
Materials and Methods
In this study, we developed a machine learning-based GC prediction model utilizing data from the Korean National Health Insurance Service, encompassing 10,515,949 adults who had not been diagnosed with GC and underwent GC screening during 2013-2014, with a follow-up period of 5 years. The cohort was divided into training and test datasets at an 8:2 ratio, and class imbalance was mitigated through random oversampling.
Results
Among various models, logistic regression demonstrated the highest predictive performance, with an area under the receiver operating characteristic curve (AUC) of 0.708, which was consistent with the AUC obtained in external validation (0.669). Importantly, the outcomes were robust to missing data imputation and variable selection. The SHapley Additive exPlanations (SHAP) algorithm enhanced the explainability of the model, identifying advancing age, being male, Helicobacter pylori infection, current smoking, and a family history of GC as key predictors of elevated risk.
Conclusion
This predictive model could significantly contribute to the early identification of individuals at elevated risk for GC, thereby enabling the implementation of targeted preventive strategies. Furthermore, the integration of noninvasive and cost-effective predictors enhances the clinical utility of the model, supporting its potential application in routine healthcare settings.
Introduction
Gastric cancer (GC) remains a significant global health issue, ranking as the fifth most common malignancy and the fourth leading cause of cancer-related mortality worldwide [1]. Despite a decline in incidence in certain regions, GC persists as a major concern, particularly in Korea, where it was the fourth leading cause of cancer deaths and the second most prevalent cancer in 2020 [2].
Prognostic outcomes for GC are highly variable, contingent upon the clinical stage at diagnosis. Patients with metastatic GC exhibit a poor prognosis, with a 5-year survival rate of approximately 7%, contrasting sharply with the 75% 5-year survival rate observed in patients with localized GC [3]. This disparity underscores the critical importance of early detection in improving patient outcomes. The development of GC typically follows a progressive sequence of precancerous stages, including chronic gastritis, atrophic gastritis, intestinal metaplasia, and dysplasia.
Early-stage GC, however, is often asymptomatic or presents with nonspecific symptoms, leading to delays in diagnosis [4]. As a result, routine screening is imperative for the timely identification of GC, with prior studies demonstrating a correlation between regular screening and reduced GC mortality [5,6]. In Korea, the National Cancer Screening Program (NCSP) offers biennial gastroscopy to individuals over 40 years of age, which has been shown to significantly increase the likelihood of detecting GC at a localized stage compared to those who have never been screened [7]. Additionally, participation in the NCSP has been associated with lower GC mortality [8].
Nevertheless, the widespread implementation of population-based screening programs incurs substantial costs. For example, the biennial gastroscopy program for individuals over 40 in Korea is funded by an annual national budget allocation exceeding 250 million dollars. Moreover, gastric endoscopy, while valuable, is not without risks, including potential complications such as infection, bleeding, and perforation [9]. Therefore, optimizing screening intervals and integrating personalized prevention strategies through riskbased stratification are essential. GC prediction models serve as practical tools for stratifying populations according to GC risk, thereby facilitating the development of more targeted and cost-effective screening and prevention protocols.
While several GC prediction models have been previously proposed [10,11], many fail to incorporate critical risk factors, such as Helicobacter pylori infection or histologically diagnosed gastric adenoma. Furthermore, most existing models rely predominantly on traditional methodologies, such as the Cox proportional hazard model (CPHM), with limited utilization of machine learning approaches. This study sought to develop a machine learning-based GC prediction model that integrates key GC risk factors. However, machine learning models are frequently criticized for their lack of transparency and explanation [12]. To address this challenge, we employed a recently proposed explainable artificial intelligence algorithm to enhance the interpretability of our machine learning-based prediction model, ensuring that the reasoning process is more accessible and comprehensible.
Materials and Methods
In Korea, GC screening is provided biennially by the NCSP to individuals aged over 40 years. Approximately 50%-60% of the Korean population participates in this program, and the gastroscopy results are systematically recorded in the National Health Insurance Service (NHIS) database. This study established a comprehensive longitudinal health information dataset, integrating a qualification database (a national medical check-up database encompassing health-related behavior data, family history, and screening results) and a treatment database that detailed disease types. Utilizing this dataset, the study followed adults aged 40-74 years who underwent gastroscopy screening for GC through the NCSP in 2013 and 2014. Individuals diagnosed with GC prior to screening or those who passed away within 5 years of screening were excluded from the analysis. For external validation, the Cancer Screenee Cohort of the National Cancer Center, comprising a dataset of 31,285 individuals, was employed to assess the generalizability and performance of the predictive models on independent datasets.
The outcome variable was defined as binary, indicating the presence or absence of GC. GC cases were identified using the International Classification of Diseases, 10th revision (ICD-10) code C16, either as a primary or secondary diagnosis, along with the special codes V193 or V194. The study population was followed for 5 years post-screening to ascertain the development of cancer, categorizing participants into GC and non-GC groups based on their diagnosis.
To predict GC occurrence within 5 years, the study incorporated 17 variables previously identified as GC risk factors: age, sex, body mass index (BMI), smoking, alcohol consumption, family history of GC, colorectal cancer, or liver cancer, and history of hypertension, diabetes mellitus, myocardial infarction/angina pectoris, stroke, dyslipidemia, colorectal cancer, liver cancer, H. pylori infection, or gastric adenoma. The previous use of an H. pylori eradication regimen, indicative of an individual’s H. pylori history, was determined by analyzing prescription history data in the treatment database. Gastric adenoma diagnosis was confirmed through gastric endoscopy and gastric tissue biopsy results. For sensitivity analysis, feature selection from the aforementioned 17 risk factors was performed using stepwise variable selection or least absolute shrinkage and selection operator (LASSO) regression, aimed at removing potentially redundant factors and mitigating the risk of data overfitting [13].
After excluding observations with missing data for any risk factor, a total of 10,515,949 individuals (men, 4,678,843; women, 5,837,106) with complete data were included in the analysis. To address the substantial amount of missing data, particularly in self-reported lifestyle variables within the NHIS dataset, we conducted a sensitivity analysis using multiple imputations (m=5) via the Multivariate Imputation by Chained Equations (MICE) package.
Given the low prevalence of GC in the population, which may result in data imbalance, we implemented Random Over-Sampling Examples (ROSE) on the training dataset to achieve a balanced ratio of 5:5 between individuals with and without GC prior to developing the prediction model. We employed a chi-squared test to compare the characteristics of individuals with GC versus those without GC.
The NHIS dataset was randomly divided into training and test datasets at an 8:2 ratio. We then developed three machine learning-based prediction models: logistic regression, decision tree, and eXtreme Gradient Boosting (XGBoost). Hyperparameter tuning for the logistic regression and decision tree models was conducted through fivefold cross-validation on the training set, while the boosting iteration for the XGBoost model was fixed at 500. The performance of these prediction models was assessed based on accuracy, sensitivity, specificity, and the area under the receiver operating characteristic curve (AUROC).
A significant challenge in applying machine learning models is the opacity of feature–response relationships, which limits the interpretability and practical utility of the models for clinicians. To enhance interpretability, we integrated the SHapley Additive exPlanations (SHAP) algorithm into the top-performing model (i.e., the model with the highest AUROC value on the external validation set). The SHAP method, grounded in game theory, offers insights into the contribution of each feature to the model’s predictions. All analyses were conducted using R software ver. 3.6.4 (R Foundation for Statistical Computing), and the study’s reporting adhered to the STROBE guidelines.
Results
The application of exclusion criteria resulted in a final cohort of 10,515,949 individuals, comprising 65,657 with GC and 10,450,292 without GC, as derived from the NHIS (Fig. 1). The characteristics of the study participants are detailed in Table 1.
We developed three machine learning-based models for GC prediction. The receiver operating characteristic curves generated from the application of these models to the internal validation dataset are presented in Fig. 2. The performance metrics of the models in both internal and external validation datasets are summarized in Table 2. Among the models, logistic regression demonstrated the highest AUROC at 0.708 (95% confidence interval [CI], 0.704 to 0.710) and the greatest sensitivity at 0.670 (95% CI, 0.661 to 0.671). However, it exhibited the lowest accuracy at 0.637 (95% CI, 0.636 to 0.637) and the third highest specificity at 0.643 (95% CI, 0.630 to 0.640). Conversely, the decision tree model outperformed others in terms of accuracy (0.795; 95% CI, 0.795 to 0.796) and specificity (0.797; 95% CI, 0.790 to 0.800). Notably, in the independent external validation dataset, the logistic regression model maintained the highest AUROC at 0.669 (95% CI, 0.580 to 0.710).

AUROCs of the prediction models. AUROC, area under the receiver operating characteristic curve; DT, decision tree; LR, logistic regression; XGB, eXtreme Gradient Boosting.
Further analysis, including missing data imputation, confirmed the robustness of the logistic regression model, which continued to yield the highest AUROC (0.712; 95% CI, 0.708 to 0.717) (S1 Table, S2A Fig.). Stepwise implementation of missing imputation and variable selection did not produce significantly different results (S3 Table, S2B Fig.). Similarly, the integration of missing imputation with variable selection using the LASSO technique did not substantially alter the outcomes (S4 Table, S2C Fig.).
We further analyzed the important features of the 17 variables from the logistic regression model that had the highest AUROC. SHAP values were computed to rank these variables based on their impact on outcome prediction (Fig. 3). The variables are arranged in descending order of importance according to their SHAP values, which are plotted along the y-axis. For each case, the SHAP values are depicted along the x-axis (Fig. 3A), with each point representing a predicted case, and the color indicating the feature’s corresponding value. Positive and negative SHAP values signify the prediction of GC presence and absence, respectively. In Fig. 3B, bar lengths correspond to SHAP values, with longer bars indicating a more significant contribution by the variable to GC prediction. The analysis identified age, sex, H. pylori infection, smoking status, and family history of GC as the most critical predictors of GC. The likelihood of developing GC increased with advancing age, particularly among males, and was elevated in individuals with a history of H. pylori infection, current smokers, and a family history of GC. Additional factors associated with an elevated risk included a higher BMI, the absence of hypertension, frequent alcohol consumption, the presence of gastric adenoma, and a history of diabetes.

Summary of SHapley Additive exPlanations (SHAP) values. (A) Each dot represents the impact of a feature on one subject. The dot’s color indicates the feature’s value, while its position on the x-axis indicates the SHAP value, reflecting the feature’s contribution to altering the model’s prediction for that individual. Features are plotted on the y-axis and organized in descending order based on mean SHAP values. Variables and coding for analysis: age group (1, 40-44; 2, 45-49; 3, 50-54; 4, 55-59; 5, 60-64; 6, 65-69; 7, 70-74), sex (1, male; 2, female), Helicobacter pylori infection (1, infection; 0, no infection), smoking status (1, nonsmoker; 2, former smoker; 3, current smoker), Family history of gastric cancer (1, yes; 0, no), body mass index (BMI; 1, < 23; 2, 23-24.9; 3, 25-29.9; 4, ≥ 30), alcohol consumption (0, no drinking; 1, ≤ 3 times/wk; 2, ≥ 4 times/wk), disease history or family history of disease (1, yes; 0, no). (B) Mean absolute SHAP values. The five most influential features are age, sex, H. pylori, smoking, and family history of gastric cancer (GC). CC, colorectal cancer; HTN, hypertension; Hx, history; LC, liver cancer; MI, myocardial Infarction.
Discussion
In this study, we developed and rigorously evaluated machine learning-based models for predicting GC risk. Among the models tested, logistic regression demonstrated superior performance, as indicated by the AUROC values. Given the imbalanced nature of the dataset, AUROC values were employed as the primary metric for model evaluation, in lieu of accuracy metrics [14].
The significance of early screening and diagnosis in reducing GC mortality rates cannot be overstated, as timely intervention markedly improves patient outcomes. The prognosis for patients diagnosed with early-stage GC is notably favorable, with 5-year survival rates approaching 95% [15,16]. While gastric endoscopy remains the gold standard for GC detection, its invasive nature limits its widespread application [14]. Risk prediction models that identify individuals at high risk for GC are crucial in facilitating targeted screening, thereby reducing unnecessary procedures in low-risk groups and optimizing resource allocation for early detection. Additionally, these models enable the personalization of screening schedules based on individual risk profiles, thereby enhancing post-endoscopic health outcomes. Our study presents a machine learning-based prediction model specifically designed to identify individuals at high risk for GC within the general population. This model is distinctive in its reliance on data derived primarily from the national health screening program, which conducts comprehensive screenings on an annual or biannual basis. The dataset incorporates a range of noninvasive, cost-effective, and straightforward variables, including lifestyle factors, family history, and medication history. By utilizing existing data, our model enhances accessibility and ease of use in clinical settings, allowing for practical application without incurring substantial costs or necessitating invasive diagnostics.
Recognizing that the performance of prediction models is contingent upon the selection of appropriate predictors, we conducted a thorough literature review to identify epidemiologically significant GC risk factors. These variables were then incorporated into the model development process, prioritizing domain knowledge over indiscriminate variable inclusion or exclusive reliance on statistical selection methods. Notably, our findings indicate that the AUROC values of classifiers did not significantly improve when employing features selected through stepwise variable selection or LASSO, as compared to the original model, which utilized a comprehensive set of features curated by researchers based on their expertise.
To optimize GC interventions, it is imperative to understand the ranked impact of contributing factors, which in turn necessitates prioritizing actions that target preventable elements involved in GC development. To enhance the interpretability of our predictive model, we employed the SHAP algorithm. Through this analysis, we identified age, sex, H. pylori infection, smoking, and a family history of GC as the most significant predictors of GC. Incorporating the SHAP algorithm into GC prediction models enables clinicians to deliver personalized preventive interventions, addressing the most impactful factors for each patient. While previous studies have employed machine learning to predict GC occurrence [14,16-22], this is the first study to address the “black box” issue of machine learning-based GC prediction models using an interpretable method such as the SHAP algorithm.
The likelihood of developing GC was observed to increase with advancing age, being male, a history of H. pylori infection, current smoking, and a family history of GC. Other factors associated with an increased risk included a higher BMI, the absence of hypertension, frequent alcohol consumption, the presence of gastric adenoma, and a history of diabetes.
Our findings align with established knowledge, confirming that H. pylori infection, smoking, and a family history of GC are key risk factors, alongside non-modifiable variables such as age and sex [23-27]. Given that behavioral factors and H. pylori infection are modifiable, it is essential to consider these elements when developing effective GC prevention strategies. H. pylori, a class I carcinogen, is recognized as the leading cause of GC [25-27]. Considering its high prevalence in Korea [28], anti-H. pylori interventions are of paramount importance. Our results underscore the critical role of H. pylori in GC pathogenesis and the necessity for targeted interventions in populations with high prevalence rates.
Furthermore, the increased risk of GC among current smokers emphasizes the need for comprehensive smoking cessation programs. Individuals with a family history of GC—a significant risk factor—would benefit from intensified gastroscopy surveillance. While the impact of BMI as a risk factor appears modest, weight management in individuals with a BMI of 30 or higher may still provide preventive benefits. Additionally, reducing the frequency of alcohol consumption could contribute to lowering GC risk.
This study has several limitations. First, the control group consisted of individuals who were not diagnosed with GC within 5 years following GC screening. However, this group may have included individuals who could develop GC in the future, potentially influencing the study outcomes. Additionally, individuals who passed away within 5 years of screening were excluded from the analysis, as some of these individuals may have later been diagnosed with GC. Due to constraints within the closed analytical environment—where the analysis of national health insurance claims data was restricted to designated computers within a secure network—we were unable to use the version of R required to perform survival machine learning and account for competing risks. This technical limitation hindered the implementation of survival machine learning in this study. Future research should aim to address this limitation by employing survival machine learning approaches. Second, the model’s scope was constrained by the limited range of variables included, necessitating cautious interpretation of results based solely on the evaluated variables. Notably, key risk factors for GC, such as chronic atrophic gastritis and salt consumption, were excluded from the analysis due to unavailability of related data. Expanding the range of variables considered could enhance the model’s performance. Third, H. pylori infection status was inferred from medication history rather than direct diagnosis, which may have led to an underestimation of GC risk in the H. pylori–infected cohort, as it included individuals who had undergone treatment that could have eradicated the infection. Fourth, the application of machine learning techniques required careful hyperparameter tuning, which is crucial for influencing classification outcomes. However, due to the aforementioned constraints of the analytical environment, our ability to perform comprehensive hyperparameter tuning was limited. With access to a more advanced computational environment, additional tuning could significantly improve model performance. Finally, there is a potential overlap between individuals in the internal validation set and the external validation dataset. Although this overlap is likely minimal, it may introduce some bias into the validation results by affecting the independence of the datasets.
Despite these limitations, a significant strength of this study lies in the use of a large, nationwide screening dataset, encompassing 10,515,949 individuals, to develop a machine learning-based GC prediction model that focuses on lifestyle, noninvasive characteristics, and major risk factors, including H. pylori infection. The adoption of a machine learning approach offers distinct advantages, such as optimizing feature selection, capturing complex nonlinear relationships, and uncovering hidden patterns within the data, thereby surpassing the predictive accuracy of traditional models like the CPHM [29]. Notably, our findings confirmed the superiority of the machine learning model over the CPHM in predicting GC risk. To evaluate comparative performance, we developed a GC prediction model using the CPHM and conducted a direct comparison with several machine learning algorithms. Our analysis demonstrated that the machine learning algorithms consistently outperformed the CPHM in identifying individuals at risk for gastric cancer. Specifically, the AUROC for the CPHM was 0.516 (95% CI, 0.514 to 0.519), which was significantly lower than the AUROC values obtained from all machine learning models (S5 Table, S6 Fig.). This marked improvement in performance underscores the scientific merit of adopting machine learning in this analysis, particularly for addressing key limitations of the current gold standard, most likely the CPHM. The CPHM estimates the effects of multiple variables on survival outcomes by incorporating time-to-event data and generating hazard ratios (HRs) for each variable. However, it operates under the assumption of a constant HR over time—a simplification that may not adequately reflect the dynamic nature of disease progression. Additionally, the CPHM’s handling of missing data poses challenges, potentially limiting its effectiveness in risk prediction. By contrast, machine learning models address these constraints, achieving greater predictive accuracy through optimized feature selection, the ability to capture intricate nonlinear relationships, and the detection of latent patterns in the dataset [14,29].
Furthermore, the robustness and generalizability of the prediction model were validated through external testing on an independent dataset. Additionally, an important benefit of our machine learning-based approach is its ability to elucidate which variables contribute to an increased risk of GC, as well as whether their impact is positive or negative. This capacity for determining the influence of specific variables enhances the interpretability of the prediction model and its practical utility. Moreover, this approach facilitates precise prevention strategies tailored to each patient’s risk profile.
By enhancing GC risk awareness and promoting appropriate gastroscopic screening, these prediction models are poised to support primary and secondary prevention efforts. Moreover, the integration of the SHAP algorithm increases the model’s transparency, thereby improving its clinical applicability and aiding decision-making in clinical practice. The use of readily available data from the national screening program, collected every 1-2 years, further underscores the model’s clinical relevance. This prediction model holds promise for reducing GC incidence and mortality by enabling the effective identification and management of high-risk individuals.
Electronic Supplementary Material
Supplementary materials are available at Cancer Research and Treatment website (https://www.e-crt.org).
Notes
Ethical Statement
The Institutional Review Board of the National Cancer Center approved this study (Ncc2021-0141). Informed consent was waived because the data analyses were performed retrospectively using anonymized data.
Author Contributions
Conceived and designed the analysis: Park B, Jun JK, Suh M, Choi KS, Choi IJ, Oh HJ.
Collected the data: Park B, Oh HJ.
Contributed data or analysis tools: Park B, Kim CH, Oh HJ.
Performed the analysis: Park B, Kim CH, Oh HJ.
Wrote the paper: Park B.
Funding acquisition: Oh HJ.
Conflicts of Interest
Conflict of interest relevant to this article was not reported.
Funding
This work was supported by the National Cancer Center Grant (2111060).