Application of Machine Learning Algorithms for Risk Stratification and Efficacy Evaluation in Cervical Cancer Screening Among the ASCUS/LSIL Population: Evidence from the Korean HPV Cohort Study
Article information
Abstract
Purpose
We assessed human papillomavirus (HPV) genotype-based risk stratification and the efficacy of cytology testing for cervical cancer screening in patients with atypical squamous cells of undetermined significance (ASCUS)/low-grade squamous intraepithelial lesion (LSIL).
Materials and Methods
Between 2010 and 2021, we monitored 1,273 HPV-positive women with ASCUS/LSIL every 6 months for up to 60 months. HPV infections were categorized as persistent (HPV positivity consistently observed post-enrollment), negative (HPV negativity consistently observed post-enrollment), or non-persistent (neither consistently positive nor negative). HPV genotypes were grouped into high-risk (Hr) groups 1 (types 16, 18, 31, 33, 45, 52, and 58) and 2 (types 35, 39, 51, 56, 59, 66, and 68) and a low-risk group. Hr1 was subdivided into types (a) 16 and 18; (b) 31, 33, and 45; and (c) 52 and 58. Cox regression and machine learning (ML) algorithms were used to analyze progression rates.
Results
Among 1,273 participants, 17.6% with persistent HPV infections experienced disease progression versus no progression in the HPV-negative group (p < 0.001). Cox analysis revealed the highest hazard ratios (HRs) for Hr1-a (11.6, p < 0.001), followed by Hr1-b (9.26, p < 0.001) and Hr1-c (7.21, p < 0.001). HRs peaked at 12-24 months, with Hr1-a maintaining significance at 24-36 months (10.7, p=0.034). ML analysis identified the final cytology change pattern as the most significant factor, with 14-15 months the optimal time for detecting progression from the first examination.
Conclusion
In ASCUS/LSIL cases, follow-up strategies should be based on HPV risk types. Annual follow-up was the most effective monitoring for detecting progression/regression.
Introduction
Cervical cancer is the fourth most common cancer in terms of incidence and the fourth deadliest cancer in women with an estimated 660,000 new cases and 350,000 deaths worldwide in 2022 [1]. In South Korea, the age-standardized incidence rate per 100,000 persons for cervical cancer was 3.7 in 2020 [2]. For the early detection of uterine cervical abnormalities, the Korean National Health Insurance Service provides biannual cytology tests for individuals over 20 years of age and has offered human papillomavirus (HPV) vaccination for girls 12 years of age and older since 2016 [3]. This cancer screening and prevention program has likely contributed to the decrease in cervical cancer incidence, from 8.6 per 100,000 in 1999 to 3.7 per 100,000 in 2020 [2]. Compared to the United States in 2015-2020, the incidence of cervical cancer in Korea is notably lower (7.7 per 100,000) [4]. However, precancerous lesions have increased in Korea from 17,651 in 2018 to 20,910 in 2021 [5]. Therefore, the cervical cancer screening program in South Korea is still critical and should potentially be revised to decrease the frequency of precancerous cervical cancer lesions.
In the HPV-positive population with atypical squamous cells of undetermined significance (ASCUS), the immediate cervical intraepithelial neoplasm (CIN) 3+ risk was 4.2% globally [6]. Therefore, repeat HPV testing or co-testing at 1 year is recommended for patients with minor screening abnormalities indicating HPV infection with a low risk of underlying CIN3+ (e.g., HPV-positive, low-grade cytological abnormalities after a documented negative screening HPV test or co-test) [6]. In comparison, the 2-year cumulative incidence of CIN3+ in the HPV-positive population with ASCUS in Japan was 17.5% [7]. In a previous Korean HPV cohort study from 2012 to 2017, a cumulative incidence of CIN2+ of 7.1% was determined [8]. The progression rate is higher in Korea and Asian countries than in the United States, and cytology with HPV testing is regularly followed up every 6 months in accordance with the Korean Society of Gynecologic Oncology’s recommendations [9].
The latest 2024 American Society for Colposcopy and Cervical Pathology (ASCCP) guidelines have introduced more detailed classifications for high-risk HPV types and now cover 14 to 20 genotypes [10], as opposed to previous guidelines that categorized HPV types merely as HPV 16 and 18 versus others [6,11]. These guidelines reflect an emerging consensus on the importance of distinguishing between high-risk HPV genotypes, a perspective supported by earlier studies. Research has indicated that HPV 58 could pose a cancer risk comparable to that of HPV 16 [12], suggesting a need for a re-evaluation of how non-HPV 16 and 18 types are assessed for cancer risk.
Machine learning (ML) was first introduced in 1956 and is widely used to assist in providing an accurate analysis of clinical findings and treatment decision-making [13]. In particular, ML algorithms are capable of repeating the same analysis using more than 100 different types of variables and can thus help to find more specific results, especially in the analysis of medical findings or clinical practice [14]. In addition, ML is more effective than conventional survival analysis because the latter can only handle low-dimensional data and faces problems in identifying non-linear associations and complex relationships between covariates and survival time [15]. Therefore, we evaluated the risk stratification of HPV types using a Korean HPV cohort as an ML algorithm to determine more suitable guidelines for ASCUS/low-grade squamous intraepithelial lesions (LSIL) and to additionally pre-evaluate the efficacy of HPV testing as part of a complete screening test.
Materials and Methods
1. Design of the Korea HPV cohort study
The Korea HPV Cohort Study, which received funding from the Korea Disease Control and Prevention Agency, took place between April 2010 and September 2021. This multicenter study, carried out in the obstetrics departments of eight general hospitals in Korea, aimed to identify the risk factors associated with the progression of cervical disease, up to the high-grade squamous intraepithelial lesion (HSIL) stage, in HPV-infected adult Korean women. Eligible participants were Korean women aged 20-60 years who tested positive for HPV DNA, irrespective of genotype, and had a diagnosis of ASCUS or LSIL through cytology testing. Prior to enrollment, all participants provided written informed consent. Approval was obtained from the Institutional Review Boards of all eight hospitals involved in the study. Throughout the study period, the enrolled patients underwent HPV DNA testing and cytology every 6 months, and data were recorded using an electronic case report form on each occasion [16].
2. Eligibility criteria and definitions of the HPV infection pattern and cytology change
From April 2010 to September 2021, during the last enrollment period of the Korean HPV Cohort Study, only those women who were diagnosed with ASCUS or LSIL and confirmed to have HPV infection in an external test and who agreed to participate in the cohort study were included. They underwent cytological and HPV DNA testing every 6 months after enrollment with collection of biological samples. Only those who were followed up at least twice after the initial examination were included in the study results. The criterion for disease “progression” was established as a diagnosis of CIN2+, confirmed with biopsy. After confirmation of progression, these participants were excluded from the cohort follow-up and suggested treatment. We also excluded women who didn’t have sufficient biopsy results or didn’t match the initial inclusion criteria of cytology (Fig. 1).

Study design. ASCUS, atypical squamous cells of undetermined significance; CIN, cervical intraepithelial neoplasm; HPV, human papillomavirus; LSIL, low-grade squamous intraepithelial lesion.
To evaluate risk factors for progression on biopsy, HPV infection patterns were first subdivided into three groups: (1) HPV-persistent, (2) HPV-negative, and (3) HPV–non-persistent. HPV-persistent infection was identified when the HPV test remained positive in two or more successive evaluations [12,16]. The HPV-negative group included those who showed HPV infection regression within 6 months post-enrollment. The HPV–non-persistent infection group included individuals who did not fit into either the HPV-persistent or HPV-negative categories (Fig. 2) [16].
Second, the HPV genotype was divided as follows. Fourteen HPV genotypes—HPV 16, 18, 31, 33, 35, 39, 45, 51, 52, 56, 58, 59, 66, and 68—are considered pathogenic or “high-risk” (Hr) for causing the development of cervical cancer [17,18]. In a large retrospective cross-sectional worldwide study, the most common HPV types were 16, 18, 31, 33, 35, 45, 52, and 58, with a combined worldwide relative contribution of 8,196 of 8,977 cases (91%; 95% confidence interval, 90 to 92) [19]. In addition, the most common HPV types for HSIL in Korea are 16, 58, 18, and 52 [12]. As a result, we classified HPV types into the following categories: Hr1 (16, 18, 31, 33, 45, 52, and 58), Hr2 (35, 39, 51, 56, 59, 66, and 68), and low risk (any virus type that is not included in either of the preceding two groups). Furthermore, to analyze specific progression risks, Hr1 was subdivided into types (a) 16 and 18; (b) 31, 33, and 45; and (c) 52 and 58 (Fig. 3).
Lastly, the HPV cohort was grouped by change pattern of cytology: (1) regression: no cytological abnormality; (2) persistent: no change in the cytology result (ASCUS/LSIL); and (3) progression: change to HSIL, atypical squamous cells cannot exclude high-grade squamous intraepithelial lesion, or malignancy in the cytology result.
3. Statistical analysis
Based on HPV infection patterns, participants were divided into three groups: persistent infection, HPV-negative group, and non-persistent infection. Age, BMI, disease progression, observation duration, initial cytology, cytology pattern, number of HPV infections, presence of multiple HPV infections, history of sexually transmitted disease (STD) infections, HPV prophylactic vaccination, pregnancy, and smoking, 1st coitus age, and number of sex partner were compared between the three groups using ANOVA or Kruskal-Wallis test for continuous variables and chi-squared test for categorical variable. To evaluate the detection efficacy for CIN2+ lesions, univariate Cox analysis was conducted for each HPV group in specific periods. Multivariate Cox analysis was repeated not only for the entire period, but also by dividing the cohort into HPV-persistent infection and HPV-non-persistent infection groups. Since there were no progression cases in the HPV-negative group, it could not be used in the analysis, so the reference group was changed to HPV-non-persistent group in the survival analysis. The independent variables included categorical variables such as HPV infected number group, type of HPV infection, multiple HPV infections, cytology pattern, and STD infection history, as well as continuous variables such as age and body mass index (BMI). Stata 17.0 (Stata Corp.) and 4.3.1 (R Core Team) were used as the statistical software, and a p-value less than 0.05 was considered statistically significant for all variables.
For ML analysis, a gradient boosting, random survival forest, and random forest model were used to identify important factors related to the disease progression rate. Initial experiments were conducted using the AutoML method, which is a common approach for automatically selecting, training, and tuning models. The grid search method was used to tune the hyperparameters. To determine the performance metrics for each model, five separate training sessions were conducted, each utilizing 5-fold cross-validation. We extracted variable importance and created heatmaps to analyze how each variable influenced the results across various model types.
To determine the period with the highest progression rate, we created 1,000 models each for the random forest and gradient boosting models using training data by the bootstrapping method (to estimate the sampling distribution of a statistic, even without knowing the true distribution). The area under the curve (AUC) for each month was then calculated using each data set, followed by calculation of the monthly increase in the AUC (average AUC of the following month−average AUC of the previous month). Statistical testing was performed using bootstrapping to determine whether the increase in the AUC for any period was significantly greater than for other periods. The conclusion was that if the increase in the AUC for all values exceeded an average of 0.95 (i.e., a significance level of 5%), it was statistically significantly greater than the increases in other periods. All ML procedures were performed in Python and using the H2O. ai API.
Results
Out of a total of 1,273 participants, 98 (7.7%) had progressive disease (classified as CIN2+). In addition, there were 266 patients in the HPV-persistent group, 49 in the HPV-negative group, and 958 in the HPV–non-persistent group. Among the HPV-persistent group, 17.6% (47 patients) had disease progression (CIN2+) while 5.3% (51 patients) had progression in the HPV–non-persistent group (Table 1). The disease progression rate for persistent Hr1 infection was 48.9%, which was significantly higher than the non-persistent Hr1 infection rate (41.0%) and lower than that of the HPV-negative group (no progression) (p < 0.001) (Table 1). The highest rates of progression in cytology (p < 0.001) and STD infection history (p=0.002) were in the persistent infection group. Multiple HPV infection was significantly more common in the persistent infection group than in the incidental or regression groups (p < 0.001). BMI was significantly lower in the HPV-negative group than in the HPV-persistent and HPV–non-persistent groups (p=0.004). However, there were no significant differences by HPV infection pattern in the ratio of initial cytology values (Table 1), history of prophylactic HPV vaccination, age at first coitus, and history of smoking (S1 Table).
To evaluate HPV risk factors, univariable Cox analysis of 1-year progression was conducted. In the interval analysis, the 12-24-month period showed the highest hazard ratios (HRs): 32.6 for Hr1-a, 16.0 for Hr1-b, and 8.85 for Hr1-c (all p < 0.05). Notably, Hr1-a consistently showed the highest HR in the 24-36-month period of 10.7 (p=0.034) (Table 2). Multivariable Cox analysis was conducted for 60-month progression (Table 3). The HPV infected number group and cytological pattern showed significant HRs for progression. In particular, both Hr1-a and Hr1-b had similar HRs (3.20 for Hr1-a and 3.53 for Hr1-b, all p < 0.05). Moreover, cytological progression were strong risk factors for progression. Subgroup analysis was conducted for the HPV-persistent and –non-persistent groups. Hr1-a had the only significant HR (2.38, p=0.006) in the HPV-persistent group (Fig. 4A) and all Hr1 subgroups were significant in the HPV–non-persistent group, with the highest HR in the Hr1-c group (Fig. 4B).

Multivariable Cox analysis by human papillomavirus (HPV) infection type (hazard ratios and 95% confidence intervals were adjusted by diagnostic age, body mass index (BMI), multiple HPV infection, and sexually transmitted disease infection history. (A) HPV-persistent infection. (B) HPV–non-persistent infection. Hr1, HPV 16, 18, 31, 33, 45, 52, 58; Hr2, HPV 35, 39, 51, 56, 59, 66, 68; Lr, low risk HPV.
The ML tools random survival forest, gradient boosting, and random forest were used to evaluate important factors for progression and the effective follow-up interval by comparison to the multivariable Cox analysis results. In the analysis, the gradient boosting model emerged as the most effective method, achieving an AUC of 0.942. The random survival forest model also demonstrated strong performance, with an AUC of 0.915. In contrast, the random forest model achieved an AUC of just 0.5. Consequently, the gradient boosting model and random survival forest model were chosen for use in this analysis.
In terms of importance for disease progression to CIN2+ predicted using the ML algorithm, the cytology pattern was the most important factor, followed (in order) by HPV infection type (persistent or not) and HPV number group (e.g., Hr1, Hr2) (Fig. 5A). In addition, the increase ratio of the monthly AUC obtained through the bootstrapping method was found to be highest when moving from 14 to 15 months with the gradient boosting model, showing an AUC of 0.97 (Fig. 5B), and from 11 to 12 months in the random forest model, showing an AUC of 0.95 (S2 Fig.).
Discussion
In Korea, cervical cancer screening has been conducted using cytology since 1999. Currently, women aged 20 years and older are recommended to undergo screening every 2 years. However, there is a lack of studies on the efficacy of cervical cancer screening through HPV risk stratification or cytology, not only in the ASCUS/LSIL population but also in the general populace in Korea. Accordingly, the present study has value in having identified the risk stratification based on HPV type, determined the follow-up duration based on HPV type, and re-evaluated all results using ML algorithms to confirm the importance of cytology. Even though all results were based on the low-grade abnormal cytology population, the findings can still guide changes in cervical cancer screening.
According to the ASCUS-LSIL Triage Study (ALTS) study, the total 2-year cumulative incidence of CIN2+ was 15.4% (CIN2, 6.7%; CIN3, 8.8%) [20] and only 5.3% of the overall CIN3 population was found to be high-risk HPV-negative [21]. In this study, CIN2+ was defined as progression and the total disease progression (CIN2+) rate by biopsy was 7.7% (mean survival time, 1.95 years), even in the initial HPV-positive population, a lower value than that of the ALTS study.
In this study, the progression group had a higher proportion of high-risk HPV-persistent infection (70%) than the non-progression group (11%). Furthermore, no progression was seen in the HPV-negative group. Many studies have defined HPV infections as persistent if HPV is detected on two consecutive follow-up visits 4-6 months apart [22], as in the present study. However, some studies have shown a similar average time to clearance. In one previous study, HPV 16 had a particularly long time to clearance (mean duration, 18.3 months) compared with other HPV types [23]. However, high-risk and low-risk HPV types can be detected for similar clearance periods [24]. Therefore, persistent infection with high-risk HPV was most frequently a major contributing factor to cervical cancer [25]. This was why we stratified the population based on HPV progression risk.
In this study, the Hr1 group was more likely to show disease progression than the Hr2 group during 60 months of follow-up. In the Hr1 group, Hr1-a and -b had the highest HRs until 24 months; in particular, Hr1-a was maintained until 24-36 months (Table 2). However, based on multivariable Cox analysis at 60 months, Hr1-b (HPV 31, 33, and 45) had the same HPV risk as Hr1-a (Table 3). However, only Hr1-a showed a meaningful HR in the HPV-persistent infection group, and all Hr1 groups resulted in HPV–non-persistent infection. Hr2 did not reach significance in all groups. Globally, after HPV 16 or 18, the virus type most common associated with disease progression was HPV 45, and HPV 16 or 18 was related to almost 90% of cervical cancer progression cases [19]. However, the 36-month follow-up data from the HPV Cohort Study showed that HPV 16 and HPV 58 have similarly high HRs [12]. These factors are in accordance with the new guidelines from the ASCCP that redefined the carcinogenic stains of HPV as HPV 16, HPV 18/45, and HPV 16-related types (33, 31, 52, 58, and 35). Our study results are in line with the global result [10].
In this study, we used ML algorithms to further analyze factors related to disease progression. The most effective ML tools for predicting disease progression were the gradient boosting model and the random forest survival model, both of which demonstrated high AUCs (greater than 0.9). In the analysis of the importance of factors related to disease progression, the most significant factor was progression of the cytology pattern (34% in the gradient boosting model), and the second and third most important factors were HPV infection type (3.2%) and number group (2.2%), with the same order of importance found in the random forest survival model. These ML results differ from what has been reported in the literature, where simple cytology was found to be less sensitive in diagnosing disease progression compared with HPV testing [26]. However, based on this multivariable and ML analysis, both HPV testing and cytology are important factors for detecting progression in the ASCUS/LSIL population.
According to univariable analysis in each 12-month period, the highest AUC in all types of HPV was at the 12-24-month interval. In ML analysis for the time-dependent AUC of disease progression, the highest mean AUC ratio per month was at 14-15 months in the gradient boosting model and 11-12 months in the random forest survival model. In another study, the median period of progression to CIN2+ from ASCUS/LSIL was 1.95 years [27]. That mirrors the American Cancer Society guideline recommendations, in which individuals with low-risk abnormal cytology with HPV infection are recommended to undergo annual cytology with HPV testing for 2 years [28]. Based on other studies and the present result, we believe that the effective follow-up period is 1 year, which is longer than the conventional recommendation of the Korean Society of Gynecologic Oncology [9]. Notably, our ML results were based on a bootstrapping method to reduce selection bias and used a more accurate ML method compared to conventional survival analysis, ensuring the reliability of the results. This was the strength of our study.
There are several limitations to the present work. First, we were unable to prove a relationship between prophylactic HPV vaccination and the progression rate. In Korea, prophylactic HPV vaccination has been available since 2007, but national vaccination programs for adolescents started in 2016, targeting individuals aged 12-15 years. Consequently, it is highly likely that the impact of vaccination was not captured in the findings of this study. However, other research using HPV Cohort Study data has reported lower rates of HPV 16 and 18 infections among women who received the vaccine [27]. This difference was due to differences in the definition of progression (biopsy only) and the use of strict inclusion criteria, with only participants who underwent at least two follow-up studies included. Second, multiple HPV infection was noted to be one of the risk factors for disease progression in several studies [29,30]. However, multiple infections did not reach statistical significance in multivariable analysis, even though almost 1,300 participants were included in this study. Third, except in the normal cytology and HPV-negative groups, only HPV-positive individuals with ASCUS/LSIL were included in this study. Therefore, there is a bias in the study as it does not reflect the characteristics of the general population, and the Cox analysis was conducted only on the HPV-persistent and non-persistent groups, excluding the HPV-negative group. Finally, the analysis in each follow-up period was conducted using only univariable analysis due to a lower number of participants in each period. However, this approach appeared to be sufficient to verify the duration of the effective follow-up time for each type of HPV. Regardless, the present study presents an HPV risk stratification strategy with follow-up using Korean data and provides a direction for future large-scale research and, despite the limitations, can be considered to have sufficient clinical significance.
Annual follow-ups are essential for monitoring the progression and regression of HPV in Korean patients with ASUCS/LSIL. The key predictors of disease progression include the persistence of specific high-risk HPV types, especially HPV 16, 18, 31, 33, 45, 52, and 58, and progression of cytology. While progression is typically detected within 2 years, individuals with HPV 16 or 18, regardless of signs of progression, should be more closely followed up due to their significant risk levels.
Electronic Supplementary Material
Supplementary materials are available at Cancer Research and Treatment website (https://www.e-crt.org).
Notes
Ethical Statement
This study was approved by the relevant Institutional Review Board (XC23ZIDI0039) and adhered to the principles of the Declaration of Helsinki. A waiver to require informed consent was obtained.
Author Contributions
Conceived and designed the analysis: Hur SY, Choi YJ.
Collected the data: Song H, Lee HY, Seong J.
Contributed data or analysis tools: Oh SA, Seong J.
Performed the analysis: Song H, Oh SA.
Wrote the paper: Song H, Choi YJ.
Review and Interpretation: Hur SY.
Conflict of Interest
Conflict of interest relevant to this article was not reported.
Funding
This work was supported by a National Research Foundation of Korea (NRF) grant funded by the Korean government (Ministry of Science and ICT) (No. NRF2021R1A2C2007425).