Data Resource Profile: The Cancer Public Library Database in South Korea
Article information
Abstract
This paper provides a comprehensive overview of the Cancer Public Library Database (CPLD), established under the Korean Clinical Data Utilization for Research Excellence project (K-CURE). The CPLD links data from four major population-based public sources: the Korea National Cancer Incidence Database in the Korea Central Cancer Registry, cause-of-death data in Statistics Korea, the National Health Information Database in the National Health Insurance Service, and the National Health Insurance Research Database in the Health Insurance Review & Assessment Service. These databases are linked using an encrypted resident registration number. The CPLD, established in 2022 and updated annually, comprises 1,983,499 men and women newly diagnosed with cancer between 2012 and 2019. It contains data on cancer registration and death, demographics, medical claims, general health checkups, and national cancer screening. The most common cancers among men in the CPLD were stomach (16.1%), lung (14.0%), colorectal (13.3%), prostate (9.6%), and liver (9.3%) cancers. The most common cancers among women were thyroid (20.4%), breast (16.6%), colorectal (9.0%), stomach (7.8%), and lung (6.2%) cancers. Among them, 571,285 died between 2012 and 2020 owing to cancer (89.2%) or other causes (10.8%). Upon approval, the CPLD is accessible to researchers through the K-CURE portal. The CPLD is a unique resource for diverse cancer research to investigate medical use before a cancer diagnosis, during initial diagnosis and treatment, and long-term follow-up. This offers expanded insight into healthcare delivery across the cancer continuum, from screening to end-of-life care.
Introduction
Recently, the increasing value of big data on cancer, driven by advancements in information technology, has increased its demand in cancer research [1]. However, the various health and medical information being accumulated includes numerous sensitive individual health information, leading to many privacy protection restrictions on the use of healthcare big data. The Personal Information Protection Act was revised to promote data usage in 2020. Under the revised act, pseudonymized data that cannot identify individuals can be used for statistics, scientific research, and public records without individual consent. Furthermore, an amendment to the Cancer Control Act was implemented in 2021 to reinforce cancer data collection and sharing, with tasks delegated to the National Cancer Data Center (NCDC). The National Cancer Center was designated in the same year as the NCDC.
The Korean Ministry of Health and Welfare initiated the Korean Clinical Data Utilization Network for Research Excellence (K-CURE) project in 2022 based on the Personal Information Protection and Cancer Control Acts. This project aims to establish an ecosystem for combining and utilizing clinical and public cancer data. The Cancer Public Library Database (CPLD), established under the K-CURE project, combines data from four major population-based public sources: the Korea National Cancer Incidence Database (KNCI DB) in the Korea Central Cancer Registry (KCCR), cause-of-death data in Statistics Korea, National Health Information Database (NHID) in the National Health Insurance Service (NHIS), and National Health Insurance Research Database (NHIRD) in the Health Insurance Review & Assessment Service (HIRA).
This study aimed to offer a comprehensive profile of CPLD data, highlighting its representation of the entire patient population with cancer in Korea. We presented descriptive statistics detailing the number of patients included in the CPLD, their demographics, medical usage, and mortality. Furthermore, this study emphasized the potential CPLD value in cancer research by presenting its available data.
Materials and Methods
1. Data sources
The CPLD resulted from the collaborative efforts of the KCCR, NHIS, HIRA, and Statistics Korea. The NCDC requested the KNCI DB from the KCCR, cause-of-death data from Statistics Korea, NHID from the NHIS, and NHIRD from the HIRA to establish the CPLD. The KNCI DB is a nationwide and hospital-based cancer registration database that regularly collects information on newly diagnosed cancer (incident) cases among Korean residents [2]. The KCCR has reported nationwide statistics since 1999; our previous study provides detailed information on the KCCR and KNCI DB [3]. Completeness is an important data quality indicator, and the 2020 KNCI DB was estimated to be 98.3% complete using the method proposed by Ajiki et al. [4]. The KNCI DB contains data on demographics (such as age, sex, and residence), diagnosis date, cancer type (based on the International Classification of Diseases, 10th edition), Surveillance, Epidemiology, and End Results (SEER) summary stage, morphology, and treatment methods for patients with cancer.
The mortality data collected by the KNCI DB were primarily derived from the cause-of-death data collected by the Statistics Korea. The cause-of-death data are obtained from the death certificates of Koreans who had resided in Korea. Causes of death were classified using the disease classification recommended by the World Health Organization [5] and the 7th Korean Standard Classification of Diseases and Causes of Death [6]. We collected the cause and date of death information from the cause-of-death database.
The NHIS in Korea is a single insurer that provides health insurance coverage for all citizens living in Korea, managing their eligibility, collecting insurance contributions, and providing health insurance benefits. The HIRA evaluates medical service fees, healthcare quality, and medical service adequacy. Under this universal health coverage system, the NHID and NHIRD contain healthcare information such as treatments, pharmaceuticals, procedures, and diagnoses for approximately 50 million beneficiaries [7,8]. We collected sociodemographic data of the NHIS beneficiaries and medical aid recipients, alongside information on their general health checkups and national cancer screening examinations from the NHID. Medical utilization data were gathered from the NHIRD, which comprised the following files: (1) general information; (2) healthcare services, including inpatient prescriptions; (3) disease diagnosis; (4) outpatient prescriptions; and (5) drug master table.
2. Data linkage
The individuals included in the KNCI DB are linked to their NHID and NHIRD enrollment data, as well as cause-of-death data, using an algorithm based on their resident registration number. Each database entry is linked through a join key—an encrypted value derived from the resident registration number using a secure hash algorithm and salt value. Individual join keys and serial numbers generated by each institution are collected by the Korea Health Information Service (a trusted third-party organization) to protect personal information. The Korea Health Information Service uses this information to create a linkage table, which the NCDC utilizes to combine data from each institution along with the created linkage table. The NCDC deletes the linkage table once the combination process is complete. Consequently, personal identifiable information used to link the database is excluded from the CPLD. Instead, each individual receives a unique, non-identifiable number to enable tracking across data files and times. Therefore, database users are not allowed to link additional data resources at an individual level. Each institution must conduct an additional process to generate a linkage table for additional data linkages, performed only by specialized institutions authorized under the Personal Information Protection Act.
Finally, the CPLD incorporated 1,983,499 individuals diagnosed with cancer between 2012 and 2019, aged 0-100 years or older. The CPLD includes information on deaths between 2012 and 2020, health insurance eligibility, general health checkups, national cancer screening, and medical claims between 2012 and 2021 (Fig. 1).
3. Data access
The CPLD can be assessed through the K-CURE portal (https://k-cure.mohw.go.kr). Researchers are required to submit a study proposal with ethical approval from their Institutional Review Board. These requirements must be approved by the NCDC review committee before data access is granted. In principle, only the minimum data needed to conduct the research question are provided. Provider identifiers, sensitive disease names (such as mental diseases and sexually transmitted diseases), and related medical information are removed or replaced in the CPLD to protect privacy. Approval from the NCDC review committee is required for all restricted-variable requests.
4. Data included in the CPLD
The CPLD comprises various linkable files categorized by unique serial numbers assigned to each included patient with cancer because of the several cases and associated claims. Table 1 presents the various file types. Twenty-four cancer types in the CPLD are classified based on the International Classification of Diseases (10th edition) codes. The ages are grouped in 5-year intervals between aged 20 and 79, while those under 20 and those 80 or older are grouped separately (0-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59, 60-64, 65-69, 70-74, 75-79, and ≥ 80 year old). However, the NCDC review committee can approve special requests for a 1-year interval. The regions are grouped into 17 municipal units, including Seoul, Busan, Daegu, Incheon, Gwangju, Daejeon, Ulsan, Sejong, Jeju-do, Gyeonggi-do, Gangwon-do, Chungcheongbuk-do, Chungcheongnam-do, Jeollabuk-do, Jeollanam-do, Gyeongsangbuk-do, and Gyeongsangnamdo. Health insurance premiums are categorized into 10 deciles. The causes of death were grouped into 24 cancer types and major classifications following the Korean Standard Classification of Diseases ver. 7, based on the International Classification of Diseases, 10th revision. The NHIS claims files from the NHIRD include unique patient identifiers, sex, service date(s), diagnosis codes, procedure codes, amount charged, and amount reimbursed. The general health checkups and cancer screening files contain general health status (including height, weight, results of blood tests, and disease history), health behavior (including smoking, alcohol consumption, and physical activity), and screening results for five cancer types.
5. Statistical analysis
This study presented descriptive statistics on the clinical and sociodemographic characteristics and healthcare utilization of patients with cancer included in the CPLD. We presented the number of patients with cancer based on cancer sites and the prevalence of the top five sites over the years. Additionally, we presented the number of deaths between 2012 and 2020 and their main causes. Furthermore, to demonstrate the extent and type of NHIS used by patients with cancer, we calculated the annual average claims per patient in the year before diagnosis (12 months before the month of cancer diagnosis), the year after diagnosis (12 months after the month of cancer diagnosis), and 12 months before death (11 months before the month of death). Descriptive analyses were performed using SAS ver. 9.4 (SAS Institute Inc., Cary, NC).
Results
Table 2 presents the number of patients with cancer based on their sociodemographic characteristics and diagnosis year. Of the 1,983,488 patients, the majority were in their 60s (23%), followed by the 70-79 age group and 50-59 age group. Individuals in the 8-10 decile group were the most prevalent decile group of health insurance premiums at cancer diagnosis. The distribution of patients with cancer based on the SEER summary stage was as follows: 40.9% had localized cancer, 27.1% belonged to the regional group, 16.1% belonged to the distant group, and 15.8% were categorized as unknown.
Fig. 2 shows the top five cancers by sex from 2012 to 2019. Among the 996,209 men, stomach, lung, colorectal, prostate, and liver cancer were the top five cancers, accounting for 16.1%, 14.0%, 13.3%, 9.6%, and 9.3% of all cancer cases diagnosed, respectively. The proportion of lung and prostate cancers in men steadily increased from 2012 to 2019, while stomach, colorectal, and liver cancer decreased. The most common cancers in women were thyroid (20.4%), breast (16.6%), colorectal (9.0%), stomach (7.8%), and lung (6.2%) cancers. The proportion of breast and lung cancers in women has steadily increased from 2012 to 2019, while gastric cancer has steadily decreased. Thyroid cancer accounted for about 30% of cancer cases in 2012, but it has gradually decreased since then, accounting for about 16.9% of all cancer cases in 2019. The number of incident cancer cases from 2012 to 2019 by cancer type in men and women is available in S1 Table.
Among these patients with cancer, 571,285 died between 2012 and 2020, with 89.2% of the deaths attributed to cancer and 10.8% to other causes (Table 3). Lung cancer caused the highest number of deaths in both sexes, with 91,437 deaths in men and 29,707 in women. Liver (14.4%), stomach (9.6%), colorectal (8.3%), and pancreatic (6.1%) cancers had the highest number of deaths among the men after lung cancer. Colorectal (11.3%), pancreatic (9.5%), stomach (9.1%), and liver (8.9%) cancers caused the most deaths among the women.
Table 4 presents the medical service utilization patterns during the 1 year before and after cancer diagnosis, as well as during the 1 year before death. Regarding medical services, 93% of the patients with cancer had outpatient claims and 43% had inpatient hospitalization claims during 1-year before cancer diagnosis. Almost all patients with cancer (92%) had at least one outpatient claim, and the majority (89%) had at least one inpatient claim during the year after diagnosis. Furthermore, of the 571,285 patients who died between 2012 and 2020, 98% had outpatient and inpatient hospitalization claims in the 1 year before death. The average number of outpatient visits and inpatient hospitalizations per patient was higher during the 1 year after diagnosis than the 1 year before. The frequency of inpatient hospitalization claims increased from 1.9 to 4.5. Medical care use increased during the last year of life, with an average of 38.7 outpatient visits and 7.8 inpatient hospitalizations. Furthermore, 41% and 34% of patients used dental and oriental medicines, respectively, in the 1-year before cancer diagnosis. However, fewer patients with cancer used dental and oriental medicines in the 1 year after diagnosis and before their death.
Discussion
The CPLD has several strengths. The CPLD encompasses 96.7% of all cancer incidence cases, as published in the annual report of cancer statistics of KCCR [9], ensuring a comprehensive representation of the population. This is advantageous because previous studies using NHIS claims data faced challenges in accurately defining patients with cancer using disease and procedure codes, which led to the underestimation or overestimation of cancer incidence or prevalence [7,10,11]. Consequently, the CPLD is a valuable resource for overcoming the limitations of defining cancer diagnoses in research.
The key features include patient demographics (including age and sex), detailed clinical cancer characteristics (including diagnosis date, site, histology, and summary stage), extensive healthcare service utilization, and cost information. These features facilitate the identification and comparison of cancer treatments and outcomes among the included populations. Moreover, the longitudinal nature of the CPLD, covering before and after cancer diagnosis periods, facilitates the calculation of time-dependent measures such as comorbidity indices, a comprehensive analysis of various treatments (including surgery, radiation, chemotherapy, immunotherapy, and other treatments), and outcomes (including time to subsequent events or death). Additionally, these longitudinal data offer valuable insights into the long-term outcomes of cancer survivors.
The CPLD is similar to the SEER-Medicare database in the United States, which combines SEER cancer registry data with Medicare enrollment and claims data [12]. The SEER-Medicare database offers advantages, including a substantial number of cancer cases, detailed tumor characteristics, population-based data sources, longitudinal Medicare data, an extensive range of covered services, and biennial linkage updates [12]. Additionally, the SEER-Medicare linkage encompasses non-cancer control groups and incorporates ancillary linkage data sources, such as the Medicare Health Outcome survey and the Medicare Consumer Assessment of Healthcare Providers and Systems survey. However, findings from the SEER-Medicare analyses may not be generalizable to younger populations owing to its focus on linking with Medicare data, primarily including individuals aged 65 years and older [12].
The CPLD has some limitations. First, a time lag of 2-3 years exists between the generation of individual data and their availability for research. The CPLD released in 2023 included patients with cancer through 2019, cases of death through 2020, and claims through 2021. This time lag is primarily driven by the KNCI DB, which is necessary for the completeness of the cancer registration [3]. Therefore, researchers should be cautious when designing studies using the CPLD, considering its unique characteristics.
Second, claims data from the NHID and NHIRD do not encompass all health-related information. For example, clinically observed information, which may be present in medical records, is excluded from the CPLD. Furthermore, services such as cosmetic surgical procedures or over-thecounter drugs not covered by the NHIS are absent in the CPLD because claims data are generated to reimburse healthcare services covered by the NHIS. The CPLD includes medical procedure codes to indicate that specific tests are conducted; however, the CPLD lacks information on test results (such as imaging test results, biomarker data, and laboratory values). Additionally, certain health conditions, such as mental illness, suicide, sexually transmitted diseases, and miscarriage, are not available because of privacy concerns. Therefore, researchers should consider these constraints when selecting study topics.
Third, researchers should understand CPLD structure and characteristics. Claims data in the CPLD comprises diverse file types, each with one-to-many linkage relationships. Furthermore, the CPLD contains left- and right-truncated data. Therefore, caution should be exercised when interpreting trends in cancer incidence, prevalence, and mortality rates. Additionally, specialized knowledge of NHIS billing and coding is essential for properly manipulating and interpreting data.
Finally, information related to diagnoses and diseases, excluding cancer, may not accurately reflect disease occurrence and prevalence because it primarily comes from the claims data used for reimbursement [8]. Moreover, administrative claims data alone do not provide insight into the decision-making process for cancer care and other patientreported outcomes. These limitations are not exclusive to the CPLD, but are common in databases relying on claims data, which are primarily gathered for administrative rather than research purposes.
In conclusion, the CPLD provides a unique resource for various cancer research, enabling the investigation of medical usage patterns before a cancer diagnosis, during the period of initial diagnosis and treatment, and long-term follow-up. This facilitates expanded insights into healthcare delivery across the cancer continuum, from screening to endof-life care. Partners from the NCDC, Statistics Korea, KCCR, NHIS, and HIRA ensure the continual enhancement and maintenance of the CPLD. The CPLD plans to add data on newly diagnosed cancer patients and update data on existing cancer patients annually. Furthermore, there are plans to expand the range of public agency data based on researchers’ needs, which includes the coronavirus disease 2019 DB of the Korea Disease Control and Prevention Agency. Finally, with continuous cooperation and efforts, the CPLD can contribute to the development of future insights into cancer research in South Korea.
Electronic Supplementary Material
Supplementary materials are available at Cancer Research and Treatment website (https://www.e-crt.org).
Notes
Author Contributions
Conceived and designed the analysis: Choi KS, Chae H, Choi DW, Ryu KS.
Collected the data: Im JS, Choi KS, Choi DW, Ryu KS, Kong HJ, Cha HS, Kim HJ, Chae H, Jeon YS, Kim H, Jung J.
Contributed data or analysis tools: Choi DW, Guk MY, Kim HR.
Performed the analysis: Choi DW, Guk MY, Kim HR.
Wrote the paper: Choi DW, Choi KS.
Interpretation and review: Choi DW, Choi KS.
Review and comment: Im JS, Choi KS, Choi DW, Guk MY, Kim HR, Ryu KS, Kong HJ, Cha HS, Kim HJ, Chae H, Jeon YS, Kim H, Jung J.
Conflicts of Interest
Conflict of interest relevant to this article was not reported.
Acknowledgements
Special thanks to the Korean Ministry of Health and Welfare, the Statistics Korea, the Korea Central Cancer Registry, the National Health Insurance Service, the Health Insurance Review & Assessment Service, and the Korea Health Information for their support and contributions to the K-CURE project. This work was supported by the Health Promotion Fund of the Ministry of Health & Welfare (No. 22A2400-1) and a research grant (No. 2310520-2, No. 2310690-1) from the National Cancer Center, Republic of Korea.