Safe Utilization and Sharing of Genomic Data: Amendment to the Health and Medical Data Utilization Guidelines of South Korea
Article information
Abstract
Purpose
In 2024, medical researchers in the Republic of Korea were invited to amend the health and medical data utilization guidelines (Government Publications Registration Number: 11-1352000-0052828-14). This study aimed to show the overall impact of the guideline revision, with a focus on clinical genomic data.
Materials and Methods
This study amended the pseudonymization of genomic data defined in the previous version through a joint study led by the Ministry of Health and Welfare, the Korea Health Information Service, and the Korea Genome Organization. To develop the previous version, we held three conferences with four main medical research institutes and seven academic societies. We conducted two surveys targeting special genome experts in academia, industry, and institutes.
Results
We found that cases of pseudonymization in the application of genome data were rare and that there was ambiguity in the terminology used in the previous version of the guidelines. Most experts (>~90%) agreed that the ‘reserved’ condition should be eliminated to make genomic data available after pseudonymization. In this study, the scope of genomic data was defined as clinical next-generation sequencing data, including FASTQ, BAM/SAM, VCF, and medical records. Pseudonymization targets genomic sequences and metadata, embedding specific elements, such as germline mutations, short tandem repeats, single-nucleotide polymorphisms, and identifiable data (for example, ID or environmental values). Expression data generated from multi-omics can be used without pseudonymization.
Conclusion
This amendment will not only enhance the safe use of healthcare data but also promote advancements in disease prevention, diagnosis, and treatment.
Introduction
In recent years, the medical field has developed technological innovations such as artificial intelligence for healthcare services. The development of high-performance computing systems and next-generation sequencing (NGS) technologies has revolutionized diagnostics and disease causation analysis in healthcare, leading to the accumulation of massive amounts of genomic data and global drug development. The annual growth rate of genome data generation is predicted to reach 16.85% between 2022 and 2030 [1]. Advanced countries in the medical field continue to construct and share healthcare databases, including genomic data. They are also developing standards for responsible data utilization through international consortia such as GA4GH [1].
The representative international genome projects include ‘All of Us’ in the United States [2-4] and ‘UK Biobank’ in Europe [5-7]. By collecting the metadata and genome sequencing data of diverse ethnic and healthy donors, the projects built up big data that were publicly available to researchers worldwide. These data are widely used in various fields of science, medicine, and industry. These international efforts have significantly influenced national genome data construction and utilization strategies in the Republic of Korea [8]. To deposit one million Korean whole-genome sequences, the Korean government launched the National Project of Bio Big Data, which involves multi-ministerial collaboration, including the Ministry of Health and Welfare, the Ministry of Science and ICT, and the Ministry of Trade, Industry, and Energy. This project aimed to deposit clinical and genomic data, including whole-genome sequences of one million Koreans, in the first stage (between 2024 and 2028) and beyond. It also includes data across a broad spectrum of individuals, including the general population and patients with rare diseases, severe illnesses, and chronic conditions. It aimed to contribute to disease prediction, diagnosis, personalized and precise medicine, and public health enhancement. There is an urgent need for the domestic implementation of safe utilization environments for the public release of clinical and genomic data.
Targeted genome sequencing data (whole-exome sequencing or whole-genome sequencing) have been generated in hospitals and medical institutions using NGSs and can be categorized into clinical and research purposes. These genomic data for research purposes were provided after review by an Institutional Review Board (IRB) in accordance with the Bioethics Law, and the data provided should be stored separately from identifiable personal information. Although clinical genomic data governed by the Personal Information Protection Act (PIPA) have been utilized with comprehensive consent forms, with the 2020 amendment to the Act, special cases concerning pseudonymized information have been applied (S1 and S2 Tables) [9]. This allowed the use of data without consent through pseudonymization, addressing the previously deferred use owing to the risks of personal identification in genomic data. This paper proposes a method for securely using previously deferred genomic data, thereby enabling their application.
In this study, we introduce the overall processes and research results of amending healthcare data utilization guidelines to enable the safe and proactive use of genomic data. This study was conducted under the auspices of the Korea Health Information Service (KHIS) and the Korea Genome Organization (KOGO). While the Health and Medical Data Utilization Guidelines were being amended, consultations and surveys were conducted with expert groups and relevant academic, medical, and industrial societies. This study proposes pseudonymization methods for metadata and genomic sequence information, with a focus on diagnostic NGS data. The amendment includes utilization strategies for omics-level data across industries, hospitals, academia, and research. The amended guidelines provide clear directions for pseudonymizing genomic data, managing personally identifiable genetic information, and sharing and utilizing data. The amended guidelines can also be expected to not only play a significant role in Korean-customized medical services, disease prevention, and treatment research but also contribute to the National Project of Bio Big Data, promoting the active utilization of genomic data.
Materials and Methods
1. Genome experts conference meeting
The project was conducted using a joint research pattern between KHIS and KOGO. The main researchers were experts with over 10 years of experience in clinical genomics research.
In this study, we invited representatives from each academic society to a meeting to reflect on the opinions of experts on genome data utilization. We consulted at least two experts from the Korean Cancer Association (KCA), KOGO, Korean Society for Bioinformatics (KSBI), Korean Society of Medical Genetics and Genomics (KSMG), Korean Society of Medical Informatics (KSMI), Korean Society of Pathologists (KSP), and Korean Society for Laboratory Medicine (KSLM) who specialize in genome data analysis and management. Additionally, to incorporate viewpoints of the corporate sector, we invited four companies: sequencing and direct-to-consumer analysis companies and new druggable target discovery companies. The main researchers were provided with weekly agendas covering topics such as the scope of data disclosure, use of data without consent, survey design, and a secure network environment (S3 Fig.).
Based on this study, we developed an amended guideline for safely utilizing genomic data. In the initial phase, meetings focused on investigating insights from representatives of industry, academia, and governmental institutions. In the second stage, advice from medical research institutes utilizing genomic data was analyzed. In the third and final stage, the opinions of specialists working on genomic data analysis at the hospital and medical center were reflected (S3 Fig.). Subsequently, in the expert advisory meetings, five major issues were discussed: (1) the scope of genomic data disclosure; (2) the extent of information utilization by file; (3) the scope of utilization from omics data, including mainly expression data; (4) a secure environment following pseudonymization; and (5) the process of exporting files (S3 Fig.). After detailed studies were conducted in many steps, the scope of genome data available for the study was defined for VCF, BAM/SAM, FASTQ, and genome information stored in medical data owing to limitations in data utilization.
2. Surveys for utilization of genome data
To identify issues with previous guidelines and investigate safe methods for utilizing genomic data, we conducted two surveys with experts in the field (S4 Fig.). The first survey targeted 52 experts from various institutions, including hospitals (24%), universities (38%), research institutes (10%), and companies (28%). This survey focused on aspects such as analysis, storage, backup, safety measures, public disclosure of genomic data, and understanding the legislation related to pseudonymization. The second survey, involving 19 participants, assessed the need to amend the health and medical data utilization guidelines. Specifically, this study aimed to identify the general characteristics associated with NGS results in patient data and records held by medical institutions.
3. Investigation of personally identifiable genetic information for pseudonymization
We approached our research from various perspectives to extract elements for personal identification from genomic data. Raw NGS sequencing data, mapping files, and mutation calling data generated in the genome data analysis pipeline were studied to identify the factors that can be pseudonymized. According to the sequencing instruments, multiple pipelines can generate different types of files (S5 Table). The selection of elements to be pseudonymized was discussed through extensive meetings with experts and reflected in the guidelines. These files can be of various types; however, they can be broadly classified into genetic sequences and metadata information.
1) Investigation of personally identifiable genetic information elements in nucleic acid sequences
From the genome data, it was possible to identify inherent personal features within the nucleic acid sequences. Germline mutations, short tandem repeats (STRs), and single nucleotide polymorphisms (SNPs) were selected as factors that could identify individuals.
The STR database provided by the National Institute of Standards and Technology (NIST) under the U.S. Department of Commerce was referenced. We decided to apply common variants with a minor allele frequency (MAF) of more than 1% and rare variants of less than 1% through surveys and many discussion times. To determine rare variants, we recommend using the Korea Biobank Array Project [10], Korean Genome and Epidemiology Study (KoGES) project [11], Ulsan Genome Project [12], and gnomAD East Asian data.
2) Investigation of personally identifiable genetic information in metadata
Several files (e.g., FASTQ, BAM/SAM, VCF, and the reporting file) of the previously mentioned genomic data contained metadata, excluding nucleic acid sequence information. Metadata such as the identification number, specific instrument number, and analysis environment values could be imported as personally identifiable genetic information during the analysis pipeline working process. Therefore, the guidelines decided to check personally identifiable genetic information within the metadata of each file. To understand the information generated by each file, the following websites were investigated: HTSlib/SAMtools documentation and VCF v4.2 specification (data availability).
4. Data availability
The analytical tools and resources used for this study are publicly accessible. Details for access are as follows: Information and resources related to the Korea Biobank Array Project can be found at https://www.koreanchip.org/project. The Genome Aggregation Database (gnomAD), offering an extensive range of human genetic variation insights, is available at https://gnomad.broadinstitute.org/. Documentation and updates for HTSlib/SAMtools can be found at https://www.htslib.org/doc/samtools.html. The Variant Call Format (VCF) version 4.2 specification, a standard format for the representation of sequence variations, can be accessed at https://samtools.github.io/hts-specs/VCFv4.2.pdf.
Results
1. Amendment of the guidelines for the safe use of health and medical data
The amendment of the “Health and Medical Data Utilization Guidelines” focuses on safely utilizing genomic and omics data. To achieve this, we examined domestic laws and prepared amendments based on a previous version of the guidelines (2022). Previously, the guidelines stipulated that genomic data could not be used without the individual’s consent, and in cases of non-consent, only localized data were provided. In the previous guidelines, genomic data were concentrated on ‘genomic information’ and ‘omics information, excluding the genome.’ For genomic information, it is possible to provide information on larger genetic units instead of specific mutation details or offer novel mutation information (excluding germline mutations). No special measures are required for omics information, excluding the genome, because recovery is impossible. However, it is still possible to recover pseudonymized data, such as transcriptome data (e.g., cDNA), from which genomic information can be recovered. Based on these guidelines, we prepared amendments to address concerns regarding personal identification, interpretation uncertainties, and wording ambiguities, thereby expanding safe utilization (Table 1, S6 Table).
The amended guidelines accounted for the inability to interpret genomic data fully, reducing the risk of data identification. Furthermore, genomic data can contain information about third parties, including parents, siblings, and other family members, necessitating caution. These guidelines do not apply to human-derived materials collected and processed through research or donation consent. In addition, there are three recommendations for genomic data. First, it advises the classification of genomic data files into genomic sequence information and non-sequence information (metadata). It recommends the replacement or deletion of high-risk personally identifiable genetic information. For instance, in typical genomic data files such as BAM/SAM and VCF, replacing or deleting high-risk information, such as germline mutations, STRs, and SNPs in the genomic data is recommended. For metadata, personally identifiable genetic information, and specific information (e.g., ID or environmental variables; file/directory name of analysis server, etc.) should be appropriately deleted or replaced. Special caution is required when utilizing raw data files such as FASTQ, which hold personally identifiable genetic information and can reveal individual genomic details through various analyses (e.g., mapping and blast search). Obtaining consent from the subject is advised, particularly when utilizing the FASTQ files derived from NGS-based genetic testing in medical institutions. This precaution is crucial because detailed individual genomic information can be exposed when DNA sequence data are extracted from FASTQ files to generate SAM/BAM/VCF files. Finally, matrix data with expression values do not require special measures for omics information (transcriptome, metabolome, and proteome, etc.). The matrix data are created as unidentifiable data during gene expression measurements using anonymized sample IDs. The deletion or replacement of personally identifiable genetic information, identifiable information, and specific information is recommended because of the existing risk of identification.
Despite these efforts, the complete interpretation of genomic data remains challenging, limiting risk reduction. Therefore, it is necessary to consider access control, physically secure environments, and risk assessment. Information with identifiability may need to be excluded, considering the time, cost, risk, and technical level required for pseudonymization. Alternatively, additional reviews or consent from the data subjects may be required depending on the purpose and files used.
These guidelines provide a basis for utilizing genomic data. This is an important foundation for balancing the protection and utilization of genomic data, which will contribute to safely utilizing genomic data and improving medical research quality.
2. Utilization of genomic data
In this study, we investigated the categorization and legal regulations of genomic data produced by medical institutions. Within this legal framework, our research aimed to present guidelines that enhance the usability of clinical data by minimizing personally identifiable genetic information while complying with legal and ethical standards. The data can be categorized into clinical and research data depending on whether they are deposited for treatment or research purposes. A thorough investigation was conducted regarding the laws and regulations related to these data, particularly focusing on replacing and deleting personally identifiable genetic information.
First, clinical data refer to genomic information collected from the pathology and diagnostic departments for diagnosing patient diseases. In this context, the file formats generated through NGS included FASTQ, BAM/SAM, VCF, and genomic information from medical records (Fig. 1A). Conversely, research data were generated for research purposes after obtaining a consent form for human-derived material donations from the patients. This also produced FASTQ, BAM/SAM, and VCF files through NGS analysis and is utilized based on the Bioethics Law [13]. Clinical and research data are subject to various domestic laws. Moreover, the purpose of these laws and personally identifiable genetic information varied (Fig. 1B, S7 and S8 Tables).
It is different to delineate the scope of pseudonymization under the PIPA and the scope of anonymization permissible under the Bioethics and Safety Act. Therefore, we pseudonymized the clinical data in accordance with the principle that only a minimum amount of personally identifiable information should be collected, removed, or substituted to make it difficult to identify specific individuals. Pseudonymized data retain information and can be linked with other data if necessary. In contrast, anonymizing research data involve permanently deleting personally identifiable genetic information or substituting it with a unique identifier from the institution, making it impossible to recognize specific individuals. While the complete removal of personally identifiable genetic information offers a higher level of privacy protection, it significantly reduces the usefulness of the data. Each data type is utilized after evaluation of its research purposes and safety by the institution’s review boards and use outside these purposes is subject to legal sanctions (Fig. 2).
3. Key personal identifiers in genomic data based on nucleotide sequence information
In the meeting, we defined the scope of utilizing genomic data. We also examined personally identifiable genetic information within each file type. The following personally identifiable genetic information is advised for replacement or deletion (Fig. 3A, S1 and S2 Tables).
Genetic mutations in germ cells can be inherited by offspring. In tumor sequencing, mutations may encompass not only tumor-specific mutations, but also preexisting mutations within patients. Such data must be handled carefully to prevent individual identification. For instance, the mitochondrial DNA mutation “m.16093A>G” signifies a change from adenine (A) to guanine (G), potentially indicating genetic connections among maternal relatives due to its inheritance through the mother’s lineage. This mutation may also imply a risk for mitochondrial diseases. Considering the challenges of completely removing germline mutations due to the limitations of bulk sequencing, protective measures such as masking or pseudonymization become crucial to minimize the exposure of sensitive genetic information (Fig. 3B).
STR was defined as DNA segments where a sequence of 2 to 7 base pairs is repeated. They are widely used in genetic profiling, criminal investigations, paternity tests, ancestral research, and more, due to the unique variation in length and repeat count per individual that allows for personal identification. Although identifying individuals solely based on STR information is challenging, combining STRs with SNPs, population statistics, or other identifiable information can significantly enhance the likelihood of personal identification. For example, this study used two datasets for 872 individuals: 642,563 genome-wide SNPs and 13 STRs used in forensic applications. The results indicate that ~90%-98% of forensic STR records can be matched to the corresponding SNP records, and the accuracy increases to ~99%-100% when approximately 30 STRs are used [14]. For example, Personal identification markers provided by US NIST were identified, and information posing a risk of personal identification when combined with other data should be removed (Fig. 3B).
SNPs are crucial variations within the human genome and play a significant role in understanding genetic differences related to race, physical appearance, and diseases. Variants with an alternating allele frequency of less than 1% were categorized as ‘rare variants.’ The development of the Korean chip in 2015, which comprises ~830,000 probes specifically designed for the Korean population, has improved the accuracy of determining the frequency of these SNPs. It is essential to recognize that combinations of SNPs or haplotype structures can be used for individual identification. For instance, in most major populations, it is estimated that only 45 SNPs are required to match two sets of genetic data to a unique individual, and only 300 SNPs are necessary to identify any individual uniquely [15]. Additionally, a panel of 52 SNPs has been approved for forensic use in several European countries (Fig. 3B) [16-18]. Moreover, omics data, including transcriptomic, proteomic, and metabolomic data, have expression data that are considered unstructured and devoid of personally identifiable genetic information. This data type can be utilized safely without concerns for personal identification, enabling researchers to leverage this information with reduced privacy risks (Fig. 3C).
We investigated specific instances of personally identifiable genetic information in actual files, demonstrating that these can be removed through parameter adjustments [19-21]. For instance, in a VCF file, if a ‘COMMON’ field is present, its value (typically 0 or 1) indicates whether a variant is common (i.e., frequent in the population) or rare. ‘COMMON=0’ could mean that the variant is rare. To verify the genetic frequency in East Asians, we considered a variant in the VCF sequence information section to be rare if the MAF for East Asians is ≤ 1% in the dbSNP database and removed such variants (Fig. 4A). In the case of FASTQ and BAM/SAM files, the analysis of organizational information, corresponding file path and pathology information, etc., may remain; therefore, deletion is necessary. For example, in BAM/SAM files, the patient and specimen information in @RG ID, LB, SM, program information, and execution paths in @PG must be removed or replaced (Fig. 4B).
Discussion
The Health Care Data Utilization Guidelines have been amended to enhance the utilization of genomic data. The primary objective of this amendment was to overcome the limitations arising from phrases that require further clarification and applicable examples in the guidelines to propose safer methods for data utilization.
This study investigated anonymization/pseudonymization practices in several foreign countries and conducted research tailored to the domestic situation to enable researchers to safely utilize genomic data (S9-S13 Tables).
This study sought safe pseudonymization methods for genomic data to convey the minimum necessary information. Information providers and recipients must request and process the correct file formats suitable for their research purposes. Genomic data may contain various personally identifiable genetic information that should be replaced or deleted if they do not align with the research objectives. Special attention is needed for germline mutations, STRs, and rare SNPs. The criterion for rare SNPs in this guideline is set as an MAF of less than 1%, which can change depending on the research purpose [10].
We investigated unique personal identification germline mutations in Koreans through Korean-specific rare germline mutations [22]. To study Korean germline mutations, we utilized the Korea4K database, and for examining mutations in East Asians, we used data from gnomAD. As a result, rare variants specific to Koreans that comprised less than 1% of the total population accounted for 77% of our dataset. When compared with East Asians, 75% of these variants were found to be unique (http://honglab.catholic.ac.kr/cmm/fms/FileDown.do?atchFileId=FILE_000000000000462&file Sn=0) (S14 Fig.). Therefore, if the study’s focus is not specifically on Korean germline research, we suggest deleting or replacing the presented rare germline mutations.
We recommend using secure networks for data transfer in cases where the use of analysis tools is limited, or adequate measures are not taken. We also suggest managing data through secure networks and, when used externally, only disclosing pseudonymized VCF containing minimal personal information. Nucleic acid sequences in their entirety can be inferred from BAM/SAM files; therefore, we suggest analyzing them within a secure network and exporting the results. Raw FASTQ data containing personally identifiable genetic information on the nucleic acid sequence should be limited (S15 Fig.). Data-providing institutions should evaluate the appropriateness of the research purposes through independent institutional review processes. Any file that aligns with the research can be used if the research has passed a review. Simultaneously, these institutions should retain the authority and obligation to refuse requests that do not match the research purpose. In cases of inadequate safety measures, recent technologies such as homomorphic encryption or blockchain technology can be used to securely process and analyze personal information in genomic data [23,24]. However, these encryption technologies are operationally limited and require further investigation.
Despite extensive discussions through multiple meetings, there remains a disagreement regarding the scope and methods of data utilization among participating institutions (hospitals, universities, medical research institutes, and companies, etc.) handling genomic data, mainly due to the competing risks of personal information leakage and identifiability. Fear caused by a lack of legal understanding and uncertainty about liability for damage seem to be the main reasons. Although this study focused on clinical genomic data, it clarified that genomic data produced for research purposes could also be sufficiently provided and utilized following medical research ethics reviews. To elevate the perception of a similar state at the national level, efforts such as promotional activities by academic societies, awareness training, and public hearings to improve national consciousness are required. Additionally, data recipients must be aware that acts aimed at identifying specific individuals are strictly prohibited under the PIPA, and violations can result in serious legal responsibilities. Subsequent research may consider including other file formats, such as compressed reference-oriented alignment map (CRAM) files. Regular reviews and continuous updates to the guidelines are necessary, and the ongoing participation and feedback from experts in various fields are crucial in this process.
South Korea is currently driving innovation in the medical field by establishing the National Project of Bio Big Data. This study provides guidelines not only to address the issue of data fragmentation but also to balance data utilization with the protection of individual privacy. These guidelines include secure pseudonymization methods for genomic data files, focusing on maximizing data usability while minimizing the risk of identification. Furthermore, this study emphasizes the need to continually amend and refine the guidelines through expert feedback based on legal and ethical standards to enhance the use of genomic data in research and industrial domains. This research is expected to contribute to the National Project of Bio Big Data, medical innovation, and research on disease prevention, diagnosis, and treatment through the safe utilization of genomic data.
Electronic Supplementary Material
Supplementary materials are available at Cancer Research and Treatment website (https://www.e-crt.org).
Notes
Author Contributions
Conceived and designed the analysis: Park H, Park J, Woo HG, Yoon H, Lee M, Hong D.
Collected the data: Park H and Hong D.
Contributed data or analysis tools: Park H, Park J, Hong D.
Performed the analysis: Park H, Park J, Woo HG, Yoon H, Lee M, Hong D.
Wrote the paper: Park H, Park J, Hong D.
Conflicts of Interest
Conflict of interest relevant to this article was not reported.
Acknowledgements
This work was supported in part by grants from the National Research Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science and ICT) (NRF-2021M3H9A2097227, NRF-2022R1A2C3008162, and RS-2023-00220840), the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (RS-2023-00265923), and the Basic Medical Science Facilitation Program through the Catholic Medical Center of the Catholic University of Korea funded by the Catholic Education Foundation. We thank the Global Science experimental Data hub Center (GSDC) and the Korea Research Environment Open NETwork (KREONET) service for data computing and network provided by the Korea Institute of Science and Technology Information (KISTI). The authors also thank Korea Health Information Service (KHIS), the Korean Cancer Association (KCA), the Korean Genomics Society (KOGO), the Korean Society for Bioinformatics (KSBI), The Korean Society of Medical Genetics and Genomics (KSMG), The Korean Society of Medical Informatics (KSMI), The Korean Society of pathologists (KSP), and The Korean Society for Laboratory Medicine (KSLM) We also extend our gratitude to Macrogen, Theragen Etex, Enzychem Lifesciences, and Endomics for their valuable contributions. Finally, we really would like to appreciate the valuable comments and all supports of Director Eunhye Shim (Ministry of Health and Welfare), Deputy Director Hee-Jeong Park (Ministry of Health and Welfare), Assistant Director Seungho Hong (Ministry of Health and Welfare), Division Director Namsoo Byeon (Korea Health Information Service), General Manager Jong-Duck Kim (Korea Health Information Service), Section Manager Seungwon Jung (Korea Health Information Service), Professor Murim Choi, Ph.D. (Department of Biomedical Science, Seoul National University; Steering committee of KOGO 2023), Je-Kyung Seong, D.V.M., M.S., Ph.D. (College of Veterinary Medicine, Seoul National University; Steering committee of KOGO 2023), Woong-Yang Park, M.D., Ph.D. (Sungkyunkwan University, College of Medicine; President of KOGO 2023), and Professor Eui Kyu Chie, M.D., Ph.D. (Department of Radiation Oncology, Seoul National University Hospital).