Safe Utilization and Sharing of Genomic Data: Amendment to the Health and Medical Data Utilization Guidelines of South Korea

Article information

Cancer Res Treat. 2024;56(4):1027-1039
Publication date (electronic) : 2024 June 7
doi : https://doi.org/10.4143/crt.2024.146
1Department of Medical Sciences, Graduate School of The Catholic University of Korea, Seoul, Korea
2Department of Medical Informatics, College of Medicine, The Catholic University of Korea, Seoul, Korea
3Department of Physiology, Ajou University School of Medicine, Suwon, Korea
4Center for Precision Medicine, Seoul National University Hospital, Seoul, Korea
5Department of Genomic Medicine, Seoul National University Hospital, Seoul, Korea
6Department of Life Science, Dongguk University, Seoul, Korea
7Department of Precision Medicine and Big Data, The Catholic University of Korea, Seoul, Korea
8Precision Medicine Research Center, College of Medicine, The Catholic University of Korea, Seoul, Korea
9Cancer Evolution Research Center, College of Medicine, The Catholic University of Korea, Seoul, Korea
10CMC Institute for Basic Medical Science, The Catholic Medical Center of The Catholic University of Korea, Seoul, Korea
Correspondence: Dongwan Hong, College of Medicine, The Catholic University of Korea, 222 Banpo-daero, Seocho-gu, Seoul 06591, Korea Tel: 82-2-3147-8424 Fax: 82-2-2258-7749 E-mail: dwhong@catholic.ac.kr
*Hyojeong Park and Jongkeun Park contributed equally to this work.
Received 2024 February 10; Accepted 2024 June 3.

Abstract

Purpose

In 2024, medical researchers in the Republic of Korea were invited to amend the health and medical data utilization guidelines (Government Publications Registration Number: 11-1352000-0052828-14). This study aimed to show the overall impact of the guideline revision, with a focus on clinical genomic data.

Materials and Methods

This study amended the pseudonymization of genomic data defined in the previous version through a joint study led by the Ministry of Health and Welfare, the Korea Health Information Service, and the Korea Genome Organization. To develop the previous version, we held three conferences with four main medical research institutes and seven academic societies. We conducted two surveys targeting special genome experts in academia, industry, and institutes.

Results

We found that cases of pseudonymization in the application of genome data were rare and that there was ambiguity in the terminology used in the previous version of the guidelines. Most experts (>~90%) agreed that the ‘reserved’ condition should be eliminated to make genomic data available after pseudonymization. In this study, the scope of genomic data was defined as clinical next-generation sequencing data, including FASTQ, BAM/SAM, VCF, and medical records. Pseudonymization targets genomic sequences and metadata, embedding specific elements, such as germline mutations, short tandem repeats, single-nucleotide polymorphisms, and identifiable data (for example, ID or environmental values). Expression data generated from multi-omics can be used without pseudonymization.

Conclusion

This amendment will not only enhance the safe use of healthcare data but also promote advancements in disease prevention, diagnosis, and treatment.

Introduction

In recent years, the medical field has developed technological innovations such as artificial intelligence for healthcare services. The development of high-performance computing systems and next-generation sequencing (NGS) technologies has revolutionized diagnostics and disease causation analysis in healthcare, leading to the accumulation of massive amounts of genomic data and global drug development. The annual growth rate of genome data generation is predicted to reach 16.85% between 2022 and 2030 [1]. Advanced countries in the medical field continue to construct and share healthcare databases, including genomic data. They are also developing standards for responsible data utilization through international consortia such as GA4GH [1].

The representative international genome projects include ‘All of Us’ in the United States [2-4] and ‘UK Biobank’ in Europe [5-7]. By collecting the metadata and genome sequencing data of diverse ethnic and healthy donors, the projects built up big data that were publicly available to researchers worldwide. These data are widely used in various fields of science, medicine, and industry. These international efforts have significantly influenced national genome data construction and utilization strategies in the Republic of Korea [8]. To deposit one million Korean whole-genome sequences, the Korean government launched the National Project of Bio Big Data, which involves multi-ministerial collaboration, including the Ministry of Health and Welfare, the Ministry of Science and ICT, and the Ministry of Trade, Industry, and Energy. This project aimed to deposit clinical and genomic data, including whole-genome sequences of one million Koreans, in the first stage (between 2024 and 2028) and beyond. It also includes data across a broad spectrum of individuals, including the general population and patients with rare diseases, severe illnesses, and chronic conditions. It aimed to contribute to disease prediction, diagnosis, personalized and precise medicine, and public health enhancement. There is an urgent need for the domestic implementation of safe utilization environments for the public release of clinical and genomic data.

Targeted genome sequencing data (whole-exome sequencing or whole-genome sequencing) have been generated in hospitals and medical institutions using NGSs and can be categorized into clinical and research purposes. These genomic data for research purposes were provided after review by an Institutional Review Board (IRB) in accordance with the Bioethics Law, and the data provided should be stored separately from identifiable personal information. Although clinical genomic data governed by the Personal Information Protection Act (PIPA) have been utilized with comprehensive consent forms, with the 2020 amendment to the Act, special cases concerning pseudonymized information have been applied (S1 and S2 Tables) [9]. This allowed the use of data without consent through pseudonymization, addressing the previously deferred use owing to the risks of personal identification in genomic data. This paper proposes a method for securely using previously deferred genomic data, thereby enabling their application.

In this study, we introduce the overall processes and research results of amending healthcare data utilization guidelines to enable the safe and proactive use of genomic data. This study was conducted under the auspices of the Korea Health Information Service (KHIS) and the Korea Genome Organization (KOGO). While the Health and Medical Data Utilization Guidelines were being amended, consultations and surveys were conducted with expert groups and relevant academic, medical, and industrial societies. This study proposes pseudonymization methods for metadata and genomic sequence information, with a focus on diagnostic NGS data. The amendment includes utilization strategies for omics-level data across industries, hospitals, academia, and research. The amended guidelines provide clear directions for pseudonymizing genomic data, managing personally identifiable genetic information, and sharing and utilizing data. The amended guidelines can also be expected to not only play a significant role in Korean-customized medical services, disease prevention, and treatment research but also contribute to the National Project of Bio Big Data, promoting the active utilization of genomic data.

Materials and Methods

1. Genome experts conference meeting

The project was conducted using a joint research pattern between KHIS and KOGO. The main researchers were experts with over 10 years of experience in clinical genomics research.

In this study, we invited representatives from each academic society to a meeting to reflect on the opinions of experts on genome data utilization. We consulted at least two experts from the Korean Cancer Association (KCA), KOGO, Korean Society for Bioinformatics (KSBI), Korean Society of Medical Genetics and Genomics (KSMG), Korean Society of Medical Informatics (KSMI), Korean Society of Pathologists (KSP), and Korean Society for Laboratory Medicine (KSLM) who specialize in genome data analysis and management. Additionally, to incorporate viewpoints of the corporate sector, we invited four companies: sequencing and direct-to-consumer analysis companies and new druggable target discovery companies. The main researchers were provided with weekly agendas covering topics such as the scope of data disclosure, use of data without consent, survey design, and a secure network environment (S3 Fig.).

Based on this study, we developed an amended guideline for safely utilizing genomic data. In the initial phase, meetings focused on investigating insights from representatives of industry, academia, and governmental institutions. In the second stage, advice from medical research institutes utilizing genomic data was analyzed. In the third and final stage, the opinions of specialists working on genomic data analysis at the hospital and medical center were reflected (S3 Fig.). Subsequently, in the expert advisory meetings, five major issues were discussed: (1) the scope of genomic data disclosure; (2) the extent of information utilization by file; (3) the scope of utilization from omics data, including mainly expression data; (4) a secure environment following pseudonymization; and (5) the process of exporting files (S3 Fig.). After detailed studies were conducted in many steps, the scope of genome data available for the study was defined for VCF, BAM/SAM, FASTQ, and genome information stored in medical data owing to limitations in data utilization.

2. Surveys for utilization of genome data

To identify issues with previous guidelines and investigate safe methods for utilizing genomic data, we conducted two surveys with experts in the field (S4 Fig.). The first survey targeted 52 experts from various institutions, including hospitals (24%), universities (38%), research institutes (10%), and companies (28%). This survey focused on aspects such as analysis, storage, backup, safety measures, public disclosure of genomic data, and understanding the legislation related to pseudonymization. The second survey, involving 19 participants, assessed the need to amend the health and medical data utilization guidelines. Specifically, this study aimed to identify the general characteristics associated with NGS results in patient data and records held by medical institutions.

3. Investigation of personally identifiable genetic information for pseudonymization

We approached our research from various perspectives to extract elements for personal identification from genomic data. Raw NGS sequencing data, mapping files, and mutation calling data generated in the genome data analysis pipeline were studied to identify the factors that can be pseudonymized. According to the sequencing instruments, multiple pipelines can generate different types of files (S5 Table). The selection of elements to be pseudonymized was discussed through extensive meetings with experts and reflected in the guidelines. These files can be of various types; however, they can be broadly classified into genetic sequences and metadata information.

1) Investigation of personally identifiable genetic information elements in nucleic acid sequences

From the genome data, it was possible to identify inherent personal features within the nucleic acid sequences. Germline mutations, short tandem repeats (STRs), and single nucleotide polymorphisms (SNPs) were selected as factors that could identify individuals.

The STR database provided by the National Institute of Standards and Technology (NIST) under the U.S. Department of Commerce was referenced. We decided to apply common variants with a minor allele frequency (MAF) of more than 1% and rare variants of less than 1% through surveys and many discussion times. To determine rare variants, we recommend using the Korea Biobank Array Project [10], Korean Genome and Epidemiology Study (KoGES) project [11], Ulsan Genome Project [12], and gnomAD East Asian data.

2) Investigation of personally identifiable genetic information in metadata

Several files (e.g., FASTQ, BAM/SAM, VCF, and the reporting file) of the previously mentioned genomic data contained metadata, excluding nucleic acid sequence information. Metadata such as the identification number, specific instrument number, and analysis environment values could be imported as personally identifiable genetic information during the analysis pipeline working process. Therefore, the guidelines decided to check personally identifiable genetic information within the metadata of each file. To understand the information generated by each file, the following websites were investigated: HTSlib/SAMtools documentation and VCF v4.2 specification (data availability).

4. Data availability

The analytical tools and resources used for this study are publicly accessible. Details for access are as follows: Information and resources related to the Korea Biobank Array Project can be found at https://www.koreanchip.org/project. The Genome Aggregation Database (gnomAD), offering an extensive range of human genetic variation insights, is available at https://gnomad.broadinstitute.org/. Documentation and updates for HTSlib/SAMtools can be found at https://www.htslib.org/doc/samtools.html. The Variant Call Format (VCF) version 4.2 specification, a standard format for the representation of sequence variations, can be accessed at https://samtools.github.io/hts-specs/VCFv4.2.pdf.

Results

1. Amendment of the guidelines for the safe use of health and medical data

The amendment of the “Health and Medical Data Utilization Guidelines” focuses on safely utilizing genomic and omics data. To achieve this, we examined domestic laws and prepared amendments based on a previous version of the guidelines (2022). Previously, the guidelines stipulated that genomic data could not be used without the individual’s consent, and in cases of non-consent, only localized data were provided. In the previous guidelines, genomic data were concentrated on ‘genomic information’ and ‘omics information, excluding the genome.’ For genomic information, it is possible to provide information on larger genetic units instead of specific mutation details or offer novel mutation information (excluding germline mutations). No special measures are required for omics information, excluding the genome, because recovery is impossible. However, it is still possible to recover pseudonymized data, such as transcriptome data (e.g., cDNA), from which genomic information can be recovered. Based on these guidelines, we prepared amendments to address concerns regarding personal identification, interpretation uncertainties, and wording ambiguities, thereby expanding safe utilization (Table 1, S6 Table).

Original text and amended genomic data in “Guidelines for Utilization of Healthcare Data”

The amended guidelines accounted for the inability to interpret genomic data fully, reducing the risk of data identification. Furthermore, genomic data can contain information about third parties, including parents, siblings, and other family members, necessitating caution. These guidelines do not apply to human-derived materials collected and processed through research or donation consent. In addition, there are three recommendations for genomic data. First, it advises the classification of genomic data files into genomic sequence information and non-sequence information (metadata). It recommends the replacement or deletion of high-risk personally identifiable genetic information. For instance, in typical genomic data files such as BAM/SAM and VCF, replacing or deleting high-risk information, such as germline mutations, STRs, and SNPs in the genomic data is recommended. For metadata, personally identifiable genetic information, and specific information (e.g., ID or environmental variables; file/directory name of analysis server, etc.) should be appropriately deleted or replaced. Special caution is required when utilizing raw data files such as FASTQ, which hold personally identifiable genetic information and can reveal individual genomic details through various analyses (e.g., mapping and blast search). Obtaining consent from the subject is advised, particularly when utilizing the FASTQ files derived from NGS-based genetic testing in medical institutions. This precaution is crucial because detailed individual genomic information can be exposed when DNA sequence data are extracted from FASTQ files to generate SAM/BAM/VCF files. Finally, matrix data with expression values do not require special measures for omics information (transcriptome, metabolome, and proteome, etc.). The matrix data are created as unidentifiable data during gene expression measurements using anonymized sample IDs. The deletion or replacement of personally identifiable genetic information, identifiable information, and specific information is recommended because of the existing risk of identification.

Despite these efforts, the complete interpretation of genomic data remains challenging, limiting risk reduction. Therefore, it is necessary to consider access control, physically secure environments, and risk assessment. Information with identifiability may need to be excluded, considering the time, cost, risk, and technical level required for pseudonymization. Alternatively, additional reviews or consent from the data subjects may be required depending on the purpose and files used.

These guidelines provide a basis for utilizing genomic data. This is an important foundation for balancing the protection and utilization of genomic data, which will contribute to safely utilizing genomic data and improving medical research quality.

2. Utilization of genomic data

In this study, we investigated the categorization and legal regulations of genomic data produced by medical institutions. Within this legal framework, our research aimed to present guidelines that enhance the usability of clinical data by minimizing personally identifiable genetic information while complying with legal and ethical standards. The data can be categorized into clinical and research data depending on whether they are deposited for treatment or research purposes. A thorough investigation was conducted regarding the laws and regulations related to these data, particularly focusing on replacing and deleting personally identifiable genetic information.

First, clinical data refer to genomic information collected from the pathology and diagnostic departments for diagnosing patient diseases. In this context, the file formats generated through NGS included FASTQ, BAM/SAM, VCF, and genomic information from medical records (Fig. 1A). Conversely, research data were generated for research purposes after obtaining a consent form for human-derived material donations from the patients. This also produced FASTQ, BAM/SAM, and VCF files through NGS analysis and is utilized based on the Bioethics Law [13]. Clinical and research data are subject to various domestic laws. Moreover, the purpose of these laws and personally identifiable genetic information varied (Fig. 1B, S7 and S8 Tables).

Fig. 1.

Workflow of Genomic Data Production in Medical Institutions for Diagnosis and Research Purposes. (A) Utilization of genomic data generated for medical purposes. Clinical data include next-generation sequencing (NGS)–based data such as FASTQ, BAM/SAM, and VCF collected through the pathology department and diagnostic laboratory of the hospital, as well as genomic data included in medical records. The institution responsible for storing clinical data performs pseudonymization in accordance with the principles of minimal information before providing data, as required by the Personal Information Protection Act (PIPA). For pseudonymized genomic data, the recipient (researcher) undergoes an evaluation of the suitability of the research through the institution’s DRB, after which the genomic data are utilized. EMR, electronic medical record. (B) Utilization of genomic data generated for research purposes. Research data involve data generated through using of samples (serum, plasma, chromosomes, DNA, and protein, etc.) collected for human-derived material research by researchers using techniques like NGS. To utilize human-derived material, samples are obtained through informed consent from the sample providers. The provided samples are subject to anonymization under the Bioethics Act through the Human-Derived Material Bank, ensuring that personally identifiable genetic information is concealed from anyone. Samples that have undergone anonymization are evaluated for the suitability of research by the recipient (researcher) through the IRB. Subsequently, the samples are used to produce the necessary genomic data, including NGS data, for research purposes.

It is different to delineate the scope of pseudonymization under the PIPA and the scope of anonymization permissible under the Bioethics and Safety Act. Therefore, we pseudonymized the clinical data in accordance with the principle that only a minimum amount of personally identifiable information should be collected, removed, or substituted to make it difficult to identify specific individuals. Pseudonymized data retain information and can be linked with other data if necessary. In contrast, anonymizing research data involve permanently deleting personally identifiable genetic information or substituting it with a unique identifier from the institution, making it impossible to recognize specific individuals. While the complete removal of personally identifiable genetic information offers a higher level of privacy protection, it significantly reduces the usefulness of the data. Each data type is utilized after evaluation of its research purposes and safety by the institution’s review boards and use outside these purposes is subject to legal sanctions (Fig. 2).

Fig. 2.

Pseudonymization makes it difficult to identify data while preserving personal identifiable information. Replacing a person’s actual name with a different code or identifier is an example of pseudonymization. The data can be analyzed, but it becomes challenging to identify specific individuals. Anonymization involves data being transformed to the extent that it is no longer associated with the original individuals. Therefore, it becomes nearly impossible to identify or track individuals through anonymized data. Anonymization is used to enhance personal information protection and data privacy. In simple terms, pseudonymization obscures personal information in a way that makes identification difficult, while anonymization completely removes personal information, making individual identification impossible.

3. Key personal identifiers in genomic data based on nucleotide sequence information

In the meeting, we defined the scope of utilizing genomic data. We also examined personally identifiable genetic information within each file type. The following personally identifiable genetic information is advised for replacement or deletion (Fig. 3A, S1 and S2 Tables).

Fig. 3.

Identification of unique personally identifiable genetic information within genomic sequence. Genomic data (VCF and SAM) consists of metadata (A) and genomic sequence data (B) within the files, with pseudonymization elements included in the composition. (A) Metadata in VCF and SAM files typically include information such as patient identifiers (patient ID, pathology information, etc.), file names, next-generation sequencing (NGS) analysis institutions, data production affiliations, NGS device names, and more. (B) Genomic sequence data can potentially lead to individual identification through information such as germline mutations, short tandem repeats (STRs), and rare single nucleotide polymorphisms (SNPs). For example, within the VCF file, the “info” column contains details like SAO (Sequence Ontology) that indicate whether a germline mutation is present. SAO=0 signifies somatic mutation, SAO=1 indicates germline mutation, and SAO=2 represents unknown status. Unless the research specifically focuses on germline mutations, this information can lead to individual identification. Additionally, forensic STR databases provided by National Institute of Standards and Technology (NIST) under the U.S. Department of Commerce can enable individual identification. By extracting information from chromosome locations in VCF and BAM/SAM files, individual identification becomes feasible. In VCF files, rare SNPs specific to Asians or Koreans can be extracted using data from gnomAD. The combination of such rare SNPs can facilitate individual identification. (C) Other omics data such as transcriptomics, proteomics, and metabolomics in files containing expression values are less likely to lead to individual identification due to their nature. Please note that the potential for individual identification exists in genomic data, especially in certain circumstances, and it’s essential to handle such data with privacy and security precautions.

Genetic mutations in germ cells can be inherited by offspring. In tumor sequencing, mutations may encompass not only tumor-specific mutations, but also preexisting mutations within patients. Such data must be handled carefully to prevent individual identification. For instance, the mitochondrial DNA mutation “m.16093A>G” signifies a change from adenine (A) to guanine (G), potentially indicating genetic connections among maternal relatives due to its inheritance through the mother’s lineage. This mutation may also imply a risk for mitochondrial diseases. Considering the challenges of completely removing germline mutations due to the limitations of bulk sequencing, protective measures such as masking or pseudonymization become crucial to minimize the exposure of sensitive genetic information (Fig. 3B).

STR was defined as DNA segments where a sequence of 2 to 7 base pairs is repeated. They are widely used in genetic profiling, criminal investigations, paternity tests, ancestral research, and more, due to the unique variation in length and repeat count per individual that allows for personal identification. Although identifying individuals solely based on STR information is challenging, combining STRs with SNPs, population statistics, or other identifiable information can significantly enhance the likelihood of personal identification. For example, this study used two datasets for 872 individuals: 642,563 genome-wide SNPs and 13 STRs used in forensic applications. The results indicate that ~90%-98% of forensic STR records can be matched to the corresponding SNP records, and the accuracy increases to ~99%-100% when approximately 30 STRs are used [14]. For example, Personal identification markers provided by US NIST were identified, and information posing a risk of personal identification when combined with other data should be removed (Fig. 3B).

SNPs are crucial variations within the human genome and play a significant role in understanding genetic differences related to race, physical appearance, and diseases. Variants with an alternating allele frequency of less than 1% were categorized as ‘rare variants.’ The development of the Korean chip in 2015, which comprises ~830,000 probes specifically designed for the Korean population, has improved the accuracy of determining the frequency of these SNPs. It is essential to recognize that combinations of SNPs or haplotype structures can be used for individual identification. For instance, in most major populations, it is estimated that only 45 SNPs are required to match two sets of genetic data to a unique individual, and only 300 SNPs are necessary to identify any individual uniquely [15]. Additionally, a panel of 52 SNPs has been approved for forensic use in several European countries (Fig. 3B) [16-18]. Moreover, omics data, including transcriptomic, proteomic, and metabolomic data, have expression data that are considered unstructured and devoid of personally identifiable genetic information. This data type can be utilized safely without concerns for personal identification, enabling researchers to leverage this information with reduced privacy risks (Fig. 3C).

We investigated specific instances of personally identifiable genetic information in actual files, demonstrating that these can be removed through parameter adjustments [19-21]. For instance, in a VCF file, if a ‘COMMON’ field is present, its value (typically 0 or 1) indicates whether a variant is common (i.e., frequent in the population) or rare. ‘COMMON=0’ could mean that the variant is rare. To verify the genetic frequency in East Asians, we considered a variant in the VCF sequence information section to be rare if the MAF for East Asians is ≤ 1% in the dbSNP database and removed such variants (Fig. 4A). In the case of FASTQ and BAM/SAM files, the analysis of organizational information, corresponding file path and pathology information, etc., may remain; therefore, deletion is necessary. For example, in BAM/SAM files, the patient and specimen information in @RG ID, LB, SM, program information, and execution paths in @PG must be removed or replaced (Fig. 4B).

Fig. 4.

Pseudonymization elements and practical application for each file type. (A) In the VCF file, metadata information was pseudonymized by replacing the original sample IDs, S2301217N_20231029 and S2301217T_20231029, with pseudonyms CMN001_00 and CMT001_00, respectively. Within the genomic sequence data of the VCF file, elements indicating germline status and SAO=1 were removed. (B) In the BAM/SAM file, metadata information was pseudonymized by replacing the original sample ID, HD753_S1, with the pseudonym CMT001_00. Additionally, information about the testing institution and equipment used by that institution was pseudonymized. PU: MISEQ was replaced with PU:NGS. Within the genomic sequence data of the BAM/SAM file, germline mutations (T) were either replaced with reference sequences (C) or removed if it was challenging to pseudonymize while ensuring data integrity. Any elements that could potentially lead to individual identification were removed when pseudonymization was not feasible. Pseudonymization elements are indicated in red font in the files.

Discussion

The Health Care Data Utilization Guidelines have been amended to enhance the utilization of genomic data. The primary objective of this amendment was to overcome the limitations arising from phrases that require further clarification and applicable examples in the guidelines to propose safer methods for data utilization.

This study investigated anonymization/pseudonymization practices in several foreign countries and conducted research tailored to the domestic situation to enable researchers to safely utilize genomic data (S9-S13 Tables).

This study sought safe pseudonymization methods for genomic data to convey the minimum necessary information. Information providers and recipients must request and process the correct file formats suitable for their research purposes. Genomic data may contain various personally identifiable genetic information that should be replaced or deleted if they do not align with the research objectives. Special attention is needed for germline mutations, STRs, and rare SNPs. The criterion for rare SNPs in this guideline is set as an MAF of less than 1%, which can change depending on the research purpose [10].

We investigated unique personal identification germline mutations in Koreans through Korean-specific rare germline mutations [22]. To study Korean germline mutations, we utilized the Korea4K database, and for examining mutations in East Asians, we used data from gnomAD. As a result, rare variants specific to Koreans that comprised less than 1% of the total population accounted for 77% of our dataset. When compared with East Asians, 75% of these variants were found to be unique (http://honglab.catholic.ac.kr/cmm/fms/FileDown.do?atchFileId=FILE_000000000000462&file Sn=0) (S14 Fig.). Therefore, if the study’s focus is not specifically on Korean germline research, we suggest deleting or replacing the presented rare germline mutations.

We recommend using secure networks for data transfer in cases where the use of analysis tools is limited, or adequate measures are not taken. We also suggest managing data through secure networks and, when used externally, only disclosing pseudonymized VCF containing minimal personal information. Nucleic acid sequences in their entirety can be inferred from BAM/SAM files; therefore, we suggest analyzing them within a secure network and exporting the results. Raw FASTQ data containing personally identifiable genetic information on the nucleic acid sequence should be limited (S15 Fig.). Data-providing institutions should evaluate the appropriateness of the research purposes through independent institutional review processes. Any file that aligns with the research can be used if the research has passed a review. Simultaneously, these institutions should retain the authority and obligation to refuse requests that do not match the research purpose. In cases of inadequate safety measures, recent technologies such as homomorphic encryption or blockchain technology can be used to securely process and analyze personal information in genomic data [23,24]. However, these encryption technologies are operationally limited and require further investigation.

Despite extensive discussions through multiple meetings, there remains a disagreement regarding the scope and methods of data utilization among participating institutions (hospitals, universities, medical research institutes, and companies, etc.) handling genomic data, mainly due to the competing risks of personal information leakage and identifiability. Fear caused by a lack of legal understanding and uncertainty about liability for damage seem to be the main reasons. Although this study focused on clinical genomic data, it clarified that genomic data produced for research purposes could also be sufficiently provided and utilized following medical research ethics reviews. To elevate the perception of a similar state at the national level, efforts such as promotional activities by academic societies, awareness training, and public hearings to improve national consciousness are required. Additionally, data recipients must be aware that acts aimed at identifying specific individuals are strictly prohibited under the PIPA, and violations can result in serious legal responsibilities. Subsequent research may consider including other file formats, such as compressed reference-oriented alignment map (CRAM) files. Regular reviews and continuous updates to the guidelines are necessary, and the ongoing participation and feedback from experts in various fields are crucial in this process.

South Korea is currently driving innovation in the medical field by establishing the National Project of Bio Big Data. This study provides guidelines not only to address the issue of data fragmentation but also to balance data utilization with the protection of individual privacy. These guidelines include secure pseudonymization methods for genomic data files, focusing on maximizing data usability while minimizing the risk of identification. Furthermore, this study emphasizes the need to continually amend and refine the guidelines through expert feedback based on legal and ethical standards to enhance the use of genomic data in research and industrial domains. This research is expected to contribute to the National Project of Bio Big Data, medical innovation, and research on disease prevention, diagnosis, and treatment through the safe utilization of genomic data.

Notes

Author Contributions

Conceived and designed the analysis: Park H, Park J, Woo HG, Yoon H, Lee M, Hong D.

Collected the data: Park H and Hong D.

Contributed data or analysis tools: Park H, Park J, Hong D.

Performed the analysis: Park H, Park J, Woo HG, Yoon H, Lee M, Hong D.

Wrote the paper: Park H, Park J, Hong D.

Conflicts of Interest

Conflict of interest relevant to this article was not reported.

Acknowledgements

This work was supported in part by grants from the National Research Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science and ICT) (NRF-2021M3H9A2097227, NRF-2022R1A2C3008162, and RS-2023-00220840), the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (RS-2023-00265923), and the Basic Medical Science Facilitation Program through the Catholic Medical Center of the Catholic University of Korea funded by the Catholic Education Foundation. We thank the Global Science experimental Data hub Center (GSDC) and the Korea Research Environment Open NETwork (KREONET) service for data computing and network provided by the Korea Institute of Science and Technology Information (KISTI). The authors also thank Korea Health Information Service (KHIS), the Korean Cancer Association (KCA), the Korean Genomics Society (KOGO), the Korean Society for Bioinformatics (KSBI), The Korean Society of Medical Genetics and Genomics (KSMG), The Korean Society of Medical Informatics (KSMI), The Korean Society of pathologists (KSP), and The Korean Society for Laboratory Medicine (KSLM) We also extend our gratitude to Macrogen, Theragen Etex, Enzychem Lifesciences, and Endomics for their valuable contributions. Finally, we really would like to appreciate the valuable comments and all supports of Director Eunhye Shim (Ministry of Health and Welfare), Deputy Director Hee-Jeong Park (Ministry of Health and Welfare), Assistant Director Seungho Hong (Ministry of Health and Welfare), Division Director Namsoo Byeon (Korea Health Information Service), General Manager Jong-Duck Kim (Korea Health Information Service), Section Manager Seungwon Jung (Korea Health Information Service), Professor Murim Choi, Ph.D. (Department of Biomedical Science, Seoul National University; Steering committee of KOGO 2023), Je-Kyung Seong, D.V.M., M.S., Ph.D. (College of Veterinary Medicine, Seoul National University; Steering committee of KOGO 2023), Woong-Yang Park, M.D., Ph.D. (Sungkyunkwan University, College of Medicine; President of KOGO 2023), and Professor Eui Kyu Chie, M.D., Ph.D. (Department of Radiation Oncology, Seoul National University Hospital).

References

1. Genomics market (by product and service: systems & software, consumables, services; by technology: sequencing, microarray, PCR, nucleic acid extraction and purification and others; by application: diagnostic application, drug discovery and development, agriculture and medical research and precision medicine and other; by end user: research institute, hospital and clinic, and others) - global industry analysis, size, share, growth, regional outlook and forecast 2023 to 2032 [Internet]. Ottawa: Precedence Research; c2023 [cited 2023 Feb 10]. Available from: https://www.precedenceresearch.com/genomics-market.
2. Ramirez AH, Gebo KA, Harris PA. Progress with the All of Us Research Program: opening access for researchers. JAMA 2021;325:2441–2.
3. All of Us Research Program Investigators, Denny JC, Rutter JL, Goldstein DB, Philippakis A, Smoller JW, et al. The “All of Us” Research Program. N Engl J Med 2019;381:668–76.
4. Ramirez AH, Sulieman L, Schlueter DJ, Halvorson A, Qian J, Ratsimbazafy F, et al. The All of Us Research Program: data quality, utility, and diversity. Patterns (N Y) 2022;3:100570.
5. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 2018;562:203–9.
6. Collins R. What makes UK Biobank special? Lancet 2012;379:1173–4.
7. Littlejohns TJ, Holliday J, Gibson LM, Garratt S, Oesingmann N, Alfaro-Almagro F, et al. The UK Biobank imaging enhancement of 100,000 participants: rationale, data collection, management and future directions. Nat Commun 2020;11:2624.
8. Lee B, Hwang S, Kim PG, Ko G, Jang K, Kim S, et al. Introduction of the Korea BioData Station (K-BDS) for sharing biological data. Genomics Inform 2023;21e12.
9. Personal Information Protection Act [Internet]. Sejong: Korea Ministry of Government; 2023. [cited 2023 Feb 10]. Available from: https://www.law.go.kr/LSW//lsInfoP.do?lsiSeq=213857&chrClsCd=010203&urlMode=engLsInfoR&viewCls=engLsInfoR#0000.
10. Kim YJ, Moon S, Hwang MY, Han S, Jang HM, Kong J, et al. The contribution of common and rare genetic variants to variation in metabolic traits in 288,137 East Asians. Nat Commun 2022;13:6642.
11. Kim Y, Han BG, ; KoGES Group. Cohort profile: the Korean Genome and Epidemiology Study (KoGES) Consortium. Int J Epidemiol 2017;46e20.
12. Jeon Y, Jeon S, Blazyte A, Kim YJ, Lee JJ, Bhak Y, et al. Welfare genome project: a participatory Korean Personal Genome Project with free health check-up and genetic report followed by counseling. Front Genet 2021;12:633731.
13. Enforcement Rule of Bioethics and Safety Act [Internet]. Sejong: Korea Ministry of Government; 2023. [cited 2023 Feb 10]. Available from: https://www.law.go.kr/LSW/lsInfoP.do?lsiSeq=98198&urlMode=engLsInfoR&viewCls=engLsInfoR#0000.
14. Edge MD, Algee-Hewitt BF, Pemberton TJ, Li JZ, Rosenberg NA. Linkage disequilibrium matches forensic genetic records to disjoint genomic marker sets. Proc Natl Acad Sci U S A 2017;114:5671–6.
15. Carter AB. Considerations for genomic data privacy and security when working in the cloud. J Mol Diagn 2019;21:542–52.
16. Phillips C, Prieto L, Fondevila M, Salas A, Gomez-Tato A, Alvarez-Dios J, et al. Ancestry analysis in the 11-M Madrid bomb attack investigation. PLoS One 2009;4e6583.
17. Sanchez JJ, Phillips C, Borsting C, Balogh K, Bogus M, Fondevila M, et al. A multiplex assay with 52 single nucleotide polymorphisms for human identification. Electrophoresis 2006;27:1713–24.
18. Pakstis AJ, Speed WC, Fang R, Hyland FC, Furtado MR, Kidd JR, et al. SNPs for a universal individual identification panel. Hum Genet 2010;127:315–24.
19. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009;25:1754–60.
20. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010;20:1297–303.
21. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009;25:2078–9.
22. Jeon S, Choi H, Jeon Y, Choi WH, Choi H, An K, et al. Korea4K: whole genome sequences of 4,157 Koreans with 107 phenotypes derived from extensive health check-ups. Gigascience 2024;13:giae014.
23. Kuo TT, Jiang X, Tang H, Wang X, Bath T, Bu D, et al. iDASH secure genome analysis competition 2018: blockchain genomic data access logging, homomorphic encryption on GWAS, and DNA segment searching. BMC Med Genomics 2020;13(Suppl 7):98.
24. Raisaro JL, Gwangbae C, Pradervand S, Colsenet R, Jacquemont N, Rosat N, et al. Protecting privacy and security of genomic data in i2b2 with homomorphic encryption and differential privacy. IEEE/ACM Trans Comput Biol Bioinform 2018;15:1413–26.

Article information Continued

Fig. 1.

Workflow of Genomic Data Production in Medical Institutions for Diagnosis and Research Purposes. (A) Utilization of genomic data generated for medical purposes. Clinical data include next-generation sequencing (NGS)–based data such as FASTQ, BAM/SAM, and VCF collected through the pathology department and diagnostic laboratory of the hospital, as well as genomic data included in medical records. The institution responsible for storing clinical data performs pseudonymization in accordance with the principles of minimal information before providing data, as required by the Personal Information Protection Act (PIPA). For pseudonymized genomic data, the recipient (researcher) undergoes an evaluation of the suitability of the research through the institution’s DRB, after which the genomic data are utilized. EMR, electronic medical record. (B) Utilization of genomic data generated for research purposes. Research data involve data generated through using of samples (serum, plasma, chromosomes, DNA, and protein, etc.) collected for human-derived material research by researchers using techniques like NGS. To utilize human-derived material, samples are obtained through informed consent from the sample providers. The provided samples are subject to anonymization under the Bioethics Act through the Human-Derived Material Bank, ensuring that personally identifiable genetic information is concealed from anyone. Samples that have undergone anonymization are evaluated for the suitability of research by the recipient (researcher) through the IRB. Subsequently, the samples are used to produce the necessary genomic data, including NGS data, for research purposes.

Fig. 2.

Pseudonymization makes it difficult to identify data while preserving personal identifiable information. Replacing a person’s actual name with a different code or identifier is an example of pseudonymization. The data can be analyzed, but it becomes challenging to identify specific individuals. Anonymization involves data being transformed to the extent that it is no longer associated with the original individuals. Therefore, it becomes nearly impossible to identify or track individuals through anonymized data. Anonymization is used to enhance personal information protection and data privacy. In simple terms, pseudonymization obscures personal information in a way that makes identification difficult, while anonymization completely removes personal information, making individual identification impossible.

Fig. 3.

Identification of unique personally identifiable genetic information within genomic sequence. Genomic data (VCF and SAM) consists of metadata (A) and genomic sequence data (B) within the files, with pseudonymization elements included in the composition. (A) Metadata in VCF and SAM files typically include information such as patient identifiers (patient ID, pathology information, etc.), file names, next-generation sequencing (NGS) analysis institutions, data production affiliations, NGS device names, and more. (B) Genomic sequence data can potentially lead to individual identification through information such as germline mutations, short tandem repeats (STRs), and rare single nucleotide polymorphisms (SNPs). For example, within the VCF file, the “info” column contains details like SAO (Sequence Ontology) that indicate whether a germline mutation is present. SAO=0 signifies somatic mutation, SAO=1 indicates germline mutation, and SAO=2 represents unknown status. Unless the research specifically focuses on germline mutations, this information can lead to individual identification. Additionally, forensic STR databases provided by National Institute of Standards and Technology (NIST) under the U.S. Department of Commerce can enable individual identification. By extracting information from chromosome locations in VCF and BAM/SAM files, individual identification becomes feasible. In VCF files, rare SNPs specific to Asians or Koreans can be extracted using data from gnomAD. The combination of such rare SNPs can facilitate individual identification. (C) Other omics data such as transcriptomics, proteomics, and metabolomics in files containing expression values are less likely to lead to individual identification due to their nature. Please note that the potential for individual identification exists in genomic data, especially in certain circumstances, and it’s essential to handle such data with privacy and security precautions.

Fig. 4.

Pseudonymization elements and practical application for each file type. (A) In the VCF file, metadata information was pseudonymized by replacing the original sample IDs, S2301217N_20231029 and S2301217T_20231029, with pseudonyms CMN001_00 and CMT001_00, respectively. Within the genomic sequence data of the VCF file, elements indicating germline status and SAO=1 were removed. (B) In the BAM/SAM file, metadata information was pseudonymized by replacing the original sample ID, HD753_S1, with the pseudonym CMT001_00. Additionally, information about the testing institution and equipment used by that institution was pseudonymized. PU: MISEQ was replaced with PU:NGS. Within the genomic sequence data of the BAM/SAM file, germline mutations (T) were either replaced with reference sequences (C) or removed if it was challenging to pseudonymize while ensuring data integrity. Any elements that could potentially lead to individual identification were removed when pseudonymization was not feasible. Pseudonymization elements are indicated in red font in the files.

Table 1.

Original text and amended genomic data in “Guidelines for Utilization of Healthcare Data”

Original text
⑥ (Genomic information) Except for a few exceptional cases as outlined below, whether pseudonymization is possible is deferred (usable only based on individual consent, excluding exceptions).
※ Genomic information may contain information about third parties such as parents, ancestors, siblings, offspring, relatives, etc., and until appropriate pseudonymization methods are developed, deferring the determination of pseudonymization feasibility is appropriate.
1) Presence or absence of genetic mutations related to widely known diseases:
- The risk of individual re-identification is significantly reduced by providing information at the level of genes, not specific mutation details (e.g., Loci).
* (Example) Study on the treatment response of patients with B gene mutations when using anticancer drugs.
2) Newly acquired mutation information of neoplasms with the removal of germline mutation information:
- The newly generated mutation information, with the removal of germline mutations (normal tissue mutations), contains only mutation information that causes cancer, ensuring no risk of individual identification.
· Neoplasm: Abnormal cell proliferation known as a tumor.
⑦ (Omics* information excluding genomics) No separate measures required.
* (Example) Metabolomics, proteomics, etc.
- Unlike genomic information, metabolomics, proteomics, etc., do not allow the recovery of genomic information, making separate measures unnecessary. However, transcriptomics is subject to deferral regarding pseudonymization feasibility since genomic information may be recoverable.
Revised version
Genomic Data:
※ The methods outlined in this guideline do not apply to human-derived materials collected and processed with consent for research or donation purposes.
- For human-derived materials collected by medical institutions and subjected to NGS-based genetic tests, generating SAM/BAM/VCF files and test records, the following appropriate methods should be employed:
1. Nucleic acid sequence information: Rare variant information (germline) and short tandem repeat (STR) information that pose personal identification risks should be deleted or appropriately processed if unrelated to the processing purpose, by either partial deletion or substitution, among other suitable methods.
2. Information excluding the above nucleic acid sequences: Metadata or unstructured strings (or codes) listed in target files and records, which contain information posing personal identification risks or specific information, should be partially deleted, or appropriately processed, either in part or in full, by substitution among other suitable methods.
- When considering the use of raw data, such as FASTQ files generated through NGS-based genetic tests on human-derived materials collected for medical purposes, it is recommended that consent from the data subject be obtained.
- The FASTQ file enables any data processor to generate files like SAM/BAM/VCF, which record chromosome numbers, positions, and variant information through mapping nucleotide sequence information for each sequencing read on the standard reference genome.
- Genomic data, containing nucleotide sequences among other information, has inherent limitations in fully interpreting the contained data, thus posing constraints on reducing the identification risk of the data itself. Since it may include information about third parties such as parents, siblings, and relatives, a crucial step involves restricting the utilization environment through a risk assessment of processing conditions (such as access control management and the establishment of closed environments), especially when compared to other types of information.
Omics Data
- Metabolomics and proteomics data, which cannot be used to reconstruct genomic information, do not require separate measures. Likewise, no separate measures are needed when utilizing expression matrix values of transcriptomes generated through NGS-based genetic tests on human-derived materials collected by medical institutions for diagnostic purposes.
- For data generated through NGS-based genetic testing of human-derived samples collected by medical institutions, excluding expression matrix values, which contain information posing personal identification risks, appropriate measures should be taken by deleting or replacing personal identification information, personally identifiable information, and specific information, either in part or in full, using appropriate methods.

NGS, next-generation sequencing.