Challenge for Diagnostic Assessment of Deep Learning Algorithm for Metastases Classification in Sentinel Lymph Nodes on Frozen Tissue Section Digital Slides in Women with Breast Cancer
Article information
Abstract
Purpose
Assessing the status of metastasis in sentinel lymph nodes (SLNs) by pathologists is an essential task for the accurate staging of breast cancer. However, histopathological evaluation of SLNs by a pathologist is not easy and is a tedious and time-consuming task. The purpose of this study is to review a challenge competition (HeLP 2018) to develop automated solutions for the classification of metastases in hematoxylin and eosin–stained frozen tissue sections of SLNs in breast cancer patients.
Materials and Methods
A total of 297 digital slides were obtained from frozen SLN sections, which include post–neoadjuvant cases (n = 144, 48.5%) in Asan Medical Center, South Korea. The slides were divided into training, development, and validation sets. All of the imaging datasets have been manually segmented by expert pathologists. A total of 10 participants were allowed to use the Kakao challenge platform for 6 weeks with two P40 GPUs. The algorithms were assessed in terms of the area under receiver operating characteristic curve (AUC).
Results
The top three teams showed 0.986, 0.985, and 0.945 AUCs for the development set and 0.805, 0.776, and 0.765 AUCs for the validation set. Micrometastatic tumors, neoadjuvant systemic therapy, invasive lobular carcinoma, and histologic grade 3 were associated with lower diagnostic accuracy.
Conclusion
In a challenge competition, accurate deep learning algorithms have been developed, which can be helpful in making frozen diagnosis of intraoperative SLN biopsy. Whether this approach has clinical utility will require evaluation in a clinical setting.
Introduction
Recently, implementation of digital pathology has been rising because of workforce crisis and increased need of consultation and collaboration. Digital pathology has many advantages in terms of time saving, slide storage, remote working, and second-opinion practice, and is becoming a part of routine procedure in diverse areas such as primary diagnosis, multidisciplinary clinic, and frozen section diagnosis [1]. Owing to rapid progress of technology, machine learning techniques using digital histopathological images have been investigated and showed satisfactory results in the detection of tumor areas and lymph node metastases in prostate, lung, and breast cancers [2-4].
Breast cancer is the most common cancer in women, accounting for approximately one-third of all cancers in women globally. For patients with localized breast cancer, the treatment of choice is surgical removal of the primary tumor [5]. In order to reduce disease recurrence or metastasis, lymph node sampling or dissection should be performed during surgery. Because axillary lymph node dissection may cause morbidity, such as arm-lymphedema and nerve injury, sentinel lymph node (SLN) sampling is recommended in order to determine the nodal metastases status and if extensive lymph node dissection is required [6-9]. Although some recent studies suggested that the role of SLN biopsy has been diminished in early breast cancer patients [10-13], SLN sampling is still considered important due to its cost- and time-effectiveness and usually performed intraoperatively using the frozen section technique and which allows surgeons to make immediate decisions during surgery [14]. However, pathologists frequently experience problems while making diagnoses of frozen sections.
First, frozen section diagnosis should be made as quickly as possible in order to minimize the waiting time for surgeons which can cause surgical and anesthetic complications. The turnaround time of the frozen section diagnosis is usually kept less than 20 to 30 minutes, including the gross examination, tissue cutting, and staining, and the microscopic examination [15]. Second, microscopic examination of a frozen section is more difficult than that of a conventional section because of inferior quality of the sections due to the frozen artifact. There are also components, such as capillaries, histiocytes, and germinal centers, in lymph nodes and which can be mistaken for metastatic carcinoma. Furthermore, frozen section diagnosis is extremely difficult in some patients who have underwent neoadjuvant systemic therapy before surgery. In order to overcome such difficulties, the deep learning algorithm might be helpful. For example, the ‘CAncer MEtastases in LYmph nOdes challeNge’ (CAMELYON16 and CAMELYON17) competitions disclosed that some deep learning algorithms achieved better diagnostic performance than a panel of 11 pathologists participating in a simulation exercise designed to mimic routine pathology workflow [4,16]. However, digital slides which were used in most of those previous studies had not been created from frozen tissue sections, but from formalin-fixed paraffin-embedded (FFPE) tissue sections. To our best knowledge, there has not been any reported study using frozen tissue section of SLNs until the present time. In addition, the previous studies did not include post–neoadjuvant cases, which has been increasing but difficult to histologically examine [17].
In the challenge competition originating from the HeLP (HEalthcare ai Learning Platform), several models have been developed. In this challenge setting, we aimed to evaluate the models’ performances for classification of metastases per slide in hematoxylin and eosin–stained frozen tissue sections of SLNs of breast cancer patients.
Materials and Methods
1. Data description
During routine surgical procedure for breast cancer in our institution, the excised SLNs were immediately submitted for frozen section. All of the SLNs were cut into 2-mm slices, entirely embedded in optimum cutting temperature compound, and frozen in –20°C to –30°C. For each lymph node, 5-μm-thick frozen sections were cut and one or two sections were picked up on glass slides and stained with hematoxylin and eosin. In this study, a total of 297 digital slides of SLNs from 132 patients were retrospectively collected. Among those, 144 slides were made from SLNs of patients who had received neoadjuvant therapy (48.5%). The slides were divided into a training set, a development set, and a validation set (157, 40, and 100 digital slides, respectively) without consideration of distribution of histologic type. Slides before a specific point in time were used as the training and development sets, and the other digital slides after that were used as the validation set. Patient demographics are summarized in Table 1. The slides were scanned using a digital microscopy scanner (Pannoramic 250 FLASH, 3DHISTECH Ltd., Budapest, Hungary) in MIRAX format (.mrxs) and with a resolution of 0.221 μm per pixel.
2. Reference standard
All the imaging datasets were segmented manually by one rater, and their annotations were confirmed by two clinically expert pathologists with 6 and 20 years’ experience in breast pathology. Regions of metastatic carcinoma larger than 200 μm in the greatest dimension were annotated as cancer with the in-house labeling tool, as shown in Fig. 1.
3. Challenge competition environment
The challenge competition platform developed by Kakao was used to allocate two GPUs to each team. All of the competitors were allowed to access only paths of digital slides and corresponding mask images with Kakao platform. Docker image files that enables any of deep learning platform to run were used to train models and inference development and validation sets. Each team was given two P40 GPUs (NVIDIA, Santa Clara, CA) resources for training models. Kakao platform used CUDA 9,0 and cuDNN 7.
During the first stage for four weeks, competitors were given 197 digital slides as the training and development set for four weeks. The training set (157 digital slides) with annotated masks was given for training the model, while the development set (40 digital slides) without masks was given for tuning the model. Model performance calculated by the evaluation matrix was listed on the leader board after inferencing the development set which was used for tuning the model. During the second stage for additional 2 weeks, the competitors were given 100 additional digital slides for final evaluation of their models with the optimal model derived from the development set.
4. Evaluation metric
The algorithms were assessed for classifying between “metastasis” or “normal.” Area under receiver operating characteristic curve (AUC) was evaluated by receiver operating characteristic (ROC) analysis.
5. Competitors
Forty-five competitors who were interested in digital pathology or machine learning registered for this challenge within 4 weeks from the beginning of November 2018. Ten competitors were selected according to their inner commitments in accordance with the limited platform environment. Ten competitors were composed of students, researchers, and doctors experienced in medical image analysis using machine learning or deep learning. Only four competitors submitted their results on the leaderboard. The methodological description is summarized in Table 2. All of the competitors selected only deep learning as the main architecture such as Inception v3 [18] for classification of the tumor patch or U-Net [19] for segmentation of the tumor region. Instead of modifying their models, they focused on pre- and post-processing steps to achieve optimal results. In one team which ranked high, random forest regression [20] was used to inference confidence by extracting high level features including the number of tumor regions, percentage of the tumor region over the entire tissue region, the area of the largest tumor regions, etc., from the heat map generated using the deep learning method. Real time-based augmentation methods were adjusted while training models. Detailed descriptions of each algorithm are listed in Table 2.
6. Ethical statement
The institutional review board for human investigations at Asan Medical Center (AMC) approved the study protocol with removal of all patient identifiers from the images and they waived the requirement for informed consent, in accordance with the retrospective design of this study.
Results
Model performances were sorted in descending order for the validation set as shown in Table 3 and Fig. 2. Four teams submitted their results on the leader board in development and validation sets. For the development set, the Four algorithms showed 0.986, 0.985, 945, and 0.595 AUCs. For the validation set which consisted of 100 digital slides, the Fiffeb team showed the highest AUC 0.805 in the validation set compared with other teams such as the DoAI, GoldenPass, and SOG teams at AUC 0.776, 0.760, and 0.540 respectively. Average times of the first three teams (Fiffeb, DoAI, and GoldenPass) in validation set were 10.8, 0.6, and 3.9 minutes, respectively.
For more detailed analysis, each algorithm was evaluated with the cutoff threshold determined by the Youden index [21] from the ROC curve in the validation set in terms of the accuracy (ACC), true positive rate (TPR), true negative rate (TNR), positive predictive value (PPV), and negative predictive value (NPV). The first-placed team Fiffeb showed the highest AUC (0.805), ACC (0.770), TNR (0.822), and PPV (0.833), while the second-placed team DoAI showed the highest TPR (0.800) and NPV (0.738).
In addition, model performance comparisons with clinical information for more detail, such as the metastatic tumor size (smaller or larger than 2 mm in the greatest dimension), whether patients had received neoadjuvant systemic therapy, histologic type of tumor, and the histologic grade of the tumor was measured, as shown in Table 4. Four teams showed higher TPR and lower false-negative rate in lymph nodes with larger metastatic tumors. In lymph nodes obtained from patients who had received neoadjuvant systemic therapy, four teams showed lower TPR and two teams showed lower TNR. In terms of the histologic type, three teams showed higher TPR and four teams higher TNR in the invasive lobular carcinoma group than in the invasive ductal carcinoma group. When comparing performance between the histologic grades, four teams showed higher TPR, but only one team showed higher TNR in grade 1 or 2 than in grade 3.
Among the 100 slides in the validation set, 57 slides were correctly categorized by all top three teams (35 slides, true-positive; 22 slides, true-negative), four slides were incorrectly categorized as positive (false-positive) by the top three teams, and six slides were incorrectly categorized as negative (false-negative) by the top three teams, as shown in Fig. 3. All of the four false-positive slides were obtained from patients with invasive ductal carcinoma, histologic grade 2, and two slides were from neoadjuvant systemic therapy patients. Similarly, all of the six false-negative slides were obtained from patients with invasive ductal carcinoma, i.e., five from histologic grade 2 patients and one from a histologic grade 3 patient, and three were from neoadjuvant systemic therapy patients. Four of the six false-negative slides had micrometastases. The size range of metastatic carcinoma in the false-negative slides was 0.13 to 4.45 mm.
Discussion
In this current study, all of the competitors adopted convolutional neural network (CNN)–based deep learning methods as the main idea such as the classification or segmentation network, and which showed high performance at 0.805, 0.776, and 0.760 in terms of AUC for the top three teams.
Interestingly, in all four teams, AUC was lower in the validation set compared to that in the development set. This might be due to the difference in patient demographics, particularly with regard to neoadjuvant systemic therapy. Distribution of histologic type is different between training, development, and validation sets as shown in Table 1. Especially in the validation set, the number of slides obtained from patients after neoadjuvant systemic therapy was significantly higher than that in the development set. Neoadjuvant systemic therapy often causes fibrosis and macrophage infiltration in the tumor area and fragmentation and/or scattering of tumor clusters [17], and which can lead to difficulty in histologic examination. It might be suggested that this neoadjuvant systemic therapeutic effect caused a decrease of AUC in the validation set.
Inference time is also key point with this challenge so that methods can be adopted in routine clinical practice. Turnaround time between receiving samples and reporting in conventional frozen section diagnosis has been variably reported around 20-30 minutes, including gross examination, freezing, cutting, staining, and microscopic examination [22]. Time consumed for scanning can be varied upon the size of sections, type of scanning machine, magnification, and focus layering, but recent studies have reported that 3-9 minutes of median handling time for scanning [22,23]. Two different types of patch-based CNN methods, classification and segmentation network, have shown pros and cons. The number of outputs of the classification network in this challenge is same with the number of classes that the model classifies input patch into (i.e., 1 or 2) by encoding all input dimensions to compressed features for a precise decision. In case of segmentation network, the number of outputs is same with the number of input dimensions (i.e., 448×448=200,704), which is approximately 100K or 200K times more than that of classification network. It is a factor reducing computational time. In our results, the first-placed team using only classification network showed 0.3 higher AUC than that of the second-placed team using only segmentation network, but too slow to deploy this into the real clinical routine while the computational time of the second-placed team took 18.8 times faster than that of the first-placed team. Ensemble of those different types of CNN networks should be considered to enhance model performance in routine clinical practice.
Next, we compared model performances according to the clinicopathologic factors of the patients. It is generally known that in manual examination of intraoperative SLN biopsy, false-negative results are more likely in micrometastases and favorable and/or lobular histology [24]. In the validation set, the top three teams showed better performances in lymph nodes with macrometastatic tumor, and which is consistent with manual examination and the CAMELYON16 study [4]. Lymph nodes which were obtained from non-neoadjuvant systemic therapy patients also revealed better performances, as discussed above. Lymph nodes from invasive lobular carcinoma patients revealed better TPR in the first three teams and better TNR in four teams than those from invasive ductal carcinoma patients, although the number of slides from invasive lobular carcinoma patients is limited. This is in accordance with the general results in manual examination and the CAMELYON16 study. In the CAMELYON16 study, 29 among 32 teams showed higher AUC in the invasive ductal carcinoma set than in the non-invasive ductal carcinoma set. In addition, tumors of histologic grade 1 or 2 showed higher TPR in the top three teams, but lower TNR in two of the three teams than tumors of histologic grade 3, and which requires further studies.
We found that some cases were wrongly categorized by the first three teams. All of six false-negative cases showed small-sized metastatic carcinoma, and which could result in false negativity. In contrast, four false-positive cases did not reveal any common clinicopathologic feature. However, we assume that reactive histiocytic infiltration or prominent germinal centers in lymph nodes might cause false positivity. Manual confirmation is probably necessary, and so a screening tool that would expedite this process might have broad appeal. Interestingly, TPR of mucinous carcinoma cases (0.5-1.0) was not lower than those of invasive ductal carcinoma (0.149-0.766) or invasive lobular carcinoma (0.000-1.000), although mucinous carcinoma was not included in training and validation sets. This might be due to some histologic similarities between mucinous carcinoma and other carcinomas, such as cluster formation, bigger cell size than lymphocytes, and nuclear size enlargement.
Our study has some strong significance compared to previously reported studies about possible usefulness of deep learning algorithm in diagnosis of SLN metastasis [4,16]. First, we used digital slides from frozen sections which were made intraoperatively, while previous studies used FFPE sections. Since frozen sections have lower quality due to tissue artifact compared with FFPE sections, it is more difficult to examine frozen sections than FFPE sections. However, what is used to determine the surgical extent intraoperatively in the real world is frozen sections, not FFPE sections. Therefore, we suggest that studies of the deep learning algorithm with SLNs would be more practical if frozen sections are used. Second, our dataset includes a high proportion (48.5%) of post-neoadjuvant patients. The role of neoadjuvant therapy in breast cancer treatment has been increasing these days, but it is much more difficult to histologically diagnose SLN metastasis after neoadjuvant therapy [17]. During case selection, we included more post-neoadjuvant cases than clinical setting with an intention of making our dataset unique and more useful. To reduce false-positive or false-negative issues technically, the deep learning models should be re-trained with those regions and different hyper-parameters such as class weights or loss weights. Those regions with different hyper-parameters have deep learning models intensively trained as strong positive regions with this strategy. Applications using these methods can be adopted in routine clinical practice by showing attention map with augmented reality and training itself robustly with false-positive cases selected by pathologists with on-line learning.
Our contest has several limitations. First, only paths to access the training, development, and validation sets were given to competitors, which means that they had no way to check the heat map generated by their models as all dataset contests provided were not available in public. Competitors were not allowed to check processing in the middle of training for the same reason. Only less than 1 MB log data could be saved and given to competitors for the purpose of debugging after training processing to check if and how the training is going well. It was also not available how much time was spent for training and analyses. This might be one of key reasons of the models with relatively low accuracies. Second, only two GPUs were given to each competitor, and it could be limited resource, although this constraint makes competitors fair. Third, we did not perform immunohistochemistry to confirm metastatic carcinoma on frozen section slides. On the contrary to FFPE sections, multiple frozen sections which were made from the same tissue fragment showed quite different shapes due to the tissue artifact. Therefore, immunohistochemistry is not as helpful in frozen sections as in FFPE sections to annotate tumor cells. In addition, it is impossible to retrospectively perform immunohistochemistry on frozen sections. Instead, when we annotate tumor cells in frozen sections, we review matched FFPE sections with cytokeratin immunohistochemistry in order to minimize annotation error. Finally, the high proportion of post–neoadjuvant cases or cases with micrometastases could have negatively affected the diagnostic accuracy of algorithms in this study. It would have been nicer if we could divide the dataset into multiple groups and develop different algorithms based on patients’ information, such as neoadjuvant status, histologic type, or histologic grade of tumor. However, it was impossible due to the limited number of digital slides. We hope to expand our dataset and include such analysis in our further study. Finally, the model performance can be influenced by various parameters including quality of tissue sections, staining quality and color differences, type of scanning machine, scanning environment, and accuracy of segmentation. Therefore, further studies for optimization of pre-processing of digital images might improve models’ diagnostic performances.
Possibly because of the characteristics of our dataset and the above limitations, even the top three algorithms in this study showed relatively lower performance than the other first prized in CAMELYON16, and lower diagnostic accuracy than average of pathologists [25]. However, we believe that it is worth holding a digital pathology challenge competition using frozen tissue sections in open innovation manner. For adjusting algorithms into routine clinical practice, HeLP is preparing another challenge competition to handle other problems such as localization of micro-metastasis and processing time.
Recognition abilities of deep learning and human could be complement each other. In addition, algorithms with deep learning can be used as computer aided system to help doctors diagnose. For example, virtual reality technology can help making quack accurate decision or alert a doctor who misses critical parts.
We held a challenge competition during six weeks to resolve the problem for classification of digital pathology slides with metastases in hematoxylin and eosin–stained frozen tissue sections of SLNs of breast cancer patients. The top three competitor teams achieved very high AUCs in the development set while they performed slightly lower AUC in the validation set. In this open innovation manner, the deep learning algorithms could be developed and evaluated, which might be helpful in the frozen diagnosis of intraoperative, SLN biopsy. Further studies are required in order to increase the accuracy and decrease the time consuming required to apply the deep learning algorithm in the clinical setting.
Notes
Conflicts of interest relevant to this article was not reported.
Acknowledgements
This work was supported by Kakao and Kakao Brain corporations and a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (HI18C0022).