Federated learning for multi-center collaboration in ophthalmology: improving classification performance in retinopathy of prematurity. Ophthalmology. Retina Lu, C., Hanif, A., Singh, P., Chang, K., Coyner, A. S., Brown, J. M., Ostmo, S., Chan, R. P., Rubin, D., Chiang, M. F., Campbell, J. P., Kalpathy-Cramer, J., Imaging and Informatics in Retinopathy of Prematurity Consortium 2022

Abstract

OBJECTIVE: To compare the performance of deep learning (DL) classifiers for the diagnosis of plus disease in retinopathy of prematurity (ROP) trained using two methods of developing models on multi-institutional datasets: centralizing data versus federated learning (FL) where no data leaves each institution.DESIGN: Evaluation of a diagnostic test or technology.SUBJECTS, PARTICIPANTS, AND/OR CONTROLS: DL models were trained, validated, and tested on 5,255 wide-angle retinal images in the neonatal intensive care units of 7 institutions as part of the Imaging and Informatics in ROP (i-ROP) study. All images were labeled for the presence of plus, pre-plus, or no plus disease with a clinical label, and a reference standard diagnosis (RSD) determined by three image-based ROP graders and the clinical diagnosis.METHODS, INTERVENTION OR TESTING: We compared the area under the receiver operating characteristic curve (AUROC) for models developed on multi-institutional data, using a central approach, then FL, and compared locally trained models to either approach. We compared model performance (kappa) with label agreement (between clinical and RSD), dataset size and number of plus disease cases in each training cohort using Spearman's correlation coefficient (CC).MAIN OUTCOME MEASURES: Model performance using AUROC and linearly-weighted kappa.RESULTS: Four settings of experiment: FL trained on RSD against central trained on RSD, FL trained on clinical labels against central trained on clinical labels, FL trained on RSD against central trained on clinical labels, and FL trained on clinical labels against central trained on RSD (p=0.046, p=0.126, p=0.224, p=0.0173, respectively). 4/7 (57%) of models trained on local institutional data performed inferiorly to the FL models. Model performance for local models was positively correlated with label agreement (between clinical and RSD labels, CC = 0.389, p=0.387), total number of plus cases (CC=0.759, p=0.047), overall training set size (CC=0.924, p=0.002).CONCLUSIONS: We show that a FL model trained performs comparably to a centralized model, confirming that FL may provide an effective, more feasible solution for inter-institutional learning. Smaller institutions benefit more from collaboration than larger institutions, showing the potential of FL for addressing disparities in resource access.

View details for DOI 10.1016/j.oret.2022.02.015

View details for PubMedID 35296449