Inter-observer variability of expert-derived morphologic risk predictors in aortic dissection. European radiology Willemink, M. J., Mastrodicasa, D., Madani, M. H., Codari, M., Chepelev, L. L., Mistelbauer, G., Hanneman, K., Ouzounian, M., Ocazionez, D., Afifi, R. O., Lacomis, J. M., Lovato, L., Pacini, D., Folesani, G., Hinzpeter, R., Alkadhi, H., Stillman, A. E., Sailer, A. M., Turner, V. L., Hinostroza, V., Baumler, K., Chin, A. S., Burris, N. S., Miller, D. C., Fischbein, M. P., Fleischmann, D. 2022


OBJECTIVES: Establishing the reproducibility of expert-derived measurements on CTA exams of aortic dissection is clinically important and paramount for ground-truth determination for machine learning.METHODS: Four independent observers retrospectively evaluated CTA exams of 72 patients with uncomplicated Stanford type B aortic dissection and assessed the reproducibility of a recently proposed combination of four morphologic risk predictors (maximum aortic diameter, false lumen circumferential angle, false lumen outflow, and intercostal arteries). For the first inter-observer variability assessment, 47 CTA scans from one aortic center were evaluated by expert-observer 1 in an unconstrained clinical assessment without a standardized workflow and compared to a composite of three expert-observers (observers 2-4) using a standardized workflow. A second inter-observer variability assessment on 30 out of the 47 CTA scans compared observers 3 and 4 with a constrained, standardized workflow. A third inter-observer variability assessment was done after specialized training and tested between observers 3 and 4 in an external population of 25 CTA scans. Inter-observer agreement was assessed with intraclass correlation coefficients (ICCs) and Bland-Altman plots.RESULTS: Pre-training ICCs of the four morphologic features ranged from 0.04 (-0.05 to 0.13) to 0.68 (0.49-0.81) between observer 1 and observers 2-4 and from 0.50 (0.32-0.69) to 0.89 (0.78-0.95) between observers 3 and 4. ICCs improved after training ranging from 0.69 (0.52-0.87) to 0.97 (0.94-0.99), and Bland-Altman analysis showed decreased bias and limits of agreement.CONCLUSIONS: Manual morphologic feature measurements on CTA images can be optimized resulting in improved inter-observer reliability. This is essential for robust ground-truth determination for machine learning models.KEY POINTS: Clinical fashion manual measurements of aortic CTA imaging features showed poor inter-observer reproducibility. A standardized workflow with standardized training resulted in substantial improvements with excellent inter-observer reproducibility. Robust ground truth labels obtained manually with excellent inter-observer reproducibility are key to develop reliable machine learning models.

View details for DOI 10.1007/s00330-022-09056-z

View details for PubMedID 36029344