Standardized Assessment of Automatic Segmentation of White Matter Hyperintensities: Results of the WMH Segmentation Challenge

From LRDE

Abstract

Quantification of cerebral white matter hyperintensities (WMH) of presumed vascular origin is of key importance in many neurological research studies. Currently, measurements are often still obtained from manual segmentations on brain MR images, which is a laborious procedure. Automatic WMH segmentation methods exist, but a standardized comparison of the performance of such methods is lacking. We organized a scientific challenge, in which developers could evaluate their method on a standardized multi-center/-scanner image dataset, giving an objective comparison: the WMH Segmentation Challenge (https://wmh.isi.uu.nl/). Sixty T1+FLAIR images from three MR scanners were released with manual WMH segmentations for training. A test set of 110 images from five MR scanners was used for evaluation. Segmentation methods had to be containerized and submitted to the challenge organizers. Five evaluation metrics were used to rank the methods: (1) Dice similarity coefficient(2) modified Hausdorff distance (95th percentile), (3) absolute log-transformed volume difference, (4) sensitivity for detecting individual lesions, and (5) F1-score for individual lesions. Additionally, methods were ranked on their inter-scanner robustness. Twenty participants submitted their method for evaluation. This paper provides a detailed analysis of the results. In brief, there is a cluster of four methods that rank significantly better than the other methods, with one clear winner. The inter-scanner robustness ranking shows that not all methods generalize to unseen scanners. The challenge remains open for future submissions and provides a public platform for method evaluation.


Bibtex (lrde.bib)

@Article{	  kuijf.19.tmi,
  author	= {H. J. Kuijf and J. M. Biesbroek and J. de Bresser and R.
		  Heinen and S. Andermatt and M. Bento and M. Berseth and M.
		  Belyaev and M. J. Cardoso and A. Casamitjana and D. L.
		  Collins and M. Dadar and A. Georgiou and M. Ghafoorian and
		  D. Jin and A. Khademi and J. Knight and H. Li and X.
		  Llad\'{o} and M. Luna and Q. Mahmood and R. McKinley and A.
		  Mehrtash and S. Ourselin and B. Park and H. Park and S. H.
		  Park and S. Pezold and \'{E}lodie Puybareau and L. Rittner
		  and C. H. Sudre and S. Valverde and V. Vilaplana and R.
		  Wiest and Yongchao Xu and Z. Xu and G. Zeng and J. Zhang
		  and G. Zheng and C. Chen and W. van der Flier and F.
		  Barkhof and M. A. Viergever and G. J. Biessels},
  journal	= {IEEE Transactions on Medical Imaging},
  title		= {Standardized Assessment of Automatic Segmentation of White
		  Matter Hyperintensities: {R}esults of the {WMH}
		  Segmentation Challenge},
  year		= {2019},
  pages		= {1--13},
  abstract	= {Quantification of cerebral white matter hyperintensities
		  (WMH) of presumed vascular origin is of key importance in
		  many neurological research studies. Currently, measurements
		  are often still obtained from manual segmentations on brain
		  MR images, which is a laborious procedure. Automatic WMH
		  segmentation methods exist, but a standardized comparison
		  of the performance of such methods is lacking. We organized
		  a scientific challenge, in which developers could evaluate
		  their method on a standardized multi-center/-scanner image
		  dataset, giving an objective comparison: the WMH
		  Segmentation Challenge (https://wmh.isi.uu.nl/). Sixty
		  T1+FLAIR images from three MR scanners were released with
		  manual WMH segmentations for training. A test set of 110
		  images from five MR scanners was used for evaluation.
		  Segmentation methods had to be containerized and submitted
		  to the challenge organizers. Five evaluation metrics were
		  used to rank the methods: (1) Dice similarity coefficient,
		  (2) modified Hausdorff distance (95th percentile), (3)
		  absolute log-transformed volume difference, (4) sensitivity
		  for detecting individual lesions, and (5) F1-score for
		  individual lesions. Additionally, methods were ranked on
		  their inter-scanner robustness. Twenty participants
		  submitted their method for evaluation. This paper provides
		  a detailed analysis of the results. In brief, there is a
		  cluster of four methods that rank significantly better than
		  the other methods, with one clear winner. The inter-scanner
		  robustness ranking shows that not all methods generalize to
		  unseen scanners. The challenge remains open for future
		  submissions and provides a public platform for method
		  evaluation.},
  keywords	= {Image segmentation; Three-dimensional displays; Manuals;
		  White matter; Biomedical imaging; Radiology; Magnetic
		  resonance imaging (MRI); Brain; Evaluation and performance;
		  Segmentation},
  note		= {Available as 'Early access'}
}