Convolutional neural networks are known to require large amounts of data to achieve optimal performance. In addition, data is commonly computationally augmented using a variety of geometric and intensity transformations to further extent the set of training samples. In medical imaging, annotated data is often scarce or costly to obtain, and there is considerable interest in methods to reduce the amount of data needed. In this work, we investigate the relative benefit of increasing the amount of original data, with respect to computationally augmenting the amount of training samples, for the case of false positive reduction of lung nodules candidates. To this end, we have implemented a previously published topology for classification, shown to achieve state of the art results on the publicly available Luna16 dataset. Numerous models were trained using different amounts of unique training samples and different degrees of data augmentation involving rotations and translations, and the performance was compared. Results indicate that in general, better performance is achieved when increasing the amount of data, or augmenting the data more extensively, as expected. Surprisingly however, we observed that after reaching a certain amount of unique training samples, data augmentation leads to significantly better performance compared to adding the same number of new samples to the training dataset. We hypothesize that the augmentation has aided in learning more general {rotation and translation invariant-features, leading to improved performance on unseen data. Future experiments include more detailed characterization of this behavior, and relating this to the topology and amount of parameters to be trained.
Gonidakis, P, Jansen, B & Vandemeulebroucke, J 2020, Artificially augmenting data or adding more samples? A study on a 3D CNN for lung nodule classification. in HK Hahn & MA Mazurowski (eds), SPIE Medical Imaging 2020: Computer-Aided Diagnosis. vol. 11314, 113142F, Progress in Biomedical Optics and Imaging - Proceedings of SPIE, vol. 11314, pp. 113142F1-113142F6, SPIE Medical Imaging 2020, 15/02/20. https://doi.org/10.1117/12.2549810
Gonidakis, P., Jansen, B., & Vandemeulebroucke, J. (2020). Artificially augmenting data or adding more samples? A study on a 3D CNN for lung nodule classification. In H. K. Hahn, & M. A. Mazurowski (Eds.), SPIE Medical Imaging 2020: Computer-Aided Diagnosis (Vol. 11314, pp. 113142F1-113142F6). Article 113142F (Progress in Biomedical Optics and Imaging - Proceedings of SPIE; Vol. 11314). https://doi.org/10.1117/12.2549810
@inproceedings{289a1decceec4188a9726d67dfe1f3ec,
title = "Artificially augmenting data or adding more samples? A study on a 3D CNN for lung nodule classification",
abstract = "Convolutional neural networks are known to require large amounts of data to achieve optimal performance. In addition, data is commonly computationally augmented using a variety of geometric and intensity transformations to further extent the set of training samples. In medical imaging, annotated data is often scarce or costly to obtain, and there is considerable interest in methods to reduce the amount of data needed. In this work, we investigate the relative benefit of increasing the amount of original data, with respect to computationally augmenting the amount of training samples, for the case of false positive reduction of lung nodules candidates. To this end, we have implemented a previously published topology for classification, shown to achieve state of the art results on the publicly available Luna16 dataset. Numerous models were trained using different amounts of unique training samples and different degrees of data augmentation involving rotations and translations, and the performance was compared. Results indicate that in general, better performance is achieved when increasing the amount of data, or augmenting the data more extensively, as expected. Surprisingly however, we observed that after reaching a certain amount of unique training samples, data augmentation leads to significantly better performance compared to adding the same number of new samples to the training dataset. We hypothesize that the augmentation has aided in learning more general {rotation and translation invariant-features, leading to improved performance on unseen data. Future experiments include more detailed characterization of this behavior, and relating this to the topology and amount of parameters to be trained.",
author = "Panagiotis Gonidakis and Bart Jansen and Jef Vandemeulebroucke",
year = "2020",
month = mar,
doi = "10.1117/12.2549810",
language = "English",
volume = "11314",
series = "Progress in Biomedical Optics and Imaging - Proceedings of SPIE",
pages = "113142F1--113142F6",
editor = "Hahn, {Horst K.} and Mazurowski, {Maciej A.}",
booktitle = "SPIE Medical Imaging 2020",
note = "SPIE Medical Imaging 2020 ; Conference date: 15-02-2020 Through 20-02-2020",
}