CINECA IRIS Institutional Research Information System

Machine learning (ML) approaches have been widely applied to medical data in order to find reliable classifiers to improve diagnosis and detect candidate biomarkers of a disease. However, as a powerful, multivariate, data-driven approach, ML can be misled by biases and outliers in the training set, finding sample-dependent classification patterns. This phenomenon often occurs in biomedical applications in which, due to the scarcity of the data, combined with their heterogeneous nature and complex acquisition process, outliers and biases are very common. In this work we present a new workflow for biomedical research based on ML approaches, that maximizes the generalizability of the classification. This workflow is based on the adoption of two data selection tools: an autoencoder to identify the outliers and the Confounding Index, to understand which characteristics of the sample can mislead classification. As a study-case we adopt the controversial research about extracting brain structural biomarkers of Autism Spectrum Disorders (ASD) from magnetic resonance images. A classifier trained on a dataset composed by 86 subjects, selected using this framework, obtained an area under the receiver operating characteristic curve of 0.79. The feature pattern identified by this classifier is still able to capture the mean differences between the ASD and Typically Developing Control classes on 1460 new subjects in the same age range of the training set, thus providing new insights on the brain characteristics of ASD. In this work, we show that the proposed workflow allows to find generalizable patterns even if the dataset is limited, while skipping the two mentioned steps and using a larger but not well designed training set would have produced a sample-dependent classifier.

Dealing with confounders and outliers in classification medical studies: The Autism Spectrum Disorders case study

Ferrari E.;Bosco P.;Calderoni S.;Oliva P.;Palumbo L.;Spera G.;Fantacci M. E.^Penultimo;Retico A.

2020-01-01

Abstract

Machine learning (ML) approaches have been widely applied to medical data in order to find reliable classifiers to improve diagnosis and detect candidate biomarkers of a disease. However, as a powerful, multivariate, data-driven approach, ML can be misled by biases and outliers in the training set, finding sample-dependent classification patterns. This phenomenon often occurs in biomedical applications in which, due to the scarcity of the data, combined with their heterogeneous nature and complex acquisition process, outliers and biases are very common. In this work we present a new workflow for biomedical research based on ML approaches, that maximizes the generalizability of the classification. This workflow is based on the adoption of two data selection tools: an autoencoder to identify the outliers and the Confounding Index, to understand which characteristics of the sample can mislead classification. As a study-case we adopt the controversial research about extracting brain structural biomarkers of Autism Spectrum Disorders (ASD) from magnetic resonance images. A classifier trained on a dataset composed by 86 subjects, selected using this framework, obtained an area under the receiver operating characteristic curve of 0.79. The feature pattern identified by this classifier is still able to capture the mean differences between the ASD and Typically Developing Control classes on 1460 new subjects in the same age range of the training set, thus providing new insights on the brain characteristics of ASD. In this work, we show that the proposed workflow allows to find generalizable patterns even if the dataset is limited, while skipping the two mentioned steps and using a larger but not well designed training set would have produced a sample-dependent classifier.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2020
			
	Codice DOI
	
				https://dx.doi.org/10.1016/j.artmed.2020.101926
			
	Tutti gli autori
	
						Ferrari, E.; Bosco, P.; Calderoni, S.; Oliva, P.; Palumbo, L.; Spera, G.; Fantacci, M. E.; Retico, A.
					
	Appare nelle tipologie:
	
				1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
1-s2.0-S0933365719306086-main.pdf non disponibili Descrizione: Articolo principale Tipologia: Versione finale editoriale Licenza: NON PUBBLICO - accesso privato/ristretto Dimensione 3.53 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	3.53 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11568/1055104

Citazioni

8

28

23

social impact