Breast cancer (BC) is known as the most prevalent form of cancer among women. Recent research has demonstrated the potential of Machine Learning (ML) techniques in predicting the five-year BC risk using personal health data. Support Vector Machine (SVM), Random Forest, K-NN (K-Nearest Neighbour), Naive Bayes, Neural Network, Decision Tree (DT), Logistic Regression (LR), Discriminant Analysis, and their variants are commonly employed in ML for BC analysis. This study investigates the factors influencing the performance of ML techniques in the domain of BC prevention, with a focus on dataset size and feature selection. The study's goal is to examine the effect of dataset cardinality, feature selection, and model selection on analytical performance in terms of Accuracy and Area Under the Curve (AUC). To this aim, 3917 papers were automatically selected from Scopus and PubMed, considering all publications from the previous 5 years, and, after inclusion and exclusion criteria, 54 articles were selected for the analysis. Our findings highlight how a good cardinality of the dataset and effective feature selection have a higher impact on the model's performance than the selected model, as corroborated by one of the studies, which gets extremely good results with all of the models employed.

Machine learning techniques in breast cancer preventive diagnosis: a review

Anastasi, Giada;Leporini, Barbara;
2024-01-01

Abstract

Breast cancer (BC) is known as the most prevalent form of cancer among women. Recent research has demonstrated the potential of Machine Learning (ML) techniques in predicting the five-year BC risk using personal health data. Support Vector Machine (SVM), Random Forest, K-NN (K-Nearest Neighbour), Naive Bayes, Neural Network, Decision Tree (DT), Logistic Regression (LR), Discriminant Analysis, and their variants are commonly employed in ML for BC analysis. This study investigates the factors influencing the performance of ML techniques in the domain of BC prevention, with a focus on dataset size and feature selection. The study's goal is to examine the effect of dataset cardinality, feature selection, and model selection on analytical performance in terms of Accuracy and Area Under the Curve (AUC). To this aim, 3917 papers were automatically selected from Scopus and PubMed, considering all publications from the previous 5 years, and, after inclusion and exclusion criteria, 54 articles were selected for the analysis. Our findings highlight how a good cardinality of the dataset and effective feature selection have a higher impact on the model's performance than the selected model, as corroborated by one of the studies, which gets extremely good results with all of the models employed.
2024
Anastasi, Giada; Franchini, Michela; Pieroni, Stefania; Buzzi, Marina; Buzzi, Maria Claudia; Leporini, Barbara; Molinaro, Sabrina
File in questo prodotto:
File Dimensione Formato  
s11042-024-18775-y.pdf

accesso aperto

Tipologia: Versione finale editoriale
Licenza: Creative commons
Dimensione 3.16 MB
Formato Adobe PDF
3.16 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11568/1274242
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 1
social impact