CINECA IRIS Institutional Research Information System

Background: Predicting the risk of developing pancreatic ductal adenocarcinoma (PDAC) is of paramount importance, given its high mortality rate. Current PDAC risk prediction models rely on a limited number of variables, do not include genetics, and have a modest accuracy. Aim: This study aimed to develop an interpretable PDAC risk prediction model, based on machine learning (ML). Methods: Five ML models (Adaptive Boosting, eXtreme Gradient Boosting, CatBoost, Deep Forest and Random Forest) built on 56 exposome variables and a polygenic risk score (PRS) were tested in 654 PDAC cases and 1,308 controls of the UK Biobank. Additionally, SHapley Additive exPlanation (SHAP) and Global model Interpretation via the Recursive Partitioning (Girp) were employed to explain the models. Results: All models provided similar performance, but based on recall the best was CatBoost (77.10 %). SHAP highlighted age and the PRS as primary contributors across all models. Girp developed rules to discern cases from controls, identifying age, PRS, and pancreatitis in most of the rules. Conclusion: The predictive models tested have exhibited good performance, indicating their potential application in the clinical field in the near future, with the PRS playing a key role in identifying high-risk individuals as demonstrated by the explainers.

Explainable machine learning identifies a polygenic risk score as a key predictor of pancreatic cancer risk in the UK Biobank

Peduzzi, Giulia^Primo;Felici, Alessio;Pellungrini, Roberto;Campa, Daniele^Ultimo

2024-01-01

Abstract

Background: Predicting the risk of developing pancreatic ductal adenocarcinoma (PDAC) is of paramount importance, given its high mortality rate. Current PDAC risk prediction models rely on a limited number of variables, do not include genetics, and have a modest accuracy. Aim: This study aimed to develop an interpretable PDAC risk prediction model, based on machine learning (ML). Methods: Five ML models (Adaptive Boosting, eXtreme Gradient Boosting, CatBoost, Deep Forest and Random Forest) built on 56 exposome variables and a polygenic risk score (PRS) were tested in 654 PDAC cases and 1,308 controls of the UK Biobank. Additionally, SHapley Additive exPlanation (SHAP) and Global model Interpretation via the Recursive Partitioning (Girp) were employed to explain the models. Results: All models provided similar performance, but based on recall the best was CatBoost (77.10 %). SHAP highlighted age and the PRS as primary contributors across all models. Girp developed rules to discern cases from controls, identifying age, PRS, and pancreatitis in most of the rules. Conclusion: The predictive models tested have exhibited good performance, indicating their potential application in the clinical field in the near future, with the PRS playing a key role in identifying high-risk individuals as demonstrated by the explainers.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2024
			
	Codice DOI
	
				https://dx.doi.org/10.1016/j.dld.2024.11.010
			
	Tutti gli autori
	
						Peduzzi, Giulia; Felici, Alessio; Pellungrini, Roberto; Campa, Daniele

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11568/1287250

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

4

ND

social impact