Background: Predicting the risk of developing pancreatic ductal adenocarcinoma (PDAC) is of paramount importance, given its high mortality rate. Current PDAC risk prediction models rely on a limited number of variables, do not include genetics, and have a modest accuracy. Aim: This study aimed to develop an interpretable PDAC risk prediction model, based on machine learning (ML). Methods: Five ML models (Adaptive Boosting, eXtreme Gradient Boosting, CatBoost, Deep Forest and Random Forest) built on 56 exposome variables and a polygenic risk score (PRS) were tested in 654 PDAC cases and 1,308 controls of the UK Biobank. Additionally, SHapley Additive exPlanation (SHAP) and Global model Interpretation via the Recursive Partitioning (Girp) were employed to explain the models. Results: All models provided similar performance, but based on recall the best was CatBoost (77.10 %). SHAP highlighted age and the PRS as primary contributors across all models. Girp developed rules to discern cases from controls, identifying age, PRS, and pancreatitis in most of the rules. Conclusion: The predictive models tested have exhibited good performance, indicating their potential application in the clinical field in the near future, with the PRS playing a key role in identifying high-risk individuals as demonstrated by the explainers.
Explainable machine learning identifies a polygenic risk score as a key predictor of pancreatic cancer risk in the UK Biobank
Peduzzi, GiuliaPrimo
;Felici, Alessio;Pellungrini, Roberto;Campa, Daniele
Ultimo
2024-01-01
Abstract
Background: Predicting the risk of developing pancreatic ductal adenocarcinoma (PDAC) is of paramount importance, given its high mortality rate. Current PDAC risk prediction models rely on a limited number of variables, do not include genetics, and have a modest accuracy. Aim: This study aimed to develop an interpretable PDAC risk prediction model, based on machine learning (ML). Methods: Five ML models (Adaptive Boosting, eXtreme Gradient Boosting, CatBoost, Deep Forest and Random Forest) built on 56 exposome variables and a polygenic risk score (PRS) were tested in 654 PDAC cases and 1,308 controls of the UK Biobank. Additionally, SHapley Additive exPlanation (SHAP) and Global model Interpretation via the Recursive Partitioning (Girp) were employed to explain the models. Results: All models provided similar performance, but based on recall the best was CatBoost (77.10 %). SHAP highlighted age and the PRS as primary contributors across all models. Girp developed rules to discern cases from controls, identifying age, PRS, and pancreatitis in most of the rules. Conclusion: The predictive models tested have exhibited good performance, indicating their potential application in the clinical field in the near future, with the PRS playing a key role in identifying high-risk individuals as demonstrated by the explainers.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.