Developing effective computational representations of protein sequences is crucial for advancing diverse areas of computational biology and bioinformatics. Ideal representations must be computationally efficient, scalable, informative, flexible across contexts, and broadly applicable. To address these requirements, we propose Protein Sequence Extended-Connectivity Fingerprints (ProSECFPs), a novel fingerprinting method inspired by Extended-Connectivity Fingerprints (ECFPs), commonly used in chemoinformatics to represent small molecules. ProSECFPs effectively capture the complex physicochemical characteristics, sequence-specific details, and structural attributes intrinsic to protein sequences. We demonstrate the effectiveness and versatility of ProSECFPs by evaluating their performance in predicting the pathogenicity of missense mutations by applying a diverse set of machine learning (ML) and deep learning (DL) algorithms. Notably, our results indicate that ProSECFPs, especially their frequency-aware variants, achieve competitive or superior accuracy compared with established protein sequence descriptors. This enhanced performance arises from their ability to comprehensively integrate amino acid composition and detailed sequence information. ProSECFPs thus provide a robust, adaptable, and highly informative computational representation of proteins, serving as a powerful foundation for addressing interdisciplinary challenges in bioinformatics, genomics, and protein engineering.

ProSECFPs: A Novel Fingerprint-Based Protein Representation Method for Missense Mutation Pathogenicity Prediction

Poles, Clarissa
Primo
;
Di Stefano, Miriana;Piazza, Lisa;Bononi, Giulia;Poli, Giulio;Macchia, Marco;Tuccinardi, Tiziano
;
2025-01-01

Abstract

Developing effective computational representations of protein sequences is crucial for advancing diverse areas of computational biology and bioinformatics. Ideal representations must be computationally efficient, scalable, informative, flexible across contexts, and broadly applicable. To address these requirements, we propose Protein Sequence Extended-Connectivity Fingerprints (ProSECFPs), a novel fingerprinting method inspired by Extended-Connectivity Fingerprints (ECFPs), commonly used in chemoinformatics to represent small molecules. ProSECFPs effectively capture the complex physicochemical characteristics, sequence-specific details, and structural attributes intrinsic to protein sequences. We demonstrate the effectiveness and versatility of ProSECFPs by evaluating their performance in predicting the pathogenicity of missense mutations by applying a diverse set of machine learning (ML) and deep learning (DL) algorithms. Notably, our results indicate that ProSECFPs, especially their frequency-aware variants, achieve competitive or superior accuracy compared with established protein sequence descriptors. This enhanced performance arises from their ability to comprehensively integrate amino acid composition and detailed sequence information. ProSECFPs thus provide a robust, adaptable, and highly informative computational representation of proteins, serving as a powerful foundation for addressing interdisciplinary challenges in bioinformatics, genomics, and protein engineering.
2025
Poles, Clarissa; Di Stefano, Miriana; Piazza, Lisa; Bononi, Giulia; Poli, Giulio; Macchia, Marco; Tuccinardi, Tiziano; Giordano, Antonio
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11568/1355967
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact