Variable heavy (VH) and variable light (VL) chain pairing is a critical determinant of antibody diversity, stability, and antigen-binding specificity. Identifying productive VH–VL combinations experimentally is labor-intensive and costly, motivating the development of computational methods that can more efficiently predict compatible heavy–light chain pairs. In this work, we present a comprehensive framework that includes a new benchmark dataset and three deep learning models, each trained with a different negative sampling strategy: random pairing, V-gene mismatching, and full V(D)J germline mismatching. Our dataset includes natural pairs and these three types of synthetic negatives to simulate increasingly realistic biological constraints. Furthermore, we present a lightweight yet highly effective BERT-based model that achieves over 90% accuracy in discriminating natural from synthetic VH–VL pairs. Through extensive evaluation, we demonstrate that V(D)J-informed negative sampling significantly improves model generalization and biological interpretability. By providing reproducible baselines and a biologically grounded benchmark, this work lays the foundation for future development of efficient computational tools in antibody engineering.

Germline-aware deep learning models and benchmarks for predicting antibody VH–VL pairing

Joubbi S.;D'Arco E.;Milazzo P.;Micheli A.
2025-01-01

Abstract

Variable heavy (VH) and variable light (VL) chain pairing is a critical determinant of antibody diversity, stability, and antigen-binding specificity. Identifying productive VH–VL combinations experimentally is labor-intensive and costly, motivating the development of computational methods that can more efficiently predict compatible heavy–light chain pairs. In this work, we present a comprehensive framework that includes a new benchmark dataset and three deep learning models, each trained with a different negative sampling strategy: random pairing, V-gene mismatching, and full V(D)J germline mismatching. Our dataset includes natural pairs and these three types of synthetic negatives to simulate increasingly realistic biological constraints. Furthermore, we present a lightweight yet highly effective BERT-based model that achieves over 90% accuracy in discriminating natural from synthetic VH–VL pairs. Through extensive evaluation, we demonstrate that V(D)J-informed negative sampling significantly improves model generalization and biological interpretability. By providing reproducible baselines and a biologically grounded benchmark, this work lays the foundation for future development of efficient computational tools in antibody engineering.
2025
Joubbi, S.; D'Arco, E.; Maccari, G.; Milazzo, P.; Micheli, A.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11568/1333102
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? 1
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact