Variable heavy (VH) and variable light (VL) chain pairing is a critical determinant of antibody diversity, stability, and antigen-binding specificity. Identifying productive VH–VL combinations experimentally is labor-intensive and costly, motivating the development of computational methods that can more efficiently predict compatible heavy–light chain pairs. In this work, we present a comprehensive framework that includes a new benchmark dataset and three deep learning models, each trained with a different negative sampling strategy: random pairing, V-gene mismatching, and full V(D)J germline mismatching. Our dataset includes natural pairs and these three types of synthetic negatives to simulate increasingly realistic biological constraints. Furthermore, we present a lightweight yet highly effective BERT-based model that achieves over 90% accuracy in discriminating natural from synthetic VH–VL pairs. Through extensive evaluation, we demonstrate that V(D)J-informed negative sampling significantly improves model generalization and biological interpretability. By providing reproducible baselines and a biologically grounded benchmark, this work lays the foundation for future development of efficient computational tools in antibody engineering.
Germline-aware deep learning models and benchmarks for predicting antibody VH–VL pairing
Joubbi S.;D'Arco E.;Milazzo P.;Micheli A.
2025-01-01
Abstract
Variable heavy (VH) and variable light (VL) chain pairing is a critical determinant of antibody diversity, stability, and antigen-binding specificity. Identifying productive VH–VL combinations experimentally is labor-intensive and costly, motivating the development of computational methods that can more efficiently predict compatible heavy–light chain pairs. In this work, we present a comprehensive framework that includes a new benchmark dataset and three deep learning models, each trained with a different negative sampling strategy: random pairing, V-gene mismatching, and full V(D)J germline mismatching. Our dataset includes natural pairs and these three types of synthetic negatives to simulate increasingly realistic biological constraints. Furthermore, we present a lightweight yet highly effective BERT-based model that achieves over 90% accuracy in discriminating natural from synthetic VH–VL pairs. Through extensive evaluation, we demonstrate that V(D)J-informed negative sampling significantly improves model generalization and biological interpretability. By providing reproducible baselines and a biologically grounded benchmark, this work lays the foundation for future development of efficient computational tools in antibody engineering.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


