Leveraging large language models on free-text symptoms from participatory surveillance enhances pertussis forecasting in the United States

De Angelis, Luigi; Tarallo, Samuele; Rizzo, Caterina; Gertz, Autumn; Baglivo, Francesco; S. Brownstein, John; Rader, Benjamin

doi:10.1186/s12879-026-13452-x

Background: Declines in childhood vaccination in the U.S. have contributed to a resurgence of vaccine-preventable diseases, including a notable increase in pertussis cases. Traditional pertussis surveillance is limited by underdiagnosis and underreporting. Participatory surveillance systems such as Outbreaks Near Me (ONM) provide an additional population-level data stream by capturing self-reported symptoms. Although pertussis signals are difficult to detect due to low incidence and symptom overlap with other infections, ONM collects free-text descriptions that may contain pertussis-specific information. Advances in large language models (LLMs) enable the extraction of relevant signals from unstructured text to potentially improve forecasting. Methods: We analyzed U.S. pertussis case data from the CDC and ONM reports from 2022 to 2025. ONM reports were filtered for prolonged cough without alternative diagnoses and further refined using a two-step GPT-4-based pipeline that summarized participant reports and excluded cases inconsistent with pertussis to enhance case specificity. Three datasets were created: CDC-only cases, CDC and ONM filtered cases, and CDC and ONM cases post-LLM processing. Aggregated time series were split into a training set (2022-2024) and a test set (2025, first 7 months). We trained multiple forecasting models (ARIMA, XG-Boost, and linear regression) on the 2022-2024 data, first using CDC-only data to establish a baseline. The best-performing model was then applied to the two datasets, incorporating the ONM participatory data. Performance was evaluated using Mean Absolute Error (MAE). Results: CDC-reported pertussis cases totaled 862 in 2022, 2,512 in 2023, 11,276 in 2024, and 5,937 in the first seven months of 2025. Of 2,741 ONM-suspected cases, 957 remained after LLM refinement. XGBoost yielded the best baseline performance (MAE 26.65). Incorporating ONM data improved performance: MAE decreased to 25.60 with filtered ONM cases and 24.69 with LLM-processed cases. Conclusions: Integrating LLM-processing of participatory surveillance data with traditional surveillance enhances the accuracy of pertussis outbreak forecasting. This approach introduces a novel way to leverage free-text data, offering a promising pathway to augment traditional public health surveillance systems.

CINECA IRIS Institutional Research Information System