Simulating metagenomic data for novel pathogen outbreak detection – BlueDot Impact
Pandemics (2024 Sept)

Simulating metagenomic data for novel pathogen outbreak detection

By Marie Krátká (Published on January 27, 2025)

This project was a runner up on the "Novel research (quantitative)" prize on our Pandemics (Sept 2024) course. The text below is an excerpt from the final project.

Introduction

The detection of outbreaks caused by a novel pathogen usually depends on healthcare professionals noticing and reporting unusual cases. However, this approach can be inefficient and slow, particularly for diseases with a high proportion of asymptomatic infections (e.g. poliovirus) or delayed symptom onset (e.g. HIV/AIDS). Such “stealth outbreaks” can spread undetected for a longer period, delaying public health response and complicating containment efforts.

To address the challenges posed by stealth outbreaks, it is crucial to develop early warning systems that do not depend on symptomatic patient reports. One promising approach is to leverage environmental surveillance. Even asymptomatic carriers often shed pathogen particles, which can be detected in samples from strategically selected locations, such as metropolitan areas or high-volume transit hubs. Sample types may include wastewater, air, or pooled specimens like nasal swabs and blood samples. Sequencing these samples produces metagenomic profiles detailing the composition of organism communities, enabling the identification of outbreaks by detecting a pathogen's increasing abundance over time, even without prior knowledge of the pathogen.

Developing reliable workflows to detect such signals requires datasets where outbreak scenarios are explicitly modeled. However, obtaining such datasets from natural conditions presents significant challenges: it demands extensive sampling, and the exact dynamics of an outbreak—such as the proportion of infected individuals contributing to the samples—are often unknown. Moreover, to ensure the robustness of detection workflows, it is necessary to test them under diverse conditions, including varying pathogen characteristics and transmission parameters.

Simulated datasets offer a practical alternative. By controlling the parameters of the outbreak and the dataset, it should be possible to systematically explore different scenarios and validate data analysis workflows. This work outlines and demonstrates a methodology for creating a simulated dataset of a viral outbreak. First, we selected and processed a time series of metagenomic wastewater data from a public database to use as a realistic background. Next, we chose a pathogen genome from a public repository, and generated computer-simulated sequencing reads of the pathogen. Finally, we modeled pathogen abundance over time based on an exponential growth curve, and combined the background and pathogen reads in the calculated ratios.

Beyond practically implementing these steps, we prioritized outlining the general guidelines and key considerations of simulating such metagenomic datasets. This facilitates further development of the workflow and adaptation of the inputs and the methodology for simulation of various outbreak scenarios.

Full project

You can view the full project here.

We use analytics cookies to improve our website and measure ad performance. Cookie Policy.