Semi-artificial datasets as a resource for validation of bioinformatics pipelines for plant virus detection

Lucie Tamisier, Annelies Haegeman, Yoika Foucart, Nicolas Fouillien, Maher Al Rwahnih, Nihal Buzkan, Thierry Candresse, Michela Chiumenti, Kris De Jonghe, Marie Lefebvre, Paolo Margaria, Jean-Sébastien Reynard, Kristian Stevens, Denis Kutnjak, Sebastien Massart

Research output: Contribution to journalA2: International peer reviewed article (not A1-type)peer-review


The widespread use of High-Throughput Sequencing (HTS) for detection of plant viruses
and sequencing of plant virus genomes has led to the generation of large amounts of data
and of bioinformatics challenges to process them. Many bioinformatics pipelines for virus
detection are available, making the choice of a suitable one difficult. A robust
benchmarking is needed for the unbiased comparison of the pipelines, but there is
currently a lack of reference datasets that could be used for this purpose. We present 7
semi-artificial datasets composed of real RNA-seq datasets from virus-infected plants
spiked with artificial virus reads. Each dataset addresses challenges that could prevent virus
detection. We also present 3 real datasets showing a challenging virus composition as well
as 8 completely artificial datasets to test haplotype reconstruction software. With these
datasets that address several diagnostic challenges, we hope to encourage virologists,
diagnosticians and bioinformaticians to evaluate and benchmark their pipeline(s).
Original languageEnglish
Article numbere53
JournalPeer community Journal
Issue number1
Pages (from-to)1-15
Number of pages15
Publication statusPublished - 2-Dec-2021

Cite this