The recent developments of high-throughput sequencing (also called Next Generation Sequencing - NGS) technologies and bioinformatics have drastically changed the research on viral pathogens and is now raising a growing interest for virus diagnostics. However, any diagnostic technique has to be included in standardized protocols. Currently, a huge diversity of bioinformatics protocols for virus discovery has been reported in the scientific literature but, to date, without addressing their reliability for diagnostic purpose. The objective of this work was therefore to compare the performance of existing bioinformatics pipelines and of the result interpretation through a double-blinded large scale proficiency testing based on a set of ten fastq files and involving 21 laboratories from 16 countries. The fastq files contained 50,000 (3), 250,000 (4) and 2.5 M (3) sequences of 21-24 nt coming from 3 samples. The false positive rate was only 0.5% and mainly related to the identification of integrated sequences or misinterpretation of the results. The overall sensitivity of detection was 57 % and ranged between 35 and 100% between laboratories with a marked effect of rarefaction for some laboratories. A principal component analysis and correlation studies underlined the most important parameters for appropriate diagnostic. The repeatability of detection corresponded to 73%. This work also underlined (i) the complexity of discovering new viruses by NGS, (ii) the difficulty to detect viral pathogens with low number of siRNA reads, (iii) the inconsistencies of databases and its impact on results. Overall, this work brings key insights into the reliability of bioinformatics pipelines and underlines some key parameters for achieving a reliable detection of viruses in a diagnostic setting using siRNA sequencing.
Acknowledgement: This article is based upon work from COST Action FA1407 – www.cost-divas.eu, supported by COST (European Cooperation in Science and Technology)