NGSpeciesID: DNA barcode and amplicon consensus generation from
long-read sequencing data
Abstract
Third generation sequencing technologies, such as Oxford Nanopore
Technologies (ONT) and Pacific Biosciences (PacBio), have gained
popularity over the last years. These platforms can generate millions of
long read sequences. This is not only advantageous for genome sequencing
projects, but also for amplicon-based high-throughput sequencing
experiments, such as DNA barcoding. However, the relatively high error
rates associated with these technologies still pose challenges for
generating high quality consensus sequences. Here we present
NGSpeciesID, a program which can generate highly accurate consensus
sequences from long-read amplicon sequencing technologies, including ONT
and PacBio. The tool includes clustering of the reads to help filter out
contaminants or reads with high error rates and employs polishing
strategies specific to the appropriate sequencing platform. We show that
NGSpeciesID produces consensus sequences with improved usability by
minimizing preprocessing and software installation and scalability by
enabling rapid processing of hundreds to thousands of samples, while
maintaining similar consensus accuracy as current pipelines