wholeskim: Utilizing genome skims for taxonomically annotating ancient
DNA metagenomes
- Lucas Elliott,
- Frédéric Boyer,
- Téo Lemane,
- Inger Alsos,
- Eric Coissac
Abstract
Inferring community composition from shotgun sequencing of environmental
DNA is highly dependent on the completeness of reference databases used
to assign taxonomic information as well as the pipeline used. While the
number of complete, fully assembled reference genomes is increasing
rapidly, their taxonomic coverage is generally too sparse to use them to
build complete reference databases that span all or most of the target
taxa. Low-coverage, whole genome sequencing, or skimming, provides a
cost-effective and scalable alternative source of genome-wide
information in the interim. Without enough coverage to assemble large
contigs of nuclear DNA, much of the utility of a genome skim in the
context of taxonomic annotation is found in its short read form.
However, previous methods have not been able to fully leverage the data
in this format. We demonstrate the utility of wholeskim, a pipeline for
the indexing of k-mers present in genome skims and subsequent querying
of these indices with short DNA reads. Wholeskim expands on the
functionality of kmindex, a software which utilizes Bloom filters to
efficiently index and query billions of k-mers. Using a collection of
thousands of plant genome skims, wholeskim is the only software that is
able to index and query the skims in their unassembled form. We also
explore the effects of taxonomic and genomic completeness of the
reference database on the accuracy and sensitivity of read assignment.27 Aug 2024Submitted to Molecular Ecology Resources 29 Aug 2024Submission Checks Completed
29 Aug 2024Assigned to Editor
29 Aug 2024Review(s) Completed, Editorial Evaluation Pending
09 Sep 2024Reviewer(s) Assigned