POOLPARTY2: An integrated pipeline for analyzing pooled or indexed low
coverage whole genome sequencing data to discover the genetic basis of
diversity
Abstract
Whole genome sequencing data allow survey of variation from across the
genome, reducing the constraint of balancing genome sub-sampling with
recombination rates and linkage between sampled markers and target loci.
As sequencing costs decrease, low coverage whole genome sequencing of
pooled or indexed-individual samples is commonly utilized to identify
loci associated with phenotypes or environmental axes in non-model
organisms. There are, however, relatively few publicly available
bioinformatic pipelines designed explicitly to analyze these types of
data, and fewer still that process the raw sequencing data, provide
useful metrics of quality control, and then execute analyses. Here, we
present an updated version of a bioinformatics pipeline called
POOLPARTY2 that can effectively handle either pooled or indexed DNA
samples and includes new features to improve computational efficiency.
Using simulated data, we demonstrate the ability of our pipeline to
recover segregating variants, estimate their allele frequencies
accurately, and identify genomic regions harboring loci under selection.
Based on the simulated data set, we benchmark the efficacy of our
pipeline with another bioinformatic suite, ANGSD, and illustrate the
compatibility and complementarity of these suites by using ANGSD to
generate genotype likelihoods as input for identifying linkage outlier
regions using alignment files and variants provided by POOLPARTY2.
Finally, we apply our updated pipeline to an empirical dataset of low
coverage whole genomic data from uncurated population samples of
Columbia River steelhead trout (Oncorhynchus mykiss), results from which
demonstrate the genomic impacts of decades of artificial selection in a
prominent hatchery stock.