Methods
The bioinformatic pipeline :
PoolParty2
PoolParty2 is an updated suite of scripts written in the BASH
and R computer languages that create and manipulate text files,
including sequence read files, and call freely distributed programs to
efficiently operate on the data as needed. After installation of
dependencies in a Linux computing environment, for which we provide
explicit instructions on our Github page
(https://github.com/stuartwillis/poolparty)
and most of which are available using the Conda package and
environment management system (Anaconda Software Distribution), users
need only provide sequence read files and haploid genome assembly, a
text file listing sequence read files with their group or population
affiliation, and tailored configuration files for each of the three
modules as appropriate. We distribute two tutorials that with the
scripts that help ensure that dependencies are accessible and illustrate
the main features of the pipeline. We additionally provide example code
to assist users in conveying output from the PoolParty2 modules
into angsd and associated utilities.
The three main modules of the pipeline focus on distinct aspects of the
bioinformatics process. The
PPalign
module calls dependency packages (i.e., BWA mem) for quality
trimming, mapping and filtering, and SNP calling functions to create
read alignments to the genome assembly, identify genetic variants and
their frequencies, and produce input files for the other modules. The
PPstats module utilizes output from the first module and
reports a number of useful statistics about the sample groups, such as
genomic extent at candidate depths or coverage variation among
chromosomes, and allows the user to confirm that sufficient and similar
coverage has been achieved across samples. The PPanalyze module
utilizes and subsets allele summary data from the first module and
performs user-specified analyses, such as principal components, sliding
window FST, and Fisher’s Exact Tests, to resolve
population structure and identify regions of significant genetic
divergence between groups. Additional modules are provided that run
further statistical tests that utilize replicate sample pairs (Cochran,
1954; Mantel & Haenszel, 1959), which take into account background
variance and linkage (local score; Fariello et al., 2017), or account
for population structure (Lewontin and Krakauer test with kinship, or
FLK; Bonhomme et al., 2010), as well as one for plotting results from
these analyses.
Computational requirements for running the pipeline will depend on the
size of the dataset and user-specified configuration, and may range from
a handful of threads and tens of Gb of RAM to dozens of processors and
>1Tb of RAM. Runs for each module usually last a few hours
but could take several days for large datasets with limited processing
and RAM resources. In the tutorials we describe strategies for piecemeal
runs of the different modules as data are generated and assembled to
coordinate and combine data subsets, confirm quality early in the
process, and reduce the overall bioinformatic processing time.
Application
1: