Methods
The bioinformatic pipeline : PoolParty2
PoolParty2 is an updated suite of scripts written in the BASH and R computer languages that create and manipulate text files, including sequence read files, and call freely distributed programs to efficiently operate on the data as needed. After installation of dependencies in a Linux computing environment, for which we provide explicit instructions on our Github page (https://github.com/stuartwillis/poolparty) and most of which are available using the Conda package and environment management system (Anaconda Software Distribution), users need only provide sequence read files and haploid genome assembly, a text file listing sequence read files with their group or population affiliation, and tailored configuration files for each of the three modules as appropriate. We distribute two tutorials that with the scripts that help ensure that dependencies are accessible and illustrate the main features of the pipeline. We additionally provide example code to assist users in conveying output from the PoolParty2 modules into angsd and associated utilities.
The three main modules of the pipeline focus on distinct aspects of the bioinformatics process. The PPalign module calls dependency packages (i.e., BWA mem) for quality trimming, mapping and filtering, and SNP calling functions to create read alignments to the genome assembly, identify genetic variants and their frequencies, and produce input files for the other modules. The PPstats module utilizes output from the first module and reports a number of useful statistics about the sample groups, such as genomic extent at candidate depths or coverage variation among chromosomes, and allows the user to confirm that sufficient and similar coverage has been achieved across samples. The PPanalyze module utilizes and subsets allele summary data from the first module and performs user-specified analyses, such as principal components, sliding window FST, and Fisher’s Exact Tests, to resolve population structure and identify regions of significant genetic divergence between groups. Additional modules are provided that run further statistical tests that utilize replicate sample pairs (Cochran, 1954; Mantel & Haenszel, 1959), which take into account background variance and linkage (local score; Fariello et al., 2017), or account for population structure (Lewontin and Krakauer test with kinship, or FLK; Bonhomme et al., 2010), as well as one for plotting results from these analyses.
Computational requirements for running the pipeline will depend on the size of the dataset and user-specified configuration, and may range from a handful of threads and tens of Gb of RAM to dozens of processors and >1Tb of RAM. Runs for each module usually last a few hours but could take several days for large datasets with limited processing and RAM resources. In the tutorials we describe strategies for piecemeal runs of the different modules as data are generated and assembled to coordinate and combine data subsets, confirm quality early in the process, and reduce the overall bioinformatic processing time.
Application 1: