2.5 Bioinformatics analysis of RAD-seq data
We used the bioinformatics software pipeline, STACKS v.1.44 (Catchen et
al., 2013; Catchen et al., 2011) to process the
restriction-site-associated DNA markers (RAD-tags) and generate single
nucleotide polymorphism (SNP) datasets. First, we executed the
“process_radtags” program in STACKS to demultiplex and trim sequence
reads by the P1 barcodes and remove low quality reads (Phred quality
score less than 20). After removing PCR duplicates with the
“clone_filter” script, the processed reads were used to generate RAD
loci without a reference genome using “denovo_map.pl” (parameter
settings: m = 3 M = 5 n = 4). We empirically
determined these parameters to limit the impact of over-splitting loci
(see Harvey et al., 2015; Ilut et al., 2014). This involved running thede novo assembly over a wide range of values of M (1–8)
with “ustacks”. From these runs, we selected a value of M = 5
since we observed that the percentage of homozygous and heterozygous
loci reached a plateau at this value and thus minimized over-splitting
of alleles for the final SNP calling.
Stacks calls SNPs (“sstacks”) within RAD loci using a
multinomial-based likelihood model that estimates the likelihood of two
most frequently observed genotypes at each site and performs a standard
likelihood ratio test using a chi-square distribution (Catchen et al.,
2011; Hohenlohe et al., 2010). For SNP inference, we used the default
alpha significance level of 0.05. Paralogous loci that stacked together
were identified and removed by subsequent quality control steps built
into STACKS (max number of stacks per loci (m ) = 3; Harvey et
al., 2015; Ilut et al., 2014). After the preliminary assembly of catalog
loci using “denovo_map.pl”, we ran the STACKS correction mode
(rxstacks-cstacks-sstacks) using the bounded SNP model with a 0.05 upper
bound for the error rate. The “rxstacks” program made corrections to
genotype and haplotype calls based on population information, rebuilt
the catalog loci and filtered out loci with average log likelihood ratio
of < 8.0.
We used three additional filtering steps to generate a set of
high-quality RAD loci for down- stream population genetic analysis.
First, we retained only RAD loci that were present in 80% of all
samples. Second, we removed RAD loci that contained more than 40 SNPs,
as these likely represented sequencing errors or over-clustering of
paralogous loci. Lastly, we used the BLAT alignment algorithm (Kent,
2002) to de novo align the RAD loci and removed those that
aligned to multiple positions. The final consensus set of RAD loci
comprised SNP data from a total of 139 individuals. Genotypes were
called, filtered, and bi-allelic SNPs were exported in VCF format using
the STACKS “populations” program. SNPs from the last seven bp of the
RAD loci were removed as this part of the locus is likely to contain
sequence errors at the 3’ end of the reads. The SNP dataset was further
filtered with VCFtools v.0.1.14 (Danecek et al., 2011) to remove SNPs
below a minor allele frequency (MAF) of 0.05 cutoff to reduce artifacts
of sequence and assembly error. The dataset was also filtered to include
only one random SNP per RAD locus for use in FastStructure (Raj et al.,
2014) in order to avoid linkage disequilibrium between SNPs within RAD
loci.