Introduction
The decreasing cost and ease of producing short-read next-generation
sequencing (NGS) datasets has transformed our understanding of
organismal diversity across the tree of life. Next-generation sequencing
offers new opportunities to empirically test both basic and applied
hypotheses relating to the molecular ecology and evolutionary genetics
within and among populations. However, transforming abundant raw
sequence data into biologically meaningful genetic data is non-trivial.
Genotyping errors can be introduced at several steps of NGS data
analysis, which by extension may induce biases in subsequent inference.
This becomes particularly problematic in non-model organisms with large
and complex genomes and fragmented genome assemblies. It has become
increasingly clear that understanding the performance and sources of
biases and errors from such tools is critical for the success of any
NGS-based project.
Various open-source programs have been developed to identify genomic
variants, such as single nucleotide polymorphisms (SNPs) or insertions
and deletions (indels), from short read data. Commonly used programs
include Freebayes (Garrison & Marth 2012), HaplotypeCaller and
UnifiedGenotyper from the Genome Analysis Tool Kit (Van der Auwera &
O’Connor 2020), SAMtools (Li et al. 2009), and VarScan (Koboldt et al.
2012), which are widely used in NGS-based projects in both model and
non-model systems. However, it is often difficult to know a
priori how well a given tool will perform given the genomic resources
available for any particular study organism, or how the study design
will interact with the underlying assumptions of such tools, which are
often benchmarked with model organism data.
Several studies comparing variant calling programs exist in the
literature (Cornish & Guda 2015; Hwang et al. 2015; Bian et al. 2018;
Chen et al. 2019; Sandmann et al. 2019). Most aim to test program
efficiency (Hwang et al. 2015) or to evaluate estimations of the
precision and sensitivity of the variant calling tools (Sandmann et al.
2019). However, these studies are often conducted with human genomic
data in mind (Cornish & Guda 2015; Bian et al. 2018; Sandmann et al.
2019). These studies often conclude that there are substantial
differences in precision and sensitivity across tools which depend in
large part on aspects of the data such as sample size and coverage as
well as the genomic resources available (Cornish & Guda 2015; Sandmann
et al. 2019, but see Bian et al. 2018). As genome-scale datasets are now
common for non-model organisms with limited or fragmented reference
genomes, it is necessary to expand upon these studies to understand and
establish best practices for systems where genomic resources are
limited.
Conifers are non-model organisms with genomes recalcitrant to
chromosome-scale assembly, especially so under budgetary constraint. For
instance, conifers often have exceptionally large genomes (20–40 Gbp;
Neale et al. 2017) with histories of whole-genome duplication (Zheng et
al. 2015), gene family expansion (De La Torre et al. 2014; Scott et al.
2020), transposable element dynamics (Yi et al. 2018; Scott et al. 2020;
Wang et al. 2020), and extensive repeat regions (Wegrzyn et al. 2014).
These complexities present a major challenge for NGS data analysis and
downstream hypothesis testing in conifers (Shu & Moran 2020; Lind et
al. 2021). Such challenges can be alleviated by quantifying the accuracy
of SNP calling when using the above-mentioned variant calling tools.
High-quality databases exist for model organisms to calibrate existing
programs and account for biases in the data, yet such resources remain
elusive for many non-model organisms. A unique biological attribute of
conifers is that they possess maternally derived haploid megagametophyte
tissue that can be reliably excised from embryonic seeds.
Megagametophyte tissue can be used in a family design with parent and
offspring samples and provide insight into the accuracy of genotype
calls by quantifying concordance between haploid offspring genotypes
that could arise given the diploid parental genotypes.
Here we used a familial design consisting of diploid tissue from a
single lodgepole pine (Pinus contorta ) parent and maternally
derived haploid tissue from 106 full-sibling offspring to evaluate
similarities and differences among SNP sets generated from FreeBayes,
HaplotypeCaller, Samtools, UnifiedGenotyper, and VarScan. We use the
rate of mismatches between parent and offspring genotype calls to infer
the genotype error rate of each variant caller, given the rarity of
mutation within one generation. We describe how the number of SNPs and
proportion of genotyping errors behave under varying threshold levels
for the most commonly used quality metrics during filtering steps. Our
results shed light on how the choice of variant calling program can
affect the result and interpretation of genomic analyses in non-model
organisms lacking extensive genomic resources.