Introduction

The decreasing cost and ease of producing short-read next-generation sequencing (NGS) datasets has transformed our understanding of organismal diversity across the tree of life. Next-generation sequencing offers new opportunities to empirically test both basic and applied hypotheses relating to the molecular ecology and evolutionary genetics within and among populations. However, transforming abundant raw sequence data into biologically meaningful genetic data is non-trivial. Genotyping errors can be introduced at several steps of NGS data analysis, which by extension may induce biases in subsequent inference. This becomes particularly problematic in non-model organisms with large and complex genomes and fragmented genome assemblies. It has become increasingly clear that understanding the performance and sources of biases and errors from such tools is critical for the success of any NGS-based project.
Various open-source programs have been developed to identify genomic variants, such as single nucleotide polymorphisms (SNPs) or insertions and deletions (indels), from short read data. Commonly used programs include Freebayes (Garrison & Marth 2012), HaplotypeCaller and UnifiedGenotyper from the Genome Analysis Tool Kit (Van der Auwera & O’Connor 2020), SAMtools (Li et al. 2009), and VarScan (Koboldt et al. 2012), which are widely used in NGS-based projects in both model and non-model systems. However, it is often difficult to know a priori how well a given tool will perform given the genomic resources available for any particular study organism, or how the study design will interact with the underlying assumptions of such tools, which are often benchmarked with model organism data.
Several studies comparing variant calling programs exist in the literature (Cornish & Guda 2015; Hwang et al. 2015; Bian et al. 2018; Chen et al. 2019; Sandmann et al. 2019). Most aim to test program efficiency (Hwang et al. 2015) or to evaluate estimations of the precision and sensitivity of the variant calling tools (Sandmann et al. 2019). However, these studies are often conducted with human genomic data in mind (Cornish & Guda 2015; Bian et al. 2018; Sandmann et al. 2019). These studies often conclude that there are substantial differences in precision and sensitivity across tools which depend in large part on aspects of the data such as sample size and coverage as well as the genomic resources available (Cornish & Guda 2015; Sandmann et al. 2019, but see Bian et al. 2018). As genome-scale datasets are now common for non-model organisms with limited or fragmented reference genomes, it is necessary to expand upon these studies to understand and establish best practices for systems where genomic resources are limited.
Conifers are non-model organisms with genomes recalcitrant to chromosome-scale assembly, especially so under budgetary constraint. For instance, conifers often have exceptionally large genomes (20–40 Gbp; Neale et al. 2017) with histories of whole-genome duplication (Zheng et al. 2015), gene family expansion (De La Torre et al. 2014; Scott et al. 2020), transposable element dynamics (Yi et al. 2018; Scott et al. 2020; Wang et al. 2020), and extensive repeat regions (Wegrzyn et al. 2014). These complexities present a major challenge for NGS data analysis and downstream hypothesis testing in conifers (Shu & Moran 2020; Lind et al. 2021). Such challenges can be alleviated by quantifying the accuracy of SNP calling when using the above-mentioned variant calling tools. High-quality databases exist for model organisms to calibrate existing programs and account for biases in the data, yet such resources remain elusive for many non-model organisms. A unique biological attribute of conifers is that they possess maternally derived haploid megagametophyte tissue that can be reliably excised from embryonic seeds. Megagametophyte tissue can be used in a family design with parent and offspring samples and provide insight into the accuracy of genotype calls by quantifying concordance between haploid offspring genotypes that could arise given the diploid parental genotypes.
Here we used a familial design consisting of diploid tissue from a single lodgepole pine (Pinus contorta ) parent and maternally derived haploid tissue from 106 full-sibling offspring to evaluate similarities and differences among SNP sets generated from FreeBayes, HaplotypeCaller, Samtools, UnifiedGenotyper, and VarScan. We use the rate of mismatches between parent and offspring genotype calls to infer the genotype error rate of each variant caller, given the rarity of mutation within one generation. We describe how the number of SNPs and proportion of genotyping errors behave under varying threshold levels for the most commonly used quality metrics during filtering steps. Our results shed light on how the choice of variant calling program can affect the result and interpretation of genomic analyses in non-model organisms lacking extensive genomic resources.