Discussion

The reliability of SNP genotypes identified from high-throughput sequencing and the choice of variant caller has been a topic of debate for over a decade (Poland & Rife 2012; Hwang et al. 2015). Here we leveraged diploid and haploid sequence data from a single P. contorta parent and 106 full-sibling offspring to compare SNP genotyping across five popular variant calling tools: FreeBayes, HaplotypeCaller, SAMtools, UnifiedGenotyper, and VarScan. We used the proportion of mismatches between parent and offspring genotype calls to infer the genotype error rates of each variant caller, given the rarity of mutation within one generation. Our comparison finds large differences in the SNPs called and we evaluate the impact of various filtering metrics on the SNP quality and quantity.
After applying an initial, base level of filtering (Table 1) with each program, we found a large disparity in the number of SNPs called and the error rates between callers (Table 2). As might be expected, we found a strong correlation between the number of SNPs called and the by-genotype error rates (R2 = 99.4%; Fig. S1), which led us to apply our additional incremental filtering to compare callers. For our specific dataset, we show that UnifiedGenotyper consistently had the lowest error rates at all degrees of additional filtering stringency (Fig. 2). Not only did UnifiedGenotyper have the lowest overall error rates after additional filtering, but it also resulted in the most SNPs called in total, offering ample opportunity to prioritize either the volume of SNPs called or the error rates. However, note that UnifiedGenotyper is no longer supported by GATK. After UnifiedGenotyper, both HaplotyperCaller and VarScan performed appreciably well in terms of error rates (Fig. 2). VarScan, however, produced quite a low volume of SNPs despite initial filtering on minor allele frequency and depth being relatively lenient (Table 1). After additional filtering, FreeBayes and SAMtools resulted in the highest by-genotype error rate and by-site error rate respectively (Fig. 2).
Our results highlight two different flavours of variant callers. Tools like HaplotypeCaller and UnifiedGenotyper are highly customizable, highly flexible and offer the user the option to prioritize the number of SNPs called or the error rate. However, HaplotypeCaller and UnifiedGenotyper also required extensive additional filtering beyond baseline to achieve acceptable error rates, and may require a more experienced user and likely more tinkering to curate a suitable SNP set. Other variant callers like FreeBayes, SAMtools, and VarScan, achieve decidedly lower error rates when comparing baseline filter SNP sets, and likely require much less tinkering and effort to produce an acceptable SNP set. However, these three variant callers may lack the customization and flexibility of the GATK callers, and because they call many fewer SNPs may be at risk of missing some important sites.
When we compared the specific sites produced with each variant caller, we found a large degree of overlap among most programs (60%) in both the total sites a program called (Fig. S3,5,7), and as well in the error-free sites a program called (Fig. 1, S9-10). The sites where each caller made genotyping errors, however, were largely specific to the caller used (Fig. 1, S9-10), suggesting that the processes by which each caller makes errors differ mechanistically. Taken together, the high degree of overlap among callers in correctly called sites and the low degree of overlap in erroneously called sites (Fig. 1, S9-10), suggests that the practice of calling sites with multiple different variant caller tools and using only the SNPs common to all tools may be a highly effective method to improve accuracy. However, while this may reduce error rates, taking the intersection across multiple tools will result in a smaller number of SNPs called.
Across all callers, we found that filtering by QUAL, DP, or GQ gave the best results in terms of reducing by-genotype error rates in our dataset (Fig. 3; also see Supp. Mat.). For those using GATK callers and investigating non-model organisms where VQSR is not applicable, these three metrics may offer the best returns on increasing filtering stringency. While the majority of the GATK filtering metrics did not perform optimally on our data, it should be noted that our results represent a comparison performed on one single dataset and that results may vary with other input data. For other variant calling programs, our results suggest that substantial improvements in error rate can be achieved solely through filtering with QUAL, DP, and/or GQ only.
Our study provides a quantitative assessment of the accuracy of some of the more popular variant caller programs and their commonly used filtering metrics; however, it comes with three main caveats. Our analysis was performed on a linkage mapping population and as such our results may differ from similar analyses performed on natural populations. For example, a variant caller that is predicated on sample allele frequencies being in Hardy-Weinberg equilibrium will have biased error rates when used on a linkage mapping population where allele frequencies are expected to be 0, 0.5, or 1. As such, it would be valuable to repeat our comparison of variant callers on natural populations in the future. Secondly, because a P. contorareference genome does not yet exist, we aligned our reads against the best available but highly fragmented congeneric loblolly pine (P. taeda ) reference genome. Any differentiation or lack of synteny betweenP. contorta and P. taeda , as well as any assembly errors in the loblolly pine reference genome may have influenced our results. The option to use the reference genome of the correct study species, the quality of that genome, and whether or not the study species is model will all likely have important effects on results. Finally, we chose to include VarScan in our analysis as it is currently a popular variant calling tool, however, VarScan can only call diploid genotypes. As such, all of our VarScan results are diploid genotype calls from haploid input data and should be interpreted with some caution.