Discussion
The reliability of SNP genotypes identified from high-throughput
sequencing and the choice of variant caller has been a topic of debate
for over a decade (Poland & Rife 2012; Hwang et al. 2015). Here we
leveraged diploid and haploid sequence data from a single P.
contorta parent and 106 full-sibling offspring to compare SNP
genotyping across five popular variant calling tools: FreeBayes,
HaplotypeCaller, SAMtools, UnifiedGenotyper, and VarScan. We used the
proportion of mismatches between parent and offspring genotype calls to
infer the genotype error rates of each variant caller, given the rarity
of mutation within one generation. Our comparison finds large
differences in the SNPs called and we evaluate the impact of various
filtering metrics on the SNP quality and quantity.
After applying an initial, base level of filtering (Table 1) with each
program, we found a large disparity in the number of SNPs called and the
error rates between callers (Table 2). As might be expected, we found a
strong correlation between the number of SNPs called and the by-genotype
error rates (R2 = 99.4%; Fig. S1), which led
us to apply our additional incremental filtering to compare callers. For
our specific dataset, we show that UnifiedGenotyper consistently had the
lowest error rates at all degrees of additional filtering stringency
(Fig. 2). Not only did UnifiedGenotyper have the lowest overall error
rates after additional filtering, but it also resulted in the most SNPs
called in total, offering ample opportunity to prioritize either the
volume of SNPs called or the error rates. However, note that
UnifiedGenotyper is no longer supported by GATK. After UnifiedGenotyper,
both HaplotyperCaller and VarScan performed appreciably well in terms of
error rates (Fig. 2). VarScan, however, produced quite a low volume of
SNPs despite initial filtering on minor allele frequency and depth being
relatively lenient (Table 1). After additional filtering, FreeBayes and
SAMtools resulted in the highest by-genotype error rate and by-site
error rate respectively (Fig. 2).
Our results highlight two different flavours of variant callers. Tools
like HaplotypeCaller and UnifiedGenotyper are highly customizable,
highly flexible and offer the user the option to prioritize the number
of SNPs called or the error rate. However, HaplotypeCaller and
UnifiedGenotyper also required extensive additional filtering beyond
baseline to achieve acceptable error rates, and may require a more
experienced user and likely more tinkering to curate a suitable SNP set.
Other variant callers like FreeBayes, SAMtools, and VarScan, achieve
decidedly lower error rates when comparing baseline filter SNP sets, and
likely require much less tinkering and effort to produce an acceptable
SNP set. However, these three variant callers may lack the customization
and flexibility of the GATK callers, and because they call many fewer
SNPs may be at risk of missing some important sites.
When we compared the specific sites produced with each variant caller,
we found a large degree of overlap among most programs (60%) in both
the total sites a program called (Fig. S3,5,7), and as well in the
error-free sites a program called (Fig. 1, S9-10). The sites where each
caller made genotyping errors, however, were largely specific to the
caller used (Fig. 1, S9-10), suggesting that the processes by which each
caller makes errors differ mechanistically. Taken together, the high
degree of overlap among callers in correctly called sites and the low
degree of overlap in erroneously called sites (Fig. 1, S9-10), suggests
that the practice of calling sites with multiple different variant
caller tools and using only the SNPs common to all tools may be a highly
effective method to improve accuracy. However, while this may reduce
error rates, taking the intersection across multiple tools will result
in a smaller number of SNPs called.
Across all callers, we found that filtering by QUAL, DP, or GQ gave the
best results in terms of reducing by-genotype error rates in our dataset
(Fig. 3; also see Supp. Mat.). For those using GATK callers and
investigating non-model organisms where VQSR is not applicable, these
three metrics may offer the best returns on increasing filtering
stringency. While the majority of the GATK filtering metrics did not
perform optimally on our data, it should be noted that our results
represent a comparison performed on one single dataset and that results
may vary with other input data. For other variant calling programs, our
results suggest that substantial improvements in error rate can be
achieved solely through filtering with QUAL, DP, and/or GQ only.
Our study provides a quantitative assessment of the accuracy of some of
the more popular variant caller programs and their commonly used
filtering metrics; however, it comes with three main caveats. Our
analysis was performed on a linkage mapping population and as such our
results may differ from similar analyses performed on natural
populations. For example, a variant caller that is predicated on sample
allele frequencies being in Hardy-Weinberg equilibrium will have biased
error rates when used on a linkage mapping population where allele
frequencies are expected to be 0, 0.5, or 1. As such, it would be
valuable to repeat our comparison of variant callers on natural
populations in the future. Secondly, because a P. contorareference genome does not yet exist, we aligned our reads against the
best available but highly fragmented congeneric loblolly pine (P.
taeda ) reference genome. Any differentiation or lack of synteny betweenP. contorta and P. taeda , as well as any assembly errors
in the loblolly pine reference genome may have influenced our results.
The option to use the reference genome of the correct study species, the
quality of that genome, and whether or not the study species is model
will all likely have important effects on results. Finally, we chose to
include VarScan in our analysis as it is currently a popular variant
calling tool, however, VarScan can only call diploid genotypes. As such,
all of our VarScan results are diploid genotype calls from haploid input
data and should be interpreted with some caution.