The impact of sequencing depth and relatedness of the reference genome
in population genomic studies: a case study with two caddisfly species
(Trichoptera, Rhyacophilidae, Himalopsyche)
Abstract
Whole-genome sequencing for generating SNP data is increasingly used in
population genetic studies. However, obtaining genomes for massive
numbers of samples is still not within the budgets of many researchers.
It is thus imperative to select an appropriate reference genome and
sequencing coverage to ensure the accuracy of the results for a specific
research question, while balancing cost and feasibility. To evaluate the
effect of the choice of the reference genome and sequencing coverage on
downstream analyses, we used five confamilial reference genomes of
variable relatedness and three levels of sequencing coverage (3.5x, 7.5x
and 12x) in a population genomic study on two caddisfly species:
Himalopsyche digitata and H. tibetana. Using these 30 datasets (five
reference genomes × three coverages × two target species), we estimated
population genetic indices (inbreeding coefficient, nucleotide
diversity, pairwise and genome-wide FST) based on variants and
population structure (PCA and admixture) based on genotype likelihood
estimates. The results showed that both distantly related reference
genomes and lower sequencing coverage lead to degradation of resolution.
In addition, choosing a more closely related reference genome may
significantly remedy the defects caused by low coverage. Therefore, we
conclude that population genetic studies would benefit from closely
related reference genomes, especially as the costs of obtaining a
high-quality reference genome continue to decrease. However, to
determine a cost-efficient strategy for a specific population genomic
study, a trade-off between reference genome relatedness and sequencing
depth can be considered.