Application 2: Barcoded individuals in population samples of steelhead
After trimming, mapping, and quality filtering, PPalign provided 287 to 405 million mapped reads per sample, which allowed between 67.2 and 70.8% sampling of the genome at the minimum number of reads per sample (10), as revealed by PPstats (Supplemental Figure 7). The distribution of genomic extent across chromosomes was similar to other lcWGS analyses of the O. mykiss genome (e.g. Micheletti, Hess, Zendt, & Narum, 2018), indicating this pattern is a function of the library preparation technique used for all these samples or idiosyncrasy of this genome. Sequencing of indexed individuals allowed us to estimate that the mean coverage per individual ranged from 0 to 2.3 (median 0.23), to confirm that it was similar across populations (median 0.23, 0.26, 0.23 and standard deviation 0.39, 0.28, and 0.28 for Willamette River, Lewis River, and Skamania Hatchery, respectively), and to reduce bias in allele frequency estimates introduced by sampling variance across samples (normalize). After population-specific filters, PPanalyze examined 22,934,298 variants (22,832,805 [99.5%] in the chromosome scaffolds) with a suite of analyses. Density plots revealed that variants were sampled from across the genome, with a handful of areas of notable density. A principal components analysis made with loci with a maximum difference in allele frequencies below 0.9 (thus excluding the most divergent outlier loci), while unremarkable, confirmed that the primary axis, which explained ~86% of the variance in the data, didnot segregate the Skamania hatchery sample from the natural origin samples, implying that outlier regions related to the main contrast (hatchery vs. natural) would not be confounded by background population structure. Raw PPanalyze output revealed many small regions of strong genomic divergence, while 51 separate regions were identified as significant at p ≤ 0.05 across ξ values and replicates in Local Score analyses (Figure 4, Table 3, Supplemental Table 1). The two most significant (highest local score) regions were the region of chr. 28 containing the genes GREB1L andROCK1 and the region of chr. 25 containing the gene SIX6 , which have been previously found associated with migration timing and age at maturity in steelhead and other salmonids, respectively (e.g. Willis et al., 2020). There were also many additional regions whose potential association with migration phenology, age at maturity, or domestication (adaptation to hatchery production) could be explored further. For example, a region of chromosome 20 that was consistently recovered in the Local Score analyses contained two protein coding genes: ATP-citrate lysase (synthase), or ACLY, and, dnaJ homolog subfamily C member 7, or DNAJC7. ACLY is a ubiquitous cytosolic enzyme positioned at the intersection of nutrients catabolism and cholesterol and fatty acid biosynthesis, and DNAJC7 is a member of the heat shock protein 40 family and acts as co-chaperone regulating the molecular chaperones HSP70 and HSP90 in folding of steroid receptors, such as the glucocorticoid receptor and the progesterone receptor. Notably, identification of linkage outliers for these three chromosomes identified the same regions, but in the case of chromosomes 25 and 28, also identified other regions that Local Score did not, presumably because, while they exhibit strong linkage across all samples, these regions were not consistently divergent between the hatchery and natural origin samples.