Genome alignment and SNP calling
Raw Illumina reads of 199 samples were processed to remove adaptor sequences using Trimmomatic software (v0.39). We used the ‘ILLUMINACLIP:2:30:10 MINLEN:50’ settings to keep the read pairs with a minimum length of 50bp for further processing. Adaptor removed reads were aligned to the PDK50 reference genome (PRJNA40349) using bwa mem software (v0.7.15-r1140) (Heng Li, 2013) with default parameters. Reads with supplementary (split/short) alignments were marked by the bwa mem -M option. Each sample alignment was validated with ‘ValidateSam’ (Picard) (http://broadinstitute.github.io/picard) to ensure no errors in output BAM files. Before variant calling, duplicate reads were marked using GATK MarkDuplicates (gatk-4.1.7.0) (DePristo et al., 2011). The following GATK function, ‘FixMateInformation’ and ‘SetNmMdAndUqTags,’ were used to ensure consistent paired-read information and fix the NM, MD, and UQ tags in the sorted BAM files. Variant calling was performed using GATK HaplotypeCaller with GVCF mode. Reads with mapping quality less than 20 and duplicate marked reads were excluded from the variant call. Hard filter parameters were applied to filter raw SNPs and INDELS using bcftools (v1.10.2) (H. Li, 2011). The raw SNPs were filtered using ‘QD<2.0 || FS>60.0 || MQ<40 || SOR>4.0 || MQRankSum<-10 || MQRankSum>10 || ReadPosRankSum<-6’ parameters, and INDELS were filtered with ‘QD < 2.0 || FS > 200.0 || ReadPosRankSum <−20.0’ parameters. Thereafter we utilized QC filter on SNP data. We excluded samples with greater than 70% missingness and SNPs with depth > 11000 summed across the samples in QC analysis. Genotypes were marked as missing if DP was below 10, genotype call rate less than 80%, or a minor allele frequency below 0.01. Finally, multiallelic SNP (-max-alleles 2) and SNPs with Hardy–Weinberg Equilibrium p-value less than 1e-06 (–hwe 1e-06) were filtered out using Vcftools (V0.1.16). The resulting samples and SNPs were used for further analysis.
Linkage Disequilibrium (LD) Analysis
Linkage Disequilibrium (LD) was calculated using PopLDdecay software (C. Zhang, Dong, Xu, He, & Yang, 2018) with -MaxDist 200 -MAF 0.05 option. The pairwise correlation coefficient ( r2 ) was calculated for all SNPs in 200kb window with minor allele frequency greater than 5% for the whole genome and 18 linkage groups separately (LG). The decay of LD was calculated by plotting pairwise r2 onto genetic distance between the SNP pairs using an approach mentioned in Marroni et al. (Marroni et al., 2011) and then calculated the distance at which pairwise correlation coefficient ( r2 ) is half its maximum value as half decay distance.