Bioinformatics: Phasing
To identify reliable variants, reads were split based on DNM base type, and the remaining variants were checked for agreement with the split reads using bamql (Masella et al., 2016). The iSNP with the greatest combinations of split read agreement, coverage and additional supporting sequencing data was selected for final phasing, see Supplementary Table 6 for iSNP information. This preliminary variant sorting and iSNP selection approach provided a more reliable variant list. The next step split the raw reads by base type at DNM and selected iSNP positions using bamql. After this step the allele frequencies, parent-of-origin, and timing of the DNM event (pre/post zygosity) was determined. The parent-of-origin conclusion was re-affirmed manually in IGV by visualising the reads containing the DNM and the chosen iSNP from the pipeline.
Additional to manual analysis in IGV, postzygotic and prezygotic DNMs were determined by three category assessments, those that appeared postzygotic are detailed in supplementary Table 11. The categories are; WES DNM base information (coverage/frequency), ONT DNM base information (coverage/frequency), and ONT allele information (coverage and allele frequencies). This information also helped determine background allele error, by presenting reads in the data that represent third and fourth allelic forms.
Allele and base error were calculated in two ways, one approach represented total error, which was calculated for a target DNM by combining all known false base frequencies for total base position error, and for total allele error, combining all known false allele frequencies. The other approach used the most prevalent base or allele frequency that could be determined as false. Assessment of error and noise in data helped support prezygotic and postzygotic DNM calls.
To provide an assessment of basecalling background error that is allele relevant, we ignored checking every base within every target as that would be unnecessarily extensive and time consuming. Instead, we used the highest false base percentage of each iSNP used in phasing each target and calculated the mean false base percentage (Supplementary Table 7). Furthermore, the false base could be qualified as it had parental data support. A quality assessment of the bases for each target within the BAM files was performed using Picard ‘QualityScoreDistributions’ (PicardToolkit, 2019).