Bioinformatics: Phasing
To identify reliable variants, reads were split based on DNM base type,
and the remaining variants were checked for agreement with the split
reads using bamql (Masella et al., 2016). The iSNP with the greatest
combinations of split read agreement, coverage and additional supporting
sequencing data was selected for final phasing, see Supplementary Table
6 for iSNP information. This preliminary variant sorting and iSNP
selection approach provided a more reliable variant list. The next step
split the raw reads by base type at DNM and selected iSNP positions
using bamql. After this step the allele frequencies, parent-of-origin,
and timing of the DNM event (pre/post zygosity) was determined. The
parent-of-origin conclusion was re-affirmed manually in IGV by
visualising the reads containing the DNM and the chosen iSNP from the
pipeline.
Additional to manual analysis in IGV, postzygotic and prezygotic DNMs
were determined by three category assessments, those that appeared
postzygotic are detailed in supplementary Table 11. The categories are;
WES DNM base information (coverage/frequency), ONT DNM base information
(coverage/frequency), and ONT allele information (coverage and allele
frequencies). This information also helped determine background allele
error, by presenting reads in the data that represent third and fourth
allelic forms.
Allele and base error were
calculated in two ways, one approach represented total error, which was
calculated for a target DNM by combining all known false base
frequencies for total base position error, and for total allele error,
combining all known false allele frequencies. The other approach used
the most prevalent base or allele frequency that could be determined as
false. Assessment of error and noise in data helped support prezygotic
and postzygotic DNM calls.
To provide an assessment of basecalling background error that is allele
relevant, we ignored checking every base within every target as that
would be unnecessarily extensive and time consuming. Instead, we used
the highest false base percentage of each iSNP used in phasing each
target and calculated the mean false base percentage (Supplementary
Table 7). Furthermore, the false base could be qualified as it had
parental data support. A quality assessment of the bases for each target
within the BAM files was performed using Picard
‘QualityScoreDistributions’ (PicardToolkit, 2019).