Variants, filtering, error polishing, and phasing
PCR targeted long-read sequencing of the successfully phased 77 DNMs produced an average allele coverage of 35,430X, with a mean background noise of 4% at iSNP positions (see Supplementary Table 7 and methods section ‘Bioinformatics’ for background noise calculation). This increased coverage compared to alternative targeting methods, WES or WGS is expected to help with error reduction and mosaicism detection (Wright et al., 2019). The DNMs, which were initially identified with short-read WES and validated with Sanger sequencing, were used to anchor the preliminary phasing of long reads. This anchoring approach groups long-reads by the base information at the DNM position. After variant calling was performed (methods section ‘Bioinformatics’), homozygous variants were removed and heterozygous variants were checked and filtered based on agreement with the DNM grouped reads. All remaining variants from the long-read sequencing approach were error polished and filtered using WES and parental ONT data (Figure 2 and Supplementary Tables 4 and 5). Following this, iSNPs were identified from the remaining variants. The iSNP with the greatest confidence (coverage, supporting data, DNM allele agreement) was selected for phasing. When phasing the reads based on the DNM and selected iSNP, additional alleles were allowed for the DNM in case of a postzygotic event, but these were screened for credible biological relevance, i.e. the DNM wild type (wt) would have to match the iSNP of the DNM alt. Importantly, this anchored approach filtered out 10% more presumably falsely called variants in comparison to the standard filtering of variants based on quality criteria, sequencing coverage and consensus (see Supplementary Table 4). Small indels made up on average 25% of the false positives, and on average 94% of all indels detected in the long read sequencing data were likely false positives (see Supplementary Table 5, and for an example of a false indel see Supplementary Figure 7). Because of this high error rate for indel calling, we decided to remove all indels without supporting data available, which is noted in the illustration of our approach (Figure 2).