mem
The parameters -k and -r were extremely important for the “seeding and reseeding” mapping stages in BWA mem (24). The different parameter values of -k and -r could significantly affect the CRT, LRE and MT, indicating that we can obtain ancient DNA mapping results with a lower contamination rate by optimizing the parameter values (Table S6, S8). The LER was highest at -k = 29, and it increased as the value of k increased. The value of LER decreased by~0.20% from -k = 19 to -k = 9, however, this decrease continued to ~4.66% from -k = 29 to -k = 19, which was 23.3 times larger than that between -k = 19 and -k = 9(Table S7). This means that the decrease of –k resulted to exponential reduction of loss of true endogenous ancient DNA. Therefore, it is of central importance to choose the appropriate value of -k to map against genome for ancient DNA. For the parameter –r, the LER reached the lowest level at -r=2.5, which means that it could avoid losing endogenous reads significantly (Table S9).
In addition, there were significant differences in running time when changing the –k and -r parameters (Table S6, S8). The running time significantly decreased from -k=9 and –r=0.5 to –k=19 and –r=1.5, but it was relatively stable and even slightly longer when –k and –r was larger than 19 and 1.5, respectively (Table S7, Table S9, Figure. 2, Figure. 3). The BWA mem algorithm only found the exact matches in a read while seeding, and this algorithm could trigger re-seeding with the maximal exact matches (MEMs) to reduce the loss of mis-mapping if MEMs are larger than [-k*-r] (24). The larger [-k*-r] meant fewer seeds, which can accelerate the mapping process. However, too long seed would also make seed mapping against genomes more difficult and eventually more time-consuming. We also found that the running time was more sensitive to the change of –k parameter rather than the –r (Table S7, Table S9, Figure. 2, Figure. 3), indicating that the running time was mainly influenced by the minimum seed length. The –r cannot affect the seeding for MEMs, but the –k can influence both seeding and reseeding procedures, which might be the possible reason for their influence on the running time. Based on these comparisons, we recommend use of –k=19 and –r=2.5 for BWA mem mapping of ancient DNA.
4.3 Improving the separation of endogenous DNA
Among all kinds of homologous contaminations, the present-day human DNA is the most frequent contamination in ancient human DNA, because it can be easily induced from the time sample are collected to the time DNA library preparation is performed. These homologous contaminations are extremely difficult to remove. In our testing, the proportion of homologous contamination that could be removed from the simulated raw data decreased with the increase in simulated contamination rates, and there was a significant negative correlation between them (R2 =0.391, P =0.019). However, it was still possible to remove more than 99% homologous contaminations even when the simulated contamination rate reached 99% (Figure. 5). It is worth noting that the proportion of homologous contamination increased to 99.9% when the simulated contamination rate decreased to 95%. On average, 99.07% contamination could be removed using our recommended screening method (Table S15), which was lower than many other ancient DNA studies (26, 27). Besides, no significant differences were found in the endogenous DNA rate considering the different samples, different damage patterns and different contamination rates, demonstrating the universal property of our recommended method. Using the remaining endogenous ancient reads, we summarized a best combination with DeamNum=1, DetectRange=15 and DoubleOrSingle=or.
To explore more possibly effective filtering strategy, we further screened reads with G or A residues preceding the first base at 5’ end of the DNA fragments. This depurination screening decreased homologous contamination ratio to 2.25% which meant that this method may enable the recovery of more endogenous DNA (Table S16). Similar to the deamination screening, no differences were found in relation to sample ages, which was largely due to the weak correlation between depurination and samples ages. No significant correlation between samples ages and the extent of DNA fragmentation was recorded. DNA fragments are usually heavily degraded due to depurination shortly after death (34, 35). However, only 10%-40% of ancient DNA fragmentation is triggered by depurination, and other factors can also result in DNA fragmentations. As such, it is difficult to identify more endogenous ancient reads by screening the DNA length. However, we also offer this option in our python script to support the filtration by depurination and fragmentation.