mem
The parameters -k and -r were extremely important for the “seeding and
reseeding” mapping stages in BWA mem (24). The different
parameter values of -k and -r could significantly affect the CRT, LRE
and MT, indicating that we can obtain ancient DNA mapping results with a
lower contamination rate by optimizing the parameter values (Table S6,
S8). The LER was highest at -k = 29, and it increased as the value of k
increased. The value of LER decreased by~0.20% from -k
= 19 to -k = 9, however, this decrease continued to
~4.66% from -k = 29 to -k = 19, which was 23.3 times
larger than that between -k = 19 and -k = 9(Table S7). This means that
the decrease of –k resulted to exponential reduction of loss of true
endogenous ancient DNA. Therefore, it is of central importance to choose
the appropriate value of -k to map against genome for ancient DNA. For
the parameter –r, the LER reached the lowest level at -r=2.5, which
means that it could avoid losing endogenous reads significantly (Table
S9).
In addition, there were significant differences in running time when
changing the –k and -r parameters (Table S6, S8). The running time
significantly decreased from -k=9 and –r=0.5 to –k=19 and –r=1.5, but
it was relatively stable and even slightly longer when –k and –r was
larger than 19 and 1.5, respectively (Table S7, Table S9, Figure. 2,
Figure. 3). The BWA mem algorithm only found the exact matches in
a read while seeding, and this algorithm could trigger re-seeding with
the maximal exact matches (MEMs) to reduce the loss of mis-mapping if
MEMs are larger than [-k*-r] (24). The larger [-k*-r] meant
fewer seeds, which can accelerate the mapping process. However, too long
seed would also make seed mapping against genomes more difficult and
eventually more time-consuming. We also found that the running time was
more sensitive to the change of –k parameter rather than the –r (Table
S7, Table S9, Figure. 2, Figure. 3), indicating that the running time
was mainly influenced by the minimum seed length. The –r cannot affect
the seeding for MEMs, but the –k can influence both seeding and
reseeding procedures, which might be the possible reason for their
influence on the running time. Based on these comparisons, we recommend
use of –k=19 and –r=2.5 for BWA mem mapping of ancient DNA.
4.3 Improving the separation of endogenous DNA
Among all kinds of homologous contaminations, the present-day human DNA
is the most frequent contamination in ancient human DNA, because it can
be easily induced from the time sample are collected to the time DNA
library preparation is performed. These homologous contaminations are
extremely difficult to remove. In our testing, the proportion of
homologous contamination that could be removed from the simulated raw
data decreased with the increase in simulated contamination rates, and
there was a significant negative correlation between them
(R2 =0.391, P =0.019). However, it was
still possible to remove more than 99% homologous contaminations even
when the simulated contamination rate reached 99% (Figure. 5). It is
worth noting that the proportion of homologous contamination increased
to 99.9% when the simulated contamination rate decreased to 95%.
On average, 99.07% contamination
could be removed using our recommended screening method (Table S15),
which was lower than many other ancient DNA studies (26, 27). Besides,
no significant differences were found in the endogenous DNA rate
considering the different samples, different damage patterns and
different contamination rates, demonstrating the universal property of
our recommended method. Using the remaining endogenous ancient reads, we
summarized a best combination with DeamNum=1, DetectRange=15 and
DoubleOrSingle=or.
To explore more possibly effective filtering strategy, we further
screened reads with G or A residues preceding the first base at 5’ end
of the DNA fragments. This depurination screening decreased homologous
contamination ratio to 2.25% which meant that this method may enable
the recovery of more endogenous DNA (Table S16). Similar to the
deamination screening, no differences were found in relation to sample
ages, which was largely due to the weak correlation between depurination
and samples ages. No significant correlation between samples ages and
the extent of DNA fragmentation was recorded. DNA fragments are usually
heavily degraded due to depurination shortly after death (34, 35).
However, only 10%-40% of ancient DNA fragmentation is triggered by
depurination, and other factors can also result in DNA fragmentations.
As such, it is difficult to identify more endogenous ancient reads by
screening the DNA length. However, we also offer this option in our
python script to support the filtration by depurination and
fragmentation.