3. Results
3.1 Description of Samples and the Simulated Data
The average length of ancient DNA data that we collected ranged from
45bp to 58bp (Table 1). The age of samples ranged from
~2.7 kyr BP (Before Present) to ~50 kyr
BP, which provided a good basis to evaluate the influence of age on
ancient DNA mapping and separation of homologous contaminations. The DNA
damage analysis showed an obvious increase of deaminated substitutions
with the frequency of C-to-T and G-to-A at ends of DNA fragments ranging
from 2% to 80 % (Figure. S2). We simulated a total of 54 ancient DNA
data set containing the same length distribution and damage pattern as
the real data set (Table 1). One million reads were finally simulated
under each condition.
3.2 Comparing different Mapping Algorithms on Ancient DNA
Ancient DNA damage, especially C-to-T changes, can result in mis-mapping
when ancient DNA fragments are mapped to the reference genome. The
mapping methods and parameters used for modern DNA are not always
suitable for ancient DNA. We compared BWA aln and BWA memto explore a more effective mapping strategy based on the
characteristics of ancient DNA damage.
The comparison was achieved by calculating the CRT, LRE and MT for each
dataset, and performing Repeated Measurement Analysis of Variance.
The
average and median value of CRT, LRE and MT of the two algorithms with
different contamination rates are shown in Table S3. The analysis showed
no significant differences in CRT (F = 1.42, P = 0.2870)
and LRE (F = 0.44, P = 0.5344). However, Significant
differences were found in MT (F = 41.57, P = 0.0013) and
BWA aln with the MS parameter taking 8.13-fold time than BWAmem by default (Table S4). We further evaluated the influence of
different samples and different contaminations rates on ancient DNA
mapping. As shown in Table S5 and Figure 1, the CRT maintained the same
level in different samples. The mean values of LRE were stable between
different contamination rates but not when contamination rate was close
to 100%.
The seed-reseed-extend strategy is one of the most important aspects of
the BWA mem algorithm. This strategy is mainly supported by two
parameters including the minimum seed length (parameter -k) and the
maximum seed length without reseeding (parameter -r). Additionally, the
algorithm searches for internal seeds inside a seed longer than x bp
(x=[-k] * [-r]). In this study, we tried to optimize these two
parameters to further explore more effective mapping parameters for
ancient DNA mapping.
We calculated the CRT, LRE and MT using different parameters –k
(9/14/19/24/29) (Figure. 2) and performed Repeated Measurement Analysis
of Variance Analysis to compare the results generated under different
parameters. There were significant differences in CRT (F =
644.61, P < 0.0001), LRE (F = 17.99, P =
0.0057) and MT (F = 146.75, P < 0.0001) (Table
S6). The significant differences of LRE were found between “-k 29” and
the other values. The MT remarkably increased at “–k 9” and was
significantly longer than all other tests. Moreover, “–k 14” consumed
more running time than that of “-k 14” and “-k 29” (Table S7).
We evaluated the parameter –r (0.5/1/1.5/2/2.5) with the same method
used to evaluate the parameter –k (Figure. 3). Significant differences
were found in CRT (F = 392.45, P < 0.0001), LRE
(F = 45.11, P = 0.0010) and MT (F = 9.19, P =
0.0002) (Table S8). A significant decrease in LRE was observed when the
set values of “-r” were greater than 1.5. MT was significantly higher
with “-r 0.5” than that of all other “-r” options. Similarly, there
was a significant decrease in MT between “-r 1.0” and the other “-r”
options (except “-r 0.5”), but no significant differences were found
between other pairwise comparisons (Table S9).
3.3 Separation of endogenous DNA
The unique ancient DNA characteristics, especially the C-to-T and/or
G-to-A changes at ends of DNA fragments help to improve the filtering of
contaminated present-day DNA. Using these characteristics, we tried to
decrease the homologous contamination rate to a very low level through
screening of DNA fragments with C-to-T and G-to-A changes within the
first or last “DetectRange” base
pair at 3’ and/or 5’ ends (Figure. 4). The mean values of CRT and LRE
are shown in the Table S10. No significant differences were found in CRT
(F = 3.27, P = 0.1097) and LRE (F = 1.11, P= 0.3893) (Table S11).
We further tested the influence of parameter “DeamNum” on separating
endogenous DNA. Significant differences were found in CRT (F =
26.01, P = 0.0011) and LRE (F = 24.03, P = 0.0152)
(Table S12). We found that an increase of “DeamNum” could lower the
CRT, but result in a higher LRE (Figure S3). We also calculated CRT and
LRE to evaluate the influence of the parameter “DoubleOrSingle” on
ancient DNA mapping (Figure. S4). Significant difference was found in
the CRT (F = 44.97, P = 0.0068) but not in the LRE
(F = 7.20, P = 0.0748) (Table S13). The result showed a
nearly 10-fold decrease of CRT when screening the reads with C-to-T or
G-to-A on single end (-DoubleOrSingle=or) compared to screening on both
3’ and 5’ ends (-DoubleOrSingle=and) (Table S10). The homologous
contamination rate can be kept to an average of 0.92% by using the
filtering strategy with
“-DetectRange=15 -DeamNum=1
-DoubleOrSingle=or”.
We further compared our method with the software PMDtools. These two
methods were run in parallel using the same dataset, and the results
generated by the PMDtools with different parameters are shown in the
Table S14. To make the comparison fairer, the LRE of the tested dataset
were kept similar in both our pipeline and the PMDtools. We found no
significant differences in the CRT (Z = -1.171, P =0.241)
between the two methods. However, the running time of our method was
6.48 times shorter than that of PMDtools, and the difference was
significant (Difference=2.3mins, Z=-6.50, P=8.28E-11).
Although this comparison could not show the superiority of our method to
PDMtools, we at least demonstrate a fast and reliable complement to the
PDMtools.
Furthermore, we tried to screen reads with G or A residues preceding the
first base at 5’ end. The average
homologous contamination rate was 2.25% after filtering using the
depurination characteristic.