3. Results

3.1 Description of Samples and the Simulated Data
The average length of ancient DNA data that we collected ranged from 45bp to 58bp (Table 1). The age of samples ranged from ~2.7 kyr BP (Before Present) to ~50 kyr BP, which provided a good basis to evaluate the influence of age on ancient DNA mapping and separation of homologous contaminations. The DNA damage analysis showed an obvious increase of deaminated substitutions with the frequency of C-to-T and G-to-A at ends of DNA fragments ranging from 2% to 80 % (Figure. S2). We simulated a total of 54 ancient DNA data set containing the same length distribution and damage pattern as the real data set (Table 1). One million reads were finally simulated under each condition.
3.2 Comparing different Mapping Algorithms on Ancient DNA
Ancient DNA damage, especially C-to-T changes, can result in mis-mapping when ancient DNA fragments are mapped to the reference genome. The mapping methods and parameters used for modern DNA are not always suitable for ancient DNA. We compared BWA aln and BWA memto explore a more effective mapping strategy based on the characteristics of ancient DNA damage.
The comparison was achieved by calculating the CRT, LRE and MT for each dataset, and performing Repeated Measurement Analysis of Variance. The average and median value of CRT, LRE and MT of the two algorithms with different contamination rates are shown in Table S3. The analysis showed no significant differences in CRT (F = 1.42, P = 0.2870) and LRE (F = 0.44, P = 0.5344). However, Significant differences were found in MT (F = 41.57, P = 0.0013) and BWA aln with the MS parameter taking 8.13-fold time than BWAmem by default (Table S4). We further evaluated the influence of different samples and different contaminations rates on ancient DNA mapping. As shown in Table S5 and Figure 1, the CRT maintained the same level in different samples. The mean values of LRE were stable between different contamination rates but not when contamination rate was close to 100%.
The seed-reseed-extend strategy is one of the most important aspects of the BWA mem algorithm. This strategy is mainly supported by two parameters including the minimum seed length (parameter -k) and the maximum seed length without reseeding (parameter -r). Additionally, the algorithm searches for internal seeds inside a seed longer than x bp (x=[-k] * [-r]). In this study, we tried to optimize these two parameters to further explore more effective mapping parameters for ancient DNA mapping.
We calculated the CRT, LRE and MT using different parameters –k (9/14/19/24/29) (Figure. 2) and performed Repeated Measurement Analysis of Variance Analysis to compare the results generated under different parameters. There were significant differences in CRT (F = 644.61, P < 0.0001), LRE (F = 17.99, P = 0.0057) and MT (F = 146.75, P < 0.0001) (Table S6). The significant differences of LRE were found between “-k 29” and the other values. The MT remarkably increased at “–k 9” and was significantly longer than all other tests. Moreover, “–k 14” consumed more running time than that of “-k 14” and “-k 29” (Table S7).
We evaluated the parameter –r (0.5/1/1.5/2/2.5) with the same method used to evaluate the parameter –k (Figure. 3). Significant differences were found in CRT (F = 392.45, P < 0.0001), LRE (F = 45.11, P = 0.0010) and MT (F = 9.19, P = 0.0002) (Table S8). A significant decrease in LRE was observed when the set values of “-r” were greater than 1.5. MT was significantly higher with “-r 0.5” than that of all other “-r” options. Similarly, there was a significant decrease in MT between “-r 1.0” and the other “-r” options (except “-r 0.5”), but no significant differences were found between other pairwise comparisons (Table S9).
3.3 Separation of endogenous DNA
The unique ancient DNA characteristics, especially the C-to-T and/or G-to-A changes at ends of DNA fragments help to improve the filtering of contaminated present-day DNA. Using these characteristics, we tried to decrease the homologous contamination rate to a very low level through screening of DNA fragments with C-to-T and G-to-A changes within the first or last “DetectRange” base pair at 3’ and/or 5’ ends (Figure. 4). The mean values of CRT and LRE are shown in the Table S10. No significant differences were found in CRT (F = 3.27, P = 0.1097) and LRE (F = 1.11, P= 0.3893) (Table S11).
We further tested the influence of parameter “DeamNum” on separating endogenous DNA. Significant differences were found in CRT (F = 26.01, P = 0.0011) and LRE (F = 24.03, P = 0.0152) (Table S12). We found that an increase of “DeamNum” could lower the CRT, but result in a higher LRE (Figure S3). We also calculated CRT and LRE to evaluate the influence of the parameter “DoubleOrSingle” on ancient DNA mapping (Figure. S4). Significant difference was found in the CRT (F = 44.97, P = 0.0068) but not in the LRE (F = 7.20, P = 0.0748) (Table S13). The result showed a nearly 10-fold decrease of CRT when screening the reads with C-to-T or G-to-A on single end (-DoubleOrSingle=or) compared to screening on both 3’ and 5’ ends (-DoubleOrSingle=and) (Table S10). The homologous contamination rate can be kept to an average of 0.92% by using the filtering strategy with “-DetectRange=15 -DeamNum=1 -DoubleOrSingle=or”.
We further compared our method with the software PMDtools. These two methods were run in parallel using the same dataset, and the results generated by the PMDtools with different parameters are shown in the Table S14. To make the comparison fairer, the LRE of the tested dataset were kept similar in both our pipeline and the PMDtools. We found no significant differences in the CRT (Z = -1.171, P =0.241) between the two methods. However, the running time of our method was 6.48 times shorter than that of PMDtools, and the difference was significant (Difference=2.3mins, Z=-6.50, P=8.28E-11). Although this comparison could not show the superiority of our method to PDMtools, we at least demonstrate a fast and reliable complement to the PDMtools.
Furthermore, we tried to screen reads with G or A residues preceding the first base at 5’ end. The average homologous contamination rate was 2.25% after filtering using the depurination characteristic.