2. Materials and Methods
2.1 Samples and Data Resource
We investigated previously sequenced whole-genome sequences from ancient
animals in order to exclude the possibility of missing out any important
data. In total, we retrieved whole genome sequencing (WGS) data from 6
ancient samples derived from different age groups of three species
namely four ancient humans (Homo sapiens )(25-27), one ancient
goat (28) and one ancient aurochs (29). The BAM files were downloaded
from NCBI (https://www.ncbi.nlm.nih.gov/). The 6 samples were used to
explore the methods for mapping and separating the endogenous DNA (Table
1). The reference genomes for each species used for genome mapping are
listed in Table S1.
2.2. DNA Damage Analysis and Ancient DNA Simulation
Removing all contaminations present in real ancient data is often
difficult, and may lead to inaccurate evaluation. Therefore, we avoided
using the real ancient sequencing data for analysis, but instead used
the simulated ancient sequences with the same damage parameters as those
of the real data. With simulated data, it is possible to clearly
determine the true state of the genuine ancient DNA and effectively
evaluate the mapping and separating method of the endogenous DNA. To
obtain damage and fragmentation parameters from the real data, we used
mapDamage2.0 (30) to calculate the frequency of C-to-T and G-to-A
changes at the ends of DNA fragments, and the length distribution of the
real ancient DNA data we collected. To simulate the real contaminations,
we sequenced an ancient panda (~ 100 years
old,
CNP0000732) and
mapped the raw reads by blast using the nucleotide database (31) to
obtain the real proportion of contaminated reads. This contamination
consisted of more than twenty thousand modern species including all
other possible contaminates (Figure S1). The real contamination data
were added into simulated endogenous ancient DNA to test for the ancient
genome mapping methods, and modern human DNA fragments were mixed with
simulated ancient human DNA to explore the method for filtering the
homologues contaminations. Finally, we used the software gargammel (perl
gargammel.pl -n 1000000 –comp 0,cont_rate,endo_rate -f
sizefreq.size -mapdamagee misincorporation.txt single/double -o
data/simulation data/)(32) to simulate FASTQ files including one million
reads of ancient DNA sequences for the above mentioned 6 ancient
samples. Nine different contamination rates were simulated (20%, 40%,
60%, 80%, 90%, 95%, 99%, 99.5% and 99.9%).
2.3. Genome Mapping of Simulated Ancient DNA
We compared the BWA aln (Version: 0.7.17) and BWA mem(Version: 0.7.17) to explore a more effective mapping method for ancient
DNA. Here,“bwa aln -l 1024
-n 0.03”(MS parameters) (18) was compared with BWA mem to seek
a better mapping strategy for ancient DNA.
We defined the valid mapping hits as reads with endogenous ancient DNA
tags (all simulated endogenous ancient DNA were tagged before mapping)
and with a mapping quality higher than 30. For the BWA memalgorithm, the most important part is the seed-reseed-extend strategy
which might be suitable for ancient DNA. As such, we also tested BWAmem with the minimum seed length (parameter -k: 9/14/19/24/29)
and the maximum seed length without reseeding (parameter -r:
0.5/1/1.5/2/2.5) parameters. In order to evaluate the mapping
effectiveness, we defined three main
criteria: 1) CRT: the contamination
rate after treatment (the number of mapped contamination reads / the
number of mapped reads); 2) LRE: the
loss rate of endogenous DNA (the number of unmapped endogenous ancient
reads / the number of endogenous ancient reads); 3) MT: the running time
of mapping.