2. Materials and Methods

2.1 Samples and Data Resource
We investigated previously sequenced whole-genome sequences from ancient animals in order to exclude the possibility of missing out any important data. In total, we retrieved whole genome sequencing (WGS) data from 6 ancient samples derived from different age groups of three species namely four ancient humans (Homo sapiens )(25-27), one ancient goat (28) and one ancient aurochs (29). The BAM files were downloaded from NCBI (https://www.ncbi.nlm.nih.gov/). The 6 samples were used to explore the methods for mapping and separating the endogenous DNA (Table 1). The reference genomes for each species used for genome mapping are listed in Table S1.
2.2. DNA Damage Analysis and Ancient DNA Simulation
Removing all contaminations present in real ancient data is often difficult, and may lead to inaccurate evaluation. Therefore, we avoided using the real ancient sequencing data for analysis, but instead used the simulated ancient sequences with the same damage parameters as those of the real data. With simulated data, it is possible to clearly determine the true state of the genuine ancient DNA and effectively evaluate the mapping and separating method of the endogenous DNA. To obtain damage and fragmentation parameters from the real data, we used mapDamage2.0 (30) to calculate the frequency of C-to-T and G-to-A changes at the ends of DNA fragments, and the length distribution of the real ancient DNA data we collected. To simulate the real contaminations, we sequenced an ancient panda (~ 100 years old, CNP0000732) and mapped the raw reads by blast using the nucleotide database (31) to obtain the real proportion of contaminated reads. This contamination consisted of more than twenty thousand modern species including all other possible contaminates (Figure S1). The real contamination data were added into simulated endogenous ancient DNA to test for the ancient genome mapping methods, and modern human DNA fragments were mixed with simulated ancient human DNA to explore the method for filtering the homologues contaminations. Finally, we used the software gargammel (perl gargammel.pl -n 1000000 –comp 0,cont_rate,endo_rate -f sizefreq.size -mapdamagee misincorporation.txt single/double -o data/simulation data/)(32) to simulate FASTQ files including one million reads of ancient DNA sequences for the above mentioned 6 ancient samples. Nine different contamination rates were simulated (20%, 40%, 60%, 80%, 90%, 95%, 99%, 99.5% and 99.9%).
2.3. Genome Mapping of Simulated Ancient DNA
We compared the BWA aln (Version: 0.7.17) and BWA mem(Version: 0.7.17) to explore a more effective mapping method for ancient DNA. Here,“bwa aln -l 1024 -n 0.03”(MS parameters) (18) was compared with BWA mem to seek a better mapping strategy for ancient DNA.
We defined the valid mapping hits as reads with endogenous ancient DNA tags (all simulated endogenous ancient DNA were tagged before mapping) and with a mapping quality higher than 30. For the BWA memalgorithm, the most important part is the seed-reseed-extend strategy which might be suitable for ancient DNA. As such, we also tested BWAmem with the minimum seed length (parameter -k: 9/14/19/24/29) and the maximum seed length without reseeding (parameter -r: 0.5/1/1.5/2/2.5) parameters. In order to evaluate the mapping effectiveness, we defined three main criteria: 1) CRT: the contamination rate after treatment (the number of mapped contamination reads / the number of mapped reads); 2) LRE: the loss rate of endogenous DNA (the number of unmapped endogenous ancient reads / the number of endogenous ancient reads); 3) MT: the running time of mapping.