Running ARGweaver
Preprocessing files for ARGweaver
To preprocess the input files for ARGweaver, we first subset our 20 diploid individuals from the full VCF files (see Supplementary information in Meier et al, 2021), and then convert the SNP information into the .sites format required for ARGweaver. The .sites format only contains information on the positions that are varying within the 20 individuals that we chose. (see .Rmd file for full code)
In the optix region (43538 bp long), there are 2812 sites altogether - 2426 are variant positions, while 330 and 56 sites are fixed for one or the other allele. Genomic positions that are neither variant nor fixed to one or the other allele (in other words, positions absent in the VCF file) are considered missing information and therefore masked from being used as input data for ARGweaver. Altogether, ARGweaver uses information of variant alleles and invariant alleles; whereas the rest is masked and treated as missing information.
# Keep only the sites that are variant to this subset of 20 diploid individuals
bcftools view -c1 -C39 -Oz ${regionName}_Herato.vcf.gz > ${regionName}_Herato_variants.vcf.gz
bcftools query --format '%CHROM\t%POS\t%REF\t%ALT\t[%GT\t]\n' ${regionName}_Herato_variants.vcf.gz > ${regionName}_Herato_variants.gt
#check numbers of sites in the VCF file
bcftools view -H ${regionName}_Herato.vcf.gz | wc -l #Total sites = 2812
bcftools view -H -c1 -C39 ${regionName}_Herato.vcf.gz | wc -l #Total variant sites = 2426
bcftools view -H -C0 ${regionName}_Herato.vcf.gz | wc -l #Sites fixed to REF allele = 330 (REF according to the VCF format nomenclature)
bcftools view -H -c40 ${regionName}_Herato.vcf.gz | wc -l #SItes fixed to ALT allele = 56 (ALT according to the VCF format nomenclature)
# Create sites file
# USAGE: Rscript vcf2sites.R <sample.txt> <input.gt> <chr> <begin> <end> #
Rscript $workdir/scripts/vcf2sites.R $workdir/sample/all_sample.txt ${regionName}_Herato_variants.gt $chrom $chromStart $chromEnd
cat <(head -2 ${regionName}_Herato_variants_RefAlt.sites) <(cut -f1,2 <(tail +3 ${regionName}_Herato_variants_RefAlt.sites)) > ../${regionName}_Herato.sites
# Create mask file
bcftools view -H ${regionName}_Herato_variants.vcf.gz | awk '$0 !~ /^#/ {print $1,$2-1,$2}' | bedops -d <(echo -e "$chrom\t$(($chromStart-1))\t$chromEnd") - | bgzip > ../${regionName}_Herato_mask.bed.gz
We repeated the same steps for the neutral background region, Scaffold Herato1603:3450000-3493538. In that region (43538 bp), there are 6407 sites altogether - 4926 is variant positions, while 1405 and 76 sites are fixed for one or the other allele. (see .Rmd file for full code).
Run ARGweaver
ARGweaver is run for 5000 iterations and then resumed for another 5000 iterations for the optix region, whereas only 5000 iterations for the neutral region.
# Run ARGweaver
~/.local/bin/arg-sample -s sites/${regionName}_Herato.sites \
--maskmap masks/${regionName}_Herato_mask.bed.gz \
-N 1940078 \
-m 2.9e-9 \
-r 2.9e-9 \
--ntimes 30 \
--maxtime $((1940078*20)) \
--iters 5000 \
--delta 0.01 \
--output {regionName}/sample/${regionName}
# Resume ARGweaver for more iterations
~/.local/bin/arg-sample -s sites/${regionName}_Herato.sites \
--maskmap masks/${regionName}_Herato_mask.bed.gz \
-N 1940078 \
-m 2.9e-9 \
-r 2.9e-9 \
--ntimes 30 \
--maxtime $((1940078*20)) \
--iters 10000 \
--delta 0.01 \
--output {regionName}/sample/${regionName} \
--resume
Analysis of ARGweaver output
MCMC summary
Visualizing the traces of likelihood, prior and joint probabilities from the MCMC iterations, we decided to set the first 3000 iterations as burnin. The plot below is for the optix region.

Fig S2: Traces across all MCMC iterations of prior (log probability of the sampled ARG given the model), likelihood (log probability of the data given the sampled ARG), joint (total log probability of the ARG and the data (prior + likelihood)), recombs (number of recombination events in the sampled ARG), arglen (total length of all branches summed across sites) and noncompats (number of variant sites that cannot be explained by infinite sites mutation model).
Process output from ARGweaver
The ARGweaver output .smc files are then converted to easily maniputable .bed format to extract TMRCAs, trees, recomination breakpoints and total tree branch lengths across all iterations (except burnin). Then, 2 specific iterations (Iteration 8250 and 9000) are chosen for further analysis and identification of haplotype blocks (see .Rmd file for full code, and analysis for the neutral region)
#### OPTIX ####
# Load variables
workdir=$HOME/Proj-Def-Haplo
prefix=optix_newNe
regionName=optix
chrom=Herato1801
chromStart=1362067
chromEnd=1405605
# Convert .smc to .bed format for all iterations
cd $workdir/optix_newrun/$prefix
smc2bed-all $workdir/optix_newrun/$prefix/sample/$prefix
## Check mcmc stats to check burnin
burnin=3000
# Compute TMRCA estimates for between and within popualtions from all iterations
for pop in all_sample Herato_notabilis Herato_lativitta; do
echo Summarizing $pop
arg-summarize --arg-file ./sample/$prefix.bed.gz \
--subset-inds $workdir/sample/$pop.txt \
--log-file ./sample/$prefix.log \
--tmrca \
--burnin $burnin \
--mean --stdev --quantile 0.05,0.5,0.95 |\
bgzip > $prefix.tmrca.$pop.bed.gz
echo " -> Done"
done
## The same is repeated for the neutral region below
## Compute TMRCA, and branch length from individual iterations
stats_list=(tmrca branchlen)
iters_list=(8250 9000)
for pop in all_sample Herato_notabilis Herato_lativitta; do
echo Summarizing $pop
for sample_no in ${iters_list[*]}; do
echo -n "Computing $sample_no"
for stat in ${stats_list[*]}; do
echo -n "Computing $stat"
arg-summarize --arg-file ./sample/$prefix.bed.gz \
--subset-inds $workdir/sample/$pop.txt \
--sample $sample_no \
--log-file ./sample/$prefix.log \
--$stat \
--mean --quantile 0.05,0.5,0.95 |\
bgzip > $prefix.$stat.$pop.${sample_no}.bed.gz
echo " -> Done"
done
done
done
Read SNPs
SNPs are changed from its nucloetide asignment to 0/1; based on allele frequency within the H.e.lativitta samples. (Higher allele frequency - 1, Lower allele frequency - 0)
## Read all SNPs
optix_sites <- readSites("~/Proj-Def-Haplo/optix_newrun/sites/optix_Herato.sites")
#
## Read all sites (SNPs+invariant)
optixALL_sites <- readSites("~/Proj-Def-Haplo/optix_newrun/sites/optixALL_Herato.sites")
#
## Change SNPs to 0 and 1 based on major allele in lativitta
sites<- data.frame(pos=optix_sites$pos, optix_sites$sites)
for (snpID in c(1:nrow(sites))){
cat(snpID, sites[snpID,'pos'])
alleles <- sites[snpID,sampID$ID]
alleles <- as.factor(alleles)
alleleCount1 <- sum(alleles[lat] == levels(alleles)[1])
alleleCount2 <- sum(alleles[lat] == levels(alleles)[2])
majAllele <- ifelse(alleleCount1 > alleleCount2, levels(alleles)[1], levels(alleles)[2])
minAllele <- setdiff(levels(alleles), majAllele)
sites[snpID,which(sites[snpID,] == majAllele)] <- 1
sites[snpID,which(sites[snpID,] == minAllele)] <- 0
}
TMRCA
We first examine the TMRCA of the total tree and the individual populations - H.e.lativitta (in red) and H.e.notabilis (in yellow). Unlike the neutral region (Herato1603) where all TMRCA estimates are fairly constrained between 1 and 10 \(N_e\), TMRCA traces for individual populations in the optix region exhibit shallow coalescence times at multiple positions throughout the entire ~50kb region. In case of a selective sweeps, where a beneficial mutation sweeps through the population, TMRCA tends to be shallow since all samples in the swept population coalesces quickly to the initial “lucky” ancestor. Moving away from the focal mutations, lineages recombine away from the swept ancestral background. The TMRCA estimates from the optix region is consistent with the previously identified selected region in optix. To investigate further the haplotype block structures, we focus into a 3kb region at optix:1385966-1388966.

Fig S3: TMRCA (Ne) for each position in the optix (top panel) and the neutral genomic region (bottom panel). Black line: total TMRCA, Red: Median TMRCA for H.e.lativitta, Yellow: TMRCA for H.e.notabilis. Corresponding shaded regions are the 0.05 and 0.95 percentile intervals.
In the focal region, there are 137 SNPs.

FigS4: TMRCA for each position in focal genomic region (optix:1385966-1388966). Same colour schemes as above. Top panel: TMRCA plot, Bottom panel: SNPs in both populations; RED and YELLOW alleles are variant positions, coloured according to the respective higher and lower allele counts in the H.e.lativitta population. BLACK alleles are fixed (invariant) in both populations. All other positions (in white) are unknown, since those positions do not have SNP information. These unknown positions are masked while running ARGWeaver, therefore they are treated as missing information and NOT as invariant sites.
Masking sites can have a critical effect on sampling ARGs, since invariant sites can shift priors towards a recent coalescence (because there hasn’t been enough time yet for a mutation to occur in any of the branches), whereas missing information does not shift priors and therefore ARGs are sampled neutrally in those regions. In the above region of 3kb, there are 137 SNPs and 16 invariant positions. Out of the 137 SNPs, only 59 SNPs are segregating within the H.e.lativitta population in focus. Below is the folded SNP frequency spectrum.

FigS5: Folded SNP frequency spectrum of the focal region, optix:1385966-1388966.
Hereafter, alleles with lower frequency within the H.e.lativitta population is referred to as the minor allele, and individuals who share the minor allele is the minor clade. Conversely, individuals that carry that major allele is called major clade. Minor allele is coded as 0 and major as 1.
Iteration 1 (MCMC iter: 8250)
We extract ARGs sampled by ARGweaver in its MCMC iteration: 8250, and estimate the TMRCA (total and within population), total branch length of each marginal tree and the recombination breakpoints. Altogether in the ~50kb region, there are 6571 trees (and 6570 recombination events), of which only 464 trees are present in the focal 3kb region, optix:1385966-1388966.
We investigate the distribution of branch length of trees with respect to their genomic spans (Fig S6). We find average tree span in the whole region is ~7 bp. We expect from theory assuming standard coalescence, \(P_r(d \mid \tau)\)=\(\frac {\rho}{2}L(\tau)\) exp\([-\frac {\rho}{2}L(\tau) d]\) where \(L(\tau)\) is in coalescent unit of \(2N_e\) generations and \(\frac {\rho}{2}\)=\(2N_er\) denotes the population-scaled recombination rate per bp. Given the parameters, and mean \(L(\tau)\) = 2, this gives mean d = \(\frac {1} {\frac {\rho}{2}L(\tau)}\) ~ 45bp. It seems in this region, ARGweaver seems to change the tree topology more often than the expected mean. We have not explored this closely (beyond the main reach of this analysis), however, we note that although the ARGs are consistent with the data, we need to take caution in biological inference from these sampled ARGs.
## Read all 6571 tree lengths ----
treelen1 <- readArgSummary("~/Proj-Def-Haplo/optix_newrun/optix_newNe/optix_newNe.branchlen.all_sample.8250.bed.gz")
treelen1$treespan <- treelen1$chromEnd-treelen1$chromStart
## Read 464 tree lengths in the focal genomic region ----
treelen1_sub <- subset_genomic_interval(treelen1, chromStart, chromEnd)
mean(treelen1$treespan) # 6.625932
mean(treelen1_sub$treespan) # 6.460215
## Plot ----
par(mfrow=c(1,3), cex.main=1.5, cex.lab=1.7, cex.axis=1.5)
# Histogram
hist(treelen1$treespan, xlab = "Tree Span")
# Branch Length vs Tree span (all 6571 trees)
plot(treelen1$treespan, treelen1$branchlen_mean, xlab="Tree Span", ylab="Tree Length")
points(aggregate(treelen1$branchlen_mean~treelen1$treespan, FUN=mean), col="red", pch=19, cex=1.5)
# Branch Length vs Tree span (464 trees in the focal genomic region)
plot(treelen1_sub$treespan, treelen1_sub$branchlen_mean, xlab="Tree Span", ylab="Tree Length")
points(aggregate(treelen1_sub$branchlen_mean~treelen1_sub$treespan, FUN=mean), col="red", pch=19, cex=1.5)

FigS6: Left: Histogram of tree spans; Center: Branch length vs tree span for all 6571 marginal trees along the genome; Right: Branch length vs tree spans for 464 trees in the focal genomic region. RED points are average tree lengths for each tree span.
For our analysis, we strictly focus on identifying edges supported by SNPs. In practice, this is done in few steps - First, for each sampled tree along the genome (= 464 trees), we extract the following information - ancestral and descendant node of each edge, edge height (=length), time-point of the descendant node, i.e., when the edge originated (=depth, NOTE: due to rounding errors in Newick format, we round the depth values to 3 significant digits) and the samples (=tips.from.dec) that each edge is ancestral to. NOTE: Although we are identifying haplotype blocks in the H.e.lativitta (red) population, we use genealogical trees that include both populations. This is done in order to estimate the edge height of the most recent common ancestor to all individuals in the red population. Ideally for making biological inferences from only one population, this need not be done. However in our analysis, in order to illustrate the haplotype block patterns generated in a selected genomic region, we decided to incorporate both populations to generate trees along the genome. Nevertheless, it is important to note that the identified haplotype blocks will stay the same irrespective of which and how many populations are included in the analysis; however, the edge height will change along the genome.
## Store edge information of all 464 trees as a list
tr1_sub_list <- extract_edges(tr1_sub)
Second, for every tree at each SNP position (= 137 SNPs), we identify most recent tree node that is commonly ancestral to all individuals that share the same allele. (Note: this assumes infinite sites mutation model, which is generally the default option in ARGweaver). This allows identification of 1 node for the major and 1 for the minor allele at each SNP position, hereafter called major and minor node for each tree.
The table below illustrates information from the tree whose genomic span coincides with the first SNP position - 1385985.
66 |
75 |
3.1531464 |
0.2364560 |
18, 22 |
75 |
18 |
0.2364539 |
0.0000021 |
18 |
75 |
22 |
0.2364539 |
0.0000021 |
22 |
66 |
72 |
2.4942915 |
0.8953109 |
39, 33, 11, 34, 0 , 14, 3 , 9 , 37, 15, 32, 25, 21, 19, 30, 17, 26, 13 |
72 |
76 |
0.7979937 |
0.0973172 |
39, 33 |
76 |
39 |
0.0973136 |
0.0000036 |
39 |
76 |
33 |
0.0973136 |
0.0000036 |
33 |
72 |
55 |
0.3208752 |
0.5744357 |
11, 34, 0 , 14, 3 , 9 , 37, 15, 32, 25, 21, 19, 30, 17, 26, 13 |
55 |
40 |
0.4227382 |
0.1516975 |
11, 34 |
40 |
11 |
0.1516965 |
0.0000010 |
11 |
40 |
34 |
0.1516965 |
0.0000010 |
34 |
55 |
68 |
0.0000000 |
0.5744357 |
0 , 14, 3 , 9 , 37, 15, 32, 25, 21, 19, 30, 17, 26, 13 |
68 |
0 |
0.5744357 |
0.0000000 |
0 |
68 |
63 |
0.0000000 |
0.5744357 |
14, 3 , 9 , 37, 15, 32, 25, 21, 19, 30, 17, 26, 13 |
63 |
14 |
0.5744357 |
0.0000000 |
14 |
63 |
46 |
0.2058814 |
0.3685543 |
3 , 9 , 37, 15, 32, 25, 21, 19, 30, 17, 26, 13 |
46 |
60 |
0.0000000 |
0.3685543 |
3, 9 |
60 |
3 |
0.3685527 |
0.0000015 |
3 |
60 |
9 |
0.3685527 |
0.0000015 |
9 |
46 |
44 |
0.0000000 |
0.3685543 |
37, 15, 32, 25, 21, 19, 30, 17, 26, 13 |
44 |
57 |
0.0000000 |
0.3685543 |
37, 15, 32, 25, 21, 19, 30, 17 |
57 |
37 |
0.3685527 |
0.0000015 |
37 |
57 |
77 |
0.0000000 |
0.3685543 |
15, 32, 25, 21, 19, 30, 17 |
77 |
52 |
0.0000000 |
0.3685543 |
15, 32, 25 |
52 |
15 |
0.3685527 |
0.0000015 |
15 |
52 |
54 |
0.3285208 |
0.0400334 |
32, 25 |
54 |
32 |
0.0400319 |
0.0000015 |
32 |
54 |
25 |
0.0400319 |
0.0000015 |
25 |
77 |
53 |
0.1320988 |
0.2364554 |
21, 19, 30, 17 |
53 |
21 |
0.2364539 |
0.0000015 |
21 |
53 |
65 |
0.0000000 |
0.2364554 |
19, 30, 17 |
65 |
73 |
0.0847579 |
0.1516975 |
19, 30 |
73 |
19 |
0.1516965 |
0.0000010 |
19 |
73 |
30 |
0.1516965 |
0.0000010 |
30 |
65 |
17 |
0.2364539 |
0.0000015 |
17 |
44 |
71 |
0.2712391 |
0.0973152 |
26, 13 |
71 |
26 |
0.0973136 |
0.0000015 |
26 |
71 |
13 |
0.0973136 |
0.0000015 |
13 |
-1 |
66 |
16.6103976 |
3.3896024 |
18, 22, 39, 33, 11, 34, 0 , 14, 3 , 9 , 37, 15, 32, 25, 21, 19, 30, 17, 26, 13 |
Table S1: For the above tree at SNP position 1385985, individuals 18 and 22 share allele 0, and therefore their most recent common ancestor is node 75 (called as minor node) which originated at time (in \(N_e\)) = 0.236 and has a length of 3.153 (in \(N_e\)).
Third, for each minor node (and major node if the SNP is fixed within the H.e.lativitta population) identified from trees at each of the 137 SNPs, we identify all trees along the genome which contains that unique ancestral node. Since we do not know the ancestral reference allele at each SNP position and we assume infinite sites mutation, any minor node that is ancestral exclusively to the minor clade is assumed to contain the causal alternate allele and is considered as a branch on which a mutation has occurred. The table below shows the extent of the above node 75 that is ancestral to indnividuals 18 and 22 at SNP position 1385985. (NOTE: only 10 rows showed for illustration. In other words, this is how the haplotype blocks are encoded as edges.
2 |
55 |
42 |
0.0000000 |
0.236454 |
18, 22 |
1 |
1385966 |
1385966 |
1 |
0.000000 |
21 |
55 |
42 |
0.0000000 |
0.236454 |
18, 22 |
2 |
1385967 |
1385967 |
1 |
0.000000 |
22 |
55 |
42 |
0.0000000 |
0.236454 |
18, 22 |
3 |
1385968 |
1385968 |
1 |
0.000000 |
23 |
55 |
42 |
0.0000000 |
0.236454 |
18, 22 |
4 |
1385969 |
1385972 |
4 |
0.000000 |
24 |
55 |
42 |
0.0000000 |
0.236454 |
18, 22 |
5 |
1385973 |
1385973 |
1 |
0.000000 |
25 |
55 |
42 |
0.0000000 |
0.236454 |
18, 22 |
6 |
1385974 |
1385993 |
20 |
0.000000 |
41 |
55 |
42 |
0.3379803 |
0.236454 |
18, 22 |
7 |
1385994 |
1386005 |
12 |
4.055763 |
36 |
55 |
42 |
0.3379803 |
0.236454 |
18, 22 |
8 |
1386006 |
1386008 |
3 |
1.013941 |
34 |
55 |
42 |
0.3379803 |
0.236454 |
18, 22 |
9 |
1386009 |
1386022 |
14 |
4.731724 |
341 |
55 |
42 |
0.3379803 |
0.236454 |
18, 22 |
10 |
1386023 |
1386025 |
3 |
1.013941 |
Table S2: Haplotype block as a unique edge that contains the mutation at pos: 1385985, shared between individuals 18, 22. Each row represents 1 tree. For example, tree 7 spans from 1385994 to 1386005, is 12 bp long; the haplotype block at this tree has a height of 0.3379 and therefore an area (height x tree span) of 4.055763.
Following the above steps, we can identify each major/minor node ancestral to each major/minor clade and occurs at a particular time point in the past, that in practice informs us of all the edges informed by SNPs. With 137 SNPs, we have 36 edges supported by SNPs (table below).
0.3686 |
9 , 3 , 11, 21, 14, 34, 32, 25, 26, 13 |
8 |
1 |
1 |
3.3896 |
30, 17, 22, 0 , 33, 18, 15, 19, 39, 37 |
11, 12 |
2 |
2 |
2.1748 |
15, 37, 3 , 19, 14, 21, 22, 0 , 39, 33 |
111 |
1 |
3 |
0.5744 |
3 , 15, 19, 14, 25, 33, 9 , 21, 22 |
99, 100 |
2 |
4 |
0.8953 |
37, 3 , 19, 14, 21, 22, 0 , 39, 33 |
106 |
1 |
5 |
1.3954 |
30, 0 , 39, 33, 18, 34, 17, 32, 26 |
113 |
1 |
6 |
0.2365 |
39, 11, 18, 17, 34, 32, 26 |
102 |
1 |
7 |
2.1748 |
17, 33, 18, 15, 19, 22 |
7 |
1 |
8 |
0.3686 |
21, 14, 32, 25, 26, 13 |
13 |
1 |
9 |
0.8953 |
30, 0 , 39, 11, 19, 14 |
88 |
1 |
10 |
NA |
19, 0, 33, 30, 22 |
6 |
1 |
11 |
NA |
9, 13, 14, 37, 22 |
40 |
1 |
12 |
0.5744 |
30, 18, 32, 34 |
61 |
1 |
13 |
1.3954 |
34, 30, 17 |
9, 10, 15 |
3 |
14 |
0.2365 |
14, 9 , 37 |
43 |
1 |
15 |
0.2365 |
9 , 33, 19 |
80 |
1 |
16 |
0.2365 |
18, 22 |
1, 2, 3 |
3 |
17 |
0.1517 |
22, 17 |
56 |
1 |
18 |
0.0624 |
21, 22 |
92, 122, 126 |
3 |
19 |
NA |
9, 11 |
97 |
1 |
20 |
1.3954 |
30, 13 |
105 |
1 |
21 |
0.0000 |
11 |
5 |
1 |
22 |
0.0000 |
30 |
14, 34, 70, 91, 110, 120, 127, 131, 136 |
9 |
23 |
0.0000 |
34 |
17, 109 |
2 |
24 |
0.0000 |
13 |
25, 26, 27, 29, 32, 33, 35, 36, 37, 38, 41, 112, 116, 118 |
14 |
25 |
0.0000 |
17 |
31 |
1 |
26 |
0.0000 |
3 |
60 |
1 |
27 |
0.0000 |
15 |
108 |
1 |
28 |
0.0000 |
18 |
117 |
1 |
29 |
3.3896 |
ALL |
4, 16, 42, 54, 62, 64, 65, 66, 93, 94, 95, 96, 98, 101 |
14 |
30 |
2.1748 |
ALL |
18, 19, 21, 22, 23, 24, 63, 119, 121, 123, 124, 125, 128, 129, 135, 137 |
16 |
31 |
1.3954 |
ALL |
20, 57, 58, 59, 67, 68, 69, 71, 72, 73, 74, 75, 76, 77, 78, 79, 81, 82, 83, 84, 85, 86 |
22 |
32 |
12.8325 |
ALL |
28, 30, 39 |
3 |
33 |
0.8953 |
ALL |
44, 45, 46, 47, 48, 49, 50, 51, 52 |
9 |
34 |
5.2829 |
ALL |
53, 87, 103, 104, 130, 132 |
6 |
35 |
8.2336 |
ALL |
55, 89, 90, 107, 114, 115, 133, 134 |
8 |
36 |
Table S3: All 36 haplotype blocks as edges from iteration 8250; each row represents an unique edge. As mentioned above, each edge is defined uniquely by its origin time-point (=depth) and the set of descendent individuals (=tips.from.dec). For example, edge 17 (same edge as the Table S1, S2 refers to) originates at 0.2365 (time in Ne), shared by individuals 18, 22 and is supported by 3 SNPs at position (rank sum order) 1,2,3. Edges with depth = NA refers to SNPs that are incompatible with the infinite sites mutation model, and hence cannot be explained by one mutation event.
It is important to clarify that this strategy of identifying nodes across successive trees along the genome can easily be translated to the mathematical notation used for the haplotype block simulations. For example, the above edge can be represented as {0, {34}, {{18,22},{}}}, i.e., the ancestral genome on this haplotype block carries the allele 0 at SNP position 1; and is broken down into regions defined by trees {1,93}, {94,464} and is ancestral to samples {18,22},{} respectively.
Haplotype block visualization
For visualization, we choose to only plot the edges that are supported by 3 or more SNPs. Moreover, for SNPs that are fixed in the red population, we only show edges that originate at any time-point below 5\(N_e\). This leaves us with only 8 edgee. (see .Rmd file for full code)
## Select edges supported by 3 or more SNPs
imp_edges1 <- edges1[edges1$No.of.SNPs >=3 & edges1$depth <=5 , ] #12 important edges
nrow(imp_edges1)
imp_edges1 <- imp_edges1[order(imp_edges1$depth),]
par(mfrow=c(2,1), cex.lab=1.5, cex.axis=1.3)
plot_all_sites(focal_sites, as.character(1)) +
axis(side =2, tick =T, at = c(21:40), labels = F) +
points(bl1_c1$pos, rep(which(sampID$ARGweaverID == c(30)),nrow(bl1_c1)), col='black', pch=19, cex = 1)
plot(hapblock1_c1$chromStart, hapblock1_c1$length, type="n", log="y", ylim=c(0.05,15), xlab="Position (bp)", ylab="Time (Ne)", xlim=c(chromStart, chromEnd)) +
plot_hapblock(hapblock1_c1_all, "black", T)

Fig S7: Haplotype block that is supported by singletons. These edges originate directly from the tree tips, ie, the samples and therefore extends all along the genomic region. Although for biological inference, singletons are often uninformative, this shows the feature of an edge supported by singletons, which extends all along the genome, is normally shallow with certain regions that go are high, where a lot of singleton clusters together. These are normally regions of the genome, which have recombined out and goes all the way back to an ancestor in the deep past. The orange block shows clustering of SNPs in the higher region of the edges, whereas, the green edge shows how SNPs can also occur by chance at other regions.
Now, ploting all the haploype blocks except singletons (Fig S8), same as the figure in main text (see Main Text for caption).

See .Rmd file for all 137 trees at each SNP position.
Case 2: iteration 9000
There are altogether 6457 trees in the ~50kb region, and 425 in the focal genomic region.
Hapotype blocks from both iterations
FigS9 shows the haplotype blocks and the SNPs that support each block from both iterations.

---
title: "ARGweaver Analysis"
output:
  html_notebook: default
  pdf_document: default
editor_options:
  chunk_output_type: inline
---

```{r include=F}
knitr::opts_chunk$set(echo=F)
```

```{bash}
## Load all modules
module load samtools bedops bcftools R 
source ~/.bashrc
export PYTHONPATH=$HOME/.local/bin
workdir="/nfs/scistore03/bartogrp/apal/Proj-Def-Haplo"
```

## Running ARGweaver

#### Sample information
ARGweaver is applied on an empirical, phased dataset of *Heliconius erato* butterflies, generated by haplotagging, a technique that produces synthetic linked-read sequence data. We used parts of two scaffolds - Herato1801:1362067-1405605 (coincides with the previously identified gene *optix* that has undergone a selective sweep) and Herato1603:3450000-3493538 (a neutral background region) (Supplementary Data 1). This dataset is previously published in Meier et al, 2021. 10 individuals of *H.e.lativitta* and *H.e.notabilis* were chosen from opposite ends of the hybrid zone transect for ARG inference (Supplementary Data 2).
```{bash}
#select samples
tail +2 sample.csv | sed 's/\,/\t/' | cut -f 1 > all_sample.txt
```

#### Preprocessing files for ARGweaver
To preprocess the input files for ARGweaver, we first subset our 20 diploid individuals from the full VCF files (see Supplementary information in Meier et al, 2021), and then convert the SNP information into the *.sites* format required for ARGweaver. The *.sites* format only contains information on the positions that are varying within the 20 individuals that we chose. (see *.Rmd* file for full code)

In the *optix* region (43538 bp long), there are 2812 sites altogether - 2426 are variant positions, while 330 and 56 sites are fixed for one or the other allele. Genomic positions that are neither variant nor fixed to one or the other allele (in other words, positions absent in the VCF file) are considered missing information and therefore masked from being used as input data for ARGweaver. Altogether, ARGweaver uses information of variant alleles and invariant alleles; whereas the rest is masked and treated as missing information.

```{bash, echo=T, warning=F, results=F}
## OPTIX: Herato1801:1362067-1405605
# Initiate variables
regionName=optix
chrom=Herato1801
chromStart=1362067
chromEnd=1405605
# Subset samples from full VCF 
cd ~/Proj-Def-Haplo/VCFs/
bcftools view -S ../sample/all_sample.txt -O z \
PC062_merged_Herato1801.PL.AD.HAPCUT2.site_all.vcf.gz > ${chrom}_${chromStart}-${chromEnd}_subset.vcf.gz
tabix ${chrom}_${chromStart}-${chromEnd}_subset.vcf.gz
```

```{bash, echo=F}
# Make new folder for optix
mkdir ~/Proj-Def-Haplo/${regionName}_newrun
mkdir ~/Proj-Def-Haplo/${regionName}_newrun/VCFs
cd ~/Proj-Def-Haplo/${regionName}_newrun/VCFs/
# Make symbolic link to the orginal file
ln -s /nfs/scistore03/bartogrp/apal/Proj-Def-Haplo/VCFs/${chrom}_${chromStart}-${chromEnd}_subset.vcf.gz ${regionName}_Herato.vcf.gz
ln -s /nfs/scistore03/bartogrp/apal/Proj-Def-Haplo/VCFs/${chrom}_${chromStart}-${chromEnd}_subset.vcf.gz.tbi ${regionName}_Herato.vcf.gz.tbi
```

```{bash, echo=T}
# Keep only the sites that are variant to this subset of 20 diploid individuals
bcftools view -c1 -C39 -Oz ${regionName}_Herato.vcf.gz > ${regionName}_Herato_variants.vcf.gz
bcftools query --format '%CHROM\t%POS\t%REF\t%ALT\t[%GT\t]\n' ${regionName}_Herato_variants.vcf.gz > ${regionName}_Herato_variants.gt
```

```{bash, echo=T}
#check numbers of sites in the VCF file
bcftools view -H ${regionName}_Herato.vcf.gz | wc -l #Total sites = 2812
bcftools view -H -c1 -C39 ${regionName}_Herato.vcf.gz | wc -l #Total variant sites = 2426
bcftools view -H -C0 ${regionName}_Herato.vcf.gz | wc -l #Sites fixed to REF allele = 330 (REF according to the VCF format nomenclature)
bcftools view -H -c40 ${regionName}_Herato.vcf.gz | wc -l #SItes fixed to ALT allele = 56 (ALT according to the VCF format nomenclature)
```

```{bash, echo=T}
# Create sites file 
# USAGE: Rscript vcf2sites.R <sample.txt> <input.gt> <chr> <begin> <end> #
Rscript $workdir/scripts/vcf2sites.R $workdir/sample/all_sample.txt ${regionName}_Herato_variants.gt $chrom $chromStart $chromEnd
cat <(head -2 ${regionName}_Herato_variants_RefAlt.sites) <(cut -f1,2 <(tail +3 ${regionName}_Herato_variants_RefAlt.sites)) > ../${regionName}_Herato.sites
```

```{bash, echo=T}
# Create mask file
bcftools view -H ${regionName}_Herato_variants.vcf.gz | awk '$0 !~ /^#/ {print $1,$2-1,$2}' | bedops -d <(echo -e "$chrom\t$(($chromStart-1))\t$chromEnd") - | bgzip > ../${regionName}_Herato_mask.bed.gz
```

We repeated the same steps for the neutral background region, Scaffold Herato1603:3450000-3493538. In that region (43538 bp), there are 6407 sites altogether - 4926 is variant positions, while 1405 and 76 sites are fixed for one or the other allele. (see *.Rmd* file for full code). 

```{bash, echo=F}
## Neutral: Herato1603:3450000-3493538

# Initiate variables
regionName=Herato1603-1 # we call it Herato1603-1 since we subset exactly the same genomic length (43538 bp) as the optix region
chrom=Herato1603
chromStart=3450000
chromEnd=3493538
cd ~/Proj-Def-Haplo/VCFs/

# Subset samples from full VCF 
bcftools view -S ../sample/all_sample.txt \
  -r $chrom:$chromStart-$chromEnd \
  -Oz PC062_merged_Herato1603.3.45Mb.PL.AD.HAPCUT2.vcf.gz > ${chrom}_${chromStart}-${chromEnd}_subset.vcf.gz
tabix ${chrom}_${chromStart}-${chromEnd}_subset.vcf.gz

# Make new folder for Herato1603
mkdir ~/Proj-Def-Haplo/${regionName}_newrun
mkdir ~/Proj-Def-Haplo/${regionName}_newrun/VCFs
cd ~/Proj-Def-Haplo/${regionName}_newrun/VCFs/

ln -s $workdir/VCFs/${chrom}_${chromStart}-${chromEnd}_subset.vcf.gz ${regionName}_Herato.vcf.gz
ln -s $workdir/VCFs/${chrom}_${chromStart}-${chromEnd}_subset.vcf.gz.tbi ${regionName}_Herato.vcf.gz.tbi

# Keep only the sites that are variant to this subset of 20 diploid individuals
bcftools view -c1 -C39 -Oz ${regionName}_Herato.vcf.gz > ${regionName}_Herato_variants.vcf.gz
bcftools query --format '%CHROM\t%POS\t%REF\t%ALT\t[%GT\t]\n' ${regionName}_Herato_variants.vcf.gz > ${regionName}_Herato_variants.gt

#check numbers of sites in the VCF file
bcftools view -H ${regionName}_Herato.vcf.gz | wc -l #Total sites = 6407
bcftools view -H -c1 -C39 ${regionName}_Herato.vcf.gz | wc -l #Total variant sites = 4926
bcftools view -H -C0 ${regionName}_Herato.vcf.gz | wc -l #Sites fixed to REF allele = 1405
bcftools view -H -c40 ${regionName}_Herato.vcf.gz | wc -l #Sites fixed to ALT allele = 76

#Create sites file 
#USAGE: Rscript vcf2sites.R <sample.txt> <input.gt> <chr> <begin> <end>
Rscript $workdir/scripts/vcf2sites.R $workdir/sample/all_sample.txt ${regionName}_Herato_variants.gt $chrom $chromStart $chromEnd
cat <(head -2 ${regionName}_Herato_variants_RefAlt.sites) <(cut -f1,2 <(tail +3 ${regionName}_Herato_variants_RefAlt.sites)) > ../${regionName}_Herato.sites

## Create mask file
bcftools view -H ${regionName}_Herato_variants.vcf.gz | awk '$0 !~ /^#/ {print $1,$2-1,$2}' | bedops -d <(echo -e "$chrom\t$(($chromStart-1))\t$chromEnd") - | bgzip > ../${regionName}_Herato_mask.bed.gz
```

#### Input parameters
In order to run ARGweaver, we consider *$\mu$* (mutation rate) = $2.9 \times 10^{-9}$ per bp per generation, and its ratio to recombination rate, $\frac{\mu}{r}$ = 1. We estimate $N_e$ by calculating Tajima's $\pi$ (= *$4N_e r$*) from the neutral region. *$\pi$* = 0.0225049, $N_e$ = 1940078. The total map length of both *optix* and neutral region is 0.01262602 cM. (see *.Rmd* file for full code)
```{bash, echo=F}
## Calculating pi from original Herato1603 scaffold pos:3.45-3.55 (1MB)
# This outputs pi per site 
vcftools --gzvcf $workdir/VCFs/PC062_merged_Herato1603.3.45Mb.PL.AD.HAPCUT2.vcf.gz \
--site-pi \
--keep $workdir/sample/all_sample.txt \
--out Herato1603
# Sum of pi estimates across all sites
awk '{SUM+=$3} END {print SUM}' Herato1603.sites.pi 
```
```{r, echo=F}
## Calculatig Ne from Herato1603:3450000-3550000
r=2.9e-09 #recombination rate
L=3550000-3450000 #Genomic length
n=40 # no. of samples
pi_sum = 2250.49 # Sum of pi estimates across all sites
pi = pi_sum/L #pi estimate
S=10571 # No of seggregating sites in this region 
Ne_tajima=pi/(4*r) 
Ne_watterson=S/(4*r*L)*(sum(c(1:(n-1))^(-1)))^-1 # Watterson's Ne for sanity check
cat(pi, Ne_tajima, Ne_watterson, Ne_tajima/Ne_watterson)  
# 0.0225049 1940078 2142433 0.9055488

## Calculating map length
2.9e-9*100*43538
```
Moreover, ARGweaver allows coalescence and recombination events to take place only at discretized time points, that are defined by an exponential function, $t(i)=\frac {exp(\frac{ i}{K−1}log(1+\delta t_{max}))−1}{\delta}$; for *K* time points and $i\in{\{0, 1, …, K−1\}}$. Very small values of $\delta ( <\frac {1}{t_{max}})$ will yield roughly linear distribution of times, whereas larger values of $\delta$ will place more time points in recent past and less in deep past. For this analysis, we set *K* = 30, the maximum time for total coalescence (*$t_{max}$*) = 20*$N_e$*, and the shape parameter (*$\delta$*) = 0.01. 
```{r fig1,  fig.width=6, fig.height=6, echo=F, results = F}
timepoints=function(K, delta, tmax){(exp((seq(K)-1)/(K-1)*log(1+delta*(tmax)))-1)/delta} 

par(mar = c(5, 4, 4, 4) + 1)
Ne=1940078; K=30; tmax=20*Ne; 
#data.frame(Ne2 = timepoints(K, 1e-2, tmax2)/Ne2, time = timepoints(K, 1e-2, tmax2))
plot(seq(K),timepoints(K, 1e-2, tmax)/Ne, ylim=c(0,20), ylab = "Time (Ne)", xlab = "Time steps", type = 'o', pch=20, col="black", lwd=3, main="Discrete Time Points") +
  abline(h=1, lwd=2, col="red") +
  axis(side = 4, at=c(0, 1e7, 2e7, 3e7, 4e7)/Ne, labels=c(0,1e7, 2e7, 3e7, 4e7)) +
  mtext("Time (generations)", side = 4, line = 3)
```
**Fig S1:** Discrete time points for ARGweaver analysis - 30 time steps, $N_e$ = 1940078, Left y-axis shows the time points in $N_e$, while right y-axis in number of generations. Red line denotes 1$N_e$.



#### Run ARGweaver

ARGweaver is run for 5000 iterations and then resumed for another 5000 iterations for the *optix* region, whereas only 5000 iterations for the neutral region. 
```{bash, echo=T}
# Run ARGweaver
~/.local/bin/arg-sample -s  sites/${regionName}_Herato.sites \
	--maskmap masks/${regionName}_Herato_mask.bed.gz \
	-N 1940078 \
	-m 2.9e-9 \
	-r 2.9e-9 \
	--ntimes 30 \
	--maxtime $((1940078*20)) \
	--iters 5000 \
	--delta 0.01 \
	--output {regionName}/sample/${regionName}

# Resume ARGweaver for more iterations
~/.local/bin/arg-sample -s sites/${regionName}_Herato.sites \
	--maskmap masks/${regionName}_Herato_mask.bed.gz \
	-N 1940078 \
	-m 2.9e-9 \
	-r 2.9e-9 \
	--ntimes 30 \
	--maxtime $((1940078*20)) \
	--iters 10000 \
	--delta 0.01 \
	--output {regionName}/sample/${regionName} \
	--resume
```



## Analysis of ARGweaver output

```{r}
## Load packages and initiate functions
source(file = "~/Proj-Def-Haplo/scripts/functions.R")
```

#### MCMC summary

Visualizing the traces of likelihood, prior and joint probabilities from the MCMC iterations, we decided to set the first 3000 iterations as burnin. The plot below is for the *optix* region.
```{r fig2, fig.width=10, fig.height=5, echo=F, message=F, results=F}
#MCMC summary
mcmc_summary <- "~/Proj-Def-Haplo/optix_newrun/optix_newNe/sample/optix_newNe.stats"
plotTraces(mcmc_summary, stats=c("prior", "likelihood", "joint", "recombs", "arglen", "noncompats"))
```
**Fig S2:** Traces across all MCMC iterations of **prior** (log probability of the sampled ARG given the model), **likelihood** (log probability of the data given the sampled ARG), **joint** (total log probability of the ARG and the data (*prior* + *likelihood*)), **recombs** (number of recombination events in the sampled ARG), **arglen** (total length of all branches summed across sites) and **noncompats** (number of variant sites that cannot be explained by infinite sites mutation model).

#### Process output from ARGweaver
The ARGweaver output *.smc* files are then converted to easily maniputable *.bed* format to extract TMRCAs, trees, recomination breakpoints and total tree branch lengths across all iterations (except burnin). Then, 2 specific iterations (Iteration 8250 and 9000) are chosen for further analysis and identification of haplotype blocks (see *.Rmd* file for full code, and analysis for the neutral region)

```{bash, echo=F}
## Load modules and source .bashrc 
module load samtools bedops bcftools R 
source ~/.bashrc
export PYTHONPATH=$HOME/.local/bin
```

```{bash echo=T}
#### OPTIX ####

# Load variables
workdir=$HOME/Proj-Def-Haplo
prefix=optix_newNe
regionName=optix
chrom=Herato1801
chromStart=1362067
chromEnd=1405605

# Convert .smc to .bed format for all iterations
cd $workdir/optix_newrun/$prefix
smc2bed-all $workdir/optix_newrun/$prefix/sample/$prefix

## Check mcmc stats to check burnin
burnin=3000
# Compute TMRCA estimates for between and within popualtions from all iterations
for pop in all_sample Herato_notabilis Herato_lativitta; do
    echo Summarizing $pop
    arg-summarize --arg-file ./sample/$prefix.bed.gz \
			 --subset-inds $workdir/sample/$pop.txt \
			 --log-file ./sample/$prefix.log \
			 --tmrca \
			 --burnin $burnin \
			 --mean --stdev --quantile 0.05,0.5,0.95 |\
		bgzip > $prefix.tmrca.$pop.bed.gz
		echo " -> Done"
done

## The same is repeated for the neutral region below
```

```{bash echo=T}
## Compute TMRCA, and branch length from individual iterations
stats_list=(tmrca branchlen)
iters_list=(8250 9000)

for pop in all_sample Herato_notabilis Herato_lativitta; do
    echo Summarizing $pop
    for sample_no in ${iters_list[*]}; do
    	echo -n "Computing $sample_no"
    	for stat in ${stats_list[*]}; do
    	  echo -n "Computing $stat"
    	  arg-summarize --arg-file ./sample/$prefix.bed.gz \
			    --subset-inds $workdir/sample/$pop.txt \
			    --sample $sample_no \
			    --log-file ./sample/$prefix.log \
			    --$stat \
			    --mean --quantile 0.05,0.5,0.95 |\
			  bgzip > $prefix.$stat.$pop.${sample_no}.bed.gz
			  echo " -> Done"
			done
    done
done
```

```{bash echo=F}
## Get trees from iteration 8000 and 9000
zcat $prefix.8000.smc.gz > $prefix.8000.smc
tail +3 $prefix.8000.smc | grep TREE > $prefix.8000_trees.txt
tail +3 $prefix.8000.smc | grep SPR > $prefix.8000_SPR.txt

zcat $prefix.9000.smc.gz > $prefix.5000.smc
tail +3 $prefix.9000.smc | grep TREE > $prefix.9000_trees.txt
tail +3 $prefix.9000.smc | grep SPR > $prefix.9000_SPR.txt

zcat $prefix.8250.smc.gz > $prefix.8250.smc
tail +3 $prefix.8250.smc | grep TREE > $prefix.8250_trees.txt
tail +3 $prefix.8250.smc | grep SPR > $prefix.8250_SPR.txt

zcat $prefix.9200.smc.gz > $prefix.9200.smc
tail +3 $prefix.9200.smc | grep TREE > $prefix.9200_trees.txt
tail +3 $prefix.9200.smc | grep SPR > $prefix.9200_SPR.txt
```

```{r echo=F, results=F, message=F, include=F}
#MCMC summary for the neutral region
mcmc_summary_neutral <- "~/Proj-Def-Haplo/Herato1603-1_newrun/Herato1603-1_newNe/sample/Herato1603-1_newNe.stats"
plotTraces(mcmc_summary_neutral, stats=c("prior", "likelihood", "joint", "recombs", "arglen", "noncompats"))
```

```{bash echo=F}
#### Herato1603 ####

# Convert .smc to .bed format for all iterations
smc2bed-all ~/Proj-Def-Haplo/Herato1603-1_newrun/Herato1603-1_newNe/sample/Herato1603-1_newNe

# Set variable
workdir=$HOME/Proj-Def-Haplo
prefix=Herato1603-1_newNe

# Check mcmc stats to check burnin
burnin=2000

# Compute TMRCA estimates for between and within popualtions from all iterations
for pop in all_sample Herato_notabilis Herato_lativitta; do
    echo Summarizing $pop
    arg-summarize --arg-file ./sample/$prefix.bed.gz \
			 --subset-inds $workdir/sample/$pop.txt \
			 --log-file ./sample/$prefix.log \
			 --tmrca \
			 --burnin $burnin \
			 --mean --stdev --quantile 0.05,0.5,0.95 |\
		bgzip > $prefix.tmrca.$pop.bed.gz
		echo " -> Done"
done
```

```{bash echo=F}
## Get sample IDs
zcat $prefix.9500.smc.gz | head -2 > ../../../sample/newARGsampleID.txt
```

```{r, echo=F}
## Read sample IDs and assign colours and IDs 
# Read samples ----
sampID <- read.table("~/Proj-Def-Haplo/sample/all.newID.txt")
shortID <- read.table("~/Proj-Def-Haplo/sample/shortID.txt")
sampID <- cbind(sampID,shortID)
colnames(sampID) <- c("ID","plotID","shortID")
sampID$pop <- c(rep("Herato_notabilis", 20), rep("Herato_lativitta", 20))
sampID$col <- c(rep("#CCCC00", 20), rep("#C00C28", 20))
#sampID

# Make list of inidviduals to asign colours ----
indColour=list()
for (id in sampID$ID){indColour[[id]] <- sampID[which(sampID$ID==id),"col"]}
indColour

# Make list of individuals to asign names ----
indLabel=list()
for (id in sampID$ID){indLabel[[id]] <- sampID[which(sampID$ID==id),"shortID"]}
#indLabel

# ARGWEAVER has different sets of IDs depending on its initial ARG sample, so fix that! ----
argID_new <- read.table(file = "sample/newARGsampleID.txt", header = T)
colnames(argID_new) <- c("ID","ARGweaverID")
# argID[40,1] <- 1
# argID[40,2] <- setdiff(sampID$ID, argID$ID)
# argID$ARGweaverID <- argID$ARGweaverID-1
sampID <- merge(sampID,argID_new,all=T, by="ID")

# set IDs for final plot (only including lativitta individuals) ----
indLabel_ARG_lat=list()
for (id in sampID$ARGweaverID){indLabel_ARG_lat[[as.character(id)]] <- which(sampID$ARGweaverID == id)-20}


indColour_argIDs <- indColour
names(indColour_argIDs) <- as.character(sampID[,"ARGweaverID"])
indColour_argIDs
#indColour_argIDs

notabilis_individuals <- sampID[which(sampID$pop == "Herato_notabilis"),"ARGweaverID"]
lativitta_individuals <- sampID[which(sampID$pop == "Herato_lativitta"),"ARGweaverID"]

```

#### Read SNPs

SNPs are changed from its nucloetide asignment to 0/1; based on allele frequency within the *H.e.lativitta* samples. (Higher allele frequency - 1, Lower allele frequency - 0)
```{r, echo=T}
## Read all SNPs
optix_sites <- readSites("~/Proj-Def-Haplo/optix_newrun/sites/optix_Herato.sites")
#
## Read all sites (SNPs+invariant)
optixALL_sites <- readSites("~/Proj-Def-Haplo/optix_newrun/sites/optixALL_Herato.sites")
#
## Change SNPs to 0 and 1 based on major allele in lativitta
sites<- data.frame(pos=optix_sites$pos, optix_sites$sites)
for (snpID in c(1:nrow(sites))){
  cat(snpID, sites[snpID,'pos'])
  alleles <- sites[snpID,sampID$ID]
  alleles <- as.factor(alleles)
  
  alleleCount1 <- sum(alleles[lat] == levels(alleles)[1])
  alleleCount2 <- sum(alleles[lat] == levels(alleles)[2])
  
  majAllele <- ifelse(alleleCount1 > alleleCount2, levels(alleles)[1], levels(alleles)[2])
  minAllele <- setdiff(levels(alleles), majAllele)

  sites[snpID,which(sites[snpID,] == majAllele)] <- 1
  sites[snpID,which(sites[snpID,] == minAllele)] <- 0
}
```

```{r include=F}
plot(sites$pos, rep(1, nsites), type='n', ylim=c(1,40), xlab = "Position", ylab = "Individual ID") +
  plot_snps(sites, 1, "#C00C28") +
  plot_snps(sites, 0, "#CCCC00") +
  for(i in setdiff(sitesALL$pos, sites$pos)){points(rep(i,40), 1:40, col = "black", cex = 0.75)}
```


#### TMRCA

We first examine the TMRCA of the total tree and the individual populations - *H.e.lativitta* (in red) and *H.e.notabilis* (in yellow). Unlike the neutral region (*Herato1603*) where all TMRCA estimates are fairly constrained between 1 and 10 $N_e$, TMRCA traces for individual populations in the *optix* region exhibit shallow coalescence times at multiple positions throughout the entire ~50kb region. In case of a selective sweeps, where a beneficial mutation sweeps through the population, TMRCA tends to be shallow since all samples in the swept population coalesces quickly to the initial "lucky" ancestor. Moving away from the focal mutations, lineages recombine away from the swept ancestral background. The TMRCA estimates from the *optix* region is consistent with the previously identified selected region in *optix*. To investigate further the haplotype block structures, we focus into a 3kb region at optix:1385966-1388966. 

```{r fig30, fig.width=20, fig.height=10, message=F, echo=F, results=F}

#### OPTIX #### ----
# READ tmrca
tmrca <- readArgSummary("optix_newrun/optix_newNe/optix_newNe.tmrca.all_sample.bed.gz")
tmrca_not <- readArgSummary("optix_newrun/optix_newNe/optix_newNe.tmrca.Herato_notabilis.bed.gz")
tmrca_lat <- readArgSummary("optix_newrun/optix_newNe/optix_newNe.tmrca.Herato_lativitta.bed.gz")

## Set genomic span and Ne
Ne=1940078
gen <- tmrca$chromStart
gen_not <- tmrca_not$chromStart
gen_lat <- tmrca_lat$chromStart

#### Herato1603-1 #### ----
  
tmrca_neutral <- readArgSummary("~/Proj-Def-Haplo/Herato1603-1_newrun/Herato1603-1_newNe/Herato1603-1_newNe.tmrca.all_sample.bed.gz")
tmrca_not_neutral <- readArgSummary("~/Proj-Def-Haplo/Herato1603-1_newrun/Herato1603-1_newNe/Herato1603-1_newNe.tmrca.Herato_notabilis.bed.gz")
tmrca_lat_neutral <- readArgSummary("~/Proj-Def-Haplo/Herato1603-1_newrun/Herato1603-1_newNe/Herato1603-1_newNe.tmrca.Herato_lativitta.bed.gz")

## Set genomic span
gen_neutral <- tmrca_neutral$chromStart
gen_not_neutral <- tmrca_not_neutral$chromStart
gen_lat_neutral <- tmrca_lat_neutral$chromStart

## Plot TMRCA  ----

par(mfrow=c(2,1), cex.lab=1.5, cex.axis=1.2, cex.main=1.5)
## optix
plot(gen, tmrca$tmrca_quantile_0.500/Ne, log="y", ylim=c(1e-2,20), type='n', col = 'black', lwd=2, yaxt="n",
     xlab = "Position (MB)", ylab = "TMRCA (Ne)", main = "optix: 1362067-1405605 (43.5kb)") 
  polygon(x = c(gen, rev(gen)), y = c(tmrca$tmrca_quantile_0.050/Ne, rev(tmrca$tmrca_quantile_0.950/Ne)), 
        col = adjustcolor("#E0E0E0",0.4), border = adjustcolor("#E0E0E0", 0.4)) +
  polygon(x = c(gen_not, rev(gen_not)), y = c(tmrca_not$tmrca_quantile_0.050/Ne, rev(tmrca_not$tmrca_quantile_0.950/Ne)), 
        col = adjustcolor("#CCCC75",0.4), border = adjustcolor("#CCCC75", 0.4)) +
  polygon(x = c(gen_lat, rev(gen_lat)), y = c(tmrca_lat$tmrca_quantile_0.050/Ne, rev(tmrca_lat$tmrca_quantile_0.950/Ne)), 
        col = adjustcolor("#FF6666",0.4), border = adjustcolor("#FF6666", 0.4)) +
  lines(gen_not, tmrca_not$tmrca_quantile_0.500/Ne, col = adjustcolor('#CCCC00', 0.7), lwd=2) +
  lines(gen_lat, tmrca_lat$tmrca_quantile_0.500/Ne, col = adjustcolor('#C00C28', 0.7), lwd=2) +
  lines(gen, tmrca$tmrca_quantile_0.500/Ne, col = adjustcolor('black'), lwd=0.5, ) +
  axis(side = 2, at = c(0.001,0.01,0.1,1,10), labels = expression(10^-3,10^-2,10^-1,10^0,10^1)) +
  abline (h=c(1,20), lwd=1)
## Herato1603-1
plot(gen_neutral, tmrca_neutral$tmrca_quantile_0.500/Ne, log="y", ylim=c(1e-2,20),
     type='n', col = 'black', lwd=2, yaxt="n",
     xlab = "Position (MB)", ylab = "TMRCA (Ne)", main = "Herato1603: 3450000-3493538 (43.5kb)")  +
  polygon(x = c(gen_neutral, rev(gen_neutral)), y = c(tmrca_neutral$tmrca_quantile_0.050/Ne, rev(tmrca_neutral$tmrca_quantile_0.950/Ne)), 
        col = adjustcolor("#E0E0E0",0.4), border = adjustcolor("#E0E0E0", 0.4)) +
  polygon(x = c(gen_not_neutral, rev(gen_not_neutral)), y = c(tmrca_not_neutral$tmrca_quantile_0.050/Ne, rev(tmrca_not_neutral$tmrca_quantile_0.950/Ne)), 
        col = adjustcolor("#CCCC75",0.4), border = adjustcolor("#CCCC75", 0.4)) +
  polygon(x = c(gen_lat_neutral, rev(gen_lat_neutral)), y = c(tmrca_lat_neutral$tmrca_quantile_0.050/Ne, rev(tmrca_lat_neutral$tmrca_quantile_0.950/Ne)), 
        col = adjustcolor("#FF6666",0.4), border = adjustcolor("#FF6666", 0.4)) +
  lines(gen_not_neutral, tmrca_not_neutral$tmrca_quantile_0.500/Ne, col = adjustcolor('#CCCC00', 0.7), lwd=2) +
  lines(gen_lat_neutral, tmrca_lat_neutral$tmrca_quantile_0.500/Ne, col = adjustcolor('#C00C28', 0.7), lwd=2) +
  lines(gen_neutral, tmrca_neutral$tmrca_quantile_0.500/Ne, col = adjustcolor('black'), lwd=0.5, ) +
  axis(side = 2, at = c(0.001,0.01,0.1,1,10), labels = expression(10^-3,10^-2,10^-1,10^0,10^1)) +
  abline (h=c(1,20), lwd=1)
```
**Fig S3:** TMRCA (Ne) for each position in the *optix* (top panel) and the neutral genomic region (bottom panel). Black line: total TMRCA, Red: Median TMRCA for *H.e.lativitta*, Yellow: TMRCA for *H.e.notabilis*. Corresponding shaded regions are the 0.05 and 0.95 percentile intervals.  

```{r}
chromStart=1385966
chromEnd=1388966
chromIDs <- range(which(sites$pos >= chromStart & sites$pos <= chromEnd))
focal_sites <- sites[chromIDs[1]:chromIDs[2],]
nrow(focal_sites) #there are 137 SNPs 
```

In the focal region, there are 137 SNPs. 

```{r fig4, fig.width=20, fig.height=10, message=F, echo=F, results=F}
par(mfrow=c(2,1), cex.main=1.5, cex.lab=1.5, cex.axis= 1.2)

plot(gen, tmrca$tmrca_quantile_0.500/Ne, log="y", ylim=c(1e-2,20),
     xlim=c(chromStart, chromEnd),
     type='l', col = 'black', lwd=2, 
     xlab = "Position (MB)", ylab = "TMRCA (Ne)", main = "optix: 1385966-1388966 (3kb)") +
  lines(gen_not, tmrca_not$tmrca_quantile_0.500/Ne, col = adjustcolor('#CCCC00', 0.7), lwd=2) +
  lines(gen_lat, tmrca_lat$tmrca_quantile_0.500/Ne, col = adjustcolor('#C00C28', 0.7), lwd=2) +
  lines(gen, tmrca$tmrca_quantile_0.500/Ne, col = adjustcolor('black'), lwd=1 )

plot(focal_sites$pos, rep(1, nrow(focal_sites)), type='n', ylim=c(1,40), xlab = "Position", ylab = "Individual ID") +
  plot_snps(focal_sites, 1, "#C00C28") +
  plot_snps(focal_sites, 0, "#CCCC00") +
  for(i in setdiff(optixALL_sites$pos, focal_sites$pos)){points(rep(i,40), 1:40, col = "black", cex = 0.75)}
```


**FigS4:** TMRCA for each position in focal genomic region (optix:1385966-1388966). Same colour schemes as above. **Top panel:** TMRCA plot, **Bottom panel:** SNPs in both populations; RED and YELLOW alleles are variant positions, coloured according to the respective higher and lower allele counts in the *H.e.lativitta* population. BLACK alleles are fixed (invariant) in both populations. All other positions (in white) are unknown, since those positions do not have SNP information. These unknown positions are masked while running ARGWeaver, therefore they are treated as missing information and NOT as invariant sites. 

Masking sites can have a critical effect on sampling ARGs, since invariant sites can shift priors towards a recent coalescence (because there hasn't been enough time yet for a mutation to occur in any of the branches), whereas missing information does not shift priors and therefore ARGs are sampled neutrally in those regions. In the above region of 3kb, there are 137 SNPs and 16 invariant positions. Out of the 137 SNPs, only 59 SNPs are segregating within the *H.e.lativitta* population in focus. Below is the folded SNP frequency spectrum. 
```{r fig5, fig.width=5, fig.height=5}
barplot(table(20-(tr1_snps$majAlleleCount_lat)), ylab = "No of positions", xlab = "Allele Count")
```
**FigS5:** Folded SNP frequency spectrum of the focal region, optix:1385966-1388966.

Hereafter, alleles with lower frequency within the *H.e.lativitta* population is referred to as the minor allele, and individuals who share the minor allele is the minor clade. Conversely, individuals that carry that major allele is called major clade. Minor allele is coded as 0 and major as 1. 

#### General comments on identifying haplotype blocks as edges from empirical datasets
In a series of marginal genealogical trees along the genome, an edge is considered unique if it originates at a particular coalescent time point, and is ancestral to a given set of samples. Unique edges extend along the genome until recombination breaks them down, but can further re-emerge in a disjunct genomic region if recombination brings the given set of samples back together. Moreover, each edge goes back in time until further coalescence events in deeper past. (Figure 3 in main text, and Box 2). Unlike a simulation where the ground truth about origin and extend of each unique edge is precisely known, inferred edges from sampled ARGs are consistent with the data, but not necessarily supported by the SNP configuration. Those edges on which mutations appear can be reliably inferred given the SNP configuration, whereas the rest are random samples from the posterior distribution and is subject to stochastic noise, suggesting inference be made from edges supported by SNPs. 

ARGweaver outputs genealogical trees along the genome, and each recombination breakpoints and time. For each tree, nodes and tips are labelled uniquely. For each recombination event, the new ancestral node ID of the recombined branch becomes the old node ID from the previous tree. Therefore, across trees, all nodes except the one that underwent recombination in the preceding tree shares the same node ID. Although it is easy to track unique nodes over a short genomic distance by just tracking the SPR (subtree prune and regraft) event; one needs to develop more sophisticated code/algorithm to track over longer distances than considered in this analysis (which we did not attempt here!). Instead, we leveraged the artifact left behind by time-discretization in ARGweaver to identify unique edges. 

The exact step-by-step process that we used to identify edges supported by SNPs is described below. We specifically chose 2 separate iteraions (iteration ID - 8250 and 9000), sufficiently separated to avoid autocorelations between them to demonstrate the noise in identifying haplotype blocks as edges based on ARG samples. We only focus on the *H.e.lativitta* population (always in red labels).

## Iteration 1 (MCMC iter: 8250)

```{r, fig.width=20, fig.height=5}
#Read tmrca
tmrca8250 <- readArgSummary("~/Proj-Def-Haplo/optix_newrun/optix_newNe/optix_newNe.tmrca.all_sample.8250.bed.gz")
tmrca8250_not <- readArgSummary("~/Proj-Def-Haplo/optix_newrun/optix_newNe/optix_newNe.tmrca.Herato_notabilis.8250.bed.gz")
tmrca8250_lat <- readArgSummary("~/Proj-Def-Haplo/optix_newrun/optix_newNe/optix_newNe.tmrca.Herato_lativitta.8250.bed.gz")

## Set genomic span and Ne ----
Ne=1940078
gen8250 <- (tmrca8250$chromStart)#/1e3
gen8250_not <- (tmrca8250_not$chromStart)#/1e3
gen8250_lat <- (tmrca8250_lat$chromStart)#/1e3

plot(gen8250, tmrca8250$tmrca_quantile_0.500/Ne, type='s', col = "black", lwd=0.8, log="y", 
     ylim=c(1e-2,20), xlim=c(1386000, 1389000), xlab = "Position (kB)", ylab = "TMRCA (Ne)") +
  lines(gen8250_not, tmrca8250_not$tmrca_quantile_0.500/Ne, type="s", col = "#CCCC00", lwd=1) +
    lines(gen8250_lat, tmrca8250_lat$tmrca_quantile_0.500/Ne, type="s", col = "#C00C28", lwd=3)
```

```{r echo=T, message=F, results=F}
# Read trees in the whole region
tr1 <- read.table("~/Proj-Def-Haplo/optix_newrun/optix_newNe/sample/optix_newNe.8250_trees.txt")
colnames(tr1) <- c("event","chromStart","chromEnd","tree")
str(tr1)

# Read SPRs in the whole region
spr1 <- read.table("~/Proj-Def-Haplo/optix_newrun/optix_newNe/sample/optix_newNe.8250_SPR.txt")
colnames(spr1) <- c("event","position", "recomb_node", "recomb_time", "coal_node", "coal_time")
str(spr1)

# Find trees in the focal region.
chromStart <- 1385966
chromEnd <- 1388966
tr1_sub <- subset_genomic_interval(tr1, chromStart, chromEnd) #464 trees
spr1_sub <- subset_genomic_interval(spr1, chromStart, chromEnd)
```

We extract ARGs sampled by ARGweaver in its MCMC iteration: 8250, and estimate the TMRCA (total and within population), total branch length of each marginal tree and the recombination breakpoints. Altogether in the ~50kb region, there are 6571 trees (and 6570 recombination events), of which only 464 trees are present in the focal 3kb region, optix:1385966-1388966.

We investigate the distribution of branch length of trees with respect to their genomic spans (Fig S6). We find average tree span in the whole region is ~7 bp. We expect from theory assuming standard coalescence, $P_r(d \mid \tau)$=$\frac {\rho}{2}L(\tau)$ exp$[-\frac {\rho}{2}L(\tau) d]$ where $L(\tau)$ is in coalescent unit of $2N_e$ generations and $\frac {\rho}{2}$=$2N_er$ denotes the population-scaled recombination rate per bp. Given the parameters, and mean $L(\tau)$ = 2, this gives mean *d* = $\frac {1} {\frac {\rho}{2}L(\tau)}$ ~ 45bp. It seems in this region, ARGweaver seems to change the tree topology more often than the expected mean. We have not explored this closely (beyond the main reach of this analysis), however, we note that although the ARGs are consistent with the data, we need to take caution in biological inference from these sampled ARGs. 

```{r fig60, fig.width=20, fig.height=5, echo=T, results=F, warning=F}
## Read all 6571 tree lengths
treelen1 <- readArgSummary("~/Proj-Def-Haplo/optix_newrun/optix_newNe/optix_newNe.branchlen.all_sample.8250.bed.gz")
treelen1$treespan <- treelen1$chromEnd-treelen1$chromStart
## Read 464 tree lengths in the focal genomic region
treelen1_sub <- subset_genomic_interval(treelen1, chromStart, chromEnd)
mean(treelen1$treespan)
mean(treelen1_sub$treespan)
## Plot
par(mfrow=c(1,3), cex.main=1.5, cex.lab=1.7, cex.axis=1.5)
# Histogram
hist(treelen1$treespan, xlab = "Tree Span")
# Branch Length vs Tree span (all 6571 trees)
plot(treelen1$treespan, treelen1$branchlen_mean, xlab="Tree Span", ylab="Tree Length")
points(aggregate(treelen1$branchlen_mean~treelen1$treespan, FUN=mean), col="red", pch=19, cex=1.5)
# Branch Length vs Tree span (464 trees in the focal genomic region)
plot(treelen1_sub$treespan, treelen1_sub$branchlen_mean, xlab="Tree Span", ylab="Tree Length")
points(aggregate(treelen1_sub$branchlen_mean~treelen1_sub$treespan, FUN=mean), col="red", pch=19, cex=1.5)
```
**FigS6:** **Left:** Histogram of tree spans; **Center:** Branch length vs tree span for all 6571 marginal trees along the genome; **Right:** Branch length vs tree spans for 464 trees in the focal genomic region. RED points are average tree lengths for each tree span.

For our analysis, we strictly focus on identifying edges supported by SNPs. In practice, this is done in few steps - First, for each sampled tree along the genome (= 464 trees), we extract the following information - ancestral and descendant node of each edge, edge height (=length), time-point of the descendant node, i.e., when the edge originated (=depth, NOTE: due to rounding errors in Newick format, we round the depth values to 3 significant digits) and the samples (=tips.from.dec) that each edge is ancestral to. NOTE: Although we are identifying haplotype blocks in the *H.e.lativitta* (red) population, we use genealogical trees that include both populations. This is done in order to estimate the edge height of the most recent common ancestor to all individuals in the red population. Ideally for making biological inferences from only one population, this need not be done. However in our analysis, in order to illustrate the haplotype block patterns generated in a selected genomic region, we decided to incorporate both populations to generate trees along the genome. Nevertheless, it is important to note that the identified haplotype blocks will stay the same irrespective of which and how many populations are included in the analysis; however, the edge height will change along the genome.

```{r echo=T}
## Store edge information of all 464 trees as a list
tr1_sub_list <- extract_edges(tr1_sub)
```

Second, for every tree at each SNP position (= 137 SNPs), we identify most recent tree node that is commonly ancestral to all individuals that share the same allele. (Note: this assumes infinite sites mutation model, which is generally the default option in ARGweaver). This allows identification of 1 node for the major and 1 for the minor allele at each SNP position, hereafter called major and minor node for each tree. 

```{r echo =F}
## this only extracts trees at SNPs
tr1_snps <- extract_trees_at_SNPs(focal_sites, tr1, spr1)

timeScale = 1/Ne
tr1_snps_list <- list()
lat <- sampID$ID[which(sampID$pop=="Herato_lativitta")]
not <- sampID$ID[which(sampID$pop=="Herato_notabilis")]
for (snpID in c(1:nrow(tr1_snps))){
  cat(snpID, tr1_snps[snpID,"pos"])
  alleles <- tr1_snps[snpID,sampID$ID]
  alleles <- as.factor(alleles)
  
  alleleCount1 <- sum(alleles[lat] == levels(alleles)[1])
  alleleCount2 <- sum(alleles[lat] == levels(alleles)[2])
  
  majAllele <- ifelse(alleleCount1 > alleleCount2, levels(alleles)[1], levels(alleles)[2])
  tr1_snps$majAllele[snpID] <- majAllele
  minAllele <- setdiff(levels(alleles), majAllele)
  tr1_snps$minAllele[snpID] <- minAllele
  
  #majClade_lat <- setdiff(which(alleles == majAllele), notabilis_individuals+1) # 1-indexed
  #minClade_lat <- setdiff(which(alleles == minAllele), notabilis_individuals+1) # 1-indexed
  majClade_lat_names <- setdiff(names(alleles[which(alleles == majAllele)]),not)
  minClade_lat_names <- setdiff(lat, majClade_lat_names)
  
  majClade_lat <- sampID[which(sampID$ID %in% majClade_lat_names),"ARGweaverID"] # 0-indexed
  tr1_snps$majClade_lat[snpID] <- list(majClade_lat)
  tr1_snps$majAlleleCount_lat[snpID] <- length(tr1_snps$majClade_lat[snpID][[1]]) 
  
  minClade_lat <- sampID[which(sampID$ID %in% minClade_lat_names),"ARGweaverID"] # 0-indexed
  tr1_snps$minClade_lat[snpID] <- list(minClade_lat)
  tr1_snps$minAlleleCount_lat[snpID] <- 20-tr1_snps$majAlleleCount_lat[snpID]
  
  tree <- tr1_snps[snpID,"tree"]
  tr.not <- pruneTree(tree = tree, seqs = as.character(notabilis_individuals)) # 0-indexed
  tr.lat <- pruneTree(tree = tree, seqs = as.character(lativitta_individuals)) # 0-indexed
  
  tr1_snps_list$tree[snpID] <- tr.lat
  apeTree <- read.tree(text=tr.lat)
  
  ## Tree
  nleaf <- length(apeTree$tip.label)
  rootage <- getTreeDepth(apeTree)*timeScale
  rootNode <- getRootNode(apeTree)
  edge <- as.data.frame(apeTree$edge)
  names(edge) <- c("anc", "dec")
  edge$anc.label <- sapply(apeTree$edge[,1], select.tip.or.node, tree = apeTree)
  edge$dec.label <- sapply(apeTree$edge[,2], select.tip.or.node, tree = apeTree)
  
  edge <- rbind(edge, data.frame(anc=-1, dec=rootNode, anc.label=-1, dec.label=rootNode))
  nnode <- nrow(edge)
  edge$length <- c(apeTree$edge.length*timeScale, max(maxtime - rootage, 1))
  edge$depth <- rep(rootage, nnode)
  for (i in 1:nnode) {j <- i
    while (1) {if (edge$dec[j] == rootNode) break
        edge$depth[i] <- edge$depth[i] - edge$length[j]
        j <- which(edge[,"dec"] == edge[j,"anc"])}}
  
  edge[which(edge$anc == -1),"dec.label"] <- edge[1,"anc.label"]
  # edge$tip.label <- rep("", nrow(edge))
  # for (i in 1:length(apeTree$tip.label)){edge[edge$dec==i,"tip.label"] <- apeTree$tip.label[i]}

  for (dec.nodeID in c(1:nrow(edge))){
    decs <- Descendants(apeTree, edge$dec[dec.nodeID],"tips")[[1]]
    edge$tips.from.dec[dec.nodeID] <- list(apeTree$tip.label[decs])
    edge$majClade.node[dec.nodeID] <- all(as.character(majClade_lat) %in% apeTree$tip.label[decs])
    edge$minClade.node[dec.nodeID] <- ifelse(length(minClade_lat) != 0, all(as.character(minClade_lat) %in% apeTree$tip.label[decs]),'NA')
  }
  
  tr1_snps_list$tree.attributes[[snpID]] <- data.frame(edge)
  majClade.node <- edge[which(edge$majClade.node==T),'dec.label']
  if (length(majClade.node)!=1){majClade.node <- edge[which(edge$anc != -1 & edge$majClade.node==T),'dec.label']}
  if (length(majClade.node)!=1){majClade.node <- edge[which(edge$anc != -1 & edge$majClade.node==T),'dec.label']}
  if (length(majClade.node)!=1){
     for (i in majClade.node){
      if (length(edge$tips.from.dec[i][[1]]) == length(minClade_lat)){
        majClade.node.final <- i
      }
    }
  }else{majClade.node.final <- majClade.node}
  
  tr1_snps$majClade.node.depth[snpID] <- edge[which(edge$dec.label == majClade.node.final),'depth']
  tr1_snps$snpID[snpID] <- snpID
}
tr1_snps$node.jump <- c('NA',diff(floor(as.numeric(tr1_snps$majClade.node))))
tr1_snps$depth.jump <- c('NA',diff(floor(tr1_snps$majClade.node.depth)))
```

The table below illustrates information from the tree whose genomic span coincides with the first SNP position - 1385985. 

```{r echo=F}
tab1 <- tr1_snps_list$tree.attributes[[1]][,3:7]
knitr::kable(tab1)
```
**Table S1**: For the above tree at SNP position 1385985, individuals 18 and 22 share allele 0, and therefore their most recent common ancestor is node 75 (called as minor node) which originated at time (in $N_e$) = 0.236 and has a length of 3.153 (in $N_e$). 

Third, for each minor node (and major node if the SNP is fixed within the *H.e.lativitta* population) identified from trees at each of the 137 SNPs, we identify all trees along the genome which contains that unique ancestral node. Since we do not know the ancestral reference allele at each SNP position and we assume infinite sites mutation, any minor node that is ancestral exclusively to the minor clade is assumed to contain the causal alternate allele and is considered as a branch on which a mutation has occurred. The table below shows the extent of the above node 75 that is ancestral to indnividuals 18 and 22 at SNP position 1385985. (NOTE: only 10 rows showed for illustration. In other words, this is how the haplotype blocks are encoded as edges. 
```{r echo =F}
tab2 <- hapblock4_c2_all[[1]][1:10,c(3:7,9:13)] 
knitr::kable(tab2)
```
**Table S2**: Haplotype block as a unique edge that contains the mutation at pos: 1385985, shared between individuals 18, 22. Each row represents 1 tree. For example, tree 7 spans from 1385994 to	1386005, is	12 bp long; the haplotype block at this tree has a height of 0.3379 and therefore an area (height x tree span) of 4.055763.

Following the above steps, we can identify each major/minor node ancestral to each major/minor clade and occurs at a particular time point in the past, that in practice informs us of all the edges informed by SNPs. With 137 SNPs, we have 36 edges supported by SNPs (table below).

```{r, echo=F}
edges1<- data.frame()
for(i in c(10:11,13:19)){
  cat(i,"\n")
  nodes <- subset_clades(tr1_snps, i)
  if(nrow(node)!=0){
      nodes %>%
        dplyr::distinct(minClade_lat) -> uniqueInds
    for (j in seq(nrow(uniqueInds))){
      cat("\t",j,"\n")
      inds = uniqueInds[j,][[1]]
      edges_raw <- find_clade_depths(nodes, tr1_snps_list, count = i, inds)
      
      if(nrow(edges_raw) == 0){
        edges_raw <- data.frame(depth = NA)
        edges_raw$tips.from.dec[1] <- list(inds)
        rowID <- sapply(nodes$minClade_lat, function(decs) all(inds %in% decs))
        edges_raw$snpID <- list(nodes[rowID,"snpID"])
        edges_raw$No.of.SNPs <- nrow(edges_raw)
      }else{
        edges_raw$snpID <- list(edges_raw$snpID)
        edges_raw$depth <- round(edges_raw$depth,4)
        edges_raw$No.of.SNPs <- nrow(edges_raw)
        edges_raw <- edges_raw[1,c("depth", "tips.from.dec","snpID","No.of.SNPs")]
      }
      edges1 <- rbind(edges1, edges_raw)
    }
  }else{
    edges1 <- edges1
  }
}

nodes <- subset_clades(tr1_snps,20)
nodes$majClade.node.depth <- round(nodes$majClade.node.depth,4)
nodes %>%
  dplyr::distinct(majClade.node.depth) -> uniqueCoalTime
for (i in seq(nrow(uniqueCoalTime))){
  edges_raw <- nodes[which(nodes$majClade.node.depth == uniqueCoalTime[i,]),]
  edges_raw$snpID <- list(edges_raw$snpID)
  edges_raw$No.of.SNPs <- nrow(edges_raw)
  edges_raw$tips.from.dec <- "ALL"
  edges_raw$depth  <- edges_raw$majClade.node.depth
  edges_raw <- edges_raw[1,c("depth", "tips.from.dec","snpID","No.of.SNPs")]
  edges1<- rbind(edges1, edges_raw)
}
row.names(edges1) <- seq(nrow(edges1))
edges1$edgeID <- seq(nrow(edges1))
```

```{r, echo=F}
tab3 <- edges1[,1:5]
knitr::kable(tab3)
```

**Table S3**: All 36 haplotype blocks as edges from iteration 8250; each row represents an unique edge. As mentioned above, each edge is defined uniquely by its origin time-point (=depth) and the set of descendent individuals (=tips.from.dec). For example, edge 17 (same edge as the Table S1, S2 refers to) originates at 0.2365 (time in Ne), shared by individuals 18, 22 and is supported by 3 SNPs at position (rank sum order) 1,2,3. Edges with depth = NA refers to SNPs that are incompatible with the infinite sites mutation model, and hence cannot be explained by one mutation event. 

It is important to clarify that this strategy of identifying nodes across successive trees along the genome can easily be translated to the mathematical notation used for the haplotype block simulations. For example, the above edge can be represented as {0, {34}, {{18,22},{}}}, i.e., the ancestral genome on this haplotype block carries the allele 0 at SNP position 1; and is broken down into regions defined by trees {1,93}, {94,464} and is ancestral to samples {18,22},{} respectively. 

#### Haplotype block visualization

For visualization, we choose to only plot the edges that are supported by 3 or more SNPs. Moreover, for SNPs that are fixed in the red population, we only show edges that originate at any time-point below 5$N_e$. This leaves us with only 8 edgee. (see .Rmd file for full code)

```{r echo=T}
## Select edges supported by 3 or more SNPs
imp_edges1 <- edges1[edges1$No.of.SNPs >=3 & edges1$depth <=5 , ] #12 important edges
nrow(imp_edges1)
imp_edges1 <- imp_edges1[order(imp_edges1$depth),]
```

```{r echo=F}
## Find haplotype blocks

#singletons
bl1_c1<-find_clade_depths(subset_clades(tr1_snps, 19), tr1_snps_list, count = 19, inds = c(30))
inds1_c1 <- c(30)
hapblock1_c1 <- find_hapblocks(inds1_c1, imp_edges1$depth[1], "all", T, tr1_sub_list)
hapblock1_c1_all <- find_disjunct_hapblocks(hapblock1_c1)

bl2_c1<-find_clade_depths(subset_clades(tr1_snps, 19), tr1_snps_list, count = 19, inds = c(13))
inds2_c1 <- c(13)
hapblock2_c1 <- find_hapblocks(inds2_c1, imp_edges1$depth[2], "all", T, tr1_sub_list)
hapblock2_c1_all <- find_disjunct_hapblocks(hapblock2_c1)

#doubletons
bl3_c1<-find_clade_depths(subset_clades(tr1_snps, 18), tr1_snps_list, count = 18, inds = c(21,22))
inds3_c1 <- c(21,22)
hapblock3_c1 <- find_hapblocks(inds3_c1, imp_edges1$depth[3], "all", F, tr1_sub_list)
hapblock3_c1_all <- find_disjunct_hapblocks(hapblock3_c1)

bl4_c1<-find_clade_depths(subset_clades(tr1_snps, 18), tr1_snps_list, count = 18, inds = c(18,22))
inds4_c1 <- c(18,22)
hapblock4_c1 <- find_hapblocks(inds4_c1, imp_edges1$depth[4], "all", F, tr1_sub_list)
hapblock4_c1_all <- find_disjunct_hapblocks(hapblock4_c1)

## Fixed SNPs
indsALL <- lativitta_individuals
tr1_count20 <- subset_clades(tr1_snps, 20)

bl6_c1 <- tr1_count20[which(round(tr1_count20$majClade.node.depth,4) == imp_edges1$depth[5]),c("pos","snpID")]
hapblock6_c1 <- find_hapblocks(indsALL, imp_edges1$depth[5], "all", F, tr1_sub_list)
hapblock6_c1_all <- find_disjunct_hapblocks(hapblock6_c1)

bl7_c1 <- tr1_count20[which(round(tr1_count20$majClade.node.depth,4) == imp_edges1$depth[7]),c("pos","snpID")]
hapblock7_c1 <- find_hapblocks(indsALL, imp_edges1$depth[7], "all", F, tr1_sub_list)
hapblock7_c1_all <- find_disjunct_hapblocks(hapblock7_c1)

bl8_c1 <- tr1_count20[which(round(tr1_count20$majClade.node.depth,4) == imp_edges1$depth[8]),c("pos","snpID")]
hapblock8_c1 <- find_hapblocks(indsALL, imp_edges1$depth[8], "all", F, tr1_sub_list)
hapblock8_c1_all <- find_disjunct_hapblocks(hapblock8_c1)

bl9_c1 <- tr1_count20[which(round(tr1_count20$majClade.node.depth,4) == imp_edges1$depth[9]),c("pos","snpID")]
hapblock9_c1 <- find_hapblocks(indsALL, imp_edges1$depth[9], "all", F, tr1_sub_list)
hapblock9_c1_all <- find_disjunct_hapblocks(hapblock9_c1)
```

```{r fig7, fig.width=17, fig.height=8, echo = T, results=F, warning=F }
par(mfrow=c(2,1), cex.lab=1.5, cex.axis=1.3)
plot_all_sites(focal_sites, as.character(1)) +
  axis(side =2, tick =T, at = c(21:40), labels = F) +
  points(bl1_c1$pos, rep(which(sampID$ARGweaverID == c(30)),nrow(bl1_c1)), col='black', pch=19, cex = 1)
plot(hapblock1_c1$chromStart, hapblock1_c1$length, type="n", log="y", ylim=c(0.05,15), xlab="Position (bp)", ylab="Time (Ne)", xlim=c(chromStart, chromEnd)) +   
  plot_hapblock(hapblock1_c1_all, "black", T)
```
**Fig S7**: Haplotype block that is supported by singletons. These edges originate directly from the tree tips, ie, the samples and therefore extends all along the genomic region. Although for biological inference, singletons are often uninformative, this shows the feature of an edge supported by singletons, which extends all along the genome, is normally shallow with certain regions that go are high, where a lot of singleton clusters together. These are normally regions of the genome, which have recombined out and goes all the way back to an ancestor in the deep past. The orange block shows clustering of SNPs in the higher region of the edges, whereas, the green edge shows how SNPs can also occur by chance at other regions. 

Now, ploting all the haploype blocks except singletons (Fig S8), same as the figure in main text (see Main Text for caption).
```{r fig 10000, fig.height=8, fig.width=15, results=F, echo=F, warning=F}
trees_to_plot <- c(2, 23, 51, 57, 75, 95, 137)

par(mfrow=c(3,1), cex.main=1.5, cex.lab=1.3, cex.axis=1.2)

plot_all_sites(focal_sites, as.character(1)) +
  axis(side =2, tick =T, at = c(21:40), labels = F) +
  points(rep((bl3_c1$pos), each = 2), rep(which(sampID$ARGweaverID == c(21,22)),nrow(bl3_c1)) , col='pink', pch=19, cex = 1) +
  points(rep((bl4_c1$pos), each = 2), rep(which(sampID$ARGweaverID == c(18,22)),nrow(bl4_c1)) , col='yellow', pch=19, cex = 1) +
  points(rep((bl6_c1$pos), each = 20), rep(21:40, nrow(bl6_c1)) , col='darkblue', pch=19, cex = 1) +
  points(rep((bl7_c1$pos), each = 20), rep(21:40, nrow(bl7_c1)) , col='cyan', pch=19, cex = 1) +
  points(rep((bl8_c1$pos), each = 20), rep(21:40, nrow(bl8_c1)) , col='magenta', pch=19, cex = 1) +
  points(rep((bl9_c1$pos), each = 20), rep(21:40, nrow(bl9_c1)) , col='sienna', pch=19, cex = 1) +
  points((focal_sites$pos[which(focal_sites$snpID %in% trees_to_plot)]-pos0)/1e3, rep(1,7), pch =17, cex = 2) +
    abline(v=(focal_sites$pos[which(focal_sites$snpID %in% trees_to_plot)]), lty = 2, lwd = 0.8)



plot(hapblock1_c1$chromStart, hapblock1_c1$length, type="n", log="y", ylim=c(0.05,15), xlab="", ylab="Time (Ne)", xlim=c(chromStart, chromEnd)) +
  #plot_hapblock(hapblock2_c1_all, "pink", F) +
  plot_hapblock(hapblock3_c1_all, "pink", F) +
  plot_hapblock(hapblock4_c1_all, "yellow", F) +
  # plot_hapblock(hapblock5_c1_all, "blue", T) +
  plot_hapblock(hapblock6_c1_all, "darkblue", F) +
  plot_hapblock(hapblock7_c1_all, "cyan", F) +
  plot_hapblock(hapblock8_c1_all, "magenta", F) +
  plot_hapblock(hapblock9_c1_all, "sienna", F) +
  points(focal_sites$pos[which(focal_sites$snpID %in% trees_to_plot)], rep(0.05,7), pch =17, cex = 2) +
  abline(v=(focal_sites$pos[which(focal_sites$snpID %in% trees_to_plot)]), lty = 2, lwd = 0.8)


plot(gen8250, tmrca8250$tmrca_quantile_0.500/Ne, type='s', col = "black", lwd=3, log="y", 
     ylim=c(0.05,15), xlim=c(chromStart, chromEnd), xlab = "Position (bp)", ylab = "TMRCA (Ne)") +
    lines(gen8250_lat, tmrca8250_lat$tmrca_quantile_0.500/Ne, type="s", col = "#C00C28", lwd=3) +
    points(focal_sites$pos[which(focal_sites$snpID %in% trees_to_plot)], rep(0.05,7), pch =17, cex = 2) +
  abline(v=(focal_sites$pos[which(focal_sites$snpID %in% trees_to_plot)]), lty = 2, lwd = 0.8)
```


See *.Rmd* file for all 137 trees at each SNP position.
```{r fig100, fig.width=15, fig.height=7, echo=F, results=F, message=F, include=F}
par(mfrow=c(2,4))
for (i in 1:137){
  plotTree(tr1_snps$tree[i], 
          #keepSeqs = as.character(lativitta_individuals),
          leafCol = indColour_argIDs, ylab=i, 
          logScale = T, ylim=c(4e-3,20), timeScale = 1/Ne, cex.lab = 1.3)
}
```

#### Case 2: iteration 9000
There are altogether 6457 trees in the ~50kb region, and 425 in the focal genomic region. 

```{r echo=F}
## Read tree length
treelen2 <- readArgSummary("~/Proj-Def-Haplo/optix_newrun/optix_newNe/optix_newNe.branchlen.all_sample.9200.bed.gz")
str(treelen2)
treelen2$treespan <- treelen2$chromEnd-treelen2$chromStart

treelen2_sub <- subset_genomic_interval(treelen2, chromStart, chromEnd)
par(mfrow=c(2,1))
plot(treelen2_sub$chromStart, treelen2_sub$branchlen_mean, type="s", log="y", xlab="Position", ylab="Tree Length")
plot(treelen2_sub$chromEnd,(treelen2_sub$chromEnd-treelen2_sub$chromStart), type="s", log="y", xlab="Position", ylab="Tree Span")
```
```{r fig, fig.width=15, fig.height=5, include=FALSE}
#plot(treelen2_sub$branchlen_mean, treelen2_sub$treespan, xlab="Tree Length", ylab="Tree Span")
mean(treelen2$treespan)
mean(treelen2_sub$treespan)
table(treelen2$treespan)
nrow(treelen2)
par(mfrow=c(1,2))
plot(treelen2$treespan, treelen2$branchlen_mean, xlab="Tree Span", ylab="Tree Length")
points(aggregate(treelen2$branchlen_mean~treelen2$treespan, FUN=mean), col="red", pch=19)
plot(treelen2_sub$treespan, treelen2_sub$branchlen_mean, xlab="Tree Span", ylab="Tree Length")
points(aggregate(treelen2_sub$branchlen_mean~treelen2_sub$treespan, FUN=mean), col="red", pch=19)
```

```{r, fig.width=20, fig.height=10, include=F}
#Read tmrca
tmrca9200 <- readArgSummary("~/Proj-Def-Haplo/optix_newrun/optix_newNe/optix_newNe.tmrca.all_sample.9200.bed.gz")
tmrca9200_not <- readArgSummary("~/Proj-Def-Haplo/optix_newrun/optix_newNe/optix_newNe.tmrca.Herato_notabilis.9200.bed.gz")
tmrca9200_lat <- readArgSummary("~/Proj-Def-Haplo/optix_newrun/optix_newNe/optix_newNe.tmrca.Herato_lativitta.9200.bed.gz")
## Set genomic span and Ne 
Ne=1940078
gen9200 <- (tmrca9200$chromStart)#/1e3
gen9200_not <- (tmrca9200_not$chromStart)#/1e3
gen9200_lat <- (tmrca9200_lat$chromStart)#/1e3

plot(gen9200, tmrca9200$tmrca_quantile_0.500/Ne, type='s', col = "black", lwd=0.8, log="y", 
     ylim=c(1e-2,20), xlim=c(chromStart, chromEnd), xlab = "Position (kB)", ylab = "TMRCA (Ne)") +
  lines(gen9200_not, tmrca9200_not$tmrca_quantile_0.500/Ne, type="s", col = "#CCCC00", lwd=1) +
    lines(gen9200_lat, tmrca9200_lat$tmrca_quantile_0.500/Ne, type="s", col = "#C00C28", lwd=3)
```

```{r include=F}
#Read trees
tr2 <- read.table("~/Proj-Def-Haplo/optix_newrun/optix_newNe/sample/optix_newNe.9200_trees.txt")
colnames(tr2) <- c("event","chromStart","chromEnd","tree")
str(tr2)
#Read SPRs
spr2 <- read.table("~/Proj-Def-Haplo/optix_newrun/optix_newNe/sample/optix_newNe.9200_SPR.txt")
colnames(spr2) <- c("event","position", "recomb_node", "recomb_time", "coal_node", "coal_time")
str(spr2)
```

```{r}
#Find trees in the region of interest
tr2_sub <- subset_genomic_interval(tr2, chromStart, chromEnd) #425 trees
str(tr2_sub)
spr2_sub <- subset_genomic_interval(spr2, chromStart, chromEnd)
```

```{r}
#Store trees and edge information as a list
tr2_sub_list <- extract_edges(tr2_sub)
```
```{r}
## this only extracts trees at SNPs
tr2_snps <- extract_trees_at_SNPs(focal_sites, tr2, spr2)
str(tr2_snps)
```

```{r}
timeScale = 1/Ne
tr2_snps_list <- list()
lat <- sampID$ID[which(sampID$pop=="Herato_lativitta")]
not <- sampID$ID[which(sampID$pop=="Herato_notabilis")]
for (snpID in c(1:nrow(tr2_snps))){
  cat(snpID, tr2_snps[snpID,"pos"])
  alleles <- tr2_snps[snpID,sampID$ID]
  alleles <- as.factor(alleles)
  
  alleleCount1 <- sum(alleles[lat] == levels(alleles)[1])
  alleleCount2 <- sum(alleles[lat] == levels(alleles)[2])
  
  majAllele <- ifelse(alleleCount1 > alleleCount2, levels(alleles)[1], levels(alleles)[2])
  tr2_snps$majAllele[snpID] <- majAllele
  minAllele <- setdiff(levels(alleles), majAllele)
  tr2_snps$minAllele[snpID] <- minAllele
  
  #majClade_lat <- setdiff(which(alleles == majAllele), notabilis_individuals+1) # 1-indexed
  #minClade_lat <- setdiff(which(alleles == minAllele), notabilis_individuals+1) # 1-indexed
  majClade_lat_names <- setdiff(names(alleles[which(alleles == majAllele)]),not)
  minClade_lat_names <- setdiff(lat, majClade_lat_names)
  
  majClade_lat <- sampID[which(sampID$ID %in% majClade_lat_names),"ARGweaverID"] # 0-indexed
  tr2_snps$majClade_lat[snpID] <- list(majClade_lat)
  tr2_snps$majAlleleCount_lat[snpID] <- length(tr2_snps$majClade_lat[snpID][[1]]) 
  
  minClade_lat <- sampID[which(sampID$ID %in% minClade_lat_names),"ARGweaverID"] # 0-indexed
  tr2_snps$minClade_lat[snpID] <- list(minClade_lat)
  tr2_snps$minAlleleCount_lat[snpID] <- 20-tr2_snps$majAlleleCount_lat[snpID]
  
  tree <- tr2_snps[snpID,"tree"]
  tr.not <- pruneTree(tree = tree, seqs = as.character(notabilis_individuals)) # 0-indexed
  tr.lat <- pruneTree(tree = tree, seqs = as.character(lativitta_individuals)) # 0-indexed
  
  tr2_snps_list$tree[snpID] <- tr.lat
  apeTree <- read.tree(text=tr.lat)
  
  ## Tree
  nleaf <- length(apeTree$tip.label)
  rootage <- getTreeDepth(apeTree)*timeScale
  rootNode <- getRootNode(apeTree)
  edge <- as.data.frame(apeTree$edge)
  names(edge) <- c("anc", "dec")
  edge$anc.label <- sapply(apeTree$edge[,1], select.tip.or.node, tree = apeTree)
  edge$dec.label <- sapply(apeTree$edge[,2], select.tip.or.node, tree = apeTree)
  
  edge <- rbind(edge, data.frame(anc=-1, dec=rootNode, anc.label=-1, dec.label=rootNode))
  nnode <- nrow(edge)
  edge$length <- c(apeTree$edge.length*timeScale, max(maxtime - rootage, 1))
  edge$depth <- rep(rootage, nnode)
  for (i in 1:nnode) {j <- i
    while (1) {if (edge$dec[j] == rootNode) break
        edge$depth[i] <- edge$depth[i] - edge$length[j]
        j <- which(edge[,"dec"] == edge[j,"anc"])}}
  
  edge[which(edge$anc == -1),"dec.label"] <- edge[1,"anc.label"]
  # edge$tip.label <- rep("", nrow(edge))
  # for (i in 1:length(apeTree$tip.label)){edge[edge$dec==i,"tip.label"] <- apeTree$tip.label[i]}

  for (dec.nodeID in c(1:nrow(edge))){
    decs <- Descendants(apeTree, edge$dec[dec.nodeID],"tips")[[1]]
    edge$tips.from.dec[dec.nodeID] <- list(apeTree$tip.label[decs])
    edge$majClade.node[dec.nodeID] <- all(as.character(majClade_lat) %in% apeTree$tip.label[decs])
    edge$minClade.node[dec.nodeID] <- ifelse(length(minClade_lat) != 0, all(as.character(minClade_lat) %in% apeTree$tip.label[decs]),'NA')
  }
  
  tr2_snps_list$tree.attributes[[snpID]] <- data.frame(edge)
  majClade.node <- edge[which(edge$majClade.node==T),'dec.label']
  if (length(majClade.node)!=1){majClade.node <- edge[which(edge$anc != -1 & edge$majClade.node==T),'dec.label']}
  if (length(majClade.node)!=1){majClade.node <- edge[which(edge$anc != -1 & edge$majClade.node==T),'dec.label']}
  if (length(majClade.node)!=1){
     for (i in majClade.node){
      if (length(edge$tips.from.dec[i][[1]]) == length(minClade_lat)){
        majClade.node.final <- i
      }
    }
  }else{majClade.node.final <- majClade.node}
  
  tr2_snps$majClade.node.depth[snpID] <- edge[which(edge$dec.label == majClade.node.final),'depth']
  tr2_snps$snpID[snpID] <- snpID
}
#colnames(tr2_snps)[ncol(tr2_snps)] <- "snpID"
tr2_snps$node.jump <- c('NA',diff(floor(as.numeric(tr2_snps$majClade.node))))
tr2_snps$depth.jump <- c('NA',diff(floor(tr2_snps$majClade.node.depth)))
```

```{r}
edges2<- data.frame()
for(i in c(10:11,13:19)){
  cat(i,"\n")
  nodes <- subset_clades(tr2_snps, i)
  if(nrow(node)!=0){
      nodes %>%
        dplyr::distinct(minClade_lat) -> uniqueInds
    for (j in seq(nrow(uniqueInds))){
      cat("\t",j,"\n")
      inds = uniqueInds[j,][[1]]
      edges_raw <- find_clade_depths(nodes, tr2_snps_list, count = i, inds)
      
      if(nrow(edges_raw) == 0){
        edges_raw <- data.frame(depth = NA)
        edges_raw$tips.from.dec[1] <- list(inds)
        rowID <- sapply(nodes$minClade_lat, function(decs) all(inds %in% decs))
        edges_raw$snpID <- list(nodes[rowID,"snpID"])
        edges_raw$No.of.SNPs <- nrow(edges_raw)
      }else{
        edges_raw$snpID <- list(edges_raw$snpID)
        edges_raw$depth <- round(edges_raw$depth,4)
        edges_raw$No.of.SNPs <- nrow(edges_raw)
        edges_raw <- edges_raw[1,c("depth", "tips.from.dec","snpID","No.of.SNPs")]
      }
      edges2 <- rbind(edges2, edges_raw)
    }
  }else{
    edges2 <- edges2
  }
}

nodes <- subset_clades(tr2_snps,20)
nodes$majClade.node.depth <- round(nodes$majClade.node.depth,4)
nodes %>%
  dplyr::distinct(majClade.node.depth) -> uniqueCoalTime
for (i in seq(nrow(uniqueCoalTime))){
  edges_raw <- nodes[which(nodes$majClade.node.depth == uniqueCoalTime[i,]),]
  edges_raw$snpID <- list(edges_raw$snpID)
  edges_raw$No.of.SNPs <- nrow(edges_raw)
  edges_raw$tips.from.dec <- "ALL"
  edges_raw$depth  <- edges_raw$majClade.node.depth
  edges_raw <- edges_raw[1,c("depth", "tips.from.dec","snpID","No.of.SNPs")]
  edges2<- rbind(edges2, edges_raw)
}
row.names(edges2) <- seq(nrow(edges2))
edges2$edgeID <- seq(nrow(edges2))
```

```{r}
## Select edges supported by 3 or more SNPs
imp_edges2 <- edges2[edges2$No.of.SNPs >=3 & edges2$depth <=5 , ] #12 important edges
nrow(imp_edges2)
imp_edges2 <- imp_edges2[order(imp_edges2$depth),]
```

```{r}
## Find haplotype blocks

#singletons
bl1_c2<-find_clade_depths(subset_clades(tr2_snps, 19), tr2_snps_list, count = 19, inds = c(30))
inds1_c2 <- c(30)
hapblock1_c2 <- find_hapblocks(inds1_c2, imp_edges2$depth[1], "all", T, tr2_sub_list)
hapblock1_c2_all <- find_disjunct_hapblocks(hapblock1_c2)

bl2_c2<-find_clade_depths(subset_clades(tr2_snps, 19), tr2_snps_list, count = 19, inds = c(13))
inds2_c2 <- c(13)
hapblock2_c2 <- find_hapblocks(inds2_c2, imp_edges2$depth[2], "all", T, tr2_sub_list)
hapblock2_c2_all <- find_disjunct_hapblocks(hapblock2_c2)

#doubletons
bl3_c2<-find_clade_depths(subset_clades(tr2_snps, 18), tr2_snps_list, count = 18, inds = c(22,21))
inds3_c2 <- c(22,21)
hapblock3_c2 <- find_hapblocks(inds3_c2, imp_edges2$depth[3], "all", T, tr2_sub_list)
hapblock3_c2_all <- find_disjunct_hapblocks(hapblock3_c2)

bl4_c2<-find_clade_depths(subset_clades(tr2_snps, 18), tr2_snps_list, count = 18, inds = c(18,22))
inds4_c2 <- c(18,22)
hapblock4_c2 <- find_hapblocks(inds4_c2, imp_edges2$depth[4], "all", T, tr2_sub_list)
hapblock4_c2_all <- find_disjunct_hapblocks(hapblock4_c2)


## Fixed SNPs
indsALL <- lativitta_individuals
tr2_count20 <- subset_clades(tr2_snps, 20)

bl6_c2 <- tr2_count20[which(round(tr2_count20$majClade.node.depth,4) == imp_edges2$depth[6]),c("pos","snpID")]
hapblock6_c2 <- find_hapblocks(indsALL, imp_edges2$depth[6], "all", F, tr2_sub_list)
hapblock6_c2_all <- find_disjunct_hapblocks(hapblock6_c2)

bl7_c2 <- tr2_count20[which(round(tr2_count20$majClade.node.depth,4) == imp_edges2$depth[7]),c("pos","snpID")]
hapblock7_c2 <- find_hapblocks(indsALL, imp_edges2$depth[7], "all", F, tr2_sub_list)
hapblock7_c2_all <- find_disjunct_hapblocks(hapblock7_c2)

bl8_c2 <- tr2_count20[which(round(tr2_count20$majClade.node.depth,4) == imp_edges2$depth[8]),c("pos","snpID")]
hapblock8_c2 <- find_hapblocks(indsALL, imp_edges2$depth[8], "all", F, tr2_sub_list)
hapblock8_c2_all <- find_disjunct_hapblocks(hapblock8_c2)

bl9_c2 <- tr2_count20[which(round(tr2_count20$majClade.node.depth,4) == imp_edges2$depth[9]),c("pos","snpID")]
hapblock9_c2 <- find_hapblocks(indsALL, imp_edges2$depth[9], "all", F, tr2_sub_list)
hapblock9_c2_all <- find_disjunct_hapblocks(hapblock9_c2)

# bl10_c2 <- tr2_count20[which(round(tr2_count20$majClade.node.depth,4) == imp_edges2$depth[10]),c("pos","snpID")]
# hapblock10_c2 <- find_hapblocks(indsALL, imp_edges2$depth[10], "all", F, tr2_sub_list)
# hapblock10_c2_all <- find_disjunct_hapblocks(hapblock10_c2)
# 
# bl11_c2 <- tr2_count20[which(round(tr2_count20$majClade.node.depth,4) == imp_edges2$depth[11]),c("pos","snpID")]
# hapblock11_c2 <- find_hapblocks(indsALL, imp_edges2$depth[11], "all", F, tr2_sub_list)
# hapblock11_c2_all <- find_disjunct_hapblocks(hapblock11_c2)
# 
# bl12_c2 <- tr2_count20[which(round(tr2_count20$majClade.node.depth,4) == imp_edges2$depth[12]),c("pos","snpID")]
# hapblock12_c2 <- find_hapblocks(indsALL, imp_edges2$depth[12], "all", F, tr2_sub_list)
# hapblock12_c2_all <- find_disjunct_hapblocks(hapblock12_c2)
```

```{r, fig.height=10, fig.width=20, echo=F, results=F, warning=F, include=F}
par(mfrow=c(3,1))

plot_all_sites(focal_sites, as.character(1)) +
  #axis(side =2, tick =T, at = c(21:40), labels = F) +
  points(rep((bl3_c2$pos), each = 2), rep(which(sampID$ARGweaverID == c(21,22)),nrow(bl3_c2)) , col='pink', pch=19, cex = 1) +
  points(rep((bl4_c2$pos), each = 2), rep(which(sampID$ARGweaverID == c(18,22)),nrow(bl4_c2)) , col='yellow', pch=19, cex = 1) +
  points(rep((bl6_c2$pos), each = 20), rep(21:40, nrow(bl6_c2)) , col='darkblue', pch=19, cex = 1) +
  points(rep((bl7_c2$pos), each = 20), rep(21:40, nrow(bl7_c2)) , col='cyan', pch=19, cex = 1) +
  points(rep((bl8_c2$pos), each = 20), rep(21:40, nrow(bl8_c2)) , col='magenta', pch=19, cex = 1) +
  points(rep((bl9_c2$pos), each = 20), rep(21:40, nrow(bl9_c2)) , col='sienna', pch=19, cex = 1)

 # points((sites$pos[which(sites$snpID %in% trees_to_plot)]-pos0)/1e3, rep(1,6), pch =17, cex = 2)


plot(hapblock4_c2$chromStart, hapblock4_c2$length, type="n", log="y", ylim=c(0.01,15), xlab="", ylab="Time (Ne)", xlim=c(chromStart, chromEnd)) +
  plot_hapblock(hapblock3_c2_all, "pink", F) +
  plot_hapblock(hapblock4_c2_all, "yellow", F) +
  # plot_hapblock(hapblock3_c2_all, "red", T) +
  # plot_hapblock(hapblock4_c2_all, "blue", T) +
  plot_hapblock(hapblock6_c2_all, "darkblue", F) +
  plot_hapblock(hapblock7_c2_all, "cyan", F) +
  plot_hapblock(hapblock8_c2_all, "magenta", F) +
  plot_hapblock(hapblock9_c2_all, "sienna", F)

  # points((sites$pos[which(sites$snpID %in% trees_to_plot)]-pos0)/1e3, rep(0.001,6), pch =17, cex = 2) +
  # abline(v=(sites$pos[which(sites$snpID %in% trees_to_plot)]-pos0)/1e3, lty = 2, lwd = 0.8)


plot(gen9200, tmrca9200$tmrca_quantile_0.500/Ne, type='s', col = "black", lwd=0.8, log="y", 
     ylim=c(1e-2,15), xlim=c(chromStart, chromEnd), xlab = "Position (kB)", ylab = "TMRCA (Ne)") +
  lines(gen9200_not, tmrca9200_not$tmrca_quantile_0.500/Ne, type="s", col = "#CCCC00", lwd=1) +
    lines(gen9200_lat, tmrca9200_lat$tmrca_quantile_0.500/Ne, type="s", col = "#C00C28", lwd=3)
```

```{r fig6, fig.width=20, fig.height=7, echo=F, results=F, message=F, include=F}
par(mfrow=c(2,5))
for (i in c(1:137)){
  plotTree(tr2_snps$tree[i], 
          #keepSeqs = as.character(lativitta_individuals),
          leafCol = indColour_argIDs, ylab=i, 
          logScale = T, ylim=c(1e-2,15), timeScale = 1/Ne, cex.lab = 1.3)
}
```

#### Hapotype blocks from both iterations
**FigS9** shows the haplotype blocks and the SNPs that support each block from both iterations. 


```{r fig9, fig.width=17, fig.height=8, echo=F, results=F, message=F}
## plot 2 iterations side by side
par(mfrow=c(3,2), cex.lab = 1.4, cex.axis = 1.2)

plot_all_sites(focal_sites, as.character(1)) +
  axis(side =2, tick =T, at = c(21:40), labels = F) +
  points(rep((bl3_c1$pos), each = 2), rep(which(sampID$ARGweaverID == c(21,22)),nrow(bl3_c1)) , col='pink', pch=19, cex = 1) +
  points(rep((bl4_c1$pos), each = 2), rep(which(sampID$ARGweaverID == c(18,22)),nrow(bl4_c1)) , col='yellow', pch=19, cex = 1) +
  points(rep((bl6_c1$pos), each = 20), rep(21:40, nrow(bl6_c1)) , col='darkblue', pch=19, cex = 1) +
  points(rep((bl7_c1$pos), each = 20), rep(21:40, nrow(bl7_c1)) , col='cyan', pch=19, cex = 1) +
  points(rep((bl8_c1$pos), each = 20), rep(21:40, nrow(bl8_c1)) , col='magenta', pch=19, cex = 1) +
  points(rep((bl9_c1$pos), each = 20), rep(21:40, nrow(bl9_c1)) , col='sienna', pch=19, cex = 1) #+

 # points((sites$pos[which(sites$snpID %in% trees_to_plot)]-pos0)/1e3, rep(1,6), pch =17, cex = 2)

plot_all_sites(focal_sites, as.character(1)) +
  #axis(side =2, tick =T, at = c(21:40), labels = F) +
  points(rep((bl3_c2$pos), each = 2), rep(which(sampID$ARGweaverID == c(21,22)),nrow(bl3_c2)) , col='pink', pch=19, cex = 1) +
  points(rep((bl4_c2$pos), each = 2), rep(which(sampID$ARGweaverID == c(18,22)),nrow(bl4_c2)) , col='yellow', pch=19, cex = 1) +
  points(rep((bl6_c2$pos), each = 20), rep(21:40, nrow(bl6_c2)) , col='darkblue', pch=19, cex = 1) +
  points(rep((bl7_c2$pos), each = 20), rep(21:40, nrow(bl7_c2)) , col='cyan', pch=19, cex = 1) +
  points(rep((bl8_c2$pos), each = 20), rep(21:40, nrow(bl8_c2)) , col='magenta', pch=19, cex = 1) +
  points(rep((bl9_c2$pos), each = 20), rep(21:40, nrow(bl9_c2)) , col='sienna', pch=19, cex = 1)

 # points((sites$pos[which(sites$snpID %in% trees_to_plot)]-pos0)/1e3, rep(1,6), pch =17, cex = 2)

plot(hapblock1_c1$chromStart, hapblock1_c1$length, type="n", log="y", ylim=c(0.01,15), xlab="", ylab="Time (Ne)", xlim=c(chromStart, chromEnd)) +
  #plot_hapblock(hapblock2_c1_all, "pink", F) +
  plot_hapblock(hapblock3_c1_all, "pink", F) +
  plot_hapblock(hapblock4_c1_all, "yellow", F) +
  # plot_hapblock(hapblock5_c1_all, "blue", T) +
  plot_hapblock(hapblock6_c1_all, "darkblue", F) +
  plot_hapblock(hapblock7_c1_all, "cyan", F) #+
  plot_hapblock(hapblock8_c1_all, "magenta", F) +
  plot_hapblock(hapblock9_c1_all, "sienna", F)
  # points((sites$pos[which(sites$snpID %in% trees_to_plot)]-pos0)/1e3, rep(0.001,6), pch =17, cex = 2) +
  # abline(v=(sites$pos[which(sites$snpID %in% trees_to_plot)]-pos0)/1e3, lty = 2, lwd = 0.8)



plot(hapblock4_c2$chromStart, hapblock4_c2$length, type="n", log="y", ylim=c(0.01,15), xlab="", ylab="Time (Ne)", xlim=c(chromStart, chromEnd)) +
  plot_hapblock(hapblock3_c2_all, "pink", F) +
  plot_hapblock(hapblock4_c2_all, "yellow", F) +
  # plot_hapblock(hapblock3_c2_all, "red", T) +
  # plot_hapblock(hapblock4_c2_all, "blue", T) +
  plot_hapblock(hapblock6_c2_all, "darkblue", F) +
  plot_hapblock(hapblock7_c2_all, "cyan", F) +
  plot_hapblock(hapblock8_c2_all, "magenta", F) +
  plot_hapblock(hapblock9_c2_all, "sienna", F)

  # points((sites$pos[which(sites$snpID %in% trees_to_plot)]-pos0)/1e3, rep(0.001,6), pch =17, cex = 2) +
  # abline(v=(sites$pos[which(sites$snpID %in% trees_to_plot)]-pos0)/1e3, lty = 2, lwd = 0.8)

plot(gen8250, tmrca8250$tmrca_quantile_0.500/Ne, type='s', col = "black", lwd=0.8, log="y", 
     ylim=c(1e-2,15), xlim=c(1386000, 1389000), xlab = "Position (kB)", ylab = "TMRCA (Ne)") +
  lines(gen8250_not, tmrca8250_not$tmrca_quantile_0.500/Ne, type="s", col = "#CCCC00", lwd=1) +
    lines(gen8250_lat, tmrca8250_lat$tmrca_quantile_0.500/Ne, type="s", col = "#C00C28", lwd=3)


plot(gen9200, tmrca9200$tmrca_quantile_0.500/Ne, type='s', col = "black", lwd=0.8, log="y", 
     ylim=c(1e-2,15), xlim=c(chromStart, chromEnd), xlab = "Position (kB)", ylab = "TMRCA (Ne)") +
  lines(gen9200_not, tmrca9200_not$tmrca_quantile_0.500/Ne, type="s", col = "#CCCC00", lwd=1) +
    lines(gen9200_lat, tmrca9200_lat$tmrca_quantile_0.500/Ne, type="s", col = "#C00C28", lwd=3)
```

