Infer gene trees

It may be of interest to compare the true genealogies (one or more) at a locus to the gene tree that would be inferred from the sequence data that was simulated on the genealogies. These may not match if ILS or introgression is high and causes the genealogies within a locus to vary significantly, or if there is insufficient information in the locus to reconstruct the gene tree.

[10]:
import ipcoal
import toytree

Simulate data under a tree model

[11]:
tre = toytree.rtree.unittree(5, treeheight=1e6, seed=111)
[12]:
model = ipcoal.Model(tree=tre, Ne=1e6, recomb=1e-9, seed=111)
[13]:
model.sim_loci(nloci=15, nsites=500)

Infer gene trees for each locus

[14]:
model.infer_gene_trees(inference_method="raxml")

Save the results table to disk

[15]:
model.df.to_csv("./sim-table.csv")

View the result table

[16]:
model.df.head(20)
[16]:
locus start end nbps nsnps genealogy inferred_tree
0 0 0 20 20 2 (r1:3.81212e+06,(r0:1.29... (r2:0.018236661006054526...
1 0 20 207 187 6 ((r0:1.08352e+06,r1:1.08... (r2:0.018236661006054526...
2 0 207 500 293 15 (r0:1.53895e+06,(r1:1.29... (r2:0.018236661006054526...
3 1 0 1 1 1 ((r2:1.13896e+06,r0:1.13... ((r1:0.00734369332777404...
4 1 1 49 48 3 ((r2:1.13896e+06,r0:1.13... ((r1:0.00734369332777404...
5 1 49 232 183 16 ((r2:1.13896e+06,r0:1.13... ((r1:0.00734369332777404...
6 1 232 311 79 7 ((r2:1.13896e+06,r0:1.13... ((r1:0.00734369332777404...
7 1 311 500 189 8 ((r2:1.13896e+06,r0:1.13... ((r1:0.00734369332777404...
8 2 0 102 102 2 (r4:1.14352e+06,((r3:690... (r1:0.036231736161195168...
9 2 102 308 206 21 (r1:4.36661e+06,(r4:1.14... (r1:0.036231736161195168...
10 2 308 329 21 3 (r1:2.26911e+06,(r4:1.14... (r1:0.036231736161195168...
11 2 329 500 171 2 (r1:1.53965e+06,(r4:1.14... (r1:0.036231736161195168...
12 3 0 200 200 29 ((r2:1.78698e+06,r1:1.78... ((r2:0.01434572639409071...
13 3 200 394 194 26 ((r4:1.50328e+06,r3:1.50... ((r2:0.01434572639409071...
14 3 394 449 55 7 ((r4:1.50328e+06,r3:1.50... ((r2:0.01434572639409071...
15 3 449 500 51 4 (r4:5.96902e+06,(r0:1.90... ((r2:0.01434572639409071...
16 4 0 198 198 14 ((r0:1.06221e+06,r1:1.06... (((r3:0.0040143760928282...
17 4 198 205 7 1 (r0:5.23908e+06,(r1:1.89... (((r3:0.0040143760928282...
18 4 205 500 295 13 (r1:1.89605e+06,(r0:1.18... (((r3:0.0040143760928282...
19 5 0 42 42 2 (r4:4.76069e+06,(r0:1.72... (r4:0.028911615930344660...

Compare trees

The true genealogies vary across the locus, both in terms of coalescent times as well as in the topology. Neighboring trees are highly correlated, however, much more than genealogies from different unlinked loci.

[23]:
# the True genealogies at locus 0
mtre = toytree.mtree(model.df.genealogy[model.df.locus == 0])
mtre.draw_tree_grid(
    start=0,
    nrows=1, ncols=3,
    shared_axis=True,
    edge_type='c',
);
r4r3r2r0r1r4r3r2r1r0r4r3r2r1r0019060603812120

The inferred tree for this locus is shown below (three times) just for comparison with the trees above. However, when working empirically we typically cannot identify the exact breakpoint where one genealogical history or another was present. Therefore we infer just one gene tree for the entire locus (250bp in this case). Interestingly, the inferred gene tree topology matches the middle genealogy most closely even though the last topology contained the most SNPs and larger proportion of the locus history.

[22]:
# the inferred genealogy by raxml based on the sequence data
mtre = toytree.mtree(model.df.inferred_tree[model.df.locus == 0])
mtre.draw_tree_grid(
    start=0,
    nrows=1, ncols=3,
    shared_axis=True,
    edge_type='c',
    tip_labels_align=True,
);
r4r3r1r0r2r4r3r1r0r2r4r3r1r0r20.0000.0090.018