User Guide¶
[1]:
import ipcoal
import toytree
import numpy as np
Workflow¶
The user interacts with ipcoal through the Model object, through which the following are defined: * Tree topology * Effective population size * Admixture * Substitution rate * Recombination rate
Then, the methods .sim_loci()
or .sim_snps()
generate data based on the input parameters.
The resulting simulated data consists of a dataframe object that holds the simulated genealogies, and matrices that contain the actual simulated sequences.
After data simulation with .sim_loci
, ipcoal can automatically infer a gene tree from the simulated data at each locus. These gene trees are then incorporated into the dataframe object.
Defining the tree topology¶
ipcoal
uses a toytree
tree object as input for the toplogy of the species tree. You can read more about toytree at the toytree docs.
A toytree
tree object can be generated three different ways:
random topology¶
[2]:
# define the tree topology
tre = toytree.rtree.coaltree(5)
tre.draw(tree_style='c');
Using the coaltree function multiple times can give different topologies and different node heights:
[3]:
# generate 3 randome trees, save as a multitree object
mtre = toytree.mtree([toytree.rtree.coaltree(5) for i in range(3)])
# plot the trees
mtre.draw_tree_grid(
start=0,
nrows=1, ncols=3,
shared_axis=True,
edge_type='c',
);
Node heights can be randomly shifted using the node_slider()
method:
[4]:
# apply `node_slider()` three times to the first tree in our multitree object
nodeslide_trees = toytree.mtree([mtre.treelist[0].mod.node_slider() for i in range(3)])
# plot the nodeslider trees:
nodeslide_trees.draw_tree_grid(
start=0,
nrows=1, ncols=3,
shared_axis=True,
edge_type='c',
);
balanced/imbalanced topology¶
[5]:
# define a 6-tip balanced tree
tre = toytree.rtree.baltree(6)
tre.draw(tree_style='c');
# define a 6-tip imbalanced tree
tre = toytree.rtree.imbtree(6)
tre.draw(tree_style='c');
topology from newick¶
[6]:
tre = toytree.tree(
'((r5:2.62256,(r4:0.96563,r3:0.96563)100:1.65693)100:\
9.63792,(r2:7.42816,(r1:4.0738,r0:4.0738)100:\
3.35436)100:4.83232);'
)
tre.draw(tree_style='c');
Rescaling tree height¶
ipcoal
interprets node heights as units of generations, so on phylogenetic scales we expect the node heights to be of pretty large magnitudes.
The height of the starting tree is very low:
[7]:
tre.draw(tree_style='c');
We can rescale the height to be very high (seen on the y-axis):
[8]:
rescale_tre = tre.mod.node_scale_root_height(treeheight=1e6)
rescale_tre.draw(tree_style='c');
Defining the demographic model¶
The ipcoal
Model object allows users to designate effective population size, admixture, substitution rates, and recombination rates.
First, need to define a tree drawing style for the next few examples.
Define our tree (from previous section)¶
[9]:
tre = toytree.tree(
'((r5:2.62256,(r4:0.96563,r3:0.96563)100:1.65693)100:\
9.63792,(r2:7.42816,(r1:4.0738,r0:4.0738)100:\
3.35436)100:4.83232);'
).mod.node_scale_root_height(treeheight=1e6)
tre.draw(tree_style='c');
Set one Ne for the whole tree¶
[10]:
# define a model with Ne = 2
mod = ipcoal.Model(tree=tre,
Ne=2,
seed=123)
# or define a model with Ne = 1e5
mod = ipcoal.Model(tree=tre,
Ne=1e5,
seed=123)
Set branch-specific Ne¶
First, set an “Ne” attribute for the tree object.
[12]:
# set specific values to some nodes and a global default
tre = tre.set_node_values(
attr="Ne",
# the branches above these nodes get specific values
values={8: 5e6, 6: 5e6, 2: 5e6},
# the rest of the branches take this default value
default=20,
)
Look at the tree:
[13]:
# draw to show edge widths and node idxs
tre.draw(
width=300, height=250,
edge_widths=np.log(tre.get_edge_values("Ne")),
tree_style='c',
tip_labels=True,
);
Now define our model. The “Ne” attribute of the tree will, by default, be the Ne
value used by the model.
[14]:
mod = ipcoal.Model(tree=tre,
seed=123)
Admixture¶
[15]:
# init a random tree
tre = toytree.rtree.unittree(8, treeheight=1e6, seed=123)
[16]:
# draw with one admixture edge
tre.draw(
ts='c',
admixture_edges=(4, 3, 0.5, 0.1),
);
# draw with two admixture edges
tre.draw(
ts='c',
admixture_edges=[
(4, 3, 0.5, 0.1),
(3, 8, 0.5, 0.2),
],
);
We define these admixture edges the same way as input for the the model object.
[17]:
# init a model Class object for simulations
model = ipcoal.Model(
tree=tre,
admixture_edges=[
(3, 4, 0.5, 0.2),
(9, 10, 0.5, 0.2),
],
seed=123,
)
By default, the admixture event occurs as a single pulse at a moment in time (admixture_type=0
).
Admixture can also be simulated at a constant rate over a specified time interval:
[ ]:
# init a model Class object for simulations
model = ipcoal.Model(
tree=tre,
# "1" type specifies interval migration
admixture_type=1,
admixture_edges=[
(3, 4, (0.2,0.5), 0.2),
(9, 10, (0.4,0.7), 0.2),
],
seed=123,
)
Subsitution and recombination¶
[19]:
# init a random tree
tre = toytree.rtree.unittree(8, treeheight=1e6, seed=123)
A single mutation rate argument can be specified, which determines the scaling of gene tree heights when passed to seq-gen for producing sequences.
[20]:
# init a model Class object for simulations
model = ipcoal.Model(
tree=tre,
mut=1e-8,
seed=123,
)
A single recombination rate argument can be specified, which is passed to msprime for producing genealogies.
[21]:
# init a model Class object for simulations
model = ipcoal.Model(
tree=tre,
recomb=1e-8,
seed=123,
)
Simulating unlinked SNPs¶
[22]:
# simulate N unlinked SNPs (will run until N snps are produced)
model.sim_snps(1000)
[23]:
model.df.head(10)
[23]:
locus | start | end | nbps | nsnps | genealogy | |
---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 1 | 1 | (((r7:501240,r2:501240):... |
1 | 1 | 0 | 1 | 1 | 1 | (((r7:503798,r2:503798):... |
2 | 2 | 0 | 1 | 1 | 1 | (((r6:542921,r4:542921):... |
3 | 3 | 0 | 1 | 1 | 1 | (((r7:500400,r2:500400):... |
4 | 4 | 0 | 1 | 1 | 1 | ((r1:751576,(r3:505149,(... |
5 | 5 | 0 | 1 | 1 | 1 | ((r1:752733,(r3:503844,(... |
6 | 6 | 0 | 1 | 1 | 1 | ((r1:754845,(r3:643204,(... |
7 | 7 | 0 | 1 | 1 | 1 | (((r7:501030,r2:501030):... |
8 | 8 | 0 | 1 | 1 | 1 | ((r1:756209,(r3:526348,(... |
9 | 9 | 0 | 1 | 1 | 1 | ((r1:751913,(r3:505907,(... |
We have a built-in way to look at D-statistics on this SNP data:
[ ]:
ipcoal.utils.baba()
Simulating loci¶
[25]:
# simulate N loci of len L
model.sim_loci(100, 3000)
[26]:
# view the genealogies and stats in a table
model.df.head(10)
[26]:
locus | start | end | nbps | nsnps | genealogy | |
---|---|---|---|---|---|---|
0 | 0 | 0 | 325 | 325 | 13 | ((r1:752972,(r3:564190,(... |
1 | 0 | 325 | 350 | 25 | 2 | ((r1:752972,(r3:564190,(... |
2 | 0 | 350 | 638 | 288 | 15 | ((r1:752972,(r3:564190,(... |
3 | 0 | 638 | 998 | 360 | 15 | (((r7:501670,r2:501670):... |
4 | 0 | 998 | 1029 | 31 | 3 | (((r7:501670,r2:501670):... |
5 | 0 | 1029 | 1174 | 145 | 7 | (((r6:567513,r4:567513):... |
6 | 0 | 1174 | 1213 | 39 | 3 | ((r1:759461,(r3:504918,(... |
7 | 0 | 1213 | 1750 | 537 | 25 | ((r1:759461,(r3:504918,(... |
8 | 0 | 1750 | 1951 | 201 | 9 | ((r1:759461,(r3:504918,(... |
9 | 0 | 1951 | 2324 | 373 | 23 | (((r7:520823,r2:520823):... |
[37]:
# look at the shape of the sequence data
model.seqs.shape
[37]:
(100, 5, 3000)
[27]:
# view sequence data as an array
model.seqs
[27]:
array([[[1, 3, 3, ..., 1, 1, 0],
[1, 3, 3, ..., 1, 1, 0],
[1, 3, 3, ..., 1, 1, 0],
...,
[1, 3, 3, ..., 1, 1, 0],
[1, 3, 3, ..., 1, 1, 0],
[1, 3, 3, ..., 1, 1, 0]],
[[0, 0, 3, ..., 3, 1, 2],
[0, 0, 2, ..., 3, 1, 2],
[0, 0, 3, ..., 2, 1, 2],
...,
[0, 0, 3, ..., 3, 1, 2],
[0, 0, 3, ..., 2, 1, 2],
[0, 0, 3, ..., 2, 1, 2]],
[[3, 2, 1, ..., 3, 2, 3],
[3, 2, 1, ..., 3, 2, 3],
[3, 2, 1, ..., 3, 2, 3],
...,
[3, 2, 1, ..., 3, 2, 3],
[3, 2, 1, ..., 3, 2, 3],
[3, 2, 1, ..., 3, 2, 3]],
...,
[[0, 0, 0, ..., 1, 2, 3],
[0, 0, 0, ..., 1, 2, 3],
[0, 0, 0, ..., 1, 2, 3],
...,
[0, 0, 0, ..., 1, 2, 3],
[0, 0, 0, ..., 1, 2, 3],
[0, 0, 0, ..., 1, 2, 3]],
[[1, 1, 2, ..., 0, 1, 3],
[1, 1, 2, ..., 0, 1, 3],
[1, 1, 2, ..., 0, 1, 3],
...,
[1, 1, 2, ..., 0, 1, 3],
[1, 1, 2, ..., 0, 1, 3],
[1, 1, 2, ..., 0, 1, 3]],
[[1, 0, 1, ..., 0, 2, 1],
[1, 0, 1, ..., 0, 2, 1],
[1, 0, 1, ..., 0, 2, 1],
...,
[1, 0, 1, ..., 0, 2, 1],
[1, 0, 1, ..., 0, 2, 1],
[1, 0, 1, ..., 0, 2, 1]]], dtype=uint8)
Inferring gene trees¶
For use with simulated multilocus data.
Once multilocus data has been simulated, the .infer_gene_trees()
method automates the gene tree inference process across all loci.
[28]:
# define the tree topology
tre = toytree.rtree.unittree(5, treeheight=1e6, seed=123)
# define
model = ipcoal.Model(tree=tre, Ne=1e6, recomb=1e-9, seed=123)
# simulate N loci of len L
model.sim_loci(100, 3000)
[29]:
# look at the beginning of the dataframe
model.df.head(10)
[29]:
locus | start | end | nbps | nsnps | genealogy | |
---|---|---|---|---|---|---|
0 | 0 | 0 | 123 | 123 | 12 | ((r2:1.21382e+06,r0:1.21... |
1 | 0 | 123 | 251 | 128 | 14 | (r2:4.91932e+06,((r1:1.2... |
2 | 0 | 251 | 355 | 104 | 17 | ((r1:1.29726e+06,r4:1.29... |
3 | 0 | 355 | 398 | 43 | 5 | ((r1:1.29726e+06,r4:1.29... |
4 | 0 | 398 | 460 | 62 | 12 | ((r1:1.29726e+06,r4:1.29... |
5 | 0 | 460 | 476 | 16 | 5 | ((r1:1.29726e+06,r4:1.29... |
6 | 0 | 476 | 551 | 75 | 11 | ((r1:1.29726e+06,r4:1.29... |
7 | 0 | 551 | 713 | 162 | 35 | ((r1:1.29726e+06,r4:1.29... |
8 | 0 | 713 | 747 | 34 | 2 | ((r1:1.29726e+06,r4:1.29... |
9 | 0 | 747 | 749 | 2 | 0 | (r2:6.02091e+06,((r1:1.2... |
[30]:
# infer a tree for every locus
model.infer_gene_trees(inference_method='raxml')
Alternative inference methods include 'mrbayes'
and 'iqtree'
.
After running gene tree inference, our inferred gene trees can be found in the new inferred_tree
column of our dataframe:
[31]:
model.df.head(10)
[31]:
locus | start | end | nbps | nsnps | genealogy | inferred_tree | |
---|---|---|---|---|---|---|---|
0 | 0 | 0 | 123 | 123 | 12 | ((r2:1.21382e+06,r0:1.21... | (((r4:0.0154746174895009... |
1 | 0 | 123 | 251 | 128 | 14 | (r2:4.91932e+06,((r1:1.2... | (((r4:0.0154746174895009... |
2 | 0 | 251 | 355 | 104 | 17 | ((r1:1.29726e+06,r4:1.29... | (((r4:0.0154746174895009... |
3 | 0 | 355 | 398 | 43 | 5 | ((r1:1.29726e+06,r4:1.29... | (((r4:0.0154746174895009... |
4 | 0 | 398 | 460 | 62 | 12 | ((r1:1.29726e+06,r4:1.29... | (((r4:0.0154746174895009... |
5 | 0 | 460 | 476 | 16 | 5 | ((r1:1.29726e+06,r4:1.29... | (((r4:0.0154746174895009... |
6 | 0 | 476 | 551 | 75 | 11 | ((r1:1.29726e+06,r4:1.29... | (((r4:0.0154746174895009... |
7 | 0 | 551 | 713 | 162 | 35 | ((r1:1.29726e+06,r4:1.29... | (((r4:0.0154746174895009... |
8 | 0 | 713 | 747 | 34 | 2 | ((r1:1.29726e+06,r4:1.29... | (((r4:0.0154746174895009... |
9 | 0 | 747 | 749 | 2 | 0 | (r2:6.02091e+06,((r1:1.2... | (((r4:0.0154746174895009... |
We can look at the true genealogies for locus 0:
[32]:
# the True genealogies at locus 0
mtre = toytree.mtree(model.df.genealogy[model.df.locus == 0])
mtre.draw_tree_grid(
start=0,
nrows=1, ncols=3,
shared_axis=True,
edge_type='c',
);
And we can compare these to the inferred tree for this locus:
[33]:
# the inferred genealogy by raxml based on the sequence data
inftre = toytree.tree(model.df.inferred_tree[model.df.locus == 0][0])
inftre.draw(tree_style='c');
Writing to disk¶
We can write a simulated dataframe to a .csv file:
[34]:
# save table to a CSV file
model.df.to_csv("./tree_table.csv")
We can write the simulted loci as separate phylip files to a directory:
[35]:
# write loci as separate phylip files to a directory
model.write_loci_to_phylip(outdir="./tests")
wrote 100 loci (5 x 3000bp) to Users/pmckenz1/googledrive/projects/ipcoal/docs/notebooks/tests/[...].phy
We can concatenate the simulated loci and write this matrix to a single phylip file:
[36]:
# write concatenated loci or snps to a phylip file
model.write_concat_to_phylip(outdir="./tests", name="test.phy")
wrote concatenated loci (5 x 300000bp) to /Users/pmckenz1/googledrive/projects/ipcoal/docs/notebooks/tests/test.phy