User Guide

[1]:
import ipcoal
import toytree
import numpy as np

Workflow

The user interacts with ipcoal through the Model object, through which the following are defined: * Tree topology * Effective population size * Admixture * Substitution rate * Recombination rate

Then, the methods .sim_loci() or .sim_snps() generate data based on the input parameters.

The resulting simulated data consists of a dataframe object that holds the simulated genealogies, and matrices that contain the actual simulated sequences.

After data simulation with .sim_loci, ipcoal can automatically infer a gene tree from the simulated data at each locus. These gene trees are then incorporated into the dataframe object.

Defining the tree topology

ipcoal uses a toytree tree object as input for the toplogy of the species tree. You can read more about toytree at the toytree docs.

A toytree tree object can be generated three different ways:

random topology

[2]:
# define the tree topology
tre = toytree.rtree.coaltree(5)
tre.draw(tree_style='c');
idx: 0 name: r0 dist: 0.0681 support: 100 height: 0.00000idx: 1 name: r1 dist: 0.0681 support: 100 height: 0.00001idx: 2 name: r2 dist: 0.1577 support: 100 height: 0.00002idx: 3 name: r3 dist: 0.1577 support: 100 height: 0.00003idx: 4 name: r4 dist: 0.3409 support: 100 height: 0.00004idx: 5 name: 5 dist: 0.2164 support: 100 height: 0.06815idx: 6 name: 6 dist: 0.1268 support: 100 height: 0.15776idx: 7 name: 7 dist: 0.0563 support: 100 height: 0.28467idx: 8 name: 8 dist: 0.0001 support: 100 height: 0.340980.00.20.3

Using the coaltree function multiple times can give different topologies and different node heights:

[3]:
# generate 3 randome trees, save as a multitree object
mtre = toytree.mtree([toytree.rtree.coaltree(5) for i in range(3)])
# plot the trees
mtre.draw_tree_grid(
    start=0,
    nrows=1, ncols=3,
    shared_axis=True,
    edge_type='c',
);
r0r1r2r3r4r0r1r2r3r4r0r1r2r3r40.00.51.0

Node heights can be randomly shifted using the node_slider() method:

[4]:
# apply `node_slider()` three times to the first tree in our multitree object
nodeslide_trees = toytree.mtree([mtre.treelist[0].mod.node_slider() for i in range(3)])
# plot the nodeslider trees:
nodeslide_trees.draw_tree_grid(
    start=0,
    nrows=1, ncols=3,
    shared_axis=True,
    edge_type='c',
);
r0r1r2r3r4r0r1r2r3r4r0r1r2r3r40.00.20.5

balanced/imbalanced topology

[5]:
# define a 6-tip balanced tree
tre = toytree.rtree.baltree(6)
tre.draw(tree_style='c');
# define a 6-tip imbalanced tree
tre = toytree.rtree.imbtree(6)
tre.draw(tree_style='c');
idx: 0 name: r1 dist: 0.3333 support: 100 height: 0.00000idx: 1 name: r2 dist: 0.3333 support: 100 height: 0.00001idx: 2 name: r0 dist: 0.6667 support: 100 height: 0.00002idx: 3 name: r4 dist: 0.3333 support: 100 height: 0.00003idx: 4 name: r5 dist: 0.3333 support: 100 height: 0.00004idx: 5 name: r3 dist: 0.6667 support: 100 height: 0.00005idx: 6 name: 6 dist: 0.3333 support: 100 height: 0.33336idx: 7 name: 7 dist: 0.3333 support: 100 height: 0.33337idx: 8 name: 8 dist: 0.3333 support: 100 height: 0.66678idx: 9 name: 9 dist: 0.3333 support: 100 height: 0.66679idx: 10 name: 10 dist: 0.3333 support: 100 height: 1.0000100.00.51.0
idx: 0 name: r1 dist: 0.2000 support: 100 height: 0.00000idx: 1 name: r0 dist: 0.2000 support: 100 height: 0.00001idx: 2 name: r2 dist: 0.4000 support: 100 height: 0.00002idx: 3 name: r3 dist: 0.6000 support: 100 height: 0.00003idx: 4 name: r4 dist: 0.8000 support: 100 height: 0.00004idx: 5 name: r5 dist: 1.0000 support: 100 height: 0.00005idx: 6 name: 6 dist: 0.2000 support: 100 height: 0.20006idx: 7 name: 7 dist: 0.2000 support: 100 height: 0.40007idx: 8 name: 8 dist: 0.2000 support: 100 height: 0.60008idx: 9 name: 9 dist: 0.2000 support: 100 height: 0.80009idx: 10 name: 10 dist: 0.2000 support: 100 height: 1.0000100.00.51.0

topology from newick

[6]:
tre = toytree.tree(
             '((r5:2.62256,(r4:0.96563,r3:0.96563)100:1.65693)100:\
              9.63792,(r2:7.42816,(r1:4.0738,r0:4.0738)100:\
              3.35436)100:4.83232);'
            )
tre.draw(tree_style='c');
idx: 0 name: r0 dist: 4.0738 support: 100 height: 0.00000idx: 1 name: r1 dist: 4.0738 support: 100 height: 0.00001idx: 2 name: r2 dist: 7.4282 support: 100 height: 0.00002idx: 3 name: r3 dist: 0.9656 support: 100 height: 0.00003idx: 4 name: r4 dist: 0.9656 support: 100 height: 0.00004idx: 5 name: r5 dist: 2.6226 support: 100 height: 0.00005idx: 6 name: 6 dist: 3.3544 support: 100 height: 4.07386idx: 7 name: 7 dist: 1.6569 support: 100 height: 0.96567idx: 8 name: 8 dist: 4.8323 support: 100 height: 7.42828idx: 9 name: 9 dist: 9.6379 support: 100 height: 2.62269idx: 10 name: 10 dist: 1.0000 support: 100 height: 12.2605100612

Rescaling tree height

ipcoal interprets node heights as units of generations, so on phylogenetic scales we expect the node heights to be of pretty large magnitudes.

The height of the starting tree is very low:

[7]:
tre.draw(tree_style='c');
idx: 0 name: r0 dist: 4.0738 support: 100 height: 0.00000idx: 1 name: r1 dist: 4.0738 support: 100 height: 0.00001idx: 2 name: r2 dist: 7.4282 support: 100 height: 0.00002idx: 3 name: r3 dist: 0.9656 support: 100 height: 0.00003idx: 4 name: r4 dist: 0.9656 support: 100 height: 0.00004idx: 5 name: r5 dist: 2.6226 support: 100 height: 0.00005idx: 6 name: 6 dist: 3.3544 support: 100 height: 4.07386idx: 7 name: 7 dist: 1.6569 support: 100 height: 0.96567idx: 8 name: 8 dist: 4.8323 support: 100 height: 7.42828idx: 9 name: 9 dist: 9.6379 support: 100 height: 2.62269idx: 10 name: 10 dist: 1.0000 support: 100 height: 12.2605100612

We can rescale the height to be very high (seen on the y-axis):

[8]:
rescale_tre = tre.mod.node_scale_root_height(treeheight=1e6)
rescale_tre.draw(tree_style='c');
idx: 0 name: r0 dist: 332270.8409 support: 100 height: 0.00000idx: 1 name: r1 dist: 332270.8409 support: 100 height: 0.00001idx: 2 name: r2 dist: 605862.0870 support: 100 height: 0.00002idx: 3 name: r3 dist: 78759.5592 support: 100 height: 0.00003idx: 4 name: r4 dist: 78759.5592 support: 100 height: 0.00004idx: 5 name: r5 dist: 213903.5340 support: 100 height: 0.00005idx: 6 name: 6 dist: 273591.2460 support: 100 height: 332270.84096idx: 7 name: 7 dist: 135143.9748 support: 100 height: 78759.55927idx: 8 name: 8 dist: 394137.9130 support: 100 height: 605862.08708idx: 9 name: 9 dist: 786096.4660 support: 100 height: 213903.53409idx: 10 name: 10 dist: 81562.8752 support: 100 height: 1000000.00001005000001000000

Defining the demographic model

The ipcoal Model object allows users to designate effective population size, admixture, substitution rates, and recombination rates.

First, need to define a tree drawing style for the next few examples.

Define our tree (from previous section)

[9]:
tre = toytree.tree(
             '((r5:2.62256,(r4:0.96563,r3:0.96563)100:1.65693)100:\
              9.63792,(r2:7.42816,(r1:4.0738,r0:4.0738)100:\
              3.35436)100:4.83232);'
            ).mod.node_scale_root_height(treeheight=1e6)
tre.draw(tree_style='c');
idx: 0 name: r0 dist: 332270.8409 support: 100 height: 0.00000idx: 1 name: r1 dist: 332270.8409 support: 100 height: 0.00001idx: 2 name: r2 dist: 605862.0870 support: 100 height: 0.00002idx: 3 name: r3 dist: 78759.5592 support: 100 height: 0.00003idx: 4 name: r4 dist: 78759.5592 support: 100 height: 0.00004idx: 5 name: r5 dist: 213903.5340 support: 100 height: 0.00005idx: 6 name: 6 dist: 273591.2460 support: 100 height: 332270.84096idx: 7 name: 7 dist: 135143.9748 support: 100 height: 78759.55927idx: 8 name: 8 dist: 394137.9130 support: 100 height: 605862.08708idx: 9 name: 9 dist: 786096.4660 support: 100 height: 213903.53409idx: 10 name: 10 dist: 81562.8752 support: 100 height: 1000000.00001005000001000000

Set one Ne for the whole tree

[10]:
# define a model with Ne = 2
mod = ipcoal.Model(tree=tre,
                      Ne=2,
                      seed=123)

# or define a model with Ne = 1e5
mod = ipcoal.Model(tree=tre,
                      Ne=1e5,
                      seed=123)

Set branch-specific Ne

First, set an “Ne” attribute for the tree object.

[12]:
# set specific values to some nodes and a global default
tre = tre.set_node_values(
    attr="Ne",
    # the branches above these nodes get specific values
    values={8: 5e6, 6: 5e6, 2: 5e6},
    # the rest of the branches take this default value
    default=20,
)

Look at the tree:

[13]:
# draw to show edge widths and node idxs
tre.draw(
    width=300, height=250,
    edge_widths=np.log(tre.get_edge_values("Ne")),
    tree_style='c',
    tip_labels=True,
);
r0r1r2r3r4r5idx: 0 name: r0 dist: 332270.8409 support: 100 height: 0.00000idx: 1 name: r1 dist: 332270.8409 support: 100 height: 0.00001idx: 2 name: r2 dist: 605862.0870 support: 100 height: 0.00002idx: 3 name: r3 dist: 78759.5592 support: 100 height: 0.00003idx: 4 name: r4 dist: 78759.5592 support: 100 height: 0.00004idx: 5 name: r5 dist: 213903.5340 support: 100 height: 0.00005idx: 6 name: 6 dist: 273591.2460 support: 100 height: 332270.84096idx: 7 name: 7 dist: 135143.9748 support: 100 height: 78759.55927idx: 8 name: 8 dist: 394137.9130 support: 100 height: 605862.08708idx: 9 name: 9 dist: 786096.4660 support: 100 height: 213903.53409idx: 10 name: 10 dist: 81562.8752 support: 100 height: 1000000.00001005000001000000

Now define our model. The “Ne” attribute of the tree will, by default, be the Ne value used by the model.

[14]:
mod = ipcoal.Model(tree=tre,
                      seed=123)

Admixture

[15]:
# init a random tree
tre = toytree.rtree.unittree(8, treeheight=1e6, seed=123)
[16]:
# draw with one admixture edge
tre.draw(
    ts='c',
    admixture_edges=(4, 3, 0.5, 0.1),
);

# draw with two admixture edges
tre.draw(
    ts='c',
    admixture_edges=[
        (4, 3, 0.5, 0.1),
        (3, 8, 0.5, 0.2),
    ],
);
idx: 0 name: r6 dist: 500000 support: 100 height: 00idx: 1 name: r4 dist: 500000 support: 100 height: 01idx: 2 name: r7 dist: 500000 support: 100 height: 02idx: 3 name: r2 dist: 500000 support: 100 height: 03idx: 4 name: r5 dist: 250000 support: 100 height: 04idx: 5 name: r0 dist: 250000 support: 100 height: 05idx: 6 name: r3 dist: 500000 support: 100 height: 06idx: 7 name: r1 dist: 750000 support: 100 height: 07idx: 8 name: 8 dist: 250000 support: 100 height: 2500008idx: 9 name: 9 dist: 250000 support: 100 height: 5000009idx: 10 name: 10 dist: 250000 support: 100 height: 50000010idx: 11 name: 11 dist: 250000 support: 100 height: 50000011idx: 12 name: 12 dist: 250000 support: 100 height: 75000012idx: 13 name: 13 dist: 250000 support: 100 height: 75000013idx: 14 name: 14 dist: 250000 support: 100 height: 10000001405000001000000
idx: 0 name: r6 dist: 500000 support: 100 height: 00idx: 1 name: r4 dist: 500000 support: 100 height: 01idx: 2 name: r7 dist: 500000 support: 100 height: 02idx: 3 name: r2 dist: 500000 support: 100 height: 03idx: 4 name: r5 dist: 250000 support: 100 height: 04idx: 5 name: r0 dist: 250000 support: 100 height: 05idx: 6 name: r3 dist: 500000 support: 100 height: 06idx: 7 name: r1 dist: 750000 support: 100 height: 07idx: 8 name: 8 dist: 250000 support: 100 height: 2500008idx: 9 name: 9 dist: 250000 support: 100 height: 5000009idx: 10 name: 10 dist: 250000 support: 100 height: 50000010idx: 11 name: 11 dist: 250000 support: 100 height: 50000011idx: 12 name: 12 dist: 250000 support: 100 height: 75000012idx: 13 name: 13 dist: 250000 support: 100 height: 75000013idx: 14 name: 14 dist: 250000 support: 100 height: 10000001405000001000000

We define these admixture edges the same way as input for the the model object.

[17]:
# init a model Class object for simulations
model = ipcoal.Model(
    tree=tre,
    admixture_edges=[
        (3, 4, 0.5, 0.2),
        (9, 10, 0.5, 0.2),
    ],
    seed=123,
)

By default, the admixture event occurs as a single pulse at a moment in time (admixture_type=0).

Admixture can also be simulated at a constant rate over a specified time interval:

[ ]:
# init a model Class object for simulations
model = ipcoal.Model(
    tree=tre,
    # "1" type specifies interval migration
    admixture_type=1,
    admixture_edges=[
        (3, 4, (0.2,0.5), 0.2),
        (9, 10, (0.4,0.7), 0.2),
    ],
    seed=123,
)

Subsitution and recombination

[19]:
# init a random tree
tre = toytree.rtree.unittree(8, treeheight=1e6, seed=123)

A single mutation rate argument can be specified, which determines the scaling of gene tree heights when passed to seq-gen for producing sequences.

[20]:
# init a model Class object for simulations
model = ipcoal.Model(
    tree=tre,
    mut=1e-8,
    seed=123,
)

A single recombination rate argument can be specified, which is passed to msprime for producing genealogies.

[21]:
# init a model Class object for simulations
model = ipcoal.Model(
    tree=tre,
    recomb=1e-8,
    seed=123,
)

Simulating unlinked SNPs

[22]:
# simulate N unlinked SNPs (will run until N snps are produced)
model.sim_snps(1000)
[23]:
model.df.head(10)
[23]:
locus start end nbps nsnps genealogy
0 0 0 1 1 1 (((r7:501240,r2:501240):...
1 1 0 1 1 1 (((r7:503798,r2:503798):...
2 2 0 1 1 1 (((r6:542921,r4:542921):...
3 3 0 1 1 1 (((r7:500400,r2:500400):...
4 4 0 1 1 1 ((r1:751576,(r3:505149,(...
5 5 0 1 1 1 ((r1:752733,(r3:503844,(...
6 6 0 1 1 1 ((r1:754845,(r3:643204,(...
7 7 0 1 1 1 (((r7:501030,r2:501030):...
8 8 0 1 1 1 ((r1:756209,(r3:526348,(...
9 9 0 1 1 1 ((r1:751913,(r3:505907,(...

We have a built-in way to look at D-statistics on this SNP data:

[ ]:
ipcoal.utils.baba()

Simulating loci

[25]:
# simulate N loci of len L
model.sim_loci(100, 3000)
[26]:
# view the genealogies and stats in a table
model.df.head(10)
[26]:
locus start end nbps nsnps genealogy
0 0 0 325 325 13 ((r1:752972,(r3:564190,(...
1 0 325 350 25 2 ((r1:752972,(r3:564190,(...
2 0 350 638 288 15 ((r1:752972,(r3:564190,(...
3 0 638 998 360 15 (((r7:501670,r2:501670):...
4 0 998 1029 31 3 (((r7:501670,r2:501670):...
5 0 1029 1174 145 7 (((r6:567513,r4:567513):...
6 0 1174 1213 39 3 ((r1:759461,(r3:504918,(...
7 0 1213 1750 537 25 ((r1:759461,(r3:504918,(...
8 0 1750 1951 201 9 ((r1:759461,(r3:504918,(...
9 0 1951 2324 373 23 (((r7:520823,r2:520823):...
[37]:
# look at the shape of the sequence data
model.seqs.shape
[37]:
(100, 5, 3000)
[27]:
# view sequence data as an array
model.seqs
[27]:
array([[[1, 3, 3, ..., 1, 1, 0],
        [1, 3, 3, ..., 1, 1, 0],
        [1, 3, 3, ..., 1, 1, 0],
        ...,
        [1, 3, 3, ..., 1, 1, 0],
        [1, 3, 3, ..., 1, 1, 0],
        [1, 3, 3, ..., 1, 1, 0]],

       [[0, 0, 3, ..., 3, 1, 2],
        [0, 0, 2, ..., 3, 1, 2],
        [0, 0, 3, ..., 2, 1, 2],
        ...,
        [0, 0, 3, ..., 3, 1, 2],
        [0, 0, 3, ..., 2, 1, 2],
        [0, 0, 3, ..., 2, 1, 2]],

       [[3, 2, 1, ..., 3, 2, 3],
        [3, 2, 1, ..., 3, 2, 3],
        [3, 2, 1, ..., 3, 2, 3],
        ...,
        [3, 2, 1, ..., 3, 2, 3],
        [3, 2, 1, ..., 3, 2, 3],
        [3, 2, 1, ..., 3, 2, 3]],

       ...,

       [[0, 0, 0, ..., 1, 2, 3],
        [0, 0, 0, ..., 1, 2, 3],
        [0, 0, 0, ..., 1, 2, 3],
        ...,
        [0, 0, 0, ..., 1, 2, 3],
        [0, 0, 0, ..., 1, 2, 3],
        [0, 0, 0, ..., 1, 2, 3]],

       [[1, 1, 2, ..., 0, 1, 3],
        [1, 1, 2, ..., 0, 1, 3],
        [1, 1, 2, ..., 0, 1, 3],
        ...,
        [1, 1, 2, ..., 0, 1, 3],
        [1, 1, 2, ..., 0, 1, 3],
        [1, 1, 2, ..., 0, 1, 3]],

       [[1, 0, 1, ..., 0, 2, 1],
        [1, 0, 1, ..., 0, 2, 1],
        [1, 0, 1, ..., 0, 2, 1],
        ...,
        [1, 0, 1, ..., 0, 2, 1],
        [1, 0, 1, ..., 0, 2, 1],
        [1, 0, 1, ..., 0, 2, 1]]], dtype=uint8)

Inferring gene trees

  • For use with simulated multilocus data.

Once multilocus data has been simulated, the .infer_gene_trees() method automates the gene tree inference process across all loci.

[28]:
# define the tree topology
tre = toytree.rtree.unittree(5, treeheight=1e6, seed=123)
# define
model = ipcoal.Model(tree=tre, Ne=1e6, recomb=1e-9, seed=123)
# simulate N loci of len L
model.sim_loci(100, 3000)
[29]:
# look at the beginning of the dataframe
model.df.head(10)
[29]:
locus start end nbps nsnps genealogy
0 0 0 123 123 12 ((r2:1.21382e+06,r0:1.21...
1 0 123 251 128 14 (r2:4.91932e+06,((r1:1.2...
2 0 251 355 104 17 ((r1:1.29726e+06,r4:1.29...
3 0 355 398 43 5 ((r1:1.29726e+06,r4:1.29...
4 0 398 460 62 12 ((r1:1.29726e+06,r4:1.29...
5 0 460 476 16 5 ((r1:1.29726e+06,r4:1.29...
6 0 476 551 75 11 ((r1:1.29726e+06,r4:1.29...
7 0 551 713 162 35 ((r1:1.29726e+06,r4:1.29...
8 0 713 747 34 2 ((r1:1.29726e+06,r4:1.29...
9 0 747 749 2 0 (r2:6.02091e+06,((r1:1.2...
[30]:
# infer a tree for every locus
model.infer_gene_trees(inference_method='raxml')

Alternative inference methods include 'mrbayes' and 'iqtree'.

After running gene tree inference, our inferred gene trees can be found in the new inferred_tree column of our dataframe:

[31]:
model.df.head(10)
[31]:
locus start end nbps nsnps genealogy inferred_tree
0 0 0 123 123 12 ((r2:1.21382e+06,r0:1.21... (((r4:0.0154746174895009...
1 0 123 251 128 14 (r2:4.91932e+06,((r1:1.2... (((r4:0.0154746174895009...
2 0 251 355 104 17 ((r1:1.29726e+06,r4:1.29... (((r4:0.0154746174895009...
3 0 355 398 43 5 ((r1:1.29726e+06,r4:1.29... (((r4:0.0154746174895009...
4 0 398 460 62 12 ((r1:1.29726e+06,r4:1.29... (((r4:0.0154746174895009...
5 0 460 476 16 5 ((r1:1.29726e+06,r4:1.29... (((r4:0.0154746174895009...
6 0 476 551 75 11 ((r1:1.29726e+06,r4:1.29... (((r4:0.0154746174895009...
7 0 551 713 162 35 ((r1:1.29726e+06,r4:1.29... (((r4:0.0154746174895009...
8 0 713 747 34 2 ((r1:1.29726e+06,r4:1.29... (((r4:0.0154746174895009...
9 0 747 749 2 0 (r2:6.02091e+06,((r1:1.2... (((r4:0.0154746174895009...

We can look at the true genealogies for locus 0:

[32]:
# the True genealogies at locus 0
mtre = toytree.mtree(model.df.genealogy[model.df.locus == 0])
mtre.draw_tree_grid(
    start=0,
    nrows=1, ncols=3,
    shared_axis=True,
    edge_type='c',
);
r4r1r3r0r2r3r0r4r1r2r3r0r2r4r1024596604919321

And we can compare these to the inferred tree for this locus:

[33]:
# the inferred genealogy by raxml based on the sequence data
inftre = toytree.tree(model.df.inferred_tree[model.df.locus == 0][0])
inftre.draw(tree_style='c');
idx: 0 name: r1 dist: 0.0087 support: 100 height: 0.00680idx: 1 name: r4 dist: 0.0155 support: 100 height: 0.00001idx: 2 name: r2 dist: 0.0587 support: 100 height: 0.01332idx: 3 name: r0 dist: 0.0118 support: 100 height: 0.06853idx: 4 name: r3 dist: 0.0137 support: 100 height: 0.06674idx: 5 name: 5 dist: 0.0566 support: 100 height: 0.01555idx: 6 name: 6 dist: 0.0083 support: 100 height: 0.07216idx: 7 name: 7 dist: 0.0000 support: 100 height: 0.080370.000.040.08

Writing to disk

We can write a simulated dataframe to a .csv file:

[34]:
# save table to a CSV file
model.df.to_csv("./tree_table.csv")

We can write the simulted loci as separate phylip files to a directory:

[35]:
# write loci as separate phylip files to a directory
model.write_loci_to_phylip(outdir="./tests")
wrote 100 loci (5 x 3000bp) to Users/pmckenz1/googledrive/projects/ipcoal/docs/notebooks/tests/[...].phy

We can concatenate the simulated loci and write this matrix to a single phylip file:

[36]:
# write concatenated loci or snps to a phylip file
model.write_concat_to_phylip(outdir="./tests", name="test.phy")
wrote concatenated loci (5 x 300000bp) to /Users/pmckenz1/googledrive/projects/ipcoal/docs/notebooks/tests/test.phy