User Guide¶

[1]:

import ipcoal
import toytree
import numpy as np

Workflow¶

The user interacts with ipcoal through the Model object, through which the following are defined: * Tree topology * Effective population size * Admixture * Substitution rate * Recombination rate

Then, the methods .sim_loci() or .sim_snps() generate data based on the input parameters.

The resulting simulated data consists of a dataframe object that holds the simulated genealogies, and matrices that contain the actual simulated sequences.

After data simulation with .sim_loci, ipcoal can automatically infer a gene tree from the simulated data at each locus. These gene trees are then incorporated into the dataframe object.

Defining the tree topology¶

ipcoal uses a toytree tree object as input for the toplogy of the species tree. You can read more about toytree at the toytree docs.

A toytree tree object can be generated three different ways:

random topology¶

[2]:

# define the tree topology
tre = toytree.rtree.coaltree(5)
tre.draw(tree_style='c');

Using the coaltree function multiple times can give different topologies and different node heights:

[3]:

# generate 3 randome trees, save as a multitree object
mtre = toytree.mtree([toytree.rtree.coaltree(5) for i in range(3)])
# plot the trees
mtre.draw_tree_grid(
    start=0,
    nrows=1, ncols=3,
    shared_axis=True,
    edge_type='c',
);

Node heights can be randomly shifted using the node_slider() method:

[4]:

# apply `node_slider()` three times to the first tree in our multitree object
nodeslide_trees = toytree.mtree([mtre.treelist[0].mod.node_slider() for i in range(3)])
# plot the nodeslider trees:
nodeslide_trees.draw_tree_grid(
    start=0,
    nrows=1, ncols=3,
    shared_axis=True,
    edge_type='c',
);

balanced/imbalanced topology¶

[5]:

# define a 6-tip balanced tree
tre = toytree.rtree.baltree(6)
tre.draw(tree_style='c');
# define a 6-tip imbalanced tree
tre = toytree.rtree.imbtree(6)
tre.draw(tree_style='c');

topology from newick¶

[6]:

tre = toytree.tree(
             '((r5:2.62256,(r4:0.96563,r3:0.96563)100:1.65693)100:\
              9.63792,(r2:7.42816,(r1:4.0738,r0:4.0738)100:\
              3.35436)100:4.83232);'
            )
tre.draw(tree_style='c');

Rescaling tree height¶

ipcoal interprets node heights as units of generations, so on phylogenetic scales we expect the node heights to be of pretty large magnitudes.

The height of the starting tree is very low:

[7]:

tre.draw(tree_style='c');

We can rescale the height to be very high (seen on the y-axis):

[8]:

rescale_tre = tre.mod.node_scale_root_height(treeheight=1e6)
rescale_tre.draw(tree_style='c');

Defining the demographic model¶

The ipcoal Model object allows users to designate effective population size, admixture, substitution rates, and recombination rates.

First, need to define a tree drawing style for the next few examples.

Define our tree (from previous section)¶

[9]:

tre = toytree.tree(
             '((r5:2.62256,(r4:0.96563,r3:0.96563)100:1.65693)100:\
              9.63792,(r2:7.42816,(r1:4.0738,r0:4.0738)100:\
              3.35436)100:4.83232);'
            ).mod.node_scale_root_height(treeheight=1e6)
tre.draw(tree_style='c');

Set one Ne for the whole tree¶

[10]:

# define a model with Ne = 2
mod = ipcoal.Model(tree=tre,
                      Ne=2,
                      seed=123)

# or define a model with Ne = 1e5
mod = ipcoal.Model(tree=tre,
                      Ne=1e5,
                      seed=123)

Set branch-specific Ne¶

First, set an “Ne” attribute for the tree object.

[12]:

# set specific values to some nodes and a global default
tre = tre.set_node_values(
    attr="Ne",
    # the branches above these nodes get specific values
    values={8: 5e6, 6: 5e6, 2: 5e6},
    # the rest of the branches take this default value
    default=20,
)

Look at the tree:

[13]:

# draw to show edge widths and node idxs
tre.draw(
    width=300, height=250,
    edge_widths=np.log(tre.get_edge_values("Ne")),
    tree_style='c',
    tip_labels=True,
);

Now define our model. The “Ne” attribute of the tree will, by default, be the Ne value used by the model.

[14]:

mod = ipcoal.Model(tree=tre,
                      seed=123)

Admixture¶

[15]:

# init a random tree
tre = toytree.rtree.unittree(8, treeheight=1e6, seed=123)

[16]:

# draw with one admixture edge
tre.draw(
    ts='c',
    admixture_edges=(4, 3, 0.5, 0.1),
);

# draw with two admixture edges
tre.draw(
    ts='c',
    admixture_edges=[
        (4, 3, 0.5, 0.1),
        (3, 8, 0.5, 0.2),
    ],
);

We define these admixture edges the same way as input for the the model object.

[17]:

# init a model Class object for simulations
model = ipcoal.Model(
    tree=tre,
    admixture_edges=[
        (3, 4, 0.5, 0.2),
        (9, 10, 0.5, 0.2),
    ],
    seed=123,
)

By default, the admixture event occurs as a single pulse at a moment in time (admixture_type=0).

Admixture can also be simulated at a constant rate over a specified time interval:

[ ]:

# init a model Class object for simulations
model = ipcoal.Model(
    tree=tre,
    # "1" type specifies interval migration
    admixture_type=1,
    admixture_edges=[
        (3, 4, (0.2,0.5), 0.2),
        (9, 10, (0.4,0.7), 0.2),
    ],
    seed=123,
)

Subsitution and recombination¶

[19]:

# init a random tree
tre = toytree.rtree.unittree(8, treeheight=1e6, seed=123)

A single mutation rate argument can be specified, which determines the scaling of gene tree heights when passed to seq-gen for producing sequences.

[20]:

# init a model Class object for simulations
model = ipcoal.Model(
    tree=tre,
    mut=1e-8,
    seed=123,
)

A single recombination rate argument can be specified, which is passed to msprime for producing genealogies.

[21]:

# init a model Class object for simulations
model = ipcoal.Model(
    tree=tre,
    recomb=1e-8,
    seed=123,
)

Simulating unlinked SNPs¶

[22]:

# simulate N unlinked SNPs (will run until N snps are produced)
model.sim_snps(1000)

[23]:

model.df.head(10)

[23]:

	locus	end	nbps	nsnps	genealogy
0	0	1	1	1	(((r7:501240,r2:501240):...
1	1	1	1	1	(((r7:503798,r2:503798):...
2	2	1	1	1	(((r6:542921,r4:542921):...
3	3	1	1	1	(((r7:500400,r2:500400):...
4	4	1	1	1	((r1:751576,(r3:505149,(...
5	5	1	1	1	((r1:752733,(r3:503844,(...
6	6	1	1	1	((r1:754845,(r3:643204,(...
7	7	1	1	1	(((r7:501030,r2:501030):...
8	8	1	1	1	((r1:756209,(r3:526348,(...
9	9	1	1	1	((r1:751913,(r3:505907,(...

We have a built-in way to look at D-statistics on this SNP data:

[ ]:

ipcoal.utils.baba()

Simulating loci¶

[25]:

# simulate N loci of len L
model.sim_loci(100, 3000)

[26]:

# view the genealogies and stats in a table
model.df.head(10)

[26]:

	start	end	nbps	nsnps	genealogy
0	0	325	325	13	((r1:752972,(r3:564190,(...
1	325	350	25	2	((r1:752972,(r3:564190,(...
2	350	638	288	15	((r1:752972,(r3:564190,(...
3	638	998	360	15	(((r7:501670,r2:501670):...
4	998	1029	31	3	(((r7:501670,r2:501670):...
5	1029	1174	145	7	(((r6:567513,r4:567513):...
6	1174	1213	39	3	((r1:759461,(r3:504918,(...
7	1213	1750	537	25	((r1:759461,(r3:504918,(...
8	1750	1951	201	9	((r1:759461,(r3:504918,(...
9	1951	2324	373	23	(((r7:520823,r2:520823):...

[37]:

# look at the shape of the sequence data
model.seqs.shape

[37]:

(100, 5, 3000)

[27]:

# view sequence data as an array
model.seqs

[27]:

array([[[1, 3, 3, ..., 1, 1, 0],
        [1, 3, 3, ..., 1, 1, 0],
        [1, 3, 3, ..., 1, 1, 0],
        ...,
        [1, 3, 3, ..., 1, 1, 0],
        [1, 3, 3, ..., 1, 1, 0],
        [1, 3, 3, ..., 1, 1, 0]],

       [[0, 0, 3, ..., 3, 1, 2],
        [0, 0, 2, ..., 3, 1, 2],
        [0, 0, 3, ..., 2, 1, 2],
        ...,
        [0, 0, 3, ..., 3, 1, 2],
        [0, 0, 3, ..., 2, 1, 2],
        [0, 0, 3, ..., 2, 1, 2]],

       [[3, 2, 1, ..., 3, 2, 3],
        [3, 2, 1, ..., 3, 2, 3],
        [3, 2, 1, ..., 3, 2, 3],
        ...,
        [3, 2, 1, ..., 3, 2, 3],
        [3, 2, 1, ..., 3, 2, 3],
        [3, 2, 1, ..., 3, 2, 3]],

       ...,

       [[0, 0, 0, ..., 1, 2, 3],
        [0, 0, 0, ..., 1, 2, 3],
        [0, 0, 0, ..., 1, 2, 3],
        ...,
        [0, 0, 0, ..., 1, 2, 3],
        [0, 0, 0, ..., 1, 2, 3],
        [0, 0, 0, ..., 1, 2, 3]],

       [[1, 1, 2, ..., 0, 1, 3],
        [1, 1, 2, ..., 0, 1, 3],
        [1, 1, 2, ..., 0, 1, 3],
        ...,
        [1, 1, 2, ..., 0, 1, 3],
        [1, 1, 2, ..., 0, 1, 3],
        [1, 1, 2, ..., 0, 1, 3]],

       [[1, 0, 1, ..., 0, 2, 1],
        [1, 0, 1, ..., 0, 2, 1],
        [1, 0, 1, ..., 0, 2, 1],
        ...,
        [1, 0, 1, ..., 0, 2, 1],
        [1, 0, 1, ..., 0, 2, 1],
        [1, 0, 1, ..., 0, 2, 1]]], dtype=uint8)

Inferring gene trees¶

For use with simulated multilocus data.

Once multilocus data has been simulated, the .infer_gene_trees() method automates the gene tree inference process across all loci.

[28]:

# define the tree topology
tre = toytree.rtree.unittree(5, treeheight=1e6, seed=123)
# define
model = ipcoal.Model(tree=tre, Ne=1e6, recomb=1e-9, seed=123)
# simulate N loci of len L
model.sim_loci(100, 3000)

[29]:

# look at the beginning of the dataframe
model.df.head(10)

[29]:

	start	end	nbps	nsnps	genealogy
0	0	123	123	12	((r2:1.21382e+06,r0:1.21...
1	123	251	128	14	(r2:4.91932e+06,((r1:1.2...
2	251	355	104	17	((r1:1.29726e+06,r4:1.29...
3	355	398	43	5	((r1:1.29726e+06,r4:1.29...
4	398	460	62	12	((r1:1.29726e+06,r4:1.29...
5	460	476	16	5	((r1:1.29726e+06,r4:1.29...
6	476	551	75	11	((r1:1.29726e+06,r4:1.29...
7	551	713	162	35	((r1:1.29726e+06,r4:1.29...
8	713	747	34	2	((r1:1.29726e+06,r4:1.29...
9	747	749	2	0	(r2:6.02091e+06,((r1:1.2...

[30]:

# infer a tree for every locus
model.infer_gene_trees(inference_method='raxml')

Alternative inference methods include 'mrbayes' and 'iqtree'.

After running gene tree inference, our inferred gene trees can be found in the new inferred_tree column of our dataframe:

[31]:

model.df.head(10)

[31]:

	start	end	nbps	nsnps	genealogy	inferred_tree
0	0	123	123	12	((r2:1.21382e+06,r0:1.21...	(((r4:0.0154746174895009...
1	123	251	128	14	(r2:4.91932e+06,((r1:1.2...	(((r4:0.0154746174895009...
2	251	355	104	17	((r1:1.29726e+06,r4:1.29...	(((r4:0.0154746174895009...
3	355	398	43	5	((r1:1.29726e+06,r4:1.29...	(((r4:0.0154746174895009...
4	398	460	62	12	((r1:1.29726e+06,r4:1.29...	(((r4:0.0154746174895009...
5	460	476	16	5	((r1:1.29726e+06,r4:1.29...	(((r4:0.0154746174895009...
6	476	551	75	11	((r1:1.29726e+06,r4:1.29...	(((r4:0.0154746174895009...
7	551	713	162	35	((r1:1.29726e+06,r4:1.29...	(((r4:0.0154746174895009...
8	713	747	34	2	((r1:1.29726e+06,r4:1.29...	(((r4:0.0154746174895009...
9	747	749	2	0	(r2:6.02091e+06,((r1:1.2...	(((r4:0.0154746174895009...

We can look at the true genealogies for locus 0:

[32]:

# the True genealogies at locus 0
mtre = toytree.mtree(model.df.genealogy[model.df.locus == 0])
mtre.draw_tree_grid(
    start=0,
    nrows=1, ncols=3,
    shared_axis=True,
    edge_type='c',
);

And we can compare these to the inferred tree for this locus:

[33]:

# the inferred genealogy by raxml based on the sequence data
inftre = toytree.tree(model.df.inferred_tree[model.df.locus == 0][0])
inftre.draw(tree_style='c');

Writing to disk¶

We can write a simulated dataframe to a .csv file:

[34]:

# save table to a CSV file
model.df.to_csv("./tree_table.csv")

We can write the simulted loci as separate phylip files to a directory:

[35]:

# write loci as separate phylip files to a directory
model.write_loci_to_phylip(outdir="./tests")

wrote 100 loci (5 x 3000bp) to Users/pmckenz1/googledrive/projects/ipcoal/docs/notebooks/tests/[...].phy

We can concatenate the simulated loci and write this matrix to a single phylip file:

[36]:

# write concatenated loci or snps to a phylip file
model.write_concat_to_phylip(outdir="./tests", name="test.phy")

wrote concatenated loci (5 x 300000bp) to /Users/pmckenz1/googledrive/projects/ipcoal/docs/notebooks/tests/test.phy