Tutorial¶

Optional: click here to run this tutorial interactively in the cloud on a free service called binder. This allows you to test out ipcoal without having to even install it.

Getting started¶

Welcome to the quick guide tutorial for ipcoal. This page is intended to introduce major concepts of coalescent simulation, and to provide a short overview of several types of analyses that can be performed with ipcoal.

ipcoal is designed for use within jupyter notebooks to make it easy to interactively perform analyses alongside visualization tools that make it easy to verify your results. The Python package toytree is installed alongside ipcoal and should typically be imported with it, like below, as the two are intended to work hand in hand.

[1]:

import ipcoal
import toytree

The species tree¶

The species tree is the primary model on which ipcoal is designed to simulate genealogies and sequences within the multispecies coalescent framework. One of the primary features of ipcoal is the ability to feed it a tree which it will then parse to build a demographic model (which is used by the msprime coalescent simulator), and which describes when and how different populations (lineages) are able to coalesce with each other. You can think of coalescence on a species tree as several distinct coalescent processes occurring within panmictic populations that are simply connected to each other by the tree structure (See Degnan and Rosenberg 2009 for a nice description).

To simulate genealogies and sequences on a tree we need to first define the tree. This can be done by either loading an inferred tree from a newick string or by generating a random tree. For this we will use the tree manipulation and visualization library toytree. In the example cell below I use toytree to generate a random tree with a set number of tips, a total tree height, and a random seed, and store it as a variable named tree1.

We can visualize this tree by calling .draw() from the toytree and here I provide the argument tree_style='p' to set a style for drawing the figure that will make it look nice for representing ‘population trees’ (i.e., species trees). This is helpful in that it provides numeric labels on the nodes of the tree, shows tip labels, and provides a scalebar for the height of the tree.

Note: Branch lengths on species trees should be in units of generations.

Please see the FAQs for tips on translating branch lengths on empirical trees from absolute time to generations.

[2]:

# generate a random tree with 8 tips and height of 1M generations
tree1 = toytree.rtree.unittree(8, treeheight=1e6, seed=123)

[3]:

# draw tree showing idx labels
tree1.draw(tree_style='p');

The demographic model¶

ipcoal is an object-oriented library of which the main object users interact with is called a Model class object. To create a Model you must provide a number of parameter arguments to ipcoal.Model() which then returns a parameterized Model object that you can store as a variable. By convention, I typically name it something with “model” in its name.

From the Model object you can then call functions to simulate genealogies and sequences, and similarly, the results of simulations are also accessed directly from Model objects.

Here we will start by setting up a simple Model for simulating coalescent genealogies on the species tree plotted above. Because this is our first model we will keep it simple and apply a single effective population size (Ne) to the entire tree, and leave all other parameter arguments at their default values. We will see further on in this tutorial the effect of varying some of these parameters.

[4]:

# initialize a model object given a species tree and Ne setting.
model = ipcoal.Model(tree=tree1, Ne=1e5)

Note: Explore the ipcoal code using interactivity:

One of the benefits of working interactively in a jupyter notebook is that you can easily explore the functionality of your objects. To see a list of all of the attributes and functions that are accessible from a Model class object simply type model.<tab> in a code cell below and place your cursor after the dot and press the tab key (or hold shift and press tab). Do not write the word <tab>. This will raise a pop-up window next to your cursor listing options associated with Model objects.

Simulating genealogies¶

An important distinction that we highlight in ipcoal is whether you are simulating linked or unlinked data. Unlinked data represents independent draws from a distribution, whereas linked data represents correlated draws, meaning that the next data point is influenced by the preceding one.

In the context of a genome we expect that regions located on different chromosomes are independent of each other, whereas sites that are located close together on the same chromosome are not independent. In the context of the coalescent, the correlation among nearby regions of the genome represents that one or more samples shares the same ancestors in both regions. Recombination causes this similarity to decay since it has the effect of causing different genomic regions to trace back to different sampled ancestors. The ability to simulate correlated tree sequences over large genomic regions – enabled by the msprime package that ipcoal uses under the hood – has opened many new opportunities for studying genome-wide genealogical patterns.

There are instances where we may be interested in simulating linked data to study the effect of recombination, or alternatively, we may sometimes wish to simulate unlinked data. Many population genetic and phylogenetic inference tools assume that data are unlinked. One useful application of ipcoal is to generate linked and unlinked datasets to explore the effect of linkage on analytical results.

Simulating (unlinked) genealogies¶

Let’s start our simulations by focusing first only on genealogies (we’ll add sequence simulations later.) To simulate just genealogies – the fastest type of simulation in ipcoal – you can use the sim_trees() function call. This takes two arguments, nloci and nsites. In ipcoal we always treat loci as being independent of one another. You can think of them as separate chromosomes. The length of each locus is represented by some number of sites. To simulate completely unlinked genealogies we can request the genealogy from a single site (nsites=1) from multiple independent loci.

[5]:

# simulate unlinked genealogies
unlinked = ipcoal.Model(tree=tree1, Ne=1e5, seed=1234)
unlinked.sim_trees(nloci=10, nsites=1)

Note: Why do you need to specify nsites when simulating genealogies?

If the simulation includes recombination (which it does by default) then a single locus extending over more than one site may actually represent multiple coalescent genealogies if a recombination crossover occurred in the history of the samples at that locus. See the linked genealogy example below.

Simulating (linked) genealogies¶

To better understand the difference between linked and unlinked genealogies let’s also produce a set of linked genealogies. To do this we simply need to call sim_trees() with nsites set >1, and to use a Model object that includes recombination. Our Model object actually already includes recombination since the default parameter setting is recomb=1e-9. Here I write it out just to be more explicit. I will simulate one long locus on which we expect multiple genealogies will be represented.

[6]:

# simulate linked genealogies
linked = ipcoal.Model(tree=tree1, Ne=1e5, recomb=1e-9, seed=1234)
linked.sim_trees(nloci=1, nsites=100000)

The results dataframe (.df)¶

The genealogical results of a simulation call are available from a Model object as a table (a Pandas DataFrame), that can be accessed from its .df attribute. Above we created two Model objects, called linked and unlinked, each of which has its own table of results. Below I show the first ten trees in each dataset.

[7]:

# show unlinked genealogies
unlinked.df.head(10)

[7]:

	locus	end	nbps	genealogy
0	0	1	1	((r1:859987.773833326529...
1	1	1	1	(((r6:637627.30935939750...
2	2	1	1	(((r6:588489.22426471975...
3	3	1	1	((r1:815025.471483044326...
4	4	1	1	((r1:932309.656962803215...
5	5	1	1	((r1:812246.912016004906...
6	6	1	1	(((r7:586937.86382025317...
7	7	1	1	((r7:860870.744937030365...
8	8	1	1	(((r7:561111.41515366919...
9	9	1	1	(((r5:412543.47529452794...

[8]:

# show linked genealogies
linked.df.head(10)

[8]:

	start	end	nbps	tidx	genealogy
0	0	44	44	0	((r1:871079.646908078808...
1	44	193	149	1	((r2:1002155.44499329023...
2	193	222	29	2	((r1:956116.170226565212...
3	222	497	275	3	((r1:956116.170226565212...
4	497	983	486	4	((r1:956116.170226565212...
5	983	1230	247	5	((r1:956116.170226565212...
6	1230	2350	1120	6	((r1:956116.170226565212...
7	2350	2498	148	7	((r2:1002155.44499329023...
8	2498	3094	596	8	((r1:937942.685911775333...
9	3094	3757	663	9	((r1:937942.685911775333...

What do these tables show? You can see in the unlinked results table that 10 different loci are represented, numbered 0-9 in the “locus” column. Each locus is represented by only a single site, stretching from start=0 to end=1. Each is 1bp in length and contains no SNPs since we have not simulated sequence data yet, only genealogies. A column labeling the tree index (tidx) for each row shows that all are labeled 0, meaning that each genealogy is the first (and here only) tree simulated in that locus. Finally, the results of greatest interest are in the final column, genealogy, which contains newick strings.

Now look at the linked results table. In contrast to the previous table we see that now all of the data are on a single locus (locus=0). The first genealogy stretches from position 0 (start=0) to position 5 (end=5) and it is 5bp in length. Following down the table we can see that recombination has broken this locus into many small chunks each represented by a different sized chunk of the locus, and by a slightly different genealogy. Each genealogy has a different tree index (tidx) indicating their order along the locus.

Of course it is hard to tell from the table how different these genealogies are. The next step is to use visualization tools and statistical analyses to compare trees.

Drawing trees¶

See the toytree documentation for more examples of methods and arguments for drawing and styling species trees, gene trees, or genealogies either individually or together on a canvas.

Here I will focus on how different parameterizations of the Model objects lead to different distributions of genealogies. Below I use the toytree.mtree (multitree) function to load the entire list of trees from each object’s .df table. Then from each multitree object I call the draw_tree_grid() function to draw the first few trees in each list with a set of styling options defined in a dictionary. This is a quick way to visualize the variation among a few simulated trees. Do not worry if it seems complicated for now, this is just for the purposes of visualizing the genealogies.

[13]:

# a dictionary of arguments to style the drawings
kwargs = {
    "ts": "c",
    "tip_labels": True,
    "shared_axis": True,
    "width": 600,
    "height": 200,
    "node_sizes": 6,
}

# draw a grid of trees from model 1
toytree.mtree(unlinked.df.genealogy).draw_tree_grid(**kwargs);

# draw a grid of trees from model 2
toytree.mtree(linked.df.genealogy).draw_tree_grid(**kwargs);

Why do the distributions of unlinked and linked trees above look so different? Because of the effect of linkage, of course. The unlinked trees in the top row represent the full range of genealogical variation across the genome. Each is an independent draw. The linked trees in the lower row, by contrast, represent one single draw from the distribution of genealogies, followed by a series of linked genealogies that differ from the initial tree by small changes in one or a few ancestral coalescences.

If you focus on the tip names you can see that the topology changes frequently among the unlinked trees (top row), but that it only changes slightly among the linked trees (bottom row), or sometimes stays the same and only the edge lengths change. The linked trees change in an ordered way, with trees that are closer together being more similar – they are spatially correlated. This is the underlying cause of linkage disequilibrium.

Demographic parameters (Ne)¶

Now that we understand how linkage works let’s explore other parameters of the demographic model, starting with the effective population size, Ne. Below I create two models with different Ne values and simulate 10 independent genealogies on each.

[14]:

# create two models that differ in Ne
model1 = ipcoal.Model(tree=tree1, Ne=1e4)
model2 = ipcoal.Model(tree=tree1, Ne=1e6)

[15]:

# simulate n genealogies for each model
model1.sim_trees(10)
model2.sim_trees(10)

Once again, we will use tree drawings to visualize the effect of Ne on the genealogical variation. In the plot below the first row corresponds to trees simulated from model 1, where Ne=1e4, and the second row corresonds to trees simulated from model 2, where Ne=1e6.

As we should expect, when Ne is small the coalescent events occur more quickly and in this case the genealogies all match the species tree. When Ne is larger we instead see much deeper coalescent times (note the y-axis differences in numbers of generations) and the topology is often different from the species tree since the coalescent times trace back deeper than most species tree divergence events.

This plot demonstrates nicely the importance of Ne in determing genealogical variation: both datasets were simulated on the same species tree but with different Ne.

[16]:

# draw a grid of trees from model 1
toytree.mtree(model1.df.genealogy).draw_tree_grid(**kwargs);

# draw a grid of trees from model 2
toytree.mtree(model2.df.genealogy).draw_tree_grid(**kwargs);

Demographic parameters (admixture)¶

Another demographic parameter of interest is admixture. We can model admixture either as a pulsed migration event, or as a protracted migration rate. A strength of using ipcoal and toytree is that it is very easy to understand the direction of introgression, which in some software tools can be very tricky to make sense of since it is sometimes referenced backwards in time, and other times forwards in time.

Defining an admixture event¶

In both ipcoal and toytree you define an admixture event with a 4-part tuple of arguments. This designates the (source, destination, time, rate) for pulsed migration. Additionally, you can specify an interval over which migration occurs at a uniform rate, i.e. (source, dest, (time_min, time_max), rate), and specifying admixture_type="interval". In simulations you will need to provide all four arguments, but for tree drawings you only need to provide the first two and the latter two will be automated to make it look nice. So let’s try first to draw an admixture edge.

This is done by supplying the event tuple to the admixture_edges argument. Here the source and destination refer to the index (idx) numeric labels of each node (which you find by drawing the tree). You can see that the admixture edge is not simply a horizontal line, but it follows the edges towards the tips of the source node and towards the ancestors from the destination node.

It is clear from this drawing that (3, 8) in the admixture_edges argument here refers to introgression from 3 to 8 backwards in time; meaning that alleles that arose in the ancestor of node 8 may have introgressed into lineage 3 at the present.

[18]:

tree1.draw(ts='p', admixture_edges=[(3, 8)]);

Now let’s define an admixture event to an ipcoal Model object. The important thing here is that ipcoal can accept arguments for the time and proportion of admixture both as float values (proportions). For example, if you want an admixture pulse to occur at the midpoint of the edge shared by nodes 3 and 8, then the time argument should be 0.5. If you want 15% probability of sampling an ancestor from another lineage at the admixture pulse then the admixture proportion should be 0.15.

[19]:

# a Model with admixture from 3-8
amodel = ipcoal.Model(
    tree=tree1,
    Ne=1e4,
    admixture_edges=[(3, 8, 0.5, 0.15)],
)

[20]:

# simulate 100 unlinked genealogies under this model
amodel.sim_trees(100)

Here I use another plotting option of the multitree object from toytree to plot a cloud of overlapping trees. We simulated the data with a small Ne so that the only discordance we see is due to admixture. The admixture argument we used above describes gene flow (viewed backwards in time) from node 3 to 8, which when looking at the tip names of the tree means introgression from species “r2” into the ancestor of “r5” and “r0”. This is exactly what we see in the simulated genealogies below. Similarly you can see that when “r7” does not coalescent with “r2” it traces back to the ancestor of “r4” and “r6”.

[21]:

# visualize discordance in a cloud tree
mtre = toytree.mtree(amodel.df.genealogy)
mtre.draw_cloud_tree(
    layout='d',
    html=True,
    fixed_order=tree1.get_tip_labels(),
);

Simulating sequences¶

Sequences are simulated on genealogies to produce a sequence alignment stretching over the length of a locus. Because as we learned above a locus can be represented by multiple distinct genealogies in the presence of recombination, sequence data simulated over a locus of length N may in fact represent sequence data simulated over genealogy A for part of its length, and genealogy B for another part of its length. Thus the concept of simulating linked or unlinked genealogies extends naturally to the simulation of linked or unlinked sequence data.

We make one additional distinction in the terminology of simulating sequence data which is between simulating loci and simulating SNPs. Simulation of SNPs is a special case where the simulation is conditioned on returning variable sites, and thus it could potentially run forever waiting for a substitution depending on the parameter settings. Below we describe this distinction between the two sequence simulation functions .sim_loci() and .sim_snps().

Substitution models¶

A major shortcoming of using msprime to study deeper-scale (e.g., phylogenetic scale) processes is that it only supports an infinite-sites mutation model, which is really only appropriate for very short evolutionary time scales. This model does not allow a site to mutate more than once, which disallows convergence (two lineages evolving the same site independently) as well as unonbserved changes (a site changing from the ancestral state to another base and then changing again). Multiple mutations to the same site is not only common in phylogenetics, but it is one of the most important processes to model.

A typical workflow to accomodate more complex substitution models has been to simulate genealogies with a coalescent simulator and then pipe these data to a command-line program called seq-gen. Our approach in ipcoal is aimed to simplify this pipeline. We harness the powerful genealogical simulations from msprime but simulate sequence data on these genealogies using our own sequence simulator written in Python (using just-in-time compiled code, which makes it very fast).

Again, to try to make model setup very transparent we provide a function to Model objects called .get_substitution_model_summary() which as its name suggests will print a summary of the substitution model that will be applied to the data. By default we use the Jukes-Cantor model, but you can set more complex models as well, like the HKY model below.

[22]:

# the default substitution model is Jukes-Cantor
modelJC = ipcoal.Model(tree1)
modelJC.get_substitution_model_summary()

state_frequencies:
    A     C     G     T
 0.25  0.25  0.25  0.25

kappa: 1.0
ts/tv: 0.5

instantaneous transition rate matrix:
        A       C       G       T
A -1.0000  0.3333  0.3333  0.3333
C  0.3333 -1.0000  0.3333  0.3333
G  0.3333  0.3333 -1.0000  0.3333
T  0.3333  0.3333  0.3333 -1.0000

You can set the substitution model parameters using the substitution_model argument which accepts a dictionary of arguments. We currently support state_frequencies and kappa, which together allows implementing JC, F81, K2P, or HKY. We plan to expand on this soon.

[23]:

# the HKY model can be implemented by setting substitution params
modelHKY = ipcoal.Model(
    tree=tree1,
    substitution_model={
        "state_frequencies": (0.3, 0.2, 0.3, 0.2),
        "kappa": 1.25,
    },
)
modelHKY.get_substitution_model_summary()

state_frequencies:
   A    C    G    T
 0.3  0.2  0.3  0.2

kappa: 1.25
ts/tv: 0.6770833333333334

instantaneous transition rate matrix:
        A       C       G       T
A -0.9627  0.2484  0.4658  0.2484
C  0.3727 -1.0559  0.3727  0.3106
G  0.4658  0.2484 -0.9627  0.2484
T  0.3727  0.3106  0.3727 -1.0559

Simulating (linked) sequences¶

The .sim_loci() function is used for simulating linked sequences, either with or without recombination. In the absence of recombination all data in the locus will be simulated on a single genealogy, whereas with recombination sequence data will be simulated on the tree sequence of genealogies and concatenated to produce a sequence alignment.

[24]:

# create a Model and simulate a 1Kb locus
model = ipcoal.Model(tree1, Ne=1e5, recomb=1e-9)
model.sim_loci(nloci=1, nsites=1000)

[25]:

# the results table
model.df

[25]:

	start	end	nbps	nsnps	tidx	genealogy
0	0	478	478	28	0	((r7:863525,(r2:799043,(...
1	478	855	377	23	1	((r7:863525,(r2:799043,(...
2	855	1000	145	9	2	((r3:935172,(r1:892409,(...

The sequence array¶

When one or more loci are simulated the results are stored as a multidimensional numpy array (((nloci, nsamples, nsite)) to the Model object accessible as .seqs. The sequences are ordered by the tip names on the species tree in alphanumeric order. The data are stored in numeric form which is a convenience for operations. Users will not typically need to interact with the raw sequence data in this form, but will instead either write it to a file (in which case it is tranformed to A,C,G,Ts), or infer trees from it directly using the ipcoal built-in inference functions.

[21]:

# the results sequence array
model.seqs

[21]:

array([[[3, 0, 0, ..., 2, 1, 2],
        [3, 0, 0, ..., 2, 1, 2],
        [3, 0, 0, ..., 2, 1, 2],
        ...,
        [3, 0, 0, ..., 2, 1, 2],
        [3, 0, 0, ..., 2, 1, 2],
        [3, 0, 0, ..., 2, 1, 2]]], dtype=uint8)

Writing sequence data to file¶

We provide two options for writing sequence data files either in PHYLIP or NEXUS format. These either write the loci individually as separate files to a shared folder, or concatenate all or some loci to a single alignment and write it to a file.

[29]:

# simulate 10 loci each 500bp in length
model.sim_loci(nloci=10, nsites=500)

[30]:

# write all loci concatenated as a single locus
model.write_concat_to_phylip(outdir="/tmp/", name="test.phy")

# or, concatenate and write only some loci
model.write_concat_to_phylip(
    outdir="/tmp", name="locs-1-3.phy", idxs=[1,2,3]
)

wrote concat loci (8 x 5000bp) to /tmp/test.phy
wrote concat loci (8 x 1500bp) to /tmp/locs-1-3.phy

[31]:

# write all individual loci to a directory
model.write_loci_to_phylip(
    outdir="/tmp",
    idxs=[1,2,3],
    name_prefix="locus",
    name_suffix="model",
)

wrote 3 loci (8 x 500bp) to /tmp/[...].phy

Simulating unlinked SNPs¶

Simulating unlinked SNPs efficiently requires a slightly different approach than simply generating sequence data regardless of whether or not variation is present. ipcoal makes an important distinction in its implementation from the workflow that is commonly performed. Instead of generating a sample of genealogies and then forcing a SNP onto each of these trees, we instead cycle through an infinite sample of genealogies and in each one attempt to evolve a SNP on the genealogy probabilistically based on its edge lengths (coalescent times). In contrast to the forced approach, shorter trees will produce a SNP less often than taller trees, and this will be reflected in the final distribution of genetic variation. In this way ipcoal more accurately simulates an expected distribution of unlinked SNPs.

Of the simulation functions in ipcoal this makes .sim_snps() the slowest function, but it is still generally quite fast. Once you simulate SNPs the results can be processed (e.g., written to file, analyzed as a table, etc) similar to those of .sim_loci(), shown above.

[32]:

model.sim_snps(100)
model.write_concat_to_phylip(outdir="/tmp", name="test.snps")

wrote concat loci (8 x 100bp) to /tmp/test.snps

Inferring gene trees from sequences¶

To infer a gene tree at every locus in a dataset simply call the Model function .infer_gene_trees(). This will concatenate data at every locus and send the alignment to a phylogenetic inference program (default is raxml) and store the results in a new column of the results table labeled “inferred_tree”. Take note that only one tree is inferred per locus, not per genealogical block, so the same inferred tree is represented in multiple rows on the results table.

[34]:

# create a Model object with 3 samples per species
model = ipcoal.Model(tree1, Ne=1e4, nsamples=3, recomb=1e-9, seed=123)

# simulate genealogies and sequences at 10 unlinked loci
model.sim_loci(nloci=10, nsites=1000)

# infer gene trees for each locus
model.infer_gene_trees(inference_method="raxml")

# show the results table
model.df

[34]:

	locus	start	end	nbps	nsnps	tidx	genealogy	inferred_tree
0	0	0	973	973	47	0	(((r1-1:29663.7,(r1-0:64...	((r0-1:0.000001000000500...
1	0	973	1000	27	0	1	(((r1-1:29663.7,(r1-0:64...	((r0-1:0.000001000000500...
2	1	0	267	267	13	0	((((r2-1:30468.8,(r2-0:1...	(r0-1:0.0000010000005000...
3	1	267	345	78	2	1	((((r2-1:30468.8,(r2-0:1...	(r0-1:0.0000010000005000...
4	1	345	1000	655	35	2	((((r2-1:30468.8,(r2-0:1...	(r0-1:0.0000010000005000...
5	2	0	1000	1000	51	0	((((r4-2:1711.3,(r4-0:55...	(r0-1:0.0000010000005000...
6	3	0	1000	1000	60	0	((((r7-2:4402.41,(r7-0:8...	((r0-2:0.000001000000500...
7	4	0	1000	1000	69	0	((((r4-2:22175.7,(r4-0:6...	((r0-1:0.000001000000500...
8	5	0	1000	1000	52	0	((((r7-1:43308.4,(r7-0:1...	((((r5-1:0.0000010000005...
9	6	0	26	26	2	0	((((r4-0:6986.09,(r4-1:1...	(r0-2:0.0000010000005000...
10	6	26	923	897	57	1	((((r4-0:6986.09,(r4-1:1...	(r0-2:0.0000010000005000...
11	6	923	1000	77	3	2	((((r4-0:6986.09,(r4-1:1...	(r0-2:0.0000010000005000...
12	7	0	657	657	37	0	((((r4-0:27989.3,(r4-1:3...	((r0-1:0.000001000000500...
13	7	657	1000	343	18	1	((((r4-0:27989.3,(r4-1:3...	((r0-1:0.000001000000500...
14	8	0	1000	1000	46	0	((((r7-2:25034.7,(r7-0:3...	((((((r3-1:0.00000100000...
15	9	0	1000	1000	53	0	(((r1-0:31251.6,(r1-1:88...	(r0-1:0.0000010000005000...

Analysis tools¶

There are many possibilities for further downstream analysis using either the sequence and/or true and inferred trees at linked or unlinked regions across the genome. More coming soon.