Skip to content

Commit

Permalink
Update multiple-sequence-alignment.md
Browse files Browse the repository at this point in the history
incorporating some more edits/wording changes, including issue [applied-bioinformatics#296](applied-bioinformatics#296)
  • Loading branch information
Mike Lee authored Sep 21, 2018
1 parent 9171973 commit 4491be6
Showing 1 changed file with 7 additions and 7 deletions.
14 changes: 7 additions & 7 deletions book/fundamentals/multiple-sequence-alignment.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,15 +106,15 @@ You probably have two burning questions in your mind right now:

We'll explore both of those through-out the rest of this notebook. First, let's cover the process of progressive multiple sequence alignment, just assuming for a moment that we know how to do both of those things.

The process of progressive multiple sequence alignment could look like the following. First, we start with some sequences and a tree representing the relationship between those sequences. We'll call this our guide tree, because it's going to guide us through the process of multiple sequence alignment. In progressive multiple sequence alignment, we build a multiple sequence alignment for each internal node of the tree, where the alignment at a given internal node contains all of the sequences in the clade defined by that node.
The process of progressive multiple sequence alignment could look like the following. First, we start with some sequences and a tree representing the relationship between those sequences. We'll call this our guide tree, because it's going to guide us through the process of multiple sequence alignment. In progressive multiple sequence alignment, we build a multiple sequence alignment for each internal node of the tree, where the alignment at a given internal node contains all of the sequences in the clade defined by that node. Let's see what this looks like with this example tree:

<img src="https://raw.githubusercontent.com/gregcaporaso/An-Introduction-To-Applied-Bioinformatics/master/book/fundamentals/images/msa-tree-input.png">

Starting from the root node, descend the bottom branch of the tree until you get to an internal node. If an alignment hasn't been constructed for that node yet, continue descending the tree until to get to a pair of nodes. In this case, we follow the two branches to the tips. We then align the sequences at that pair of tips (usually with Needleman-Wunsch, for multiple sequence alignment), and assign that alignment to the node connecting those tips.
Starting from the root node (at the far left), we will follow along a branch of the tree until we get to an internal node. If an alignment hasn't been constructed for that node yet, and it holds more than 2 sequences, we would continue moving along one of branches of that node. In this case, if we follow the bottom branch from the root we run into a node holding only sequences "s4" and "s5". We would then align those sequences at that pair of tips (usually with Needleman-Wunsch, for multiple sequence alignment), and assign that alignment to the node connecting those tips.

<img src="https://raw.githubusercontent.com/gregcaporaso/An-Introduction-To-Applied-Bioinformatics/master/book/fundamentals/images/msa-tree-a1.png">

Next, we want to find what to align the resulting alignment to, so start from the root node and descend the top branch of the tree. When you get to the next node, determine if an alignment has already been created for that node. If not, our job is to build that alignment so we have something to align against. In this case, that means that we need to align `s1`, `s2`, and `s3`. We can achieve this by aligning `s1` and `s3` first, to get the alignment at the internal node connecting them.
Next, we want to find what to align that resulting alignment to, so let's start from the root node again and move along the top branch of the tree. When we get to the next node, we determine if an alignment has already been created for that node, and if not, our job is to build that alignment so we have something to align against. In this case, that means that we need to align `s1`, `s2`, and `s3`. We can achieve this by aligning `s1` and `s3` first, to get the alignment at the internal node connecting them.

<img src="https://raw.githubusercontent.com/gregcaporaso/An-Introduction-To-Applied-Bioinformatics/master/book/fundamentals/images/msa-tree-a2.png">

Expand All @@ -132,7 +132,7 @@ The alignment at the root node is our multiple sequence alignment.

Let's address the first of our outstanding questions. I mentioned above that *we need an alignment to build a good tree*. The key word here is *good*. We can actually build a very rough tree - one that we would never want to present as representing any actual phylogenetic relationships between the sequences in question - without first aligning the sequences. One way to do this is using hierarchical clustering to build an [UPGMA](https://en.wikipedia.org/wiki/UPGMA) (**U**nweighted **P**air **G**roup **M**ethod with **A**rithmetic-mean) tree, requiring only a distance matrix as input. So if we can find a non-alignment-dependent way to compute distances between the sequences, we can build a rough UPGMA tree from them.

Let's compute distances between the sequences based on their *word* composition. We'll define a *word* here as `k` adjacent characters in the sequence. We can then define a function that will return all of the words in a sequence as follows. These words can be defined as being overlapping, or non-overlapping.
Let's compute distances between the sequences based on their *kmer* composition – a "kmer" refers to `k` adjacent characters in a sequence, and these can be overlapping or non-overlapping. To get a feel for this, let's look at a function that returns all of the (overlapping) kmers of a specified size in a sequence and then try it out.

```python
>>> from skbio import DNA
Expand Down Expand Up @@ -160,9 +160,9 @@ And here are the kmers of size 3 without allowing overlap:
... print(e)
```

We'll go with overlapping for this example, as the more words we have, the better our guide tree should be.
We'll go with overlapping for this example, as the more kmers we have, the better our guide tree should be.

So now when we have two sequences, we can compute the word counts for each and define a distance between them as the fraction of total words that are unique to either sequence.
So now when we have two sequences, we can compute the kmer counts for each and define a distance between them as the fraction of total kmers that are unique to either sequence.

```python
>>> from iab.algorithms import kmer_distance
Expand Down Expand Up @@ -361,7 +361,7 @@ We can now build a (hopefully) improved tree from our multiple sequence alignmen
>>> fig = msa_dm.plot(cmap='Greens')
```

The UPGMA tree that result from the initial (not-aligned) sequences is very different from the UPGMA tree that results from the multiple sequence alignment. First we'll look at the intial guide tree, and then the tree resulting from the progressive multiple sequence alignment.
The UPGMA tree that results from the initial (not-aligned) sequences is very different from the UPGMA tree that results from the multiple sequence alignment. First we'll look at the intial guide tree, and then the tree resulting from the progressive multiple sequence alignment.

```python
>>> d = dendrogram(guide_lm, labels=guide_dm.ids, orientation='right',
Expand Down

0 comments on commit 4491be6

Please sign in to comment.