Draw UPGMA PHYLOGENETIC Trees

Data visualization is an essential component of research. It allows researchers and the public alike to identify trends and relationships that may be difficult to notice immediately with the numbers alone. Furthermore, it helps to catalyze new ideas and understandings, allowing science to flourish from the collective knowledge of one another. Here, we focus on a niche yet surprisingly fun way to make phylogenetic trees. 

UPGMA, or the unweighted pair group method of arithmetic mean, is a method that can be used to produce rooted trees from a distance matrix. A distance matrix delineates the difference in nucleotide sequence between two organisms; each value represents the distance between each base pair. 

Before we start the construction of trees, it’s important to recognize a major assumption of UPGMA: the molecular clock assumption. Also known as ultrametricity, the molecular clock assumes that all species mutate and change at the same rate. This is not completely accurate since organisms evolve at different rates, so variations in rates of mutation are not accounted for. Nevertheless, it is an important assumption to make as it allows us to create evolutionary timelines for species that are not well documented through fossil records

When given a distance matrix, as illustrated below, we can iterate through the same three steps, slowly constructing our phylogenetic tree as we go, until we find the largest distance.

The gray boxes represent repeats or values of zero. They can be ignored when solving the question.

First, identify the smallest distance or lowest number in the matrix. This represents the two organisms that are most closely related. Given the matrix below, organisms C and E are the closest with a value of 1. We can highlight the rows and columns with C and E which will help us in a following step.

The dark green box illustrated the smallest distance which is then used to construct the tree. The light green boxes are the affected values of which we must take the weighted average from, and the white boxes are the unaffected values which can be copied into the next table.

Next, begin drawing the tree by joining the smallest distance pair. The distance from the base to the top should be half the calculated distance. In this case, since C and E have a distance of 1, the length of the line of the tree should each be 0.5.

From here, construct a new table that is smaller than the original by one row and one column by joining together the organisms we previously worked with.

For all the unaltered pairs (the values that are not highlighted) simply copy them down into the new table. In this example, the numbers 8, 2, and 5 can be written in the same positions as the first table.

In the remaining empty squares (the new pairs) the new distance will be the average of the two previous lineages, weighted by the size of the lineage. For example, to find the value of CE and A, take the average of the distance between C and A (which is 6) as well as the distance between E and A (which is also 6); consequently, input 6 into your new table. Repeat this process for B and CE (B and C is 4 while B and E is 5 resulting in an average of 4.5) and D and CE (D and C is 7 while D and E is 3 resulting in an average of 5). At this stage, each lineage has an equal weight in the average; however if both distances being compared have more than one letter, this is when we must take the weighted average.

Now that you have your new matrix, you can repeat the same steps: identify the smallest distance, highlight the rows and columns, add a line to the phylogenetic tree that is half the distance, then copy down the unaffected values or take the average of the affected ones. Once you create a table with only one value left, use the remaining value as your final distance. 

Try the next two steps of the practice problem above yourself and scroll through the carousel when you are ready.

The first two hopefully look quite similar to the process demonstrated above. However, when reaching the point where B(CE) and AD have the smallest distance, it may become slightly more confusing.

Consider B, CE, and AD as the three separate units to identify in the table. We can compare AD with B and AD with CE which corresponds to 5.5 and 6.5 respectively. Instead of taking the standard average as done in the past, this is where the weighted average part of UPGMA comes into play. In the previous iterations, each lineage could be calculated as the same weight, but since B(CE) and AD both include multiple lineages, we must weigh them based on the size. Since B(CE) is the larger lineage we can multiply 5.5 by 2. Add this with 6.5 then divide the result by 3. 

( (5.5*2) + (6.5*1) )/3 = 5.83

To find our final distance, divide 5.83 by 2 to get approximately 2.29.

When drawing the proportional phylogenetic tree, make sure to note that the distances calculated represent only the distances from the bottom to a point. When filling in distances that do not start from the bottom, you can subtract values from one another to find them. When finished the tree should resemble something like the drawing shown below: 

UPGMA may come in a variety of different forms such as a difference table, a list of sequences, or even a list of sequences and the Jukes-Cantor model, but finding a way to convert these forms into a distance matrix will allow you to use the method above. For instance, when given a list of sequences and the Jukes-Cantor model, you can identify similarities through sequence alignment then weigh the differences by the probability that the mutation would have occurred. Construct a distance matrix and now you’re on your way to creating phylogenetic trees. With just a little bit of addition, subtraction, time, and knowledge, I hope you are now equipped with all the artillery to combat UPGMA questions headfirst!  

Previous
Previous

Men We Reaped: How Systematic Issues in America LEAD to Tragedies

Next
Next

Have a BLAST with Sequence ALignment Part II