Node support values and update of treestructure object with new sequences
Vinicius Franceschi
2025-06-20
Source:vignettes/supportValsUpdate.Rmd
supportValsUpdate.Rmd
Introduction
In this tutorial, we will consider node support values (e.g. bootstrap) on
cluster designations and update previous treestructure
object with new sequences.
This tutorial uses SARS-CoV-2 public data available here
to demonstrate the use of node support values from e.g. parametric bootstrap to
avoid designating population structure in badly supported clades.
This tutorial also demonstrates how to update previous cluster designations
(an existing treestructure
object) using a new rooted maximum likelihood tree
incorporating more sequences. The new tips are added to the cluster which shares
its MRCA (most recent common ancestor).
Workflow to analyse SARS-CoV-2 sequence data
For a step-by-step guide to replicate the complete workflow please see here.
Briefly, we downloaded SARS-CoV-2 public metadata, a treefile, and multiple
sequence alignments from the UShER index. Then we extracted sequences up to the
end of February 2020 (n < 1,000 sequences). We then estimated a maximum
likelihood (ML) tree with IQ-TREE v2.2.2.6 with 1,000
ultrafast bootstraps, a time-scaled tree using
treedater
(strict clock and 0.0008 subst./site/year),
removed outliers by root-to-tip regression,
an then re-estimated the timetree without the outliers (n = 891).
Let’s load the resulting ML tree with bootstrap values:
mltr2_outl_rm <- readRDS( system.file('mltr2_outl_rm_sc2_feb2020.rds',
package='treestructure') )
ggtree::ggtree(mltr2_outl_rm)
And here you can load the timetree:
timetr2_phylo <- readRDS( system.file('timetr2_phylo_sc2_feb2020.rds',
package='treestructure') )
ggtree::ggtree(timetr2_phylo)
Assign clusters without using node support
Firstly, we will assign clusters without using bootstrap support:
trestruct_res_nobt <- trestruct(timetr2_phylo,
minCladeSize = 30,
nodeSupportValues = FALSE,
level = 0.01)
Because treestructure
will take several minutes to run, we can load the results:
trestruct_res_nobt <- readRDS( system.file('trestruct_res_nobt.rds',
package='treestructure') )
plot(trestruct_res_nobt, use_ggtree = T) + ggtree::geom_tippoint()
The treestructure
analyses resulted in 13 clusters.
Assign clusters using bootstrap support
Let’s add the support values to a vector that we will pass to trestruct
:
timetr2_boot <- as.integer(mltr2_outl_rm$node.label)
#> Warning: NAs introduced by coercion
#note that IQ-TREE does not give a node support for the "root" of the tree,
#You, as a user with knowledge of your data, will decide if this should be
#a high or low support value.
timetr2_boot[is.na(timetr2_boot)] <- 95
#show the first 6 node support values.
print(head(timetr2_boot))
#> [1] 95 100 8 3 5 8
Finally, we designate clusters that have at least 80% bootstrap support. This is achieved by setting to 80 the parameter nodeSupportThreshold in the trestruct function.
trestruct_res <- trestruct(timetr2_phylo,
minCladeSize = 30,
nodeSupportValues = timetr2_boot,
nodeSupportThreshold = 80,
level = 0.01)
Because it will take a minute to run treestructure
, we can load the result instead.
trestruct_res <- readRDS( system.file('trestruct_res.rds',
package='treestructure') )
plot(trestruct_res, use_ggtree = T) + ggtree::geom_tippoint()
Now we have only 4 well-supported clusters with differences in coalescent patterns. Note that this might change if you use a higher or lower value for the nodeSupportThreshold in the trestruct function.
Update a previous treestrucuture object with new sequences
To update the previous treestructure
object with new sequences, we extracted
SARS-CoV-2 sequences up to 15 March 2020 (n > 6,712 sequences). We then estimated
a new ML tree including all those sequences as before.
Note that this new tree must be rooted, but does not need to be time-scaled or binary.
#Note that this tree has more sequences than the previous tree used in this
#tutorial.
mltr_addtips <- readRDS( system.file('mltr_addtips_mar2020.rds',
package='treestructure') )
ggtree::ggtree(mltr_addtips)
And without the need to re-estimate a timetree or re-run trestruct
from scratch,
we are now able to add the new sequences to the existing treestructure
object:
trestruct_add_tips <- addtips(trst = trestruct_res, tre = mltr_addtips)
#> Some taxa in *trst* were not found in tre. These will be excluded from the output.
#>
plot(trestruct_add_tips, use_ggtree = T) + ggtree::geom_tippoint()
If you would like to compare the sequence names that comprise each cluster in each tree, you can do:
#compare sequences in cluster 1 from trestruct_res object and the
#trestruct_add_tips object
tree1_cluster1 <- trestruct_res$clusterSets$`1`
tree2_cluster1 <- trestruct_add_tips$clusterSets$`1`
length(tree1_cluster1)
#> [1] 55
length(tree2_cluster1)
#> [1] 728
Note that the length of tree1_cluster1 and tree2_cluster1 is different.
That is because we added tips from the ML tree, mltr_addtips, to the
treestructure
object, trestruct_res.
You can also see that all elements in tree1_cluster1 is contained in tree2_cluster1