Skip to contents

Introduction

This vignette will demonstrate how to use structured coalescent models (Volz 2012) to estimate epidemiological parameters given a pathogen genealogy and discrete trait information about sampled individuals.

Requirements

This tutorial assumes you have the following R packages already installed in your computer.

  • phydynR: implements the coalescent simulation and likelihood function for phylodynamic analyses.
  • bbmle: tools for general likelihood estimation.
  • ggplot2: functions to plot the demographic process.

Load the R packages

The epidemiological model

We will focus on the problem of estimating HIV transmission rates by stage of infection and assuming we know the stage of infection when patients are sampled and virus is sequenced. We will fit a very simple model of an HIV epidemic using ordinary differential equations (ODEs). In this model, the infectious period will be broken into three stages of different average duration and with different transmission rates. The first stage, early HIV infection (EHI) is short (average duration \(1/\gamma_0 = 1\) year), but has high transmission rate \(\beta_0\). The second stage, chronic HIV infection, is long (average duration \(1/\gamma_1 = 7\) years), and has small transmission rate. The last stage, AIDS, lasts \(1/\gamma_2 = 2\) years on average and has an intermediate transmission rate. There are births into the susceptible state at rate \(bN\) where \(N = S + I_0 + I_1 + I_2\). And there is mortality due to natural causes from all states at rate \(\mu\). The parameter values are listed in table 1.

The model equations are:

\(\dot{S} = bN − \mu S − (\beta_0I_0 + \beta_1I_1 + \beta_2I_2)S/N\)

\(\dot{I_0} = (\beta_0I_0 + \beta_1I_1 + \beta_2I_2)S/N − (\mu + \gamma_0)I_0\)

\(\dot{I_1} = \gamma_0I_0 − (\mu + \gamma_1)I_1\)

\(\dot{I_2} = \gamma_1I_1 − (\mu + \gamma_2)I_2\)

Table 1: Parameter symbols and values
Parameter Symbol Values
Duration EHI \(1/\gamma_0\) 1 year
Duration chronic \(1/\gamma_1\) 7 years
Duration AIDS \(1/\gamma_2\) 2 years
Birth rate \(b\) 0.036
Natural death rate \(\mu\) \(1/30\)
EHI transmission rate \(\beta_0\) 1.2
Chronic transmission rate \(\beta_1\) 0.03
AIDS transmission rate \(\beta_2\) 0.09
Initial number of susceptibles \(S(0)\) 3000

Create a list of the true parameter values:

 parms_truth <- list(gamma0 = 1,
                     gamma1 = 1/7,
                     gamma2 = 1/2,
                     mu = 1/30,
                     b = 0.036,
                     beta0 = 12./10,
                     beta1 = 3./100,
                     beta2 = 9./100,
                     S_0 = 3000,
                     I0_0 = 1, 
                     I1_0 = 0.01, 
                     I2_0 = 0.01,
                     m = 3,
                     mm = 1)

In order to fit this model, we need to express the equations in a canonical format. According to this format, we will tally birth and migration events between demes. In our example, the deme corresponds to the stage of infection that an infected host can be in, so we will refer the demes with the following names:

INFECTEDNAMES <- c('I0', 'I1', 'I2')

There are \(m = 3\) demes in this model, so the birth events between demes needs to be expressed with a \(3 × 3\) matrix F. The element \(F_kl\) represents the rate of transmissions by a host in deme \(k\) to a host in deme \(l\). In our example, this is the following:

births <- rbind(c('parms$beta0 * S * I0 / (S + I0 + I1 + I2)', '0', '0'),
                c('parms$beta1 * S * I1 / (S + I0 + I1 + I2)', '0', '0'),
                c('parms$beta2 * S * I2 / (S + I0 + I1 + I2)', '0', '0'))

rownames(births) = colnames(births) <- INFECTEDNAMES

Each element of the matrix is a string that will be parsed as R code and evaluated, so it is important to write it exactly as you would if you were solving the equations outside of phydynR. Also note that the object parms is accessible to these equations, which is a list of parameters - this may include parameters to be estimated. Note that there are zero rates in the 2nd and third columns, since all new infected hosts start out in the first stage of infection (EHI). Also note that we must give row and column names to the matrix, and these names must correspond to the names of the demes.

Similarly, we must create a matrix of migrations:

migrations <- rbind(c('0', 'parms$gamma0 * I0', '0'),
                     c('0', '0', 'parms$gamma1 * I1'),
                     c('0', '0', '0'))

rownames(migrations) = colnames(migrations) <- INFECTEDNAMES

Note that this matrix tallys the stage progression from EHI to chronic and from chronic to AIDS.

We must also write a vector of expressions for events that terminate a lineage – In this model, this occurs due to natural or AIDS related mortality:

deaths <- c('parms$mu*I0',
            'parms$mu*I1',
            'parms$mu*I2 + parms$gamma2 * I2')

names(deaths) <- INFECTEDNAMES

Finally, we must write a vector of ODEs for state variables that do not correspond to demes in the coalescent model. In our example, there is only one such variable- the number of susceptibles:

nonDemeDynamics <- paste(sep = '',
                         '-parms$mu*S + parms$mu*(S + I0 + I1 + I2)',
                         '- S * (parms$beta0*I0+parms$beta1*I1+parms$beta2*I2) / (S + I0 + I1 + I2)')

names(nonDemeDynamics) <- 'S'

Note well that in all cases, the expression or equation must have the corresponding name of the state variable. The model can be fitted to a binary tree with dated tips. We will use a simulated tree such that we know the true parameter values and population size through time. We load this by:

# read the tree
tree <- read.tree(system.file('extdata/hivSimulation.nwk', package='phydynR'))

# the sample times are the same, because it is a homochronous sample at 50 years
sampleTimes <- rep(50, length(tree$tip.label))
names(sampleTimes) <- tree$tip.label

# create a tree with dated tips and internal nodes,
# will infer the sample states from tip labels
bdt <- DatedTree(phylo = tree, 
                 sampleTimes = sampleTimes,
                 minEdgeLength = 0.01)
bdt
#> 
#> Phylogenetic tree with 318 tips and 317 internal nodes.
#> 
#> Tip labels:
#>   89_I1, 349_I2, 339_I1, 64_I1, 203_I1, 30_I0, ...
#> 
#> Rooted; includes branch lengths.

Note well that the vector of sample times must have the names of each taxon. If the state of each taxon (the stage of infection of each sample unit) is encoded at the end of taxon name, it can be loaded automatically by passing the argument sampleStatesAnnotations. Alternatively, that information can be passed as a matrix with row names corresponding to the tip labels of the tree.

After setting up all the components of the mathematical model, we can build the demographic process using the function build.demographic.process from the phydynR package. The dm output can be used as input to coalescent models for the calculation of the likelihood.

dm <- build.demographic.process(births = births,
                                nonDemeDynamics = nonDemeDynamics,
                                migrations = migrations,
                                deaths = deaths,
                                parameterNames = names(parms_truth),
                                rcpp = FALSE,
                                sde = FALSE)

Now we can calculate the likelihood of the tree and see how long it takes by using the system.time function:

print(system.time(print(
  colik(tree = bdt,
        theta = parms_truth,
        demographic.process.model = dm,
        x0 = c(I0 = 1, I1 = 0.01, I2 = 0.01, S = parms_truth$S_0),
        t0 = 0,
        res = 1000,
        integrationMethod = 'euler')
  )))

It is a bit slow, so note that there is an alternative low-level interface that can be a bit faster (colik.pik.fgy). Also note that changing the integrationMethod (choose ‘euler’), maxHeight (only fit to part of the tree) and res (set to a smaller value) options can dramatically speed up the calculation at the cost of some accuracy. We can fit the model by using the bbmle or stats4 package.

Fitting the model

We will focus on estimating the three transmission rates of the system along with a nuisance parameter that controls initial conditions, \(t_0\), which is the time of origin of the epidemic. We will assume prior knowledge of the stage progression rates \(\gamma_i\) mortality rates \(\mu\), and susceptible population size \(S(0)\).

First, create the objective function to be minimized:

obj_fun <- function(lnbeta0, lnbeta1, lnbeta2, t0){
  
  parms <- parms_truth
  parms$beta0 <- exp(lnbeta0)
  parms$beta1 <- exp(lnbeta1)
  parms$beta2 <- exp(lnbeta2)
  
  mll <- -colik(tree = bdt,
                theta = parms,
                demographic.process.model = dm,
                x0 = c(I0 = 1, I1 = 0.01, I2 = 0.01, S = parms_truth$S_0),
                t0 = 0,
                res = 1000,
                integrationMethod = 'rk4')

    # track progress:
  print(c(mll, exp(c(lnbeta0, lnbeta1, lnbeta2) ), t0) )
  mll
}

Note that this uses log-transformation for variables that must be positive (like rates).

We can then fit the model by running

fit <- mle2(obj_fun,
            start = list(lnbeta0 = log(0.6), 
                         lnbeta1 = log(0.2), 
                         lnbeta2 = log(0.05), 
                         t0 = 0),
            method='Nelder-Mead', 
            optimizer='optim',
            control = list(trace = 6, reltol = 1e-8))

Note that we are starting the optimizer far from the true parameter values. If fitting a model to real data, it is recommended to try many different starting conditions over a large range of values. The optimizer would take about 10 minutes to run, so instead we will load the results:

load( system.file('extdata/hivModel0-fit.RData', package='phydynR') )

AIC(fit)
#> [1] 4447.468
logLik(fit)
#> 'log Lik.' -2219.734 (df=4)
coef(fit)
#>    lnbeta0    lnbeta1    lnbeta2         t0 
#>  0.1772788 -3.3862403 -2.9639203 -1.8339507
exp(coef(fit))
#>    lnbeta0    lnbeta1    lnbeta2         t0 
#> 1.19396395 0.03383565 0.05161617 0.15978107

We can compare the fitted model to the true number of infected through time, which is shown in Figure 1

parms <- parms_truth
parms[c('beta0', 'beta1', 'beta2', 't0')] <- unname(exp(coef(fit)))
parms$t0 <- log(parms$t0)
ox <- dm(theta = parms, 
         x0 = c(I0 = 1, I1 =0.01, I2 = 0.01, S = parms_truth$S_0), 
         t0 = exp(coef(fit)['t0']),
         t1 = bdt$maxSampleTime, 
         res = 1e3, 
         integrationMethod='rk4')
o <- ox[[5]] 
times <- o[,1]

real_data <- cbind(rtimes, rY)
rdata <- gather(real_data, "Demes", "I", I0:I2, factor_key=TRUE)

ggplot(rdata, aes(x = rtimes, y = I)) +
  geom_point(aes(col = Demes)) + 
  scale_colour_manual(values=c("black", "red", "green")) +
  geom_line(data = o, aes(x = time, y = I0), col='black') +
  geom_line(data = o, aes(x = time, y = I1), col='red') +
  geom_line(data = o, aes(x = time, y = I2), col='green') +
  theme_bw() +
  theme(element_text(size = 12)) +
  xlab("Time") +
  ylab("Number of infected by stage")
The actual (points) and estimated (lines) number of infections through time. Black: EHI, Red: chronic, Green: AIDS

Figure 1: The actual (points) and estimated (lines) number of infections through time. Black: EHI, Red: chronic, Green: AIDS

We can calculate a confidence interval for the transmission rates using likelihood profiles:

profbeta <- profile(fit, 
                    which = 1, 
                    alpha = 0.05, 
                    std.err = 0.5, 
                    trace = TRUE, 
                    tol.newmin = 1 )

This takes a long time, so we will load the results:

load( system.file('extdata/hivModel0-profbeta.RData', package='phydynR') )

We see that the confidence interval covers the true value:

 c(exp(confint(profbeta )), TrueVal = parms_truth$beta0)
#>    2.5 %   97.5 %  TrueVal 
#> 1.093873 1.243095 1.200000

And, we can visualize the profile (Figure 2).

#fig.asp sets the height-to-width ratio of the figure

plot(profbeta)
abline( v = log( parms_truth$beta0) , col='red')
Likelihood profile for the transmission rate $\beta_0$ with confidence levels. The true parameter value is indicated by the vertical red line.

Figure 2: Likelihood profile for the transmission rate \(\beta_0\) with confidence levels. The true parameter value is indicated by the vertical red line.

References

Volz, Erik M. 2012. Complex Population Dynamics and the Coalescent Under Neutrality.” Genetics 190 (1): 187–201. https://doi.org/10.1534/genetics.111.134627.