In this tutorial we’ll explore Maximum Likelihood and MCMC simulations. This builds on previous tutorial on custom simulation models. You’ll want to have a good handle on those models before continuing here. You may also want to look at the bootstrap models, which is also relevant.

The Model

Joint Probabilities

Generalizing from our more simplistic colonization simulation model, consider colonization of roots by k taxa at each of N root intersections. Rather than randomly sampling each location on a root, the probability of colonization can be modelled as the joint probability of k Bernoulli variables. A simple example is the joint probability of mycorrhizal colonization (\(P_1\)) and pathogen colonization (\(P_2\)). For each pair of taxa, the joint probability is:

\[ P(P_1=n_1 \space , \space P_2=n_2) \]

\[= P(P_2=n_2 | P_1=n_1) \cdot P(P_1=n_2)\]

\[= P(P_1=n_1 | P_2=n_2) \cdot P(P_2=n_2) \]

where n₁ and n₂ are the binary colonization outcomes (0 or 1) for species 1 and 2, respectively.

Now consider the joint probabilities of each of the four possible outcomes:

\[A. \space Both := P(P_1 = 1 , P_2 = 1)\]

\[B. \space Mycorrhizae \space Only := P(P_1 = 1 , P_2 = 0)\]

\[C. \space Pathogens \space Only := P(P_1 = 0 , P_2 = 1)\]

\[D. \space Neither := P(P_1 = 0, P_2 = 0)\]

The variance-covariance matrix for mycorrhizae (1) and pathogens (2) species is:

\[V_1 = (A + B)(1 - (A + B))\]

\[V_2 = (A + C)(1 - (A + C))\]

\[Cov_{1,2} = A - (A + B)(A + C)\]

The correlation coefficient between species is:

\[\frac {Cov_{1,2}}{\sqrt{V_1 \cdot V_2}} \]

\[= \frac {A - (A + B)(A + C)}{\sqrt {(A + B)(1 - (A + B))(A + C)(1 - (A + C))}}\]

Null Model

First, consider the simplest case where there is no biological interaction between the species.

Let \(p_1\) be the colonization probability for Mycorrhizae

Let \(p_2\) be the colonization probability for Pathogens

If colonzation probabilities are independent, then the joint probabilities are:

\[ A. \space Both = p_1p_2 \]

\[ B. \space MycOnly = p_1(1-p_2) \]

\[ C. \space PathOnly = (1-p_1)p_2\]

\[ D. \space Neither = (1-p_1)(1-p_2) \]

Asymmetric interference model

Now consider an interference model, where species interfere and exclude colonization by the other with probability \(\alpha\). The joint probabilities are then:

\[ A. \space Both = p_1p_2 (1-\alpha_{1,2}-\alpha_{2,1}) \]

\[ B. \space MycOnly = p_1(1-p_2)+p_1p_2*\alpha_{1,2} \]

\[ C. \space PathOnly = (1-p_1)p_2+p_1p_2*\alpha_{2,1}\]

\[ D. \space Neither = (1-p_1)(1-p_2) \]

Symmetric interference model

For the simulations below, we’re going to assume that \(\alpha_{1,2} = \alpha_{2,1}\). In other words, species interfere and exclude colonization by the other with equal probability \(\alpha\). The reason we do this is to have one less parameter to estimate in the models below.

The joint probabilities for the symmetric interference model are:

\[ A. \space Both = p_1p_2 (1-2*\alpha) \]

\[ B. \space MycOnly = p_1(1-p_2)+p_1p_2*\alpha \]

\[ C. \space PathOnly = (1-p_1)p_2+p_1p_2*\alpha\]

\[ D. \space Neither = (1-p_1)(1-p_2) \]

Estimating parameters

Given that \(A\), \(B\), \(C\) & \(D\) can be estimated from the data, it is easy to use the Null Model to derive an equation for estimating parameters \(p_1\) and \(p_2\) from the observed data.

It’s more difficult for the Symmetric Interference Models, but we can solve for \(\alpha\) by adding \(C+D\), which simplifies to \[\alpha=\frac{1-A}{2-2p_1p_2}\]

We can also solve \(p_1\) and \(p_2\) using the quadratic formula:

\[x=\frac{-b\pm \sqrt{b^2-4ac}}{2a}\]

Subtracting \(B-C\) or \(C-B\) simplifies to: \[-p^2 + p(B-C+2)+(C-D-B-1)\]

Now substitute into the quadratic formula: \(x = p\), \(a=-1\), \(b=B-C+2\) and \(c=C+D-B-1\), yielding:

\[p=\frac{-(B-C+2) \pm \sqrt{(B-C+2)^2-4(-1)(C-D-B-1)}}{2(-1)}\]

ParmCalc<-function(P1,P2,Alpha){
  # Conditional probabilities
  ## A through B can be estimated from the data
  ## Use these equations to try to solve for p1, p2 & a
  ## For example, B-C or C-B will isolate p1 and p2 only
  A<-P1*P2*(1-2*Alpha)
  B<-P1*(1-P2)+P1*P2*Alpha
  C<-(1-P1)*P2+P1*P2*Alpha
  D<-(1-P1)*(1-P2)
  
  # Divide each by N to convert observed # occurrences to probabilities
  N<-A+B+C+D
  A<-A/N
  B<-B/N
  C<-C/N
  D<-D/N

  # Equations for calculating (co)variances and correlation coefficient  
  V1<-(A+B)*(1-(A+B))
  V2<-(A+C)*(1-(A+C))
  COV12<-A-(A+B)*(A+C)

  COR12<-COV12/(sqrt(V1)*sqrt(V2))
  
  ## Combine A through B to isolate each of these terms
  ## These are derived from the quadratic formula, leaving 2 solutions, one of which is outside the bounds of reasonable probabilities (<0 or >1)
  ### Calculate both...
  p1<-(-(B+2-C)+sqrt(((B+2-C)^2) -(4*(-1)*(C+D-B-1)))) / (2*(-1)) 
  # These are the bad values: (-(B+2-C)-sqrt(((B+2-C)^2) -(4*(-1)*(C+D-B-1)))) / (2*(-1))
  ### Repeat for p2
  # These are the bad values: (-(B-2-C)+sqrt(((B-2-C)^2) -(4*(1)*(C-D-B+1)))) / (2*(1)) 
  p2<-(-(B-2-C)-sqrt(((B-2-C)^2) -(4*(1)*(C-D-B+1)))) / (2*(1))
  # Insert equation for estimating alpha (a)
  a<-(p1*p2-A)/(2*p1*p2)  # Solve from A 
  ## a<-(B-p1+p1*p2)/(p1*p2) # Solve from B
  ## a<-(C-p2+p1*p2)/(p1*p2) # Solve from C
  return(cbind(a,p1,p2))
}

Generate 5,000 parameters and test whether the estimates are correct

# Test the model
SimDat<-data.frame(P1=runif(5000),P2=runif(5000),a=runif(5000))
Parm<-ParmCalc(P1=SimDat$P1,P2=SimDat$P2,Alpha=SimDat$a)
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.3.3

#Each of these 3 plots will be a 1:1 line if your calculations are correct"
## Test p1
qplot(SimDat$P1,Parm[,2],alpha=I(0.1))

## Test p2
qplot(SimDat$P2,Parm[,3],alpha=I(0.1))

## Test alpha
qplot(SimDat$a,Parm[,1],alpha=I(0.1))

Probability

Working with real data is a little more complicated. For example, consider if \(p_1=p_2=0.8\) and \(\alpha=0.1\) and we look at 3 cross sections.

We might get species 1 and species 2 in all 3 cross-sections, just by chance.

Make a toy data set to demonstrate:

ToyDat<-data.frame(Myc=c(1,1,1),Path=c(1,1,1))
ToyDat

##   Myc Path
## 1   1    1
## 2   1    1
## 3   1    1

What is the probability of observing the data in ToyDat given \(p_1=p_2=0.8\) and \(\alpha=0.1\)? It’s the probability of A occurring 3 times, or \(P(A) \times P(A) \times P(A)\), which is \(A^3\):

\[(p_1p_2-2\alpha p_1p_2)^3 = (0.8*0.8-2*0.1*0.8*0.8)^3 = 13.4\%\]

In other words, it’s quite a likely outcome. But what will happen when we calculate \(p_1\) and \(p_2\) from the data?

To calculate the parameters \(p_1\), \(p_2\) and \(\alpha\) from the data, we have to modify the above test script in terms of the observed presence/absence data:

ParmFind<-function(Myc,Path){
  # Conditional probabilities
  ## A through B can be estimated from the data
  ## Use these equations to try to solve for p1, p2 & a
  ## For example, B-C or C-B will isolate p1 and p2 only
  N<-length(Myc)
  
  A<-sum(Myc*Path)/N
  B<-sum(Myc*(1-Path))/N
  C<-sum((1-Myc)*Path)/N
  D<-sum((1-Myc)*(1-Path))/N

  ### Calculate parameters
  p1<-(-(B+2-C)+sqrt(((B+2-C)^2) -(4*(-1)*(C+D-B-1)))) / (2*(-1)) 
  p2<-(-(B-2-C)-sqrt(((B-2-C)^2) -(4*(1)*(C-D-B+1)))) / (2*(1))
  a<-(B-p1+p1*p2)/(p1*p2)
  return(cbind(a,p1,p2))
}

ParmFind(Myc=ToyDat$Myc,Path=ToyDat$Path)

##      a p1 p2
## [1,] 0  1  1

Compare these estimated values to the ‘true’ values (which we know because we put them into the model): \(\alpha=0.1\), \(p_1=p_2=0.8\).

Now what if we simulate 10 draws with the same parameter values, and then calculate the parameters from the data? Let’s make the simulation easier by setting \(\alpha=0\)

Iters<-100
ParmTable<-data.frame(a=rep(NA,Iters),p1=rep(NA,Iters),p2=rep(NA,Iters))
for (i in 1:Iters){
  ToyDat<-data.frame(Myc=sample(0:1,10,replace=T,prob=c(0.2,0.8)),Path=sample(0:1,10,replace=T,prob=c(0.2,0.8)))
  ParmTable[i,]<-ParmFind(Myc=ToyDat$Myc,Path=ToyDat$Path)
}

## Warning in sqrt(((B + 2 - C)^2) - (4 * (-1) * (C + D - B - 1))): NaNs
## produced

## Warning in sqrt(((B + 2 - C)^2) - (4 * (-1) * (C + D - B - 1))): NaNs
## produced

## Warning in sqrt(((B + 2 - C)^2) - (4 * (-1) * (C + D - B - 1))): NaNs
## produced

## Warning in sqrt(((B + 2 - C)^2) - (4 * (-1) * (C + D - B - 1))): NaNs
## produced

## Warning in sqrt(((B + 2 - C)^2) - (4 * (-1) * (C + D - B - 1))): NaNs
## produced

head(ParmTable)

##               a        p1        p2
## 1  1.000000e-01 1.0000000 1.0000000
## 2  1.111111e-01 1.0000000 0.9000000
## 3  1.250000e-01 0.8000000 1.0000000
## 4           NaN       NaN 1.0000000
## 5 -2.485919e-01 0.6837722 0.6837722
## 6 -1.110223e-15 0.8000000 1.0000000

qplot(ParmTable$a)+theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 5 rows containing non-finite values (stat_bin).

qplot(ParmTable$p1)+theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 5 rows containing non-finite values (stat_bin).

qplot(ParmTable$p2)+theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The calculation only rarely finds the correct \(p_1\) and \(p_2\) values of 0.8! Most of the time it calculates \(p_1=p_2=1\) with a non-zero \(\alpha\) value instead.

Is there an alternative approach that would account for the effect of random sampling?

Likelihood

The alternative is to use Maximum Likelihood or a Bayesian method to calculate the probability distribution of the three parameters, given the data. For this we simulate 100 root cross-sections. Start by creating a function for simulating data

Simulated data

Root<-function(p1=0.5,p2=0.5,a=0,N=100){
  Root<-data.frame(P=rep(0,N),Myc=rep(0,N),Path=rep(0,N))
  
  # Conditional Probabilities
  A<-p1*p2*(1-2*a)
  B<-p1*(1-p2)+p1*p2*a
  C<-(1-p1)*p2+p1*p2*a
  D<-(1-p1)*(1-p2)
  
  # Avoid errors arising when a >> p1 & p2, leading to negative probabilities 
  A[A<0]<-0

  # Assign conditional probabilities
  Root$P<-sample(c("A","B","C","D"),size=N,replace=T,prob=c(A,B,C,D))
  
  # Translate to Mycorrhizae and Pathogen
  Root$Myc[Root$P %in% c("A","B")]<-1
  Root$Path[Root$P %in% c("A","C")]<-1
  
  return(Root)
}

Likelihood Model

The likelihood of a model is the probability of observing the data given a particular set of parameters. For example, we can set arbitrary values of p1, p2 and a to calculate the likelihood that the specified parameters would create the observed data.

The log-likelihood of a joint bernouli variable is the sum of the log-likelihood of each joint outcome (A-D) multiplied by the number of occurrences.

IMPORTANT: Remember that

\[\log(a \times b \times c) = log(a) + log(b) + log(c)\]

i.e. the log of the probability that a AND b AND c occur NOT: ~~a OR b OR c~~

Write a function that returns the Log-Likelihood given a set of parameters p1, p2 and a.

Lik<-function(p1,p2,a){
  # Conditional Probabilities
  A<-p1*p2*(1-2*a)        
  B<-p1*(1-p2)+p1*p2*a    
  C<-(1-p1)*p2+p1*p2*a    
  D<-(1-p1)*(1-p2)        
  
  # Avoid errors when taking the log of negative OR ZERO values (below)
  A[A<0]<-1e-100
  
  # Log-Likelihood Function
  return(log(A)*sum(SimDat$Myc*SimDat$Path)+log(B)*sum(SimDat$Myc*(1-SimDat$Path))+
            log(C)*sum((1-SimDat$Myc)*SimDat$Path)+log(D)*sum((1-SimDat$Myc)*(1-SimDat$Path)))
}

Likelihood surface

Take a look at how the likelihood changes across different values of p1, p2 and a. The ‘maximum likelihood’ is the highest likelihood value (y-axis) for all values of p1, p2 and a.

p1sim<-0.421
p2sim<-0.615
asim<-0.153
# Simulate data to test the Maximum likelihood Values
SimDat<-Root(p1=p1sim,p2=p2sim,a=asim,N=10000)

# Examine the Liklihood Function 
library(ggplot2)
qplot(x=0:100/100,y=Lik(p1=c(0:100)/100,p2=p2sim,a=asim),geom="line")+
  geom_vline(xintercept=p1sim,colour="red")+theme_bw()+xlab("p1")+ylab("log-L")

qplot(x=0:100/100,y=Lik(p1=p1sim,p2=c(0:100)/100,a=asim),geom="line")+
  geom_vline(xintercept=p2sim,colour="red")+theme_bw()+xlab("p2")+ylab("log-L")

qplot(x=-100:100/100,y=Lik(p1=p1sim,p2=p2sim,a=c(-100:100)/100),geom="line")+
  geom_vline(xintercept=asim,colour="red")+theme_bw()+xlab("a")+ylab("log-L")

## Warning in log(B): NaNs produced

## Warning in log(B): NaNs produced

## Warning: Removed 38 rows containing missing values (geom_path).

Max Likelihood

Calculating the maximum likelihood can be challenging. The Maximum Likelihood values can be derived by taking the partial derivative of the likelihood function, which is not too difficult if there aren’t too many parameters, and if the probability density functions for each parameter are not too complicated.

When the likelihood model is too complicated for an analytical solution, you need some way to ‘explore the parameter space’.

Brute force

If we force our parameters must fall between 0 and 1, a conceptually straight-forward way to explore parameter space is to consider all values within the range, to a given precision level (e.g. 2 or 3 decimal places).

In our example, there are 3 parameters. If we want to look at all combinations from 0.01 to 0.99 in 0.01 increments, that would be ~100^3, or about a million likelihood calculations. If we increased the precision to 0.001 increments would require 1000^3, or about 1 billion calculations.

MaxLik<-function(Precise=100){
  # Brute Force Method
  # expand.grid() generates an index matrix of all parameter combinations
  MLdat<-expand.grid(p1=(2:Precise-1)/Precise,p2=(2:Precise-1)/Precise,a=(2:Precise-1)/Precise) 
  # Log-lik can be undefined when p or q are >=1 or <=0
  MLdat$Lik<-Lik(p1=MLdat$p1,p2=MLdat$p2,a=MLdat$a)
  # Return the 5% most likely values, based on Log-Likelihood
  return(MLdat[order(MLdat$Lik,decreasing=T),][1:ceiling(nrow(MLdat)*0.05),])
}
rbind(c(p1sim,p2sim,asim),head(MaxLik()))

##           p1    p2     a        Lik
## 1      0.421 0.615 0.153      0.421
## 143196 0.420 0.610 0.150 -13336.470
## 133395 0.420 0.610 0.140 -13336.710
## 143295 0.420 0.620 0.150 -13336.725
## 133394 0.410 0.610 0.140 -13337.389
## 153096 0.420 0.620 0.160 -13337.589
## 133494 0.420 0.620 0.140 -13337.853

You could try re-running the above with Precise=1000. If you don’t run out of memory, it may take a while to run. As the number of parameters x precision combinations gets too large, we have to think of other alternatives.

Random sampling

Instead of a brute force sampling, we can use a semi-random sampling process. Here is one possible way we could do this:

# Find the Maximum likelihood Values
MaxLik<-function(Iters=1000000){
  # Random Sampling Method
  MLdat<-data.frame(p1=runif(Iters),p2=runif(Iters),a=runif(Iters))
  # Log-lik can be undefined when p or q are >=1 or <=0
  MLdat[MLdat==1]<-0.9999  
  MLdat[MLdat==0]<-0.0001
  MLdat$Lik<-Lik(p1=MLdat$p1,p2=MLdat$p2,a=MLdat$a)
  return(MLdat[order(MLdat$Lik,decreasing=T),][1:ceiling(Iters*0.05),])
}
rbind(c(p1sim,p2sim,asim),head(MaxLik()))

##               p1        p2         a        Lik
## 1      0.4210000 0.6150000 0.1530000      0.421
## 604259 0.4181620 0.6134273 0.1440263 -13336.163
## 731859 0.4153664 0.6182921 0.1475074 -13336.684
## 338381 0.4155636 0.6066605 0.1423874 -13336.949
## 983610 0.4150689 0.6179136 0.1408229 -13337.085
## 745921 0.4180495 0.6090503 0.1549842 -13337.419
## 555796 0.4111525 0.6054014 0.1377225 -13337.877

Compare this model is to the previous algorithm. What is the key difference?

Structured sampling

A more elegant search algorithm.

Explain what this script does

# Find the Maximum likelihood Values
MaxLik<-function(divs=10,zooms=100){
  Parms<-data.frame(p1=0.5,p2=0.5,a=0.5)
  for(z in 1:zooms){
    tRange<-seq(-0.4999,0.4999,length.out=divs)/z
    tMLdat<-expand.grid(p1=tRange+Parms$p1,p2=tRange+Parms$p2,a=tRange+Parms$a) # this generates an index matrix of all parameter combinations
    tLik<-Lik(p1=tMLdat$p1,p2=tMLdat$p2,a=tMLdat$a)
    tLik[is.nan(tLik)]<-min(tLik,na.rm=T)
    Parms<-unique(tMLdat[tLik==max(tLik),])
  }
  return(Parms)
}
X<-MaxLik(divs=10,zooms=100)
rbind(c(p1sim,p2sim,asim),X)

##            p1       p2         a
## 1   0.4210000 0.615000 0.1530000
## 556 0.4189521 0.614165 0.1470833

This script does a few things: 1. Set all starting parameters to 0.5 2. Sample a range of values at low-resolution 3. Find the combination of low-res parameter values, based on the Likelihood values 4. Set these values to the new starting value 5. ‘Zoom in’ to a higher resolution, centred around the new values 6. Repeat

Compare the estimated parameters (X) from the ones we used to generate the data (SimDat).

This fairly simple model does a pretty good job of finding the best parameters. However, there are cases where this may not find the ‘best’ model. In our case, the ‘likelihood surface’ is relatively smooth, but in other cases it can be more convoluted.

Additionally, we may have parameters that are not bounded. For example, it makes sense to constrain \(p_1\) and \(p_2\) to be between bounded at 0 and 1 since they are probabilites, but there is no good reason to bound \(\alpha\). It should be perfectly fine to have an positive or negative alpha, representing antagonistic vs. facilitative interactions (i.e. fewer vs. more co-occurrences relative to the null model).

In these cases, we need a more advanced optimization algorithm.

MCMC Model

The Markov-Chain Monte Carlo model is a class of structured random sampling algorithms that perform a sort of ‘random walk’ through the data. There are a few steps to set-up first.

First, we need a log-likelihood function. We’ll use the same funcation as the previous models.

Prior Probability (Uniform)

Next, we have to define a prior probability distribution, which provides a starting point for the parameter search and will be updated with each iteraction of the MCMC simulation.

You may want to read up on the probability density function. In our model, we will use uniform priors for the two probabilities, and a normally distributed prior (mean=0, sd=1) for a. Note the log=T option, which log-transforms the probability.

Prior<-function(p1,p2,a){
    p1pr<-dunif(p1, min=0, max=1,log=T)
    p2pr<-dunif(p2, min=0, max=1,log=T)
    apr<-dnorm(a,log=T)
    return(p1pr+p2pr+apr)
}

Posterior Probability

Next, calculate the posterior probability as the sum of the log-likelihood and the log-prior. As noted earlier, this is equivalent to multiplying the probabilities and then taking the log.

Post <- function(p1,p2,a){
   return (Lik(p1,p2,a) + Prior(p1,p2,a))
}

Metropolis-Hastings MCMC

This is a particular ‘flavour’ of MCMC called the Metropolis-Hastings algorithm

This key distinction of this algorithm is the basis of the ‘random walk’. Random changes to the values of p1, p2 are chosen with probability defined by a normal function with the mean equal to the ‘prior’ supplied for the parameter, and in this case the variability in the change is determined by the standard deviation of the distribution.

PickParm <- function(p1,p2,a,sd=0.01){
    Xpick<-rnorm(3,mean = c(p1,p2,a), sd= c(sd,sd,sd))
    return(Xpick)
}

Now we’re ready for the MCMC model. For each iteration of the algorithm, we:

Identify the priors (initial values of p1, p2 & a)
Modify priors by adding small, random deviations in PickParm
Calculate the Likelihood of the new parameters, given the data
Update the posterior probability
Choose a random number from a uniform distribution.
If the posterior probability > random probability from 5,
1. then update the priors
2. otherwise, don’t update the priors
Repeat

Why do we do step #6?

Eventually, the simulation should settle close to the ‘true’ values of p1, p2 and a. Here is the MCMC function:

MeMCMC <- function(p1=0.5,p2=0.5,a=0.5,Iters=1000,verbose=F){
  # Setup output vector and put user-supplied (or default) priors
  MCout <- data.frame(p1=rep(NA,Iters),p2=rep(NA,Iters),a=rep(NA,Iters))
  MCout[1,] <- c(p1,p2,a)
  
  # MCMC Algorithm
  for (i in 1:Iters){
    if(verbose==T){
      cat("Starting Iteration",i,"\n")
    }
    # Choose new parameters
    Pick <- PickParm(p1=MCout$p1[i],p2=MCout$p2[i],a=MCout$a[i])
    
    # To replace 'bad' values for the a parameter, we need to check that all joint probabilities > 0
    A<-Pick[1]*Pick[2]*(1-2*Pick[3])
    B<-Pick[1]*(1-Pick[2])+Pick[1]*Pick[2]*Pick[3]
    C<-(1-Pick[1])*Pick[2]+Pick[1]*Pick[2]*Pick[3]
    D<-(1-Pick[1])*(1-Pick[2])
    # If illogical parameters are chosen, then set probability very small to avoid this 'parameter space'
    if(max(A,B,C,D) >=1 | min(A,B,C,D) <= 0){
      Pick<-MCout[i,]
    } else{
      # Calculate Probability
      Prob <- exp(Post(p1=Pick[1],p2=Pick[2],a=Pick[3]) - 
        Post(p1=MCout$p1[i],p2=MCout$p2[i],a=MCout$a[i]))
    }
    # Compare probability to random draw from a uniform distribution to determine whether new values should be chosen
    if (runif(1) < Prob ){
      MCout[i+1,] <- Pick
    }else{
      MCout[i+1,] <- MCout[i,]
    }
  }
  return(MCout)
}

A lot of the extra coding in this function is just to avoid generating errors that trace back to our likelihood function. Since the a parameter can be negative, we can end up with probabilities < 0 for one or more scenarios in A, B, C, D. But you can’t take the log of a negative number, so this causes an NaN error in the Liklihood funciton. Negative probability also doesn’t makes logical sense.

To avoid these problems, we make sure the probabilities of A, B, C, and D are within 0 and 1. If not, we replace the chosen parameters with the ones used in the previous iteration.

Test the MCMC results

Generate a dataset and test.

# Simulated Data
SimDat<-Root(p1=p1sim,p2=p2sim,a=asim,N=1000)
# Run the MCMC algorithm; estimate priors from the data
MCrun<-MeMCMC(p1=sum(SimDat$Myc)/nrow(SimDat),p2=sum(SimDat$Path)/nrow(SimDat),a=0,Iters=10000,verbose=F)

# Remove early values, which are more heavily influenced by the starting values
BurnIn<-1000
PlotDat<-MCrun[-(1:BurnIn),]
library(ggplot2)
qplot(x=p1,data=PlotDat)+
  geom_vline(xintercept=p1sim,colour="red")+theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(x=p2,data=PlotDat)+
  geom_vline(xintercept=p2sim,colour="red")+theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(x=a,data=PlotDat)+
  geom_vline(xintercept=asim,colour="red")+theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Not looking very good – generally in the right area but a broad range of parameter values. We probably need a lot more of iterations. Fortunately, there is an MCMC package that will run much faster.

Programming Woes:

A recurring theme of these tutorials is that coding requires concentration and determination. Case in point, while putting this tutorial together it took me >3 days and countless hours to figure out that I had an error in the Prob<-... statement in the MeMCMC function above. This is what I had instead:

      # Calculate Probability
      Prob <- exp(Post(p1=Pick[1],p2=Pick[2],a=Pick[3])) - 
        Post(p1=MCout$p1[i],p2=MCout$p2[i],a=MCout$a[i])

Can you see the error?

This is a classic example of a coding ‘bug’. To see why this was such a problem, try substituting this into the function, then re-run the function and the other two code chunks. Compare the distribution of parameter values in the MCMC results for the two diferent Prob<-...Prob statements.

All of this to say, that it is completely normal to bang your head against the wall for a long time before you finally get some code that works properly!

MCMC Package

MCMC for R is a package by Charles Guyer at the University of Minnesota. He also wrote the R code for ASTER models for life history analysis, with Ruth Shaw.

First, a reformulation of the posterior probability function. This is just a combination of the Prior() Post() and Lik() functions above, with different formatting to read the parameters as a 3-element vector instead of separate objects. Also modified to avoid errors with Parms <= 0 and Parms >= 1.

PostProb<-function(Parms,SimDat){
  
  # Extract Parameters
  p1<-Parms[1]
  p2<-Parms[2]
  a<-Parms[3]
  
  # Priors
  p1pr = dunif(p1, min=0, max=1,log=T)
  p2pr = dunif(p2, min=0, max=1,log=T)
  apr = dnorm(a,log=T)
  
  # Conditional Probabilities
  A<-p1*p2*(1-2*a)
  B<-p1*(1-p2)+p1*p2*a
  C<-(1-p1)*p2+p1*p2*a
  D<-(1-p1)*(1-p2)
  
  if(max(A,B,C,D) >=1 | min(A,B,C,D) <= 0){
    return(-1e100)
  } else{
    LogLik<-log(A)*sum(SimDat$Myc*SimDat$Path)+log(B)*sum(SimDat$Myc*(1-SimDat$Path))+
            log(C)*sum((1-SimDat$Myc)*SimDat$Path)+log(D)*sum((1-SimDat$Myc)*(1-SimDat$Path))
  
    LogPrior<-p1pr + p2pr + apr

    return(LogLik + LogPrior)
    
    }
}

According to the mcmc vignette, we should use an intial setup function metrop() for MCMC resampling.

library(mcmc)

## Warning: package 'mcmc' was built under R version 3.3.3

SimDat<-Root(p1=p1sim,p2=p2sim,a=asim,N=10000)
Init<-c(sum(SimDat$Myc)/nrow(SimDat),sum(SimDat$Path)/nrow(SimDat),0)
test<-metrop(obj=PostProb,initial=Init,nbatch=10e3,SimDat=SimDat)

Next we should update the model to find a scale= parameter that gives ~20% acceptance rate for the search algorithm. The scale parameter is similar to the sd of the normal distrib in our custom MCMC MeMCMC function, above – it determines how much to jump around when sampling the parameters.

The metrop function can take the output of a test as input, and just adds more iterations to it.

test$accept

## [1] 1e-04

test<-metrop(test,scale=0.01,SimDat=SimDat)
test$accept

## [1] 0.2383

test<-metrop(test,scale=0.02,SimDat=SimDat)
test$accept

## [1] 0.0641

Now that we found the right scale, we can do a bunch more runs

test<-metrop(test,scale=0.01,nbatch=10e4,SimDat=SimDat)

Time series

It’s common practice to look at the parameter selection over time to make sure there are no biases in the search algorithm

# Set up data for plotting
Pdat<-as.data.frame(ts(test$batch))
names(Pdat)<-c("p1","p2","a")
Pdat$Iter<-seq_along(Pdat$a)
# Denote burnin simulations
Pdat$Burnin<-F
Pdat$Burnin[1:20000]<-T

qplot(x=Iter,y=p1,data=Pdat,geom="line",colour=Burnin)+
  geom_hline(yintercept=p1sim,colour="blue")+theme_bw()

qplot(x=Iter,y=p2,data=Pdat,geom="line",colour=Burnin)+
  geom_hline(yintercept=p2sim,colour="blue")+theme_bw()

qplot(x=Iter,y=a,data=Pdat,geom="line",colour=Burnin)+
  geom_hline(yintercept=asim,colour="blue")+theme_bw()

These look quite good – we haven’t even used a burn-in parameter to cut out the first N iterations.

Q: What would these plots look like if you used different priors (e.g. p1=p2=a=1)?

Autocorrelation

We can check for autocorrelation in the parameter estimates

acf(test$batch)

Parameter Mean + SE

We can start by looking at the frequency distribution of MCMC posterior estimates

qplot(x=p1,data=Pdat,fill=Burnin)+
  geom_vline(xintercept=p1sim,colour="blue")+theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(x=p2,data=Pdat,fill=Burnin)+
  geom_vline(xintercept=p2sim,colour="blue")+theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(x=a,data=Pdat,fill=Burnin)+
  geom_vline(xintercept=asim,colour="blue")+theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The parameter mean is just the mean of the estimates – which we might want to calculate after a burn-in period to avoid biases from early priors. The standard error is more complicated and debated. One way is the Batch Means Standard Error, discussed here

MCMC Posterior Mean

X<-apply(test$batch,2,mean)
X

## [1] 0.4313274 0.6289034 0.1620260

MCMC Posterior SE

Recall from basic statistics that the variance is

\[Var = E(X^2)-E(X)^2\] Where \(E(X)\) is the mean (a.k.a. first moment) and \(E(X^2)\) is the mean of the second moment (i.e. squared values) of X

EXsq<-function(x){mean(x^2)}
Xsq<-apply(test$batch,2,FUN=EXsq)
Xsq

## [1] 0.18607780 0.39555063 0.02631518

SE<-Xsq-X^2
SE

## [1] 3.448956e-05 3.119435e-05 6.274456e-05

It’s interesting to compare these to the original paramaters in the simulated dataset, above.

Next steps

Try writing an MCMC model with separate \(\alpha\) values, as described in the asymmetric interference model at the beginning.
Try writing an MCMC model that will calculate separate parameter values for different groups supplied by the user. Consider a data.frame() with the following headings and try to modify the MCMC model to calculate separate parameters to test whether Species and GMdens differ in parameters \(p_1\), \(p_2\), and \(\alpha\)
- Myc – A binary (1/0) value for a root cross section indicating presence/absence of Mycorhizae at that cross-section
- Path – Similar to Myc but for pathogens
- ID – A unique identifier for the plant – all cross-sections with the same ID are on the same plant
- Species – One of 6 species in the study
- GMdens – (high/low) The density of garlic mustard at the site where the roots were collected

Maximum Likelihood & Markov Chain Monte Carlo Simulations

Robert I. Colautti

March 15, 2018

The Model

Joint Probabilities

Null Model

Asymmetric interference model

Symmetric interference model

Estimating parameters

Probability

Likelihood

Simulated data

Likelihood Model

Likelihood surface

Max Likelihood

Brute force

Random sampling

Structured sampling

MCMC Model

Prior Probability (Uniform)

Posterior Probability

Metropolis-Hastings MCMC

Test the MCMC results

Programming Woes:

MCMC Package

Time series

Autocorrelation

Parameter Mean + SE

MCMC Posterior Mean

MCMC Posterior SE

Next steps