Maximum Likelihood (ML) and the Markov-Chain Monte Carlo (MCMC) are two methods that are commonly used for the analysis of population genetic datasets. Make sure you understand probability distributions from the Probabilities Tutorial before proceeding further. You may also want to look at the Resampling Models Tutorial.

Frequentist Statistics

You should be familiar with frequentist statistics like the t-test, ANOVA, and other types of linear models. The philosophy behind these models is to compare your data to a null model -- usually a random expectation. With frequentist statistics, you have two basic outcomes:

  1. Reject the Null Hypothesis \(H_0\)
  2. Fail to reject \(H_0\)

It is common in scientific publications to interpret #2 as evidence for the hypothesis of interest \(H_1\). But strictly speaking, we have only shown that the data deviate from random expectations in a way that is consistent with our hypothesis. In reality, there may be many different models that fit the data.

A basic regression of two continuous variables:

data(iris) # Load the iris dataset

qplot(x=Sepal.Length,y=Petal.Length, data=iris) + geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'

summary(lm(Petal.Length ~ Sepal.Length,data=iris))
## 
## Call:
## lm(formula = Petal.Length ~ Sepal.Length, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.47747 -0.59072 -0.00668  0.60484  2.49512 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7.10144    0.50666  -14.02   <2e-16 ***
## Sepal.Length  1.85843    0.08586   21.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8678 on 148 degrees of freedom
## Multiple R-squared:   0.76,  Adjusted R-squared:  0.7583 
## F-statistic: 468.6 on 1 and 148 DF,  p-value: < 2.2e-16

What is the null model for a linear regression?

What is the null model for a t-test?

Maximum Likelihood

Rather than reject a null model given the data, we flip the question around and ask: "What is the likelihood that the model would produce the data we observe?"

To answer this question, we need two equations:

  1. The specific model -- Not just the type of model (e.g. linear), but the specific parameters of the model (e.g. slope and intercept.
  2. The Loss Function -- This compares the observed values to what would be predicted given the specific model parameters.

To build a loss function, start by thinking about a single data point. What is the likelihood \(l_p\) of observing datum \(x_i\) given some mathematical model?

In the example above, we are implicitly testing the hypothesis that Petal.Length depends on Sepal.Length following a linear model. Specifically, the petal length of an individual iris plant \(i\) is:

\[ Petal.Length_i = \beta_0 + \beta_1 \times Sepal.Length_i \]

The null hypothesis is that the slope is zero and the y-intercept is just the mean Petal.Length. In other words, the predicted values are:

Pred0<- mean(iris$Petal.Length) + 0*iris$Sepal.Length
Pred0
##   [1] 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758
##  [13] 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758
##  [25] 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758
##  [37] 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758
##  [49] 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758
##  [61] 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758
##  [73] 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758
##  [85] 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758
##  [97] 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758
## [109] 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758
## [121] 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758
## [133] 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758
## [145] 3.758 3.758 3.758 3.758 3.758 3.758

Why are these all the same values? Try graphing Predicted vs. Observed

Now calculate the predicted values for each observation for a model with a slope of 1 and intercept of 0?

Pred1<- 0 + 1*iris$Sepal.Length
Pred1
##   [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
##  [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
##  [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
##  [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
##  [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
##  [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
## [109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
## [127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
## [145] 6.7 6.7 6.3 6.5 6.2 5.9

Plot Predicted vs. Observed and compare with the previous model

Likelihood Calculation

For each predicted value, we can calculate the likelihood of observing the value, given the model by using the dnorm function. To do this, think about the expected deviation of observed from expected. The mean for each point will be the prediction from the linear equation. There will also be a residual error -- a random deviation from this prediction. For now, let's assume the residual error is normally distributed with standard deviation of 1.

L0<-dnorm(iris$Petal.Length,
          mean= mean(iris$Petal.Length) + 0*iris$Sepal.Length,
          sd=1)

L1<-dnorm(iris$Petal.Length,
          mean= 0 + 1*iris$Sepal.Length,
          sd=1)

Loss calculation

The vectors L0 and L1 above are vectors of probabilities. Each individual probability \(p_i\) is the likelihood of observing \(Petal.Length_i\) given the prediction from one of the two models. If we assume that each observation is independent, then the likelihood of observing all of the data points is just the sum product

\[L_{model} = \prod_{i=1}^N p_i \]

BUT: when we multiply a bunch of numbers that are less than 1, then we get smaller and smaller numbers. Problems arise when we multiply small numbers together on computers:

X<-0.1
X^2
## [1] 0.01
X^2 * X^-2
## [1] 1
X^10 * X^-10
## [1] 1
X^10000 * X^-10000
## [1] NaN

Log-likelihood

To circumvent this problem, we take advantage of the equality:

\[\log(a \times b \times c) = log(a) + log(b) + log(c)\]

and calculate the log-likelihood

\[log(L_{model}) = \sum_{i=1}^N log(p_i)\]

The log-likelihoods for our two models above are:

sum(log(L0))
## [1] -370.0035
sum(log(L1))
## [1] -557.3608

Think about the log-transformation. Which model is a better fit (i.e. which has a higher probability)?

More generally, we can calculate the log-likelihood for any set of parameters:

LogLik<-function(B0=0,B1=1,SD=1){
  return(dnorm(iris$Petal.Length,
          mean= B0 + B1*iris$Sepal.Length,
          sd=SD))
  }

sum(log(LogLik(B0=0,B1=1,SD=1)))
## [1] -557.3608
sum(log(LogLik(B0=1,B1=0,SD=1)))
## [1] -940.4958
sum(log(LogLik(B0=0,B1=-1,SD=1)))
## [1] -7524.881
sum(log(LogLik(B0=-1,B1=0,SD=1)))
## [1] -2067.896
sum(log(LogLik(B0=1,B1=0,SD=10)))
## [1] -491.2551

These are just 5 of an infinite number of parameter combinations. What if we wanted to find the 'best' parameters -- i.e. the parameters that provide the best fit?

What is the maximum likelihood model?

Maximum Likelihood

A relatively straightforward way to find the best slope and intercept would be to slice up the parameter space. In theory, each value can range from negative infinity to positive infinity, with an infinite number of decimal places in between. However, we can try to approximate by looking at a range of values and then 'zooming in'.

1. Log-Likelihood Function

We can also generalize the functions above to calculate the likelihood given a set of model parameters:

LogLik<-function(B0=0,B1=1,SD=1){
  likelihoods<- dnorm(iris$Petal.Length,
      mean= B0 + B1*iris$Sepal.Length,
      sd=SD)
  log.likelihoods<-log(likelihoods)
  logLModel <- sum(log.likelihoods)
  return(logLModel)
  }

2. Test Values

B0i<-c(-10:10)
B1i<-c(-50:50)/10
Dat<-expand.grid(B0=B0i,B1=B1i) # Every combination of B0i with B1i
head(Dat)
##    B0 B1
## 1 -10 -5
## 2  -9 -5
## 3  -8 -5
## 4  -7 -5
## 5  -6 -5
## 6  -5 -5

3. Calculate log-Likelihoods

Dat$L<-NA
for(Row in 1:nrow(Dat)){
  Dat$L[Row]<-LogLik(B0=Dat$B0[Row],B1=Dat$B1[Row],SD=1)
}

Find the 'best' parameter space

ggplot(aes(x=B0,y=B1,fill=L),data=Dat) + geom_tile() +
  scale_fill_gradient(low="white", high="red") 

Optimize Functions

There are a number of optimizing functions that search the paramater space, automating what we have done visually, above. A few of these are available using the optimize function in R.

However, we have first have to modify our function to deal with input values that may not make sense:

LogLik(0,0,-1)
## Warning in dnorm(iris$Petal.Length, mean = B0 + B1 * iris$Sepal.Length, : NaNs
## produced
## [1] NaN

The function also requires an estimate of the deviance, rather than the likelihood. In this case, the deviance is just

\[-2 \times \sum log(p_i)\]

The deviance is our loss function, and it's what we are trying to minimize (vs. maximizing likelihood)

LogLikMod<-function(Parms){ # Set to avoid bugs
  B0<-Parms[1]
  B1<-Parms[2]
  SD<-Parms[3]
  if(SD < 0){ 
    deviance<- 10000000 # Penalize SD < 0
    } 
  if(SD > 0) {
      likelihoods<- dnorm(iris$Petal.Length,
          mean= B0 + B1*iris$Sepal.Length,
          sd=SD)
      deviance<- -2*sum(log(likelihoods))
  }
  return(deviance)
}
LogLikMod(c(0,0,-1))
## [1] 1e+07
LogLikMod(c(10,10,1))
## [1] Inf
Fits<-optim(par=c(0,0,1), # Initial parameters (Starting point)
      fn=LogLikMod) # Our likelihood function
Fits
## $par
## [1] -7.1011296  1.8584331  0.8621265
## 
## $value
## [1] 381.135
## 
## $counts
## function gradient 
##      142       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

Compare these parameter estimates $par with the frequentist linear model

Fits$par
## [1] -7.1011296  1.8584331  0.8621265
summary(lm(Petal.Length ~ Sepal.Length,data=iris))
## 
## Call:
## lm(formula = Petal.Length ~ Sepal.Length, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.47747 -0.59072 -0.00668  0.60484  2.49512 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7.10144    0.50666  -14.02   <2e-16 ***
## Sepal.Length  1.85843    0.08586   21.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8678 on 148 degrees of freedom
## Multiple R-squared:   0.76,  Adjusted R-squared:  0.7583 
## F-statistic: 468.6 on 1 and 148 DF,  p-value: < 2.2e-16

MCMC

Markov-Chain Monte Carlo is a popular algorithm for searching a complex parameter space. It is a sort of 'random walk' through the data. In each iteration, it chooses new parameter values and compares the fit of the data to the previous iteration. If the likelihood improves, then the new parameters are kept.

MCMC is a based on on Bayesian statistics, which starts with a set of prior probabilities, and updates the probabilities based on the data observed.

For a walkthrough example building an MCMC function, see This ML MCMC Tutorial.