Maximum Likelihood (ML) and the Markov-Chain Monte Carlo (MCMC) are two methods that are commonly used for the analysis of population genetic datasets. Make sure you understand probability distributions from the Probabilities Tutorial before proceeding further. You may also want to look at the Resampling Models Tutorial.
You should be familiar with frequentist statistics like the t-test, ANOVA, and other types of linear models. The philosophy behind these models is to compare your data to a null model -- usually a random expectation. With frequentist statistics, you have two basic outcomes:
It is common in scientific publications to interpret #2 as evidence for the hypothesis of interest \(H_1\). But strictly speaking, we have only shown that the data deviate from random expectations in a way that is consistent with our hypothesis. In reality, there may be many different models that fit the data.
A basic regression of two continuous variables:
data(iris) # Load the iris dataset
qplot(x=Sepal.Length,y=Petal.Length, data=iris) + geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
summary(lm(Petal.Length ~ Sepal.Length,data=iris))
##
## Call:
## lm(formula = Petal.Length ~ Sepal.Length, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.47747 -0.59072 -0.00668 0.60484 2.49512
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.10144 0.50666 -14.02 <2e-16 ***
## Sepal.Length 1.85843 0.08586 21.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8678 on 148 degrees of freedom
## Multiple R-squared: 0.76, Adjusted R-squared: 0.7583
## F-statistic: 468.6 on 1 and 148 DF, p-value: < 2.2e-16
What is the null model for a linear regression?
What is the null model for a t-test?
Rather than reject a null model given the data, we flip the question around and ask: "What is the likelihood that the model would produce the data we observe?"
To answer this question, we need two equations:
To build a loss function, start by thinking about a single data point. What is the likelihood \(l_p\) of observing datum \(x_i\) given some mathematical model?
In the example above, we are implicitly testing the hypothesis that Petal.Length
depends on Sepal.Length
following a linear model. Specifically, the petal length of an individual iris plant \(i\) is:
\[ Petal.Length_i = \beta_0 + \beta_1 \times Sepal.Length_i \]
The null hypothesis is that the slope is zero and the y-intercept is just the mean Petal.Length
. In other words, the predicted values are:
Pred0<- mean(iris$Petal.Length) + 0*iris$Sepal.Length
Pred0
## [1] 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758
## [13] 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758
## [25] 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758
## [37] 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758
## [49] 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758
## [61] 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758
## [73] 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758
## [85] 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758
## [97] 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758
## [109] 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758
## [121] 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758
## [133] 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758 3.758
## [145] 3.758 3.758 3.758 3.758 3.758 3.758
Why are these all the same values? Try graphing Predicted vs. Observed
Now calculate the predicted values for each observation for a model with a slope of 1 and intercept of 0?
Pred1<- 0 + 1*iris$Sepal.Length
Pred1
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
## [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
## [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
## [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
## [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
## [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
## [109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
## [127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
## [145] 6.7 6.7 6.3 6.5 6.2 5.9
Plot Predicted vs. Observed and compare with the previous model
For each predicted value, we can calculate the likelihood of observing the value, given the model by using the dnorm
function. To do this, think about the expected deviation of observed from expected. The mean for each point will be the prediction from the linear equation. There will also be a residual error -- a random deviation from this prediction. For now, let's assume the residual error is normally distributed with standard deviation of 1.
L0<-dnorm(iris$Petal.Length,
mean= mean(iris$Petal.Length) + 0*iris$Sepal.Length,
sd=1)
L1<-dnorm(iris$Petal.Length,
mean= 0 + 1*iris$Sepal.Length,
sd=1)
The vectors L0
and L1
above are vectors of probabilities. Each individual probability \(p_i\) is the likelihood of observing \(Petal.Length_i\) given the prediction from one of the two models. If we assume that each observation is independent, then the likelihood of observing all of the data points is just the sum product
\[L_{model} = \prod_{i=1}^N p_i \]
BUT: when we multiply a bunch of numbers that are less than 1, then we get smaller and smaller numbers. Problems arise when we multiply small numbers together on computers:
X<-0.1
X^2
## [1] 0.01
X^2 * X^-2
## [1] 1
X^10 * X^-10
## [1] 1
X^10000 * X^-10000
## [1] NaN
To circumvent this problem, we take advantage of the equality:
\[\log(a \times b \times c) = log(a) + log(b) + log(c)\]
and calculate the log-likelihood
\[log(L_{model}) = \sum_{i=1}^N log(p_i)\]
The log-likelihoods for our two models above are:
sum(log(L0))
## [1] -370.0035
sum(log(L1))
## [1] -557.3608
Think about the log-transformation. Which model is a better fit (i.e. which has a higher probability)?
More generally, we can calculate the log-likelihood for any set of parameters:
LogLik<-function(B0=0,B1=1,SD=1){
return(dnorm(iris$Petal.Length,
mean= B0 + B1*iris$Sepal.Length,
sd=SD))
}
sum(log(LogLik(B0=0,B1=1,SD=1)))
## [1] -557.3608
sum(log(LogLik(B0=1,B1=0,SD=1)))
## [1] -940.4958
sum(log(LogLik(B0=0,B1=-1,SD=1)))
## [1] -7524.881
sum(log(LogLik(B0=-1,B1=0,SD=1)))
## [1] -2067.896
sum(log(LogLik(B0=1,B1=0,SD=10)))
## [1] -491.2551
These are just 5 of an infinite number of parameter combinations. What if we wanted to find the 'best' parameters -- i.e. the parameters that provide the best fit?
What is the maximum likelihood model?
A relatively straightforward way to find the best slope and intercept would be to slice up the parameter space. In theory, each value can range from negative infinity to positive infinity, with an infinite number of decimal places in between. However, we can try to approximate by looking at a range of values and then 'zooming in'.
We can also generalize the functions above to calculate the likelihood given a set of model parameters:
LogLik<-function(B0=0,B1=1,SD=1){
likelihoods<- dnorm(iris$Petal.Length,
mean= B0 + B1*iris$Sepal.Length,
sd=SD)
log.likelihoods<-log(likelihoods)
logLModel <- sum(log.likelihoods)
return(logLModel)
}
B0i<-c(-10:10)
B1i<-c(-50:50)/10
Dat<-expand.grid(B0=B0i,B1=B1i) # Every combination of B0i with B1i
head(Dat)
## B0 B1
## 1 -10 -5
## 2 -9 -5
## 3 -8 -5
## 4 -7 -5
## 5 -6 -5
## 6 -5 -5
Dat$L<-NA
for(Row in 1:nrow(Dat)){
Dat$L[Row]<-LogLik(B0=Dat$B0[Row],B1=Dat$B1[Row],SD=1)
}
ggplot(aes(x=B0,y=B1,fill=L),data=Dat) + geom_tile() +
scale_fill_gradient(low="white", high="red")
There are a number of optimizing functions that search the paramater space, automating what we have done visually, above. A few of these are available using the optimize
function in R.
However, we have first have to modify our function to deal with input values that may not make sense:
LogLik(0,0,-1)
## Warning in dnorm(iris$Petal.Length, mean = B0 + B1 * iris$Sepal.Length, : NaNs
## produced
## [1] NaN
The function also requires an estimate of the deviance, rather than the likelihood. In this case, the deviance is just
\[-2 \times \sum log(p_i)\]
The deviance is our loss function, and it's what we are trying to minimize (vs. maximizing likelihood)
LogLikMod<-function(Parms){ # Set to avoid bugs
B0<-Parms[1]
B1<-Parms[2]
SD<-Parms[3]
if(SD < 0){
deviance<- 10000000 # Penalize SD < 0
}
if(SD > 0) {
likelihoods<- dnorm(iris$Petal.Length,
mean= B0 + B1*iris$Sepal.Length,
sd=SD)
deviance<- -2*sum(log(likelihoods))
}
return(deviance)
}
LogLikMod(c(0,0,-1))
## [1] 1e+07
LogLikMod(c(10,10,1))
## [1] Inf
Fits<-optim(par=c(0,0,1), # Initial parameters (Starting point)
fn=LogLikMod) # Our likelihood function
Fits
## $par
## [1] -7.1011296 1.8584331 0.8621265
##
## $value
## [1] 381.135
##
## $counts
## function gradient
## 142 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
Compare these parameter estimates $par
with the frequentist linear model
Fits$par
## [1] -7.1011296 1.8584331 0.8621265
summary(lm(Petal.Length ~ Sepal.Length,data=iris))
##
## Call:
## lm(formula = Petal.Length ~ Sepal.Length, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.47747 -0.59072 -0.00668 0.60484 2.49512
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.10144 0.50666 -14.02 <2e-16 ***
## Sepal.Length 1.85843 0.08586 21.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8678 on 148 degrees of freedom
## Multiple R-squared: 0.76, Adjusted R-squared: 0.7583
## F-statistic: 468.6 on 1 and 148 DF, p-value: < 2.2e-16
Markov-Chain Monte Carlo is a popular algorithm for searching a complex parameter space. It is a sort of 'random walk' through the data. In each iteration, it chooses new parameter values and compares the fit of the data to the previous iteration. If the likelihood improves, then the new parameters are kept.
MCMC is a based on on Bayesian statistics, which starts with a set of prior probabilities, and updates the probabilities based on the data observed.
For a walkthrough example building an MCMC function, see This ML MCMC Tutorial.