Genetic Drift

Let's look at a basic model of genetic drift for a gene with just two alleles in a population that is large enough that we can assume sampling with replacement.

Review the binomial probability distribution from the previous tutorial. Recall the binomial probability distribution is:

\[P(K = k) = \binom Nk p^k q^{N-k}\]

Genetic drift occurs because a finite number of \(N\) individuals contribute genes to the next generation. So what happens if we sample \(N\) diploid individuals and count the number \(A\) alleles. What's the probability of getting \(k\) of the \(A\) alleles? If we are sampling \(N\) individuals at random, and those diploid individuals are themselves created from random draws from the gene pool, then it is simply:

\[P(K = k) = \binom {2N}k p^k q^{N-k}\]

Why \(2N\)? Since each individual contains 2 alleles, we are sampling gametes from each individual twice.

Mean allele frequency

What is the expected number \(k\) of \(A\) alleles in the sample if \(p\) is the frequency of \(A\) in the population? It is the expectation:

\[E(k) = 2Np\]

Write an R script with a binomial sampling distribution to test this equality.

If these \(N\) individuals now mate randomly to produce the next generation, and we again sample \(N\) individuals, then what is the expected frequency \(p\) in each generation?

In the first generation it was just \(p\), the frequency of \(A\) in the population.

In the second generation, we sampled \(2N\) gametes resulting in an average of \(2Np\) alleles in our sample. So the expected proportion is just the number of \(A\) alleles we predict divided by the total number of alleles:

\[E(p') = 2Np/2N = p\]

We can calculate the per-generation change in the mean as:

\[E(\Delta p) = E(p'-p) = E(p')-E(p) = p - p = 0\]

In other words, the frequency of the \(A\) allele is constant on average. But what about the variance in allele frequencies among drift simulations?

Variance in allele frequency

Again, recall that variance is the 2nd central moment. What is the variance in the number of \(A\) alleles in the first generation of sampling?

\[V(p) = 2Npq\]

where \(q = 1-p\). In the second generation:

\[V(p') = pq/2N\]

We can calculate the per-generation rate of change in the variance:

\[V(\Delta p) = V(p')-V(p) = \frac{pq}{2N}\]

Note that the variance is \(0\) when \(p = 0\) and \(p = 1\). In other words, the allele frequency doesn't change once the \(A\) allele is fixed or lost in the population.

We can see that variance in the change of \(p\) due to genetic drift dependes on allele frequency \(p\) but even moreso on the population size \(N\). Let's explore in R:

# Code the variance function
CalcVar<-function(p,N){
  return((p*(1-p))/(2*N))
}

# Set some parameter values
N<-c(1:100) # A range of opulation sizes
p<-c(0.1,0.25,0.5,0.75,0.9) # A few p values

# Make data set for plotting
PDat<-as.data.frame(expand.grid(p,N))
names(PDat)<-c("p","N")
PDat$Var<-CalcVar(p=PDat$p,N=PDat$N)

# Plot
qplot(x=N,y=Var,colour=as.factor(p),data=PDat,geom="line")

Conservation genetics is the application of population genetics (both theoretical and empirical) for conservation research and management. One of the keystone concepts in conservation genetics is the Minimum Viable Population (MVP). The '50/500 rule' is a rule of thumb that says we should try to preserve 50 genetically distinct individuals to prevent inbreeding depression and 500 individuals to guard against genetic drift. For a discussion see Frankham et al. 2014.

What does the graph of \(V(p)\) tell us about the 50/500 rule?

IBD

Identity by descent (IBD) is another fundamental concept in population genetics, used to analyze ancestry, inbreeding, trait mapping, forensics, and other applications. The basic concept is that two alleles are 'identical by descent' if they came from the same ancestor, as opposed to a convergent mutation or migration event.

Note: IBD is also an acronym for Isolation By Distance. This is another important concept in population genetics, but they are quite different concepts despite having the same acronym.

IBD can be defined mathematically as the probability \(F_t\) that any two alleles chosen at random from the gamete pool in generation \(t\) are descended from the same parental allele at time \(0\). For the purposes of this model, imagine that all alleles are unique at time \(0\). Let's follow \(F_t\) through time:

\(F_{t=0} = 0\) when all alleles are unique by definition of the model.

\(F_{t=1} = \frac{1}{2N}\) because there are N individuals and 2N alleles. These are sampled, with replacement from time 0. Given that you've already sampled one allele of any type, \(\frac{1}{2N}\) is the probability that your next allele is of the same type.

\(F_{t=2}\) Now it gets a bit more complicated. In the second generation, we again have the same probability \(\frac{1}{2N}\) of sampling alleles, BUT only for alleles that were not already IBD in generation 1:

\[F_{t=2} = \frac{1}{2N} + (1-\frac{1}{2N})F_{t=1}\] Again, \(1/2N\) is the probability of IBD from sampling \(2N\) alleles from the \(F(1)\) generation and added to this are the fraction that were not IBD from time \(1\) but were already IBD at time \(1\).

Substituting \(F_{t=1} = \frac{1}{2N}\)

\[F_{t=2} = (\frac{1}{2N})(1+(1-\frac{1}{2N}))\]

Continuing in the same way:

\[F_{t=3} = (\frac{1}{2N})(1+(1-\frac{1}{2N})+(1-\frac{1}{2N})^2)\]

and

\[F_t = (\frac{1}{2N})(\sum_{k=0}^{t-1}(1-\frac{1}{2N})^k)\]

We can also calculate the change at each time point by subtracting equations:

\[\Delta F = F_t - F_{t-1} = \frac{1}{2N}+(1-\frac{1}{2N})F_{t-1} \]

which simplifies to:

\[= \frac{1}{2N}(1-F_{t-1}) \]

As \(t\) approaches infinity, IBD approaches equilibrium:

\[F_{eq} = \frac{1}{2N}\times 2N = 1\]

What does this mean, in terms of IBD and genetic variation in the population over time?

What happens if new alleles can enter the population through migration and mutation?

Drift + Migration

In each generation we have \(mN\) new migrant individuals entering the population (i.e. IBD of a migrant allele = 0). Important: this means \(m\) is not a number but a proportion, and \(2Nm\) is the number of migrant alleles.

An interesting asside for models more generally: why choose to define \(m\) in this way, as a proportion of \(N\) rather than just the number of migrants? It helps to simplify the equations later when we have values like \(m^2\) and \(m/2N\). We can set them to zero to simplify the math if we assume these values are very small (i.e. migration is rare and populations are large).

Now we have to modify our equation because a migrant allele sampled in generation \(t\) from generation \(t-1\) cannot be IBD. Going back to our general equation above (see \(F_{t=2}\)):

\[F_{t} = \frac{1}{2N} + (1-\frac{1}{2N})F_{t-1}\]

We have to modify the second part where an allele was IBD in \(t-1\) to exclude new migrant alleles, because they cannot be IBD. The probability that any two alleles were not migrants is \((1-m)^2\), so with migration:

\[F_{t} = \frac{1}{2N} + (1-m)^2(1-\frac{1}{2N})F_{t-1}\]

What is IBD at equilibrium? As time approaches infinity:

\[F_{eq} = \frac{1}{2N}+(1-m)^2(1-\frac{1}{2N})F_{eq}\]

If we multiply through and estimate very small values as zero:

\[m^2 \approx \frac{m}{2N} \approx 0\]

then we can simplify to:

\[F_{eq} = \frac{1}{1+4Nm}\]

This is a classic equation in population genetics, so let's put it in words: The probability that two randomly selected alleles from a population are identical by decent is inversely proportional to the average number of migrants in each generation.

Drift + Mutation

Instead of migrants, what if new alleles arise through mutation? We can define a term \(\mu\) that is analogous to \(m\), so that the number of new alleles in a given generation is \(2N \mu\). Using the same logic and substituting, we get a similar equation:

\[F_{eq} = \frac{1}{1+4N \mu}\]

Drift + Migration + Mutation

We could use similar logic to simultaneously account for both migration and mutation, with \((1-m)^2\) as the probability of two alleles not being migrant alleles, \((1-\mu)^2\) as the probability of not sampling two mutations, and \(2(1-m)(1-\mu)\) as the probability of not sampling one mutation plus one migrant allele. Make the same assumptions to set small values = 0, simplifies to:

\[F_{eq} = \frac{1}{1+4N(m+\mu)}\]

Drift Summary

We saw above how IBD eventually goes to 1 in the absence of mutation or migration because an allele that hits \(p_i = 0\) is lost forever, eventually resulting in a single allele that is fixed in the population because there are no polymorphisms remaining.

Going back in our IBD model to \(t=0\) where each allele is unique, what is the probability that any particular allele will become the one that is fixed due to drift? \(\frac{1}{2N}\). Now what if all alleles were not unique? If each allele has probability \(\frac{1}{2N}\) of fixation, then an allele with \(n\) copies would have a probability of \(\frac{n}{2N}\).

In other words, the probability that an allele in a population will drift to fixation, in the absence of migration and mutation, is equal to its allele frequency \(p_i\).

Now imagine a population is split into two identical populations. Over time, new mutations arise at rate \(\mu\). In this case, the rate of sequence-level divergence between the two isolated populations is \(2\mu\) and the total divergence after \(t\) generations is:

\[d = 2\mu t\]

This is another classic model in population genetics and the basis of molecular evolution. But doesn't population size matter? Surprisingly not. Let's think about it.

We know that larger populations have more opportunities for new mutations, given that the number of mutations is \(2N \mu\). But the probability of fixation is lower in a larger population: \(\frac{1}{2N}\). The \(2N\) cancels out leaving just \(\mu\) in each generation. This is multiplied by the number of generations \(t\), and mutations can occur in either of the two populations, so \(d = \mu_1 t + \mu_2 t\), which simplifies to the equation above if the mutation rates are the same in each population.

Sequence variation is the foundation of modern applications in population genetics.

\(N_e\)

In the Identity By Descent (IBD) models above, we assumed that each allele was unique at generation \(t=0\) and that we sampled alleles randomly in each generation. But what if there is some degree of inbreeding in the population along the lines of those captured by Wright's F-statistics described earlier?

Under inbreeding, the probability of IBD is higher than under completely random mating. We can account for inbreeding by defining an effective population size \(N_e\) that is distinct from the census size \(N\). We define \(N_e\) as the census size that would give the same population parameters under random mating. In a population with a lot of inbreeding, \(N_e << N\), whereas the IBD model assumes \(N_e = N\)

In our IBD analysis we assumed populations were constant over time. What happens if population size varies? The effective population size is the harmonic mean:

\[\frac{1}{N_e} = \frac{1}{t}\sum_{j=0}^1\frac{1}{N_j}\]

Program the harmonic mean in R and see how \(N_e\) is affected by large vs. small numbers, compared to the average.

Natural Selection

So far we have ignored the most powerful force in population genetics. Let's incorporate the effect of natural selection. To do this, we can define fitness of a genotype as \(W_G\) mathematically. Fitness can be difficult to measure empirically, but it is common to estimate fitness through traits most closely related to survival and reproduction. We will see that fitness has a more nuanced definition that we can understand with math.

Let's say we want to model evolution in a population of \(g\) different genotypes. Each genotype has a frequency of \(P_g\) in the population, so of course \(\sum_g P_g = 1\)

We might think about natural selection acting in two non-exclusive ways.

Viability Selection

In the viability selection model, we are interested in survival, with some number of each genotype surviving. For example, let's say we are looking at overwinter survival in birds and there are three genotypes. We might model the pre-selection number of each genotype and the survival rate of each with vectors. For example:

Pre<-c(10,33,14,59,6,22)
Srv<-c(0.3,0.12,0.2,0.2,0.4,0.11)

Calculate the frequency \(P_g\) of each genotype.

Fecundity Selection

In the fecundity model, we are interested in reproduction -- the number of offspring produced by each genotype in each subsequent generation. Here, natural selection occurs at the end of each generation. We might model the number of each genotype using the Pre vector above, and then calculate \(N_g\) as the number of each offspring for each genotype vector by multiplying out the average number of surviving offspring per genotype:

Fec<-c(1.3,1.1,0.9,2,1.7,1.3)
Ng<-Pre*Fec
Ng
## [1]  13.0  36.3  12.6 118.0  10.2  28.6

What is the change in allele frequency due to selection in the two models above? Let's start by thinking of only a single genetic locus.

One-locus Model

If \(P_g\) is the frequency of a genotype before selection, the frequency after selection depends on the number of survivors (viability) or offspring (fecundity) after selection, relative to the mean. Let's call \(W_g\) the fitness (N surviving or N offspring) of each genotype and \(\bar W\) as the mean fitness (recall definition of the mean \(\bar W = \sum P_g W_g\)). The freuency of genotype \(g\) after selection is:

\[P'_g = P_g \frac{W_g}{\bar W}\] where \('\) denotes 'after selection'.

In this equation, we can define a genotype's fitness relative to the mean as:

\[w_g = \frac{W_g}{\bar W)\]

It is common for upper-case \(W\) to refer to absolute fitness and lower-case \(w\) to refer to relative fitness.

Note that \(\bar w_g = 1\). How can you prove this?

Genotype Model

Moving from a single locus to multiple loci quickly gets complicated. If we are looking at a simple model with perfect inheritence and no recombination (e.g. clonal/asexual reproduction), then we can just track each individual genotype using the same equation. This might be appropriate for a community of bacteria, or the bird survival model without considering fecundity fitness.

Let's consider a phenotype that might affect fitness, and predict how the phenotype will evolve after selection. A common way to define evolution is a change in the mean phenotype of a population:

\[\Delta \bar z = \bar z' - \bar z = \sum_g P_g \frac{w_g}{\bar w} z_g - \sum_g P_gz_g = \sum_g P_g w_g z_g - \bar w \sum_g P_gz_g\]

Compare these terms to the 'expectations' in the Probabilities tutorial.

Robertson-Price Identity

The equation above can also be re-written as:

\[\bar \Delta z = E(w_gz_g) - E(w_g)E(z_g) = C(w_g,z_g)\]

This is a powerful equation in evolution known as the Robertson-Price Identity. Think about what it is saying: the rate of change in a trait due to selection is equal to the covariance between the trait and relative fitness. We can also define this as:

\[ \Delta \bar z = \beta_{w,z}V(z_g)\]

In this equation, \(\beta\) is the regression coefficient from a linear model with \(z\) standardized to a mean of zero and variance of 1:

\[\beta_{w,z} = \frac{\sum_g{E[(z_g-\bar z)(w_g-\bar w)]}}{\sum_g E[(z_g-\bar z)^2]}\]

which is just \(C(z,w)/V(z)\).

Substitute the equation for \(\beta\) into the equation \(\Delta z = \beta_{w,z}\) to prove to yourself that \(\Delta z = C(w,z)\)

Selection gradient

Let's put the simplied Price Equation \(\Delta z = C(w,z)\) into words. The evolution of a population due to selection is measured as a change in the mean phenotype \(\Delta z\). This can be predicted by simply knowing the among-genotype covariance between a standardized trait value \(z\) and fitness. This equation is often called the predicted response to selection.

Technically we derived this from a one-locus model but there is a similar model for a quantitative trait if we replace the among-genotype (co)variance with additive genetic (co)variance. Additive genetic variance the the variance in phenotype (i.e. variance among genotypes) caused by the additive effects of many individual genes summed across all the different genetic loci. Dominance and epistasis also affect the phenotypes expressed by an individual but NOT the population mean change in response to selection.

Another formulation is the breeders equation:

\[R = h^2S\]

Where \(R\) is \(\Delta \bar z\) and \(S\) is defined below. The \(h^2\) parameter is the trait heritability, defined as:

Heritability

Heritability describes how much an offspring's phenotype resembles their parent. In the equation above, it can be defined as:

\[h^2 = V_A/V_T\]

The fraction of total phenotypic variation is \(V_T\) (i.e. variance calculated across all phenotypes of all individuals in the population).

What can cause \(V_A < V_T\)?

Multi-trait Evolution

We can expand the Breeder's equation to include multiple traits. If traits evolve independently, then we can just substitute vectors into our equation.

\[ \Delta \bar z = G \beta\] where \(\Delta z\) is a vector of \(N\) elements describing the response to selection of \(N\) traits. \(\beta\) is a vector of length \(N\) describing the selection gradient for each trait. \(G\) is a vector of \(N\) additive genetic variances for the traits.

G-matrix

But a big problem with the vector model is that traits are rarely independent but are more often genetically correlated. If you are confused about **genetic *correlations**, here's a simple way to think about it: if two traits are correlated then measuring one trait on a genotype allows you to predict the correlated trait better than random. In other words, there is additive genetic covariance among genotypes.

When traits are correlated, then we can't just look at selection on each trait independently, we also have to consider evolution in a trait that can occur if a correlated trait is under selection. To model this, we can replace the vector \(G\) with an \(N \times N\) matrix where the diagonals are still the \(N\) additive genetic variance for each trait and the off-diagonals are the additive genetic covariance for each trait.

It sounds complicated, but it's easier to understand in R:

Single Trait

Vz<-0.3 # Variance of a trait
Vt<-1 # Total Variance (scaled to 1 for simplicity)
Czw<-0.9 # Covariance between a trait and fitness

Beta<-Czw/Vz # Selection gradient

S<-Czw*Vt/Vz # Selection differential

paste(Beta,S)
## [1] "3 3"

Multiple Traits

Vz<-c(0.3,0.4,0.1) # Variances of 3 traits
Vt<-c(1,1,1) # Total Variance for each trait (scaled to 1 for simplicity)
Vg<-c(0.9,0.6,1.1) # Additive genetic variance of each trait

G<-matrix(c(  Vg[1], 0.6,  0.1, # First row of G-matrix
               0.6, Vg[2],-0.4, # Second Row of G-matrix
               0.1, -0.4, Vg[3]),nrow=3)   # Third Row of G-matrix

Beta<-c(3,-2,6) # Vector of selection gradients

Let's take a closer look at G:

G
##      [,1] [,2] [,3]
## [1,]  0.9  0.6  0.1
## [2,]  0.6  0.6 -0.4
## [3,]  0.1 -0.4  1.1

Note the diagonal values are Vz, the additive genetic variance for each trait.

Now calculated the expected response to selection:

dz<-G %*% Beta # A vector of the response to selection
dz
##      [,1]
## [1,]  2.1
## [2,] -1.8
## [3,]  7.7

Note that we are using the matrix multiplication %*%, not the simple multiplication symbol *. If you are foggy on your matrix algebra, this Kahn Academy video on YouTube might help. It's a bit complicated to calculate by hand, but luckily R makes it easy.

What happens if you use * instead of %*% (i.e. how is the matrix calculated)?

Note that \(\beta = P^{-1}S\) where \(P^{-1}\) is the transposed matrix of total phenotypic trait (co)variances. Think of it like \(V_T\) except that we have to measure the total phenotypic variance for each trait (diagonal) and the covariance between traits (off-diagonal).

Price Theorem

The Price Theorem is the \(E=mc^2\) of evolutionary biology. It is a formulation of evolution that produces a number of important equations including the Robertson-Price Identity (RPI), the Breeder's Equation, Fisher's Fundamental Theorem of Natural Selection. These are three foundational equations in evolutionary biology! There are several formulations, but we can derive one by going back to the RPI, which explains the expected change in a population phenotype due to selection:

\[\Delta \bar z = C(w_g,z_g)\]

When we derived the RPI, we didn't think about how traits are inherited. Going back to the definition of evolution

\[\Delta \bar z = z' - z = \sum_g P'_gz'_g - \sum_g P_gz_g \]

For the RPI, we only looked at the effect of selection. However we can define a new term \(\delta\) as \(z'_G - z_G\) to account for factors that cause the genetic trait values of offspring to differ from the parents. Substituting into the RPI equation gives us a formulation of the Price Theorem:

\[\Delta \bar z = C(w_G,z_G) + E(w_G\delta z_G)\]

The second component weights the fitness of the parent phenotype by factors that cause offspring to differ from their parents. Note that fitness is a component of both parts of the equation.

This equation holds for all types of evolution, not even just biological evolution! Any system with differential survival and inheritence can be modeled using this framework. You can read more in this Gardner 2020 paper.