Models in population genetics generally come in two flavours: Genotype and Allele
Models considering genotypes define parameters for each gene combination. For example, the frequency of genotypes in a 1-locus, 2-allele diploid model would be:
\[P_{AA} + P_{Aa} + P_{aa} = 1\]
With three frequencies (aka sampling probabilities) corresponding to the three possible combinations of two alleles at the locus of interest. Technically, we only need to define 2 parameters because the third can be inferred by subtracting from 1. We can model any number of genotypes N in this way:
\[\sum_{i=1}^N P_i =1\]
In contrast to 3 parameters for the genotype model, a one-locus, two-allele diploid allele model requires only 2 parameters corresponding to the frequency of the two alleles:
\[p + q = 1\]
where p is the frequency of the \(A\) allele and q is the frequency of the \(a\) allele. Technically, we only need to define 1 parameter (\(p\)) because we know that \(q = 1-p\).
Under Hardy-Weinberg Equilibrium (HWE) assumptions, diploid genotypes are created by randomly sampling two alleles, leading to four possible outcomes:
Draw 1 | Draw 2 |
---|---|
A | A |
A | a |
a | A |
a | a |
Under HWE draws are independent, so we multiply the probabilities together and we know that they must sum to 1:
\[P_{AA} + P_{Aa} + P_{aA} + P_{aa} = 1\] substituting: \[p\times p + p\times q + q\times p + q\times q = 1\]
If \(P_{Aa}\) is equivalent to \(P_{aA}\) then this simplifies to:
\[p^2+2pq+q^2=1\]
Let's try to add more realism to these models by adding more alleles.
With the genotype model we can see this quickly gets very complicated. We can define frequency parameters by the number of each allele. For example if \(P_{1,1,1}\) represents the frequency of the genotype with one of each allele, then:
\[P_{A1,A1} + P_{A1,A2} + P_{A1,A3} + P_{A2,A2} + P_{A2,A3} + P_{A3,A3} = 1\]
Similarly, under the HWE allele model and assuming heterozygotes are equivalent, the expected frequencies are:
\[p^2 + 2pq + q^2 + 2pr + 2qr + r^2 = 1\]
More generally for \(N_A\) alleles, the number of genotypes in a diploid model is given by the binomial expansion:
\[\frac{N_A(N_A+1)}{2}\]
One interesting way to think about the equation is as the sum of integers from \(1\) to \(N_A\). Let's explore this by writing a function in R
Genotypes<-function(N=2){
return(N*(N+1)/2)
}
Genotypes(2) # Number of genotypes for 2 alleles
## [1] 3
Genotypes(3) # Number of genotypes for 3 alleles
## [1] 6
Genotypes(10) # Number of genotypes for 10 alleles
## [1] 55
Write R code to compare these values with the sum of integers \(1\) to \(N_A\) to show that they are the same
Let's go back to the two-allele model but add a second locus with alleles \(B\) and \(b\). Now we need genotype parameters for each combination of \(A/a\) and \(B/b\). Recall your Punnett Squares:
. | A | a |
---|---|---|
B | AB | aB |
b | Ab | ab |
Now assign a fraction to each combination:
\[P_{AB} + P_{Ab} + P_{aB} + P_{ab} = 1\] NOTE: this is similar to the 2-allele model except that \(P_{Aa} = P_{aA}\) in the single-locus model BUT \(P_{Ab} \ne P_{aB}\) in the one-allele, two-locus model.
Explain why this is true, based on your understanding of biology.
Now add a 3rd locus with two alleles \(C\) and \(c\). To find all the genotypes we could make a new Punnett Square for each of the four genotypes as rows and the C locus as columns:
. | C | c |
---|---|---|
AB | ABC | ABc |
Ab | AbC | Abc |
aB | aBC | aBc |
ab | abC | abc |
This gives us 8 different genotype frequencies in the population.
More generally for a diploid model with \(K\) loci, the number of genotypes will be:
\[2^{K}\]
This makes sense if you think about Punnett Squares. The table for \(ABC\) above can be re-written as a 3 dimensional table of A x B x C. This can be complicated to write out, but R makes it easy:
LocusA<-c("A","a")
LocusB<-c("B","b")
LocusC<-c("C","c")
# Create a Table
expand.grid(LocusA,LocusB,LocusC)
## Var1 Var2 Var3
## 1 A B C
## 2 a B C
## 3 A b C
## 4 a b C
## 5 A B c
## 6 a B c
## 7 A b c
## 8 a b c
# Create a 3D-array of
array(c(LocusA,LocusB,LocusC),dim=c(2,1,3))
## , , 1
##
## [,1]
## [1,] "A"
## [2,] "a"
##
## , , 2
##
## [,1]
## [1,] "B"
## [2,] "b"
##
## , , 3
##
## [,1]
## [1,] "C"
## [2,] "c"
Note the dim parameter indicating that we want 3x of the 2x1 vectors. In other words, each locus is a 1D vector of length 2 and we have 3 of them.
Understanding these data structures can be very handy for models and simulations. This includes null and neutral models for statistical tests of empirical data.
Understanding these data structures can be very handy for models and simulations. To move from genes to genotypes, we can assign values to represent the genotype, or we can empirically measure the effect of a gene on a trait. For example, if we are looking at an allele for flower colour, we might assign a value ranging from 0 to 1 to describe the amount of pigmentation, ranging from white (0) to pink (0.5) to red (1). This makes it easy to look at the mean and variance of a phenotype based on its alele and/or genotype frequencies.
Let's model an organism whose pigmentation (e.g. flower, scale, fur, feather) ranges from light to dark and is completely determined by two loci, each with two alleles. Assume HWE and that the trait values are additive with each allele at each locus contributing equally. Assign a value of 1 to the allele that increases pigmentation and 0 to an allele that doesn't. These numbers are arbitrary but 1 and 0 will simplify the math in these examples. We'll use the convention \(A\)/\(B\) for pigmented and \(a\)/\(b\) for non-pigmented.
What is the phenotypic value for each genotype?
Genotype | Trait Value |
---|---|
AABB | 2 |
AABb | 1.5 |
AAbb | 1 |
AaBB | 1.5 |
AaBb | 1 |
aaBB | 1 |
aaBb | 0.5 |
aabb | 0 |
NOTE: Don't confuse upper/lower-case with dominant/recessive. This is an additive model, so each allele contributes equally (co-dominance).
What is the expected population mean? It depends on the genotype frequencies, which are determined by allele frequencies under HWE. Let's think about each locus independently.
This is where our moments and central moments come in handy.
Recall that \(E(x) = \bar x = \sum P(x_i)x_i\), meaning the sum of the frequency of each genotype \(i\) multiplied by their trait values. For any locus, the mean phenotype is:
\[E(x) = P_{AA}(1) + P_{Aa}(0.5) + P_{aa}(0)\] Under HWE:
\[= p^2(1) + 2pq(0.5) + q^2(0)\] which so elegantly simplifies to \(p\)
Prove this to yourself by subsituting \(q = 1-p\)
Take a minute to think about this. For an additive model under HWE with two alleles and phenotype value \(1\), the average phenotype is the fraction of \(A\) alleles in the population, which is \(p\).
Under HWE, each locus is independent. In an additive model, we can predict the mean phenotype for any number of two-allele loci by adding their proportions!
\[E(x) = \sum p_i\]
So simple! Now you can see why population genetics modellers like to make assumptions. Breaking those assumptions adds some complexity (a.k.a. realism).
This term/concept is often confusing for students, but Linkage Disequilibrium (LD) is just a fancy term meaning that allele frequences at two or more loci do not assort independently as predicted under HWE. One way this can happen is if loci do not segregate independently during meiosis. This is called gametic disequilibrium (GD) and mechanisms of GD can include physical linkage on a chromosome or non-additive effects on gamete fitness.
TERMINOLOGY NOTE: GD is a type of LD, but we will see are other forms of LD that don't involve GD. Be careful not to confuse these.
Remember that if two variables are non-independent, then they have a covariance > 0. Consider the covariance between two loci, which you know as the second central moment:
\[C_{AB}=E[(x_{Ai}-\bar x_A)(x_{Bi} -\bar x_B)]\]
\[= E(x_{Ai}x_{Bi})-\bar x_A \bar x_B\]
We just showed above that the mean of a population trait value for locus \(L\) is equal to the proportion of the \(A\) allele at that locus \(\bar x_L = p_L\), so:
\[E(x_{Ai}x_{Bi})= p_Ap_B + C_{AB}\]
When loci are independent, then covariance is zero by definition, and
\[E(x_{Ai}x_{Bi})=E(x_{Ai})E(x_{Bi})=p_Ap_B\]
Therefore, we can think about linkage disequilibrium is a departure from the assumption that loci are independent.
Another way to think of it is in terms of the probability of producing homozygotes vs heterozygotes. If both loci are independent then:
\[P_{AB} P_{ab} - P_{Ab} P_{aB} = 0\]
Prove this to yourself by converting genotype probabilities to allele sampling (\(P_A = p_A^2\), \(P_B = p_B^2\), etc.)
If loci are not independent, then this equation no longer holds. So we can linkage disequilibrium as a deviation from this expectation
\[D_{AB} = P_{AB} P_{ab} - P_{Ab} P_{aB}\]
We can re-arrange to define frequencies of each gamete by substituting allele probabilities and simplifying to get:
\[P_{AB} = p_Ap_B + D_{AB}\] \[P_{Ab} = p_Aq_B - D_{AB}\] \[P_{aB} = q_Ap_B - D_{AB}\] \[P_{ab} = q_Aq_B + D_{AB}\]
Compare the equation for \(P_{AB}\) with the covariance equation, and you can see that the gametic disequilibrium parameter \(D_AB\) is the same as the covariance \(C_AB\). That's right, LD is just a non-zero covariance between loci!
You may be used to thinking about covariances and correlations as a bivariate plot of two continuous variables. This can be a good analogy if you think of your plot in terms of four quadrants corresponding to the four combinations of 2 x 2 alleles. For example, imagine quadrants from the Punnett square with \(a < A\) along the x-axis and \(b < B\) along the y-axis.
. | . | . |
---|---|---|
B | aB | AB |
b | ab | Ab |
. | a | A |
When traits are uncorrelated, there is an equal probability of sampling homozygotes and heterozygotes.
X<-rnorm(100)
Y<-rnorm(100)
qplot(X,Y) + geom_vline(xintercept=0,colour="red") + geom_hline(yintercept=0,colour="red")
Furthermore, you can think of a change in allele frequency as a change in the mean along the x or y axis.
When traits are correlated, there is an unequal probability, defined by \(C_{x,y} = D_{x,y}\).
When \(D > 0\) then homozygotes are more common than random:
X<-rnorm(100)
Y<-rnorm(100)+X
qplot(X,Y) + geom_vline(xintercept=0,colour="red")+ geom_hline(yintercept=0,colour="red")
When \(D < 0\) then heterozygotes are more common than random:
Y<-rnorm(100)-X
qplot(X,Y) + geom_vline(xintercept=0,colour="red")+ geom_hline(yintercept=0,colour="red")
Let's go back yet again to the finding that the mean phenotype for a single locus model is \(p_A\). In the probability lecture, we saw that a covariance of a trait with itself is the variance. So let's consider the covariance between two alleles at the same locus but on homologous chromosomes, where the \(A\) allele has a trait value of \(1\) and the \(a\) allele has a value of 0:
\[C_{xy} = \sum p(x_iy_i)(x_i-\bar x)(y_i-\bar y)\]
In this one-locus, two-allele model, \(p(x_iy_i)\) is the genotype frequency in the population, \(x_i-\bar x\) is the deviation for the allele on chromosome 1 and \(y_i-\bar y\) is the same on chromosome 2:
\[C_{AA} = P_{AA}(1-p_A)(1-p_A) + P_{aa}(0-p_A)(0-p_A) + P_{Aa}(1-p_A)(0-p_A) + P_{aA}(0-p_A)(1-p_A)\]
You can see how the moment deviations \(1-p_A\) and \(0-pA\) are multiplied (see covariance equation), and then multiplied by their frequency \(P_{..}\) to generate the mean expectation -- the covariance. This simplifies to:
\[C_{A,A} = P_{AA}q_A^2 + P_{aa}p_A^2 - P_{Aa}p_Aq_A - P_{aA}q_Ap_A\]
and rearranges to genotype-specific trait values:
\[P_{A,A} = p_A^2 + C_{A,A}\] \[P_{a,a} = q_A^2 + C_{A,A}\] \[P_{A,a} = P_{a,A} = p_Aq_A - C_{A,A}\]
Here again, the covariance is the linkage disequilibrium, and it defines the frequency of homozygotes relative to heterozygotes in the population. If LD is positive, then there are more homozygotes than expected. If LD is negative, then there is an excess of heterozygotes.
Wright's inbreeding coefficient \(f\) defines the rate of inbreeding in a population -- the probability that an offspring is produced by self-fertilization.
What is the probability of producing an offspring by cross-fertilization?
Given the probabilities for selfing and outcrossing, we can define expected genotype frequencies under inbreeding:
\[P_{AA} = (1-f)p_A^2 + fp_A\] \[P_{aa} = (1-f)q_A^2 + fq_A\]
\[P_{Aa} = P_{aA} (1-f)p_Aq_A\] From the equation you can see intutively that \(f\) quantifies the reduction of heterozygosity of a population relative to HWE (i.e. assortative mating = no inbreeding).
Another way to think about this is to compare the observed heterozygosity to what would be expected under HWE, resulting in another formulation of \(f\):
\[f = \frac{H_E-H_O}{H_E} = 1-\frac{H_O}{H_E}\]
Prove this equation to yourself by substituting observed and expected heterozygote frequencies in the diploid model?
In empirical studies that use genetic markers, Wright's F-statistic is usually capitalized, but it is equivalent and can be generalized beyond the one-locus, two-allele model shown above. Also known as fixation indices$*, Wright's F-statistics come in a few flavours based on the type of inbreeding. To understand these, think of a large study where you sample a bunch of individual genotypes at several locations across the species' range. Inbreeding can be measured at three levels:
\(F_{IS}\) compares the average heterozygosity of individuals relative to the heterozygosity expected under random mating within a subpopulation. This is often used a proxy for selfing rates in plants, but it can also be elevated by non-random mating among relatives.
\[F_{IS} = 1 - \frac{H_I}{E(H_S)}\]
You can see from the equation that high homozogosity of individual genotypes (i.e. low heterozygosity) reduces the right side of the equation, so that inbreeding increases \(F_{IS}\) towards its theoretical maximum of \(1\).
\(F_{IT}\) compares the average heterozygosity within individuals to the total expectation by randomly selecting gametes from across the metapopulation.
\[F_{IS} = 1 - \frac{H_I}{E(H_T)}\]
\(F_{ST}\) is measure of population structure, because it compares the expected heterozygosity of subpopulations with the heterozygosity expected under random mating across the entire metapopulation:
\[F_{ST} = 1 - \frac{E(H_S)}{E(H_T)}\]
Note how these equations are related to each other. Rearranging you can see that:
\[(1-F_{IT}) = (1-F_{IS})(1-F_{ST})\] \[=\frac{H_I}{E(H_T)}\]
\[=\frac{H_I}{E(H_S)}\times \frac{E(H_S)}{E(H_T)}\]
NOTE: All three of these statistics measure inbreeding. Often we think of inbreeding as selfing or mating between close relatives, but these are only two types of inbreeding. Most species show some degree of inbreeding because individuals who are geographically closer to each other are more likely to mate than random, and they are more likely to be related than random. Population structure is caused by some form of genetic inbreeding.
How do we calculate the expected heterozygosity for a locus with 2 or more alleles?
Wright's F-statistics quantify different levels of inbreeding, as defined above. Don't confuse these with Fisher's F-statistics, which appear in ANOVA and other linear models. Fisher's F is a value that follows an F-distribution. Wright's F are population parameters. However, the similarities are interesting to note: both use fractions comparing variation within vs among populations.