library(ggplot2) # plotting library
source("http://bit.ly/theme_pub") # Set custom plotting theme
theme_set(theme_pub())
Central moments (CM) characterize the distribution. We’ve already seen two of these. Here are the first through 4th central moments:
Notice that skew is like variance but with the third exponent, meaning that values can be negative or positive. If we have a few very small values (relative to the mean) then our skew will tend to be more negative. If we have a few very large values it will tend to be more positive. Skew will also increase with the variance of the distribution, so we often standardize using the standard deviation raised to the same exponent:
\[ C_{Skew} = Skew/\sigma = \frac{\sum(x_i - \bar x)^3}{\sigma^3(n-1)}\] A normal distribution will typically have a coefficient of skew near zero:
Negative \(C_{Skew}\) values characterize distributions with left skew
Positive \(C_{Skew}\) values characterize distributions with right skew
Values of \(C_{Skew}\) near zero are symmetical (i.e. ‘normal’)
Skew examples from Wikipedia.
Kurtosis uses the 4th exponent, so all values will be positive, however outliers are weighted more heavily than values close to zero. As with skew, kurtosis will scale with the variance, so we usually look at the coefficient of kurtosis, standardized in an analogous way:
\[ C_{Kurt} = Kurt/\sigma = \frac{\sum(x_i - \bar x)^4}{\sigma^4(n-1)} -3\] We subtract 3 because a normal distribution will typically have a coefficient of kurtosis near 3:
Positive \(C_{Kurt}\) values characterize Leptokurtic distributions
Neative \(C_{Kurt}\) values characterize Platykurtic distributions
Values of \(C_{Kurt}\) near zero are Mesokurtic (i.e. ‘normal’)
Now that we know how to characterize non-normal distributions, let’s look at a few examples and then see how we can handle these in our statistical models
Let’s look at a few examples, starting with a normal distribution for comparison:
<-NormDat<-rnorm(1000)
Xqplot(NormDat)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
What if we raise each value to the exponent: \(e^x\)
<-exp(rnorm(1000))
ExpDatqplot(ExpDat)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Or multiply two randomly chosen values: \(x_i \times y_i\)
<-rnorm(1000)*rnorm(1000)
multNormqplot(multNorm)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Or take the reciprocal: \(1/x_i\)
<-1/rnorm(1000)
invNormqplot(invNorm)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Let’s try two more using different distributions that we know are not normal:
<-rbinom(n=1000, size=5, prob=0.2)
BinDatqplot(BinDat)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
<-rpois(n=1000, lambda=2)
PoisDatqplot(PoisDat)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
EXERCISE: Calculate the standard deviation, skew and kurtosis coefficients for each figure
The first thing we can try are data transformations. A transformation is just an equation that we apply to the data to try to make it look more normal. The log-transformation is very common and often useful. There is a good mathematical reason for this, if you remember that:
\[log(x*y) = log(x) + log(y)\]
So if our data are on an exponential (multiplicative) scale, then a log-transformation puts them on a linear (additive) scale
qplot(log(ExpDat))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(log(multNorm))
## Warning in log(multNorm): NaNs produced
## Warning in log(multNorm): NaNs produced
## Warning in log(multNorm): NaNs produced
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 470 rows containing non-finite values (stat_bin).
qplot(log(invNorm))
## Warning in log(invNorm): NaNs produced
## Warning in log(invNorm): NaNs produced
## Warning in log(invNorm): NaNs produced
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 523 rows containing non-finite values (stat_bin).
qplot(log(BinDat))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 373 rows containing non-finite values (stat_bin).
qplot(log(PoisDat))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 132 rows containing non-finite values (stat_bin).
We still have a bit of skew in a couple of these but overall this is a big improvement. Even with the skew the data are much closer to normal than without the transformation
qplot(sqrt(ExpDat))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(sqrt(multNorm))
## Warning in sqrt(multNorm): NaNs produced
## Warning in sqrt(multNorm): NaNs produced
## Warning in sqrt(multNorm): NaNs produced
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 470 rows containing non-finite values (stat_bin).
qplot(sqrt(invNorm))
## Warning in sqrt(invNorm): NaNs produced
## Warning in sqrt(invNorm): NaNs produced
## Warning in sqrt(invNorm): NaNs produced
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 523 rows containing non-finite values (stat_bin).
Other common transformations include square-root, arcsine.
Another option is to use a non-parametric model that doesn’t make assumptions about the distribution of the population. There are many examples of these, including those covered in another tutorial on resampling and permutation.
Permutation tests don’t make any assumptions about the underlying distribution.