Data Science

Introduction

Data Science is a relatively new field of study that merges computer science and statistics to answer questions in other domains (e.g. business, medicine, biology, psychology). Data Science as a discipline has grown in popularity in response to the rapid rate of increase in data collection and publication.

Data Science often involves ‘Big Data’, which doesn’t have a strict quantitative definition but will usually have one or more of the following characteristics:

High Volume – large file sizes with may observations.
Wide Variety – many different types of data.
High Velocity – data accumulates at a high rate.
Compromised Veracity – data quality issues must be addressed otherwise downstream analyses will be compromised.

Question: What are some examples of ‘big data’ in Biology?

Answer: There are many types of biological data that could be listed, but medical records, remote sensing data, and ‘omics data are common examples of ’big data’ in biology.

In biology, it can be helpful to think of Data Science as a continuous life-cycle with multiple stages:

Data Science Life-Cycle

Hypothesize – Make initial observations about the natural world, or insights from other data, that lead to testable hypotheses. Your core biology training is crucial here.
Collect – This may involve taking measurements yourself, manually entering data that is not yet in electronic format, requesting data from authors of published studies, or importing data from online sources. Collecting data is a crucial step that is often done poorly.
Correct – Investigate the data for quality assurance, to identify and fix potential errors. Start to visualize the data to look for outliers, odd frequency distributions, or nonsensical relationships among variables.
Explore – Try to understand the data, where they come from, and potential limitations on their use. Continue visualizing data; this may cause you to modify your hypotheses slightly.
Model – Now that hypotheses are clearly defined, apply statistical tests of their validity.
Report – Use visualizations along with the results of your statistical tests to summarise your findings.
Repeat – Return to step 1.

In this chapter, we focus mainly on coding in R for steps 2, 3, and 6. Step 5 requires a firm understanding of statistics, which is the focus of the book R STATS Crash Course for Biologists. Visualizations in Steps 3 & 4 were covered in earlier chapters.. Step 1 requires a good understanding of the study system, as covered in a typical university degree in the biological sciences.

Data collection and management are crucial steps in the Data Science Life-Cycle. Read the baRcodeR paper by Wu et al (2022) (https://doi.org/10.1111/2041-210X.13405). called baRcodeR with PyTrackDat: Open-source labelling and tracking of biological samples for repeatable science. Pay particular attention to the ‘Data Standards’ section. The baRcodeR and PyTrackDat programs and their application to current projects may also be of interest.

Setup

The tidyverse library is a set of packages created by the same developers responsible for R Studio. There are a number of packages and functions that improve on base R. The name is based on the idea of living in a data universe that is neat and tidy.

The book R for Data Science by Hadley Wickham & Garrett Grolemund is an excellent resource for learning about the tidyverse. In this chapter we’ll touch on a few of the main packages and functions.

Protip In general, any book by Hadley Wickham that you come across is worth reading if you want to be proficient in R.

While we wait for the tidyverse to install, let’s consider a few general principles of data science.

2D Data Wrangling

The dplyr library in R has many useful features for importing and reorganizing your data for steps 2, 3 and 4 in the Data Science Life-Cycle.

library(dplyr)

Warning: package 'dplyr' was built under R version 4.3.2


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Note: This error message informs us that the dplyr package uses function or parameter names that are the same as other base or stats packages in R. These base/stats functions are ‘masked’ meaning that when you run one (e.g. filter) then R will run the dplyr version rather than the stats version.

library(tidyr)

We’ll work with our FallopiaData.csv data set, and remind ourselves of the structure of the data

`tibbles` and `readr()`

We looked at data.frame objects in the first chapter as an expansion of matrices with a few additional features like column and row names. A tibble is the tidyverse version of the data.frame object and includes a few more useful features. To import a dataset to a tibble instead of a data.frame object, we use read_csv instead of read.csv.

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.3.2

Warning: package 'readr' was built under R version 4.3.2

Warning: package 'forcats' was built under R version 4.3.2

Warning: package 'lubridate' was built under R version 4.3.2

You may also see a message about conflicts here, with the filter() and lag() functions from dplyr masking the functions from the stats package.

Fallo<-read_csv(
  "https://colauttilab.github.io/RCrashCourse/FallopiaData.csv")
str(Fallo)

Rows: 123 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): Scenario, Nutrients, Taxon
dbl (10): PotNum, Symphytum, Silene, Urtica, Geranium, Geum, All_Natives, Fa...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

spc_tbl_ [123 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ PotNum      : num [1:123] 1 2 3 5 6 7 8 9 10 11 ...
 $ Scenario    : chr [1:123] "low" "low" "low" "low" ...
 $ Nutrients   : chr [1:123] "low" "low" "low" "low" ...
 $ Taxon       : chr [1:123] "japon" "japon" "japon" "japon" ...
 $ Symphytum   : num [1:123] 9.81 8.64 2.65 1.44 9.15 ...
 $ Silene      : num [1:123] 36.4 29.6 36 21.4 23.9 ...
 $ Urtica      : num [1:123] 16.08 5.59 17.09 12.39 5.19 ...
 $ Geranium    : num [1:123] 4.68 5.75 5.13 5.37 0 9.05 3.51 9.64 7.3 6.36 ...
 $ Geum        : num [1:123] 0.12 0.55 0.09 0.31 0.17 0.97 0.4 0.01 0.47 0.33 ...
 $ All_Natives : num [1:123] 67 50.2 61 40.9 38.4 ...
 $ Fallopia    : num [1:123] 0.01 0.04 0.09 0.77 3.4 0.54 2.05 0.26 0 0 ...
 $ Total       : num [1:123] 67.1 50.2 61.1 41.7 41.8 ...
 $ Pct_Fallopia: num [1:123] 0.01 0.08 0.15 1.85 8.13 1.12 3.7 0.61 0 0 ...
 - attr(*, "spec")=
  .. cols(
  ..   PotNum = col_double(),
  ..   Scenario = col_character(),
  ..   Nutrients = col_character(),
  ..   Taxon = col_character(),
  ..   Symphytum = col_double(),
  ..   Silene = col_double(),
  ..   Urtica = col_double(),
  ..   Geranium = col_double(),
  ..   Geum = col_double(),
  ..   All_Natives = col_double(),
  ..   Fallopia = col_double(),
  ..   Total = col_double(),
  ..   Pct_Fallopia = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

This file is an example of a 2-dimensional data set, which is common in biology. 2D datasets have the familiar row x column layout used by spreadsheet programs like Microsoft Excel or Google Sheets. There are some exceptions, but data in this format should typically follows 3 rules:

Each cell contains a single value
Each variable must have its own column
Each observation must have its own row

Making sure your data are arranged this way will usually make it much easier to work with.

`filter()` Rows

The filter() function will subset observations based on values of interest within particular columns of our dataset. For example, we may want to filter the rows (i.e. pots) that had at least 70 $g$ of total biomass.

Pot1<-filter(Fallo,Total >= 70)
head(Pot1)

# A tibble: 6 × 13
  PotNum Scenario Nutrients Taxon Symphytum Silene Urtica Geranium  Geum
   <dbl> <chr>    <chr>     <chr>     <dbl>  <dbl>  <dbl>    <dbl> <dbl>
1     60 high     high      bohem      7.77   51.4   5.13    10.1   0.37
2     67 gradual  high      japon      2.92   25.2  19.1     23.1   0.06
3     70 gradual  high      japon     10.1    47.0  18.6      0.64  0.82
4     86 gradual  high      bohem      2.93   60.9   4.11     6.67  1.27
5     95 extreme  high      japon      4.92   25.9  40.3      4.92  0.07
6    103 extreme  high      japon      6.92   49.4   0       10.3   0.42
# ℹ 4 more variables: All_Natives <dbl>, Fallopia <dbl>, Total <dbl>,
#   Pct_Fallopia <dbl>

`rename()` Columns

There are different options to change the names of columns in your data. In base R you can use the names() function with the square bracket index []:

X<-Fallo
names(X)

 [1] "PotNum"       "Scenario"     "Nutrients"    "Taxon"        "Symphytum"   
 [6] "Silene"       "Urtica"       "Geranium"     "Geum"         "All_Natives" 
[11] "Fallopia"     "Total"        "Pct_Fallopia"

names(X)[12]<-"Total_Biomass"
names(X)

 [1] "PotNum"        "Scenario"      "Nutrients"     "Taxon"        
 [5] "Symphytum"     "Silene"        "Urtica"        "Geranium"     
 [9] "Geum"          "All_Natives"   "Fallopia"      "Total_Biomass"
[13] "Pct_Fallopia"

There is also a simple dplyr function to do this:

X<-rename(Fallo, Total_Biomass = Total)
names(X)

 [1] "PotNum"        "Scenario"      "Nutrients"     "Taxon"        
 [5] "Symphytum"     "Silene"        "Urtica"        "Geranium"     
 [9] "Geum"          "All_Natives"   "Fallopia"      "Total_Biomass"
[13] "Pct_Fallopia"

`arrange()` Rows

Use the arrange() function to sort the rows of your data based on the values in one or more columns. For example, let’s re-arrange our FallopiaData.csv dataset based on Taxon (a string denoting the species of Fallopia used) and Total (a float denoting the total biomass in each pot).

X<-arrange(Fallo, Taxon, Total)
head(X)

# A tibble: 6 × 13
  PotNum Scenario Nutrients Taxon Symphytum Silene Urtica Geranium  Geum
   <dbl> <chr>    <chr>     <chr>     <dbl>  <dbl>  <dbl>    <dbl> <dbl>
1     26 low      low       bohem     13.2    18.1   0        0     0.1 
2     17 low      low       bohem      4.9    29.5   1.36     0     0.19
3     80 gradual  high      bohem     11.9    17.2   8.92     0.94  0.18
4     18 low      low       bohem      3.51   27.6   8.14     3.81  0.21
5     28 low      low       bohem     10.6    18.8   7.19     6.73  0.17
6     22 low      low       bohem      0.76   22.7   9.85    10.6   0.74
# ℹ 4 more variables: All_Natives <dbl>, Fallopia <dbl>, Total <dbl>,
#   Pct_Fallopia <dbl>

use the desc() function with arrange() to reverse and sort in descending order. We can sort by multiple columns in the by_group= parameter.

X<-arrange(Fallo, by_group=Taxon, desc(Silene))
head(X)

# A tibble: 6 × 13
  PotNum Scenario Nutrients Taxon Symphytum Silene Urtica Geranium  Geum
   <dbl> <chr>    <chr>     <chr>     <dbl>  <dbl>  <dbl>    <dbl> <dbl>
1     86 gradual  high      bohem      2.93   60.9   4.11     6.67  1.27
2     53 high     high      bohem      7.05   56.3   1.14     4.07  0   
3     60 high     high      bohem      7.77   51.4   5.13    10.1   0.37
4     50 high     high      bohem      8.52   44.6   1.27     9.45  0.35
5     79 gradual  high      bohem     11.0    44.6   1.56     0.03  0.05
6    107 extreme  high      bohem      0      43.8   8.1      8.18  0.36
# ℹ 4 more variables: All_Natives <dbl>, Fallopia <dbl>, Total <dbl>,
#   Pct_Fallopia <dbl>

`select()` Columns

The select() function can be used to select a subset of columns (i.e. variables) from your data.

Suppose we only want to look at total biomass, but keep all the treatment columns:

X<-select(Fallo, PotNum, Scenario, Nutrients, Taxon, Total)
head(X)

# A tibble: 6 × 5
  PotNum Scenario Nutrients Taxon Total
   <dbl> <chr>    <chr>     <chr> <dbl>
1      1 low      low       japon  67.1
2      2 low      low       japon  50.2
3      3 low      low       japon  61.1
4      5 low      low       japon  41.7
5      6 low      low       japon  41.8
6      7 low      low       japon  48.3

You can also use the colon : to select a range of columns:

X<-select(Fallo, PotNum:Taxon, Total)
head(X)

# A tibble: 6 × 5
  PotNum Scenario Nutrients Taxon Total
   <dbl> <chr>    <chr>     <chr> <dbl>
1      1 low      low       japon  67.1
2      2 low      low       japon  50.2
3      3 low      low       japon  61.1
4      5 low      low       japon  41.7
5      6 low      low       japon  41.8
6      7 low      low       japon  48.3

Exclude columns with -

X<-select(Fallo, -PotNum:Taxon, -Total)

Warning in x:y: numerical expression has 12 elements: only the first used

Oops, what generated that warning? Take a careful look at the error message and see if you can figure it out.

The problem is we are using the range of columns between PotNum and Taxon, but in one case we are excluding and the other we are including. We need to be consistent:

X<-select(Fallo, -PotNum:-Taxon, Total)
head(X)

# A tibble: 6 × 9
  Symphytum Silene Urtica Geranium  Geum All_Natives Fallopia Total Pct_Fallopia
      <dbl>  <dbl>  <dbl>    <dbl> <dbl>       <dbl>    <dbl> <dbl>        <dbl>
1      9.81   36.4  16.1      4.68  0.12        67.0     0.01  67.1         0.01
2      8.64   29.6   5.59     5.75  0.55        50.2     0.04  50.2         0.08
3      2.65   36.0  17.1      5.13  0.09        61.0     0.09  61.1         0.15
4      1.44   21.4  12.4      5.37  0.31        40.9     0.77  41.7         1.85
5      9.15   23.9   5.19     0     0.17        38.4     3.4   41.8         8.13
6      6.31   24.4   7        9.05  0.97        47.7     0.54  48.3         1.12

Or a bit more clear:

X<-select(Fallo, -(PotNum:Taxon), Scenario)
head(X)

# A tibble: 6 × 10
  Symphytum Silene Urtica Geranium  Geum All_Natives Fallopia Total Pct_Fallopia
      <dbl>  <dbl>  <dbl>    <dbl> <dbl>       <dbl>    <dbl> <dbl>        <dbl>
1      9.81   36.4  16.1      4.68  0.12        67.0     0.01  67.1         0.01
2      8.64   29.6   5.59     5.75  0.55        50.2     0.04  50.2         0.08
3      2.65   36.0  17.1      5.13  0.09        61.0     0.09  61.1         0.15
4      1.44   21.4  12.4      5.37  0.31        40.9     0.77  41.7         1.85
5      9.15   23.9   5.19     0     0.17        38.4     3.4   41.8         8.13
6      6.31   24.4   7        9.05  0.97        47.7     0.54  48.3         1.12
# ℹ 1 more variable: Scenario <chr>

`everything()`

Use the everything() function with select() to rearrange your columns without losing any:

X<-select(Fallo, Taxon, Scenario, Nutrients, PotNum, 
          Pct_Fallopia, everything())
head(X)

# A tibble: 6 × 13
  Taxon Scenario Nutrients PotNum Pct_Fallopia Symphytum Silene Urtica Geranium
  <chr> <chr>    <chr>      <dbl>        <dbl>     <dbl>  <dbl>  <dbl>    <dbl>
1 japon low      low            1         0.01      9.81   36.4  16.1      4.68
2 japon low      low            2         0.08      8.64   29.6   5.59     5.75
3 japon low      low            3         0.15      2.65   36.0  17.1      5.13
4 japon low      low            5         1.85      1.44   21.4  12.4      5.37
5 japon low      low            6         8.13      9.15   23.9   5.19     0   
6 japon low      low            7         1.12      6.31   24.4   7        9.05
# ℹ 4 more variables: Geum <dbl>, All_Natives <dbl>, Fallopia <dbl>,
#   Total <dbl>

`mutate()` Columns

Suppose we want to make a new column to our data frame or tibble. For example, we calculate the sum of biomass of Urtica and Geranium only. In base R, we could use $ to select the column from the data frame.

X<-Fallo
X$UrtSil<-X$Urtica+X$Silene

In the dplyr package we can use the mutate() function.

X<-mutate(Fallo, UrtSil = Urtica + Silene)
head(X)

# A tibble: 6 × 14
  PotNum Scenario Nutrients Taxon Symphytum Silene Urtica Geranium  Geum
   <dbl> <chr>    <chr>     <chr>     <dbl>  <dbl>  <dbl>    <dbl> <dbl>
1      1 low      low       japon      9.81   36.4  16.1      4.68  0.12
2      2 low      low       japon      8.64   29.6   5.59     5.75  0.55
3      3 low      low       japon      2.65   36.0  17.1      5.13  0.09
4      5 low      low       japon      1.44   21.4  12.4      5.37  0.31
5      6 low      low       japon      9.15   23.9   5.19     0     0.17
6      7 low      low       japon      6.31   24.4   7        9.05  0.97
# ℹ 5 more variables: All_Natives <dbl>, Fallopia <dbl>, Total <dbl>,
#   Pct_Fallopia <dbl>, UrtSil <dbl>

This is a lot more readable, especially when you have complicated equations or you want to add many of new columns.

Question: What if you only wanted to retain the new columns and delete everything else? Try it.

Which functions did you use?

`transmute()` Columns

The transmute() functions acts as a combination of mutate() + select()

X<-transmute(Fallo, UrtSil = Urtica + Silene)
head(X)

# A tibble: 6 × 1
  UrtSil
   <dbl>
1   52.4
2   35.2
3   53.1
4   33.8
5   29.1
6   31.4

`summarise()` + `group_by()`

This can be useful for quickly summarizing your data, for example to find the mean or standard deviation based on a particular treatment or group.

TrtGrp<-group_by(Fallo,Taxon,Scenario,Nutrients)
summarise(TrtGrp, Mean=mean(Total), SD=sd(Total))

`summarise()` has grouped output by 'Taxon', 'Scenario'. You can override using
the `.groups` argument.

# A tibble: 10 × 5
# Groups:   Taxon, Scenario [10]
   Taxon Scenario     Nutrients  Mean    SD
   <chr> <chr>        <chr>     <dbl> <dbl>
 1 bohem extreme      high       58.3  7.34
 2 bohem fluctuations high       58.4  9.20
 3 bohem gradual      high       57.5  9.34
 4 bohem high         high       60.3  8.68
 5 bohem low          low        48.0  8.86
 6 japon extreme      high       57.2 10.9 
 7 japon fluctuations high       56.4 13.7 
 8 japon gradual      high       59.7  9.57
 9 japon high         high       56.4  8.20
10 japon low          low        52.0  8.29

Weighted Mean

In our dataset, the Taxon column shows which of two species of Fallopia were used in the competition experiments. We might want to take the mean total biomass for each of the two Fallopia species.

However, there are other factors in our experiment that may affect biomass. The Nutrients column tells us whether pots received high or low nutrients, and this also affects biomass:

X<-group_by(Fallo,Nutrients)
summarise(X, Mean=mean(Total), SD=sd(Total))

# A tibble: 2 × 3
  Nutrients  Mean    SD
  <chr>     <dbl> <dbl>
1 high       58.0  9.61
2 low        49.9  8.66

We can see that the Nutrients treatment had a large effect on the mean biomass of a pot. Now imagine if our sampling design is ‘unbalanced’. For example, maybe we had some plant mortality or lost some plants to a tornado. If one of the two species in the Taxon column had more high-nutrient pots left over, then it would have a higher mean. BUT, what if the nutrients promoted growth? We would expect to see a difference in the mean of each Taxon group, this expected difference is just an artifact of unbalanced replication of the high nutrient treatment. We can simulate this effect by re-shuffling the species names:

RFallo<-Fallo
set.seed(256)
RFallo$Taxon<-rbinom(nrow(RFallo),size=1,prob=0.7)

X<-group_by(RFallo,Taxon)
summarise(X, Mean=mean(Total))

# A tibble: 2 × 2
  Taxon  Mean
  <int> <dbl>
1     0  56.1
2     1  56.5

In this example, the difference is less than $ 1%$ of the mean (i.e., $0.4 / 56.1 = 0.0071 = 0.71%$), but this could be due to the imbalance offsetting biological differences between the taxa. In other scenarios it could be much larger. To fix this problem, we may want to take a weighted mean. In this case, we want to weight the mean of each taxon by the unbalanced Scenario and Treatment categories.

Step 1: Group by all three columns and calculate group means

X1<-group_by(RFallo,Taxon,Scenario,Nutrients)
(Y1<-summarise(X1,Mean=mean(Total)))

# A tibble: 10 × 4
# Groups:   Taxon, Scenario [10]
   Taxon Scenario     Nutrients  Mean
   <int> <chr>        <chr>     <dbl>
 1     0 extreme      high       54.7
 2     0 fluctuations high       58.4
 3     0 gradual      high       56.2
 4     0 high         high       59.8
 5     0 low          low        50.2
 6     1 extreme      high       59.1
 7     1 fluctuations high       56.8
 8     1 gradual      high       59.9
 9     1 high         high       57.6
10     1 low          low        49.8

Now we have a single value for each of the subgroups. Instead of over-representing pots from Taxon=1, we have equal weightings of all three groups.

Step 2: Take the output from Step 1 and calculate the mean of means for the column of interest.

X2<-group_by(Y1,Taxon)
Y2<-summarise(X2, Mean=mean(Mean))
arrange(Y2,desc(Mean))

# A tibble: 2 × 2
  Taxon  Mean
  <int> <dbl>
1     1  56.6
2     0  55.8

Now there is more than a $2.1 %$ difference (i.e., $1.2 / 55.8 = 0.0215$) between the taxa. This is triple the effect that we calculated with unbalanced samples!

`%>%` (pipe)

The dplyr package includes a special command (%>%) called a pipe. The pipe is a very handy tool for complex data wrangling. It allows us to string together multiple functions and then pipe our data from one function to the next. In principle, this is similar to the multi-function plotting commands we made with ggplot() in earlier chapters.

The pipe is useful to combine operations without creating a whole bunch of new objects. This can save on memory use and run speed because you are not making new objects for every single function.

Pro-tip: Use Ctl + Shift + m to add a pipe quickly

For example, we can re-write the weighted mean example using pipes.

RFallo %>% 
  group_by(Taxon,Scenario,Nutrients) %>% 
  summarise(Mean=mean(Total)) %>% 
  group_by(Taxon,Scenario) %>% 
  summarise(Mean=mean(Mean)) %>% 
  group_by(Taxon) %>% 
  summarise(Mean=mean(Mean)) %>% 
  arrange(desc(Mean))

# A tibble: 2 × 2
  Taxon  Mean
  <int> <dbl>
1     1  56.6
2     0  55.8

We declare the input data set in the first line, and then pipe to group_by() in line 2. The output of group_by() is then *piped to summarise(), and so on. There are two important things to note here:

We do not declare the data object inside of each function. The pipe command does this for us.
We still have to assign the output to an object if we want to retain it. For example, we might change line 1 to wtMean<-RFallo %>%. We didn’t redirect the command to an object, so it simply output to your R Console.

Compared to the earlier example, it is clear that the pipe can make for cleaner, more concise code with fewer objects added to the environment. also avoids potential for bugs in our program. Imagine if we mis-spelled ‘Taxon’ in our second line by accidentally pressing ‘s’ along with ‘x’. Compare the output:

X<-group_by(Fallo,Taxon,Scenario,Nutrients)
X<-group_by(X,Tasxon,Scenario)

Error in `group_by()`:
! Must group by variables found in `.data`.
✖ Column `Tasxon` is not found.

X<-group_by(X,Taxon)
X<-summarise(X, Mean=mean(Total), SD=sd(Total))
arrange(X,desc(Mean))

# A tibble: 2 × 3
  Taxon  Mean    SD
  <chr> <dbl> <dbl>
1 japon  56.4 10.4 
2 bohem  56.3  9.54

Fallo %>% 
  group_by(Taxon,Scenario,Nutrients) %>%
  group_by(Tasxon,Scenario) %>%
  group_by(Taxon) %>%
  summarise(Mean=mean(Total), SD=sd(Total)) %>%
  arrange(desc(Mean))

Error in `group_by()`:
! Must group by variables found in `.data`.
✖ Column `Tasxon` is not found.

In both cases we get an error, but in one case we still calculate the means and sd of the two species.

A bug that produces no output is much less dangerous than an error that gives an output. Why?

Join Datasets

The dplyr package has some handy joing_ tools to combine data sets. There are four main ways to join data sets. I find it helpful to think of Venn diagrams when I’m trying to remember what all of these functions do. You can get more information on these with the help command ?join after loading the dplyr library, but here is a quick overview. For each of these, imagine a Venn diagram with two datasets: $X$ as a circle on the left side and $Y$ as a circle on the right side. The rows we choose to combine from the two datasets depend on one or more identifying columns that we can define (e.g. sample ID, date, time).

left_join() - Keep all rows of X and add matching rows from Y. Any rows in Y that don’t match X are excluded.
right_join() - The reverse of left_join()
inner_join() - Only keep rows that are common to both X AND Y. Remove any rows that don’t match.
full_join() - Keep any columns that are in either X OR Y. Add NA for missing data in either of the columns for rows that don’t match.

To try these out, we can create a couple of quick data sets and compare the output. For each case, note the addition of NA for missing data.

X<-data.frame(ID=c("A","C"),Xdat=c(1,3))
Y<-data.frame(ID=c("B","C"),Ydat=c(2,3))
X

  ID Xdat
1  A    1
2  C    3

  ID Ydat
1  B    2
2  C    3

left_join(X,Y,by="ID")

  ID Xdat Ydat
1  A    1   NA
2  C    3    3

right_join(X,Y,by="ID")

  ID Xdat Ydat
1  C    3    3
2  B   NA    2

inner_join(X,Y,by="ID")

  ID Xdat Ydat
1  C    3    3

full_join(X,Y,by="ID")

  ID Xdat Ydat
1  A    1   NA
2  C    3    3
3  B   NA    2

Wide vs Long Data

Most of the data examples we’ve looked at are in the ‘wide’ format, where we have a single individual as a row, and multiple measurements as separate columns.

However, there are many cases where we may want to reorganize our data into the ‘long’ format, where each row is an individual observation. Many statistical models use this format, and it’s also useful for visualizations.

Here is one very handy example that I use all the time. If we have a bunch of columns of observations and we want to generate plots quickly (e.g. frequency distributions), we can convert the data to the long format and then use the facet_ functions from ggplot2 to create separate plots for each of the original columns containing the observations.

`pivot_longer`

The pivot_longer() function in the tidyr library converts data from wide to long. We use the cols= parameter to specify the data columns, then names_to= specifies the column name containing the parameter, and the cols_to= specifies the column name of the values.

This is a bit confusing to read, but it’s easier to understand if you compare the output with the full_join function in the previous section.

LongData<-full_join(X,Y,by="ID") %>%
  pivot_longer(cols=c("Xdat","Ydat"),
               names_to="Measurement",
               values_to="Value")
LongData

# A tibble: 6 × 3
  ID    Measurement Value
  <chr> <chr>       <dbl>
1 A     Xdat            1
2 A     Ydat           NA
3 C     Xdat            3
4 C     Ydat            3
5 B     Xdat           NA
6 B     Ydat            2

This is the long format. Compare to the wide format:

full_join(X,Y,by="ID")

  ID Xdat Ydat
1  A    1   NA
2  C    3    3
3  B   NA    2

Note how there is only one column of data values, with the Measurement column indicating which measurement the value belongs to, and the ID column is repeated for each measurement.

This is why the long data format is sometimes called the repeated measures data format.

`pivot_wider`

The pivot_wider() function does the reverse. This time, we specify the column that contains the values with values_from= and the corresponding column names with names_from=. This should recover the original data set:

WideData<-LongData %>%
  pivot_wider(values_from=Value,
              names_from=Measurement)
WideData

# A tibble: 3 × 3
  ID     Xdat  Ydat
  <chr> <dbl> <dbl>
1 A         1    NA
2 C         3     3
3 B        NA     2

Note the slight difference in output.

Question: Can you explain why this output is different than the original?

Answer: The original was a data.frame object, but the output of pivot_longer() and pivot_wider()

Missing Data

So far we have worked on a pristine data set that has already been edited for errors. More often datasets will contain missing values.

`NA` and `na.rm()`

As we have already seen, the R language uses a special object NA to denote missing data.

Vec<-c(1,2,3,NA,5,6,7)
Vec

[1]  1  2  3 NA  5  6  7

When a function is run on a vector or other object containing NA, the function will often return NA or give an error message:

mean(Vec)

[1] NA

This is by design, because it is not always clear what NA means. Should these data be removed from consideration, or is it a zero to be included in calculations and statistical models? Many functions in R include an na.rm parameter that is set to FALSE by default. Setting it to true tells the function to ignore the NA (i.e., remove it from the calculation).

mean(Vec, na.rm=T)

[1] 4

`NA` vs `0`

A common mistake students make is to put 0 for missing data. This can be a big problem when analyzing the data since the calculations are very different.

Vec1<-c(1,2,3,NA,5,6,7)
mean(Vec1, na.rm=T)

[1] 4

Vec2<-c(1,2,3,0,5,6,7)
mean(Vec2, na.rm=T)

[1] 3.428571

However, there are many cases where NA does represent a zero for purposes of calculation or statistical analysis. There is no simple rule to follow here. The best decision will depend on the specific details of the biological data and the assumptions of the particular calculation or statistic.

`is.na()`

In large datasets you might want to check for missing values. Let’s simulate this in our FallopiaData.csv dataset.

To set up a test data frame, randomly select 10 rows and replace the value for ‘Total’ with NA.

Question: Can you remember how to do this, from the R Fundamentals Chapter?

Answer: There are many ways that you could approach this. One way is to first randomly select from a vector representing the row number in our data frame. Then, replace the values of these particular observations with NA.

X<-sample(c(1:nrow(Fallo)),10,replace=F)
Fallo$Total[X]<-NA
Fallo$Total

  [1] 67.06 50.22 61.08 41.71 41.81 48.27 55.42    NA 53.53 45.89 59.02 57.66
 [13] 48.98 35.97 43.28 52.27    NA 44.61 59.13 58.97 55.36 31.46 43.46 44.65
 [25] 59.69 60.82 57.21 34.09 58.57    NA 63.18 59.88 54.09 55.27 61.31 53.56
 [37] 52.66 64.71 61.06 45.34 64.20 57.50 68.55 49.55    NA 54.06 66.60 74.82
 [49] 53.71 49.75 58.45 66.06 67.01 70.41    NA 63.43    NA 47.50 61.79 54.96
 [61] 48.99 52.01    NA    NA 42.47 46.18 62.56 54.36 69.54 75.91 56.34 64.97
 [73] 60.71 57.80 41.72 67.44 58.78 77.74 65.68 58.42 55.35 50.28 55.04 39.56
 [85] 71.07 45.23 57.20 67.70 52.46 60.86 62.19 65.53 48.19 60.89 48.13 60.37
 [97] 67.86 56.40 49.13 56.11 49.78 69.00 65.40    NA 63.08 60.93 29.54 49.12
[109] 68.73 31.90 69.88 69.48 47.88    NA 58.13 50.51 54.83 66.80 50.31 56.12
[121] 62.96 78.80 64.25

Use is.na() to check for missing values:

is.na(Fallo$Total)

  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
 [13] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [25] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
 [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
 [49] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE
 [61] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
[109] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[121] FALSE FALSE FALSE

Note that the output is a vector of True/False. Each cell corresponds to a value of ‘Total’ with TRUE indicating missing values. This is an example of a boolean variable, which has some handy properties in R.

First, we can use it as an index. For example, let’s see which pots have missing ‘Total’ values:

Missing<-is.na(Fallo$Total)
Fallo$PotNum[Missing]

 [1]   9  20  36  55  68  70  78  79 126 138

Another handy trick to check for missing values is to sum the vector:

sum(is.na(Fallo$Total))

[1] 10

This takes advantage of the fact that the Boolean TRUE/FALSE variable is equivalent to the binary 1/0 values. By default, R treats TRUE as 1 and FALSE as 0 if used in any mathematical operation.

Naughty Data

Naughty data contain the same information as a standard row x column (i.e. 2-dimensional) data frame but break tone or more of the following best practices.

Each cell contains a single value
Each variable must have its own column
Each observation must have its own row

Naughty data are very common in biology, this often occurs when either the collector(s) were not familiar with these best practices, or decisions are made to favour human readibility and usability rather than computer interpretability.

Examples of Naughty Data from Wu et al. (2022), shown on the left side. The right side shown well-behaved data, with one observation per row, and column names as the first row.

Naughty data can be very time consuming to fix, but regular expressions can make this a bit easier, as discussed in the Regular Expressions Chapter.

Dates

As biologists, we often work with dates or time. We may want to analyze the date a sample was collected, or a measurement was taken. But dates are often encoded in formats that don’t fit neatly into the usual data types that we’ve worked with so far. The lubridate package provides a convenient framework for switching between human-readable dates and mathematical relationships among them – for example, the number of days, minutes, or seconds between two time points.

library(lubridate)

It can be convenient to automatically include the current date or time, especially when you are producing reports in R.

We can get the date as a date object in the year-month-day format,

today()

[1] "2024-01-29"

And we can get the date along with the time as a datetime object, in the year-month-day hour:minute:second timezone format.

now()

[1] "2024-01-29 20:45:14 EST"

This is an important distinction, because the datetime object extends on the date object to include hours, minutes, seconds, and time zone.

Run Time

We can use a datetime object to track the run time of our code. The run time is how long it takes to run a particular program. We first create an object to store the computer clock before the program runs, then we subtract that object from the computer clock after the program finishes:

Before<-now()

for(i in 1:1000){
  rpois(10,lambda=10)^rnorm(10)
}

now()-Before

Time difference of 0.0129571 secs

Note that your run time may be different, depending on the specifications of your computer and other processes that are running on it.

This technique can be useful for estimating the run time of a large program. Specifically, we run a fraction of the data and/or loop iterations and multiply to get an estimate the run time for the full data set.

Question: If you add a print() function to print out the result in each iteration of the for loop, how much does this slow down the run time?

Answer: Try it to find out!

For a small program like this, it’s not a problem to ad a few fractions of a second. But this can scale up exponentially in ‘big data’ applications. For example, imagine you are sequencing a human genome at 100× coverage, which might typically involve one a billion sequences, each 300 bases long. You want to write a program to assemble all of the sequences into a full genome. How much time would it add to your program run time if you add a 1 millisecond step for each sequence, or each base pair in the experiment?

For each sequence, it would be $10^9sequences \times 0.001 s = 10^6 s$, or about 11.5 days! For each base pair, multiply by 300 – it would take more than 10 years!

These are problems of scale that don’t matter so much when you are starting out, but developing an efficiency mindset early, it can pay off when we move on to bigger data projects.

Next, we’ll explore the date and datetime objects in more detail.

`date` Objects

Human-readable dates come in many different forms, which we can encode as strings. Here are some examples that we might see for encoding the end-date of the most recent 5,126-year-long Mesoamerican calendar used by the ancient Maya civilization:

Date2<-"2012-12-21"
Date3<-"21.12.2012"
Date4<-"Dec 21, 2012"
Date5<-"21 December, 2012"

The lubridate package has a number of related functions that correspond to the order of entry – d for day, m for month, and y for year:

ymd(Date2)

[1] "2012-12-21"

dmy(Date3)

[1] "2012-12-21"

mdy(Date4)

[1] "2012-12-21"

dmy(Date5)

[1] "2012-12-21"

Notice the flexibility here! Some dates have dashes, dots, spaces, and commas! Yet, all are easily converted to a common object type. On the surface, these objects look like simple strings, but compare the structure of the date object with its original input string:

str(Date2)

 chr "2012-12-21"

str(ymd(Date2))

 Date[1:1], format: "2012-12-21"

Notice how the first is a simple chr character object, whereas the second is a Date object. The date object can be treated as a numeric variable, that outputs as a readable date. For example, what if we want to know what the date 90 days before or after?

c(ymd(Date2)-90,ymd(Date2)+90)

[1] "2012-09-22" "2013-03-21"

When using date objects, R will even account for the different number of days in each month, and adjust for leap years!

As with any function in R, we can apply ymd() to a column in a data frame or any other vector of strings:

DateVec<-c("2012-12-1","2012-12-20",
           "2012-12-23", "2012-11-05")
ymd(DateVec)

[1] "2012-12-01" "2012-12-20" "2012-12-23" "2012-11-05"

The elements must have the have the same order as the function, but surprisingly they don’t have to have the same format:

MixVec<-c("2012-12-11","12-20-2012",
          "2012, December 21", "2013, Nov-11")
ymd(MixVec)

Warning: 1 failed to parse.

[1] "2012-12-11" NA           "2012-12-21" "2013-11-11"

Note the warning and the resulting output. The first, third, and fourth elements are converted, even though they have different formats. The second element is replaced with NA because the order is not year, month, day, as required by ymd().

Because these objects are numeric, we can also use them for plotting:

ggplot() + 
  geom_histogram(aes(ymd(DateVec)),
                 binwidth=1)

Question: What do you notice about the x-axis? How would it differ if we used strings instead?

Answer: Try to run the above with DateVec as a string instead of date.

You should get an error, because you can’t bin values of a character variable to generate a frequency histogram. This shows us that R is able to treat the date object as a continuous variable, spacing the bins based on the number of days separating them.

`datetime`

The datetime object adds a time element to the date. Just as there are different functions to translate different date formats, there are different datetime functions. Each datetime function starts with one of the three date functions we used earlier, but then adds time elements after an underscore _ to define nine different functions. Here are a few examples:

mdy_h("Dec 21, 2012 -- 10am")

[1] "2012-12-21 10:00:00 UTC"

ymd_hm("2012-12-21, 08:30")

[1] "2012-12-21 08:30:00 UTC"

dmy_hms("21 December, 2012; 11:59:59")

[1] "2012-12-21 11:59:59 UTC"

Extracting Components

We can extract components from date and datetime objects. For example, we can extract only the year:

year(Date2)

[1] 2012

or the month:

month(Date2)

[1] 12

We have several options for days.

First, we can get the day of the year, also known as the Julian day:

yday(Date2)

[1] 356

Or the day of the month

mday(Date2)

[1] 21

or the day of the week (rather than month) using wday()

wday(Date2)

[1] 6

We can use the label=T parameter rather than number, to get the name of the specific month or day, rather than the number:

month(Date2, label=T)

[1] Dec
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec

wday(Date2, label=T)

[1] Fri
Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat

Categorical Dates

We’ve seen above how dates are more like numeric objects rather than strings, but we can also treat dates as categorical data.

One example that is becoming more and more common in biology is the analysis of data from data loggers, which automatically save observations over time. Think of climate stations that measure temperature and precipitation as a common example. Another example might be location information of a study organism using image analysis or PIT tags (i.e. Passive Integrated Transponders).

In many cases, the timescale of collection is much shorter than what we need for our analysis. We end up with too much data! Luckily, dplyr with lubridate offer a great combination for summarizing these data.

`lubridate` + `dplyr`

Here’s an example that takes advantage that takes advantage of our ability to treat components of datetime objects as categorical variables as well as continuous variables.

Imagine we have observations taken every minute, and we just want to calculate the average for each hour . To see how to do this, we will generate a toy data set in R using the tibble function, and then assigning column names: DayTime for the day-time object and Obs for the observation (i.e. measurement):

First, we create the imaginary dataset, using the replicate function and a random number generator:

TestData<-tibble(
  DayTime=now()+minutes(rep(c(0:359),100)),
  Obs=rnorm(36000))

We can calculate the hourly average by piping our TestData tibble (above) through a group_by and then a summarise() function.

TestData %>%
  group_by(yday(DayTime),hour(DayTime)) %>%
  summarise(Mean=mean(Obs))

`summarise()` has grouped output by 'yday(DayTime)'. You can override using the
`.groups` argument.

# A tibble: 7 × 3
# Groups:   yday(DayTime) [2]
  `yday(DayTime)` `hour(DayTime)`      Mean
            <dbl>           <int>     <dbl>
1              29              20  0.0161  
2              29              21 -0.0170  
3              29              22  0.0111  
4              29              23 -0.0108  
5              30               0 -0.0159  
6              30               1 -0.0283  
7              30               2 -0.000328

Question Why do we include yday() in the group_by function?

Answer: Remove yday(DayTime) and compare the output.

Data Science

Introduction

Data Science Life-Cycle

Setup

2D Data Wrangling

tibbles and readr()

filter() Rows

rename() Columns

arrange() Rows

select() Columns

everything()

mutate() Columns

transmute() Columns

summarise() + group_by()