Flow Control

Introduction

Think of your data analysis as a stream flowing from the raw data at the headwaters down to the river mouth, exiting as a full analysis with graphics, statistical analyses, and biological interpretation.

There are different ways we can control the flow of our code. The simplest is just to write a sequence of lines of code, with the output of one line of code forming the input of the next. A pseudo-code example might be:

A<-functionA()
B<-functionB(A)
C<-functionC(B)

But sometimes we may want to do the same function or analysis only if the input meets certain criteria. Or we may want to reiterate the same analysis multiple times on different inputs. This is where more advanced flow control comes in handy.

To start, let’s make up a couple of objects to play with:

X<-21
Xvec<-c(1:10,"string")

if(){}

The if(){} statement uses an operator (see above) to asses whether the value is TRUE or FALSE:

if(X > 100){ # Is X greater than 100?
  print("X > 100") # If TRUE
} else { 
  print("X <= 100") # If FALSE
}
[1] "X <= 100"

A common ‘rookie’ mistake is to leave out a bracket or use the wrong type of bracket. Use regular brackets for the if function if() followed by two sets of curly brackets containing the code to run {run if true}else{run if false}.

Break up across multiple lines to improve readability. Note that you don’t need an else{} part if you just want to do nothing when FALSE.

if(X > 0){print ("yup")}
[1] "yup"

ifelse()

The ifelse() is a more compact version for simple comparisons. The following code does the same as above.

ifelse(X > 100,"X > 100", "X <= 100")
[1] "X <= 100"

Nested if

You can also nest if and ifelse statements to account for more outcomes. Conceptually think of it as a bifurcating tree, starting at the top (root) and then splitting in two for every if statement.

if(X > 100){
  print("X > 100")
  if(X > 200){
    print("X > 200")
  }
} else {
  if(X == 100){
    print("X = 100")
  } else {
    print("X < 100")
  }
}
[1] "X < 100"

Don’t get intimidated. It just takes time to work through all of the possibilities. Try to draw a bifurcating diagram to represent each true/false outcome for the above code.

for(){} Loop

A loop does the same thing over and over again until some condition is met. In the case of a for loop, we set a ‘counter’ variable and loop through each value of the counter variable. Here are a few examples:

  1. Loop through numbers from 1 to 5
for (i in 1:5){
  print(paste(X,i,sep=":"))
}
[1] "21:1"
[1] "21:2"
[1] "21:3"
[1] "21:4"
[1] "21:5"
  1. Loop through the elements of a vector directly
for (i in Xvec){
  print(i)
}
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "10"
[1] "string"
  1. Use an index object to index the elements of a vector
for (i in 1:length(Xvec)){
  print(Xvec[i])
}
[1] "1"
[1] "2"
[1] "3"
[1] "4"
[1] "5"
[1] "6"
[1] "7"
[1] "8"
[1] "9"
[1] "10"
[1] "string"

Note that in each case there is a vector and the loop goes through each cell in the vector. The i variable is an object that gets replaced with a new number in each iteration of the loop.

Loops can be tricky, and the only way to really learn them is to practice as much as possible. Whenever you find yourself writing similar code more than 2 or 3 times, challenge yourself to try to re-write it as a loop.

In addition to looping through a vector, it can often be useful to include a counter variable. This can be especially useful for more complicated loops, but be careful to decide where in your code to update the counter variable. USUALLY it will be either

  1. At the end, setting the initial value to 1 before the loop begins.
count1<-1
count10<-1

for(i in 1:5){
  print(paste("count1 =",count1))
  print(paste("count10 =",count10))
  count1<-count1+1
  count10<-count10*10
}
[1] "count1 = 1"
[1] "count10 = 1"
[1] "count1 = 2"
[1] "count10 = 10"
[1] "count1 = 3"
[1] "count10 = 100"
[1] "count1 = 4"
[1] "count10 = 1000"
[1] "count1 = 5"
[1] "count10 = 10000"
  1. At the start, setting the initial value to 0 before the loop begins.
countbefore<-0
countafter<-0

for(i in 1:5){
  countbefore<-countbefore+1
  print(paste("before =",countbefore))
  print(paste("after =",countafter))
  countafter<-countafter+1
}
[1] "before = 1"
[1] "after = 0"
[1] "before = 2"
[1] "after = 1"
[1] "before = 3"
[1] "after = 2"
[1] "before = 4"
[1] "after = 3"
[1] "before = 5"
[1] "after = 4"

This is yet another example of how two different coding approaches can produce the same result.

Read through the outputs above carefully to make sure you understand how the loops work. When you are confident you understand, then write a new loop and write down the predicted output. Run the loop to check if you were right.

Nested Loops

Counters are particularly valuable when you have a nested loop, which is just one loop inside of another. But note that this can complicate decisions about where to place your counter variable.

In the example below, we are first looping through a vector of length 3, tracked with i. Then for each i we do a second loop, tracked by j.

This time, try to predict the output BEFORE you run the loop. Write it down, then run the loop to check your answer.

LoopCount<-0
for(i in 1:3){
  for(j in 1:2){
    LoopCount<-LoopCount+1
    print(paste("Loop =",LoopCount))
    print(paste("i = ",i))
    print(paste("j = ",j))
  }
}
[1] "Loop = 1"
[1] "i =  1"
[1] "j =  1"
[1] "Loop = 2"
[1] "i =  1"
[1] "j =  2"
[1] "Loop = 3"
[1] "i =  2"
[1] "j =  1"
[1] "Loop = 4"
[1] "i =  2"
[1] "j =  2"
[1] "Loop = 5"
[1] "i =  3"
[1] "j =  1"
[1] "Loop = 6"
[1] "i =  3"
[1] "j =  2"

while(){} Loop

The while function is another kind of loop, but instead of looping through a predefined set of variables, we iterate until some condition is met inside of the loop. This is called the exit condition.

In biology, the while loop is often used in optimization simulations, where many calculations are run until some optimum or threshold value is reached. Examples may include equilibrium simulations like the Evolutionarily Stable Strategy (ESS), population growth trajectories, or mutation-selection equilibrium. Other examples may include advanced statistical analyses based on maximizing the likelihood or fit of particular statistical models.

One common coding error associated with while loops is that the exit condition is never reached, causing your computer to run an infinite loop.

Here’s a simple while loop, which will continue until count is greater than or equal to X.

count<-0
while(count < X){
  print(count)
  count<-count+1
}
[1] 0
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
[1] 11
[1] 12
[1] 13
[1] 14
[1] 15
[1] 16
[1] 17
[1] 18
[1] 19
[1] 20

Modulo

Earlier in this chapter, we saw the modulo (%%) function, which returns the remainder of a division equation. This comes in handy for loops. For example, if you want to do something every 2, or 3, or N iterations of a loop, you can divide by N and determine if the dividend is zero. Here’s an example:

for(i in 1:9){
  if(i %% 3 ==0 ){
    print(paste("Iteration:",i))
  }
}
[1] "Iteration: 3"
[1] "Iteration: 6"
[1] "Iteration: 9"

Question: What will the output look like?

Before you run the code, take time to work through the for loop and the if statement to predict the output. This will help you develop a better understanding of these tricky but highly useful coding tools.

Faster Loops

There are many cases where you may want to add or change vectors, arrays, or data frame objects inside of a loop. By default, R will create a new object each time you add a new element. This can make loops very slow.

Slow loops are usually fine, especially when you are starting out. Does it really matter if your code takes 0.2 seconds or 0.12 seconds? Even if it takes 5 minutes to run, that might be a better use of your time than spending 2 hours finding a faster version.

However, as you advance to code larger projects, you will find that these time differences start to become important. There are some good tricks in R for faster.

This subsection is going to get advanced pretty quickly. Don’t worry if you struggle to understand it. Just give it a shot and if you need to move on, file it away for the future. Then, come back to this part of the book if find yourself struggling with loops that are taking too long to run.

Here is a slow loop example demonstrating the central limit theorem. First, we want to sample 1000 numbers from a random normal distribution and calculate the average. Second, we want to repeat this process for 5000 iterations. Finally, we want to calculate the average of these 500 iterations. That is, the mean of means.

Iters<-500 # Number of iterations
OutVector<-NA
Start<-Sys.time()
for(i in 1:Iters){
  TempMean<-NA
  for(j in 1:1000){ # One loop per sample
    TempMean[j]<-rnorm(1)
  }
  OutVector[i]<-mean(TempMean)
}
Sys.time()-Start
Time difference of 1.103925 secs
paste("Mean of means =",mean(OutVector))
[1] "Mean of means = -0.000636993195912495"

Note the use of Sys.time() to keep track of the loop run time. Also, note that the Time difference for your computer will depend on the specifications of your computer and memory use. This is a handy technique for large bioinfomatics projects. For example, imagine looping through millions or billions of DNA sequences. How long will the loop take? Try running it on a subset of a few thousand and then multiply to estimate the total run time.

Here is an example of the same calculation in a fast loop. Note the only changes are the addition of the vector() functions.

Iters<-500 # Number of iterations
OutVector<-vector("numeric",Iters)
Start<-Sys.time()
for(i in 1:Iters){
  TempMean<-vector("numeric",1000)
  for(j in 1:100){ # One loop per sample
    TempMean[j]<-rnorm(1)
  }
  OutVector[i]<-mean(TempMean)
}
Sys.time()-Start
Time difference of 0.1123781 secs
paste("Mean of means =",mean(OutVector))
[1] "Mean of means = 0.000421493909076394"

The reason this is faster is a bit technical, but the key is that we are pre-defining the size of our output vectors before we run the loop. This allows R to assign an appropriate amount of computer memory to keep track of changes to the output vector. If we don’t do this, R has to constantly update the output memory by creating new objects each time.

But there is an even faster way to do the same calculation. Often if we are outputting to vectors in a for loop, then there is a way to re-write the same function by using sapply() or tapply().

Here is an example of an even faster loop.

Iters<-500 # Number of iterations
OutVector<-vector("numeric",Iters)
OutMean<-function(x){
  return(mean(rnorm(1000)))
}
Start<-Sys.time()
OutVector<-sapply(OutVector,FUN=OutMean)
Sys.time()-Start
Time difference of 0.02665401 secs
paste(mean(OutVector))
[1] "0.000901234129891329"

We’ve used three tricks here. First, we simply generate a vector of 1000 and take the mean with mean(rnorm(1000)), instead of making a nested loop with a vector to hold each of our 1000 random numbers. Second, we put this into a custom function called OutMean, which allows us to apply the same function along a vector. We’ll look at custom functions in more detail in a later chapter. Finally, we use the sapply function to apply our custom functionOutMean for each element of OutVector.