Regular Expressions I

Introduction

Regular Expressions, also known as regex and regexp are special text-based functions that act as run complex find-and-replace functions. I didn’t learn regular expressions until I was a postdoc working at Duke University, but I wish I had learned about them much earlier! This remains one of the most useful programming tools I have ever used. It is absolutely essential for working with any kind of large text files or large data sets. I’ll explain.

A lot of programming tools in biology use input text files that require very specific formatting (e.g. .txt, .csv, .fasta, .nex). Sometimes, you might need to reorganize or recode data in a large text file or in many separate text files. This can be a big time sink, it can introduce errors, and it’s not reproducible if you do it manually. But regular expressions can automate the process.

Here’s one example. As a PhD student I co-founded a project called the Global Garlic Mustard Field Survey (GGMFS) with collaborator Dr. Oliver Bossdorf at the University of Tübingen – yes the same Dr. Bossdorf mentioned in the Quick Visualizations Chapter. We were fortunate to have over 100 collaborators across Europe and North America who helped to collect samples for the project. Details of the project were published in the Journal Neobiota: https://neobiota.pensoft.net/article/1270/ but one BIG problem is the way that each of these 100+ collaborators entered their data online. For example, latitudes and longitudes were entered in a variety of different formats. Regular expressions allowed me to write a small program to automatically convert all of these different formats to a common, decimal format that we could use for the analysis. This saved a huge amount of time and prevented errors that could have been introduced if we tried to edit these values by hand.

Often when you work with large datasets, you will need to automate some of your error correction, and regular expressions can be a big help here. For example, imagine a simple online survey that includes a place for people to simply type “yes” or “no” in response to a question. This should be coded as a binary variable (1 or 0) for analysis, but you might find a variety of inputs such as: “YES”, “Y”, “yes”, and “Yes”. These all mean the same thing, yet if you try to analyze the raw output, R will treat these as different categories. Here again, regular expressions can be used to quickly change all the different examples to a common “Y” or to a Boolean variable TRUE.

One final example, is pattern matching, which is common for the analysis of DNA, proteins or other large strings of data. You may want to find a particular sequence of data, possibly with a few variable sites: e.g. TCTA or TCAA or TCGA. This is another area where regular expressions can help.

Universal Syntax

Regular expressions are a universal language that extends to many other programming languages, including C/C#/C++, Python, Unix/Linux, and Perl. We focus here on R but most of the syntax is mantained across programming languages.

Steep Learning Curve

WARNING! There is a very steep learning curve here, and the only way to really learn this is to drown yourself in examples. There are lots of exercises you can do for practice online. You should also try to apply these whenever you can, just like you should with all of your other R skills.

Functions

There are four main functions that use regular expressions in R.

grep() and grepl() are equivalent to ‘find’ in your favorite word processor. They have the general form:

gsub("find", in.this.object)

grep() outputs a vector with all of the address locations (i.e. numbers) that match. Thus the output length is equal to the number of matches.

grepl() outputs a vector of TRUE (match) and FALSE (no match). Thus, the output length is equal to the length of the input object.

sub() and gsub() are equivalent to ‘find and replace’. They have the general form:

grep("find", "replace", in.this.object)

sub() replaces only the first match, whereas gsub() replaces all of the matches.

Some specific examples are provided below to help you understand these similarities and differences. As always, you should take the time to try these out and make sure you get the same input. If you don’t, then it’s a good learning opportunity to find out what you did differently!

There are two other more advanced functions in R. These aren’t covered in this tutorial, but may be of use once you are more comfortable with the above functions.

regexpr() provides more detailed information about the first match.

gregexpr() provides more detailed results about all matches.

See ?regexpr and ?gregexpr for more info

Examples

Some examples can help to understand the differences among the four main functions. Let’s start with a simple data frame of species names.

Species<-c("petiolata", "verticillatus", "salicaria", "minor")
print(Species)
[1] "petiolata"     "verticillatus" "salicaria"     "minor"        

grep()

This returns cell addresses matching the query string.

grep("a",Species)
[1] 1 2 3

Note the vector length compared to the input vector. Instead of the cell number, we can get R to return the specific values in each matching cell with the value=T parameter

grep("a",Species, value=T)
[1] "petiolata"     "verticillatus" "salicaria"    

grepl()

This returns a vector of TRUE (match) and FALSE (no match). Compare this output with the same parameters in the grep() function.

grepl("a",Species)
[1]  TRUE  TRUE  TRUE FALSE

sub()

This replaces the first match (in each cell)

sub("l","L",Species)
[1] "petioLata"     "verticiLlatus" "saLicaria"     "minor"        

gsub()

This replaces all matches (in each cell). Compare this output to sub().

gsub("l","L",Species)
[1] "petioLata"     "verticiLLatus" "saLicaria"     "minor"        

Question: Did you see the difference?

Hint: Look at “Verticillatus”.

Wildcards

Escape Character \

The backslash is a special character. It’s called the ‘escape’ character because it is used to escape from the literal interpretation of the next character to the right. For example, \. applies the escape to the period character. The specific meaning depends on the context, which is much easier to understand by examples, as shown below.

\\ in R

In the introduction, we discussed the universality of regular expressions in the sense that a similar syntax is used by many different programming langagues. But now here is one exception. In R, the double-escape is usually needed, whereas other programming languages typically use just one. The reason is a bit meta – it’s because we are running regular expressions within R object. So the first \ is used to escape special characters in R, applying it to the second \, which is itself the special character that needs to be escaped to pass through the function. The second slash is followed by the ‘escaped’ character. Some examples are provided below.

If that isn’t clear. Just remember that you need two backslashes when writing regular experssions in R, but just one backslash for most other languages.

\\w

Instead of finding the letter w, the \\w is a wildcard character that represents any letter or digit. It also includes underscore _ for some reason.

sub("w","X","...which 1-100 words get replaced?")
[1] "...Xhich 1-100 words get replaced?"
gsub("w","X","...which 1-100 words get replaced?")
[1] "...Xhich 1-100 Xords get replaced?"
sub("\\w","X","...which 1-100 words get replaced?")
[1] "...Xhich 1-100 words get replaced?"
gsub("\\w","X","...which 1-100 words get replaced?")
[1] "...XXXXX X-XXX XXXXX XXX XXXXXXXX?"

Again, note the differences between the sub() and gsum() functions. We’ll stick to gsub() for the remainder of the examples in this chapter, but you should also run sub() yourself. Each time, take a moment to try to predict how the output will differ before running it. This will help you develop an understanding of regular expressions much more quickly.

\\W

The capital W is the inverse of \\w find a character that is NOT a letter or number.

gsub("\\W","X","...which 1-100  words get replaced?")
[1] "XXXwhichX1X100XXwordsXgetXreplacedX"

\\s

This represents a space

gsub("\\s","X","...which 1-100  words get replaced?")
[1] "...whichX1-100XXwordsXgetXreplaced?"

\\t

This is a tab character. A lot of data files stored as text are tab-delimited (.tsv) as well as comma-delimited (.csv)

gsub("\\t","X","...which 1-100 \t words get replaced?")
[1] "...which 1-100 X words get replaced?"

Remember that \t is a tab character.

cat("A\t\t\tB C")
A           B C

\\d

d for digits. This is the wild card for numeric characters.

gsub("\\d","X","...which 1-100  words get replaced?")
[1] "...which X-XXX  words get replaced?"

\\D

Non-digit characters

gsub("\\D","X","...which 1-100  words get replaced?")
[1] "XXXXXXXXX1X100XXXXXXXXXXXXXXXXXXXXX"

New Lines

There are two special characters that indicate new lines in a text file.

\\r

This is the ‘carriage return’ special character

\\n

This is the ‘newline’ special character

Big Problem

One or both of these may be generated when you press the ‘enter’ key while writing a text file. The difference depends on which operating system you are using. These also add a source of headache and confusion when working with text files because:

  1. Unix and MacOS text files use lines that end with \n only
  2. Windows and DOS text files use lines end with \r\n

Question: Do you know how this difference originated?

Answer: The reason goes back to the early days of programming, when programmers were moving from mechanical typewriters to computer programs. Mechanical typewriters are hard to find these days, but they would hold a piece of paper in place on a cylinder called a carriage. The

The \n stands for ‘new line’, and the \r stands for return. When you reach the end of a line of text on a typewriter, you would typically return the carriage back to the starting position, and then move to the next line, thus the \r\n. The Unix operating system decided that the \r wasn’t needed, whereas the DOS operating system decided to include it.

This difference can cause problems when moving text files across operating systems. Programs like FileZilla will automatically translate these end-of-line characters when moving across systems.

Special Characters

In addition to special characters that use the escape \\, there are a number of other special characters that don’t use the escape, but have a special meaning.

Note that if you want to search for the characters below you would have to use the escape character. E.g., use \\. to search for the period character (.).

Any Character .

The period is a wild card that means ‘anything’. This includes all of the \\w characters but also other characters like puncutation marks.

gsub(".","X","...which 1-100  words get replaced?")
[1] "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"

So how to search for a period .? As noted above, we have to use the escape character

gsub("\\.","X","...which 1-100  words get replaced?")
[1] "XXXwhich 1-100  words get replaced?"

Or |

This is sometimes called the pipe character, and it simply means ‘or’. For example, we can search for w or e.

gsub("w|e","X","...which 1-100  words get replaced?")
[1] "...Xhich 1-100  Xords gXt rXplacXd?"

Special Searches *, ?, +, {}

These special characters refer to details about the kind of search that we are trying to conduct. Look at these examples carefully, and remember that sub replaces the first match while gsub replaces all of the matches.

sub("\\w","X","...which 1-100 words get replaced?")
[1] "...Xhich 1-100 words get replaced?"
gsub("\\w","X","...which 1-100 words get replaced?")
[1] "...XXXXX X-XXX XXXXX XXX XXXXXXXX?"

Now let’s apply some of these special characters to see how they work.

Find One or More Matches +

Finds ‘one or more’ matches (i.e. at least one match)

sub("\\w+","X","...which 1-100 words get replaced?")
[1] "...X 1-100 words get replaced?"
gsub("\\w+","X","...which 1-100 words get replaced?")
[1] "...X X-X X X X?"

Compare this match to the one above. Notice how we have replaced groups of letters instead of single letters. The algorithm works like this:

  1. Start at the left and move to the right, one character at a time
  2. Check if the character is a letter or number (\\w).
  3. If NO, move to the next character
  4. If YES, check the next character. If it is also a \\w then go to the next character. Repeat until the next character is not \\w, and replace the entire string of characters.

When run in the sub() function, the algorithm does the above and then stops. When run with the gsub() function, it continues to the next character, and then starts over.

Greedy Matches *

This is a greedy search matches 0 or more in a row. Again, this is easier to understand by exploring examples.

sub("\\w*","X","...which 1-100 words get replaced?")
[1] "X...which 1-100 words get replaced?"
gsub("\\w*","X","...which 1-100 words get replaced?")
[1] "X.X.X.X X-X X X X?X"

In the sub() function, it detects a period (.) as the first character, indicating no match. It replaces the ‘null’ or 0 match at the beginning, which has the effect of adding a character. In the gsub() function it repeats this again before each period (.). It then continues until it finds the letter w. Then it finds a group of \\w matches, replacing all of them with a single X. Then a space, which is skipped, then a -, which is another null match, prompting another insert.

Restrained Match ?

This is the restrained search, which matches zero or one time.

sub("\\w?","X","...which 1-100 words get replaced?")
[1] "X...which 1-100 words get replaced?"
gsub("\\w?","X","...which 1-100 words get replaced?")
[1] "X.X.X.XXXXX X-XXX XXXXX XXX XXXXXXXX?X"

Compare this to the * above. The ? character behaves in a similar way, except it is constrained in the sense that each each letter is replaced individually, instead of replacing entire words.

Lazy, Restrained +?

This is the lazy version of +

sub("\\w+?","X","...which 1-100 words get replaced?")
[1] "...Xhich 1-100 words get replaced?"
gsub("\\w+?","X","...which 1-100 words get replaced?")
[1] "...XXXXX X-XXX XXXXX XXX XXXXXXXX?"

Note the difference in sub(), which replaces on the the first letter here but the whole word when + is used alone in the earlier example. In the gsub() example we end up replacing every letter instead of whole words. Remember, sub() runs the algorithm once and then stops, while gsub() cycles through the algorithm until it reaches the end of the line.

Lazy, Greedy *?

Similarly, we can combine these characters for the ‘lazy’ version of *

sub("\\w*?","X","...which 1-100 words get replaced?")
[1] "X...which 1-100 words get replaced?"
gsub("\\w*?","X","...which 1-100 words get replaced?")
[1] "X.X.X.XwXhXiXcXhX X1X-X1X0X0X XwXoXrXdXsX XgXeXtX XrXeXpXlXaXcXeXdX?X"

Try using +*.

Question: Why do you get an error message?

Answer: The * and ? find the same characters, but have competing replacement rules (greedy or restrained).

Range {}

Curly brackets are used to specify a number of matches, expanding on the options even further.

{n,m}

Find between \(n\) to \(m\) matches

gsub("\\w{3,4}","X","...which 1-100 words get replaced?")
[1] "...Xh 1-X Xs X XX?"

{n}

Find exactly \(n\) matches

gsub("\\w{3}","X","...which 1-100 words get replaced?")
[1] "...Xch 1-X Xds X XXed?"

{n,}

Find \(n\) or more matches

gsub("\\w{4,}","X","...which 1-100 words get replaced?")
[1] "...X 1-100 X get X?"

{}?

As above, we can use ? for the ‘lazy’ versions of these searches

gsub("\\w{4,}?","X","...which 1-100 words get replaced?")
[1] "...Xh 1-100 Xs get XX?"

Set []

Square brackets allow us to define a set, which is a group of characters from which we want to match any. Within a set, we can use the dash - to specify a range of numbers or letters.

gsub("[aceihw-z]","X","...which 1-100 words get replaced?")
[1] "...XXXXX 1-100 Xords gXt rXplXXXd?"

In the above example, we search for 1 of any of the listed letters: a, c, e, i h, w, x, y, z. Note that x and y are included in the w-z statement.

Question: What if we want to find 1 or more of these characters in a row to replace with X?

gsub("[aceihw-z]+","X","...which 1-100 words get replaced?")
[1] "...X 1-100 Xords gXt rXplXd?"

Negate or Start ^ and End $

Use these characters to specify searches at the start ^ or end $ of the input string.

Start ^

How do we find which species start with the letter a? Use the start character:

grep("^a",Species)
integer(0)

Negate ^

The same character (^) has a different meaning if used with a set []. In those cases, it negates, or finds the opposite.

For example, find species containing any character other than a:

grep("[^a]",Species)
[1] 1 2 3 4

Replace every letter except a or l

gsub("[^al]","X",Species)
[1] "XXXXXlaXa"     "XXXXXXXllaXXX" "XalXXaXXa"     "XXXXX"        

End $

Find species that end with a

grep("a$",Species)
[1] 1 3

Capture ()

Regular parentheses are used to ‘capture’ text, which can then be specified in the replacement string using \\1. Or you can capture multiple pieces of text and reorganize them by using the corresponding number – \\1 for the first set of(), \\2 for the second set of (), etc. Some examples should help.

Replace each word with its first letter

gsub("(\\w)\\w+","\\1",
     "...which 1-100 words get replaced?")
[1] "...w 1-1 w g r?"

Pull out only the numbers and reverse their order

gsub(".*([0-9]+)-([0-9]+).*",
     "\\2-\\1","...which 1-100 words get replaced?")
[1] "100-1"

Reverse first two letters of each word

gsub("(\\w)(\\w)(\\w+)","\\2\\1\\3",
     "...which 1-100 words get replaced?")
[1] "...hwich 1-010 owrds egt erplaced?"