Regular Expressions II

Introduction

In this chapter, we look at specific applications and examples of regular expressions. This includes some tips for practicing regular expressions in R Studio, gathering data from online websites, and practice exercises with DNA data.

Regex in R Studio

As noted above, there is a steep learning curve to understanding regular expressions. Getting clean Regex code to do what you want can often come down to trial-and-error. Sometimes, it’s impractical to run code and look at the output to see if your Regex code is doing what you want it to do. In R Studio, you test your regular expression in a text file, using the Find command from the Edit menu in R Studio (or Ctl + F on Windows or Cmd + F on MacOS).

Try opening a text file in R Studio and select Find from the Edit Menu. You’ll see some check boxes at the bottom of the search bar. One of these boxes is Regex. If you check this box (and uncheck all other boxes) you can use the Find command to highlight the text that is selected for your Regex command. Just make sure to use one escape character (\) for the search bar.

Recall that the double escape characters (\\) are specific to the R search string parameter of the regex functions (e.g., sub, gsub)

Scraping

Scraping is a method for collecting data from online sources. In R, we can use the functions readLines and curl(), both from the curl library, to copy data from websites. We can do this because websites with the .html or .xml extension are a special kind of text files.

For more advanced applications, we might want to use the rvest package, which is designed for html and xml files. However, we’ll focus here on curl() because we can apply regular expressions to any text files to extract information of interest. Here’s an example where we will scrape a record for the Green Fluorescent Protein (GFP) from the Protein Data Bank (PDB). Note that this is a file with the extension .pdb but this is a human-readable text file that can be opened in any text editor

First, we’ll import the text into an R object.

library(curl)

Using libcurl 7.84.0 with Schannel

You will have to use install.packages("curl") to download this package to your computer. You only need to do this once but you will have to use library(curl) whenever you want to use the functions, as explained in the R Fundamentals Chapter.

Now we can download a file to play with.

Prot<-readLines(curl("http://www.rcsb.org/pdb/files/1ema.pdb"))

Download this link to your computer and open with a text file to see what it looks like.

This hint is a simple trick to understand what kind of file(s) you are working with.

This is a tab-delimited file, which we could import as a data frame using read.delim but we’ll keep it this way to practice our regular expressions.

The Prot object we have made is a simple vector of strings, with each cell corresponding to a different row of text:

length(Prot)

[1] 2363

grep("TITLE",Prot)

[1] 2

We can pull out the amino acid sequences, which are rows that start with the word ‘ATOM’

AAseq<-Prot[grep("^ATOM",Prot)]
length(AAseq)

[1] 1717

AAseq[1]

[1] "ATOM      1  N   SER A   2      28.888   9.409  52.301  1.00 85.05           N  "

Challenge: Try to apply what you have learned about regular expressions to isolate the 3-letter amino acid code.

There are several ways we could do this. Take the time to think about it and give it a try.

Here’s one good option, since we know it’s a tab-delimited file with the amino acid in the 4th column:

gsub("ATOM\\t\\w+\\t\\w+\\t(\\w+).*","\\1",AAseq[1])

[1] "ATOM      1  N   SER A   2      28.888   9.409  52.301  1.00 85.05           N  "

That didn’t work. Sometimes the ‘tabs’ are actually just multiple ‘spaces’

AAchain<-gsub("ATOM\\s+\\w+\\s+\\w+\\s+(\\w+).*","\\1",AAseq)
AAchain[1:100]

  [1] "SER" "SER" "SER" "SER" "SER" "SER" "LYS" "LYS" "LYS" "LYS" "LYS" "LYS"
 [13] "LYS" "LYS" "LYS" "GLY" "GLY" "GLY" "GLY" "GLU" "GLU" "GLU" "GLU" "GLU"
 [25] "GLU" "GLU" "GLU" "GLU" "GLU" "GLU" "GLU" "GLU" "GLU" "GLU" "LEU" "LEU"
 [37] "LEU" "LEU" "LEU" "LEU" "LEU" "LEU" "PHE" "PHE" "PHE" "PHE" "PHE" "PHE"
 [49] "PHE" "PHE" "PHE" "PHE" "PHE" "THR" "THR" "THR" "THR" "THR" "THR" "THR"
 [61] "GLY" "GLY" "GLY" "GLY" "VAL" "VAL" "VAL" "VAL" "VAL" "VAL" "VAL" "VAL"
 [73] "VAL" "VAL" "VAL" "VAL" "VAL" "VAL" "PRO" "PRO" "PRO" "PRO" "PRO" "PRO"
 [85] "PRO" "ILE" "ILE" "ILE" "ILE" "ILE" "ILE" "ILE" "ILE" "LEU" "LEU" "LEU"
 [97] "LEU" "LEU" "LEU" "LEU"

Now we have a handy vector of amino acids representing our protein.

Examples

Let’s try practising with a couple of examples.

Transect Data

Regular expressions are also useful with data objects

Imagine you have a repeated measures design. 3 transects (A-C) and 3 positions along each transect (1-3). We can simulate this data by generating dandom numbers in a data frame.

Transect<-data.frame(Species=letters[1:20],
                     A1=rnorm(20), A2=rnorm(20), A3=rnorm(20),
                     B1=rnorm(20), B2=rnorm(20) ,B3=rnorm(20),
                     C1=rnorm(20), C2=rnorm(20), C3=rnorm(20))
head(Transect)

  Species         A1         A2         A3          B1          B2          B3
1       a -0.4762919 -0.8771969 -2.4134562  1.09793390 -0.22686803  0.93908458
2       b  0.5931594  0.2868677 -1.0057155 -0.48245954 -1.26291078 -0.13815786
3       c  0.5233956  0.3531411 -0.3127597  0.28584890 -0.03140898 -0.17422685
4       d -0.1398728 -0.1236981 -0.1956281 -0.01232568  0.41402493  0.02370992
5       e  0.8294952 -0.5536721  0.7487573 -0.43119837  0.90706270  2.10488473
6       f  1.0231336  0.2370539 -0.4418742 -0.97054591 -0.49717235  0.38836569
          C1         C2         C3
1 -1.3588022 -0.3143200 -1.1155445
2 -0.7981804  0.2417494  0.2025241
3 -0.3029387 -1.2293938 -0.2851082
4  0.2644057 -1.3592029 -0.5699135
5 -0.1300611  0.3638732 -1.7787850
6 -0.7249635 -0.2156055 -0.9796530

Tip: the object letters in the code above contains lower-case letter, while LETTERS contains upper case.

Challenge

Use your knowledge f regular expressions with subsetting data outlined in the R Fundamentals Chapter to do the following:

Subset only the columns that have an “A” in their name
Subset the rows of data for species “d”

Take the time to do this on your own. It will take you a while and you might make a lot of mistakes along the way. That’s all part of the learning process. The longer you struggle, the faster you will learn.

Now here is a more challenging example:

Genbank

Here is a line of code to import DNA from genbank. (The one line is broken up into three physical lines to make it easier to read)

Lythrum_18S<-scan(
  "https://colauttilab.github.io/RCrashCourse/sequence.gb",
  what="character",sep="\n")

This is the sequence of the 18S subunit from the ribosome gene of Lythrum salicaria (from Genbank)

print(Lythrum_18S)

 [1] "LOCUS       AF206955                1740 bp    DNA     linear   PLN 18-APR-2003"
 [2] "DEFINITION  Lythrum salicaria 18S ribosomal RNA gene, complete sequence."       
 [3] "ACCESSION   AF206955"                                                           
 [4] "VERSION     AF206955.1"                                                         
 [5] "KEYWORDS    ."                                                                  
 [6] "SOURCE      Lythrum salicaria"                                                  
 [7] "  ORGANISM  Lythrum salicaria"                                                  
 [8] "            Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;" 
 [9] "            Spermatophyta; Magnoliopsida; eudicotyledons; Gunneridae;"          
[10] "            Pentapetalae; rosids; malvids; Myrtales; Lythraceae; Lythrum."      
[11] "REFERENCE   1  (bases 1 to 1740)"                                               
[12] "  AUTHORS   Soltis,P.S., Soltis,D.E. and Chase,M.W."                            
[13] "  TITLE     Direct Submission"                                                  
[14] "  JOURNAL   Submitted (19-NOV-1999) School of Biological Sciences, Washington"  
[15] "            State University, Pullman, WA 99164-4236, USA"                      
[16] "FEATURES             Location/Qualifiers"                                       
[17] "     source          1..1740"                                                   
[18] "                     /organism=\"Lythrum salicaria\""                           
[19] "                     /mol_type=\"genomic DNA\""                                 
[20] "                     /db_xref=\"taxon:13129\""                                  
[21] "                     /note=\"Lythrum salicaria L.\""                            
[22] "     rRNA            1..1740"                                                   
[23] "                     /product=\"18S ribosomal RNA\""                            
[24] "ORIGIN      "                                                                   
[25] "        1 gtcatatgct tgtctcaaag attaagccat gcatgtgtaa gtatgaacaa attcagactg"    
[26] "       61 tgaaactgcg aatggctcat taaatcagtt atagtttgtt tgatggtatc tgctactcgg"    
[27] "      121 ataaccgtag taattctaga gctaatacgt gcaacaaacc ccgacttctg gaagggacgc"    
[28] "      181 atttattaga taaaaggtcg acgcgggctt tgcccgatgc tctgatgatt catgataact"    
[29] "      241 tgacggatcg cacggccatc gtgccggcga cgcatcattc aaatttctgc cctatcaact"    
[30] "      301 ttcgatggta ggatagtggc ctaccatggt gtttacgggt aacggagaat tagggttcga"    
[31] "      361 ttccggagag ggagcctgag aaacggctac cacatccaag gaaggcagca ggcgcgcaaa"    
[32] "      421 ttacccaatc ctgacacggg gaggtagtga caataaataa caatactggg ctctttgagt"    
[33] "      481 ctggtaattg gaatgagtac aatctaaatc ccttaacgag gatccattgg agggcaagtc"    
[34] "      541 tggtgccagc agccgcggta attccagctc caatagcgta tatttaagtt gttgcagtta"    
[35] "      601 aaaagctcgt agttggacct tgggttgggt cgaccggtcc gcctttggtg tgcaccgatc"    
[36] "      661 ggctcgtccc ttctaccggc gatgcgcgcc tggccttaat tggccgggtc gttcctccgg"    
[37] "      721 tgctgttact ttgaagaaat tagagtgctc aaagcaagca ttagctatga atacattagc"    
[38] "      781 atgggataac attataggat tccgatccta ttatgttggc cttcgggatc ggagtaatga"    
[39] "      841 ttaacaggga cagtcggggg cattcgtatt tcatagtcag aggtgaaatt cttggattta"    
[40] "      901 tgaaagacga acaactgcga aagcatttgc caaggatgtt ttcattaatc aagaacgaaa"    
[41] "      961 gttgggggct cgaagacgat cagataccgt cctagtctca accataaacg atgccgacca"    
[42] "     1021 gggatcagcg aatgttactt ttaggacttc gctggcacct tatgagaaat caaagttttt"    
[43] "     1081 gggttccggg gggagtatgg tcgcaaggct gaaacttaaa ggaattgacg gaagggcacc"    
[44] "     1141 accaggagtg gagcctgcgg cttaatttga ctcaacacgg ggaaacttac caggtccaga"    
[45] "     1201 catagtaagg attgacagac tgagagctct ttcttgattc tatgggtggt ggtgcatggc"    
[46] "     1261 cgttcttagt tggtggagcg atttgtctgg ttaattccgt taacgaacga gacctcagcc"    
[47] "     1321 tgctaactag ctatgtggag gtacacctcc acggccagct tcttagaggg actatggccg"    
[48] "     1381 cttaggccaa ggaagtttga ggcaataaca ggtctgtgat gcccttagat gttctgggcc"    
[49] "     1441 gcacgcgcgc tacactgatg tattcaacga gtctatagcc ttggccgaca ggcccgggta"    
[50] "     1501 atctttgaaa tttcatcgtg atggggatag atcattgcaa ttgttggtct tcaacgagga"    
[51] "     1561 attcctagta agcgcgagtc atcagctcgc gttgactacg tccctgccct ttgtacacac"    
[52] "     1621 cgcccgtcgc tcctaccgat tgaatggtcc ggtgaaatgt tcggatcgcg gcgacgtggg"    
[53] "     1681 cgcttcgtcg ccgacgacgt cgcgagaagt ccattgaacc ttatcattta gaggaaggag"    
[54] "//"

Notice that each line is read in as a separate cell in a vector, with sequences beginning with a number ending with 1. We can take advantage of this to extract just the sequence data

Challenge

Before we move on, try to do the following:

Isolate only the rows containing DNA sequences. This should include

Removing all of the characters that are not a, t, g, or c.
Combining separate cells/lines into a single string. You can do this with using the paste() function with the collapse="" parameter

Convert lower-case to upper-case. To do this, you can use:

gsub("([actg])","\\U\\1",Seq,perl=T)

The \\U\\\1 means “paste brackets as upper-case”, and is only available as a Perl command, which is accessible in gsub() with the perl=T parameter.

Replace start codons (ATG) with -->START-->ATG
Insert >--STOP--| after any stop codons (TAA or TAG or TGA).

Take the time to struggle with this and try different combinations until you find a way through. The more you struggle, the faster you will learn.

A cool thing about regular expressions is that there is rarely a single right answer, especially for complicated problems. When you are ready, Continue on to see one possible solution.

Solutions

Transects

Subset only transect A for the first 3 species:

Transect[1:3,grep("A",names(Transect))]

          A1         A2         A3
1 -0.4762919 -0.8771969 -2.4134562
2  0.5931594  0.2868677 -1.0057155
3  0.5233956  0.3531411 -0.3127597

Subset the data for the species "d".

Transect[grep("d",Transect$Species),]

  Species         A1         A2         A3          B1        B2         B3
4       d -0.1398728 -0.1236981 -0.1956281 -0.01232568 0.4140249 0.02370992
         C1        C2         C3
4 0.2644057 -1.359203 -0.5699135

Genbank

First, isolate the DNA sequence:

Use .* with () to delete everything before the DNA sequence

Seq<-gsub(".*(1 [gatc])","",Lythrum_18S)

Use the .* and space with + to eliminate all text before the sequence

Seq<-gsub(".*ORIGIN +","",paste(Seq,collapse=""))

Eliminate spaces and the two // at the end

Seq<-gsub(" |//","",Seq)

Capital letters look nicer, but requires a PERL qualifier \\U in the replacement string. This is not standard in R regular expressions, but can be accessed with perl=T.

Seq<-gsub("([actg])","\\U\\1",Seq,perl=T)

Now that we have our formatted DNA sequence string, we can highlight start codons.

ORFs<-gsub("ATG","\n <ATG>-->START-->",Seq)

And stop codons too.

ORFs<-gsub("(TAA|TAG|TGA)","<\\1>--STOP--| \n",Seq)

Note the addition of the newline character (\n) to make the output more readable. To see the final modified string, use the cat() function to print the string directly, rather than creating an object containing the string:

cat(ORFs)

(Output not shown)

More Exercises

Here are some more exercises to practice your skills. No solutions are given, but you will know if you are correct if you get the desired output.

Email Spammer: Consider a vector of email addresses scraped from the internet.

robert ‘dot’ colautti ‘at’ queensu ‘dot’ ca
chris.eckert[at]queensu.ca
lonnie.aarssen at queensu.ca

Use regular expressions to convert all email addresses to the standard format: name@queensu.ca

Genetic Simulation: Start by creating a random sequence of DNA. Think way back to the R Fundamentals Chapter to find a function that can randomly sample from 4 base pairs to create vector of 1,000 bases.

Once you have your vector, try collapsing it into a single element (i.e., a string with 1,000 characters).

Hint: You can do this with the paste() command. You just need one special parameter.

Now, try each of the following challenges:

Replace T with U.
Find all start codons (AUG) and stop codons (UAA, UAG, UGA).
Find all open reading frames (hint: consider each sequence beginning with AUG and ending with a stop codon; how do you know if both sequences are in the same reading frame?).
Count the length (number of bases) for all open reading frames.

Regex Golf

Have fun! https://alf.nu/RegexGolf