library(curl)
Using libcurl 7.84.0 with Schannel
In this chapter, we look at specific applications and examples of regular expressions. This includes some tips for practicing regular expressions in R Studio, gathering data from online websites, and practice exercises with DNA data.
As noted above, there is a steep learning curve to understanding regular expressions. Getting clean Regex code to do what you want can often come down to trial-and-error. Sometimes, it’s impractical to run code and look at the output to see if your Regex code is doing what you want it to do. In R Studio, you test your regular expression in a text file, using the Find command from the Edit menu in R Studio (or Ctl + F on Windows or Cmd + F on MacOS).
Try opening a text file in R Studio and select Find from the Edit Menu. You’ll see some check boxes at the bottom of the search bar. One of these boxes is Regex. If you check this box (and uncheck all other boxes) you can use the Find command to highlight the text that is selected for your Regex command. Just make sure to use one escape character (\
) for the search bar.
Recall that the double escape characters (
\\
) are specific to the R search string parameter of the regex functions (e.g.,sub
,gsub
)
Scraping is a method for collecting data from online sources. In R, we can use the functions readLines
and curl()
, both from the curl
library, to copy data from websites. We can do this because websites with the .html
or .xml
extension are a special kind of text files.
For more advanced applications, we might want to use the rvest
package, which is designed for html
and xml
files. However, we’ll focus here on curl()
because we can apply regular expressions to any text files to extract information of interest. Here’s an example where we will scrape a record for the Green Fluorescent Protein (GFP) from the Protein Data Bank (PDB). Note that this is a file with the extension .pdb
but this is a human-readable text file that can be opened in any text editor
First, we’ll import the text into an R object.
You will have to use install.packages("curl")
to download this package to your computer. You only need to do this once but you will have to use library(curl)
whenever you want to use the functions, as explained in the R Fundamentals Chapter.
Now we can download a file to play with.
Download this link to your computer and open with a text file to see what it looks like.
This hint is a simple trick to understand what kind of file(s) you are working with.
This is a tab-delimited file, which we could import as a data frame using read.delim
but we’ll keep it this way to practice our regular expressions.
The Prot
object we have made is a simple vector of strings, with each cell corresponding to a different row of text:
We can pull out the amino acid sequences, which are rows that start with the word ‘ATOM’
[1] "ATOM 1 N SER A 2 28.888 9.409 52.301 1.00 85.05 N "
Challenge: Try to apply what you have learned about regular expressions to isolate the 3-letter amino acid code.
There are several ways we could do this. Take the time to think about it and give it a try.
Here’s one good option, since we know it’s a tab-delimited file with the amino acid in the 4th column:
[1] "ATOM 1 N SER A 2 28.888 9.409 52.301 1.00 85.05 N "
That didn’t work. Sometimes the ‘tabs’ are actually just multiple ‘spaces’
[1] "SER" "SER" "SER" "SER" "SER" "SER" "LYS" "LYS" "LYS" "LYS" "LYS" "LYS"
[13] "LYS" "LYS" "LYS" "GLY" "GLY" "GLY" "GLY" "GLU" "GLU" "GLU" "GLU" "GLU"
[25] "GLU" "GLU" "GLU" "GLU" "GLU" "GLU" "GLU" "GLU" "GLU" "GLU" "LEU" "LEU"
[37] "LEU" "LEU" "LEU" "LEU" "LEU" "LEU" "PHE" "PHE" "PHE" "PHE" "PHE" "PHE"
[49] "PHE" "PHE" "PHE" "PHE" "PHE" "THR" "THR" "THR" "THR" "THR" "THR" "THR"
[61] "GLY" "GLY" "GLY" "GLY" "VAL" "VAL" "VAL" "VAL" "VAL" "VAL" "VAL" "VAL"
[73] "VAL" "VAL" "VAL" "VAL" "VAL" "VAL" "PRO" "PRO" "PRO" "PRO" "PRO" "PRO"
[85] "PRO" "ILE" "ILE" "ILE" "ILE" "ILE" "ILE" "ILE" "ILE" "LEU" "LEU" "LEU"
[97] "LEU" "LEU" "LEU" "LEU"
Now we have a handy vector of amino acids representing our protein.
Let’s try practising with a couple of examples.
Regular expressions are also useful with data objects
Imagine you have a repeated measures design. 3 transects (A-C) and 3 positions along each transect (1-3). We can simulate this data by generating dandom numbers in a data frame.
Transect<-data.frame(Species=letters[1:20],
A1=rnorm(20), A2=rnorm(20), A3=rnorm(20),
B1=rnorm(20), B2=rnorm(20) ,B3=rnorm(20),
C1=rnorm(20), C2=rnorm(20), C3=rnorm(20))
head(Transect)
Species A1 A2 A3 B1 B2 B3
1 a -0.4762919 -0.8771969 -2.4134562 1.09793390 -0.22686803 0.93908458
2 b 0.5931594 0.2868677 -1.0057155 -0.48245954 -1.26291078 -0.13815786
3 c 0.5233956 0.3531411 -0.3127597 0.28584890 -0.03140898 -0.17422685
4 d -0.1398728 -0.1236981 -0.1956281 -0.01232568 0.41402493 0.02370992
5 e 0.8294952 -0.5536721 0.7487573 -0.43119837 0.90706270 2.10488473
6 f 1.0231336 0.2370539 -0.4418742 -0.97054591 -0.49717235 0.38836569
C1 C2 C3
1 -1.3588022 -0.3143200 -1.1155445
2 -0.7981804 0.2417494 0.2025241
3 -0.3029387 -1.2293938 -0.2851082
4 0.2644057 -1.3592029 -0.5699135
5 -0.1300611 0.3638732 -1.7787850
6 -0.7249635 -0.2156055 -0.9796530
Tip: the object
letters
in the code above contains lower-case letter, whileLETTERS
contains upper case.
Use your knowledge f regular expressions with subsetting data outlined in the R Fundamentals Chapter to do the following:
Take the time to do this on your own. It will take you a while and you might make a lot of mistakes along the way. That’s all part of the learning process. The longer you struggle, the faster you will learn.
Now here is a more challenging example:
Here is a line of code to import DNA from genbank. (The one line is broken up into three physical lines to make it easier to read)
This is the sequence of the 18S subunit from the ribosome gene of Lythrum salicaria (from Genbank)
[1] "LOCUS AF206955 1740 bp DNA linear PLN 18-APR-2003"
[2] "DEFINITION Lythrum salicaria 18S ribosomal RNA gene, complete sequence."
[3] "ACCESSION AF206955"
[4] "VERSION AF206955.1"
[5] "KEYWORDS ."
[6] "SOURCE Lythrum salicaria"
[7] " ORGANISM Lythrum salicaria"
[8] " Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;"
[9] " Spermatophyta; Magnoliopsida; eudicotyledons; Gunneridae;"
[10] " Pentapetalae; rosids; malvids; Myrtales; Lythraceae; Lythrum."
[11] "REFERENCE 1 (bases 1 to 1740)"
[12] " AUTHORS Soltis,P.S., Soltis,D.E. and Chase,M.W."
[13] " TITLE Direct Submission"
[14] " JOURNAL Submitted (19-NOV-1999) School of Biological Sciences, Washington"
[15] " State University, Pullman, WA 99164-4236, USA"
[16] "FEATURES Location/Qualifiers"
[17] " source 1..1740"
[18] " /organism=\"Lythrum salicaria\""
[19] " /mol_type=\"genomic DNA\""
[20] " /db_xref=\"taxon:13129\""
[21] " /note=\"Lythrum salicaria L.\""
[22] " rRNA 1..1740"
[23] " /product=\"18S ribosomal RNA\""
[24] "ORIGIN "
[25] " 1 gtcatatgct tgtctcaaag attaagccat gcatgtgtaa gtatgaacaa attcagactg"
[26] " 61 tgaaactgcg aatggctcat taaatcagtt atagtttgtt tgatggtatc tgctactcgg"
[27] " 121 ataaccgtag taattctaga gctaatacgt gcaacaaacc ccgacttctg gaagggacgc"
[28] " 181 atttattaga taaaaggtcg acgcgggctt tgcccgatgc tctgatgatt catgataact"
[29] " 241 tgacggatcg cacggccatc gtgccggcga cgcatcattc aaatttctgc cctatcaact"
[30] " 301 ttcgatggta ggatagtggc ctaccatggt gtttacgggt aacggagaat tagggttcga"
[31] " 361 ttccggagag ggagcctgag aaacggctac cacatccaag gaaggcagca ggcgcgcaaa"
[32] " 421 ttacccaatc ctgacacggg gaggtagtga caataaataa caatactggg ctctttgagt"
[33] " 481 ctggtaattg gaatgagtac aatctaaatc ccttaacgag gatccattgg agggcaagtc"
[34] " 541 tggtgccagc agccgcggta attccagctc caatagcgta tatttaagtt gttgcagtta"
[35] " 601 aaaagctcgt agttggacct tgggttgggt cgaccggtcc gcctttggtg tgcaccgatc"
[36] " 661 ggctcgtccc ttctaccggc gatgcgcgcc tggccttaat tggccgggtc gttcctccgg"
[37] " 721 tgctgttact ttgaagaaat tagagtgctc aaagcaagca ttagctatga atacattagc"
[38] " 781 atgggataac attataggat tccgatccta ttatgttggc cttcgggatc ggagtaatga"
[39] " 841 ttaacaggga cagtcggggg cattcgtatt tcatagtcag aggtgaaatt cttggattta"
[40] " 901 tgaaagacga acaactgcga aagcatttgc caaggatgtt ttcattaatc aagaacgaaa"
[41] " 961 gttgggggct cgaagacgat cagataccgt cctagtctca accataaacg atgccgacca"
[42] " 1021 gggatcagcg aatgttactt ttaggacttc gctggcacct tatgagaaat caaagttttt"
[43] " 1081 gggttccggg gggagtatgg tcgcaaggct gaaacttaaa ggaattgacg gaagggcacc"
[44] " 1141 accaggagtg gagcctgcgg cttaatttga ctcaacacgg ggaaacttac caggtccaga"
[45] " 1201 catagtaagg attgacagac tgagagctct ttcttgattc tatgggtggt ggtgcatggc"
[46] " 1261 cgttcttagt tggtggagcg atttgtctgg ttaattccgt taacgaacga gacctcagcc"
[47] " 1321 tgctaactag ctatgtggag gtacacctcc acggccagct tcttagaggg actatggccg"
[48] " 1381 cttaggccaa ggaagtttga ggcaataaca ggtctgtgat gcccttagat gttctgggcc"
[49] " 1441 gcacgcgcgc tacactgatg tattcaacga gtctatagcc ttggccgaca ggcccgggta"
[50] " 1501 atctttgaaa tttcatcgtg atggggatag atcattgcaa ttgttggtct tcaacgagga"
[51] " 1561 attcctagta agcgcgagtc atcagctcgc gttgactacg tccctgccct ttgtacacac"
[52] " 1621 cgcccgtcgc tcctaccgat tgaatggtcc ggtgaaatgt tcggatcgcg gcgacgtggg"
[53] " 1681 cgcttcgtcg ccgacgacgt cgcgagaagt ccattgaacc ttatcattta gaggaaggag"
[54] "//"
Notice that each line is read in as a separate cell in a vector, with sequences beginning with a number ending with 1. We can take advantage of this to extract just the sequence data
Before we move on, try to do the following:
paste()
function with the collapse=""
parameterThe \\U\\\1
means “paste brackets as upper-case”, and is only available as a Perl command, which is accessible in gsub()
with the perl=T
parameter.
Replace start codons (ATG) with -->START-->ATG
Insert >--STOP--|
after any stop codons (TAA or TAG or TGA).
Take the time to struggle with this and try different combinations until you find a way through. The more you struggle, the faster you will learn.
A cool thing about regular expressions is that there is rarely a single right answer, especially for complicated problems. When you are ready, Continue on to see one possible solution.
Subset only transect A for the first 3 species:
A1 A2 A3
1 -0.4762919 -0.8771969 -2.4134562
2 0.5931594 0.2868677 -1.0057155
3 0.5233956 0.3531411 -0.3127597
Subset the data for the species "d"
.
First, isolate the DNA sequence:
.*
with ()
to delete everything before the DNA sequence.*
and space with +
to eliminate all text before the sequence//
at the end\\U
in the replacement string. This is not standard in R regular expressions, but can be accessed with perl=T
.Now that we have our formatted DNA sequence string, we can highlight start codons.
And stop codons too.
Note the addition of the newline character (\n
) to make the output more readable. To see the final modified string, use the cat()
function to print the string directly, rather than creating an object containing the string:
(Output not shown)
Here are some more exercises to practice your skills. No solutions are given, but you will know if you are correct if you get the desired output.
Use regular expressions to convert all email addresses to the standard format: name@queensu.ca
Once you have your vector, try collapsing it into a single element (i.e., a string with 1,000 characters).
Hint: You can do this with the
paste()
command. You just need one special parameter.
Now, try each of the following challenges:
Have fun! https://alf.nu/RegexGolf