Regular Expressions
aka ‘regex’ and ‘regexp’
a sort of find-and-replace for nerds
one of the most powerful data tools I have ever learned
requires patience and lots of practice
The re module in Python contains functions for regular expressions
rere.match() - match between pattern and whole target string
re.search() - find first match of pattern within target string
re.findall() - find all non-overlapping pattern within target string
re.split() - split target string according to pattern
re.sub() - replace occurrences of pattern within string
re.match()Returns ‘None’ if no match
import re
m1=re.match("salicaria","salicaria")
print(m1)## <_sre.SRE_Match object; span=(0, 9), match='salicaria'>
m2=re.match("a","salicaria")
print(m2)## None
m3=re.match("z","salicaria")
print(m3)## None
re.search()Contrast output with re.match()
import re
m1=re.search("salicaria","salicaria")
print(m1)## <_sre.SRE_Match object; span=(0, 9), match='salicaria'>
m2=re.search("a","salicaria")
print(m2)## <_sre.SRE_Match object; span=(1, 2), match='a'>
m3=re.search("z","salicaria")
print(m3)## None
What is different?
re.findall()Note [] instead of None when no match is found
import re
m1=re.findall("salicaria","salicaria")
print(m1)## ['salicaria']
m2=re.findall("a","salicaria")
print(m2)## ['a', 'a', 'a']
m3=re.findall("z","salicaria")
print(m3)## []
re.split()Compare this output with previous functions
import re
m1=re.split("salicaria","salicaria")
print(m1)## ['', '']
m2=re.split("a","salicaria")
print(m2)## ['s', 'lic', 'ri', '']
m3=re.split("z","salicaria")
print(m3)## ['salicaria']
What is different?
re.sub()For this function, we have to add a replace string parameter
import re
m1=re.sub("salicaria","X","salicaria")
print(m1)## X
m2=re.sub("a","X","salicaria")
print(m2)## sXlicXriX
m3=re.sub("z","X","salicaria")
print(m3)## salicaria
\The escape character tells the Python interpreter that the next string is not to be interpreted literally. Unlike regular expressions in R, python uses only a single backslash.
\wAll letters and digits (aka ‘words’)
from re import sub
m1=sub("w","*","...which 1-100 words get replaced?")
print(m1)## ...*hich 1-100 *ords get replaced?
m2=sub("\w","*","...which 1-100 words get replaced?")
print(m2)## ...***** *-*** ***** *** ********?
\WUpper case \W means the inverse of lower-case \w (non-word and non-number)
m1=sub("\W","*","...which 1-100 words get replaced?")
print(m1)## ***which*1*100**words*get*replaced*
\sEmpty space
m1=sub("\s","*","...which 1-100 words get replaced?")
print(m1)## ...which*1-100**words*get*replaced?
\tTab character (useful for tab-delimited data files)
m1=sub("\t","*","...which 1-100 words get replaced?")
print(m1)## ...which 1-100 words get replaced?
\dDigits (numbers)
m1=sub("\d","*","...which 1-100 words get replaced?")
print(m1)## ...which *-*** words get replaced?
\DInverse of \d (non-digits)
m1=sub("\D","*","...which 1-100 words get replaced?")
print(m1)## *********1*100*********************
. = any character except new linem1=sub(".","*","...which 1-100 words get replaced?")
print(m1)## ***********************************
So how to search for a period? Use the escape character
m1=sub("\.","*","...which 1-100 words get replaced?")
print(m1)## ***which 1-100 words get replaced?
\rCarriage return
\nNewline character
Unix/Mac files – lines usually end with \n only
Windows/DOS files – lines usually end with \r\n
FUN FACT: Where does the term ‘Carriage return’ come from?
|The vertical bar or ‘pipe’ is often located above the backslash (shift-backslash). It is the regex equivalent of or
For example, look for w or e
m1=sub("w|e","*","...which 1-100 words get replaced?")
print(m1)## ...*hich 1-100 *ords g*t r*plac*d?
*, ?, + and {}Use *, ?, + and {} for more complicated searches
Look at these examples carefully
m1=sub("\w","*","...which 1-100 words get replaced?")
print(m1)## ...***** *-*** ***** *** ********?
+1 or more occurrences
m1=sub("\w+","*","...which 1-100 words get replaced?")
print(m1)## ...* *-* * * *?
?This tells regex to do a ‘lazy’ search (find the first occurrance)
m1=sub("\w?","*","...which 1-100 words get replaced?")
print(m1)## *.*.*.***** *-*** ***** *** ********?*
*This tells regex to do a ‘greedy’ search (find the last occurrence)
m1=sub("\w*","*","...which 1-100 words get replaced?")
print(m1)## *.*.*.* *-* * * *?*
{n,m}Find between n to m matches
m1=sub("\w{3,4}","*","...which 1-100 words get replaced?")
print(m1)## ...*h 1-* *s * **?
{n} = exactly n matchesm1=sub("\w{3}","*","...which 1-100 words get replaced?")
print(m1)## ...*ch 1-* *ds * **ed?
{n,}= n or more matchesm1=sub("\w{4,}","*","...which 1-100 words get replaced?")
print(m1)## ...* 1-100 * get *?
[]Use square brackets to find any matching characters.
m1=sub("which","*","...which 1-100 words get replaced?")
print(m1)## ...* 1-100 words get replaced?
m2=sub("[which]","*","...which 1-100 words get replaced?")
print(m2)## ...***** 1-100 *ords get repla*ed?
-Use dash for a range of numbers
m1=sub("[1-4]","*","1234567890")
print(m1)## ****567890
or letters
m1=sub("[a-z]","*","AaBbCcDd")
print(m1)## A*B*C*D*
m2=sub("[A-Z]","*","AaBbCcDd")
print(m2)## *a*b*c*d
m3=sub("[A-z]","*","AaBbCcDd")
print(m3)## ********
^Start and end of line$^ Start of lineFind species starting with “s”
m1=sub("^s","*","start of lines")
print(m1)## *tart of lines
[^]IMPORTANT: ^ Also ‘negates’ when used within [] Find species containing any letter other than s
m1=sub("[^s]","*","start of lines")
print(m1)## [1] "s*art of lines"
Find species ending with “a”
m1=sub("s$","*","start of lines")
print(m1)## start of line*
()Capture text using () and reprint using \\1, \\2, etc
Replace each word with its first letter
m1=sub(".*(\w\w+).*","\\1","...which 1-100 words get replaced?")
print(m1)## ed
Pull out only the numbers and reverse their order
m1=sub(".*([0-9]+)-([0-9]+).*","\\2-\\1","...which 1-100 words get replaced?")
print(m1)## 100-1
Reverse first two letters of each ‘word’ containing 3+ characters
m1=sub("(\w)(\w)(\w+)","\\2\\1\\3","...which 1-100 words get replaced?")
print(m1)## ...hwich 1-010 owrds egt erplaced?
group() - find matching groupAs you can see above, re.match() and re.search() return an object if there is a match. What if we want to see what was matched? Use .group()
import re
m1=re.match("salicaria","salicaria")
print(m1.group())## salicaria
But group() produces an error if no match was generated:
import re
m2=re.match("z","salicaria")
print(m2)## None
print(m1.group())## salicaria
groups() - find matching subgroupsBrackets can be used to subset searches, and then groups() identifies the matched groups.
import re
m1=re.match("sal","salicaria")
print(m1)## <_sre.SRE_Match object; span=(0, 3), match='sal'>
print(m1.groups())## ()
m2=re.match("(s)(a)(l)","salicaria")
print(m2)## <_sre.SRE_Match object; span=(0, 3), match='sal'>
print(m2.groups())## ('s', 'a', 'l')
print(m2.group(1))## s
print(m2.group(2))## a
print(m2.group(3))## l
Use regular expressions to convert all email addresses to the standard format: name@queensu.ca
MySeq="ATGTGTGATAGATATAGTTTATAG"* Replace T with U
* Find all start codons (AUG) and stop codons (UAA, UAG, UGA)
* Find all open reading frames (hint: consider each sequence beginning with AUG and ending with a stop codon; how do you know if both sequences are in the same reading frame?)
* Count the length of bp for all open reading frames
Have fun! LINK