Regular Expressions
aka ‘regex’ and ‘regexp’
a sort of find-and-replace for nerds
one of the most powerful data tools I have ever learned
requires patience and lots of practice
The re
module in Python contains functions for regular expressions
re
re.match()
- match between pattern and whole target string
re.search()
- find first match of pattern within target string
re.findall()
- find all non-overlapping pattern within target string
re.split()
- split target string according to pattern
re.sub()
- replace occurrences of pattern within string
re.match()
Returns ‘None’ if no match
import re
m1=re.match("salicaria","salicaria")
print(m1)
## <_sre.SRE_Match object; span=(0, 9), match='salicaria'>
m2=re.match("a","salicaria")
print(m2)
## None
m3=re.match("z","salicaria")
print(m3)
## None
re.search()
Contrast output with re.match()
import re
m1=re.search("salicaria","salicaria")
print(m1)
## <_sre.SRE_Match object; span=(0, 9), match='salicaria'>
m2=re.search("a","salicaria")
print(m2)
## <_sre.SRE_Match object; span=(1, 2), match='a'>
m3=re.search("z","salicaria")
print(m3)
## None
What is different?
re.findall()
Note []
instead of None
when no match is found
import re
m1=re.findall("salicaria","salicaria")
print(m1)
## ['salicaria']
m2=re.findall("a","salicaria")
print(m2)
## ['a', 'a', 'a']
m3=re.findall("z","salicaria")
print(m3)
## []
re.split()
Compare this output with previous functions
import re
m1=re.split("salicaria","salicaria")
print(m1)
## ['', '']
m2=re.split("a","salicaria")
print(m2)
## ['s', 'lic', 'ri', '']
m3=re.split("z","salicaria")
print(m3)
## ['salicaria']
What is different?
re.sub()
For this function, we have to add a replace string parameter
import re
m1=re.sub("salicaria","X","salicaria")
print(m1)
## X
m2=re.sub("a","X","salicaria")
print(m2)
## sXlicXriX
m3=re.sub("z","X","salicaria")
print(m3)
## salicaria
\
The escape character tells the Python interpreter that the next string is not to be interpreted literally. Unlike regular expressions in R, python uses only a single backslash.
\w
All letters and digits (aka ‘words’)
from re import sub
m1=sub("w","*","...which 1-100 words get replaced?")
print(m1)
## ...*hich 1-100 *ords get replaced?
m2=sub("\w","*","...which 1-100 words get replaced?")
print(m2)
## ...***** *-*** ***** *** ********?
\W
Upper case \W
means the inverse of lower-case \w
(non-word and non-number)
m1=sub("\W","*","...which 1-100 words get replaced?")
print(m1)
## ***which*1*100**words*get*replaced*
\s
Empty space
m1=sub("\s","*","...which 1-100 words get replaced?")
print(m1)
## ...which*1-100**words*get*replaced?
\t
Tab character (useful for tab-delimited data files)
m1=sub("\t","*","...which 1-100 words get replaced?")
print(m1)
## ...which 1-100 words get replaced?
\d
Digits (numbers)
m1=sub("\d","*","...which 1-100 words get replaced?")
print(m1)
## ...which *-*** words get replaced?
\D
Inverse of \d
(non-digits)
m1=sub("\D","*","...which 1-100 words get replaced?")
print(m1)
## *********1*100*********************
.
= any character except new linem1=sub(".","*","...which 1-100 words get replaced?")
print(m1)
## ***********************************
So how to search for a period? Use the escape character
m1=sub("\.","*","...which 1-100 words get replaced?")
print(m1)
## ***which 1-100 words get replaced?
\r
Carriage return
\n
Newline character
Unix/Mac files – lines usually end with \n
only
Windows/DOS files – lines usually end with \r\n
FUN FACT: Where does the term ‘Carriage return’ come from?
|
The vertical bar or ‘pipe’ is often located above the backslash (shift-backslash). It is the regex equivalent of or
For example, look for w or e
m1=sub("w|e","*","...which 1-100 words get replaced?")
print(m1)
## ...*hich 1-100 *ords g*t r*plac*d?
*
, ?
, +
and {}
Use *
, ?
, +
and {}
for more complicated searches
Look at these examples carefully
m1=sub("\w","*","...which 1-100 words get replaced?")
print(m1)
## ...***** *-*** ***** *** ********?
+
1 or more occurrences
m1=sub("\w+","*","...which 1-100 words get replaced?")
print(m1)
## ...* *-* * * *?
?
This tells regex to do a ‘lazy’ search (find the first occurrance)
m1=sub("\w?","*","...which 1-100 words get replaced?")
print(m1)
## *.*.*.***** *-*** ***** *** ********?*
*
This tells regex to do a ‘greedy’ search (find the last occurrence)
m1=sub("\w*","*","...which 1-100 words get replaced?")
print(m1)
## *.*.*.* *-* * * *?*
{n,m}
Find between n to m matches
m1=sub("\w{3,4}","*","...which 1-100 words get replaced?")
print(m1)
## ...*h 1-* *s * **?
{n}
= exactly n matchesm1=sub("\w{3}","*","...which 1-100 words get replaced?")
print(m1)
## ...*ch 1-* *ds * **ed?
{n,}
= n or more matchesm1=sub("\w{4,}","*","...which 1-100 words get replaced?")
print(m1)
## ...* 1-100 * get *?
[]
Use square brackets to find any matching characters.
m1=sub("which","*","...which 1-100 words get replaced?")
print(m1)
## ...* 1-100 words get replaced?
m2=sub("[which]","*","...which 1-100 words get replaced?")
print(m2)
## ...***** 1-100 *ords get repla*ed?
-
Use dash for a range of numbers
m1=sub("[1-4]","*","1234567890")
print(m1)
## ****567890
or letters
m1=sub("[a-z]","*","AaBbCcDd")
print(m1)
## A*B*C*D*
m2=sub("[A-Z]","*","AaBbCcDd")
print(m2)
## *a*b*c*d
m3=sub("[A-z]","*","AaBbCcDd")
print(m3)
## ********
^
Start and end of line$
^
Start of lineFind species starting with “s”
m1=sub("^s","*","start of lines")
print(m1)
## *tart of lines
[^]
IMPORTANT: ^ Also ‘negates’ when used within [] Find species containing any letter other than s
m1=sub("[^s]","*","start of lines")
print(m1)
## [1] "s*art of lines"
Find species ending with “a”
m1=sub("s$","*","start of lines")
print(m1)
## start of line*
()
Capture text using ()
and reprint using \\1
, \\2
, etc
Replace each word with its first letter
m1=sub(".*(\w\w+).*","\\1","...which 1-100 words get replaced?")
print(m1)
## ed
Pull out only the numbers and reverse their order
m1=sub(".*([0-9]+)-([0-9]+).*","\\2-\\1","...which 1-100 words get replaced?")
print(m1)
## 100-1
Reverse first two letters of each ‘word’ containing 3+ characters
m1=sub("(\w)(\w)(\w+)","\\2\\1\\3","...which 1-100 words get replaced?")
print(m1)
## ...hwich 1-010 owrds egt erplaced?
group()
- find matching groupAs you can see above, re.match()
and re.search()
return an object if there is a match. What if we want to see what was matched? Use .group()
import re
m1=re.match("salicaria","salicaria")
print(m1.group())
## salicaria
But group() produces an error if no match was generated:
import re
m2=re.match("z","salicaria")
print(m2)
## None
print(m1.group())
## salicaria
groups()
- find matching subgroupsBrackets can be used to subset searches, and then groups()
identifies the matched groups.
import re
m1=re.match("sal","salicaria")
print(m1)
## <_sre.SRE_Match object; span=(0, 3), match='sal'>
print(m1.groups())
## ()
m2=re.match("(s)(a)(l)","salicaria")
print(m2)
## <_sre.SRE_Match object; span=(0, 3), match='sal'>
print(m2.groups())
## ('s', 'a', 'l')
print(m2.group(1))
## s
print(m2.group(2))
## a
print(m2.group(3))
## l
Use regular expressions to convert all email addresses to the standard format: name@queensu.ca
MySeq="ATGTGTGATAGATATAGTTTATAG"
* Replace T with U
* Find all start codons (AUG) and stop codons (UAA, UAG, UGA)
* Find all open reading frames (hint: consider each sequence beginning with AUG and ending with a stop codon; how do you know if both sequences are in the same reading frame?)
* Count the length of bp for all open reading frames
Have fun! LINK