1. Overview

Regular Expressions

  • aka ‘regex’ and ‘regexp’

  • a sort of find-and-replace for nerds

  • one of the most powerful data tools I have ever learned

  • requires patience and lots of practice

2. Basic Regex Functions in Python

The re module in Python contains functions for regular expressions

Handy regex functions in re

Searching

re.match() - match between pattern and whole target string

re.search() - find first match of pattern within target string

re.findall() - find all non-overlapping pattern within target string

Modifying

re.split() - split target string according to pattern

re.sub() - replace occurrences of pattern within string

re.match()

Returns ‘None’ if no match

import re
m1=re.match("salicaria","salicaria")
print(m1)
## <_sre.SRE_Match object; span=(0, 9), match='salicaria'>
m2=re.match("a","salicaria")
print(m2)
## None
m3=re.match("z","salicaria")
print(m3)
## None

re.search()

Contrast output with re.match()

import re
m1=re.search("salicaria","salicaria")
print(m1)
## <_sre.SRE_Match object; span=(0, 9), match='salicaria'>
m2=re.search("a","salicaria")
print(m2)
## <_sre.SRE_Match object; span=(1, 2), match='a'>
m3=re.search("z","salicaria")
print(m3)
## None

What is different?

re.findall()

Note [] instead of None when no match is found

import re
m1=re.findall("salicaria","salicaria")
print(m1)
## ['salicaria']
m2=re.findall("a","salicaria")
print(m2)
## ['a', 'a', 'a']
m3=re.findall("z","salicaria")
print(m3)
## []

re.split()

Compare this output with previous functions

import re
m1=re.split("salicaria","salicaria")
print(m1)
## ['', '']
m2=re.split("a","salicaria")
print(m2)
## ['s', 'lic', 'ri', '']
m3=re.split("z","salicaria")
print(m3)
## ['salicaria']

What is different?

re.sub()

For this function, we have to add a replace string parameter

import re
m1=re.sub("salicaria","X","salicaria")
print(m1)
## X
m2=re.sub("a","X","salicaria")
print(m2)
## sXlicXriX
m3=re.sub("z","X","salicaria")
print(m3)
## salicaria

3. Wildcards

Escape character: \

The escape character tells the Python interpreter that the next string is not to be interpreted literally. Unlike regular expressions in R, python uses only a single backslash.

\w

All letters and digits (aka ‘words’)

from re import sub
m1=sub("w","*","...which 1-100 words get replaced?")
print(m1)
## ...*hich 1-100 *ords get replaced?
m2=sub("\w","*","...which 1-100 words get replaced?")
print(m2)
## ...***** *-*** ***** *** ********?

\W

Upper case \W means the inverse of lower-case \w (non-word and non-number)

m1=sub("\W","*","...which 1-100  words get replaced?")
print(m1)
## ***which*1*100**words*get*replaced*

\s

Empty space

m1=sub("\s","*","...which 1-100  words get replaced?")
print(m1)
## ...which*1-100**words*get*replaced?

\t

Tab character (useful for tab-delimited data files)

m1=sub("\t","*","...which 1-100  words get replaced?")
print(m1)
## ...which 1-100  words get replaced?

\d

Digits (numbers)

m1=sub("\d","*","...which 1-100  words get replaced?")
print(m1)
## ...which *-***  words get replaced?

\D

Inverse of \d (non-digits)

m1=sub("\D","*","...which 1-100  words get replaced?")
print(m1)
## *********1*100*********************

. = any character except new line

m1=sub(".","*","...which 1-100  words get replaced?")
print(m1)
## ***********************************

So how to search for a period? Use the escape character

m1=sub("\.","*","...which 1-100  words get replaced?")
print(m1)
## ***which 1-100  words get replaced?

Two more special wildcards:

\r

Carriage return

\n

Newline character

Unix/Mac files – lines usually end with \n only

Windows/DOS files – lines usually end with \r\n

FUN FACT: Where does the term ‘Carriage return’ come from?

4. Special characters:

Or: |

The vertical bar or ‘pipe’ is often located above the backslash (shift-backslash). It is the regex equivalent of or

For example, look for w or e

m1=sub("w|e","*","...which 1-100  words get replaced?")
print(m1)
## ...*hich 1-100  *ords g*t r*plac*d?

*, ?, + and {}

Use *, ?, + and {} for more complicated searches

Look at these examples carefully

m1=sub("\w","*","...which 1-100 words get replaced?")
print(m1)
## ...***** *-*** ***** *** ********?

+

1 or more occurrences

m1=sub("\w+","*","...which 1-100 words get replaced?")
print(m1)
## ...* *-* * * *?

?

This tells regex to do a ‘lazy’ search (find the first occurrance)

m1=sub("\w?","*","...which 1-100 words get replaced?")
print(m1)
## *.*.*.***** *-*** ***** *** ********?*

*

This tells regex to do a ‘greedy’ search (find the last occurrence)

m1=sub("\w*","*","...which 1-100 words get replaced?")
print(m1)
## *.*.*.* *-* * * *?*

{n,m}

Find between n to m matches

m1=sub("\w{3,4}","*","...which 1-100 words get replaced?")
print(m1)
## ...*h 1-* *s * **?

{n} = exactly n matches

m1=sub("\w{3}","*","...which 1-100 words get replaced?")
print(m1)
## ...*ch 1-* *ds * **ed?

{n,}= n or more matches

m1=sub("\w{4,}","*","...which 1-100 words get replaced?")
print(m1)
## ...* 1-100 * get *?

6. ^Start and end of line$

^ Start of line

Find species starting with “s”

m1=sub("^s","*","start of lines")
print(m1)
## *tart of lines

[^]

IMPORTANT: ^ Also ‘negates’ when used within [] Find species containing any letter other than s

m1=sub("[^s]","*","start of lines")
print(m1)
## [1] "s*art of lines"

$ End of line

Find species ending with “a”

m1=sub("s$","*","start of lines")
print(m1)
## start of line*

7. Capture text: ()

Capture text using () and reprint using \\1, \\2, etc

Replace each word with its first letter

m1=sub(".*(\w\w+).*","\\1","...which 1-100 words get replaced?")
print(m1)
## ed

Pull out only the numbers and reverse their order

m1=sub(".*([0-9]+)-([0-9]+).*","\\2-\\1","...which 1-100 words get replaced?")
print(m1)
## 100-1

Reverse first two letters of each ‘word’ containing 3+ characters

m1=sub("(\w)(\w)(\w+)","\\2\\1\\3","...which 1-100 words get replaced?")
print(m1)
## ...hwich 1-010 owrds egt erplaced?

group() - find matching group

As you can see above, re.match() and re.search() return an object if there is a match. What if we want to see what was matched? Use .group()

import re
m1=re.match("salicaria","salicaria")
print(m1.group())
## salicaria

But group() produces an error if no match was generated:

import re
m2=re.match("z","salicaria")
print(m2)
## None
print(m1.group())
## salicaria

groups() - find matching subgroups

Brackets can be used to subset searches, and then groups() identifies the matched groups.

import re
m1=re.match("sal","salicaria")
print(m1)
## <_sre.SRE_Match object; span=(0, 3), match='sal'>
print(m1.groups())
## ()
m2=re.match("(s)(a)(l)","salicaria")
print(m2)
## <_sre.SRE_Match object; span=(0, 3), match='sal'>
print(m2.groups())
## ('s', 'a', 'l')
print(m2.group(1))
## s
print(m2.group(2))
## a
print(m2.group(3))
## l

PRACTICE EXERCISES

1. Consider a vector of email addresses scraped from the internet:

  • robert ‘dot’ colautti ‘at’ queensu ‘dot’ ca
  • chris.eckert[at]queensu.ca
  • lonnie.aarssen at queensu.ca

Use regular expressions to convert all email addresses to the standard format: name@queensu.ca

2. Create a random sequence of DNA:

MySeq="ATGTGTGATAGATATAGTTTATAG"
* Replace T with U
* Find all start codons (AUG) and stop codons (UAA, UAG, UGA)
* Find all open reading frames (hint: consider each sequence beginning with AUG and ending with a stop codon; how do you know if both sequences are in the same reading frame?)
* Count the length of bp for all open reading frames

4. Regex Golf

Have fun! LINK