1. Overview

Regular Expressions

aka ‘regex’ and ‘regexp’
a sort of find-and-replace for nerds
one of the most powerful data tools I have ever learned
requires patience and lots of practice

2. Basic Regex Functions in Python

The re module in Python contains functions for regular expressions

Handy regex functions in `re`

Searching

re.match() - match between pattern and whole target string

re.search() - find first match of pattern within target string

re.findall() - find all non-overlapping pattern within target string

Modifying

re.split() - split target string according to pattern

re.sub() - replace occurrences of pattern within string

`re.match()`

Returns ‘None’ if no match

import re
m1=re.match("salicaria","salicaria")
print(m1)

## <_sre.SRE_Match object; span=(0, 9), match='salicaria'>

m2=re.match("a","salicaria")
print(m2)

## None

m3=re.match("z","salicaria")
print(m3)

## None

`re.search()`

Contrast output with re.match()

import re
m1=re.search("salicaria","salicaria")
print(m1)

## <_sre.SRE_Match object; span=(0, 9), match='salicaria'>

m2=re.search("a","salicaria")
print(m2)

## <_sre.SRE_Match object; span=(1, 2), match='a'>

m3=re.search("z","salicaria")
print(m3)

## None

What is different?

`re.findall()`

Note [] instead of None when no match is found

import re
m1=re.findall("salicaria","salicaria")
print(m1)

## ['salicaria']

m2=re.findall("a","salicaria")
print(m2)

## ['a', 'a', 'a']

m3=re.findall("z","salicaria")
print(m3)

## []

`re.split()`

Compare this output with previous functions

import re
m1=re.split("salicaria","salicaria")
print(m1)

## ['', '']

m2=re.split("a","salicaria")
print(m2)

## ['s', 'lic', 'ri', '']

m3=re.split("z","salicaria")
print(m3)

## ['salicaria']

What is different?

`re.sub()`

For this function, we have to add a replace string parameter

import re
m1=re.sub("salicaria","X","salicaria")
print(m1)

## X

m2=re.sub("a","X","salicaria")
print(m2)

## sXlicXriX

m3=re.sub("z","X","salicaria")
print(m3)

## salicaria

3. Wildcards

Escape character: `\`

The escape character tells the Python interpreter that the next string is not to be interpreted literally. Unlike regular expressions in R, python uses only a single backslash.

`\w`

All letters and digits (aka ‘words’)

from re import sub
m1=sub("w","*","...which 1-100 words get replaced?")
print(m1)

## ...*hich 1-100 *ords get replaced?

m2=sub("\w","*","...which 1-100 words get replaced?")
print(m2)

## ...***** *-*** ***** *** ********?

`\W`

Upper case \W means the inverse of lower-case \w (non-word and non-number)

m1=sub("\W","*","...which 1-100  words get replaced?")
print(m1)

## ***which*1*100**words*get*replaced*

`\s`

Empty space

m1=sub("\s","*","...which 1-100  words get replaced?")
print(m1)

## ...which*1-100**words*get*replaced?

`\t`

Tab character (useful for tab-delimited data files)

m1=sub("\t","*","...which 1-100  words get replaced?")
print(m1)

## ...which 1-100  words get replaced?

`\d`

Digits (numbers)

m1=sub("\d","*","...which 1-100  words get replaced?")
print(m1)

## ...which *-***  words get replaced?

`\D`

Inverse of \d (non-digits)

m1=sub("\D","*","...which 1-100  words get replaced?")
print(m1)

## *********1*100*********************

`.` = any character except new line

m1=sub(".","*","...which 1-100  words get replaced?")
print(m1)

## ***********************************

So how to search for a period? Use the escape character

m1=sub("\.","*","...which 1-100  words get replaced?")
print(m1)

## ***which 1-100  words get replaced?

Two more special wildcards:

`\r`

Carriage return

`\n`

Newline character

Unix/Mac files – lines usually end with \n only

Windows/DOS files – lines usually end with \r\n

FUN FACT: Where does the term ‘Carriage return’ come from?

4. Special characters:

Or: `|`

The vertical bar or ‘pipe’ is often located above the backslash (shift-backslash). It is the regex equivalent of or

For example, look for w or e

m1=sub("w|e","*","...which 1-100  words get replaced?")
print(m1)

## ...*hich 1-100  *ords g*t r*plac*d?

`*`, `?`, `+` and `{}`

Use *, ?, + and {} for more complicated searches

Look at these examples carefully

m1=sub("\w","*","...which 1-100 words get replaced?")
print(m1)

## ...***** *-*** ***** *** ********?

`+`

1 or more occurrences

m1=sub("\w+","*","...which 1-100 words get replaced?")
print(m1)

## ...* *-* * * *?

`?`

This tells regex to do a ‘lazy’ search (find the first occurrance)

m1=sub("\w?","*","...which 1-100 words get replaced?")
print(m1)

## *.*.*.***** *-*** ***** *** ********?*

`*`

This tells regex to do a ‘greedy’ search (find the last occurrence)

m1=sub("\w*","*","...which 1-100 words get replaced?")
print(m1)

## *.*.*.* *-* * * *?*

`{n,m}`

Find between n to m matches

m1=sub("\w{3,4}","*","...which 1-100 words get replaced?")
print(m1)

## ...*h 1-* *s * **?

`{n}` = exactly n matches

m1=sub("\w{3}","*","...which 1-100 words get replaced?")
print(m1)

## ...*ch 1-* *ds * **ed?

`{n,}`= n or more matches

m1=sub("\w{4,}","*","...which 1-100 words get replaced?")
print(m1)

## ...* 1-100 * get *?

5. Multiple search: `[]`

Use square brackets to find any matching characters.

m1=sub("which","*","...which 1-100 words get replaced?")
print(m1)

## ...* 1-100 words get replaced?

m2=sub("[which]","*","...which 1-100 words get replaced?")
print(m2)

## ...***** 1-100 *ords get repla*ed?

Range of characters `-`

Use dash for a range of numbers

m1=sub("[1-4]","*","1234567890")
print(m1)

## ****567890

or letters

m1=sub("[a-z]","*","AaBbCcDd")
print(m1)

## A*B*C*D*

m2=sub("[A-Z]","*","AaBbCcDd")
print(m2)

## *a*b*c*d

m3=sub("[A-z]","*","AaBbCcDd")
print(m3)

## ********

6. `^`Start and end of line`$`

`^` Start of line

Find species starting with “s”

m1=sub("^s","*","start of lines")
print(m1)

## *tart of lines

`[^]`

IMPORTANT: ^ Also ‘negates’ when used within [] Find species containing any letter other than s

m1=sub("[^s]","*","start of lines")
print(m1)

## [1] "s*art of lines"

$ End of line

Find species ending with “a”

m1=sub("s$","*","start of lines")
print(m1)

## start of line*

7. Capture text: `()`

Capture text using () and reprint using \\1, \\2, etc

Replace each word with its first letter

m1=sub(".*(\w\w+).*","\\1","...which 1-100 words get replaced?")
print(m1)

## ed

Pull out only the numbers and reverse their order

m1=sub(".*([0-9]+)-([0-9]+).*","\\2-\\1","...which 1-100 words get replaced?")
print(m1)

## 100-1

Reverse first two letters of each ‘word’ containing 3+ characters

m1=sub("(\w)(\w)(\w+)","\\2\\1\\3","...which 1-100 words get replaced?")
print(m1)

## ...hwich 1-010 owrds egt erplaced?

`group()` - find matching group

As you can see above, re.match() and re.search() return an object if there is a match. What if we want to see what was matched? Use .group()

import re
m1=re.match("salicaria","salicaria")
print(m1.group())

## salicaria

But group() produces an error if no match was generated:

import re
m2=re.match("z","salicaria")
print(m2)

## None

print(m1.group())

## salicaria

`groups()` - find matching subgroups

Brackets can be used to subset searches, and then groups() identifies the matched groups.

import re
m1=re.match("sal","salicaria")
print(m1)

## <_sre.SRE_Match object; span=(0, 3), match='sal'>

print(m1.groups())

## ()

m2=re.match("(s)(a)(l)","salicaria")
print(m2)

## <_sre.SRE_Match object; span=(0, 3), match='sal'>

print(m2.groups())

## ('s', 'a', 'l')

print(m2.group(1))

## s

print(m2.group(2))

## a

print(m2.group(3))

## l

PRACTICE EXERCISES

1. Consider a vector of email addresses scraped from the internet:

robert ‘dot’ colautti ‘at’ queensu ‘dot’ ca
chris.eckert[at]queensu.ca
lonnie.aarssen at queensu.ca

Use regular expressions to convert all email addresses to the standard format: name@queensu.ca

2. Create a random sequence of DNA:

MySeq="ATGTGTGATAGATATAGTTTATAG"

* Replace T with U
* Find all start codons (AUG) and stop codons (UAA, UAG, UGA)
* Find all open reading frames (hint: consider each sequence beginning with AUG and ending with a stop codon; how do you know if both sequences are in the same reading frame?)
* Count the length of bp for all open reading frames

3. More online examples

http://regex.sketchengine.co.uk/extra_regexps.html

4. Regex Golf

Have fun! LINK

Python Crash Course Part 2 – Intro to Regex

Rob Colautti

1. Overview

2. Basic Regex Functions in Python

Handy regex functions in re

Searching

Modifying

re.match()

re.search()

re.findall()

re.split()

re.sub()