Lecture Notes on 20 Nov 2009

Regular Expressions in Python

A regular expression is a way of expressing a pattern. A regular
expression is used to find or replace a pattern or split a piece
of text according to the pattern.

# Import the regular expression module
import re

# Sample text
text = "Come down to Kew in lilac time, in lilac time, in lilac time;"

# All patterns are expressed as raw strings and have to be compiled.
# The flag re.I indicates that the pattern is case insensitive.
p = re.compile(r'lilac', re.I)

# match() function only matches patterns at the beginning of the text.
# It returns a match object if it finds the pattern or None.
m = p.match(text)

print m		# prints 'None'

if (m):
  print "Found"
else:
  print "Not found"	# prints "Not found"

# search() function finds the first occurence of the pattern or None.
# It returns a match object.
s = p.search(text)

# To print the actual string it found use the function group()
print s.group()		# print 'lilac'

# To print the location in the text
print s.start(), s.end() 	# prints '20 25'
print s.span()			# prints '(20, 25)'

# To obtain all the matches use the function findall() that returns a list.
a = p.findall(text)
print a		# prints "['lilac', 'lilac', 'lilac']"

# To obtain the location of all the above matches get an iteration object
# with the function finditer() and loop through all the iterations
iterObj = p.finditer(text)
for m in iterObj:
  print m.span()

# replace or substitute the pattern use sub() that returns a new string
new_text = p.sub('tulip', text)
print new_text

# Use split() to split a piece of text according to a delimiter.
text2 = "asd:qwe:zxc"

# Specify delimiter
p = re.compile(r':')

# split() returns a list of strings not including the delimiter
a = p.split(text2)
print a		# prints "['asd', 'qwe', 'zxc']"

There are meta-characters that have special meaning in a regular
expression:

. (period) matches any character other than newline

[aeiou]	matches any character in the square brackets

[^aeiou] matches any character NOT in the square brackets

a{3} matches exactly 3 a's.

a{3,5} matches 3 to 5 a's.

a* matches 0 or more a's.

a+ matches 1 or more a's.

a? matches 0 or 1 a's.

(a|b|c) matches a or b or c.

\d matches any digit, same as [0-9]

\D matches any character that is not a digit, same as [^0-9]

\w matches any alphanumeric character, including the underscore '_'

\W maches any non-alphanumeric character

\s matches any space, including new lines

\S matches any character that is not a space