Lecture Notes on 20 Nov 2009 Regular Expressions in Python A regular expression is a way of expressing a pattern. A regular expression is used to find or replace a pattern or split a piece of text according to the pattern. # Import the regular expression module import re # Sample text text = "Come down to Kew in lilac time, in lilac time, in lilac time;" # All patterns are expressed as raw strings and have to be compiled. # The flag re.I indicates that the pattern is case insensitive. p = re.compile(r'lilac', re.I) # match() function only matches patterns at the beginning of the text. # It returns a match object if it finds the pattern or None. m = p.match(text) print m # prints 'None' if (m): print "Found" else: print "Not found" # prints "Not found" # search() function finds the first occurence of the pattern or None. # It returns a match object. s = p.search(text) # To print the actual string it found use the function group() print s.group() # print 'lilac' # To print the location in the text print s.start(), s.end() # prints '20 25' print s.span() # prints '(20, 25)' # To obtain all the matches use the function findall() that returns a list. a = p.findall(text) print a # prints "['lilac', 'lilac', 'lilac']" # To obtain the location of all the above matches get an iteration object # with the function finditer() and loop through all the iterations iterObj = p.finditer(text) for m in iterObj: print m.span() # replace or substitute the pattern use sub() that returns a new string new_text = p.sub('tulip', text) print new_text # Use split() to split a piece of text according to a delimiter. text2 = "asd:qwe:zxc" # Specify delimiter p = re.compile(r':') # split() returns a list of strings not including the delimiter a = p.split(text2) print a # prints "['asd', 'qwe', 'zxc']" There are meta-characters that have special meaning in a regular expression: . (period) matches any character other than newline [aeiou] matches any character in the square brackets [^aeiou] matches any character NOT in the square brackets a{3} matches exactly 3 a's. a{3,5} matches 3 to 5 a's. a* matches 0 or more a's. a+ matches 1 or more a's. a? matches 0 or 1 a's. (a|b|c) matches a or b or c. \d matches any digit, same as [0-9] \D matches any character that is not a digit, same as [^0-9] \w matches any alphanumeric character, including the underscore '_' \W maches any non-alphanumeric character \s matches any space, including new lines \S matches any character that is not a space