0% found this document useful (0 votes)
2 views

5A - Regex

Regular expressions (regex) are a powerful tool in computing for matching and parsing strings of text using a formal language. They utilize special characters to define patterns, allowing for flexible string manipulation and extraction. The document provides an overview of regex syntax, usage in Python, and examples of matching and extracting data.

Uploaded by

d3v1lsr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

5A - Regex

Regular expressions (regex) are a powerful tool in computing for matching and parsing strings of text using a formal language. They utilize special characters to define patterns, allowing for flexible string manipulation and extraction. The document provides an overview of regex syntax, usage in Python, and examples of matching and extracting data.

Uploaded by

d3v1lsr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Regular Expressions

Unit - 5A

Python for Everybody: Chapter 11


Regular Expressions
In computing, a regular expression, also
referred to as “regex” or “regexp”, provides a
concise and flexible means for matching
strings of text, such as particular characters,
words, or patterns of characters. A regular
expression is written in a formal language
that can be interpreted by a regular
expression processor.
http://en.wikipedia.org/wiki/Regular_expression
Regular Expressions
Really clever “wild card” expressions for
matching and parsing strings

http://en.wikipedia.org/wiki/Regular_expression
Really smart “Find” or “Search”
Understanding Regular
Expressions
• Very powerful and quite cryptic
• Fun once you understand them
• Regular expressions are a language unto themselves
• A language of “marker characters” - programming with
characters
• It is kind of an “old school” language - compact
http://xkcd.com/208/
Regular Expression Quick Guide
^ Matches the beginning of a line
$ Matches the end of the line
. Matches any character
\s Matches whitespace
\S Matches any non-whitespace character
* Repeats a character zero or more times
*? Repeats a character zero or more times (non-greedy)
+ Repeats a character one or more times
+? Repeats a character one or more times (non-greedy)
[aeiou] Matches a single character in the listed set
[^XYZ] Matches a single character not in the listed set
[a-z0-9] The set of characters can include a range
( Indicates where string extraction is to start
) Indicates where string extraction is to end

Check out other handouts of Regex shared with you


The Regular Expression
Module
• Before you can use regular expressions in your program,
you must import the library using “import re”

• You can use re.search() to see if a string matches a


regular expression, similar to using the find() method for
strings

• You can use re.findall() to extract portions of a string


that match your regular expression, similar to a
combination of find() and slicing: var[5:10]
Using re.search() Like find()

import re
hand = open('mbox-short.txt')
for line in hand: hand = open('mbox-short.txt')
line = line.rstrip() for line in hand:
if line.find('From:') >= 0: line = line.rstrip()
print(line) if re.search('From:', line) :
print(line)
Using re.search() Like
startswith()
import re
hand = open('mbox-short.txt')
for line in hand: hand = open('mbox-short.txt')
line = line.rstrip() for line in hand:
if line.startswith('From:') : line = line.rstrip()
print(line) if re.search('^From:', line) :
print(line)

We fine-tune what is matched by adding special characters to the


string
Wild-Card Characters
• The dot character matches any character

• If you add the asterisk character, the character is “any


number of times”
Many
Match the start of
times
X-Sieve: CMU Sieve 2.3 the line
X-DSPAM-Result: Innocent
X-DSPAM-Confidence: 0.8475
X-Content-Type-Message-Body: text/plain
^X.*:
Match any character
Fine-Tuning Your Match
Depending on how “clean” your data is and the purpose of
your application, you may want to narrow your match down
a bit
Many
Match the start times
X-Sieve: CMU Sieve 2.3 of the line
X-DSPAM-Result: Innocent
X-Plane is behind schedule: two weeks
X-: Very short
^X.*:
Match any character
Fine-Tuning Your Match
Depending on how “clean” your data is and the purpose of
your application, you may want to narrow your match down
a bit

One or more
Match the start
X-Sieve: CMU Sieve 2.3 times
X-DSPAM-Result: Innocent of the line
X-: Very Short
X-Plane is behind schedule: two weeks ^X-\S+:
Match any non-whitespace
character
Matching and Extracting
Data
• re.search() returns a True/False depending on whether
the string matches the regular expression

• If we actually want the matching strings to be extracted,


we use re.findall()
>>> import re
[0-9]+ >>> x = 'My 2 favorite numbers are 19 and 42'
>>> y = re.findall('[0-9]+',x)
>>> print(y)
One or more ['2', '19', '42']
digits
Matching and Extracting
Data
When we use re.findall(), it returns a list of zero or more
sub-strings that match the regular expression

>>> import re
>>> x = 'My 2 favorite numbers are 19 and 42'
>>> y = re.findall('[0-9]+',x)
>>> print(y)
['2', '19', '42']
>>> y = re.findall('[AEIOU]+',x)
>>> print(y)
[]
Warning: Greedy Matching
The repeat characters (* and +) push outward in both
directions (greedy) to match the largest possible string
One or more
characters
>>> import re
>>> x = 'From: Using the : character'
>>> y = re.findall('^F.+:', x)
>>> print(y) ^F.+:
['From: Using the :']

First character in Last character in


Why not 'From:' ?
the match is an F the match is a :
Non-Greedy Matching
Not all regular expression repeat codes are
greedy! If you add a ? character, the + and * One or more
chill out a bit... characters
but not
>>> import re greedy
>>> x = 'From: Using the : character'
>>> y = re.findall('^F.+?:', x) ^F.+?:
>>> print(y)
['From:']
First character in Last character in
the match is an F the match is a :
Fine-Tuning String Extraction
You can refine the match for re.findall() and separately determine
which portion of the match is to be extracted by using parentheses

From [email protected] Wed Jan 4 2023 3:58 PM

>>> y = re.findall('\S+@\S+',x) \S+@\S+


>>> print(y)
['[email protected]'] At least one
non-
whitespace
character
Fine-Tuning String Extraction
Parentheses are not part of the match - but they tell where
to start and stop what string to extract

From [email protected] Wed Jan 4 2023 3:58 PM

>>> y = re.findall('\S+@\S+',x)
>>> print(y)
['[email protected]']
^From (\S+@\S+)
>>> y = re.findall('^From (\S+@\S+)',x)
>>> print(y)
['[email protected]']
String Parsing Examples…
13 25
From [email protected] Wed Jan 4 2023 3:58 PM

>>> data = 'From [email protected] Wed Jan 4 2023 3:58 PM'


>>> atpos = data.find('@')
>>> print(atpos)
13 Extracting a host
>>> sppos = data.find(' ',atpos)
>>> print(sppos) name - using
25 find and string
>>> host = data[atpos+1 : sppos]
>>> print(host) slicing
bmsce.ac.in
The Double Split Pattern
Sometimes we split a line one way, and then grab one of
the pieces of the line and split that piece again

From [email protected] Wed Jan 4 2023 3:58 PM

words = line.split() [email protected]


email = words[1] ['dean.fyb', 'bmsce.ac.in']
pieces = email.split('@')
print(pieces[1]) 'bmsce.ac.in'
The Regex Version
From [email protected] Wed Jan 4 2023 3:58 PM
import re
lin = 'From [email protected] Wed Jan 4 2023 3:58 PM'
y = re.findall('@([^ ]*)',lin)
print(y)

['bmsce.ac.in']
'@([^ ]*)'

Look through the string until you find an at


sign
The Regex Version
From [email protected] Wed Jan 4 2023 3:58 PM
import re
lin = 'From [email protected] Wed Jan 4 2023 3:58 PM'
y = re.findall('@([^ ]*)',lin)
print(y)

['bmsce.ac.in']
'@([^ ]*)'
Match non-blank
Match many of them
character
The Regex Version
From [email protected] Wed Jan 4 2023 3:58 PM
import re
lin = 'From [email protected] Wed Jan 4 2023 3:58 PM'
y = re.findall('@([^ ]*)',lin)
print(y)

['bmsce.ac.in']
'@([^ ]*)'

Extract the non-blank characters


Even Cooler Regex Version
From [email protected] Wed Jan 4 2023 3:58 PM
import re
lin = 'From [email protected] Wed Jan 4 2023 3:58 PM'
y = re.findall('^From .*@([^ ]*)',lin)
print(y)

['bmsce.ac.in']
'^From .*@([^ ]*)'
Starting at the beginning of the line, look for the string
'From '
Even Cooler Regex Version
From [email protected] Wed Jan 4 2023 3:58 PM
import re
lin = 'From [email protected] Wed Jan 4 2023 3:58 PM'
y = re.findall('^From .*@([^ ]*)',lin)
print(y)

['bmsce.ac.in']
'^From .*@([^ ]*)'

Skip a bunch of characters, looking for an at sign


Even Cooler Regex Version
From [email protected] Wed Jan 4 2023 3:58 PM
import re
lin = 'From [email protected] Wed Jan 4 2023 3:58 PM'
y = re.findall('^From .*@([^ ]*)',lin)
print(y)

['bmsce.ac.in']
'^From .*@([^ ]*)'

Start extracting
Even Cooler Regex Version
From [email protected] Wed Jan 4 2023 3:58 PM
import re
lin = 'From [email protected] Wed Jan 4 2023 3:58 PM'
y = re.findall('^From .*@([^ ]*)',lin)
print(y)

['bmsce.ac.in']
'^From .*@([^ ]+)'
Match non-blank Match many of
character them
Even Cooler Regex Version
From [email protected] Wed Jan 4 2023 3:58 PM
import re
lin = 'From [email protected] Wed Jan 4 2023 3:58 PM'
y = re.findall('^From .*@([^ ]*)',lin)
print(y)

['bmsce.ac.in']
'^From .*@([^ ]+)'

Stop extracting
Spam Confidence
import re
hand = open('mbox-short.txt')
numlist = list()
for line in hand:
line = line.rstrip()
stuff = re.findall('^X-DSPAM-Confidence: ([0-9.]+)', line)
if len(stuff) != 1 : continue
num = float(stuff[0])
numlist.append(num)
print('Maximum:', max(numlist)) python ds.py
Maximum: 0.9907
X-DSPAM-Confidence: 0.8475
Escape Character
If you want a special regular expression character to just
behave normally (most of the time) you prefix it with '\'

>>> import re
At least
>>> x = 'We just received $10.00 for cookies.' one or
>>> y = re.findall('\$[0-9.]+',x) more
>>> print(y)
['$10.00']
\$[0-9.]+
A real dollar A digit or
sign period
Summary
• Regular expressions are a cryptic but powerful
language for matching strings and extracting elements
from those strings
• Regular expressions have special characters that
indicate intent

You might also like