0% found this document useful (0 votes)

2 views

5A - Regex

Regular expressions (regex) are a powerful tool in computing for matching and parsing strings of text using a formal language. They utilize special characters to define patterns, allowing for flexible string manipulation and extraction. The document provides an overview of regex syntax, usage in Python, and examples of matching and extracting data.

Uploaded by

d3v1lsr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

5A - Regex

Uploaded by

d3v1lsr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Regular Expressions

Unit - 5A

Python for Everybody: Chapter 11

Regular Expressions
In computing, a regular expression, also
referred to as “regex” or “regexp”, provides a
concise and flexible means for matching
strings of text, such as particular characters,
words, or patterns of characters. A regular
expression is written in a formal language
that can be interpreted by a regular
expression processor.
http://en.wikipedia.org/wiki/Regular_expression
Regular Expressions
Really clever “wild card” expressions for
matching and parsing strings

http://en.wikipedia.org/wiki/Regular_expression
Really smart “Find” or “Search”
Understanding Regular
Expressions
• Very powerful and quite cryptic
• Fun once you understand them
• Regular expressions are a language unto themselves
• A language of “marker characters” - programming with
characters
• It is kind of an “old school” language - compact
http://xkcd.com/208/
Regular Expression Quick Guide
^ Matches the beginning of a line
$ Matches the end of the line
. Matches any character
\s Matches whitespace
\S Matches any non-whitespace character
* Repeats a character zero or more times
*? Repeats a character zero or more times (non-greedy)
+ Repeats a character one or more times
+? Repeats a character one or more times (non-greedy)
[aeiou] Matches a single character in the listed set
[^XYZ] Matches a single character not in the listed set
[a-z0-9] The set of characters can include a range
( Indicates where string extraction is to start
) Indicates where string extraction is to end

Check out other handouts of Regex shared with you

The Regular Expression
Module
• Before you can use regular expressions in your program,
you must import the library using “import re”

• You can use re.search() to see if a string matches a

regular expression, similar to using the find() method for
strings

• You can use re.findall() to extract portions of a string

that match your regular expression, similar to a
combination of find() and slicing: var[5:10]
Using re.search() Like find()

import re
hand = open('mbox-short.txt')
for line in hand: hand = open('mbox-short.txt')
line = line.rstrip() for line in hand:
if line.find('From:') >= 0: line = line.rstrip()
print(line) if re.search('From:', line) :
print(line)
Using re.search() Like
startswith()
import re
hand = open('mbox-short.txt')
for line in hand: hand = open('mbox-short.txt')
line = line.rstrip() for line in hand:
if line.startswith('From:') : line = line.rstrip()
print(line) if re.search('^From:', line) :
print(line)

We fine-tune what is matched by adding special characters to the

string
Wild-Card Characters
• The dot character matches any character

• If you add the asterisk character, the character is “any

number of times”
Many
Match the start of
times
X-Sieve: CMU Sieve 2.3 the line
X-DSPAM-Result: Innocent
X-DSPAM-Confidence: 0.8475
X-Content-Type-Message-Body: text/plain
^X.*:
Match any character
Fine-Tuning Your Match
Depending on how “clean” your data is and the purpose of
your application, you may want to narrow your match down
a bit
Many
Match the start times
X-Sieve: CMU Sieve 2.3 of the line
X-DSPAM-Result: Innocent
X-Plane is behind schedule: two weeks
X-: Very short
^X.*:
Match any character
Fine-Tuning Your Match
Depending on how “clean” your data is and the purpose of
your application, you may want to narrow your match down
a bit

One or more
Match the start
X-Sieve: CMU Sieve 2.3 times
X-DSPAM-Result: Innocent of the line
X-: Very Short
X-Plane is behind schedule: two weeks ^X-\S+:
Match any non-whitespace
character
Matching and Extracting
Data
• re.search() returns a True/False depending on whether
the string matches the regular expression

• If we actually want the matching strings to be extracted,

we use re.findall()
>>> import re
[0-9]+ >>> x = 'My 2 favorite numbers are 19 and 42'
>>> y = re.findall('[0-9]+',x)
>>> print(y)
One or more ['2', '19', '42']
digits
Matching and Extracting
Data
When we use re.findall(), it returns a list of zero or more
sub-strings that match the regular expression

>>> import re
>>> x = 'My 2 favorite numbers are 19 and 42'
>>> y = re.findall('[0-9]+',x)
>>> print(y)
['2', '19', '42']
>>> y = re.findall('[AEIOU]+',x)
>>> print(y)
[]
Warning: Greedy Matching
The repeat characters (* and +) push outward in both
directions (greedy) to match the largest possible string
One or more
characters
>>> import re
>>> x = 'From: Using the : character'
>>> y = re.findall('^F.+:', x)
>>> print(y) ^F.+:
['From: Using the :']

First character in Last character in

Why not 'From:' ?
the match is an F the match is a :
Non-Greedy Matching
Not all regular expression repeat codes are
greedy! If you add a ? character, the + and * One or more
chill out a bit... characters
but not
>>> import re greedy
>>> x = 'From: Using the : character'
>>> y = re.findall('^F.+?:', x) ^F.+?:
>>> print(y)
['From:']
First character in Last character in
the match is an F the match is a :
Fine-Tuning String Extraction
You can refine the match for re.findall() and separately determine
which portion of the match is to be extracted by using parentheses

From [email protected] Wed Jan 4 2023 3:58 PM

>>> y = re.findall('\S+@\S+',x) \S+@\S+

>>> print(y)
['[email protected]'] At least one
non-
whitespace
character
Fine-Tuning String Extraction
Parentheses are not part of the match - but they tell where
to start and stop what string to extract

From [email protected] Wed Jan 4 2023 3:58 PM

>>> y = re.findall('\S+@\S+',x)
>>> print(y)
['[email protected]']
^From (\S+@\S+)
>>> y = re.findall('^From (\S+@\S+)',x)
>>> print(y)
['[email protected]']
String Parsing Examples…
13 25
From [email protected] Wed Jan 4 2023 3:58 PM

>>> data = 'From [email protected] Wed Jan 4 2023 3:58 PM'

>>> atpos = data.find('@')
>>> print(atpos)
13 Extracting a host
>>> sppos = data.find(' ',atpos)
>>> print(sppos) name - using
25 find and string
>>> host = data[atpos+1 : sppos]
>>> print(host) slicing
bmsce.ac.in
The Double Split Pattern
Sometimes we split a line one way, and then grab one of
the pieces of the line and split that piece again

From [email protected] Wed Jan 4 2023 3:58 PM

words = line.split() [email protected]

email = words[1] ['dean.fyb', 'bmsce.ac.in']
pieces = email.split('@')
print(pieces[1]) 'bmsce.ac.in'
The Regex Version
From [email protected] Wed Jan 4 2023 3:58 PM
import re
lin = 'From [email protected] Wed Jan 4 2023 3:58 PM'
y = re.findall('@([^ ]*)',lin)
print(y)

['bmsce.ac.in']
'@([^ ]*)'

Look through the string until you find an at

sign
The Regex Version
From [email protected] Wed Jan 4 2023 3:58 PM
import re
lin = 'From [email protected] Wed Jan 4 2023 3:58 PM'
y = re.findall('@([^ ]*)',lin)
print(y)

['bmsce.ac.in']
'@([^ ]*)'
Match non-blank
Match many of them
character
The Regex Version
From [email protected] Wed Jan 4 2023 3:58 PM
import re
lin = 'From [email protected] Wed Jan 4 2023 3:58 PM'
y = re.findall('@([^ ]*)',lin)
print(y)

['bmsce.ac.in']
'@([^ ]*)'

Extract the non-blank characters

Even Cooler Regex Version
From [email protected] Wed Jan 4 2023 3:58 PM
import re
lin = 'From [email protected] Wed Jan 4 2023 3:58 PM'
y = re.findall('^From .*@([^ ]*)',lin)
print(y)

['bmsce.ac.in']
'^From .*@([^ ]*)'
Starting at the beginning of the line, look for the string
'From '
Even Cooler Regex Version
From [email protected] Wed Jan 4 2023 3:58 PM
import re
lin = 'From [email protected] Wed Jan 4 2023 3:58 PM'
y = re.findall('^From .*@([^ ]*)',lin)
print(y)

['bmsce.ac.in']
'^From .*@([^ ]*)'

Skip a bunch of characters, looking for an at sign

Even Cooler Regex Version
From [email protected] Wed Jan 4 2023 3:58 PM
import re
lin = 'From [email protected] Wed Jan 4 2023 3:58 PM'
y = re.findall('^From .*@([^ ]*)',lin)
print(y)

['bmsce.ac.in']
'^From .*@([^ ]*)'

Start extracting
Even Cooler Regex Version
From [email protected] Wed Jan 4 2023 3:58 PM
import re
lin = 'From [email protected] Wed Jan 4 2023 3:58 PM'
y = re.findall('^From .*@([^ ]*)',lin)
print(y)

['bmsce.ac.in']
'^From .*@([^ ]+)'
Match non-blank Match many of
character them
Even Cooler Regex Version
From [email protected] Wed Jan 4 2023 3:58 PM
import re
lin = 'From [email protected] Wed Jan 4 2023 3:58 PM'
y = re.findall('^From .*@([^ ]*)',lin)
print(y)

['bmsce.ac.in']
'^From .*@([^ ]+)'

Stop extracting
Spam Confidence
import re
hand = open('mbox-short.txt')
numlist = list()
for line in hand:
line = line.rstrip()
stuff = re.findall('^X-DSPAM-Confidence: ([0-9.]+)', line)
if len(stuff) != 1 : continue
num = float(stuff[0])
numlist.append(num)
print('Maximum:', max(numlist)) python ds.py
Maximum: 0.9907
X-DSPAM-Confidence: 0.8475
Escape Character
If you want a special regular expression character to just
behave normally (most of the time) you prefix it with '\'

>>> import re
At least
>>> x = 'We just received $10.00 for cookies.' one or
>>> y = re.findall('\$[0-9.]+',x) more
>>> print(y)
['$10.00']
\$[0-9.]+
A real dollar A digit or
sign period
Summary
• Regular expressions are a cryptic but powerful
language for matching strings and extracting elements
from those strings
• Regular expressions have special characters that
indicate intent

Languages and Machines (Thomas A. Sudkamp)
100% (3)
Languages and Machines (Thomas A. Sudkamp)
574 pages
Lecture#03, FA, Union, Complement, Intersection, Concatenation, Kleene Star
No ratings yet
Lecture#03, FA, Union, Complement, Intersection, Concatenation, Kleene Star
59 pages
Solutions Chapter 6 Automata
50% (2)
Solutions Chapter 6 Automata
4 pages
Regular Expressions: Python For Everybody
No ratings yet
Regular Expressions: Python For Everybody
34 pages
Regular Expressions: Python For Everybody
No ratings yet
Regular Expressions: Python For Everybody
34 pages
Day3.3 StringManipulation
No ratings yet
Day3.3 StringManipulation
43 pages
Module5_RegularExpressions
No ratings yet
Module5_RegularExpressions
10 pages
06 - Regular Expressions and Network Programming
No ratings yet
06 - Regular Expressions and Network Programming
55 pages
Module 3 Regular Expressions
No ratings yet
Module 3 Regular Expressions
8 pages
Chapter - 11 - Regular Expressions
100% (1)
Chapter - 11 - Regular Expressions
10 pages
Module3 RegularExpressions
No ratings yet
Module3 RegularExpressions
8 pages
RegEx in Python (4)
No ratings yet
RegEx in Python (4)
6 pages
Python RegEx
No ratings yet
Python RegEx
11 pages
Tsa Lab Record - Cse
No ratings yet
Tsa Lab Record - Cse
53 pages
Python Regex: Re - Match, Re - Search, Re - Findall With Example
No ratings yet
Python Regex: Re - Match, Re - Search, Re - Findall With Example
10 pages
Lecture 9 Python
No ratings yet
Lecture 9 Python
8 pages
Regular Expression 4
No ratings yet
Regular Expression 4
16 pages
Module 4 - Regular Expressions1
No ratings yet
Module 4 - Regular Expressions1
37 pages
Python 201 - (Slightly) Advanced Python Topics
No ratings yet
Python 201 - (Slightly) Advanced Python Topics
69 pages
Lec 06 - Regular Expression
No ratings yet
Lec 06 - Regular Expression
19 pages
2 - Python Strings
No ratings yet
2 - Python Strings
23 pages
Regex Summary
No ratings yet
Regex Summary
8 pages
9.RegEx
No ratings yet
9.RegEx
57 pages
Regular Expressions in Python
No ratings yet
Regular Expressions in Python
12 pages
Lec 07-II-DSFa23
No ratings yet
Lec 07-II-DSFa23
44 pages
Regular Expressions
No ratings yet
Regular Expressions
5 pages
Regular
No ratings yet
Regular
9 pages
Summary Python 1
No ratings yet
Summary Python 1
36 pages
UNIT - 4 REGEX
No ratings yet
UNIT - 4 REGEX
28 pages
Subtitle
No ratings yet
Subtitle
3 pages
Python Regular Expressions
No ratings yet
Python Regular Expressions
14 pages
9.RegEx (1)
No ratings yet
9.RegEx (1)
57 pages
Regular Expressions Python
No ratings yet
Regular Expressions Python
26 pages
22MCA1061 Regx
No ratings yet
22MCA1061 Regx
18 pages
Unit-3 Python
No ratings yet
Unit-3 Python
72 pages
632223462-unit-3-python
No ratings yet
632223462-unit-3-python
72 pages
Lec 07 II Dsfa23
No ratings yet
Lec 07 II Dsfa23
44 pages
W10A Full
No ratings yet
W10A Full
40 pages
Python Module-41
No ratings yet
Python Module-41
56 pages
Regular Expressions in Python
No ratings yet
Regular Expressions in Python
16 pages
Unit-3 - Regular Expression
No ratings yet
Unit-3 - Regular Expression
15 pages
UNIT4
No ratings yet
UNIT4
67 pages
Python Regex Cheat Sheet
No ratings yet
Python Regex Cheat Sheet
29 pages
Lecture 7 Re Part2 Split
No ratings yet
Lecture 7 Re Part2 Split
8 pages
Unit 2
No ratings yet
Unit 2
69 pages
Regular Expression l
No ratings yet
Regular Expression l
20 pages
Advanced Python Programming - Lesson No.002
No ratings yet
Advanced Python Programming - Lesson No.002
20 pages
Session-20 - Jupyter Notebook
No ratings yet
Session-20 - Jupyter Notebook
12 pages
Howto Regex
No ratings yet
Howto Regex
19 pages
Regular Exp
No ratings yet
Regular Exp
6 pages
Regular Expressions
No ratings yet
Regular Expressions
104 pages
Regular Expression
No ratings yet
Regular Expression
18 pages
Module 4 - Regular Expressions
No ratings yet
Module 4 - Regular Expressions
35 pages
Character Classes
No ratings yet
Character Classes
10 pages
Module4 DataAnalyticsLanguages
No ratings yet
Module4 DataAnalyticsLanguages
30 pages
13B RegExp
No ratings yet
13B RegExp
38 pages
Regular Expression
No ratings yet
Regular Expression
17 pages
Exercises Regular Expressions
No ratings yet
Exercises Regular Expressions
6 pages
unit 4 Regular expression
No ratings yet
unit 4 Regular expression
16 pages
Regular Expression
No ratings yet
Regular Expression
21 pages
String Functions and Regular Expressions: Anastasis Oulas Evangelos Pafilis Jacques Lagnel
No ratings yet
String Functions and Regular Expressions: Anastasis Oulas Evangelos Pafilis Jacques Lagnel
37 pages
Untitled
No ratings yet
Untitled
53 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Describing Web Resources in RDF: Grigoris Antoniou Frank Van Harmelen
No ratings yet
Describing Web Resources in RDF: Grigoris Antoniou Frank Van Harmelen
120 pages
Automata and Formal Languages
No ratings yet
Automata and Formal Languages
24 pages
ATC
No ratings yet
ATC
2 pages
Theory of Computation
No ratings yet
Theory of Computation
6 pages
Per Norgard
100% (1)
Per Norgard
21 pages
Unit-3 Regular Expressions
No ratings yet
Unit-3 Regular Expressions
100 pages
Compiler Design
No ratings yet
Compiler Design
3 pages
III Cse Ma3354 Notes Unit1
No ratings yet
III Cse Ma3354 Notes Unit1
36 pages
Kleene S-Theorem PDF
No ratings yet
Kleene S-Theorem PDF
13 pages
MTK 2013 (Ri) Sem 1 1011-1
No ratings yet
MTK 2013 (Ri) Sem 1 1011-1
15 pages
AI Assignment
No ratings yet
AI Assignment
4 pages
Lecture 5 PDF
No ratings yet
Lecture 5 PDF
14 pages
Turing
No ratings yet
Turing
15 pages
MAM1008 Class Test 1 Memorandum - 2023
No ratings yet
MAM1008 Class Test 1 Memorandum - 2023
6 pages
Recursive Descent Parsers
No ratings yet
Recursive Descent Parsers
3 pages
INPUT BUFFERING,SPECIFICATION OF TOKENS,RECOGNITION OF TOKEN
No ratings yet
INPUT BUFFERING,SPECIFICATION OF TOKENS,RECOGNITION OF TOKEN
3 pages
Discrete Structure
0% (1)
Discrete Structure
5 pages
Examples of Slr, Clr, Lalr
No ratings yet
Examples of Slr, Clr, Lalr
12 pages
Presentation On Logic1
No ratings yet
Presentation On Logic1
47 pages
Chapter 3
No ratings yet
Chapter 3
41 pages
2 Problem Sheet Two
No ratings yet
2 Problem Sheet Two
4 pages
MA526
No ratings yet
MA526
2 pages
Formal Language and Compiler Design - 2
No ratings yet
Formal Language and Compiler Design - 2
40 pages
CD mcqs
No ratings yet
CD mcqs
36 pages
Nivane Niveema XI - HTTP
No ratings yet
Nivane Niveema XI - HTTP
42 pages
Implementation of Predictive Parser.: CD Lab Programs
No ratings yet
Implementation of Predictive Parser.: CD Lab Programs
15 pages
Set Theory
No ratings yet
Set Theory
36 pages

5A - Regex

Uploaded by

5A - Regex

Uploaded by

Regular Expressions

Python for Everybody: Chapter 11

Check out other handouts of Regex shared with you

• You can use re.search() to see if a string matches a

• You can use re.findall() to extract portions of a string

We fine-tune what is matched by adding special characters to the

• If you add the asterisk character, the character is “any

• If we actually want the matching strings to be extracted,

First character in Last character in

From [email protected] Wed Jan 4 2023 3:58 PM

>>> y = re.findall('\S+@\S+',x) \S+@\S+

From [email protected] Wed Jan 4 2023 3:58 PM

>>> data = 'From [email protected] Wed Jan 4 2023 3:58 PM'

From [email protected] Wed Jan 4 2023 3:58 PM

words = line.split() [email protected]

Look through the string until you find an at

Extract the non-blank characters

Skip a bunch of characters, looking for an at sign

You might also like