0% found this document useful (0 votes)
9 views

RegEx in Python (4)

Uploaded by

Yash Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

RegEx in Python (4)

Uploaded by

Yash Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

REGULAR EXPRESSIONS (REGEX) IN PYTHON:

Regular Expressions (RegEx) are a powerful tool for pattern matching and text manipulation. In Python, regex
functionality is implemented through the re module.

APPLICATIONS OF REGEX
● Data validation
● Data extraction
● Input sanitization (data cleaning)

This document explains regex basics, syntax, functions, and practical examples with improved clarity and structure.

What is a Regular Expression?


A Regular Expression is a sequence of characters that defines a search pattern. It can be used to match strings,
validate formats, or extract information.

COMMON USE CASES OF REGEX THAT ARE ALSO COVERED IN THIS ARTICLE WITH DETAILED EXPLANATION:

● Extracting email addresses


● Extracting timestamps from logs
● Extracting URLs
● Validating phone numbers or dates
● Searching for words or patterns in text
● Validating passwords

Regex Syntax in Python


To use regex, you define a pattern or a regex expression that consists of special characters and sequences, which
defines what to look for in a text.
Here are some of the most common components of regex syntax:

1. SPECIAL CHARACTERS
Character Description
. Matches any single character.
^ Matches the start of the string.
$ Matches the end of the string.
* Matches 0 or more repetitions.
+ Matches 1 or more repetitions.
? Matches 0 or 1 occurrence.
{n} Matches exactly n occurrences.
{n,} Matches n or more occurrences.
{n,m} Matches between n and m occurrences.
\ Escapes special characters.

Created by: Anjali Garg | Data Scientist | Aspiring ML Engineer | https://www.linkedin.com/in/anjali-garg-2a7747222/


2. CHARACTER CLASSES
Syntax Description
[arn] where one of the a, r or n is present
[a-n] returns a match for any lowercase character between a and n
[^arn] returns a match where character is not a, r or n
[0123] return a match where 0,1,2 or 3 is present
[0-9] returns a match where a number between 0 to 9
[0-5][0-9] returns a match for any number between 00-59
[a-zA-Z] returns a match for any alphabetical character
[+] in sets, special characters have no meaning, so it will return a match if a '+' character is found.

3. PREDEFINED SEQUENCES
Sequence Description
\A returns a match if the specified characters are at the start of the string
\b Returns a match where the specified characters are at the beginning or at the end of a word
\B A match where the specified characters are present, but NOT at the beginning or at the end of a word
\d returns a match where the string contains digits 0-9
\D returns a match where the string does not contains digits 0-9
\s returns a match where the string contains a white space character
\S returns a match where the string DOES NOT contains a white space character
\w returns a match where the string contains word character i.e., a-zA-Z0-9 and underscore
\W returns a match where the string DOES NOT contain a word character
\Z returns a match if the specified characters are at the end of the string.

4. GROUPING AND CAPTURING

Parentheses () are used to group parts of a regex pattern and capture matches. Capturing groups save the matched
content for later use, while non-capturing groups allow grouping without saving the matched content.

CAPTURING GROUP
A capturing group matches the specified pattern and saves the matched content for reference. For example:

pattern = r"(\d{3})-(\d{2})-(\d{4})"
text = "123-45-6789"
match = re.match(pattern, text)
print(match.groups()) # Output: ('123', '45', '6789')

NON-CAPTURING GROUP
A non-capturing group groups the pattern without saving the matched content. Use (?:...) to create a non-
capturing group. For example:

pattern = r"(?:\d{3})-(\d{2})-(\d{4})"
text = "123-45-6789"
match = re.match(pattern, text)
print(match.groups()) # Output: ('45', '6789')

Created by: Anjali Garg | Data Scientist | Aspiring ML Engineer | https://www.linkedin.com/in/anjali-garg-2a7747222/


PRACTICAL EXAMPLES
1. MATCHING EMAIL ADDRESSES
Example: [email protected]
● The username part i.e., before @ part:
Can contain alphabets a-z, A-Z, numbers 0-9, dot ., space, hyphen -, and some emails unlike gmail allow
underscore _ and other special characters like + as well.
[email protected] : “[a-zA-Z0-9 .-_+]+” : one or more than one occurrence of these
characters
● The domain part i.e., after @ part:
Can contain sub domains, domains, domain extensions and one necessary ending extension that must
contain at least 2 alphabets.
[email protected] : “[a-zA-Z0-9-.]+”
[email protected] : “\.[a-zA-Z]{2,}”
# Complete regex:
r"[a-zA-Z0-9 ._-+]+@[a-zA-Z-.]+\.[a-zA-Z]{2,}"
# Equivalent regex:
r"[\w .-+]+@[\w-.]+\.[a-zA-Z]{2,}"
# (\w: any alphabet, number, underscore, {2,} means occurrence greater than 2
times)

2. MATCHING QUESTIONS
Examples:
- Is this your final answer?
- "Python is a snake" - is this statement correct?
- Why is the sky blue during the day?
● Starting of question: can be alphanumeric, can contain quotation marks: r”[a-zA-Z0-9\”’]+”
● Middle part of a question: r”[a-zA-Z0-9\”’ ,-_–+]*”
(you can include more special characters if they’re allowed in the questions, or you can use [^?\n] to match
every character except a question mark and a new line)
● Ending of a question: r”\?”

# Complete regex:
r"[\w\"']+[\w\"',-_+ ]*\?"

Created by: Anjali Garg | Data Scientist | Aspiring ML Engineer | https://www.linkedin.com/in/anjali-garg-2a7747222/


3. MATCHING URLS
Examples:
- https://www.example.com?query_param1=value1&query_param2=value2
- Components of a URL:

Since, there are a lot of special characters allowed in the URL, some are not allowed, for example white space is
encoded using %20, and non ascii characters are also encoded using word characters and some special characters.

● Scheme (http/https) of url followed by :// - r”https?:\/\/”


● Subdomain, domain, top level domain: r”(?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,}”
● Port number’s non capturing group: r”(?::[0-9]{1,5})?”
● Path’s non capturing group: r”(?:\/[^\s?#]*)?”
● Query Separator and Parameters’ non capturing group: r”(?:\?[a-zA-Z0-9%._\-~+=&]*)?”
● Fragment’s non capturing group: r”(?:#[^\s]+)?”

# Complete regex:
r"https?:\/\/(?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,}(?::[0-
9]{1,5})?(?:\/[^\s?#]*)?(?:\?[a-zA-Z0-9%._\-~+=&]*)?(?:#[^\s]*)?"

4. MATCHING IPV4 ADDRESSES


An IPv4 address consists of four octets, separated by dots (.), where each octet is a number between 0 and 255.
Logic behind regex to match a number between 0-255:
● Number between 0-9: [0-9]
● Number between 10-99: [1-9][0-9]
● Number between 0-99: [0-9][0-9]?
● Number between 0-199: [0-1]?[0-9][0-9]?
● Number between 200-255: 2[0-5][0-5]

Regex for number to be in between 0-255: r”(?:[0-1]?[0-9][0-9]?|2[0-5][0-5])”


# Complete regex:
r"(?:(?:[0-1]?[0-9][0-9]?|2[0-5][0-5])\.){3}(?:[0-1]?[0-9][0-9]?|2[0-5][0-5])"

Created by: Anjali Garg | Data Scientist | Aspiring ML Engineer | https://www.linkedin.com/in/anjali-garg-2a7747222/


Python’s re Module
The re module provides built-in functions for regex operations.

COMMON FUNCTIONS
Function Description Syntax Return Value (x)

Returns a list containing all matches in x=


List of all matched
re.findall the order they are found. If no match, re.findall("regex_expression",
strings
empty list. text)

Returns a match object for the first x=


Match object (if
re.search match found. Returns None if no match is re.search("regex_expression",
found) or None
found. text)

Splits a string into a list at each match. x = re.split("regex_expression", List of separated


re.split
Optionally, limit the splits with maxsplit. text, [maxsplit]) strings

Replaces one or more matches with a x = re.sub("regex_expression", A new string with


re.sub given string. Optionally limit "replacement_string", text, substitutions
replacements with count. count) applied

CODE:
import re

# Sample text with correct and incorrect examples


sample_text = """
Correct Examples:
[email protected]
[email protected]
Is this your final answer?
"Python is a snake" - is this statement correct?
https://www.example.com?query_param1=value1&query_param2=value2
http://example.org/resource
192.168.1.1
127.0.0.1

Incorrect Examples:
john.doe@com
noatsymbol.com
Is this even correct..
ftp://wrong.protocol.com
256.256.256.256
999.999.999.999
"""

# Regex patterns
patterns = {
"Email Address": r"[a-zA-Z0-9._+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
"Question": r"[a-zA-Z0-9\"'][a-zA-Z0-9\"',-_-+ ]*\?",
"URL": r"https?:\/\/(?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,}(?::[0-
9]{1,5})?(?:\/[^\s?#]*)?(?:\?[a-zA-Z0-9%._\-~+=&]*)?(?:#[^\s]*)?",
"IPv4 Address": r"(?:(?:[0-1]?[0-9][0-9]?|2[0-5][0-5])\.){3}(?:[0-1]?[0-9][0-
9]?|2[0-5][0-5])"
}

Created by: Anjali Garg | Data Scientist | Aspiring ML Engineer | https://www.linkedin.com/in/anjali-garg-2a7747222/


def test_regex(pattern_name, pattern, text):
print(f"\nTesting: {pattern_name}")
matches = re.findall(pattern, text)
print("Matches:")
for match in matches:
print(f" - {match}")

# Testing all patterns


for name, regex in patterns.items():
test_regex(name, regex, sample_text)

OUTPUT:
Testing: Email Address
Matches:
- [email protected]
- [email protected]
Testing: Question
Matches:
- Is this your final answer?
- "Python is a snake" - is this statement correct?
- https://www.example.com?
Testing: URL
Matches:
- https://www.example.com?query_param1=value1&query_param2=value2
- http://example.org/resource
Testing: IPv4 Address
Matches:
- 192.168.1.1
- 127.0.0.1

Theory References:
https://www.w3schools.com/python/python_regex.asp
https://www.geeksforgeeks.org/components-of-a-url/

Created by: Anjali Garg | Data Scientist | Aspiring ML Engineer | https://www.linkedin.com/in/anjali-garg-2a7747222/

You might also like