RegEx
Study Notes
REGEX in Data Analytics
RegEx
Introduction to Data Analytics
Data Analytics or more commonly known as Data Science is one of the
most emerging fields in today’s world. As the technology is growing
and the number of users adopting these technologies, there has been a
huge amount of data flowing in to the service providers which needs to
be processed and understood for better service and understanding.
This is done using various tools and technologies such as data slicing,
manipulation, mining, algorithmic processing etc and the Most popular
languages that are used for running these analytic applications are
Python and R.
We shall be discussing these techniques one by one. The First one we would discuss is
REGEX.
What is REGEX:
REGEX stands for Regular Expressions. It is a search method by which we can search or
replace some specific data from a large data set.
It was devised in 1950 by an American mathematician Stephen Cole Kleene.
In a large set of data, it usually helps by finding some specific set of strings.
For example If you want to find data where the name of a person starts with the letter ‘i’ and
ends with ‘e’. We can apply the following formula for search
search("i*e", txt)
The results would get displayed as per the analyst code.
RegEx
REGEX Functions:
REGEX has different functions apart from the search function. They are as follows:
Functions Description
Findall Shows the list containing all the matches
Search Shows a matching object if there is a match
anywhere in the string
Split shows a list where the string has been split
at each match
sub Replaces one or many instances where
matches with a string
These functions help in doing the necessary REGEX search and replace operations.
Metacharacters in REGEX:
Character Description Example
[] Set of characters "[a-m]"
It signals a special sequence (it can also be used to escape special
\ characters) "\d"
Any character in the string. One dot per character. (except newline
. character) "he..o"
^ If the string Starts with "^hello"
$ If the string Ends with "world$"
* To search if unspecified number of characters "aix*"
=+ If One or more occurrence "aix+"
{} If Exactly the specified number of occurrences of same character set "al{2}"
| For Either or "falls|stays"
RegEx
() To Capture and group
These metacharacters act as helpers in the search commands. With the help of these
characters a complex search parameter can be formed to get data from any type of data set.
Special Sequences in REGEX:
A special sequence is the character placed after ‘\’ which signifies different functions as
described in the list.
Character Description Example
To show a match if the specified characters are at the beginning of the
\A string. "\AThe"
To show a match where the specified characters are at the beginning or at the r"\bain"
\b end of a word r"ain\b"
To show a match where the specified characters are present, but NOT at the r"\Bain"
\B beginning (or at the end) of a word. r"ain\B"
\d To show a match where the string contains digits (numbers from 0-9). "\d"
\D To Show a match where the string DOES NOT contain digits "\D"
\s To show a match where the string contains a white space character "\s"
To show a match where the string DOES NOT contain a white space
\S character "\S"
To show a match where the string contains characters containing ( a to Z,
\w digits from 0-9, and the underscore _ character). "\w"
\W To show a match where the string DOES NOT contain any word characters "\W"
\Z To show a match if the specified characters are at the end of the string "Spain\Z"
RegEx
Sets in REGEX:
Sets are the set of characters that you would like to search placed inside [ ] brackets.
Set Description
Shows a match where one of the specified characters (a, r, or n) are
[arn] present.
Shows a match for any lower case character, alphabetically between a and
[a-n] n.
[^arn] Shows a match for any character EXCEPT a, r, and n.
[0123] Shows a match where any of the specified digits (0, 1, 2, or 3) are present.
[0-9] Shows a match for any digit between 0 and 9.
[0-5][0-9] Shows a match for any two-digit numbers from 00 and 59.
Shows a match for any character alphabetically between a and z, lower
[a-zA-Z] case OR upper case.
In sets, +, *, ., |, (), $,{} characters have no special meaning, so [+] means:
[+] show a match for any + character in the string.