0% found this document useful (0 votes)
134 views

RegularExpression Notes

This document describes common regular expression syntax elements and their meanings. It provides a table listing regular expression syntax characters and elements, along with brief descriptions of their functions. These include elements for matching characters, whitespace, digits, word boundaries, line anchors, and quantifiers like repetition, optional elements, and capturing groups. Special characters require escaping with a backslash. The document is a reference for understanding the elements that make up regular expressions.

Uploaded by

megha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
134 views

RegularExpression Notes

This document describes common regular expression syntax elements and their meanings. It provides a table listing regular expression syntax characters and elements, along with brief descriptions of their functions. These include elements for matching characters, whitespace, digits, word boundaries, line anchors, and quantifiers like repetition, optional elements, and capturing groups. Special characters require escaping with a backslash. The document is a reference for understanding the elements that make up regular expressions.

Uploaded by

megha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

[abc] A single character of: a, b, or c

[^abc] Any single character except: a, b, or c


[a-z] Any single character in the range a-z
[a-zA-Z] Any single character in the range a-z or A-Z
^ Start of line
$ End of line
\A Start of string
\z End of string
. Any single character
\s Any whitespace character
\S Any non-whitespace character
\d Any digit
\D Any non-digit
\w Any word character (letter, number, underscore)
\W Any non-word character
\b Any word boundary
(...) Capture everything enclosed
(a|b) a or b
a? Zero or one of a
a* Zero or more of a
a+ One or more of a
a{3} Exactly 3 of a
a{3,} 3 or more of a
a{3,6} Between 3 and 6 of a

Literal Characters
Note that regex engines are case sensitive by default. cat does not match Cat, unless you tell the
regex engine to ignore differences in case.

Special Characters
there are 12 characters with special meanings: the backslash \, the caret ^, the dollar sign $, the
period or dot ., the vertical bar or pipe symbol |, the question mark ?, the asterisk or star *, the plus
sign +, the opening parenthesis (, the closing parenthesis ), the opening square bracket [, and the
opening curly brace {, These special characters are often called "metacharacters". Most of them
are errors when used alone.
If you want to use any of these characters as a literal in a regex, you need to escape them with a
backslash. If you want to match 1+1=2, the correct regex is 1\+1=2. Otherwise, the plus sign has a
special meaning.
Most regular expression flavors treat the brace { as a literal character, unless it is part of a
repetition operator like a{1,3}. So you generally do not need to escape it with a backslash, though
you can do so if you want. An exception to this rule is the Java, which requires all literal braces to
be escaped.

All other characters should not be escaped with a backslash. That is because the backslash is also
a special character. The backslash in combination with a literal character can create a regex token
with a special meaning. E.g. \d is a shorthand that matches a single digit from 0 to 9.

Syntax Feature

Any character except [\^$.|?*+() Literal character

Backslash escapes a
\ followed by any of [\^$.|?*+(){}
metacharacter

. Any character

| Alternation

\| Alternation

? Greedy quantifier

\? Greedy quantifier

?? Lazy quantifier

?+ Possessive quantifier

* Greedy quantifier

*? Lazy quantifier

*+ Possessive quantifier

+ Greedy quantifier
\+ Greedy quantifier

+? Lazy quantifier

++ Possessive quantifier

{ and } Literal curly braces

{n} where n is an integer >= 1 Fixed quantifier

{n,m} where n >= 0 and m >= n Greedy quantifier

{n,} where n >= 0 Greedy quantifier

{,m} where m >= 1 Greedy quantifier

\{n\} where n is an integer >= 1 Fixed quantifier

\{n,m\} where n >= 0 and m >= n Greedy quantifier

\{n,\} where n >= 0 Greedy quantifier

\{,m\} where m >= 1 Greedy quantifier

{n,m}? where n >= 0 and m >= n Lazy quantifier

{n,}? where n >= 0 Lazy quantifier

{,m}? where m >= 1 Lazy quantifier

{n,m}+ where n >= 0 and m >= n Possessive quantifier

{n,}+ where n >= 0 Possessive quantifier

^ String anchor
^ Line anchor

$ String anchor

$ Line anchor

\a Character escape

\A String anchor

\A Attempt anchor

\b Backspace character

\b Word boundary

\B Backslash character

\c XML shorthand

Control character
\ca through \cz
escape

Control character
\cA through \cZ
escape

\C XML shorthand

\B Word boundary

\d Digits shorthand

\D Non-digits shorthand

\e Escape character
\f Form feed character

Named
\g{name}
backreference

Relative
\g-1, \g-2, etc.
Backreference

Relative
\g{-1}, \g{-2}, etc.
Backreference

\g1 through \g99 Backreference

\g{1} through \g{99} Backreference

Named subroutine
\g<name> where "name" is the name of a capturing group
call

Named
\g<name> where "name" is the name of a capturing group
backreference

Named subroutine
\g'name' where "name" is the name of a capturing group
call

Named
\g'name' where "name" is the name of a capturing group
backreference

\g<0> Recursion

\g'0' Recursion

\g<1> where 1 is the number of a capturing group Subroutine call

\g<1> where 1 is the number of a capturing group Backreference

\g'1' where 1 is the number of a capturing group Subroutine call


\g'1' where 1 is the number of a capturing group Backreference

Relative subroutine
\g<-1> where -1 is a negative integer
call

Relative
\g<-1> where -1 is is a negative integer
backreference

Relative subroutine
\g'-1' where -1 is is a negative integer
call

Relative
\g'-1' where -1 is is a negative integer
backreference

Forward subroutine
\g<+1> where +1 is a positive integer
call

Forward subroutine
\g'+1' where +1 is is a positive integer
call

\G Attempt anchor

\G Match anchor

Hexadecimal digit
\h
shorthand

Horizontal
\h
whitespace shorthand

Non-hexadecimal
\H
digit shorthand

Non-horizontal
\H
whitespace shorthand

\i XML shorthand
\I XML shorthand

Named
\k<name>
backreference

Named
\k'name' through \k'99'
backreference

Named
\k{name}
backreference

\k<1> through \k<99> Backreference

\k'1' through \k'99' Backreference

Relative
\k<-1>, \k<-2>, etc.
Backreference

Relative
\k'-1', \k'-2', etc.
Backreference

Keep text out of the


\K
regex match

\l Lowercase shorthand

Non-lowercase
\L
shorthand

\m Tcl word boundary

\M Tcl word boundary

\n Line feed character

\N Not a line break


Literal CRLF, LF, or CR line break Line break

\o{7777} where 7777 is any octal number Octal escape

\pL where L is a Unicode category Unicode category

\PL where L is a Unicode category Unicode category

\p{L} where L is a Unicode category Unicode category

\p{IsL} where L is a Unicode category Unicode category

\p{Category} Unicode category

\p{IsCategory} Unicode category

\p{Script} Unicode script

\p{IsScript} Unicode script

\p{Block} Unicode block

\p{InBlock} Unicode block

\p{IsBlock} Unicode block

Negated Unicode
\P{Property}
property

Negated Unicode
\p{^Property}
property

\P{^Property} Unicode property

\Q…\E Escape sequence


Carriage return
\r
character

\R Line break

Whitespace
\s
shorthand

Non-whitespace
\S
shorthand

\t Tab character

\u Uppercase shorthand

\uFFFF where FFFF are 4 hexadecimal digits Unicode code point

\u{FFFF} where FFFF are 1 to 4 hexadecimal digits Unicode code point

Non-uppercase
\U
shorthand

\v Vertical tab character

Vertical whitespace
\v
shorthand

Non-vertical
\V
whitespace shorthand

Word character
\w
shorthand

Non-word character
\W
shorthand

\xFF where FF are 2 hexadecimal digits Hexadecimal escape


\xFFFF where FFFF are 4 hexadecimal digits Unicode code point

\x{FFFF} where FFFF are 1 to 4 hexadecimal digits Unicode code point

\X Unicode grapheme

\y Tcl word boundary

\Y Tcl word boundary

\Z String anchor

\z String anchor

\0 NULL escape

\1 through \7 Octal escape

\1 through \9 Backreference

\10 through \77 Octal escape

\10 through \99 Backreference

\100 through \377 Octal escape

\01 through \0377 Octal escape

\` String anchor

\` Attempt anchor

\' String anchor

\< GNU word boundary


\> GNU word boundary

POSIX word
[[:<:]]
boundary

POSIX word
[[:>:]]
boundary

(regex) Capturing group

\(regex\) Capturing group

(?:regex) Non-capturing group

Named capturing
(?<name>regex)
group

Named capturing
(?'name'regex)
group

(?#comment) Comment

(?|regex) Branch reset group

(?>regex) Atomic group

(?=regex) Positive lookahead

(?!regex) Negative lookahead

(?<=regex) Positive lookbehind

(?<!regex) Negative lookbehind

(?(?=regex)then|else) where (?=regex) is any valid lookaround Lookaround


and then and else are any valid regexes conditional
(?(regex)then|else) where regex, then, and else are any valid Implicit lookahead
regexes and regex is not the name of a capturing group conditional

(?(name)then|else) where name is the name of a capturing group


Named conditional
and then and else are any valid regexes

(?(<name>)then|else) where name is the name of a capturing group


Named conditional
and then and else are any valid regexes

(?('name')then|else) where name is the name of a capturing group


Named conditional
and then and else are any valid regexes

(?(1)then|else) where 1 is the number of a capturing group


Conditional
and then and elseare any valid regexes

(?(-1)then|else) where -1 is a negative integer


Relative conditional
and then and else are any valid regexes

(?(+1)then|else) where +1 is a positive integer


Forward conditional
and then and else are any valid regexes

(?(+1)then|else) where 1 is the number of a capturing group


Conditional
and then and elseare any valid regexes

(?<capture-subtract>regex) where "capture" and "subtract" are


Balancing group
group names and "regex" is any regex

(?'capture-subtract'regex) where "capture" and "subtract" are


Balancing group
group names and "regex" is any regex

Named subroutine
(?&name) where "name" is the name of a capturing group
call

(?(DEFINE)regex) where "regex" is any regex Subroutine definitions

Named capturing
(?P<name>regex)
group
Named
(?P=name)
backreference

(?P=1) through (?P=99) Backreference

Named subroutine
(?P>name) where "name" is the name of a capturing group
call

(?R) Recursion

(?0) Recursion

(?1) where 1 is the number of a capturing group Subroutine call

Relative subroutine
(?-1) where -1 is is a negative integer
call

Forward subroutine
(?+1) where +1 is is a positive integer
call

Character Class Syntax Feature

Any character except ^-]\ Literal character

\ (backslash) followed by any of ^-]\ Backslash escapes a metacharacter

\ Literal backslash

- between two tokens that each specify a single character Range

^ immediately after the opening [ Negated character class

[ Literal opening bracket

[ Nested character class


[base-[subtract]] Character class subtraction

[base&&[intersect]] Character class intersection

[base&&intersect] Character class intersection

[:alpha:] POSIX class

[:^alpha:] Negated POSIX class

\p{Alpha} POSIX class

\p{IsAlpha} POSIX class

[.span-ll.] POSIX collation sequence

[=x=] POSIX character equivalence

The Dot Matches (Almost) Any Character


In regular expressions, the dot or period is one of the most commonly used metacharacters.
Unfortunately, it is also the most commonly misused metacharacter.

The dot matches a single character, without caring what that character is. The only exception are
line break characters. In all regex flavors discussed in this tutorial, the dot does not match line
breaks by default.

Line Break Characters


All flavors treat the newline \n as a line break.

\N Never Matches Line Breaks

Use The Dot Sparingly


The dot is a very powerful regex metacharacter. It allows you to be lazy. Put in a dot, and everything
matches just fine when you test the regex on valid data. The problem is that the regex also matches
in cases where it should not match. If you are new to regular expressions, some of these cases
may not be so obvious at first.

Let's illustrate this with a simple example. Say we want to match a date in mm/dd/yy format, but
we want to leave the user the choice of date separators. The quick solution is \d\d.\d\d.\d\d.
Seems fine at first. It matches a date like 02/12/03 just fine. Trouble is: 02512703 is also
considered a valid date by this regular expression. In this match, the first dot matched 5, and the
second matched 7. Obviously not what we intended.

\d\d[- /.]\d\d[- /.]\d\d is a better solution. This regex allows a dash, space, dot and
forward slash as date separators. Remember that the dot is not a metacharacter inside a character
class, so we do not need to escape it with a backslash.

This regex is still far from perfect. It matches 99/99/99 as a valid date. [01]\d[- /.][0-
3]\d[- /.]\d\d is a step ahead, though it still matches 19/39/99. How perfect you want your
regex to be depends on what you want to do with it. If you are validating user input, it has to be
perfect. If you are parsing data files from a known source that generates its files in the same way
every time, our last attempt is probably more than sufficient to parse the data without errors. You
can find a better regex to match dates in the example section.

Use Negated Character Classes Instead of the Dot


Suppose you want to match a double-quoted string. Sounds easy. We can have any number of any
character between the double quotes, so ".*" seems to do the trick just fine. The dot matches any
character, and the star allows the dot to be repeated any number of times, including zero.

Suppose you want to match a double-quoted string. Sounds easy. We can have any number of
any character between the double quotes, so ".*" seems to do the trick just fine. The dot matches
any character, and the star allows the dot to be repeated any number of times, including zero. If
you test this regex on Put a "string" between double quotes, it matches "string" just
fine. Now go ahead and test it on Houston, we have a problem with "string one" and
"string two". Please respond.

Ouch. The regex matches "string one" and "string two". Definitely not what we intended.
The reason for this is that the star is greedy.

In the date-matching example, we improved our regex by replacing the dot with a character class.
Here, we do the same with a negated character class. Our original definition of a double-quoted
string was faulty. We do not want any number of any character between the quotes. We want any
number of characters that are not double quotes or newlines between the quotes. So the proper
regex is "[^"\r\n]*".

Character Classes or Character Sets

Alternation with The Vertical Bar or Pipe Symbol


If you want to search for the literal text cat or dog, separate both options with a vertical bar or pipe
symbol: cat|dog. If you want more options, simply expand the list: cat|dog|mouse|fish.

The alternation operator has the lowest precedence of all regex operators. That is, it tells the regex
engine to match either everything to the left of the vertical bar, or everything to the right of the
vertical bar. If you want to limit the reach of the alternation, you need to use parentheses for
grouping. If we want to improve the first example to match whole words only, we would need to
use \b(cat|dog)\b. This tells the regex engine to find a word boundary, then either cat or dog,
and then another word boundary. If we had omitted the parentheses then the regex engine would
have searched for a word boundary followed by cat, or, dog followed by a word boundary.

Optional Items
The question mark makes the preceding token in the regular expression
optional. colou?r matches both colourand color. The question mark is called a quantifier.

You can make several tokens optional by grouping them together using parentheses, and placing
the question mark after the closing parenthesis. E.g.: Nov(ember)? matches Nov and November.

You can write a regular expression that matches many alternatives by including more than one
question mark. Feb(ruary)? 23(rd)? matches February 23rd, February 23, Feb
23rd and Feb 23.

You can also use curly braces to make something optional. colou{0,1}r is the same
as colou?r. POSIX BREand GNU BRE do not support either syntax. These flavors require
backslashes to give curly braces their special meaning: colou\{0,1\}r.

important Regex Concept: Greediness


The question mark is the first metacharacter introduced by this tutorial that is greedy. The question
mark gives the regex engine two choices: try to match the part the question mark applies to, or do
not try to match it. The engine always tries to match that part. Only if this causes the entire regular
expression to fail, will the engine try ignoring the part the question mark applies to.

The effect is that if you apply the regex Feb 23(rd)? to the string Today is Feb 23rd, 2003,
the match is always Feb 23rd and not Feb 23. You can make the question mark lazy (i.e. turn off
the greediness) by putting a second question mark after the first.

Looking Inside The Regex Engine


Let's apply the regular expression colou?r to the string The colonel likes the color green.
The first token in the regex is the literal c. The first position where it matches successfully is
the c in colonel. The engine continues, and finds that o matches o, l matches l and
another o matches o. Then the engine checks whether u matches n. This fails. However, the
question mark tells the regex engine that failing to match u is acceptable. Therefore, the engine
skips ahead to the next regex token: r. But this fails to match n as well. Now, the engine can only
conclude that the entire regular expression cannot be matched starting at the c in colonel.
Therefore, the engine starts again trying to match c to the first o in colonel.

After a series of failures, c matches the c in color, and o, l and o match the following characters.
Now the engine checks whether u matches r. This fails. Again: no problem. The question mark
allows the engine to continue with r. This matches r and the engine reports that the regex
successfully matched color in our string.

Using Atomic Grouping Instead of Possessive Quantifiers


Technically, possessive quantifiers are a notational convenience to place an atomic group around
a single quantifier. All regex flavors that support possessive quantifiers also support atomic
grouping. But not all regex flavors that support atomic grouping support possessive quantifiers.
With those flavors, you can achieve the exact same results using an atomic group.

Basically, instead of X*+, write (?>X*). It is important to notice that both the quantified token X
and the quantifier are inside the atomic group. Even if X is a group, you still need to put an extra
atomic group around it to achieve the same effect. (?:a|b)*+ is equivalent to (?>(?:a|b)*) but
not to (?>a|b)*. The latter is a valid regular expression, but it won't have the same effect when
used as part of a larger regular expression.

To illustrate, (?:a|b)*+b and (?>(?:a|b)*)b both fail to match b. a|b matches the b. The star
is satisfied, and the fact that it's possessive or the atomic group will cause the star to forget all its
backtracking positions. The second b in the regex has nothing left to match, and the overall match
attempt fails.

In the regex (?>a|b)*b, the atomic group forces the alternation to give up its backtracking
positions. This means that if an a is matched, it won't come back to try b if the rest of the regex
fails. Since the star is outside of the group, it is a normal, greedy star. When the second b fails, the
greedy star backtracks to zero iterations. Then, the second b matches the b in the subject string.

Start of String and End of String Anchors


Anchors are a different breed. They do not match any character at all. Instead, they match a
position before, after, or between characters. They can be used to "anchor" the regex match at a
certain position. The caret ^ matches the position before the first character in the string.
Applying ^a to abc matches a. ^b does not match abc at all, because the b cannot be matched
right after the start of the string, matched by ^. See below for the inside view of the regex engine.

You might also like