RegularExpression Notes
RegularExpression Notes
Literal Characters
Note that regex engines are case sensitive by default. cat does not match Cat, unless you tell the
regex engine to ignore differences in case.
Special Characters
there are 12 characters with special meanings: the backslash \, the caret ^, the dollar sign $, the
period or dot ., the vertical bar or pipe symbol |, the question mark ?, the asterisk or star *, the plus
sign +, the opening parenthesis (, the closing parenthesis ), the opening square bracket [, and the
opening curly brace {, These special characters are often called "metacharacters". Most of them
are errors when used alone.
If you want to use any of these characters as a literal in a regex, you need to escape them with a
backslash. If you want to match 1+1=2, the correct regex is 1\+1=2. Otherwise, the plus sign has a
special meaning.
Most regular expression flavors treat the brace { as a literal character, unless it is part of a
repetition operator like a{1,3}. So you generally do not need to escape it with a backslash, though
you can do so if you want. An exception to this rule is the Java, which requires all literal braces to
be escaped.
All other characters should not be escaped with a backslash. That is because the backslash is also
a special character. The backslash in combination with a literal character can create a regex token
with a special meaning. E.g. \d is a shorthand that matches a single digit from 0 to 9.
Syntax Feature
Backslash escapes a
\ followed by any of [\^$.|?*+(){}
metacharacter
. Any character
| Alternation
\| Alternation
? Greedy quantifier
\? Greedy quantifier
?? Lazy quantifier
?+ Possessive quantifier
* Greedy quantifier
*? Lazy quantifier
*+ Possessive quantifier
+ Greedy quantifier
\+ Greedy quantifier
+? Lazy quantifier
++ Possessive quantifier
^ String anchor
^ Line anchor
$ String anchor
$ Line anchor
\a Character escape
\A String anchor
\A Attempt anchor
\b Backspace character
\b Word boundary
\B Backslash character
\c XML shorthand
Control character
\ca through \cz
escape
Control character
\cA through \cZ
escape
\C XML shorthand
\B Word boundary
\d Digits shorthand
\D Non-digits shorthand
\e Escape character
\f Form feed character
Named
\g{name}
backreference
Relative
\g-1, \g-2, etc.
Backreference
Relative
\g{-1}, \g{-2}, etc.
Backreference
Named subroutine
\g<name> where "name" is the name of a capturing group
call
Named
\g<name> where "name" is the name of a capturing group
backreference
Named subroutine
\g'name' where "name" is the name of a capturing group
call
Named
\g'name' where "name" is the name of a capturing group
backreference
\g<0> Recursion
\g'0' Recursion
Relative subroutine
\g<-1> where -1 is a negative integer
call
Relative
\g<-1> where -1 is is a negative integer
backreference
Relative subroutine
\g'-1' where -1 is is a negative integer
call
Relative
\g'-1' where -1 is is a negative integer
backreference
Forward subroutine
\g<+1> where +1 is a positive integer
call
Forward subroutine
\g'+1' where +1 is is a positive integer
call
\G Attempt anchor
\G Match anchor
Hexadecimal digit
\h
shorthand
Horizontal
\h
whitespace shorthand
Non-hexadecimal
\H
digit shorthand
Non-horizontal
\H
whitespace shorthand
\i XML shorthand
\I XML shorthand
Named
\k<name>
backreference
Named
\k'name' through \k'99'
backreference
Named
\k{name}
backreference
Relative
\k<-1>, \k<-2>, etc.
Backreference
Relative
\k'-1', \k'-2', etc.
Backreference
\l Lowercase shorthand
Non-lowercase
\L
shorthand
Negated Unicode
\P{Property}
property
Negated Unicode
\p{^Property}
property
\R Line break
Whitespace
\s
shorthand
Non-whitespace
\S
shorthand
\t Tab character
\u Uppercase shorthand
Non-uppercase
\U
shorthand
Vertical whitespace
\v
shorthand
Non-vertical
\V
whitespace shorthand
Word character
\w
shorthand
Non-word character
\W
shorthand
\X Unicode grapheme
\Z String anchor
\z String anchor
\0 NULL escape
\1 through \9 Backreference
\` String anchor
\` Attempt anchor
POSIX word
[[:<:]]
boundary
POSIX word
[[:>:]]
boundary
Named capturing
(?<name>regex)
group
Named capturing
(?'name'regex)
group
(?#comment) Comment
Named subroutine
(?&name) where "name" is the name of a capturing group
call
Named capturing
(?P<name>regex)
group
Named
(?P=name)
backreference
Named subroutine
(?P>name) where "name" is the name of a capturing group
call
(?R) Recursion
(?0) Recursion
Relative subroutine
(?-1) where -1 is is a negative integer
call
Forward subroutine
(?+1) where +1 is is a positive integer
call
\ Literal backslash
The dot matches a single character, without caring what that character is. The only exception are
line break characters. In all regex flavors discussed in this tutorial, the dot does not match line
breaks by default.
Let's illustrate this with a simple example. Say we want to match a date in mm/dd/yy format, but
we want to leave the user the choice of date separators. The quick solution is \d\d.\d\d.\d\d.
Seems fine at first. It matches a date like 02/12/03 just fine. Trouble is: 02512703 is also
considered a valid date by this regular expression. In this match, the first dot matched 5, and the
second matched 7. Obviously not what we intended.
\d\d[- /.]\d\d[- /.]\d\d is a better solution. This regex allows a dash, space, dot and
forward slash as date separators. Remember that the dot is not a metacharacter inside a character
class, so we do not need to escape it with a backslash.
This regex is still far from perfect. It matches 99/99/99 as a valid date. [01]\d[- /.][0-
3]\d[- /.]\d\d is a step ahead, though it still matches 19/39/99. How perfect you want your
regex to be depends on what you want to do with it. If you are validating user input, it has to be
perfect. If you are parsing data files from a known source that generates its files in the same way
every time, our last attempt is probably more than sufficient to parse the data without errors. You
can find a better regex to match dates in the example section.
Suppose you want to match a double-quoted string. Sounds easy. We can have any number of
any character between the double quotes, so ".*" seems to do the trick just fine. The dot matches
any character, and the star allows the dot to be repeated any number of times, including zero. If
you test this regex on Put a "string" between double quotes, it matches "string" just
fine. Now go ahead and test it on Houston, we have a problem with "string one" and
"string two". Please respond.
Ouch. The regex matches "string one" and "string two". Definitely not what we intended.
The reason for this is that the star is greedy.
In the date-matching example, we improved our regex by replacing the dot with a character class.
Here, we do the same with a negated character class. Our original definition of a double-quoted
string was faulty. We do not want any number of any character between the quotes. We want any
number of characters that are not double quotes or newlines between the quotes. So the proper
regex is "[^"\r\n]*".
The alternation operator has the lowest precedence of all regex operators. That is, it tells the regex
engine to match either everything to the left of the vertical bar, or everything to the right of the
vertical bar. If you want to limit the reach of the alternation, you need to use parentheses for
grouping. If we want to improve the first example to match whole words only, we would need to
use \b(cat|dog)\b. This tells the regex engine to find a word boundary, then either cat or dog,
and then another word boundary. If we had omitted the parentheses then the regex engine would
have searched for a word boundary followed by cat, or, dog followed by a word boundary.
Optional Items
The question mark makes the preceding token in the regular expression
optional. colou?r matches both colourand color. The question mark is called a quantifier.
You can make several tokens optional by grouping them together using parentheses, and placing
the question mark after the closing parenthesis. E.g.: Nov(ember)? matches Nov and November.
You can write a regular expression that matches many alternatives by including more than one
question mark. Feb(ruary)? 23(rd)? matches February 23rd, February 23, Feb
23rd and Feb 23.
You can also use curly braces to make something optional. colou{0,1}r is the same
as colou?r. POSIX BREand GNU BRE do not support either syntax. These flavors require
backslashes to give curly braces their special meaning: colou\{0,1\}r.
The effect is that if you apply the regex Feb 23(rd)? to the string Today is Feb 23rd, 2003,
the match is always Feb 23rd and not Feb 23. You can make the question mark lazy (i.e. turn off
the greediness) by putting a second question mark after the first.
After a series of failures, c matches the c in color, and o, l and o match the following characters.
Now the engine checks whether u matches r. This fails. Again: no problem. The question mark
allows the engine to continue with r. This matches r and the engine reports that the regex
successfully matched color in our string.
Basically, instead of X*+, write (?>X*). It is important to notice that both the quantified token X
and the quantifier are inside the atomic group. Even if X is a group, you still need to put an extra
atomic group around it to achieve the same effect. (?:a|b)*+ is equivalent to (?>(?:a|b)*) but
not to (?>a|b)*. The latter is a valid regular expression, but it won't have the same effect when
used as part of a larger regular expression.
To illustrate, (?:a|b)*+b and (?>(?:a|b)*)b both fail to match b. a|b matches the b. The star
is satisfied, and the fact that it's possessive or the atomic group will cause the star to forget all its
backtracking positions. The second b in the regex has nothing left to match, and the overall match
attempt fails.
In the regex (?>a|b)*b, the atomic group forces the alternation to give up its backtracking
positions. This means that if an a is matched, it won't come back to try b if the rest of the regex
fails. Since the star is outside of the group, it is a normal, greedy star. When the second b fails, the
greedy star backtracks to zero iterations. Then, the second b matches the b in the subject string.