Jan Goyvaerts - All About Regular Expressions-Https - WWW - Regular-Expressions - Info - (2019)
Jan Goyvaerts - All About Regular Expressions-Https - WWW - Regular-Expressions - Info - (2019)
REGULAR EXPRESSIONS
Jan G oyvar et s
Regular Expression Quick Start
This quick start will quickly get you up to speed with regular expressions. Obviously, this brief introduction cannot explain
everything there is to know about regular expressions. For detailed information, consult the regular expression tutorial. Each
topic in the quick start corresponds with a topic in the tutorial, so you can easily go back and forth between the two.
This first example is actually a perfectly valid regex. It is the most basic pattern, simply matching the literal text regex.
Matches are highlighted in blue in this tutorial.
We will use the term "string" to indicate the text that we are applying the regular expression to. We will highlight them in
green.
Literal Characters
The most basic regular expression consists of a single literal character, e.g.: a. It will match the first occurrence of that
character in the string. If the string is Jack is a boy, it will match the a after the J.
This regex can match the second a too. It will only do so when you tell the regex engine to start searching through the string
after the first match. In a text editor, you can do so by using its "Find Next" or "Search Forward" function. In a programming
language, there is usually a separate function that you can call to continue searching through the string after the previous
match.
Twelve characters are with special meanings: the opening square bracket [, the closing square bracket ], the backslash \,
the caret ^, the dollar sign $, the period or dot ., the vertical bar or pipe symbol |, the question mark ?, the asterisk or star *,
the plus sign +, the opening parenthesis ( and the closing parenthesis ). These special characters are often called
"metacharacters".
If you want to use any of these characters as a literal in a regex, you need to escape them with a backslash. If you want to
match 1+1=2, the correct regex is 1\+1=2. Otherwise, the plus sign will have a special meaning.
You can use a hyphen inside a character class to specify a range of characters. [0-9] matches a single digit between 0 and
9. You can use more than one range. [0-9a-fA-F] matches a single hexadecimal digit, case insensitively. You can combine
ranges and single characters. [0-9a-fxA-FX] matches a hexadecimal digit or the letter X.
Typing a caret after the opening square bracket will negate the character class. The result is that the character class will
match any character that is not in the character class. q[^x] matches qu in question. It does not match Iraq since there is
no character after the q for the negated character class to match.
Shorthand Character Classes
\d matches a single character that is a digit, \w matches a "word character" (alphanumeric characters plus underscore), and
\s matches a whitespace character (includes tabs and line breaks). The actual characters matched by the shorthands
depends on the software you're using. Usually, non-English letters and numbers are included.
Non-Printable Characters
You can use special character sequences to put non-printable characters in your regular expression. Use \t to match a tab
character (ASCII 0x09), \r for carriage return (0x0D) and \n for line feed (0x0A). More exotic non-printables are \a (bell,
0x07), \e (escape, 0x1B), \f (form feed, 0x0C) and \v (vertical tab, 0x0B). Remember that Windows text files use \r\n to
terminate lines, while UNIX text files use \n.
Use \xFF to match a specific character by its hexadecimal index in the character set. E.g. \xA9 matches the copyright
symbol in the Latin-1 character set.
If your regular expression engine supports Unicode, use \uFFFF to insert a Unicode character. E.g. \u20AC matches the
euro currency sign.
All non-printable characters can be used directly in the regular expression, or as part of a character class.
gr.y matches gray, grey, gr%y, etc. Use the dot sparingly. Often, a character class or negated character class is faster and
more precise.
Anchors
Anchors do not match any characters. They match a position. ^ matches at the start of the string, and $ matches at the end of
the string. Most regex engines have a "multi-line" mode that makes ^ match after any line break, and $ before any line break.
E.g. ^b matches only the first b in bob.
\b matches at a word boundary. A word boundary is a position between a character that can be matched by \w and a
character that cannot be matched by \w. \b also matches at the start and/or end of the string if the first and/or last characters
in the string are word characters. \B matches at every position where \b cannot match.
Alternation
Alternation is the regular expression equivalent of "or". cat|dog will match cat in About cats and dogs. If the regex is
applied again, it will match dog. You can add as many alternatives as you want, e.g.: cat|dog|mouse|fish.
Alternation has the lowest precedence of all regex operators. cat|dog food matches cat or dog food. To create a regex
that matches cat food or dog food, you need to group the alternatives: (cat|dog) food.
Repetition
The question mark makes the preceding token in the regular expression optional. E.g.: colou?r matches colour or color.
The asterisk or star tells the engine to attempt to match the preceding token zero or more times. The plus tells the engine to
attempt to match the preceding token once or more. <[A-Za-z][A-Za-z0-9]*> matches an HTML tag without any
attributes. <[A-Za-z0-9]+> is easier to write but matches invalid tags such as <1>.
Use curly braces to specify a specific amount of repetition. Use \b[1-9][0-9]{3}\b to match a number between 1000 and
9999. \b[1-9][0-9]{2,4}\b matches a number between 100 and 99999.
The regex <.+> will match <EM>first</EM> in: This is a <EM>first</EM> test.
Place a question mark after the quantifier to make it lazy. <.+?> will match <EM> in the above string.
A better solution is to follow my advice to use the dot sparingly. Use <[^<>]+> to quickly match an HTML tag without regard
to attributes. The negated character class is more specific than the dot, which helps the regex engine find matches quickly.
Parentheses create a capturing group. The above example has one group. After the match, group number one will contain
nothing if Set was matched or Value if SetValue was matched. How to access the group's contents depends on the
software or programming language you're using. Group zero always contains the entire regex match.
Use the special syntax Set(?:Value)? to group tokens without creating a capturing group. This is more efficient if you don't
plan to use the group's contents. Do not confuse the question mark in the non-capturing group syntax with the quantifier.
Backreferences
Within the regular expression, you can use the backreference \1 to match the same text that was matched by the capturing
group. ([abc])=\1 matches a=a, b=b, and c=c. It does not match anything else. If your regex has multiple capturing
groups, they are numbered counting their opening parentheses from left to right.
Unicode Properties
\p{L} matches a single character that has a given Unicode property. L stands for letter. \P{L} matches a single character
that does not have the given Unicode property. You can find a complete list of Unicode properties in the tutorial.
Lookaround
Lookaround is a special kind of group. The tokens inside the group are matched normally, but then the regex engine makes
the group give up its match and keeps only the result. Lookaround matches a position, just like anchors. It does not expand
the regex match.
q(?=u) matches the q in question, but not in Iraq. This is positive lookahead. The u is not part of the overall regex match.
The lookahead matches at each position in the string before a u.
q(?!u) matches q in Iraq but not in question. This is negative lookahead. The tokens inside the lookahead are attempted,
their match is discarded, and the result is inverted.
To look backwards, use lookbehind. (?<=a)b matches the b in abc. This is positive lookbehind. (?<!a)b fails to match abc.
You can use a full-fledged regular expression inside the lookahead. Most regular expression engines only allow literal
characters and alternation inside lookbehind, since they cannot apply regular expressions backwards.
Free-Spacing Syntax
Many application have an option that may be labeled "free-spacing" or "ignore whitespace" or "comments" that makes the
regular expression engine ignore unescaped spaces and line breaks and that makes the # character start a comment that
runs until the end of the line. This allows you to use whitespace to format your regular expression in a way that makes it easier
for humans to read and thus makes it easier to maintain.
Regular Expression Tutorial
This tutorial teaches you all you need to know to be able to craft powerful time-saving regular expressions. It starts with the
most basic concepts, so that you can follow this tutorial even if you know nothing at all about regular expressions yet.
The tutorial doesn't stop there. It also explains how a regular expression engine works on the inside, and alert you at the
consequences. This helps you to quickly understand why a particular regex does not do what you initially expected. It will save
you lots of guesswork and head scratching when you need to write more complex regexes.
This first example is actually a perfectly valid regex. It is the most basic pattern, simply matching the literal text regex. A
"match" is the piece of text, or sequence of bytes or characters that pattern was found to correspond to by the regex
processing software. Matches are highlighted in blue on this site.
With the above regular expression pattern, you can search through a text file to find email addresses, or verify if a given string
looks like an email address. In this tutorial, we will use the term "string" to indicate the text that we're applying the regular
expression to. This tutoorial will highlight them in green. The term "string" or "character string" is used by programmers to
indicate a sequence of characters. In practice, you can use regular expressions with whatever data you can access using the
application or programming language you are working with.
As usual in the software world, different regular expression engines are not fully compatible with each other. The syntax and
behavior of a particular engine is called a regular expression flavor. This tutorial covers all the popular regular expression
flavors, including Perl, PCRE, PHP, .NET, Java, JavaScript, XRegExp, Python, Ruby, Delphi, R, Tcl, POSIX, and many others.
The tutorial alerts you when these flavors require different syntax or show different behavior. Even if your application is not
explicitly covered by the tutorial, it likely uses a regex flavor that is covered, as most applications are developed using one of
the programming environments or regex libraries just mentioned.
If you are a programmer, your software will run faster since even a simple regex engine will outperform a state of the art plain
text search algorithm searching through the data multiple times. Regular expressions also reduce development time. With a
regex engine, it takes only one line (e.g. in Perl, PHP, Python, Ruby, Java, or .NET) or a couple of lines (e.g. in C using PCRE)
of code to, say, check if the user's input looks like a valid email address.
Regex Tutorial Table of Contents
This regular expressions tutorial teaches you every aspect of regular expressions. Each topic assumes you have read and
understood all previous topics. If you are new to regular expressions, you should read the topics in the order presented.
Introduction
The introduction indicates the scope of the tutorial and which regex flavors will be discussed. It also introduces basic
terminology.
The simplest regex consists of only literal characters. Certain characters have special meanings in a regex and have to be
escaped. Escaping rules may get a bit complicated when using regexes in software source code.
Non-printable Characters
Non-printable characters such as control characters and special spacing or line break characters are easier to enter using
control character escapes or hexadecimal escapes.
First look at the internals of the regular expression engine's internals. Later topics will build on this information. Knowing the
engine's internals will greatly help you to craft regexes that match what you intended, and not match what you do not want.
A character class or character set matches a single character out of several possible characters, consisting of individual
characters and/or ranges of characters. A negated character class matches a single character not in the character class.
Shorthand character classes allow you to use common sets of characters quickly. You can use shorthands on their own or as
part of character classes.
Character class subtraction allows you to match one character that is present in one set of characters but not present in
another set of characters.
Character class intersection allows you to match one character that is present in one set of characters and also present in
another set of characters.
The Dot
The dot matches any character, though usually not line break characters unless you change an option.
Anchors
Anchors are zero-width. They do not match any characters, but rather a position. The caret and the dollar sign match at the
start and the end of the string. Depending on your regex flavor and its options, they can match at the start and the end of a
line as well.
Word Boundaries
Word boundaries are like anchors, but match at the start of a word and the end of a word. However, most regex flavors define
the concept of a "word" differently than your English teacher in grade school.
Alternation
By separating different sub-regexes with vertical bars, you can tell the regex engine to attempt them from left to right, and
return success as soon as one of them can be matched.
Optional Items
Putting a question mark after an item tells the regex engine to match the item if possible, but continue anyway (rather than
admit defeat) if it cannot be matched.
Three styles of operators, the star, the plus, and curly braces, allow you to repeat an item zero or more times, once or more,
or an arbitrary number of times. It is important to understand that these quantifiers are "greedy" by default, unless you
explicitly make them "lazy".
Grouping
By placing parentheses around part of the regex, you tell the engine to treat that part as a single item when applying
quantifiers or to group alternatives together. Parentheses also create capturing groups allowing you to reuse the text matched
by part of the regex.
Backreferences
Backreferences to capturing groups match the same text that was previously matched by that capturing group, allowing you to
match patterns of repeated text.
Regular expressions that have multiple groups are much easier to read and maintain if you use named capturing groups and
named backreferences.
Free-Spacing Mode
Splitting a regular expression into multiple lines, adding comments and whitespace, makes it even more readable.
When using alternation to match different variants of the same thing, you can put the alternatives in a branch reset group.
Then all the alternatives share the same capturing groups. This allows you to use backreferences or retrieve part of the
matched text without having to check which of the alternatives captured it.
If your regular expression flavor supports Unicode, then you can use special Unicode regex tokens to match specific Unicode
characters, or to match any character that has a certain Unicode property or is part of a particular Unicode script or block.
Mode Modifiers
Change matching modes such as "case insensitive" for specific parts of the regular expression.
Nested quantifiers can cause an exponentially increasing amount of backtracking that brings the regex engine to a grinding
halt. Atomic grouping and possessive quantifiers provide a solution.
Lookahead and lookbehind (collectively lookaround) are zero-width. With positive lookaround, you can specify multiple
requirements (sub-regexes) to be applied to the same part of the string. With negative lookaround, you can invert the result of
a regex match (i.e. match something that does not match something else).
Keep The Text Matched So Far out of The Overall Regex Match
Keeping the text matched so far out of the overall regex match allows you to find matches that are preceded by certain text,
without having that preceding text included in the overall regex match. This method is primarily of interest with regex flavors
that have no or limited support for lookbehind.
Conditionals
A conditional is a special construct that will first evaluate a lookaround, and then execute one sub-regex if the lookaround
succeeds, and another sub-regex if the lookaround fails.
Recursion
Recursion matches the whole regex again at a particular point inside the regex, which makes it possible to match balanced
constructs.
Subroutine Calls
Subroutine calls allow you to write regular expressions that match the same constructs in multiple places without having to
duplicate parts of your regular expression.
Capturing groups inside recursion and subroutine calls are handled differently by the regex flavors that support them.
Special backreferences match the text stored by a capturing group at a particular recursion level, instead of the text most
recently matched by that capturing group.
The regex flavors that support recursion and subroutine calls backtrack differently after a recursion or subroutine call fails.
If you are using a POSIX-compliant regular expression engine, you can use POSIX bracket expressions to match locale-
dependent characters.
When a regex can find zero-length matches, regex engines use different strategies to avoid getting stuck on a zero-length
match when you want to iterate over all matches in a string. This may lead to different match results.
Forcing a regex match to start at the end of a previous match provides an efficient way to parse text data.
Literal Characters
The most basic regular expression consists of a single literal character, e.g.: a. It will match the first occurrence of that
character in the string. If the string is Jack is a boy, it will match the a after the J. The fact that this a is in the middle of
the word does not matter to the regex engine. If it matters to you, you will need to tell that to the regex engine by using word
boundaries. We will get to that later.
This regex can match the second a too. It will only do so when you tell the regex engine to start searching through the string
after the first match. In a text editor, you can do so by using its "Find Next" or "Search Forward" function. In a programming
language, there is usually a separate function that you can call to continue searching through the string after the previous
match.
Similarly, the regex cat will match cat in About cats and dogs. This regular expression consists of a series of three
literal characters. This is like saying to the regex engine: find a c, immediately followed by an a, immediately followed by a t.
Note that regex engines are case sensitive by default. cat does not match Cat, unless you tell the regex engine to ignore
differences in case.
Special Characters
Because we want to do more than simply search for literal pieces of text, we need to reserve certain characters for special
use. In the regex flavors discussed in this tutorial, there are 12 characters with special meanings: the opening square bracket
[, the closing square bracket ], the backslash \, the caret ^, the dollar sign $, the period or dot ., the vertical bar or pipe
symbol |, the question mark ?, the asterisk or star *, the plus sign +, the opening parenthesis (, and the closing parenthesis
). These special characters are often called "metacharacters". Most of them are errors when used alone.
If you want to use any of these characters as a literal in a regex, you need to escape them with a backslash. If you want to
match 1+1=2, the correct regex is 1\+1=2. Otherwise, the plus sign will have a special meaning.
Note that 1+1=2, with the backslash omitted, is a valid regex. So you will not get an error message. But it will not match
1+1=2. It would match 111=2 in 123+111=234, due to the special meaning of the plus character.
If you forget to escape a special character where its use is not allowed, such as in +1, then you will get an error message.
Most regular expression flavors treat the brace { as a literal character, unless it is part of a repetition operator like {1,3}. So
you generally do not need to escape it with a backslash, though you can do so if you want. But there are a few exceptions.
Java requires literal opening braces to be escaped. Boost and std::regex require all literal braces to be escaped.
] is a literal outside character classes. Different rules apply inside character classes. Those are discussed in the topic about
character classes. Again, there are exceptions. std::regex and Ruby require closing square brackets to be escaped even
outside character classes.
All other characters should not be escaped with a backslash. That is because the backslash is also a special character. The
backslash in combination with a literal character can create a regex token with a special meaning. E.g. \d is a shorthand that
matches a single digit from 0 to 9.
Escaping a single metacharacter with a backslash works in all regular expression flavors. Many flavors also support the
\Q…\E escape sequence. All the characters between the \Q and the \E are interpreted as literal characters. E.g. \Q*\d+*\E
matches the literal text *\d+*. The \E may be omitted at the end of the regex, so \Q*\d+* is the same as \Q*\d+*\E. This
syntax is supported by Perl, PCRE, PHP, Delphi, and Java, both inside and outside character classes. Java 4 and 5 have
bugs that cause \Q…\E to misbehave, however, so you shouldn't use this syntax with Java. Boost supports it outside
character classes, but not inside
Special Characters and Programming Languages
If you are a programmer, you may be surprised that characters like the single quote and double quote are not special
characters. That is correct. When using a regular expression or grep tool, or the search function of a text editor, you should
not escape or repeat the quote characters like you do in a programming language.
In your source code, you have to keep in mind which characters get special treatment inside strings by your programming
language. That is because those characters will be processed by the compiler, before the regex library sees the string. So the
regex 1\+1=2 must be written as "1\\+1=2" in C++ code. The C++ compiler will turn the escaped backslash in the source
code into a single backslash in the string that is passed on to the regex library. To match c:\temp, you need to use the regex
c:\\temp. As a string in C++ source code, this regex becomes "c:\\\\temp". Four backslashes to match a single one
indeed.
Non-Printable Characters
You can use special character sequences to put non-printable characters in your regular expression. Use \t to match a tab
character (ASCII 0x09), \r for carriage return (0x0D) and \n for line feed (0x0A). More exotic non-printables are \a (bell,
0x07), \e (escape, 0x1B), \f (form feed, 0x0C) and \v (vertical tab, 0x0B). Remember that Windows text files use \r\n to
terminate lines, while UNIX text files use \n.
In some flavors, \v matches the vertical tab (ASCII 0x0B). In other flavors, \v is a shorthand that matches any vertical
whitespace character. That includes the vertical tab, form feed, and all line break characters. Perl 5.10, PCRE 7.2, PHP 5.2.4,
R, Delphi XE, and later versions treat it as a shorthand. Earlier versions treated it as a needlessly escaped literal v.
Many regex flavors also support the tokens \cA through \cZ to insert ASCII control characters. The letter after the backslash
is always a lowercase c. The second letter is an uppercase letter A through Z, to indicate Control+A through Control+Z. These
are equivalent to \x01 through \x1A (26 decimal). E.g. \cM matches a carriage return, just like \r, \x0D, and , and \u000D.
Most flavors allow the second letter to be lowercase, with no difference in meaning. Only Java requires the A to Z to be
uppercase.
Using characters other than letters after \c is not recommended because the behavior is inconsistent between applications.
Some allow any character after \c while other allow ASCII characters. The application may take the last 5 bits that character
index in the code page or its Unicode code point to form an ASCII control character. Or the application may just flip bit 0x40.
Either way \c@ through \c_ would match control characters 0x00 through 0x1F. But \c* might match a line feed or the letter
j. The asterisk is character 0x2A in the ASCII table, so the lower 5 bits are 0x0A while flipping bit 0x40 gives 0x6A.
Metacharacters indeed lose their meaning immediately after \c in applications that support \cA through \cZ for matching
control characters. .NET, and XRegExp are more sensible. They treat anything other than a letter after\cas an error.
In XML Schema regular expressions and Xpath, \c is a shorthand character class that matches any character allowed in an
XML name.
If your regular expression engine supports Unicode, you can use \uFFFF or \x{FFFF} to insert a Unicode character. The
euro currency sign occupies Unicode code point U+20AC. If you cannot type it on your keyboard, you can insert it into a
regular expression with \u20AC or \x{20AC}. See the tutorial section on Unicode for more details on matching Unicode code
points.
If your regex engine works with 8-bit code pages instead of Unicode, you can include any character in your regular expression
if you know its position in the character set that you are working with. In the Latin-1 character set, the copyright symbol is
character 0xA9. So to search for the copyright symbol, you can use \xA9. Another way to search for a tab is to use \x09.
Note that the leading zero is required. In Tcl 8.5 and prior you have to be careful with this syntax, because Tcl used to eat up
all hexadecimal characters after \x and treat the last 4 as a Unicode code point. So \xA9ABC20AC would match the euro
symbol. Tcl 8.6 only takes the first two hexadecimal digits as part of the \x, as all other regex flavors do, so \xA9ABC20AC
matches ©ABC20AC.
Line Breaks
\R is a special escape that matches any line break, including Unicode line breaks. What makes it special is that it treats CRLF
pairs as indivisible. If the match attempt of \R begins before a CRLF pair in the string, then a single \R matches the whole
CRLF pair. \R will not backtrack to match only the CR in a CRLF pair. So while \R can match a lone CR or a lone LF, \R{2} or
\R\R cannot match a single CRLF pair. The first \R matches the whole CRLF pair, leaving nothing for the second one to
match.
Or at least, that is how \R should work. It works like that in Ruby 2.0 and later, Java 8, and PCRE 8.13 and later. Java 9
introduced a bug that allows \R\R to match a single CRLF pair. PCRE 7.0 through 8.12 had a bug that allows \R{2} to match
a single CRLF pair. Perl has a different bug with the same result.
Note that \R only looks forward to match CRLF pairs. The regex \r\R can match a single CRLF pair. After \r has consumed
the CR, the remaining lone LF is a valid line break for \R to match. This behavior is consistent across all flavors.
Octal Escapes
Many applications also support octal escapes in the form of \0377 or \377, where 377 is the octal representation of the
character's position in the character set (255 decimal in this case). There is a lot of variation between regex flavors as to the
number of octal digits allowed or required after the backslash, whether the leading zero is required or not allowed, and
whether \0 without additional digits matches a NULL byte. In some flavors this causes complications as \1 to \77 can be
octal escapes 1 to 63 (decimal) or backreferences 1 to 77 (decimal), depending on how many capturing groups there are in
the regex. Therefore, using these octal escapes in regexes is strongly discouraged. Use hexadecimal escapes instead.
Perl 5.14, PCRE 8.34, PHP 5.5.10, and R 3.0.3 support a new syntax \o{377} for octal escapes. You can have any number
of octal digits between the curly braces, with or without leading zero. There is no confusion with backreferences and literal
digits that follow are cleanly separated by the closing curly brace. Do be careful to only put octal digits between the curly
braces. In Perl, \o{whatever} is not an error but matches a NULL byte.
A similar issue exists in Python 3.2 and prior with the Unicode escape \uFFFF. Python has supported this syntax as part of
(Unicode) string literals ever since Unicode support was added to Python. But Python's re module only supports \uFFFF
starting with Python 3.3. In Python 3.2 and earlier, \uFFFF works when you add your regex as a literal (Unicode) string to
your Python code. But when your Python 3.2 script reads the regex from a file or user input, \uFFFF matches uFFFF literally
as the regex engine sees \u as an escaped literal u.
First Look at How a Regex Engine Works Internally
Knowing how the regex engine works will enable you to craft better regexes more easily. It will help you understand quickly
why a particular regex does not do what you initially expected. This will save you lots of guesswork and head scratching when
you need to write more complex regexes.
While there are many implementations of regular expressions that differ sometimes slightly and sometimes significantly in
syntax and behavior, there are basically only two kinds of regular expression engines: text-directed engines, also called DFA
and NFA engines, respectively. All the regex flavors treated in this tutorial are based on regex-directed engines. This is
because certain very useful features, such as lazy quantifiers and backreferences, can only be implemented in regex-directed
engines. No surprise that this kind of engine is more popular.
Notable tools that use text-directed engines are awk, egrep, flex, lex, MySQL and Procmail. For awk and egrep, there are a
few versions of these tools that use a regex-directed engine.
You can easily find out whether the regex flavor you intend to use has a text-directed or regex-directed engine. If
backreferences and/or lazy quantifiers are available, you can be certain the engine is regex-directed. You can do the test by
applying the regex regex|regex not to the string regex not. If the resulting match is only regex, the engine is regex-
directed. If the result is regex not, then it is text-directed. The reason behind this is that the regex-directed engine is
"eager".
A regex-directed engine walks through the regex, attempting to match the next token in the regex to the next character. If a
match is found, the engine advances through the regex and the subject string. If a token fails to match, the engine backtracks
to a previous position in the regex and the subject string where it can try a different path through the regex. This tutorial will
talk a lot more about backtracking later on. Modern regex flavors using regex-directed engines have lots of features such as
atomic grouping and possessive quantifiers that allow you to control this backtracking.
A text-directed engine walks through the subject string, attempting all permutations of the regex before advancing to the next
character in the string. A text-directed engine never backtracks. Thus, there isn't much to discuss about the matching process
of a text-directed engine. In most cases, a text-directed engine finds the same matches as a regex-directed engine.
After introducing a new regex token, this tutorial explains step by step how the regex engine actually processes that token.
This inside look may seem a bit long-winded at certain times. But understanding how the regex engine works will enable you
to use its full power and help you avoid common mistakes.
When applying cat to He captured a catfish for his cat., the engine will try to match the first token in the regex c
to the first character in the match H. This fails. There are no other possible permutations of this regex, because it merely
consists of a sequence of literal characters. So the regex engine tries to match the c with the e. This fails too, as does
matching the c with the space. Arriving at the 4th character in the match, c matches c. The engine will then try to match the
second token a to the 5th character, a. This succeeds too. But then, t fails to match p. At that point, the engine knows the
regex cannot be matched starting at the 4th character in the match. So it will continue with the 5th: a. Again, c fails to match
here and the engine carries on. At the 15th character in the match, c again matches c. The engine then proceeds to attempt
to match the remainder of the regex at character 15 and finds that a matches a and t matches t.
The entire regular expression could be matched starting at character 15. The engine is "eager" to report a match. It will
therefore report the first three letters of catfish as a valid match. The engine never proceeds beyond this point to see if there
are any "better" matches. The first match is considered good enough.
In this first example of the engine's internals, our regex engine simply appears to work like a regular text search routine. A
text-directed engine would have returned the same result too. However, it is important that you can follow the steps the engine
takes in your mind. In following examples, the way the engine works will have a profound impact on the matches it will find.
Some of the results may be surprising. But they are always logical and predetermined, once you know how the engine works.
A character class matches only a single character. gr[ae]y will not match graay, graey or any such thing. The order of the
characters inside a character class does not matter. The results are identical.
You can use a hyphen inside a character class to specify a range of characters. [0-9] matches a single digit between 0 and
9. You can use more than one range. [0-9a-fA-F] matches a single hexadecimal digit, case insensitively. You can combine
ranges and single characters. [0-9a-fxA-FX] matches a hexadecimal digit or the letter X. Again, the order of the characters
and the ranges does not matter.
Useful Applications
Find a word, even if it is misspelled, such as sep[ae]r[ae]te or li[cs]en[cs]e.
It is important to remember that a negated character class still must match a character. q[^u] does not mean: "a q not
followed by a u". It means: "a q followed by a character that is not a u". It will not match the q in the string Iraq. It will match
the q and the space after the q in Iraq is a country. Indeed: the space will be part of the overall match, because it is the
"character that is not a u" that is matched by the negated character class in the above regexp. If you want the regex to match
the q, and only the q, in both strings, you need to use negative lookahead: q(?!u). But we will get to that later.
To include a backslash as a character without any special meaning inside a character class, you have to escape it with
another backslash. [\\x] matches a backslash or an x. The closing bracket (]), the caret (^) and the hyphen (-) can be
included by escaping them with a backslash, or by placing them in a position where they do not take on their special meaning.
I recommend the latter method, since it improves readability. The POSIX and GNU flavors are an exception. They treat
backslashes in character classes as literal characters. So with these flavors, you can't escape anything in character classes.
To include an uescaped caret, place it anywhere except right after the opening bracket. [x^] matches an x or a caret. This
works with all flavors discussed in this tutorial.
You can include an unescaped closing bracket right after the opening bracket, or right after the negating caret. []x] matches
a closing bracket or an x. [^]x] matches any character that is not a closing bracket or an x. The hyphen can be included
right after the opening bracket, or right before the closing bracket, or right after the negating caret. Both [-x] and [x-] match
an x or a hyphen.
Many regex tokens that work outside character classes can also be used inside character classes. This includes character
escapes, octal escapes, and hexadecimal escapes for non-printable characters. E.g. [$\u20AC] matches a dollar or euro
sign, assuming your regex flavor supports Unicode.
Per and PCRE also support the \Q…\E sequence inside character classes to escape a string of characters. E.g. [\Q[-]\E]
matches [, - or ].
If you want to repeat the matched character, rather than the class, you will need to use backreferences. ([0-9])\1+ will
match 222 but not 837. When applied to the string 833337, it will match 3333 in the middle of this string. If you do not want
that, you need to use lookahead and lookbehind.
But I digress. I did not yet explain how character classes work inside the regex engine. Let us take a look at that first.
Nothing noteworthy happens for the first twelve characters in the string. The engine will fail to match g at every step, and
continue with the next character in the string. When the engine arrives at the 13th character, g is matched. The engine will
then try to match the remainder of the regex with the text. The next token in the regex is the literal r, which matches the next
character in the text. So the third token, [ae] is attempted at the next character in the text (e). The character class gives the
engine two options: match a or match e. It will first attempt to match a, and fail.
But because we are using a regex-directed engine, it must continue trying to match all the other permutations of the regex
pattern before deciding that the regex cannot be matched with the text starting at character 13. So it will continue with the
other option, and find that e matches e. The last regex token is y, which can be matched with the following character as well.
The engine has found a complete match with the text starting at character 13. It will return grey as the match result, and look
no further. Again, the leftmost match was returned, even though we put the a first in the character class, and gray could have
been matched in the string. But the engine simply did not get that far, because another equally valid match was found to the
left of it.
\w stands for "word character", usually [A-Za-z0-9_]. Notice the inclusion of the underscore and digits.
\s stands for "whitespace character". Again, which characters this actually includes, depends on the regex flavor. In all flavors
discussed in this tutorial, it includes [ \t\r\n]. That is: \s will match a space, a tab or a line break. Some flavors include
additional, rarely used non-printable characters such as vertical tab and form feed. In flavors that support Unicode, \s
normally includes all characters from the Unicode "separator" category. Java and PCRE are exceptions once again. But
JavaScript does match all Unicode whitespace with \s.
The flavor comparison shows "ascii only" for flavors that match only the ASCII characters listed in the previous paragraphs.
With flavors marked as "YES", letters, digits and space characters from other languages or Unicode are also included in the
shorthand classes. In the screen shot, you can see the characters matched by \w using various scripts. Notice that JavaScript
uses ASCII for \d and \w, but Unicode for \s. XML does it the other way around. Python offers flags to control what the
shorthands should match.
Shorthand character classes can be used both inside and outside the square brackets. \s\d matches a whitespace character
followed by a digit. [\s\d] matches a single character that is either whitespace or a digit. When applied to 1 + 2 = 3, the
former regex will match 2 (space two), while the latter matches 1 (one). [\da-fA-F] matches a hexadecimal digit, and is
equivalent to [0-9a-fA-F] if your flavor only matches ASCII characters with \d.
Be careful when using the negated shorthands inside square brackets. [\D\S] is not the same as [^\d\s]. The latter will
match any character that is not a digit or whitespace. So it will match x, but not 8. The former, however, will match any
character that is either not a digit, or is not whitespace. Because a digit is not whitespace, and whitespace is not a digit,
[\D\S] will match any character, digit, whitespace or otherwise.
PCRE also supports \h and \v starting with version 7.2. PHP does as of version 5.2.2, and Java as of version 8. Boost
supports \h starting with version 1.42. No version of Boost supports \v as a shorthand.
In many other regex flavors, \v matches only the vertical tab character. Perl, PCRE, and PHP never supported this, so they
were free to give \v a different meaning. Java 4 to 7 did use \v to match only the vertical tab. Java 8 changed the meaning of
this token anyway. The vertical tab is also a vertical whitespace character. To avoid confusion, the above paragraph uses \cK
to represent the vertical tab.
Ruby 1.9 and later have their own version of \h. It matches a single hexadecimal digit just like [0-9a-fA-F]. \v is a vertical
tab in Ruby.
You can use these four shorthands both inside and outside character classes using the bracket notation. They're very useful
for validating XML references and values in your XML schemas. The regular expression \i\c* matches an XML name like
xml:schema.
The regex <\i\c*\s*> matches an opening XML tag without any attributes. </\i\c*\s*> matches any closing tag.
<\i\c*(\s+\i\c*\s*=\s*("[^"]*"|'[^']*'))*\s*> matches an opening tag with any number of attributes.
No other regex flavors discussed in this tutorial support XML character classes. If your XML files are plain ASCII , you can use
[_:A-Za-z] for \i and [-._:A-Za-z0-9] for \c. If you want to allow all Unicode characters that the XML standard allows,
then you will end up with some pretty long regexes. You would have to use (all on one line):
[:A-Z_a-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02FF\u0370-\u037D\u037F-\u1FFF
\u200C-\u200D\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD] instead of\i and:
[-.0-9:A-Z_a-z\u00B7\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u037D\u037F-\u1FFF\u200C-
\u200D\u203F\u2040\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD]
instead of \c.
The character class [a-z-[aeiuo]] matches a single letter that is not a vowel. In other words: it matches a single
consonant. Without character class subtraction or intersection, the only way to do this would be to list all consonants:
[b-df-hj-np-tv-z].
The character class [\p{Nd}-[^\p{IsThai}]] matches any single Thai digit. The base class matches any Unicode digit.
All non-Thai characters are subtracted from that class.
The class subtraction must always be the last element in the character class. [0-9-[4-6]a-f] is not a valid regular
expression. It should be rewritten as [0-9a-f-[4-6]]. The subtraction works on the whole class. E.g.
[\p{Ll}\p{Lu}-[\p{IsBasicLatin}]] matches all uppercase and lowercase Unicode letters, except any ASCII letters.
The \p{IsBasicLatin} is subtracted from the combination of \p{Ll}\p{Lu} rather than from \p{Lu} alone. This regex
will not match abc.
While you can use nested character class subtraction, you cannot subtract two classes sequentially. To subtract ASCII
characters and Greek characters from a class with all Unicode letters, combine the ASCII and Greek characters into one
class, and subtract that, as in [\p{L}-[\p{IsBasicLatin}\p{IsGreek}]].
Strictly speaking, this means that the character class subtraction syntax is incompatible with Perl and the majority of other
regex flavors. But in practice there's no difference. Using non-alphanumeric characters in character class ranges is very bad
practice because it relies on the order of characters in the ASCII character table. That makes the regular expression hard to
understand for the programmer who inherits your work. While [A-[] would match any upper case letter or an opening square
bracket in Perl, this regex is much clearer when written as [A-Z[]. The former regex would cause an error with the XML and
.NET flavors because they interpret -[] as an empty subtracted class, leaving an unbalanced [.
If the intersected class does not need a negating caret, then Java and Ruby allow you to omit the nested square brackets:
[class&&intersect].
The character class [a-z&&[^aeiuo]] matches a single letter that is not a vowel. In other words: it matches a single
consonant. Without character class subtraction or intersection, the only way to do this would be to list all consonants:
[b-df-hj-np-tv-z].
If you do not use square brackets around the right hand part of the intersection, then there is no confusion that the entire
remainder of the character class is the right hand part of the intersection. If you do use the square brackets, you could write
something like [0-9&&[12]56]. In Ruby, this is the same as [0-9&&1256]. But Java has bugs that cause it to treat this as
[0-9&&56], completely ignoring the nested brackets.
You also shouldn't put && at the very start or very end of the regex. Ruby treats [0-9&&] and [&&0-9] as intersections with
an empty class, which matches no characters at all. Java ignores leading and trailing && operators.
If you want to negate the right hand side of the intersection, then you must use square brackets. Those automatically control
precedence. So Java and Ruby all read [1234&&[^3456]] as "1234 and (not 3456)". Thus this regex is the same as
[12].
Strictly speaking, this means that the character class intersection syntax is incompatible with the majority of other regex
flavors. But in practice there's no difference, because there is no point in using two ampersands in a character class when you
just want to add a literal ampersand. A single ampersand is still treated as a literal by Java and Ruby
The dot matches a single character, without caring what that character is. The only exception are newline characters. In all
regex flavors discussed in this tutorial, the dot will not match a newline character by default. So by default, the dot is short for
the negated character class [^\n] (UNIX regex flavors) or [^\r\n] (Windows regex flavors).
This exception exists mostly because of historic reasons. The first tools that used regular expressions were line-based. They
would read a file line by line, and apply the regular expression separately to each line. The effect is that with these tools, the
string could never contain newlines, so the dot could never match them.
Modern tools and languages can apply regular expressions to very large strings or even entire files. All regex flavors
discussed here have an option to make the dot match all characters, including newlines.
In Perl, the mode where the dot also matches newlines is called "single-line mode". This is a bit unfortunate, because it is
easy to mix up this term with "multi-line mode". Multi-line mode only affects anchors, and single-line mode only affects the dot.
You can activate single-line mode by adding an s after the regex code, like this: m/^regex$/s;.
Other languages and regex libraries have adopted Perl's terminology. When using the regex classes of the .NET framework,
you activate this mode by specifying RegexOptions.Singleline, such as in
Regex.Match("string", "regex", RegexOptions.Singleline).
JavaScript and VBScript do not have an option to make the dot match line break characters. In those languages, you can use
a character class such as [\s\S] to match any character. This character matches a character that is either a whitespace
character (including line break characters), or a character that is not a whitespace character. Since all characters are either
whitespace or non-whitespace, this character class matches any character.
In all of Boost's regex grammars the dot matches line breaks by default. Boost's ECMAScript grammar allows you to turn this
off with regex_constants::no_mod_m.
std::regex, XML Schema and XPath also treat the carriage return \r as a line break character. JavaScript adds the Unicode
line separator \u2028 and page separator \u2029 on top of that. Java includes these plus the Latin-1 next line control
character \u0085. Boost adds the form feed \f to the list. Only Delphi supports all Unicode line breaks, completing the mix
with the vertical tab.
.NET is notably absent from the list of flavors that treat characters other than \n as line breaks. Unlike scripting languages
that have their roots in the UNIX world, .NET is a Windows development framework that does not automatically strip carriage
return characters from text files that it reads. If you read a Windows text file as a whole into a string, it will contain carriage
returns. If you use the regex abc.* on that string, without setting RegexOptions.SingleLine, then it will match abc plus
all characters that follow on the same line, plus the carriage return at the end of the line, but without the newline after that.
Some flavors allow you to control which characters should be treated as line breaks. Java has the UNIX_LINES option which
makes it treat only \n as a line break. PCRE has options that allow you to choose between \n only, \r only, \r\n, or all
Unicode line breaks.
On POSIX systems, the POSIX locale determines which characters are line breaks. The C locale treats only the newline \n as
a line break. Unicode locales support all Unicode line breaks.
PCRE's options that control which characters are treated as line breaks affect \N in exactly the same way as they affect the
dot.
PHP 5.3.4 and R 2.14.0 also support \N as their regex support is based on PCRE 8.10 or later.
\d\d[- /.]\d\d[- /.]\d\d is a better solution. This regex allows a dash, space, dot and forward slash as date
separators. Remember that the dot is not a metacharacter inside a character class, so we do not need to escape it with a
backslash. This regex is still far from perfect. It matches 99/99/99 as a valid date.
[0-1]\d[- /.][0-3]\d[- /.]\d\d is a step ahead, though it will still match 19/39/99 as a valid date. How perfect you
want your regex to be depends on what you want to do with it. If you are validating user input, it has to be perfect. If you are
parsing data files from a known source that generates its files in the same way every time, our last attempt is probably more
than sufficient to parse the data without errors. You can find a better regex to match dates in the examples section.
Suppose you want to match a double-quoted string. Sounds easy. We can have any number of any character between the
double quotes, so ".*" seems to do the trick just fine. The dot matches any character, and the star allows the dot to be
repeated any number of times, including zero. If you test this regex on Put a "string" between double quotes, it will
match "string" just fine.
Ouch. The regex matches "string one" and "string two". Definitely not what we intended. The reason for this is that
the star is greedy.
In the date-matching example, we improved our regex by replacing the dot with a character class. Here, we will do the same.
Our original definition of a double-quoted string was faulty. We do not want any number of any character between the quotes.
We want any number of characters that are not double quotes or newlines between the quotes. So the proper regex is
"[^"\r\n]*".
Anchors are a different breed. They do not match any character at all. Instead, they match a position before, after or between
characters. They can be used to "anchor" the regex match at a certain position. The caret ^ matches the position before the
first character in the string. Applying ^a to abc matches a. ^b will not match abc at all, because the b cannot be matched
right after the start of the string, matched by ^. See below for the inside view of the regex engine.
Similarly, $ matches right after the last character in the string. c$ matches c in abc, while a$ does not match at all.
A regex that consists solely of an anchor can only find zero-length matches. This can be useful, but can also create
complications that are explained near the end of this tutorial.
Useful Applications
When using regular expressions in a programming language to validate user input, using anchors is very important. If you use
the code if ($input =~ m/\d+/) in a Perl script to see if the user entered an integer number, it will accept the input even
if the user entered qsdf4ghjk, because \d+ matches the 4. The correct regex to use is ^\d+$. Because "start of string"
must be matched before the match of \d+, and "end of string" must be matched right after it, the entire string must consist of
digits for ^\d+$ to be able to match.
It is easy for the user to accidentally type in a space. When Perl reads from a line from a text file, the line break will also be
stored in the variable. So before validating input, it is good practice to trim leading and trailing whitespace. ^\s+ matches
leading whitespace and \s+$ matches trailing whitespace. In Perl, you could use $input =~ s/^\s+|\s+$//g. Handy
use of alternation and /g allows us to do this in a single line of code.
In text editors like GNU Emacs, and regex tools, the caret and dollar always match at the start and end of each line. This
makes sense because those applications are designed to work with entire files, rather than short strings.
In all programming languages and libraries discussed on this website , except Ruby, you have to explicitly activate this
extended functionality. It is traditionally called "multi-line mode". In Perl, you do this by adding an m after the regex code, like
this: m/^regex$/m; In .NET, the anchors match before and after newlines when you specify RegexOptions.Multiline,
such as in:
Regex.Match("string", "regex", RegexOptions.Multiline).
For anchors there's an additional consideration when CR and LF occur as a pair and the regex flavor treats both these
characters as line breaks. Delphi and Java treat CRLF as an indivisible pair. ^ matches after CRLF and $ matches before
CRLF, but neither match in the middle of a CRLF pair. JavaScript and XPath treat CRLF pairs as two line breaks. ^ matches in
the middle of and after CRLF, while $ matches before and in the middle of CRLF.
JavaScript, POSIX, XML, and Xpath do not support \A and \Z. You're stuck with using the caret and dollar for this purpose.
The GNU extensions to POSIX regular expressions use \` (backtick) to match the start of the string, and \' (single quote) to
match the end of the string.
Strings Ending with a Line Break
Because Perl returns a string with a newline at the end when reading a line from a file, Perl's regex engine matches $ at the
position before the line break at the end of the string even when multi-line mode is turned off. Perl also matches $ at the very
end of the string, regardless of whether that character is a line break. So ^\d+$ matches 123 whether the subject string is
123 or 123\n.
Most modern regex flavors have copied this behavior. That includes .NET, Java, PCRE, Delphi, PHP, and Python. This
behavior is independent of any settings such as "multi-line mode".
In all these flavors except Python, \Z also matches before the final line break. If you only want a match at the absolute very
end of the string, use \z (lower case z instead of upper case Z). \A\d+\z does not match 123\n. \z matches after the line
break, which is not matched by the shorthand character class.
In Python, \Z matches only at the very end of the string. Python does not support \z.
Boost is the only exception. In Boost, \Z can match before any number of trailing line breaks as well as at the very end of the
string. So if the subject string ends with three line breaks, Boost's \Z has four positions that it can match at. Like in all other
flavors, Boost's \Z is independent of multi-line mode. Boost's $ only matches at the very end of the string when you turn off
multi-line mode (which is on by default in Boost).
Then, the regex engine arrives at the second 4 in the string. The ^ can match at the position before the 4, because it is
preceded by a newline character. Again, the regex engine advances to the next regex token, 4, but does not advance the
character position in the string. 4 matches 4, and the engine advances both the regex token and the string character. Now the
engine attempts to match $ at the position before (indeed: before) the 8. The dollar cannot match here, because this position
is followed by a character, and that character is not a newline.
Yet again, the engine must try to match the first token again. Previously, it was successfully matched at the second 4, so the
engine continues at the next character, 8, where the caret does not match. Same at the six and the newline.
Finally, the regex engine tries to match the first token at the third 4 in the string. With success. After that, the engine
successfully matches 4 with 4. The current regex token is advanced to $, and the current character is advanced to the very
last position in the string: the void after the string. No regex token that needs a character to match can match here. Not even a
negated character class. However, we are trying to match a dollar sign, and the mighty dollar is a strange beast. It is
zero-width, so it will try to match the position before the current character. It does not matter that this "character" is the void
after the string. In fact, the dollar will check the current character. It must be either a newline, or the void after the string, for $
to match the position before the current character. Since that is the case after the example, the dollar matches successfully.
Since $ was the last token in the regex, the engine has found a successful match: the last 4 in the string.
Word Boundaries
The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a "word boundary".
This match is zero-length.
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
Simply put: \b allows you to perform a "whole words only" search using a regular expression in the form of \bword\b. A
"word character" is a character that can be used to form words. All characters that are not "word characters" are "non-word
characters".
In all flavors, the characters [a-zA-Z0-9_] are word characters. These are also matched by the short-hand character class
\w. Flavors showing "ASCII" for word boundaries in the flavor comparison recognize only these as word characters. Flavors
showing "Unicod" also recognize letters and digits from other languages or all of Unicode as word characters. Notice that Java
supports Unicode for \b but not for \w. Python offers flags to control which characters are word characters (affecting both \b
and \w).
Most flavors, except the ones discussed below, have only one metacharacter that matches both before a word and after a
word. This is because any position between characters can never be both at the start and at the end of a word. Using only one
operator makes things easier for you.
Since digits are considered to be word characters, \b4\b can be used to match a 4 that is not part of a larger number. This
regex will not match 44 sheets of a4. So saying "\b matches before and after an alphanumeric sequence" is more exact
than saying "before and after a word".
\b cannot match at the position between the T and the h. It cannot match between the h and the i either, and neither
between the i and the s.
The next character in the string is a space. \b matches here because the space is not a word character, and the preceding
character is. Again, the engine continues with the i which does not match with the space.
Advancing a character and restarting with the first regex token, \b matches between the space and the second i in the string.
Continuing, the regex engine finds that i matches i and s matches s. Now, the engine tries to match the second \b at the
position before the l. This fails because this position is between two word characters. The engine reverts to the start of the
regex and advances one character to the s in island. Again, the \b fails to match and continues to do so until the second
space is reached. It matches there, but matching the i fails.
But \b matches at the position before the third i in the string. The engine continues, and finds that i matches i and s
matches s. The last token in the regex, \b, also matches at the position before the third space in the string because the space
is not a word character, and the character before it is.
The engine has successfully matched the word is in our string, skipping the two earlier occurrences of the characters i and s.
If we had used the regular expression is, it would have matched the is in This.
In Tcl, \b matches a backspace character, just like \x08 in most regex flavors (including Tcl's). \B matches a single backslash
character in Tcl, just like \\ in all other regex flavors (and Tcl too).
Tcl uses the letter "y" instead of the letter "b" to match word boundaries. \y matches at any word boundary position, while \Y
matches at any position that is not a word boundary. These Tcl regex tokens match exactly the same as \b and \B in
Perl-style regex flavors. They don't discriminate between the start and the end of a word.
Tcl has two more word boundary tokens that do discriminate between the start and end of a word. \m matches only at the start
of a word. That is, it matches at any position that has a non-word character to the left of it, and a word character to the right of
it. It also matches at the start of the string if the first character in the string is a word character. \M matches only at the end of a
word. It matches at any position that has a word character to the left of it, and a non-word character to the right of it. It also
matches at the end of the string if the last character in the string is a word character.
In most situations, the lack of \m and \M tokens is not a problem. \yword\y finds "whole words only" occurrences of "word"
just like \mword\M would. \Mword\m could never match anywhere, since \M never matches at a position followed by a word
character, and \m never at a position preceded by one. If your regular expression needs to match characters before or after
\y, you can easily specify in the regex whether these characters should be word characters or non-word characters. E.g. if
you want to match any word, \y\w+\y will give the same result as \m.+\M. Using \w instead of the dot automatically restricts
the first \y to the start of a word, and the second \y to the end of a word. Note that \y.+\y would not work. This regex
matches each word, and also each sequence of non-word characters between the words in your subject string. That said, if
your flavor supports \m and \M, the regex engine could apply \m\w+\M slightly faster than \y\w+\y, depending on its internal
optimizations.
If your regex flavor supports lookahead and lookbehind, you can use (?<!\w)(?=\w) to emulate Tcl's \m and
(?<=\w)(?!\w) to emulate \M. Though quite a bit more verbose, these lookaround constructs match exactly the same as
Tcl's word boundaries.
If your flavor has lookahead but not lookbehind, and also has Perl-style word boundaries, you can use \b(?=\w) to emulate
Tcl's \m and \b(?!\w) to emulate \M. \b matches at the start or end of a word, and the lookahead checks if the next
character is part of a word or not. If it is we're at the start of a word. Otherwise, we're at the end of a word.
If you want to search for the literal text cat or dog, separate both options with a vertical bar or pipe symbol: cat|dog. If you
want more options, simply expand the list: cat|dog|mouse|fish.
The alternation operator has the lowest precedence of all regex operators. That is, it tells the regex engine to match either
everything to the left of the vertical bar, or everything to the right of the vertical bar. If you want to limit the reach of the
alternation, you will need to use parentheses for grouping. If we want to improve the first example to match whole words only,
we would need to use \b(cat|dog)\b. This tells the regex engine to find a word boundary, then either "cat" or "dog", and
then another word boundary. If we had omitted the parentheses, the regex engine would have searched for "a word boundary
followed by cat", or, "dog" followed by a word boundary.
The obvious solution is Get|GetValue|Set|SetValue. Let's see how this works out when the string is SetValue:
The regex engine starts at the first token in the regex, G, and at the first character in the string, S. The match fails. However,
the regex engine studied the entire regular expression before starting. So it knows that this regular expression uses
alternation, and that the entire regex has not failed yet. So it continues with the second option, being the second G in the
regex. The match fails again. The next token is the first S in the regex. The match succeeds, and the engine continues with
the next character in the string, as well as the next token in the regex. The next token in the regex is the e after the S that just
successfully matched. e matches e. The next token, t matches t.
At this point, the third option in the alternation has been successfully matched. Because the regex engine is eager, it considers
the entire alternation to have been successfully matched as soon as one of the options has. In this example, there are no
other tokens in the regex outside the alternation, so the entire regex has successfully matched Set in SetValue.
Contrary to what we intended, the regex did not match the entire string. There are several solutions. One option is to take into
account that the regex engine is eager, and change the order of the options. If we use GetValue|Get|SetValue|Set,
SetValue will be attempted before Set, and the engine will match the entire string. We could also combine the four options
into two and use the question mark to make part of them optional: Get(Value)?|Set(Value)?. Because the question mark
is greedy, SetValue will be attempted before Set.
The best option is probably to express the fact that we only want to match complete words. We do not want to match Set or
SetValue if the string is SetValueFunction. So the solution is \b(Get|GetValue|Set|SetValue)\b or
\b(Get(Value)?|Set(Value)?)\b. Since all options have the same end, we can optimize this further to:
\b(Get|Set)(Value)?\b
Optional Items
The question mark makes the preceding token in the regular expression optional. E.g.: colou?r matches both colour and
color.
You can make several tokens optional by grouping them together using parentheses, and placing the question mark after the
closing bracket. E.g.: Nov(ember)? will match Nov and November.
You can write a regular expression that matches many alternatives by including more than one question mark.
Feb(ruary)? 23(rd)? matches February 23rd, February 23, Feb 23rd and Feb 23.
You can also use curly braces to make something optional. colou{0,1}r is the same as colou?r. POSIX BRE and GNU
BRE do not support either syntax. These flavors require backslashes to give curly braces their special meaning:
colou\{0,1\}r.
The effect is that if you apply the regex Feb 23(rd)? to the string Today is Feb 23rd, 2013, the match will always be
Feb 23rd and not Feb 23. You can make the question mark lazy (i.e. turn off the greediness) by putting a second question
mark after the first.
The discussion about the other repetition operators has more details on greedy and lazy quantifiers.
The first token in the regex is the literal c. The first position where it matches successfully is the c in colonel. The engine
continues, and finds that o matches o, l matches l and another o matches o. Then the engine checks whether u matches n.
This fails. However, the question mark tells the regex engine that failing to match u is acceptable. Therefore, the engine will
skip ahead to the next regex token: r. But this fails to match n as well. Now, the engine can only conclude that the entire
regular expression cannot be matched starting at the c in colonel. Therefore, the engine starts again trying to match c to the
first o in colonel.
After a series of failures, c will match with the c in color, and o, l and o match the following characters. Now the engine
checks whether u matches r. This fails. Again: no problem. The question mark allows the engine to continue with r. This
matches r and the engine reports that the regex successfully matched color in our string.
Repetition with Star and Plus
We already introduced one repetition operator or quantifier: the question mark. It tells the engine to attempt to match the
preceding token zero times or once, in effect making it optional.
The asterisk or star tells the engine to attempt to match the preceding token zero or more times. The plus tells the engine to
attempt to match the preceding token once or more. <[A-Za-z][A-Za-z0-9]*> matches an HTML tag without any
attributes. The sharp brackets are literals. The first character class matches a letter. The second character class matches a
letter or digit. The star repeats the second character class. Because we used the star, it's OK if the second character class
matches nothing. So our regex will match a tag like <p>. When matching <html>, the first character class will match h. The
star will cause the second character class to be repeated three times, matching t, m and l with each step.
We could also have used <[A-Za-z0-9]+>. We did not, because this regex would match <1>, which is not a valid HTML
tag. But this regex may be sufficient if you know the string you are searching through does not contain any such invalid tags.
Limiting Repetition
There is an additional quantifier that allows you to specify how many times a token can be repeated. The syntax is
{min,max}, where min is a positive integer number indicating the minimum number of matches, and max is an integer equal
to or greater than min indicating the maximum number of matches. If the comma is present but max is omitted, the maximum
number of matches is infinite. So {0,} is the same as *, and {1,} is the same as +. Omitting both the comma and max tells
the engine to repeat the token exactly min times.
You could use \b[1-9][0-9]{3}\b to match a number between 1000 and 9999.
\b[1-9][0-9]{2,4}\b matches a number between 100 and 99999. Notice the use of the word boundaries.
Most people new to regular expressions will attempt to use <.+>. They will be surprised when they test it on a string like
This is a <em>first</em> test. You might expect the regex to match <em> and when continuing after that match,
</em>.
But it does not. The regex will match <em>first</em>. Obviously not what we wanted. The reason is that the plus is greedy.
That is, the plus causes the regex engine to repeat the preceding token as often as possible. Only if that causes the entire
regex to fail, will the regex engine backtrack. That is, it will go back to the plus, make it give up the last iteration, and proceed
with the remainder of the regex. Let's take a look inside the regex engine to see in detail how this works and why this causes
our regex to fail. After that, I will present you with two possible solutions.
Like the plus, the star and the repetition using curly braces are greedy.
So the match of .+ is reduced to em>first</em> tes. The next token in the regex is still >. But now the next character in
the string is the last t. Again, these cannot match, causing the engine to backtrack further. The total match so far is reduced to
<em>first</em> te. But > still cannot match. So the engine continues backtracking until the match of .+ is reduced to
em>first</em. Now, > can match the next character in the string. The last token in the regex has been matched. The
engine reports that <em>first</em> has been successfully matched.
Remember that the regex engine is eager to return a match. It will not continue backtracking further to see if there is another
possible match. It will report the first valid match it finds. Because of greediness, this is the leftmost longest match.
Again, < matches the first < in the string. The next token is the dot, this time repeated by a lazy plus. This tells the regex
engine to repeat the dot as few times as possible. The minimum is one. So the engine matches the dot with e. The
requirement has been met, and the engine continues with > and m. This fails. Again, the engine will backtrack. But this time,
the backtracking will force the lazy plus to expand rather than reduce its reach. So the match of .+ is expanded to em, and the
engine tries again to continue with >. Now, > is matched successfully. The last token in the regex has been matched. The
engine reports that <em> has been successfully matched. That's more like it.
An Alternative to Laziness
In this case, there is a better option than making the plus lazy. We can use a greedy plus and a negated character class:
<[^>]+>. The reason why this is better is because of the backtracking. When using the lazy plus, the engine has to backtrack
for each character in the HTML tag that it is trying to match. When using the negated character class, no backtracking occurs
at all when the string contains valid HTML code. Backtracking slows down the regex engine. You will not notice the difference
when doing a single search in a text editor. But you will save plenty of CPU cycles when using such a regex repeatedly in a
tight loop in a script that you are writing.
Finally, remember that this tutorial only talks about regex-directed engines. Text-directed engines do not backtrack. They do
not get the speed penalty, but they also do not support lazy repetition operators.
Note that only parentheses can be used for grouping. Square brackets define a character class, and curly braces are used by
a special repetition operator.
The regex Set(Value)? matches Set or SetValue. In the first case, the first (and only) capturing group remains empty. In
the second case, the first capturing group matches Value.
Non-Capturing Groups
If you do not need the group to capture its match, you can optimize this regular expression into Set(?:Value)?.
The question mark and the colon after the opening parenthesis are the special syntax that you can use to tell the regex engine
that this pair of brackets should not create a backreference. Note the question mark after the opening bracket is unrelated to
the question mark at the end of the regex. That question mark is the regex operator that makes the previous token optional.
This operator cannot appear after an opening parenthesis, because an opening bracket by itself is not a valid regex token.
Therefore, there is no confusion between the question mark as an operator to make a token optional, and the question mark
as a character to change the properties of a pair of parentheses. The colon indicates that the change we want to make is to
turn off capturing the backreference.
color=(?:red|green|blue) is another regex with a non-capturing group. This regex has no quantifiers.
Regex flavors that support named capture often have an option to turn all unnamed groups into non-capturing groups.
To figure out the number of a particular backreference, scan the regular expression from left to right and count the opening
parentheses. The first bracket starts backreference number one, the second number two, etc. Non-capturing parentheses are
not counted. This fact means that non-capturing parentheses have another benefit: you can insert them into a regular
expression without changing the numbers assigned to the backreferences. This can be very useful when modifying a complex
regular expression.
You can reuse the same backreference more than once. ([a-c])x\1x\1 will match axaxa, bxbxb and cxcxc.
Most regex flavors support up to 99 capturing groups and double-digit backreferences. So \99 is a valid backreference if your
regex has 99 capturing groups.
The first token in the regex is the literal <. The regex engine will traverse the string until it can match at the first < in the string.
The next token is [a-z]. The regex engine also takes note that it is now inside the first pair of capturing parentheses. [a-z]
matches b. The engine advances to [a-z0-9] and >. This match fails. However, because of the star, that's perfectly fine.
The position in the string remains at >. The position in the regex is advanced to [^>].
This step crosses the closing bracket of the first pair of capturing parentheses. This prompts the regex engine to store what
was matched inside them into the first backreference. In this case, b is stored.
After storing the backreference, the engine proceeds with the match attempt. [^>] does not match >. Again, because of
another star, this is not a problem. The position in the string remains at >, and position in the regex is advanced to >. These
obviously match. The next token is a dot, repeated by a lazy star. Because of the laziness, the regex engine will initially skip
this token, taking note that it should backtrack in case the remainder of the regex fails.
The engine has now arrived at the second < in the regex, and the second < in the string. These match. The next token is /.
This does not match i, and the engine is forced to backtrack to the dot. The dot matches the second < in the string. The star
is still lazy, so the engine again takes note of the available backtracking position and advances to < and i. These do not
match, so the engine again backtracks.
The backtracking continues until the dot has consumed <i>bold italic. At this point, < matches the third < in the string,
and the next token is / which matches /. The next token is \1. Note that the token is the backreference, and not b. The
engine does not substitute the backreference in the regular expression. Every time the engine arrives at the backreference, it
will read the value that was stored. This means that if the engine had backtracked beyond the first pair of capturing
parentheses before arriving the second time at \1, the new value stored in the first backreference would be used. But this did
not happen here, so b it is. This fails to match at i, so the engine backtracks again, and the dot consumes the third < in the
string.
Backtracking continues again until the dot has consumed <i>bold italic</i>. At this point, < matches < and / matches
/. The engine arrives again at \1. The backreference still holds b. b matches b. The last token in the regex, > matches >. A
complete match has been found: <b><i>bold italic</i></b>.
Let's take the regex <([a-z][a-z0-9]*)[^>]*>.*?</\1> without the word boundary and look inside the regex engine at
the point where \1 fails the first time. First, .*? continues to expand until it has reached the end of the string, and </\1> has
failed to match each time .*? matched one more character.
Then the regex engine backtracks into the capturing group. [a-z0-9]* has matched oo, but would just as happily match o or
nothing at all. When backtracking, [a-z0-9]* is forced to give up one character. The regex engine continues, exiting the
capturing group a second time. Since [a-z][a-z0-9]* has now matched bo, that is what is stored into the capturing group,
overwriting boo that was stored before. [^>]* matches the second o in the opening tag. >.*?</ matches >bold<. \1 fails
again.
The regex engine does all the same backtracking once more, until [a-z0-9]* is forced to give up another character, causing
it to match nothing, which the star allows. The capturing group now stores just b. [^>]* now matches oo. >.*?</ once again
matches >bold<. \1 now succeeds, as does > and an overall match is found. But not the one we wanted.
There are several solutions to this. One is to use the word boundary. When [a-z0-9]* backtracks the first time, reducing the
capturing group to bo, \b fails to match between o and o. This forces [a-z0-9]* to backtrack again immediately. The
capturing group is reduced to b and the word boundary fails between b and o. There are no further backtracking positions, so
the whole match attempt fails.
The reason we need the word boundary is that we're using [^>]* to skip over any attributes in the tag. If your paired tags
never have any attributes, you can leave that out, and use <([a-z][a-z0-9]*)>.*?</\1>. Each time [a-z0-9]*
backtracks, the > that follows it will fail to match, quickly ending the match attempt.
If you didn't expect the regex engine to backtrack into capturing groups, you can use an atomic group. The regex engine
always backtracks into capturing groups, and never captures atomic groups. You can put the capturing group inside an atomic
group to get an atomic capturing group: (?>(atomic capture)). In this case, we can put the whole opening tag into the
atomic group: (?><([a-z][a-z0-9]*)[^>]*>).*?</\1>. The tutorial section on atomic grouping has all the details.
This also means that ([abc]+)=\1 will match cab=cab, and that ([abc])+=\1 will not. The reason is that when the
engine arrives at \1, it holds b which fails to match c. Obvious when you look at a simple example like this one, but a
common cause of difficulty with regular expressions nonetheless. When using backreferences, always double check that you
are really capturing what you want.
Backreferences also cannot be used inside a character class. The \1 in regex like (a)[\1b] will be interpreted as an octal
escape in most regex flavors. So this regex will match an a followed by either \x01 or a b.
Backreferences to Failed Groups
The previous section applies to all regex flavors, except those few that don't support capturing groups at all. Flavors behave
differently when you start doing things that don't fit the "match the text matched by a previous capturing group" job description.
There is a difference between a backreference to a capturing group that matched nothing, and one to a capturing group that
did not participate in the match at all. The regex (q?)b\1 will match b. q? is optional and matches nothing, causing (q?) to
successfully match and capture nothing. b matches b and \1 successfully matches the nothing captured by the group.
The regex (q)?b\1 however will fail to match b. (q) fails to match at all, so the group never gets to capture anything at all.
Because the whole group is optional, the engine does proceed to match b. However, the engine now arrives at \1 which
references a group that did not participate in the match attempt at all. This causes the backreference to fail to match at all,
mimicking the result of the group. Since there's no ? making \1 optional, the overall match attempt fails.
One of the few exceptions is JavaScript. According to the official ECMA standard, a backreference to a non-participating
capturing group must successfully match nothing just like a backreference to a participating group that captured nothing does.
In other words, in JavaScript, (q?)b\1 and (q)?b\1 both match b. Xpath also works this way
Dinkumware's implementation of std::regex handles backreferences like JavaScript for all its grammars that support
backreferences. Boost did so too until version 1.46. As of version 1.47, Boost fails backreferences to non-participating groups
when using the ECMAScript grammar, but still lets them successfully match nothing when using the basic and grep grammars.
Java treats backreferences to groups that don't exist as backreferences to groups that exist but never participate in the match.
They are not an error, but simply never match anything.
.NET is a little more complicated. .NET supports single-digit and double-digit backreferences as well as double-digit octal
escapes without a leading zero. Backreferences trump octal escapes. So \12 is a line feed (octal 12 = decimal 10) in a regex
with fewer than 12 capturing groups. It would be a backreference to the 12th group in a regex with 12 or more capturing
groups. .NET does not support single-digit octal escapes. So \7 is an error in a regex with fewer than 7 capturing groups.
Forward References
Modern flavors, notably .NET, Java, Perl, PCRE, PHP, Delphi, and Ruby allow forward references. That is: you can use a
backreference to a group that appears later in the regex. Forward references are obviously only useful if they're inside a
repeated group. Then there can be situations in which the regex engine evaluates the backreference after the group has
already matched. Before the group is attempted, the backreference will fail like a backreference to a failed group does.
If forward references are supported, the regex (\2two|(one))+ will match oneonetwo. At the start of the string, \2 fails.
Trying the other alternative, one is matched by the second capturing group, and subsequently by the first group. The first
group is then repeated. This time, \2 matches one as captured by the second group. two then matches two. With two
repetitions of the first group, the regex has matched the whole subject string.
JavaScript does not support forward references, but does not treat them as an error. In JavaScript, forward references always
find a zero-length match, just as backreferences to non-participating groups do in JavaScript. Because this is not particularly
useful, XRegExp makes them an error. In std::regex, Boost, Python, Tcl, and VBScript forward references are an error.
Nested References
A nested reference is a backreference inside the capturing group that it references. Like forward references, nested
references are only useful if they're inside a repeated group, as in (\1two|(one))+. When nested references are
supported, this regex also matches oneonetwo. At the start of the string, \1 fails. Trying the other alternative, one is matched
by the second capturing group, and subsequently by the first group. The first group is then repeated. This time, \1 matches
one as captured by the last iteration of the first group. It doesn't matter that the regex engine has re-entered the first group.
The text matched by the group was stored into the backreference when the group was previously exited. two then matches
two. With two repetitions of the first group, the regex has matched the whole subject string. If you retrieve the text from the
capturing groups after the match, the first group stores onetwo while the second group captured the first occurrence of one in
the string.
.NET, Java, Perl, and VBScript flavors all support nested references. PCRE does too, but had bugs with backtracking into
capturing groups with nested backreferences. Instead of fixing the bugs, PCRE 8.01 worked around them by forcing capturing
groups with nested references to be atomic. So in PCRE, (\1two|(one))+ is the same as (?>(\1two|(one)))+. This
affects languages with regex engines based on PCRE, such as PHP, Delphi, and R.
JavaScript and Ruby do not support nested references, but treat them as backreferences to non-participating groups instead
of as errors. In JavaScript that means they always match a zero-length string, while in Ruby they always fail to match. In
std::regex, Boost, Python, and Tcl, nested references are an error.
Python's regex module was the first to offer a solution: named capture. By assigning a name to a capturing group, you can
easily reference it by name. (?P<name>group) captures the match of group into the backreference "name". name must be
an alphanumeric sequence starting with a letter. group can be any regular expression. The question mark, P, angle brackets,
and equals signs are all part of the syntax. Though the syntax for the named backreference uses parentheses, it's just a
backreference that doesn't do any capturing or grouping.
The regular expression classes of the .NET framework also support named capture. Unfortunately, the Microsoft developers
decided to invent their own syntax, rather than follow the one pioneered by Python.
Here is an example with two capturing groups in .NET style: (?<first>group)(?'second'group). As you can see, .NET
offers two syntaxes to create a capturing group: one using sharp brackets, and the other using single quotes. The first syntax
is preferable in strings, where single quotes may need to be escaped. The second syntax is preferable when adding your
regex to an XML file, as this minimizes the amount of escaping you have to do to format your regex as a literal string or as
XML content.
To reference a capturing group inside the regex, use \k<name> or \k'name'. Again, you can use the two syntactic variations
interchangeably.
Because Python and .NET introduced their own syntax, we refer to these two variants as the "Python syntax" and the ".NET
syntax" for named capture and named backreferences. Today, many other regex flavors have copied this syntax.
Perl 5.10 added support for both the Python and .NET syntax for named capture and backreferences. It also adds two more
syntactic variants for named backreferences: \k{one} and \g{two}. There's no difference between the five syntaxes for
named backreferences in Perl. All can be used interchangeably. In the replacement text, you can interpolate the variable
$+{name} to insert the text matched by a named capturing group.
PCRE 7.2 and later support all the syntax for named capture and backreferences that Perl 5.10 supports. Old versions of
PCRE supported the Python syntax, even though that was not "Perl-compatible" at the time. Languages like PHP, Delphi, and
R that implement their regex support using PCRE also support all this syntax. Unfortunately, neither PHP or R support named
references in the replacement text. You'll have to use numbered references to the named groups. PCRE does not support
search-and-replace at all.
Java 7 and XRegExp copied the .NET syntax, but only the variant with angle brackets. Ruby 1.9 and later supports both
variants of the .NET syntax. Boost 1.42 and later support named capturing groups using the .NET syntax with angle brackets
or quotes and named backreferences using the \g syntax with curly braces from Perl 5.10. Boost 1.47 additionally supports
backreferences using the \k syntax with angle brackets and quotes from .NET. Boost 1.47 allowed these variants to multiply.
Boost 1.47 allows named and numbered backreferences to be specified with \g or \k and with curly braces, angle brackets,
or quotes. So Boost 1.47 and later have six variations of the backreference syntax on top of the basic \1 syntax. This puts
Boost in conflict with Ruby, PCRE, PHP, and R which treat \g with angle brackets or quotes as a subroutine call.
Most flavors number both named and unnamed capturing groups by counting their opening parentheses from left to right.
Adding a named capturing group to an existing regex still upsets the numbers of the unnamed groups. In .NET, however,
unnamed capturing groups are assigned numbers first, counting their opening parentheses from left to right, skipping all
named groups. After that, named groups are assigned the numbers that follow by counting the opening parentheses of the
named groups from left to right.
As an example, the regex (a)(?P<x>b)(c)(?P<y>d) matches abcd as expected. If you do a search-and-replace with this
regex and the replacement \1\2\3\4 or $1$2$3$4 (depending on the flavor), you will get abcd. All four groups were
numbered from left to right, from one till four.
Things are a bit more complicated with the .NET framework. The regex (a)(?<x>b)(c)(?<y>d) again matches abcd.
However, if you do a search-and-replace with $1$2$3$4 as the replacement, you will get acbd. First, the unnamed groups
(a) and (c) got the numbers 1 and 2. Then the named groups "x" and "y" got the numbers 3 and 4.
In all other flavors that copied the .NET syntax the regex (a)(?<x>b)(c)(?<x>d) still matches abcd. But in all those
flavors, the replacement \1\2\3\4 or $1$2$3$4 (depending on the flavor) gets you abcd. All four groups were numbered
from left to right.
Perl and Ruby also allow groups with the same name. But these flavors only use smoke and mirrors to make it look like the all
the groups with the same name act as one. In reality, the groups are separate. In Perl, a backreference matches the text
captured by the leftmost group in the regex with that name that matched something. In Ruby, a backreference matches the
text captured by any of the groups with that name. Backtracking makes Ruby try all the groups.
So in Perl and Ruby, you can only meaningfully use groups with the same name if they are in separate alternatives in the
regex, so that only one of the groups with that name could ever capture any text. Then backreferences to that group sensibly
match the text captured by the group.
For example, if you want to match a followed by a digit 0..5, or b followed by a digit 4..7, and you only care about the digit,
you could use the regex a(?<digit>[0-5])|b(?<digit>[4-7]). In these four flavors, the group named "digit" will
then give you the digit 0..7 that was matched, regardless of the letter. If you want this match to be followed by c and the
exact same digit, you could use:
(?:a(?<digit>[0-5])|b(?<digit>[4-7]))c\k<digit>
PCRE does not allow duplicate named groups by default. PCRE 6.7 and later allow them if you turn on that option or use the
mode modifier (?J). But prior to PCRE 8.36 that wasn't very useful as backreferences always pointed to the first capturing
group with that name in the regex regardless of whether it participated in the match. Starting with PCRE 8.36 (and thus PHP
5.6.9 and R 3.1.3) and also in PCRE2, backreferences point to the first group with that name that actually participated in the
match. Though PCRE and Perl handle duplicate groups in opposite directions the end result is the same if you follow the
advice to only use groups with the same name in separate alternatives.
Boost allows duplicate named groups. Prior to Boost 1.47 that wasn't useful as backreferences always pointed to the last
group with that name that appears before the backreference in the regex. In Boost 1.47 and later backreferences point to the
first group with that name that actually participated in the match just like in PCRE 8.36 and later.
Python, Java, and XRegExp 3 do not allow multiple groups to use the same name. Doing so will give a regex compilation
error. XRegExp 2 allowed them, but did not handle them correctly.
In Perl 5.10, PCRE 8.00, PHP 5.2.14, and Boost 1.42 (or later versions of these) it is best to use a branch reset group when
you want groups in different alternatives to have the same name, as in:
(?|a(?<digit>[0-5])|b(?<digit>[4-7]))c\k<digit>.
With this special syntax—group opened with (?| instead of (?: — the two groups named "digit" really are one and the
same group. Then backreferences to that group are always handled correctly and consistently between these flavors. (Older
versions of PCRE and PHP may support branch reset groups, but don't correctly handle duplicate names in branch reset
groups.)
Relative Backreferences
Some applications support relative backreferences. These use a negative number to reference a group preceding the
backreference. To find the group that the relative backreference refers to, take the absolute number of the backreference and
count that many opening parentheses of (named or unnamed) capturing groups starting at the backreference and going from
right to left through the regex. So (a)(b)(c)\k<-1> matches abcc and (a)(b)(c)\k<-3> matches abca. If the
backreference is inside a capturing group, then you also need to count that capturing group's opening parenthesis. So
(a)(b)(c\k<-2>) matches abcb. (a)(b)(c\k<-1>) either fails to match or is an error depending on whether your
application allows nested backreferences.
The syntax for nested backreferences varies widely. It is generally an extension of the syntax for named backreferences. Ruby
1.9 and later supports \k<-1> and \k'-1'. Though this looks like the .NET syntax for named capture, .NET itself does not
support relative backreferences.
Perl 5.10, PCRE 7.0, PHP 5.2.2, and R support \g{-1} and \g-1.
Boost supports the Perl syntax starting with Boost 1.42. Boost adds the Ruby syntax starting with Boost 1.47. To complicate
matters, Boost 1.47 allowed these variants to multiply. Boost 1.47 and later allow relative backreferences to be specified with
\g or \k and with curly braces, angle brackets, or quotes. That makes six variations plus \g-1 for a total of seven variations.
This puts Boost in conflict with Ruby, PCRE, PHP, and R which treat \g with angle brackets or quotes and a negative number
as a relative subroutine call.
Free-Spacing Regular Expressions
Most modern regex flavors support a variant of the regular expression syntax called free-spacing mode. This mode allows for
regular expressions that are much easier for people to read. Of the flavors discussed in this tutorial, only XML Schema and
the POSIX and GNU flavors don't support it. Plain JavaScript doesn't either, but XRegExp does. The mode is usually enabled
by setting an option or flag outside the regex. With flavors that support mode modifiers, you can put (?x) the very start of the
regex to make the remainder of the regex free-spacing.
In free-spacing mode, whitespace between regular expression tokens is ignored. Whitespace includes spaces, tabs and line
breaks. Note that only whitespace between tokens is ignored. E.g. a b c is the same as abc in free-spacing mode, but \ d
and \d are not the same. The former matches d, while the latter matches a digit. \d is a single regex token composed of a
backslash and a "d". Breaking up the token with a space gives you an escaped space (which matches a space), and a literal
"d".
Likewise, grouping modifiers cannot be broken up. (?>atomic) is the same as (?> ato mic ) and as ( ?>ato mic).
They all match the same atomic group. They're not the same as (? >atomic). In fact, the latter will cause a syntax error.
The ?> grouping modifier is a single element in the regex syntax, and must stay together. This is true for all such constructs,
including lookaround, named groups, etc.
Exactly which spaces and line breaks are ignored depends on the regex flavor. All flavors discussed in this tutorial ignore the
ASCII space, tab, line feed, carriage return, and form feed characters. Boost is the only flavor that ignores all Unicode spaces
and line breaks. Perl always treats non-ASCII spaces as literals. Perl 5.22 and later ignore non-ASCII line breaks. Perl 5.16
and prior treat them as literals. Perl 5.18 and 5.20 treated unescaped non-ASCII line breaks as errors in free-spacing mode to
give developers a transition period.
This means that in free-spacing mode, you can use \ or [ ] to match a single space. Use whichever you find more readable.
The hexadecimal escape \x20 also works, of course.
Java, however, does not treat a character class as a single token in free-spacing mode. Java does ignore whitespace and
comments inside character classes. So in Java's free-spacing mode, [abc] is identical to [ a b c ] and \ is the only way
to match a space. However. even in free-spacing mode, the negating caret must appear immediately after the opening
bracket. [ ^ a b c ] matches any of the four characters ^, a, b or c just like [abc^] would. With the negating caret in the
proper place, [^ a b c ] matches any character that is not a, b or c.
Perl 5.26 offers limited free-spacing within character classes as an option. The /x flag enables free-spacing outside character
classes only, as in previous versions of Perl. The double /xx flag additionally makes Perl 5.26 treat unescaped spaces and
tabs inside character classes as free whitespace. Line breaks are still literals inside character classes. PCRE2 10.30 supports
the same /xx mode as Perl 5.26 if you pass the flag PCRE2_EXTENDED_MORE to pcre2_compile().
Perl 5.26 and PCRE 10.30 also add a new mode modifier (?xx) which enables free-spacing both inside and outside
character classes. (?x) turns on free-spacing outside character classes like before, but also turns off free-spacing inside
character classes. (?-x) and (?-xx) both completely turn off free-spacing.
Java treats the ^ in [ ^ a ] as a literal. Even when spaces are ignored they still break the special meaning of the caret in
Java. Perl 5.26 and PCRE2 10.30 treat ^ in [ ^ a ] as a negation caret in /xx mode. Perl 5.26 and PCRE2 10.30 totally
ignore free whitespace. They still consider the caret to be at the start of the character class.
Comments in Free-Spacing Mode
Another feature of free-spacing mode is that the # character starts a comment. The comment runs until the end of the line.
Everything from the # until the next line break character is ignored. Most flavors do not recognize any other line break
characters as the end of a comment, even if they recognize other line breaks as free whitespace or allow anchors to match at
other line breaks.
The XPath and Oracle do not support comments within the regular expression, even though they have a free-spacing mode.
The # is always treated as a literal character.
Putting it all together, the regex to match a valid date can be clarified by writing it across multiple lines:
Of the flavors discussed in this tutorial, all flavors that support comment in free-spacing mode, except Java and Tcl, also
support (?#comment ). The flavors that don't support comments in free-spacing mode or don't support free-spacing mode at
all don't support (?#comment) either.
Alternatives inside a branch reset group share the same capturing groups. The syntax is (?|regex) where (?| opens the
group and regex is any regular expression. If you don't use any alternation or capturing groups inside the branch reset group,
then its special function doesn't come into play. It then acts as a non-capturing group.
The regex (?|(a)|(b)|(c)) consists of a single branch reset group with three alternatives. This regex matches either a, b,
or c. The regex has only a single capturing group with number 1 that is shared by all three alternatives. After the match, $1
holds a, b, or c.
Compare this with the regex (a)|(b)|(c) that lacks the branch reset group. This regex also matches a, b, or c. But it has
three capturing groups. After the match, $1 holds a or nothing at all, $2 holds b or nothing at all, while $3 holds c or nothing at
all.
Backreferences to capturing groups inside branch reset groups work like you'd expect. (?|(a)|(b)|(c))\1 matches aa,
bb, or cc. Since only one of the alternatives inside the branch reset group can match, the alternative that participates in the
match determines the text stored by the capturing group and thus the text matched by the backreference.
The alternatives in the branch reset group don't need to have the same number of capturing groups.
(?|abc|(d)(e)(f)|g(h)i) has three capturing groups. When this regex matches abc, all three groups are empty. When
def is matched, $1 holds d, $2 holds e and $3 holds f. When ghi is matched, $1 holds h while the other two are empty.
You can have capturing groups before and after the branch reset group. Groups before the branch reset group are numbered
as usual. Groups in the branch reset group are numbered continued from the groups before the branch reset group, which
each alternative resetting the number. Groups after the branch reset group are numbered continued from the alternative with
the most groups, even if that is not the last alternative. So (x)(?|abc|(d)(e)(f)|g(h)i)(y) defines five capturing
groups. (x) is group 1, (d) and (h) are group 2, (e) is group 3, (f) is group 4, and (y) is group 5.
If you omit the names in some alternatives, the groups will still share the names with the other alternatives. In the regex
(?'before'x)(?|abc|(?'left'd)(?'middle'e)(?'right'f)|g(h)i)(?'after'y) the group (h) is still named
'left' because the branch reset group makes it share the name and number of (?'left'd).
In Perl, PCRE, and Boost, it is best to use a branch reset group when you want groups in different alternatives to have the
same name. That's the only way in Perl, PCRE, and Boost to make sure that groups with the same name really are one and
the same group.
^(?:(0?[13578]|1[02])/(3[01]|[12][0-9]|0?[1-9]) # 31 days
| (0?[469]|11)/(30|[12][0-9]|0?[1-9]) # 30 days
| (0?2)/([12][0-9]|0?[1-9]) # 29 days
)$
The first version uses a non-capturing group (?:…) to group the alternatives. It has six separate capturing groups. $1 and $2
hold the month and the day for months with 31 days. $3 and $4 hold them for months with 30 days. $5 and $6 are only used
for February.
^(?|(0?[13578]|1[02])/(3[01]|[12][0-9]|0?[1-9]) # 31 days
| (0?[469]|11)/(30|[12][0-9]|0?[1-9]) # 30 days
| (0?2)/([12][0-9]|0?[1-9]) # 29 days
)$
The second version uses a branch reset group (?|…) to group the alternatives and merge their capturing groups. The 4th
character is the only difference between these two regexes. Now there are only two capturing groups. These are shared
between the tree alternatives. When a match is found $1 always holds the month and $2 always holds the day, regardless of
the number of days in the month.
Unicode Regular Expressions
Unicode is a character set that aims to define all characters and glyphs from all human languages, living and dead. With more
and more software being required to support multiple languages, or even just any language, Unicode has been strongly
gaining popularity in recent years. Using different character sets for different languages is simply too cumbersome for
programmers and users.
Unfortunately, Unicode brings its own requirements and pitfalls when it comes to regular expressions. Of the regex flavors
discussed in this tutorial, Java, XML and the .NET framework use Unicode-based regex engines. Perl supports Unicode
starting with version 5.6. PCRE can optionally be compiled with Unicode support. Note that PCRE is far less flexible in what it
allows for the \p tokens, despite its name "Perl-compatible". The PHP preg functions, which are based on PCRE, support
Unicode when the /u option is appended to the regular expression.
All Unicode regex engines discussed in this tutorial treat any single Unicode code point as a single character. When this
tutorial tells you that the dot matches any single character, this translates into Unicode parlance as "the dot matches any
single Unicode code point". In Unicode, à can be encoded as two code points: U+0061 (a) followed by U+0300 (grave
accent). In this situation, . applied to à will match a without the accent. ^.$ will fail to match, since the string consists of two
code points. ^..$ matches à.
The Unicode code point U+0300 (grave accent) is a combining mark. Any code point that is not a combining mark can be
followed by any number of combining marks. This sequence, like U+0061 U+0300 above, is displayed as a single grapheme
on the screen.
Unfortunately, à can also be encoded with the single Unicode code point U+00E0 (a with grave accent). The reason for this
duality is that many historical character sets encode "a with grave accent" as a single character. Unicode's designers thought
it would be useful to have a one-on-one mapping with popular legacy character sets, in addition to the Unicode way of
separating marks and base letters (which makes arbitrary combinations not supported by legacy character sets possible).
In .NET, Java 8 and prior, and Ruby 1.9 you can use \P{M}\p{M}*+ or (?>\P{M}\p{M}*) as a reasonably close
substitute. To match any number of graphemes, use (?>\P{M}\p{M}*)+ as a substitute for \X+.
Perl and PCRE do not support the \uFFFF syntax. They use \x{FFFF} instead. You can omit leading zeros in the
hexadecimal number between the curly braces. Since \x by itself is not a valid regex token, \x{1234} can never be confused
to match \x 1234 times. It always matches the Unicode code point U+1234. \x{1234}{5678} will try to match code point
U+1234 exactly 5678 times.
In Java, the regex token \uFFFF only matches the specified code point, even when you turned on canonical equivalence.
However, the same syntax \uFFFF is also used to insert Unicode characters into literal strings in the Java source code.
Pattern.compile("\u00E0") will match both the single-code-point and double-code-point encodings of à, while
Pattern.compile("\\u00E0") matches only the single-code-point version. Remember that when writing a regex as a
Java string literal, backslashes must be escaped. The former Java code compiles the regex à, while the latter compiles
\u00E0. Depending on what you're doing, the difference may be significant.
JavaScript, which does not offer any Unicode support through its RegExp class, does support \uFFFF for matching a single
Unicode code point as part of its string syntax.
XML Schema and Xpath do not have a regex token for matching Unicode code points. However, you can easily use XML
entities like  to insert literal code points into your regular expression.
Unicode Categories
In addition to complications, Unicode also brings new possibilities. One is that each Unicode character belongs to a certain
category. You can match a single character belonging to a particular category with \p{}. You can match a single character not
belonging to a particular category with \P{}.
Again, "character" really means "Unicode code point". \p{L} matches a single code point in the category "letter". If your input
string is à encoded as U+0061 U+0300, it matches a without the accent. If the input is à encoded as U+00E0, it matches à
with the accent. The reason is that both the code points U+0061 (a) and U+00E0 (à) are in the category "letter", while U+0300
is in the category "mark".
You should now understand why \P{M}\p{M}* is the equivalent of \X. \P{M} matches a code point that is not a combining
mark, while \p{M}* matches zero or more code points that are combining marks. To match a letter including any diacritics,
use \p{L}\p{M}*. This last regex will always match à, regardless of how it is encoded.
PCRE, PHP, and .NET are case sensitive when it checks the part between curly braces of a \p token. \p{Zs} will match any
kind of space character, while \p{zs} will throw an error. All other regex engines described in this tutorial will match the
space in both cases, ignoring the case of the property between the curly braces. Still, it is recommended that you make a
habit of using the same uppercase and lowercase combination as is shown in the list of properties below. This will make your
regular expressions work with all Unicode regex engines.
In addition to the standard notation, \p{L}, Java, Perl and PCRE allow you to use the shorthand \pL. The shorthand only
works with single-letter Unicode properties. \pLl is not the equivalent of \p{Ll}. It is the equivalent of \p{L}l which
matches Al or àl or any Unicode letter followed by a literal l.
Perl also supports the longhand \p{Letter}. You can find a complete list of all Unicode properties below. You may omit the
underscores or use hyphens or spaces instead.
Unicode Scripts
The Unicode standard places each assigned code point (character) into one script. A script is a group of code points used by
a particular human writing system. Some scripts like Thai correspond with a single human language. Other scripts like Latin
span multiple languages.
Some languages are composed of multiple scripts. There is no Japanese Unicode script. Instead, Unicode offers the
Hiragana, Katakana, Han and Latin scripts that Japanese documents are usually composed of.
A special script is the Common script. This script contains all sorts of characters that are common to a wide range of scripts. It
includes all sorts of punctuation, whitespace and miscellaneous symbols.
All assigned Unicode code points (those matched by \P{Cn}) are part of exactly one Unicode script. All unassigned Unicode
code points (those matched by \p{Cn}) are not part of any Unicode script at all.
1. \p{Common}
2. \p{Arabic}
3. \p{Armenian}
4. \p{Bengali}
5. \p{Bopomofo}
6. \p{Braille}
7. \p{Buhid}
8. \p{CanadianAboriginal}
9. \p{Cherokee}
10. \p{Cyrillic}
11. \p{Devanagari}
12. \p{Ethiopic}
13. \p{Georgian}
14. \p{Greek}
15. \p{Gujarati}
16. \p{Gurmukhi}
17. \p{Han}
18. \p{Hangul}
19. \p{Hanunoo}
20. \p{Hebrew}
21. \p{Hiragana}
22. \p{Inherited}
23. \p{Kannada}
24. \p{Katakana}
25. \p{Khmer}
26. \p{Lao}
27. \p{Latin}
28. \p{Limbu}
29. \p{Malayalam}
30. \p{Mongolian}
31. \p{Myanmar}
32. \p{Ogham}
33. \p{Oriya}
34. \p{Runic}
35. \p{Sinhala}
36. \p{Syriac}
37. \p{Tagalog}
38. \p{Tagbanwa}
39. \p{TaiLe}
40. \p{Tamil}
41. \p{Telugu}
42. \p{Thaana}
43. \p{Thai}
44. \p{Tibetan}
45. \p{Yi}
The above scripts can be matched by Perl, PCRE, PHP, Ruby 1.9, Delphi, and XRegExp.
Perl allows you to use \p{IsLatin} instead of \p{Latin}. The "Is" syntax is useful for distinguishing between scripts
and blocks, as explained in the next section. PCRE, PHP, and XRegExp do not support the "Is" prefix.
Java 7 adds support for Unicode scripts. Unlike the other flavors, Java 7 requires the "Is" prefix.
Unicode Blocks
The Unicode standard divides the Unicode character map into different blocks or ranges of code points. Each block is used to
define characters of a particular script like "Tibetan" or belonging to a particular group like "Braille Patterns". Most blocks
include unassigned code points, reserved for future expansion of the Unicode standard.
Note that Unicode blocks do not correspond 100% with scripts. An essential difference between blocks and scripts is that a
block is a single contiguous range of code points, as listed below. Scripts consist of characters taken from all over the Unicode
character map. Blocks may include unassigned code points (i.e. code points matched by \p{Cn}). Scripts never include
unassigned code points. Generally, if you're not sure whether to use a Unicode script or Unicode block, use the script.
E.g. the Currency block does not include the dollar and yen symbols. Those are found in the Basic_Latin and Latin-
1_Supplement blocks instead, for historical reasons, even though both are currency symbols, and the yen symbol is not a
Latin character. You should not blindly use any of the blocks listed below based on their names. Instead, look at the ranges of
characters they actually match. E.g. the Unicode property \p{Sc} or \p{Currency_Symbol} would be a better choice than
the Unicode block \p{InCurrency} when trying to find all currency symbols.
Not all Unicode regex engines use the same syntax to match Unicode blocks. Java, Ruby 2.0, and XRegExp use the
\p{InBlock} syntax as listed above. .NET and XML use \p{IsBlock} instead. Perl supports both notations. I recommend
you use the "In" notation if your regex engine supports it. "In" can only be used for Unicode blocks, while "Is" can also be used
for Unicode properties and scripts, depending on the regular expression flavor you're using. By using "In", it's obvious you're
matching a block and not a similarly named property or script.
In .NET and XML, you must omit the underscores but keep the hyphens in the block names.
Use \p{IsLatinExtended-A} instead of \p{InLatin_Extended-A}. In Java, you must omit the hyphens. .NET and
XML also compare the names case sensitively, while Perl, Ruby compares them case insensitively. Java 4 is case sensitive.
Java 5 and later are case sensitive for the "Is" prefix but not for the block names themselves.
The actual names of the blocks are the same in all regular expression engines. The block names are defined in the Unicode
standard. PCRE does not support Unicode blocks.
If you type the à key on the keyboard, all word processors that I know of will insert the code point U+00E0 into the file. So if
you're working with text that you typed in yourself, any regex that you type in yourself will match in the same way.
Sometimes, the tool or language does not provide the ability to specify matching options. The handy String.matches()
method in Java does not take a parameter for matching options like Pattern.compile() does. Or, the regex flavor may
support matching modes that aren't exposed as external flags. The regex functions in R have ignore.case as their only
option, even though the underlying PCRE library has more matching modes than any other discussed in this tutorial.
In those situation, you can add the following mode modifiers to the start of the regex. To specify multiple modes, simply put
them together as in (?ismx).
Flavors that can't apply modifiers to only part of the regex treat a modifiers in the middle of the regex as an error. Python is an
exception to this. In Python, putting a modifier in the middle of the regex affects the whole regex. So in Python,
(?i)caseless and caseless(?i) are both case insensitive. In all other flavors, the trailing mode modifier either has no
effect or is an error.
You can quickly test how the regex flavor you're using handles mode modifiers. The regex (?i)te(?-i)st should match
test and TEst, but not teST or TEST.
Modifier Spans
Instead of using two modifiers, one to turn an option on, and one to turn it off, you use a modifier span.
(?i)ignorecase(?-i:casesensitive)ignorecase.
You have probably noticed the resemblance between the modifier span and the non-capturing group (?:group). Technically,
the non-capturing group is a modifier span that does not change any modifiers. But there are flavors, like JavaScript, Python,
and Tcl that support non-capturing groups even though they do not support modifier spans. Like a non-capturing group, the
modifier span does not create a backreference.
Modifier spans are supported by all regex flavors that allow you to use mode modifiers in the middle of the regular expression,
and by those flavors only. These include .NET, Java, Perl and PCRE, PHP, Delphi, and R
Atomic Grouping
An atomic group is a group that, when the regex engine exits from it, automatically throws away all backtracking positions
remembered by any tokens inside the group. Atomic groups are non-capturing. The syntax is (?>group). Lookaround
groups are also atomic. Atomic grouping is supported by most modern regular expression flavors, including the Java, PCRE,
.NET, Perl, Boost, and Ruby. The first three of these also support possessive quantifiers, which are essentially a notational
convenience for atomic grouping.
An example will make the behavior of atomic groups clear. The regular expression a(bc|b)c (capturing group) matches
abcc and abc. The regex a(?>bc|b)c (atomic group) matches abcc but not abc.
When applied to abc, both regexes will match a to a, bc to bc, and then c will fail to match at the end of the string. Here their
paths diverge. The regex with the capturing group has remembered a backtracking position for the alternation. The group will
give up its match, b then matches b and c matches c. Match found!
The regex with the atomic group, however, exited from an atomic group after bc was matched. At that point, all backtracking
positions for tokens inside the group are discarded. In this example, the alternation's option to try b at the second position in
the string is discarded. As a result, when c fails, the regex engine has no alternatives left to try.
Of course, the above example isn't very useful. But it does illustrate very clearly how atomic grouping eliminates certain
matches. Or more importantly, it eliminates certain match attempts.
Regex Optimization Using Atomic Grouping
Consider the regex \b(integer|insert|in)\b and the subject integers. Obviously, because of the word boundaries,
these don't match. What's not so obvious is that the regex engine will spend quite some effort figuring this out.
\b matches at the start of the string, and integer matches integer. The regex engine makes note that there are two more
alternatives in the group, and continues with \b. This fails to match between the r and s. So the engine backtracks to try the
second alternative inside the group. The second alternative matches in, but then fails to match s. So the engine backtracks
once more to the third alternative. in matches in. \b fails between the n and t this time. The regex engine has no more
remembered backtracking positions, so it declares failure.
This is quite a lot of work to figure out integers isn't in our list of words. We can optimize this by telling the regular
expression engine that if it can't match \b after it matched integer, then it shouldn't bother trying any of the other words.
The word we've encountered in the subject string is a longer word, and it isn't in our list.
We can do this by turning the capturing group into an atomic group: \b(?>integer|insert|in)\b. Now, when integer
matches, the engine exits from an atomic group, and throws away the backtracking positions it stored for the alternation.
When \b fails, the engine gives up immediately. This savings can be significant when scanning a large file for a long list of
keywords. This savings will be vital when your alternatives contain repeated tokens (not to mention repeated groups) that lead
to catastrophic backtracking.
Don't be too quick to make all your groups atomic. As we saw in the first example above, atomic grouping can exclude valid
matches too. Compare how \b(?>integer|insert|in)\b and \b(?>in|integer|insert)\b behave when applied to
insert. The former regex matches, while the latter fails. If the groups weren't atomic, both regexes would match. Remember
that alternation tries its alternatives from left to right. If the second regex matches in, it won't try the two other alternatives due
to the atomic group.
Possessive Quantifiers
The topic on repetition operators or quantifiers explains the difference between greedy and lazy repetition. Greediness and
laziness determine the order in which the regex engine tries the possible permutations of the regex pattern. A greedy
quantifier will first try to repeat the token as many times as possible, and gradually give up matches as the engine backtracks
to find an overall match. A lazy quantifier will first repeat the token as few times as required, and gradually expand the match
as the engine backtracks through the regex to find an overall match.
Because greediness and laziness change the order in which permutations are tried, they can change the overall regex match.
However, they do not change the fact that the regex engine will backtrack to try all possible permutations of the regular
expression in case no match can be found.
Possessive quantifiers are a way to prevent the regex engine from trying all permutations. This is primarily useful for
performance reasons. You can also use possessive quantifiers to eliminate certain matches.
Of the regex flavors discussed in this tutorial, possessive quantifiers are supported by Java, and PCRE. That includes
languages with regex support based on PCRE such as PHP, Delphi, and R. Ruby supports possessive quantifiers starting with
Ruby 1.9, Perl supports them starting with Perl 5.10, and Boost starting with Boost 1.42.
The performance increase can be significant in situations where the regex fails. If the subject is "abc (no closing quote), the
above matching process will happen in the same way, except that the second " fails. When using a possessive quantifier,
there are no steps to backtrack to. The regular expression does not have any alternation or non-possessive quantifiers that
can give up part of their match to try a different permutation of the regular expression. So the match attempt fails immediately
when the second " fails.
Had we used a greedy quantifier instead, the engine would have backtracked. After the " failed at the end of the string, the
[^"]* would give up one match, leaving it with ab. The " would then fail to match c. [^"]* backtracks to just a, and " fails
to match b. Finally, [^"]* backtracks to match zero characters, and " fails a. Only at this point have all backtracking
positions been exhausted, and does the engine give up the match attempt. Essentially, this regex performs as many needless
steps as there are characters following the unmatched opening quote.
Now, linear backtracking like a regex with a single quantifier does is pretty fast. It's unlikely you'll notice the speed difference.
However, when you're nesting quantifiers, a possessive quantifier may save your day. Nesting quantifiers means that you
have one or more repeated tokens inside a group, and the group is also repeated. That's when catastrophic backtracking
often rears its ugly head. In such cases, you'll depend on possessive quantifiers and/or atomic grouping to save the day.
In both regular expressions, the first " will match the first " in the string. The repeated dot then matches the remainder of the
string abc"x. The second " then fails to match at the end of the string.
Now, the paths of the two regular expressions diverge. The possessive dot-star wants it all. No backtracking is done. Since the
" failed, there are no permutations left to try, and the overall match attempt fails. The greedy dot-star, while initially grabbing
everything, is willing to give back. It will backtrack one character at a time. Backtracking to abc", " fails to match x.
Backtracking to abc, " matches ". An overall match "abc" was found.
Essentially, the lesson here is that when using possessive quantifiers, you need to make sure that whatever you're applying
the possessive quantifier to should not be able to match what should follow it. The problem in the above example is that the
dot also matches the closing quote. This prevents us from using a possessive quantifier. The negated character class in the
previous section cannot match the closing quote, so we can make it possessive.
To illustrate: (?:a|b)*+b and (?>(?:a|b)*)b both fail to match b. a|b will match the b. The star is satisfied, and the fact
that it's possessive or the atomic group will cause the star to forget all its backtracking positions. The second b in the regex
has nothing left to match, and the overall match attempt fails.
In the regex (?>a|b)*b, the atomic group forces the alternation to give up its backtracking positions. I.e. if an a is matched, it
won't come back to try b if the rest of the regex fails. Since the star is outside of the group, it is a normal, greedy star. When
the second b fails, the greedy star will backtrack to zero iterations. Then, the second b matches the b in the subject string.
This distinction is particularly important when converting a regular expression written by somebody else using possessive
quantifiers to a regex flavor that doesn't have possessive quantifiers.
Positive lookahead works just the same. q(?=u) matches a q that is followed by a u, without making the u part of the match.
The positive lookahead construct is a pair of parentheses, with the opening parenthesis followed by a question mark and an
equals sign.
You can use any regular expression inside the lookahead (but not lookbehind, as explained below). Any valid regular
expression can be used inside the lookahead. If it contains capturing groups then those groups will capture as normal and
backreferences to them will work normally, even outside the lookahead. (The only exception is Tcl, which treats all groups
inside lookahead as non-capturing.) The lookahead itself is not a capturing group. It is not included in the count towards
numbering the backreferences. If you want to store the match of the regex inside a lookahead, you have to put capturing
parentheses around the regex inside the lookahead, like this: (?=(regex)). The other way around will not work, because
the lookahead will already have discarded the regex match by the time the capturing group is to store its match.
Let's try applying the same regex to quit. q matches q. The next token is the u inside the lookahead. The next character is
the u. These match. The engine advances to the next character: i. However, it is done with the regex inside the lookahead.
The engine notes success, and discards the regex match. This causes the engine to step back in the string to u.
Because the lookahead is negative, the successful match inside it causes the lookahead to fail. Since there are no other
permutations of this regex, the engine has to start again at the beginning. Since q cannot match anywhere else, the engine
reports failure.
Let's take one more look inside, to make sure you understand the implications of the lookahead. Let's apply q(?=u)i to
quit.Now the lookahead positive, and there is a token after it. Again, q matches q and u matches u. Again, the match from
the lookahead must be discarded, so the engine steps back from i in the string to u. The lookahead was successful, so the
engine continues with i. But i cannot match u. So this match attempt fails. All remaining attempts will fail as well, because
there are no more q's in the string.
The construct for positive lookbehind is (?<=text): a pair of parentheses, with the opening parenthesis followed by a
question mark, "less than" symbol and an equals sign. Negative lookbehind is written as (?<!text), using an exclamation
point instead of an equals sign.
The lookbehind continues to fail until the regex reaches the m in the string. The engine again steps back one character, and
notices that the a can be matched there. The positive lookbehind matches. Because it is zero-width, the current position in the
string remains at the m. The next token is b, which cannot match here. The next character is the second a in the string. The
engine steps back, and finds out that the m does not match a.
The next character is the first b in the string. The engine steps back and finds out that a satisfies the lookbehind. b matches b,
and the entire regex has been matched successfully. It matches one character: the first b in the string.
The bad news is that most regex flavors do not allow you to use just any regex inside a lookbehind, because they cannot
apply a regular expression backwards. Therefore, the regular expression engine needs to be able to figure out how many
steps to step back before checking the lookbehind.
Many regex flavors, including those used by Perl, Python, and Boost only allow fixed-length strings. You can use literal text,
character escapes, Unicode escapes other than \X, and character classes. You cannot use quantifiers or backreferences. You
can use alternation, but only if all alternatives have the same length. These flavors evaluate lookbehind by first stepping back
through the subject string for as many characters as the lookbehind needs, and then attempting the regex inside the
lookbehind from left to right.
PCRE is not fully Perl-compatible when it comes to lookbehind. While Perl requires alternatives inside lookbehind to have the
same length, PCRE allows alternatives of variable length. PHP, Delphi, R, and Ruby also allow this. Each alternative still has
to be fixed-length. Each alternative is treated as a separate fixed-length lookbehind.
Java takes things a step further by allowing finite repetition. You still cannot use the star or plus, but you can use the question
mark and the curly braces with the max parameter specified. Java determines the minimum and maximum possible lengths of
the lookbehind. The lookbehind in the regex (?<!ab{2,4}c{3,5}d)test has 5 possible lengths. It can be from 7 through
11 characters long. When Java (version 6 or later) tries to match the lookbehind, it first steps back the minimum number of
characters (7 in this example) in the string and then evaluates the regex inside the lookbehind as usual, from left to right. If it
fails, Java steps back one more character and tries again. If the lookbehind continues to fail, Java continues to step back until
the lookbehind either matches or it has stepped back the maximum number of characters (11 in this example). This repeated
stepping back through the subject string kills performance when the number of possible lengths of the lookbehind grows. Keep
this in mind. Don't choose an arbitrarily large maximum number of repetitions to work around the lack of infinite quantifiers
inside lookbehind. Java 4 and 5 have bugs that cause lookbehind with alternation or variable quantifiers to fail when it should
succeed in some situations. These bugs were fixed in Java 6.
The only regex engine that allow you to use a full regular expression inside lookbehind, including infinite repetition and
backreferences, is the .NET framework RegEx classes. This regex engine really applied the regex inside the lookbehind
backwards, going through the regex inside the lookbehind and through the subject string from right to left. It only needs to
evaluate the lookbehind once, regardless of how many different possible lengths it has.
Finally, flavors like std::regex and Tcl do not support lookbehind at all, even though they do support lookahead. JavaScript was
like that for the longest time since its inception. But now lookbehind is part of the ECMAScript 2018 specification. As of this
writing (late 2018), Google's Chrome browser is the only popular JavaScript implementation that supports lookbehind. So if
cross-browser compatibility matters, you can't use lookbehind in JavaScript.
Lookaround Is Atomic
The fact that lookaround is zero-width automatically makes it atomic. As soon as the lookaround condition is satisfied, the
regex engine forgets about everything inside the lookaround. It will not backtrack inside the lookaround to try different
permutations.
The only situation in which this makes any difference is when you use capturing groups inside the lookaround. Since the regex
engine does not backtrack into the lookaround, it will not try different permutations of the capturing groups.
For this reason, the regex (?=(\d+))\w+\1 will never match 123x12. First the lookaround captures 123 into \1. \w+ then
matches the whole string and backtracks until it matches only 1. Finally, \w+ fails since \1 cannot be matched at any position.
Now, the regex engine has nothing to backtrack to, and the overall regex fails. The backtracking steps created by \d+ have
been discarded. It never gets to the point where the lookahead captures only 12.
Obviously, the regex engine does try further positions in the string. If we change the subject string, the regex
(?=(\d+))\w+\1 will match 56x56 in 456x56.
If you don't use capturing groups inside lookaround, then all this doesn't matter. Either the lookaround condition can be
satisfied or it cannot be. In how many ways it can be satisfied is irrelevant.
Testing The Same Part of a String for More Than One Requirement
Lookaround, which was introduced in detail in the previous topic, is a very powerful concept. Unfortunately, it is often
underused by people new to regular expressions, because lookaround is a bit confusing. The confusing part is that the
lookaround is zero-width. So if you have a regex in which a lookahead is followed by another piece of regex, or a lookbehind
is preceded by another piece of regex, then the regex will traverse part of the string twice.
A more practical example makes this clear. Let's say we want to find a word that is six letters long and contains the three
consecutive letters cat. Actually, we can match this without lookaround. We just specify all the options and lump them
together using alternation: cat\w{3}|\wcat\w{2}|\w{2}cat\w|\w{3}cat. Easy enough. But this method gets unwieldy
if you want to find any word between 6 and 12 letters long containing either "cat", "dog" or "mouse".
Matching a 6-letter word is easy with \b\w{6}\b. Matching a word containing "cat" is equally easy: \b\w*cat\w*\b.
Combining the two, we get: (?=\b\w{6}\b)\b\w*cat\w*\b Easy! Here's how this works. At each character position in the
string where the regex is attempted, the engine will first attempt the regex inside the positive lookahead. This sub-regex, and
therefore the lookahead, matches only when the current character position in the string is at the start of a 6-letter word in the
string. If not, the lookahead will fail, and the engine will continue trying the regex from the start at the next character position in
the string.
The lookahead is zero-width. So when the regex inside the lookahead has found the 6-letter word, the current position in the
string is still at the beginning of the 6-letter word. At this position will the regex engine attempt the remainder of the regex.
Because we already know that a 6-letter word can be matched at the current position, we know that \b matches and that the
first \w* will match 6 times. The engine will then backtrack, reducing the number of characters matched by \w*, until cat can
be matched. If cat cannot be matched, the engine has no other choice but to restart at the beginning of the regex, at the next
character position in the string. This is at the second letter in the 6-letter word we just found, where the lookahead will fail,
causing the engine to advance character by character until the next 6-letter word.
If cat can be successfully matched, the second \w* will consume the remaining letters, if any, in the 6-letter word. After that,
the last \b in the regex is guaranteed to match where the second \b inside the lookahead matched. Our double-
requirement-regex has matched successfully.
You can discover these optimizations by yourself if you carefully examine the regex and follow how the regex engine applies it,
as I did above. I said the third and last \b are guaranteed to match. Since it is zero-width, and therefore does not change the
result returned by the regex engine, we can remove them, leaving: (?=\b\w{6}\b)\w*cat\w*. Though the last \w* is also
guaranteed to match, we cannot remove it because it adds characters to the regex match. Remember that the lookahead
discards its match, so it does not contribute to the match returned by the regex engine. If we omitted the \w*, the resulting
match would be the start of a 6-letter word containing "cat", up to and including "cat", instead of the entire word.
But we can optimize the first \w*. As it stands, it will match 6 letters and then backtrack. But we know that in a successful
match, there can never be more than 3 letters before "cat". So we can optimize this to \w{0,3}. Note that making the asterisk
lazy would not have optimized this sufficiently. The lazy asterisk would find a successful match sooner, but if a 6-letter word
does not contain "cat", it would still cause the regex engine to try matching "cat" at the last two letters, at the last single letter,
and even at one character beyond the 6-letter word.
So we have (?=\b\w{6}\b)\w{0,3}cat\w*. One last, minor, optimization involves the first \b. Since it is zero-width itself,
there's no need to put it inside the lookahead. So the final regex is: \b(?=\w{6}\b)\w{0,3}cat\w*.
You could replace the final \w* with \w{0,3} too. But it wouldn't make any difference. The lookahead has already checked
that we're at a 6-letter word, and \w{0,3}cat has already matched 3 to 6 letters of that word. Whether we end the regex with
\w* or \w{0,3} doesn't matter, because either way, we'll be matching all the remaining word characters. Because the
resulting match and the speed at which it is found are the same, we may just as well use the version that is easier to type.
\b(?=\w{6,12}\b)\w{0,9}(cat|dog|mouse)\w* Very easy, once you get the hang of it. This regex will also put "cat",
"dog" or "mouse" into the first backreference.
To overcome the limitations of lookbehind, Perl 5.10, PCRE 7.2, Ruby 2.0, and Boost 1.42 introduce a new feature that can be
used instead of lookbehind for its most common purpose. \K keeps the text matched so far out of the overall regex match.
h\Kd matches only the second d in adhd.
The engine advances one character through the string and attempts the match again. h fails to match d.
Advancing again, h matches h. The engine advances through the regex. The regex has now reached \K in the regex and the
position between h and the second d in the string. \K does nothing other than to tell that if this match attempt ends up
succeeding, the regex engine should pretend that the match attempt started at the present position between h and d, rather
than between the first d and h where it really started.
The engine advances through the regex. d matches the second d in the string. An overall match is found. Because of the
position saved by \K, the second d in the string is returned as the overall match.
\K only affects the position returned after a successful match. It does not move the start of the match attempt during the
matching process. The regex hhh\Kd matches the d in hhhhd. This regex first matches hhh at the start of the string. Then \K
notes the position between hhh and hd in the string. Then d fails to match the fourth h in the string. The match attempt at the
start of the string has failed.
Now the engine must advance one character in the string before starting the next match attempt. It advances from the actual
start of the match attempt, which was at the start of the string. The position stored by \K does not change this. So the second
match attempt begins at the position after the first h in the string. Starting there, hhh matches hhh, \K notes the position, and
d matches d. Now, the position remembered by \K is taken into account, and d is returned as the overall match.
\K Can Be Used Anywhere
You can use \K pretty much anywhere in any regular expression. You should only avoid using it inside lookbehind. You can
use it inside groups, even when they have quantifiers. You can have as many instances of \K in your regex as you like.
(ab\Kc|d\Ke)f matches cf when preceded by ab. It also matches ef when preceded by d.
\K does not affect capturing groups. When (ab\Kc|d\Ke)f matches cf, the capturing group captures abc as if the \K
weren't there. When the regex matches ef, the capturing group stores de.
Limitations of \K
Because \K does not affect the way the regex engine goes through the matching process, it offers a lot more flexibility than
lookbehind in Perl, PCRE, and Ruby. You can put anything to the left of \K, but you're limited to what you can put inside
lookbehind.
But this flexibility does come at a cost. Lookbehind really goes backwards through the string. This allows lookbehind check for
a match before the start of the match attempt. When the match attempt was started at the end of the previous match,
lookbehind can match text that was part of the previous match. \K cannot do this, precisely because it does not affect the way
the regex engine goes through the matching process.
If you iterate over all matches of (?<=a)a in the string aaaa, you will get three matches: the second, third, and fourth a in the
string. The first match attempt begins at the start of the string and fails because the lookbehind fails. The second match
attempt begins between the first and second a, where the lookbehind succeeds and the second a is matched. The third match
attempt begins after the second a that was just matched. Here the lookbehind succeeds too. It doesn't matter that the
preceding a was part of the previous match. Thus the third match attempt matches the third a. Similarly, the fourth match
attempt matches the fourth a. The fifth match attempt starts at the end of the string. The lookbehind still succeeds, but there
are no characters left for a to match. The match attempt fails. The engine has reached the end of the string and the iteration
stops. Five match attempts have found three matches.
Things are different when you iterate over a\Ka in the string aaaa. You will get only two matches: the second and the fourth
a. The first match attempt begins at the start of the string. The first a in the regex matches the first a in the string. \K notes the
position. The second a matches the second a in the string, which is returned as the first match. The second match attempt
begins after the second a that was just matched. The first a in the regex matches the third a in the string. \K notes the
position. The second a matches the fourth a in the string, which is returned as the first match. The third match attempt begins
at the end of the string. a fails. The engine has reached the end of the string and the iteration stops. Three match attempts
have found two matches.
Basically, you'll run into this issue when the part of the regex before the \K can match the same text as the part of the regex
after the \K. If those parts can't match the same text, then a regex using \K will find the same matches than the same regex
rewritten using lookbehind. In that case, you should use \K instead of lookbehind as that will give you better performance in
Perl, PCRE, and Ruby.
Another limitation is that while lookbehind comes in positive and negative variants, \K does not provide a way to negate
anything. (?<!a)b matches the string b entirely, because it is a "b" not preceded by an "a". [^a]\Kb does not match the
string b at all. When attempting the match, [^a] matches b. The regex has now reached the end of the string. \K notes this
position. But now there is nothing left for b to match. The match attempt fails. [^a]\Kb is the same as (?<=[^a])b, which
are both different from (?<!a)b.
If-Then-Else Conditionals in Regular Expressions
A special construct (?ifthen|else) allows you to create conditional regular expressions. If the if part evaluates to true,
then the regex engine will attempt to match the then part. Otherwise, the else part is attempted instead. The syntax consists of
a pair of parentheses. The opening bracket must be followed by a question mark, immediately followed by the if part,
immediately followed by the then part. This part can be followed by a vertical bar and the else part. You may omit the else
part, and the vertical bar with it.
For the if part, you can use the lookahead and lookbehind constructs. Using positive lookahead, the syntax becomes
(?(?=regex)then|else). Because the lookahead has its own parentheses, the if and then parts are clearly separated.
Remember that the lookaround constructs do not consume any characters. If you use a lookahead as the if part, then the
regex engine will attempt to match the then or else part (depending on the outcome of the lookahead) at the same position
where the if was attempted.
Alternatively, you can check in the if part whether a capturing group has taken part in the match thus far. Place the number of
the capturing group inside parentheses, and use that as the if part. Note that although the syntax for a conditional check on a
backreference is the same as a number inside a capturing group, no capturing group is created. The number and the brackets
are part of the if-then-else syntax started with (?.
For the then and else, you can use any regular expression. If you want to use alternation, you will have to group the then or
else together using parentheses, like in:
(?(?=condition)(then1|then2|then3)|(else1|else2|else3)).
Otherwise, there is no need to use parentheses around the then and else parts.
When applied to bd, a fails to match. Since the capturing group containing a is optional, the engine continues with b at the
start of the subject string. Since the whole group was optional, the group did not take part in the match. Any subsequent
backreference to it like \1 will fail. Note that (a)? is very different from (a?). In the former regex, the capturing group does
not take part in the match if a fails, and backreferences to the group will fail. In the latter group, the capturing group always
takes part in the match, capturing either a or nothing. Backreferences to a capturing group that took part in the match and
captured nothing always succeed. Conditionals evaluating such groups execute the then part. In short: if you want to use a
reference to a group in a conditional, use (a)? instead of (a?).
Continuing with our regex, b matches b. The regex engine now evaluates the conditional. The first capturing group did not
take part in the match at all, so the else part or d is attempted. d matches d and an overall match is found.
Moving on to our second subject string abc, a matches a, which is captured by the capturing group. Subsequently, b matches
b. The regex engine again evaluates the conditional. The capturing group took part in the match, so the then part or c is
attempted. c matches c and an overall match is found.
Our third subject bc does not start with a, so the capturing group does not take part in the match attempt, like we saw with the
first subject string. b still matches b, and the engine moves on to the conditional. The first capturing group did not take part in
the match at all, so the else part or d is attempted. d does not match c and the match attempt at the start of the string fails.
The engine does try again starting at the second character in the string, but fails since b does not match c.
The fourth subject abd is the most interesting one. Like in the second string, the capturing group grabs the a and the b
matches. The capturing group took part in the match, so the then part or c is attempted. c fails to match d, and the match
attempt fails. Note that the else part is not attempted at this point. The capturing group took part in the match, so only the then
part is used. However, the regex engine isn't done yet. It will restart the regular expression from the beginning, moving ahead
one character in the subject string.
Starting at the second character in the string, a fails to match b. The capturing group does not take part in the second match
attempt which started at the second character in the string. The regex engine moves beyond the optional group, and attempts
b, which matches. The regex engine now arrives at the conditional in the regex, and at the third character in the subject string.
The first capturing group did not take part in the current match attempt, so the else part or d is attempted. d matches d and an
overall match bd is found.
If you want to avoid this last match result, you need to use anchors. ^(a)?b(?(1)c|d)$ does not find any matches in the
last subject string. The caret will fail to match at the second and third characters in the string.
All these flavors also support named capturing groups. You can use the name of a capturing group instead of its number as
the if test. The syntax is slightly inconsistent between regex flavors. In Python, and .NET you simply specify the name of the
group between parentheses. (?<test>a)?b(?(test)c|d) is the regex from the previous section using named capture. In
Perl or Ruby, you have to put angle brackets or quotes around the name of the group, and put that between the conditional's
parentheses: (?<test>a)?b(?(<test>)c|d) or (?'test'a)?b(?('test')c|d). PCRE supports all three variants.
PCRE 7.2 and later also supports relative conditionals. The syntax is the same as that of a conditional that references a
numbered capturing group with an added plus or minus sign before the group number. The conditional then counts the
opening parentheses to the left (minus) or to the right (plus) starting at the (?( that opens the conditional.
(a)?b(?(-1)c|d) is another way of writing the above regex. The benefit is that this regex won't break if you add capturing
groups at the start or the end of the regex.
Python supports conditionals using a numbered or named capturing group. Python does not support conditionals using
lookaround, even though Python does support lookaround outside conditionals. Instead of a conditional like
(?(?=regex)then|else), you can alternate two opposite lookarounds: (?=regex)then|(?!regex)else).
The second part of the pattern is the if-then-else conditional (?(2)\w+@\w+\.[a-z]+|.+)). The if part checks whether the
second capturing group took part in the match thus far. It will have taken part if the header is the From or To header. In that
case, the then part of the conditional \w+@\w+\.[a-z]+ tries to match an email address. To keep the example simple, we
use an overly simple regex to match the email address, and we don't try to match the display name that is usually also part of
the From or To header.
If the second capturing group did not participate in the match this far, the else part .+ is attempted instead. This simply
matches the remainder of the line, allowing for any test subject.
Finally, we place an extra pair of parentheses around the conditional. This captures the contents of the email header matched
by the conditional into the third backreference. The conditional itself does not capture anything. When implementing this
regular expression, the first capturing group will store the name of the header ("From", "To", or "Subject"), and the third
capturing group will store the value of the header.
You could try to match even more headers by putting another conditional into the else part. E.g.
As you can see, regular expressions using conditionals quickly become unwieldy. I recommend that you only use them if one
regular expression is all your tool allows you to use. When programming, you're far better of using the regex
^(From|To|Date|Subject): (.+) to capture one header with its unvalidated contents. In your source code, check the
name of the header returned in the first capturing group, and then use a second regular expression to validate the contents of
the header returned in the second capturing group of the first regex. Though you'll have to write a few lines of extra code, this
code will be much easier to understand and maintain. If you precompile all the regular expressions, using multiple regular
expressions will be just as fast, if not faster, than the one big regex stuffed with conditionals.
The name 'subtract' must be the name of another group in the regex. When the regex engine enters the balancing group,
it subtracts one match from the group "subtract". If the group 'subtract' did not match yet, or if all its matches were already
subtracted, then the balancing group fails to match. You could think of a balancing group as a conditional that tests the group
'subtract', with 'regex' as the 'if' part and an 'else' part that always fails to match. The difference is that the
balancing group has the added feature of subtracting one match from the group 'subtract', while a conditional leaves the
group untouched.
If the balancing group succeeds and it has a name ('capture' in this example), then the group captures the text between
the end of the match that was subtracted from the group 'subtract' and the start of the match of the balancing group itself
('regex' in this example).
The reason this works in .NET is that capturing groups in .NET keep a stack of everything they captured during the matching
process that wasn't backtracked or subtracted. Most other regex engines only store the most recent match of each capturing
groups. When (\w)+ matches abc then Match.Groups[1].Value returns c as with other regex engines, but
Match.Groups[1].Captures stores all three iterations of the group: a, b, and c.
The balancing group too has + as its quantifier. The engine again finds that the subtracted group 'open' captured something,
namely the first o. The regex enters the balancing group, leaving the group 'open' without any matches. c matches the
second c in the string. The group 'between' captures oc which is the text between the match subtracted from 'open' (the
first o) and the second c just matched by the balancing group.
The balancing group is repeated again. But this time, the regex engine finds that the group 'open' has no matches left. The
balancing group fails to match. The group 'between' is unaffected, retaining its most recent capture.
The + is satisfied with two iterations. The engine has reached the end of the regex. It returns oocc as the overall match.
Match.Groups['open'].Success will return false, because all the captures of that group were subtracted.
Match.Groups['between'].Value returns "oc".
But the regex ^(?'open'o)+(?'-open'c)+$ still matches ooc. The matching process is again the same until the
balancing group has matched the first c and left the group 'open' with the first o as its only capture. The quantifier makes
the engine attempt the balancing group again. The engine again finds that the subtracted group 'open' captured something.
The regex enters the balancing group, leaving the group 'open' without any matches. But now, c fails to match because the
regex engine has reached the end of the string.
The regex engine must now backtrack out of the balancing group. When backtracking a balancing group, .NET also
backtracks the subtraction. Since the capture of the the first o was subtracted from 'open' when entering the balancing
group, this capture is now restored while backtracking out of the balancing group. The repeated group (?'-open'c)+ is now
reduced to a single iteration. But the quantifier is fine with that, as + means "once or more" as it always does. Still at the end of
the string, the regex engine reaches $ in the regex, which matches. The whole string ooc is returned as the overall match.
Match.Groups['open'].Captures will hold the first o in the string as the only item in the CaptureCollection. That's
because, after backtracking, the second o was subtracted from the group, but the first o was not.
To make sure the regex matches oc and oocc but not ooc, we need to check that the group 'open' has no captures left
when the matching process reaches the end of the regex. We can do this with a conditional. (?(open)(?!)) is a conditional
that checks whether the group 'open' matched something. In .NET, having matched something means still having captures
on the stack that weren't backtracked or subtracted. If the group has captured something, the "if" part of the conditional is
evaluated. In this case that is the empty negative lookahead (?!).
The empty string inside this lookahead always matches. Because the lookahead is negative, this causes the lookahead to
always fail. Thus the conditional always fails if the group has captured something. If the group has not captured anything, the
"else" part of the conditional is evaluated. In this case there is no "else" part. This means that the conditional always
succeeds if the group has not captured something. This makes (?(open)(?!)) a proper test to verify that the group
'open' has no captures left.
The regex ^(?'open'o)+(?'-open'c)+(?(open)(?!))$ fails to match ooc. When c fails to match because the regex
engine has reached the end of the string, the engine backtracks out of the balancing group, leaving 'open' with a single
capture. The regex engine now reaches the conditional, which fails to match. The regex engine will backtrack trying different
permutations of the quantifiers, but they will all fail to match. No match can be found.
The regex ^(?'open'o)+(?'-open'c)+(?(open)(?!))$ does match oocc. After (?'-open'c)+ has matched cc, the
regex engine cannot enter the balancing group a third time, because 'open' has no captures left. The engine advances to
the conditional. The conditional succeeds because 'open' has no captures left and the conditional does not have an "else"
part. Now $ matches at the end of the string.
This is the generic solution for matching balanced constructs using .NET's balancing groups or capturing group subtraction
feature. You can replace o, m, and c with any regular expression, as long as no two of these three can match the same text.
Let's see how (?'x'[ab]){2}(?'-x')\k'x' matches aba. The first iteration of (?'x'[ab]) captures a. The second
iteration captures b. Now the regex engine reaches the balancing group (?'-x'). It checks whether the group 'x' has
matched, which it has. The engine enters the balancing group, subtracting the match b from the stack of group 'x'. There are
no regex tokens inside the balancing group. It matches without advancing through the string. Now the regex engine reaches
the backreference \k'x'. The match at the top of the stack of group 'x' is a. The next character in the string is also an a
which the backreference matches. aba is found as an overall match.
When you apply this regex to abb, the matching process is the same, except that the backreference fails to match the second
b in the string. Since the regex has no other permutations that the regex engine can try, the match attempt fails.
Matching Palindromes
^(?'letter'[a-z])+[a-z]?(?:\k'letter'(?'-letter'))+(?(letter)(?!))$ matches palindrome words of any
length. This regular expression takes advantage of the fact that backreferences and capturing group subtraction work well
together. It also uses an empty balancing group as the regex in the previous section.
Let's see how this regex matches the palindrome radar. ^ matches at the start of the string. Then (?'letter'[a-z])+
iterates five times. The group 'letter' ends up with five matches on its stack: r, a, d, a, and r. The regex engine is now at
the end of the string and at [a-z]? in the regex. It doesn't match, but that's fine, because the quantifier makes it optional. The
engine now reaches the backreference \k'letter'. The group 'letter' has r at the top of its stack. This fails to match
the void after the end of the string.
The regex engine backtracks. (?'letter'[a-z])+ is reduced to four iterations, leaving r, a, d, and a on the stack of the
group 'letter'. [a-z]? matches r. The backreference again fails to match the void after the end of the string. The engine
backtracks, forcing [a-z]? to give up its match. Now 'letter' has a at the top of its stack. This causes the backreference
to fail to match r.
More backtracking follows. (?'letter'[a-z])+ is reduced to three iterations, leaving d at the top of the stack of the group
'letter'. The engine again proceeds with [a-z]?. It fails again because there is no d for the backreference to match.
Backtracking once more, the capturing stack of group 'letter' is reduced to r and a. Now the tide turns. [a-z]? matches
d. The backreference matches a which is the most recent match of the group 'letter' that wasn't backtracked. The engine
now reaches the empty balancing group (?'-letter'). This matches, because the group 'letter' has a match a to
subtract.
The backreference and balancing group are inside a repeated non-capturing group, so the engine tries them again. The
backreference matches r and the balancing group subtracts it from 'letter' stack, leaving the capturing group without any
matches. Iterating once more, the backreference fails, because the group 'letter' has no matches left on its stack. This
makes the group act as a non-participating group. Backreferences to non-participating groups always fail in .NET, as they do
in most regex flavors.
(?:\k'letter'(?'-letter'))+ has successfully matched two iterations. Now, the conditional (?(letter)(?!))
succeeds because the group 'letter' has no matches left. The anchor $ also matches. The palindrome radar has been
matched.
While Ruby 1.9 does not have any syntax for regex recursion, it does support capturing group recursion. So you could recurse
the whole regex in Ruby 1.9 if you wrap the whole regex in a capturing group. .NET does not support recursion, but it supports
balancing groups that can be used instead of recursion to match balanced constructs.
As we'll see later, there are differences in how Perl, PCRE, and Ruby deal with backreferences and backtracking during
recursion. While they copied each other's syntax, they did not copy each other's behavior. Boost 1.42 copied the syntax from
Perl. But its implementation is marred by bugs. Boost 1.60 attempted to fix the behavior of quantifiers on recursion, but it's still
quite different from other flavors and incompatible with previous versions of Boost. Boost 1.64 finally stopped crashing upon
infinite recursion. But recursion of the whole regex still attempts only the first alternative.
Simple Recursion
The regexes a(?R)?z, a(?0)?z, and a\g<0>?z all match one or more letters a followed by exactly the same number of
letters z. Since these regexes are functionally identical, we'll use the syntax with R for recursion to see how this regex
matches the string aaazzz.
First, a matches the first a in the string. Then the regex engine reaches (?R). This tells the engine to attempt the whole regex
again at the present position in the string. Now, a matches the second a in the string. The engine reaches (?R) again. On the
second recursion, a matches the third a. On the third recursion, a fails to match the first z in the string. This causes (?R) to
fail. But the regex uses a quantifier to make (?R) optional. So the engine continues with z which matches the first z in the
string.
Now, the regex engine has reached the end of the regex. But since it's two levels deep in recursion, it hasn't found an overall
match yet. It only has found a match for (?R). Exiting the recursion after a successful match, the engine also reaches z. It
now matches the second z in the string. The engine is still one level deep in recursion, from which it exists with a successful
match. Finally, z matches the third z in the string. The engine is again at the end of the regex. This time, it's not inside any
recursion. Thus, it returns aaazzz as the overall regex match.
A common real-world use is to match a balanced set of parentheses. \((?>[^()]|(?R))*\) matches a single pair of
parentheses with any text in between, including an unlimited number of parentheses, as long as they are all properly paired. If
the subject string contains unbalanced parentheses, then the first regex match is the leftmost pair of balanced parentheses,
which may occur after unbalanced opening parentheses. If you want a regex that does not find any matches in a string that
contains unbalanced parentheses, then you need to use a subroutine call instead of recursion. If you want to find a sequence
of multiple pairs of balanced parentheses as a single match, then you also need a subroutine call.
This regular expression does not work correctly in Boost. If a regex has alternation that is not inside a group then recursion of
the whole regex in Boost only attempts the first alternative. So \((?R)*\)|[^()]+ in Boost matches any number of
balanced parentheses nested arbitrarily deep with no text in between, or any text that does not contain any parentheses at all.
If you flip the alternatives then [^()]+|\((?R)*\) in Boost matches any text without any parentheses or a single pair of
parentheses with any text without parentheses in between. In all other flavors these two regexes find the same matches.
The solution for Boost is to put the alternation inside a group. (?:\((?R)*\)|[^()]+) and (?:[^()]+|\((?R)*\)) find
the same matches in all flavors discussed in this tutorial that support recursion.
Regular Expression Subroutines
Perl 5.10, PCRE 4.0, and Ruby 1.9 support regular expression subroutine calls. These are very similar to regular expression
recursion. Instead of matching the entire regular expression again, a subroutine call only matches the regular expression
inside a capturing group. You can make a subroutine call to any capturing group from anywhere in the regex. If you place a
call inside the group that it calls, you'll have a recursive capturing group.
As with regex recursion, there is a wide variety of syntax that you can use for exactly the same thing. Perl uses (?1) to call a
numbered group, (?+1) to call the next group, (?-1) to call the preceding group, and (?&name) to call a named group. You
can use all of these to reference the same group. (?+1)(?'name'[abc])(?1)(?-1)(?&name) matches a string that is
five letters long and consists only of the first three letters of the alphabet. This regex is exactly the same as
[abc](?'name'[abc])[abc][abc][abc].
PCRE was the first regex engine to support subroutine calls. (?P<name>[abc])(?1)(?P>name) matches three letters like
(?P<name>[abc])[abc][abc] does. (?1) is a call to a numbered group and (?P>name) is a call to a named group. The
latter is called the "Python syntax" in the PCRE man page. While this syntax mimics the syntax Python uses for named
capturing groups, it is a PCRE invention. Python does not support subroutine calls or recursion. PCRE 7.2 added (?+1) and
(?-1) for relative calls. PCRE 7.7 adds all the syntax used by Perl 5.10 and Ruby 2.0. Recent versions of PHP, Delphi, and R
also support all this syntax, as their regex functions are based on PCRE.
The syntax used by Ruby 1.9 and later looks more like that of backreferences. \g<1> and \g'1' call a numbered group,
\g<name> and \g'name' call a named group, while \g<-1> and \g'-1' call the preceding group. Ruby 2.0 adds
\g<+1> and \g'+1' to call the next group.
As we'll see later, there are differences in how Perl, PCRE, and Ruby deal with capturing, backreferences, and backtracking
during subroutine calls. While they copied each other's syntax, they did not copy each other's behavior. Boost 1.42 copied the
syntax from Perl but its implementation is marred by bugs, which are still not all fixed in version 1.62. Most significantly,
quantifiers other than * or {0,1} cause subroutine calls to misbehave. This is partially fixed in Boost 1.60 which correctly
handles ? and {0,1} too.
Boost does not support the Ruby syntax for subroutine calls. In Boost \g<1> is a backreference — not a subroutine call — to
capturing group 1. So ([ab])\g<1> can match aa and bb but not ab or ba. In Ruby the same regex would match all four
strings. No other flavor discussed in this tutorial uses this syntax for backreferences.
A(\((?>[^()]|(?1))*\))\z matches a string that consists of nothing but a correctly balanced pair of parentheses,
possibly with text between them. \A[^()]*+(\((?>[^()]|(?1))*+\)[^()]*+)++\z.
Matching The Same Construct More Than Once
A regex that needs to match the same kind of construct (but not the exact same text) more than once in different parts of the
regex can be shorter and more concise when using subroutine calls. Suppose you need a regex to match patient records like
these:
Name: John Doe
Born: 17-Jan-1964
Admitted: 30-Jul-2013
Released: 3-Aug-2013
Further suppose that you need to match the date format rather accurately so the regex can filter out valid records, leaving
invalid records for human inspection. In most regex flavors you could easily do this with this regex, using free-spacing syntax:
^Name:\ (.*)\r?\n
Born:\ (?:3[01]|[12][0-9]|[1-9])
-(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
-(?:19|20)[0-9][0-9]\r?\n
Admitted:\ (?:3[01]|[12][0-9]|[1-9])
-(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
-(?:19|20)[0-9][0-9]\r?\n
Released:\ (?:3[01]|[12][0-9]|[1-9])
-(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
-(?:19|20)[0-9][0-9]$
With subroutine calls you can make this regex much shorter, easier to read, and easier to maintain:
^Name:\ (.*)\r?\n
Born:\ (?'date'(?:3[01]|[12][0-9]|[1-9])
-(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
-(?:19|20)[0-9][0-9])\r?\n
Admitted:\ \g'date'\r?\n
Released:\ \g'date'$
(?(DEFINE)(?'subroutine'regex)). While this looks like a conditional that references the non-existent group DEFINE
containing a single named group 'subroutine', the DEFINE group is a special syntax. The fixed text (?(DEFINE) opens
the group. A parenthesis closes the group. This special group tells the regex engine to ignore its contents, other than to parse
it for named and numbered capturing groups. You can put as many capturing groups inside the DEFINE group as you like.
The DEFINE group itself never matches anything, and never fails to match. It is completely ignored. The regex
foo(?(DEFINE)(?'subroutine'skipped))bar matches foobar. The DEFINE group is completely superfluous in this
regex, as there are no calls to any of the groups inside of it.
(?(DEFINE)(?'date'(?:3[01]|[12][0-9]|[1-9])
-(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
-(?:19|20)[0-9][0-9]))
^Name:\ (.*)\r?\n
Born:\ (?P>date)\r?\n
Admitted:\ (?P>date)\r?\n
Released:\ (?P>date)$
Quantifiers On Subroutine Calls
Quantifiers on subroutine calls work just like a quantifier on recursion. The call is repeated as many times in sequence as
needed to satisfy the quantifier. ([abc])(?1){3} matches abcb and any other combination of four-letter combination of the
first three letters of the alphabet. First the group matches once, and then the call matches three times. This regex is equivalent
to ([abc])[abc]{3}.
Quantifiers on the group are ignored by the subroutine call. ([abc]){3}(?1) also matches abcb. First, the group matches
three times, because it has a quantifier. Then the subroutine call matches once, because it has no quantifier.
([abc]){3}(?1){3} matches six letters, such as abbcab, because now both the group and the call are repeated 3 times.
These two regexes are equivalent to ([abc]){3}[abc] and ([abc]){3}[abc]{3}.
While Ruby does not support subroutine definition groups, it does support subroutine calls to groups that are repeated zero
times. (a){0}\g<1>{3} matches aaa. The group itself is skipped because it is repeated zero times. Then the subroutine call
matches three times, according to its quantifier. This also works in PCRE 7.7 and later. It doesn't work (reliably) in older
versions of PCRE or in any version of Perl because of bugs.
The Ruby version of the patient record example can be further cleaned up as:
(?'date'(?:3[01]|[12][0-9]|[1-9])
-(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
-(?:19|20)[0-9][0-9]){0}
^Name:\ (.*)\r?\n
Born:\ \g'date'\r?\n
Admitted:\ \g'date'\r?\n
Released:\ \g'date'$
Infinite Recursion
Regular expressions such as (?R)?z or a?(?R)?z or a|(?R)z that use recursion without having anything that must be
matched in front of the recursion can result in infinite recursion. If the regex engine reaches the recursion without having
advanced through the text then the next recursion will again reach the recursion without having advanced through the text.
With the first regex this happens immediately at the start of the match attempt. With the other two this happens as soon as
there are no further letters a to be matched.
Boost 1.64 treats the first two regexes as a syntax error because they always lead to infinite recursion. It allows the third regex
because that one can match a. Ruby 1.9 and later, all versions of PCRE, and PCRE2 10.20 and prior treat all three forms of
potential infinite recursion as a syntax error. Perl, PCRE2 10.21 and later, and Boost 1.63 and prior allow all three forms.
But subroutine calls that are not recursive by themselves may end up being recursive if the group they call has another
subroutine call that calls a parent group of the first subroutine call. When subroutine calls are forced to go around in a circle -
that too leads to infinite recursion. Detecting such circular calls when compiling a regex is more complicated than checking for
straight infinite recursion. Only Ruby 1.9 and later is able to detect this and treat it as a syntax error. All other flavors allow
these regexes.
Errors and Crashes
When infinite recursion does occur, whether it's straight recursion or subroutine calls going in circles, Perl, and PCRE2 treat it
as a matching error that aborts the entire match attempt. Boost 1.64 handles this by not attempting the recursion and acting
as if the recursion failed. If the recursion is optional then Boost 1.64 may find matches where other flavors throw errors.
Boost 1.63 and prior and PCRE 8.12 and prior crash when infinite recursion occurs. This also affects Delphi up to version XE6
and PHP up to version 5.4.8 as they are based on older PCRE versions.
Endless Recursion
A regex such as a(?R)z that has a recursion token that is not optional and is not have an alternative without the same
recursion leads to endless recursion. Such a regular expression can never find a match. When a matches the regex engine
attempts the recursion. If it can match another a then it has to attempt the recursion again. Eventually a will run out of letters
to match. The recursion then fails. Because it's not optional the regex fails to match.
Ruby detects this situation when compiling your regular expression. It flags endless recursion as a syntax error. Perl, PCRE,
PCRE2, and Boost do not detect endless recursion. They simply go through the matching process which finds no matches.
Quantifiers On Recursion
The introduction to recursion shows how a(?R)?z matches aaazzz. The quantifier ? makes the preceding token optional. In
other words, it repeats the token between zero or one times. In a(?R)?z the (?R) is made optional by the ? that follows it.
You may wonder why the regex attempted the recursion three times, instead of once or not at all.
The reason is that upon recursion, the regex engine takes a fresh start in attempting the whole regex. All quantifiers and
alternatives behave as if the matching process prior to the recursion had never happened at all, other than that the engine
advanced through the string. The regex engine restores the states of all quantifiers and alternatives when it exits from a
recursion, whether the recursion matched or failed. Basically, the matching process continues normally as if the recursion
never happened, other than that the engine advanced through the string.
If you're familiar with procedural programming languages, regex recursion is basically a recursive function call and the
quantifiers are local variables in the function. Each recursion of the function gets its own set of local variables that don't affect
and aren't affected by the same local variables in recursions higher up the stack. Quantifiers on recursion work this way in all
flavors, except Boost.
Let's see how a(?R){3}z|q behaves (Boost excepted). The simplest possible match is q, found by the second alternative in
the regex.
The simplest match in which the first alternative matches is aqqqz. After a is matched, the regex engine begins a recursion. a
fails to match q. Still inside the recursion, the engine attempts the second alternative. q matches q. The engine exits from the
recursion with a successful match. The engine now notes that the quantifier {3} has successfully repeated once. It needs two
more repetitions, so the engine begins another recursion. It again matches q. On the third iteration of the quantifier, the third
recursion matches q. Finally, z matches z and an overall match is found.
This regex does not match aqqz or aqqqqz. aqqz fails because during the third iteration of the quantifier, the recursion fails
to match z. aqqqqz fails because after a(?R){3} has matched aqqq, z fails to match the fourth q.
The regex can match longer strings such as aqaqqqzqz. With this string, during the second iteration of the quantifier, the
recursion matches aqqqz. Since each recursion tracks the quantifier separately, the recursion needs three consecutive
recursions of its own to satisfy its own instance of the quantifier. This can lead to arbitrarily long matches such as
aaaqqaqqqzzaqqqzqzqaqqaaqqqzqqzzz.
How Boost Handles Quantifiers on Recursion
Boost has its own ideas about how quantifiers should work on recursion. Recursion only works the same in Boost as in other
flavors if the recursion operator either has no quantifier at all or if it has * as its quantifier. Any other quantifier may lead to
very different matches (or lack thereof) in Boost 1.59 or prior versus Boost 1.60 and later versus other regex flavors. Boost
1.60 attempted to fix some of the differences between Boost and other flavors but it only resulted in a different incompatible
behavior.
In Boost 1.59 and prior, quantifiers on recursion count both iteration and recursion throughout the entire recursion stack. So
possible matches for a(?R){3}z|q in Boost 1.59 include aaaazzzz, aaaqzzz, aaqqzz, aaqzqz, and aqaqzzz. In all
these matches the number of recursions and iterations add up to 3. No other flavor would find these matches because they
require 3 iterations during each recursion. So other flavors can match things like aaqqqzaqqqzaqqqzz or aqqaqqqzz. Boost
1.59 would match only aqqqz within these strings.
Boost 1.60 attempts to iterate quantifiers at each recursion level like other flavors, but does so incorrectly. Any quantifier that
makes the recursion optional allows for infinite repetition. So Boost 1.60 and later treat a(?R)?z the same as a(?R)*z.
While this fixes the problem that a(?R)?z could not match aaazzz entirely in Boost 1.59, it also allows matches such as
aazazz that other flavors won't find with this regex. If the quantifier is not optional, then Boost 1.60 only allows it to match
during the first recursion. So a(?R){3}z|q could only ever match q or aqqqz.
Boost's issues with quantifiers on recursion also affect quantifiers on parent groups of the recursion token. They also affect
quantifiers on subroutine calls and quantifiers groups that contain a subroutine call to a parent group of the group with the
quantifier.
Quantifiers like these that are inside the recursion but do not repeat the recursion itself do work correctly in Boost.
^Name:\ (.*)\n
Born:\ (?'date'(?:3[01]|[12][0-9]|[1-9])
-(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
-(?:19|20)[0-9][0-9])\n
Admitted:\ \g'date'\n
Released:\ \g'date'$
^Name:\ (.*)\n
Born:\ (?'date'(?:3[01]|[12][0-9]|[1-9])
-(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
-(?:19|20)[0-9][0-9])\n
Admitted:\ (?&date)\n
Released:\ (?&date)$
Unfortunately, there are differences in how these three regex flavors treat subroutine calls beyond their syntax. First of all, in
Ruby a subroutine call makes the capturing group store the text matched during the subroutine call. In Perl, PCRE, and Boost
a subroutine call does not affect the group that is called.
When the Ruby solution matches the sample above, retrieving the contents of the capturing group "date" will get you
3-Aug-2013 which was matched by the last subroutine call to that group. When the Perl solution matches the same,
retrieving $+{date} will get you 17-Jan-1964. In Perl, the subroutine calls did not capture anything at all. But the "Born"
date was matched with a normal named capturing group which stored the text that it matched normally. Any subroutine calls to
the group don't change that. PCRE behaves as Perl in this case, even when you use the Ruby syntax with PCRE.
If you want to extract the dates from the match, the best solution is to add another capturing group for each date. Then you
can ignore the text stored by the "date" group and this particular difference between these flavors. In Ruby or PCRE:
^Name:\ (.*)\n
Born:\ (?'born'(?'date'(?:3[01]|[12][0-9]|[1-9])
-(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
-(?:19|20)[0-9][0-9]))\n
Admitted:\ (?'admitted'\g'date')\n
Released:\ (?'released'\g'date')$
^Name:\ (.*)\n
Born:\ (?'born'(?'date'(?:3[01]|[12][0-9]|[1-9])
-(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
-(?:19|20)[0-9][0-9]))\n
Admitted:\ (?'admitted'(?&date))\n
Released:\ (?'released'(?&date))$
PCRE and Boost back up and restore capturing groups when entering and exiting recursion. When the regex engine enters
recursion, it internally makes a copy of all capturing groups. This does not affect the capturing groups. Backreferences inside
the recursion match text captured prior to the recursion unless and until the group they reference captures something during
the recursion. After the recursion, all capturing groups are replaced with the internal copy that was made at the start of the
recursion. Text captured during the recursion is discarded. This means you cannot use capturing groups to retrieve parts of
the text that were matched during recursion.
Perl 5.10, the first version to have recursion, through version 5.18, isolated capturing groups between each level of recursion.
When Perl 5.10's regex engine enters recursion, all capturing groups appear as they have not participated in the match yet.
Initially, all backreferences will fail. During the recursion, capturing groups capture as normal. Backreferences match text
captured during the same recursion as normal. When the regex engine exits from the recursion, all capturing groups revert to
the state they were in prior to the recursion. Perl 5.20 changed Perl's behavior to back up and restore capturing groups the
way PCRE does.
For most practical purposes, however, you'll only use backreferences after their corresponding capturing groups. Then the
difference between the way Perl 5.10 through 5.18 deal with capturing groups during recursion and the way PCRE and later
versions of Perl do is academic.
Ruby's behavior is completely different. When Ruby's regex engine enters or exists recursion, it makes no changes to the text
stored by capturing groups at all. Backreferences match the text stored by the capturing group during the group's most recent
match, irrespective of any recursion that may have happened. After an overall match is found, each capturing group still stores
the text of its most recent match, even if that was during a recursion. This means you can use capturing groups to retrieve part
of the text matched during the last recursion.
Let's see how this regex matches radar. The word boundary \b matches at the start of the string. The regex engine enters
the two capturing groups. [a-z] matches r which is then stored in the capturing group "letter". Now the regex engine
enters the first recursion of the group "word". At this point, Perl forgets that the "letter" group matched r. PCRE does
not. But this does not matter. (?'letter'[a-z]) matches and captures a. The regex enters the second recursion of the
group "word". (?'letter'[a-z]) captures d. During the next two recursions, the group captures a and r. The fifth
recursion fails because there are no characters left in the string for [a-z] to match. The regex engine must backtrack.
Because (?&word) failed to match, (?'letter'[a-z]) must give up its match. The group reverts to a, which was the text
the group held at the start of the recursion. (It becomes empty in Perl 5.18 and prior.) Again, this does not matter because the
regex engine must now try the second alternative inside the group "word", which contains no backreferences. The second
[a-z] matches the final r in the string. The engine now exits from a successful recursion. The text stored by the group
"letter" is restored to what it had captured prior to entering the fourth recursion, which is a.
After matching (?&word) the engine reaches \k'letter'. The backreference fails because the regex engine has already
reached the end of the subject string. So it backtracks once more, making the capturing group give up the a. The second
alternative now matches the a. The regex engine exits from the third recursion. The group "letter" is restored to the d
matched during the second recursion.
The regex engine has again matched (?&word). The backreference fails again because the group stores d while the next
character in the string is r. Backtracking again, the second alternative matches d and the group is restored to the a matched
during the first recursion.
Now, \k'letter' matches the second a in the string. That's because the regex engine has arrived back at the first recursion
during which the capturing group matched the first a. The regex engine exits the first recursion. The capturing group to the r
which it matched prior to the first recursion.
Finally, the backreference matches the second r. Since the engine is not inside any recursion any more, it proceeds with the
remainder of the regex after the group. \b matches at the end of the string. The end of the regex is reached and radar is
returned as the overall match. If you query the groups "word" and "letter" after the match you'll get radar and r. That's the
text matched by these groups outside of all recursion.
\b(?'word'(?'letter'[a-z])\g'word'\k'letter'|[a-z])\b
Ruby will not complain. But it will not match palindromes longer than three letters either. Instead this regex matches things like
a, dad, radaa, raceccc, and rediviiii.
Let's see why this regex does not match radar in Ruby. Ruby starts out like Perl and PCRE, entering the recursions until
there are no characters left in the string for [a-z] to match.
Because \g'word' failed to match, (?'letter'[a-z]) must give up its match. Ruby reverts it to a, which was the text the
group most recently matched. The second [a-z] matches the final r in the string. The engine now exits from a successful
recursion. The group "letter" continues to hold its most recent match a.
After matching \g'word' the engine reaches \k'letter'. The backreference fails because the regex engine has already
reached the end of the subject string. So it backtracks once more, reverting the group to the previously matched d. The
second alternative now matches the a. The regex engine exits from the third recursion.
The regex engine has again matched \g'word'. The backreference fails again because the group stores d while the next
character in the string is r. Backtracking again, the group reverts to a and the second alternative matches d.
Now, \k'letter' matches the second a in the string. The regex engine exits the first recursion which successfully matched
ada. The capturing group continues to hold a which is its most recent match that wasn't backtracked.
The regex engine is now at the last character in the string. This character is r. The backreference fails because the group still
holds a.
The engine can backtrack once more, forcing (?'letter'[a-z])\g'word'\k'letter' to give up the rada it matched
so far. The regex engine is now back at the start of the string. It can still try the second alternative in the group. This matches
the first r in the string. Since the engine is not inside any recursion any more, it proceeds with the remainder of the regex after
the group. \b fails to match after the first r. The regex engine has no further permutations to try. The match attempt has
failed.
If the subject string is radaa, Ruby's engine goes through nearly the same matching process as described above. Only the
events described in the last paragraph change. When the regex engine reaches the last character in the string, that character
is now a. This time, the backreference matches. Since the engine is not inside any recursion any more, it proceeds with the
remainder of the regex after the group. \b matches at the end of the string. The end of the regex is reached and radaa is
returned as the overall match. If you query the groups "word" and "letter" after the match you'll get radaa and a. Those
are the most recent matches of these groups that weren't backtracked.
Basically, in Ruby this regex matches any word that is an odd number of letters long and in which all the characters to the right
of the middle letter are identical to the character just to the left of the middle letter. That's because Ruby only restores
capturing groups when they backtrack, but not when it exits from recursion.
The solution, specific to Ruby, is to use a backreference that specifies a recursion level instead of the normal backreference
used in the regex on this page.
Perl, PCRE, and Boost restore capturing groups when they exit from recursion. This means that backreferences in Perl,
PCRE, and Boost match the same text that was matched by the capturing group at the same recursion level. This makes it
possible to do things like matching palindromes.
Ruby does not restore capturing groups when it exits from recursion. Normal backreferences match the text that is the same
as the most recent match of the capturing group that was not backtracked, regardless of whether the capturing group found its
match at the same or a different recursion level as the backreference. Basically, normal backreferences in Ruby don't pay any
attention to recursion.
But while the normal capturing group storage in Ruby does not get any special treatment for recursion, Ruby actually stores a
full stack of matches for each capturing groups at all recursion levels. This stack even includes recursion levels that the regex
engine has already exited from.
Backreferences in Ruby can match the same text as was matched by a capturing group at any recursion level relative to the
recursion level that the backreference is evaluated at. You can do this with the same syntax for named backreferences by
adding a sign and a number after the name. In most situations you will use +0 to specify that you want the backreference to
reuse the text from the capturing group at the same recursion level. You can specify a positive number to reference the
capturing group at a deeper level of recursion. This would be a recursion the regex engine has already exited from. You can
specify a negative number to reference the capturing group a level that is less deep. This would be a recursion that is still in
progress.
Let's see how this regex matches radar. The word boundary \b matches at the start of the string. The regex engine enters
the capturing group "word". [a-z] matches r which is then stored in the stack for the capturing group "letter" at
recursion level zero. Now the regex engine enters the first recursion of the group "word". (?'letter'[a-z]) matches and
captures a at recursion level one. The regex enters the second recursion of the group "word". (?'letter'[a-z])
captures d at recursion level two. During the next two recursions, the group captures a and r at levels three and four. The fifth
recursion fails because there are no characters left in the string for [a-z] to match. The regex engine must backtrack.
The regex engine must now try the second alternative inside the group "word". The second [a-z] in the regex matches the
final r in the string. The engine now exits from a successful recursion, going one level back up to the third recursion.
After matching \g'word' the engine reaches \k'letter+0'. The backreference fails because the regex engine has
already reached the end of the subject string. So it backtracks once more. The second alternative now matches the a. The
regex engine exits from the third recursion.
The regex engine has again matched \g'word' and needs to attempt the backreference again. The backreference specifies
+0 or the present level of recursion, which is 2. At this level, the capturing group matched d. The backreference fails because
the next character in the string is r. Backtracking again, the second alternative matches d.
Now, \k'letter+0' matches the second a in the string. That's because the regex engine has arrived back at the first
recursion during which the capturing group matched the first a. The regex engine exits the first recursion.
The regex engine is now back outside all recursion. That this level, the capturing group stored r. The backreference can now
match the final r in the string. Since the engine is not inside any recursion any more, it proceeds with the remainder of the
regex after the group. \b matches at the end of the string. The end of the regex is reached and radar is returned as the
overall match.
\b(?'word'(?'letter'[a-z])\g'word'(?:\k'letter-1'|z)|[a-z])\b.
The backreference here wants a match the text one level less deep on the capturing group's stack. It is alternated with the
letter z so that something can be matched when the backreference fails to match.
The new regex matches things like abcdefdcbaz. After a whole bunch of matching and backtracking, the second [a-z]
matches f. The regex engine exits form a successful fifth recursion. The capturing group "letter" has stored the matches a, b,
c, d, and e at recursion levels zero to four. Other matches by that group were backtracked and thus not retained.
Now the engine evaluates the backreference \k'letter-1'. The present level is 4 and the backreference specifies -1. Thus
the engine attempts to match d, which succeeds. The engine exits from the fourth recursion.
The backreference continues to match c, b, and a until the regex engine has exited the first recursion. Now, outside all
recursion, the regex engine again reaches \k'letter-1'. The present level is 0 and the backreference specifies -1. Since
recursion level -1 never happened, the backreference fails to match. This is not an error but simply a backreference to a
non-participating capturing group. But the backreference has an alternative. z matches z and \b matches at the end of the
string. abcdefdcbaz was matched successfully.
Again, after a whole bunch of matching and backtracking, the second [a-z] matches f, the regex engine is back at recursion
level 4, and the group "letter" has a, b, c, d, and e at recursion levels zero to four on its stack.
Now the engine evaluates the backreference \k'letter+1'. The present level is 4 and the backreference specifies +1. The
capturing group was backtracked at recursion level 5. This means we have a backreference to a non-participating group,
which fails to match. The alternative z does match. The engine exits from the fourth recursion.
At recursion level 3, the backreference points to recursion level 4. Since the capturing group successfully matched at
recursion level 4, it still has that match on its stack, even though the regex engine has already exited from that recursion. Thus
\k'letter+1' matches e. Recursion level 3 is exited successfully.
The backreference continues to match d and c until the regex engine has exited the first recursion. Now, outside all recursion,
the regex engine again reaches \k'letter+1'. The present level is 0 and the backreference specifies +1. The capturing
group still retains all its previous successful recursion levels. So the backreference can still match the b that the group
captured during the first recursion. Now \b matches at the end of the string. abcdefzdcb was matched successfully.
You can take this as far as you like in this direction too:
Perl and Ruby backtrack into recursion if the remainder of the regex after the recursion fails. They try all permutations of the
recursion as needed to allow the remainder of the regex to match. PCRE treats recursion as atomic. PCRE backtracks
normally during the recursion, but once the recursion has matched, it does not try any further permutations of the recursion,
even when the remainder of the regex fails to match. The result is that Perl and Ruby may find regex matches that PCRE
cannot find, or that Perl and Ruby may find different regex matches.
Consider the regular expression aa$|a(?R)a|a in Perl or the equivalent aa$|a\g'0'a|a in Ruby 2.0. PCRE supports
either syntax. Let's see how Perl, Ruby, and PCRE go through the matching process of this regex when aaa is the subject
string.
The first alternative aa$ fails because the anchor cannot be matched between the second and third a in the string. Attempting
the second alternative at the start of the string, a matches a. Now the regex engine enters the first recursion.
Inside the recursion, the first alternative matches the second and third a in the string. The regex engine exists a successful
recursion. But now, the a that follows (?R) or \g'0' in the regex fails to match because the regex engine has already
reached the end of the string. Thus the regex engine must backtrack. Here is where PCRE behaves differently than Perl or
Ruby.
Perl and Ruby remember that inside the recursion the regex matched the second alternative and that there are three possible
alternatives. Perl and Ruby backtrack into the recursion. The second alternative inside the recursion is backtracked, reducing
the match so far to the first a in the string. Now the third alternative is attempted. a matches the second a in the string. The
regex engine again exits successfully from the same recursion. This time, the a that follows (?R) or \g'0' in the regex
matches the third a in the string. aaa is found as the overall match.
PCRE, on the other hand, remembers nothing about the recursion other than that it matched aa at the end of the string. PCRE
does backtrack over the recursion, reducing the match so far to the first a in the string. But this leaves the second alternative
in the regex without any further permutations to try. Thus the a at the start of the second alternative is also backtracked,
reducing the match so far to nothing. PCRE continues the match attempt at the start of the string with the third alternative and
finds that a matches a at the start of the string. In PCRE, this is the overall match.
You can make recursion in Perl and Ruby atomic by adding an atomic group. aa$|a(?>(?R))a|a in Perl and
aa$|a(?>\g'0')a|a in Ruby is the same as the original regexes in PCRE.
Boost is of two minds. Recursion of the whole regex is atomic in Boost, like in PCRE. But Boost will backtrack into subroutine
calls and into recursion of capturing groups, like Perl. So you can do non-atomic recursion in Boost by wrapping the whole
regex into a capturing group and then calling that.
PCRE2 originally behaved like PCRE, treating all recursion and subroutine calls as atomic. PCRE2 10.30 changed this, trying
to be more like Perl, but ending up like Boost. PCRE2 10.30 will backtrack into subroutine calls and recursion of capturing
groups like Perl does. But PCRE2 is still not able to backtrack into recursion of the whole regex. In the examples below,
"PCRE" means the original PCRE only. For PCRE2 10.22 and prior, follow the PCRE example. For PCRE2 10.30 and later,
follow the Perl example.
Let's see how these regexes match or fail to match deed. PCRE starts off the same as Perl and Ruby, just as in the original
regex. The group "letter" matches d. During three consecutive recursions, the group captures e, e, and d. The fourth
recursion fails, because there are no characters left the match. Back in the third recursion, the first alternative is backtracked
and the second alternative matches d at the end of the string. The engine exists the third recursion with a successful match.
Back in the second recursion, the backreference fails because there are no characters left in the string.
Here the behavior diverges. Perl and Ruby backtrack into the third recursion and backtrack the quantifier ? that makes the
second alternative optional. In the third recursion, the second alternative gives up the d that it matched at the end of the string.
The engine exists the third recursion again, this time with a successful zero-length match. Back in the second recursion, the
backreference still fails because the group stored e for the second recursion but the next character in the string is d. This the
first alternative is backtracked and the second alternative matches the second e in the string. The second recursion is exited
with success.
In the first recursion, the backreference again fails. The group stored e for the first recursion but the next character in the
string is d. Again, Perl and Ruby backtrack into the second recursion to try the permutation where the second alternative finds
a zero-length match. Back in the first recursion again, the backreference now matches the second e in the string. The engine
leaves the first recursion with success. Back in the overall match attempt, the backreference matches the final d in the string.
The word boundary succeeds and an overall match is found.
PCRE, however, does not backtrack into the third recursion. It does backtrack over the third recursion when it backtracks the
first alternative in the second recursion. Now, the second alternative in the second alternative matches the second e in the
string. The second recursion is exited with success.
In the first recursion, the backreference again fails. The group stored e for the first recursion but the next character in the
string is d. Again, PCRE does not backtrack into the second recursion, but immediately fails the first alternative in the first
recursion. The second alternative in the first recursion now matches the first e in the string. PCRE exits the first recursion with
success. Back in the overall match attempt, the backreference fails, because the group captured d prior to the recursion, and
the next character is the second e in the string. Backtracking again, the second alternative in the overall regex match now
matches the first d in the string. Then the word boundary fails. PCRE did not find any matches.
\b(?'word'
(?'oddword' (?'oddletter' [a-z])(?P>oddword) \k'oddletter' |[a-z])
| (?'evenword'(?'evenletter'[a-z])(?P>evenword)?\k'evenletter')
)\b
Basically, this is two copies of the original regex combined with alternation. The first alternatives has the groups 'word' and
'letter' renamed to 'oddword' and 'oddletter'. The second alternative has the groups 'word' and 'letter'
renamed to 'evenword' and 'evenletter'. The call (?P>evenword) is now made optional with a question mark instead
of the alternative |[a-z]. A new group 'word' combines the two groups 'oddword' and 'evenword' so that the word
boundaries still apply to the whole regex.
The first alternative 'oddword' in this regex matches a palindrome of odd length like radar in exactly the same way as the
regex discussed in the topic about recursion and capturing groups does. The second alternative in the new regex is never
attempted.
When the string is a palindrome of even length like deed, the new regex first tries all permutations of the first alternative. The
second alternative 'evenword' is attempted only after the first alternative fails to find a match.
In the second alternative group 'evenletter' matches d. During three consecutive recursions, the group captures e, e, and
d. The fourth recursion fails, because there are no characters left the match. Back in the third recursion, the regex engine
notes that recursive call (?P>evenword)? is optional. It proceeds to the backreference \k'evenletter'. The
backreference fails because there are no characters left in the string. Since the recursion has no further alternatives to try, is is
backtracked. The group 'evenletter' must give up its most recent match and PCRE exits from the failed third recursion.
In the second recursion, the backreference fails because the capturing group matched e during that recursion but the next
character in the string is d. The group gives up another match and PCRE exits from the failed second recursion.
Back in the first recursion, the backreference succeeds. The group matched the first e in the string during that recursion and
the backreference matches the second. PCRE exits from the successful first recursion.
Back in the overall match attempt, the backreference succeeds again. The group matched the d at the start of the string
during the overall match attempt, and the backreference matches the final d. Exiting the groups 'evenword' and 'word',
the word boundary matches at the end of the string. deed is the overall match.
One key syntactic difference is that the backslash is NOT a metacharacter in a POSIX bracket expression. So in POSIX, the
regular expression [\d] matches a \ or a d. To match a ], put it as the first character after the opening [ or the negating ^.
To match a -, put it right before the closing ]. To match a ^, put it before the final literal - or the closing ]. Put together,
[]\d^-] matches ], \, d, ^ or -.
The main purpose of the bracket expressions is that they adapt to the user's or application's locale. A locale is a collection of
rules and settings that describe language and cultural conventions, like sort order, date format, etc. The POSIX standard also
defines these locales.
Generally, only POSIX-compliant regular expression engines have proper and full support for POSIX bracket expressions.
Some non-POSIX regex engines support POSIX character classes, but usually don't support collating sequences and
character equivalents. Regular expression engines that support Unicode use Unicode properties and scripts to provide
functionality similar to POSIX bracket expressions. In Unicode regex engines, shorthand character classes like \w normally
match all relevant Unicode characters, alleviating the need to use locales.
Character Classes
Don't confuse the POSIX term "character class" with what is normally called a regular expression character class. [x-z0-9]
is an example of what we call a "character class" and POSIX calls a "bracket expression". [:digit:] is a POSIX character
class, used inside a bracket expression like [x-z[:digit:]]. The POSIX character class names must be written all
lowercase.
When used on ASCII strings, these two regular expressions find exactly the same matches: a single character that is either x,
y, z, or a digit. When used on strings with non-ASCII characters, the [:digit:] class may include digits in other scripts,
depending on the locale.
POSIX bracket expressions can be negated. [^x-z[:digit:]] matches a single character that is not x, y, z or a digit. A
major difference between POSIX bracket expressions and the character classes in other regex flavors is that POSIX bracket
expressions treat the backslash as a literal character. This means you can't use backslashes to escape the closing bracket
(]), the caret (^) and the hyphen (-). To include a caret, place it anywhere except right after the opening bracket. [x^]
matches an x or a caret. You can put the closing bracket right after the opening bracket, or the negating caret. []x] matches
a closing bracket or an x. [^]x] matches any character that is not a closing bracket or an x. The hyphen can be included
right after the opening bracket, or right before the closing bracket, or right after the negating caret. Both [-x] and [x-] match
an x or a hyphen.
The POSIX standard defines 12 character classes. The table below lists all 12, plus the [:ascii:] and [:word:] classes
that some regex flavors also support. The table also shows equivalent character classes that you can use in ASCII and
Unicode regular expressions if the POSIX classes are unavailable. The ASCII equivalents correspond exactly what is defined
in the POSIX standard. The Unicode equivalents correspond to what most Unicode regex engines match. The POSIX
standard does not define a Unicode locale. Some classes also have Perl-style shorthand equivalents.
Java does not support POSIX bracket expressions, but does support POSIX character classes using the \p operator. Though
the \p syntax is borrowed from the syntax for Unicode properties, the POSIX classes in Java only match ASCII characters as
indicated below. The class names are case sensitive. Unlike the POSIX syntax which can only be used inside a bracket
expression, Java's \p can be used inside and outside bracket expressions.
In Java 8 and prior, it does not matter whether you use the Is prefix with the \p syntax or not. So in Java 8, \p{Alnum} and
\p{IsAlnum} are identical. In Java 9 and later there is a difference. Without the Is prefix, the behavior is exactly the same
as in previous versions of Java. The syntax with the Is prefix now matches Unicode characters too. For \p{IsPunct} this
also means that it no longer matches the ASCII characters that are in the Symbol Unicode category.
Other than POSIX-compliant engines part of a POSIX-compliant system, none of the regex flavors discussed in this tutorial
support collating sequences.
Note that a fully POSIX-compliant regex engine treats ch as a single character when the locale is set to Czech. This means
that [^x]emie also matches chemie. [^x] matches a single character that is not an x, which includes ch in the Czech
POSIX locale.
In any other regular expression engine, or in a POSIX engine using a locale that does not treat ch as a digraph, [^x]emie
matches the misspelled word cemie but not chemie, as [^x] cannot match the two characters ch.
Finally, note that not all regex engines claiming to implement POSIX regular expressions actually have full support for collating
sequences. Sometimes, these engines use the regular expression syntax defined by POSIX, but don't have full locale support.
You may want to try the above matches to see if the engine you're using does. E.g. Tcl's regexp command supports collating
sequences, but Tcl only supports the Unicode locale, which does not define any collating sequences. The result is that in Tcl, a
collating sequence specifying a single character will match just that character, and all other collating sequences will result in
an error.
Character Equivalents
A POSIX locale can define character equivalents that indicate that certain characters should be considered as identical for
sorting. E.g. in French, accents are ignored when ordering words. élève comes before être which comes before
événement. é and ê are all the same as e, but l comes before t which comes before v. With the locale set to French, a
POSIX-compliant regular expression engine will match e, é, è and ê when you use the collating sequence [=e=] in the
bracket expression [[=e=]].
If a character does not have any equivalents, the character equivalence token simply reverts to the character itself. E.g.
[[=x=][=z=]] is the same as [xz] in the French locale.
Like collating sequences, POSIX character equivalents are not available in any regex engine that I know of, other than those
following the POSIX standard. And those that do may not have the necessary POSIX locale support. Here too Tcl's regexp
command supports character equivalents, but Unicode locale, the only one Tcl supports, does not define any character
equivalents. This effectively means that [[=x=]] and [x] are exactly the same in Tcl, and will only match x, for any character
you may try instead of "x".
In email, for example, it is common to prepend a "greater than" symbol and a space to each line of the quoted message. In
VB.NET, we can easily do this with:
We are using multi-line mode, so the regex ^ matches at the start of the quoted message, and after each newline. The
Regex.Replace method removes the regex match from the string, and inserts the replacement string (greater than symbol and
a space). Since the match does not include any characters, nothing is deleted. However, the match does include a starting
position. The replacement string is inserted there, just like we want it.
Using ^\d*$ to test if the user entered a number would give undesirable results. It causes the script to accept an empty string
as a valid input. Let's see why.
There is only one "character" position in an empty string: the void after the string. The first token in the regex is ^. It matches
the position before the void after the string, because it is preceded by the void before the string. The next token is \d*. One of
the star's effects is that it makes the \d, in this case, optional. The engine tries to match \d with the void after the string. That
fails. But the star turns the failure of the \d into a zero-length success. The engine proceeds with the next regex token, without
advancing the position in the string. So the engine arrives at $, and the void after the string. These match. At this point, the
entire regex has matched the empty string, and the engine reports success.
The solution is to use the regex ^\d+$ with the proper quantifier to require at least one digit to be entered. If you always make
sure that your regexes cannot find zero-length matches, other than special cases such as matching the start or end of each
line, then you can save yourself the headache you'll get from reading the remainder of this topic.
Things get tricky when a regex can find zero-length matches at any position as well as certain non-zero-length matches. Say
we have the regex \d*|x, the subject string x1, and a regex engine allows zero-length matches. Which and how many
matches do we get when iterating over all matches? The answer depends on how the regex engine advances after
zero-length matches. The answer is tricky either way.
The first match attempt begins at the start of the string. \d fails to match x. But the * makes \d optional. The first alternative
finds a zero-length match at the start of the string. Until here, all regex engines that allow zero-length matches do the same.
Now the regex engine is in a tricky situation. We're asking it to go through the entire string to find all non-overlapping regex
matches. The first match ended at the start of the string, where the first match attempt began. The regex engine needs a way
to avoid getting stuck in an infinite loop that forever finds the same zero-length match at the start of the string.
The simplest solution, which is used by most regex engines, is to start the next match attempt one character after the end of
the previous match, if the previous match was zero-length. In this case, the second match attempt begins at the position
between the x and the 1 in the string. \d matches 1. The end of the string is reached. The quantifier * is satisfied with a
single repetition. 1 is returned as the overall match.
The other solution, which is used by Perl, is to always start the next match attempt at the end of the previous match,
regardless of whether it was zero-length or not. If it was zero-length, the engine makes note of that, as it must not allow a
zero-length match at the same position. Thus Perl begins the second match attempt also at the start of the string. The first
alternative again finds a zero-length match. But this is not a valid match, so the engine backtracks through the regular
expression. \d* is forced to give up its zero-length match. Now the second alternative in the regex is attempted. x matches x
and the second match is found. The third match attempt begins at the position after the x in the string. The first alternative
matches 1 and the third match is found.
But the regex engine isn't done yet. After x is matched, it makes one more match attempt starting at the end of the string.
Here too \d* finds a zero-length match. So depending on how the engine advances after zero-length matches, it finds either
three or four matches.
Python 3.6 and prior advance after zero-length matches. The gsub() function to search-and-replace skips zero-length
matches at the position where the previous non-zero-length match ended, but the finditer() function returns those
matches. Listing all matches adds the zero-length match at the end of the string.
Python 3.7 changed all this. It handles zero-length matches like Perl. gsub() does now replace zero-length matches that are
adjacent to another match. This means regular expressions that can find zero-length matches are not compatible between
Python 3.7 and prior versions of Python.
PCRE 8.00 and later and PCRE2 handle zero-length matches like Perl by backtracking. They no longer advance one
character after a zero-length match like PCRE 7.9 used to do.
The regexp functions in R and PHP are based on PCRE, so they avoid getting stuck on a zero-length match by backtracking
like PCRE does. But the gsub() function to search-and-replace in R also skips zero-length matches at the position where the
previous non-zero-length match ended, like gsub() in Python 3.6 and prior does. The other regexp functions in R and all the
functions in PHP do allow zero-length matches immediately adjacent to non-zero-length matches, just like PCRE itself.
What you have to watch out for is that String[Regex.MatchPosition] may cause an access violation or segmentation
fault, because MatchPosition can point to the void after the string. This can also happen with ^ and ^$ in multi-line mode if the
last character in the string is a newline.
Applying \G\w to the string test string matches t. Applying it again matches e. The 3rd attempt yields s and the 4th
attempt matches the second t in the string. The fifth attempt fails. During the fifth attempt, the only place in the string where
\G matches is after the second t. But that position is not followed by a word character, so the match fails.
If a match attempt fails, the stored position for \G is reset to the start of the string. To avoid this, specify the continuation
modifier /c.
All this is very useful to make several regular expressions work together. E.g. you could parse an HTML file in the following
fashion:
while ($string =~ m/</g)
{
if ($string =~ m/\Gstrong>/c)
{
# Bold
}
elsif ($string =~ m/\Gem>/c)
{
# Italics
}
else
{
# ...etc...
}
}
The regex in the while loop searches for the tag's opening bracket, and the regexes inside the loop check which tag we found.
This way you can parse the tags in the file in the order they appear in the file, without having to write a single big regex that
matches all tags you are interested in.
In this tutorial, replacement strings are shown as replace like you would enter them in the Replace box of an application.
Table of Contents
Literal Characters and Special Characters
The simplest replacement text consists of only literal characters. Certain characters have special meanings in replacement
strings and have to be escaped. Escaping rules may get a bit complicated when using replacement strings in software source
code.
Non-Printable Characters
Non-printable characters such as control characters and special spacing or line break characters are easier to enter using
control character escapes or hexadecimal escapes.
Matched Text
Reinserting the entire regex match into the replacement text allows a search-and-replace to insert text before and after regular
expression matches without really replacing anything.
Backreferences
Backreferences to named and numbered capturing groups in the regular expression allow the replacement text to reuse parts
of the text matched by the regular expression.
Match Context
Some applications support special tokens in replacement strings that allow you to insert the subject string or the part of the
subject string before or after the regex match. This can be useful when the replacement text syntax is used to collect search
matches and their context instead of making replacements in the subject string.
Case Conversion
Some applications can insert the text matched by the regex or by capturing groups converted to uppercase or lowercase.
Conditionals
Some applications can use one replacement or another replacement depending on whether a capturing group participated in
the match. This allows you to use different replacements for different matches of the regular expression.
Special Characters
The most basic replacement string consists only of literal characters. The replacement replacement simply replaces each
regex match with the text replacement.
Because we want to be able to do more than simply replace each regex match with the exact same text, we need to reserve
certain characters for special use. In most replacement text flavors, two characters tend to have special meanings: the
backslash \ and the the dollar sign $. Whether and how to escape them depends on the application you're using. In some
applications, you always need to escape them when you want to use them as literal characters. In other applications, you only
need to escape them if they would form a replacement text token with the character that follows.
In Delphi, you can use a backslash to escape the backslash and the dollar, and you can use a dollar to escape the dollar. \\
replaces with a literal backslash, while \$ and $$ replace with a literal dollar sign. You only need to escape them to suppress
their special meaning in combination with other characters. In \! and $!, the backslash and dollar are literal characters
because they don't have a special meaning in combination with the exclamation point. You can't and needn't escape the
exclamation point or any other character except the backslash and dollar, because they have no special meaning in Delphi
replacement strings.
In .NET, JavaScript, VBScript, XRegExp, PCRE2, and std::regex you can escape the dollar sign with another dollar sign. $$
replaces with a literal dollar sign. XRegExp and PCRE2 require you to escape all literal dollar signs. They treat unescaped
dollar signs that don't form valid replacement text tokens as errors. In .NET, JavaScript (without XRegExp), and VBScript you
only need to escape the dollar sign to suppress its special meaning in combination with other characters. In $\ and $!, the
dollar is a literal character because it doesn't have a special meaning in combination with the backslash or exclamation point.
You can't and needn't escape the backslash, exclamation point, or any other character except dollar, because they have no
special meaning in .NET, JavaScript, VBScript, and PCRE2 replacement strings.
In Java, an unescaped dollar sign that doesn't form a token is an error. You must escape the dollar sign with a backslash or
another dollar sign to use it as a literal character. $! is an error because the dollar sign is not escaped and has no special
meaning in combination with the exclamation point. A backslash always escapes the character that follows. \! replaces with a
literal exclamation point, and \\ replaces with a single backslash. A single backslash at the end of the replacement text is an
error.
In Python and Ruby, the dollar sign has no special meaning. You can use a backslash to escape the backslash. You only need
to escape the backslash to suppress its special meaning in combination with other characters. In \!, the backslash is a literal
character because it doesn't have a special meaning in combination with the exclamation point. You can't and needn't escape
the exclamation point or any other character except the backslash, because they have no special meaning in Python and
Ruby replacement strings. An unescaped backslash at the end of the replacement text, however, is an error in Python but a
literal backslash in Ruby.
In PHP's preg_replace, you can use a backslash to escape the backslash and the dollar. \\ replaces with a literal backslash,
while \$ replaces with a literal dollar sign. You only need to escape them to suppress their special meaning in combination
with other characters. In \!, the backslash is a literal character because it doesn't have a special meaning in combination with
the exclamation point. You can't and needn't escape the exclamation point or any other character except the backslash and
dollar, because they have no special meaning in PHP replacement strings.
In Boost, a backslash always escapes the character that follows. \! replaces with a literal exclamation point, and \\ replaces
with a single backslash. A single backslash at the end of the replacement text is ignored. An unescaped dollar sign is a literal
dollar sign if it doesn't form a replacement string token. You can escape dollar signs with a backslash or with another dollar
sign. So $, $$, and \$ all replace with a single dollar sign.
In R, the dollar sign has no special meaning. A backslash always escapes the character that follows. \! replaces with a literal
exclamation point, and \\ replaces with a single backslash. A single backslash at the end of the replacement text is ignored.
In Tcl, the ampersand & has a special meaning, and must be escaped with a backslash if you want a literal ampersand in your
replacement text. You can use a backslash to escape the backslash. You only need to escape the backslash to suppress its
special meaning in combination with other characters. In \!, the backslash is a literal character because it doesn't have a
special meaning in combination with the exclamation point. You can't and needn't escape the exclamation point or any other
character except the backslash and ampersand, because they have no special meaning in Tcl replacement strings. An
unescaped backslash at the end of the replacement text is a literal backslash.
In XPath, an unescaped backslash is an error. An unescaped dollar sign that doesn't form a token is also an error. You must
escape backslashes and dollars with a backslash to use them as literal characters. The backslash has no special meaning
other than to escape another backslash or a dollar sign.
Perl is a special case. Perl doesn't really have a replacement text syntax. So it doesn't have escape rules for replacement
texts either. In Perl source code, replacement strings are simply double-quoted strings. What looks like backreferences in
replacement text are really interpolated variables. You could interpolate them in any other double-quoted string after a regex
match, even when not doing a search-and-replace.
If you specify the replacement text as a string constant in your source code, then you have to keep in mind which characters
are given special treatment inside string constants by your programming language. That is because those characters are
processed by the compiler, before the replacement text function sees the string. So in Java, for example, to replace all regex
matches with a single dollar sign, you need to use the replacement text \$, which you need to enter in your source code as
"\\$". The Java compiler turns the escaped backslash in the source code into a single backslash in the string that is passed
on to the replaceAll() function. That function then sees the single backslash and the dollar sign as an escaped dollar sign.
See the tools and languages section for more information on how to use replacement strings in various programming
languages.
Non-Printable Characters
Most applications and programming languages do not support any special syntax in the replacement text to make it easier to
enter non-printable characters. If you are the end user of an application, that means you'll have to use an application such as
the Windows Character Map to help you enter characters that you cannot type on your keyboard. If you are programming, you
can specify the replacement text as a string constant in your source code. Then you can use the syntax for string constants in
your programming language to specify non-printable characters.
Python is an exception. It allows you to use special escape sequences to enter a few common control characters. Use \t to
replace with a tab character (ASCII 0x09), \r for carriage return (0x0D), and \n for line feed (0x0A). Remember that
Windows text files use \r\n to terminate lines, while UNIX text files use \n.
Python supports the above escape sequences in replacement text, in addition to supporting them in string constants. Python
and Boost also support these more exotic non-printables: \a (bell, 0x07), \f (form feed, 0x0C) and \v (vertical tab, 0x0B).
Boost also supports hexadecimal escapes. You can use \uFFFF or \x{FFFF} to insert a Unicode character. The euro
currency sign occupies Unicode code point U+20AC. If you cannot type it on your keyboard, you can insert it into the
replacement text with \x{20AC}. For the 127 ASCII characters, you can use \x00 through \x7F. If you are using Boost with
8-bit character strings, you can also use \x80 through \xFF to insert characters from those 8-bit code pages.
Python does not support hexadecimal escapes in the replacement text syntax, even though it supports \xFF and \uFFFF in
string constants.
Regex Syntax versus String Syntax
Many programming languages support escapes for non-printable characters in their syntax for literal strings in source code.
Then such escapes are translated by the compiler into their actual characters before the string is passed to the search-
and-replace function. If the search-and-replace function does not support the same escapes, this can cause an apparent
difference in behavior when a regex is specified as a literal string in source code compared with a regex that is read from a file
or received from user input. For example, JavaScript's string.replace() function does not support any of these escapes.
But the JavaScript language does support escapes like \n, \x0A, and \u000A in string literals. So when developing an
application in JavaScript, \n is only interpreted as a newline when you add the replacement text as a string literal to your
source code. Then the JavaScript interpreter then translates \n and the string.replace() function sees an actual newline
character. If your code reads the same replacement text from a file, then string.replace() function sees \n, which it
treats as a literal backslash and a literal n.
Matched Text
Reinserting the entire regex match into the replacement text allows a search-and-replace to insert text before and after regular
expression matches without really replacing anything. It also allows you to replace the regex match with something that
contains the regex match. For example, you could replace all matches of the regex http://\S+ with
<a href="$&">$&</a> to turn all URLs in a file into HTML anchors that link to those URLs.
$& is substituted with the whole regex match in the replacement text in Delphi, .NET, JavaScript, and VBScript. It is also the
variable that holds the whole regex match in Perl. \& works in Delphi, and Ruby. In Tcl, & all by itself represents the whole
regex match, while $& is a literal dollar sign followed by the whole regex match, and \& is a literal ampersand.
In Boost and std::regex your choice of replacement format changes the meaning of & and $&. When using the sed
replacement format, & represents the whole regex match and $& is a literal dollar sign followed by the whole regex match.
When using the default (ECMAScript) or "all" replacement format, & is a literal ampersand and $& represents the whole
regex match.
The overall regex match is usually treated as an implied capturing group number zero. In many applications you can use the
syntax for backreferences in the replacement text to reference group number zero and thus insert the whole regex match in
the replacement text. You can do this with $0 in Delphi, .NET, Java, XRegExp, PCRE2, PHP, and XPath. \0 works with the
Delphi, Ruby, PHP, and Tcl.
Python does not support any of the above. In Python you can reinsert the whole regex match by using the syntax for named
backreferences with group number zero: \g<0>. Delphi does not support this, even though it supports named backreferences
using this syntax.
Numbered Backreferences
If your regular expression has named or numbered capturing groups, then you can reinsert the text matched by any of those
capturing groups in the replacement text. Your replacement text can reference as many groups as you like, and can even
reference the same group more than once. This makes it possible to rearrange the text matched by a regular expression in
many different ways. As a simple example, the regex \*(\w+)\* matches a single word between asterisks, storing the word
in the first (and only) capturing group. The replacement text <em>\1</em> replaces each regex match with the text stored by
the capturing group between em tags. Effectively, this search-and-replace replaces the asterisks with em tags, leaving the
word between the asterisks in place. This technique using backreferences is important to understand. Replacing *word* as a
whole with <em>word</em> is far easier and far more efficient than trying to come up with a way to correctly replace the
asterisks separately.
The \1 syntax for backreferences in the replacement text is borrowed from the syntax for backreferences in the regular
expression. \1 through \9 are supported by Delphi, Perl (though deprecated), Python, Ruby, PHP, R, Boost, and Tcl.
Double-digit backreferences \10 through \99 are supported by Delphi, Python, and Boost. If there are not enough capturing
groups in the regex for the double-digit backreference to be valid, then all these flavors treat \10 through \99 as a single-digit
backreference followed by a literal digit. The flavors that support single-digit backreferences but not double-digit
backreferences also do this.
$1 through $99 for single-digit and double-digit backreferences are supported by Delphi, .NET, Java, JavaScript, VBScript,
PCRE2, PHP, Boost, std::regex, and sXPath. These are also the variables that hold text matched by capturing groups in Perl.
If there are not enough capturing groups in the regex for a double-digit backreference to be valid, then $10 through $99 are
treated as a single-digit backreference followed by a literal digit by all these flavors except .NET, Perl, PCRE2, and std::regex.
Putting curly braces around the digit ${1} isolates the digit from any literal digits that follow. This works in Delphi, .NET, Perl,
PCRE2, PHP, Boost, and XRegExp.
Named Backreferences
If your regular expression has named capturing groups, then you should use named backreferences to them in the
replacement text. The regex (?'name'group) has one group called 'name'. You can reference this group with ${name} in
Delphi, .NET, PCRE2, Java 7, and XRegExp. PCRE2 also supports $name without the curly braces. In Perl 5.10 and later you
can interpolate the variable $+{name}. Boost too uses $+{name} in replacement strings. ${name} does not work in any
version of Perl. $name is unique to PCRE2.
In Python and Delphi, if you have the regex (?P<name>group) then you can use its match in the replacement text with
\g<name>. Python, but not Delphi, also supports numbered backreferences using this syntax. In Python this is the only way to
have a numbered backreference immediately followed by a literal digit.
PHP and R support named capturing groups and named backreferences in regular expressions. But they do not support
named backreferences in replacement texts. You'll have to use numbered backreferences in the replacement text to reinsert
text matched by named groups. To determine the numbers, count the opening parentheses of all capturing groups (named
and unnamed) in the regex from left to right.
In most applications, there is no difference between a backreference in the replacement string to a group that matched the
empty string or a group that did not participate. Both are replaced with an empty string. Two exceptions are Python and
PCRE2. They do allow backreferences in the replacement string to optional capturing groups. But the search-and-replace will
return an error code in PCRE2 if the capturing group happens not to participate in one of the regex matches. The same
situation raises an exception in Python 3.4 and prior. Python 3.5 no longer raises the exception.
Boost 1.42 added additional syntax of its own invention for either meaning of highest-numbered group. $^N,
$LAST_SUBMATCH_RESULT, and ${^LAST_SUBMATCH_RESULT} all insert the text matched by the highest-numbered group
that actually participated in the match. $LAST_PAREN_MATCH and ${^LAST_PAREN_MATCH} both insert the text matched by
the highest-numbered group regardless of whether participated in the match.
Match Context
Some applications support special tokens in replacement strings that allow you to insert the subject string or the part of the
subject string before or after the regex match. This can be useful when the replacement text syntax is used to collect search
matches and their context instead of making replacements in the subject string.
In the replacement text, $` is substituted with the part of the subject string to the left of the regex match in Delphi, .NET,
JavaScript, VBScript, Boost, and std::regex. It is also the variable that holds the part of the subject string to the left of the
regex match in Perl. \` works in Delphi, and Ruby.
In the same applications, you can use $' or \' to insert the part of the subject string to the right of the regex match.
In the replacement text, $_ is substituted with the entire subject string in Delphi, and .NET. In Perl, $_ is the default variable
that the regex is applied to if you use a regular expression without the matching operator =~. \_ is just an escaped
underscore. It has no special meaning in any application.
Boost 1.42 added some alternative syntax of its own invention. $PREMATCH and ${^PREMATCH} are synonyms for $`.
$POSTMATCH and ${^POSTMATCH} are synonyms for $'.
Perl's case conversion escapes also work in replacement texts. The most common use is to change the case of an
interpolated variable. \U converts everything up to the next \L or \E to uppercase. \L converts everything up to the next \U
or \E to lowercase. \u converts the next character to uppercase. \l converts the next character to lowercase. You can
combine these into \l\U to make the first character lowercase and the remainder uppercase, or \u\L to make the first
character uppercase and the remainder lowercase. \E turns off case conversion. You cannot use \u or \l after \U or \L
unless you first stop the sequence with \E.
When the regex (?i)(hello) (world) matches HeLlO WoRlD the replacement text \l\U$1\E \u\L$2 becomes
hELLO World. Literal text is also affected. \U$1 Dear $2 becomes HELLO DEAR WORLD.
Perl's case conversion works in regular expressions too. But it doesn't work the way you might expect. Perl applies case
conversion when it parses a string in your script and interpolates variables. That works great with backreferences in
replacement texts, because those are really interpolated variables in Perl. But backreferences in the regular expression are
regular expression tokens rather than variables. (?-i)(a)\U\1 matches aa but not aA. \1 is converted to uppercase while
the regex is parsed, not during the matching process. Since \1 does not include any letters, this has no effect. In the regex
\U\w, \w is converted to uppercase while the regex is parsed. This means that \U\w is the same as \W, which matches any
character that is not a word character.
Boost's Replacement String Case Conversion
Boost supports case conversion in replacement strings when using the default replacement format or the "all" replacement
format. \U converts everything up to the next \L or \E to uppercase. \L converts everything up to the next \U or \E to
lowercase. \u converts the next character to uppercase. \l converts the next character to lowercase. \E turns off case
conversion. As in Perl, the case conversion affects both literal text in your replacement string and the text inserted by
backreferences.
Where Boost differs from Perl is that combining these needs to be done the other way around. \U\l makes the first character
lowercase and the remainder uppercase. \L\u makes the first character uppercase and the remainder lowercase. Boost also
allows \l inside a \U sequence and a \u inside a \L sequence. So when (?i)(hello) (world) matches HeLlO WoRlD
you can use \L\u\1 \u\2 to replace the match with Hello World.
Unlike in Perl, in PCRE2 \U, \L, \u, and \l all stop any preceding case conversion. So you cannot combine \L and \u, for
example, to make the first character uppercase and the remainder lowercase. \L\u makes the first character uppercase and
leaves the rest unchanged, just like \u. \u\L makes all characters lowercase, just like \L.
In PCRE2, case conversion runs through conditionals. Any case conversion in effect before the conditional also applies to the
conditional. If the conditional contains its own case conversion escapes in the part of the conditional that is actually used, then
those remain in effect after the conditional. So you could use ${1:+\U:\L}${2} to insert the text matched by the second
capturing group in uppercase if the first group participated, and in lowercase if it didn't.
When the regex (?i)(Hello) (World) matches HeLlO WoRlD the replacement string \U$1 \L$2 becomes
HELLO world. Literal text is not affected. \U$1 Dear $2 becomes HELLO Dear WORLD.
For conditionals to work in Boost, you need to pass regex_constants::format_all to regex_replace. For them to
work in PCRE2, you need to pass PCRE2_SUBSTITUTE_EXTENDED to pcre2_substitute.
The matched and unmatched parts can be blank. You can omit the colon if the unmatched part is blank. So (?1matched:)
and (?1matched) replace with matched when the group participates. They replace the match with nothing when the group
does not participate.
You can use the full replacement string syntax in matched and unmatched. This means you can nest conditionals inside
other conditionals. So (?1one(?2two):(?2two:none)) replaces with onetwo when both groups participate, with one or
two when group 1 or 2 participates and the other doesn't, and with none when neither group participates. With Boost
?1one(?2two):?2two:none does exactly the same but omits parentheses that aren't needed.
Boost treats conditionals that reference a non-existing group number as conditionals to a group that never participates in the
match. So (?12twelve:not twelve) always replaces with not twelve when there are fewer than 12 capturing groups in
the regex.
You can avoid the ambiguity between single digit and double digit conditionals by placing curly braces around the number.
(?{1}1:0) replaces with 1 when group 1 participates and with 0 when it doesn't, even if there are 11 or more capturing
groups in the regex. (?{12}twelve:not twelve) is always a conditional that references group 12, even if there are fewer
than 12 groups in the regex (which may make the conditional invalid).
The syntax with curly braces also allows you to reference named capturing groups by their names.
(?{name}matched:unmatched) replaces with matched when the group "name" participates in the match and with
unmatched when it doesn't. If the group does not exist, Boost treats conditionals that reference a non-existing group name as
literals. So (?{nonexisting}matched:unmatched) uses ?{nonexisting}matched:unmatched as a literal
replacement.
matched is used as the replacement for matches in which the capturing group participated. unmatched is used for matches
in which the group did not participate. :+ delimits the group number or name from the first part of the conditional. The second
colon delimits the two parts. If you want a literal colon in the matched part, then you need to escape it with a backslash. If you
want a literal closing curly brace anywhere in the conditional, then you need to escape that with a backslash too. Plus signs
have no special meaning beyond the :+ that starts the conditional, so they don't need to be escaped.
You can use the full replacement string syntax in matched and unmatched. This means you can nest conditionals inside
other conditionals. So ${1:+one${2:+two}:${2:+two:none}} replaces with onetwo when both groups participate, with
one or two when group 1 or 2 participates and the other doesn't, and with none when neither group participates.
Both PCRE2 treats conditionals that reference non-existing capturing groups as an error.
Escaping Question Marks, Colons, Parentheses, and Curly Braces
As explained above, you need to use backslashes to escape colons that you want to use as literals when used in the
matched part of the conditional. You also need to escape literal closing parentheses (Boost) or curly braces (PCRE2) with
backslashes inside conditionals.
In replacement string flavors that support conditionals, you can escape colons, parentheses, curly braces, and even question
marks with backslashes to make sure they are interpreted as literals anywhere in the replacement string. But generally there is
no need to.
The colon does not have any special meaning in the unmatched part or outside conditionals. So you don't need to escape it
there. The question mark does not have any special meaning if it is not followed by a digit or a curly brace. In PCRE2 it never
has a special meaning. So you only need to escape question marks with backslashes if you want to use a literal question
mark followed by a literal digit or curly brace as the replacement in Boost.
Boost always uses parentheses for grouping. An unescaped opening parenthesis always opens a group. Groups can be
nested. An unescaped closing parenthesis always closes a group. An unescaped closing parenthesis that does not have a
matching opening parenthesis effectively truncates the replacement string. So Boost requires you to always escape literal
parentheses with backslashes.
Programming Languages and Libraries
If you are a programmer, you can save a lot of coding time by using regular expressions. With a regular expression, you can
do powerful string parsing in only a handful lines of code, or maybe even just a single line. A regex is faster to write and easier
to debug and maintain than dozens or hundreds of lines of code to achieve the same by hand.
Boost - Free C++ source libraries with comprehensive regex support that was later standardized by C++11. But there are
significant differences in Boost's regex flavors and the flavors in std::regex implementations.
Delphi - Delphi XE and later ship with RegularExpressions and RegularExpressionsCore units that wrap the PCRE
library. For older Delphi versions, you can use the TPerlRegEx component, which is the regex unit that the actual
RegularExpressionsCore unit is based on.
GNU (Linux) - GNU (Linux) or the GNU Portability Library includes many modules, including a regex module. It implements
both POSIX flavors, as well as these two flavors with added GNU extensions.
Groovy - Groovy uses Java's java.util.regex package for regular expressions support. Groovy adds only a few language
enhancements that allow you to instantiate the Pattern and Matcher classes with far fewer keystrokes.
Java - Java 4 and later include an excellent regular expressions library in the java.util.regex package.
JavaScript - If you use JavaScript to validate user input on a web page at the client side, using JavaScript's built-in regular
expression support will greatly reduce the amount of code you need to write.
.NET (dot net) - Microsoft's development framework includes a poorly documented, but very powerful regular expression
package, that you can use in any .NET-based programming language such as C# (C sharp) or VB.NET.
PCRE - Popular open source regular expression library written in ANSI C that you can link directly into your C and C++
applications, or use through an .so (UNIX/Linux) or a .dll (Windows).
Perl - The text-processing language that gave regular expressions a second life, and introduced many new features. Regular
expressions are an essential part of Perl.
PHP - Popular language for creating dynamic web pages, with three sets of regex functions. Two implement POSIX ERE,
while the third is based on PCRE.
POSIX - The POSIX standard defines two regular expression flavors that are implemented in many applications, programming
languages and systems.
PowerShell - Windows PowerShell is a programming language from Microsoft that is primarily designed for system
administration. Since PowerShell is built on top of .NET, it's built-in regex operators -match and -replace use the .NET
regex flavor. PowerShell can also access the .NET Regex classes directly.
Python - Popular high-level scripting language with a comprehensive built-in regular expression library.
R - The R Language is the programming languages used in the R Project for statistical computing. It has built-in support for
regular expressions based on POSIX and PCRE.
Ruby - Another popular high-level scripting language with comprehensive regular expression support as a language feature.
std::regex - Regex support part of the standard C++ library defined in C++11 and previously in TR1.
Tcl - Tcl, a popular "glue" language, offers three regex flavors. Two POSIX-compatible flavors, and an "advanced" Perl-style
flavor.
VBScript - Microsoft scripting language used in ASP (Active Server Pages) and Windows scripting, with a built-in RegExp
object implementing the regex flavor defined in the JavaScript standard.
Visual Basic 6 - Last version of Visual Basic for Win32 development. You can use the VBScript RegExp object in your VB6
applications.
wxWidgets - Popular open source windowing toolkit. The wxRegEx class encapsulates the "Advanced Regular Expression"
engine originally developed for Tcl.
XML Schema - The W3C XML Schema standard defines its own regular expression flavor for validating simple types using
pattern facets.
Xojo - Cross-platform development tool formerly known as REALbasic, with a built-in RegEx class based on PCRE.
XQuery and XPath - The W3C standard for XQuery 1.0 and XPath 2.0 Functions and Operators extends the XML Schema
regex flavor to make it suitable for full text search.
XRegExp - Open source JavaScript library that enhances the regex syntax and eliminates many cross-browser
inconsistencies and bugs.
Databases
Modern databases often offer built-in regular expression features that can be used in SQL statements to filter columns using a
regular expression. With some databases you can also use regular expressions to extract the useful part of a column, or to
modify columns using a search-and-replace.
PostgreSQL - PostgreSQL provides matching operators and extraction and substitution functions using the "Advanced
Regular Expression" engine also used by Tcl.
MySQL - MySQL's REGEXP operator works just like the LIKE operator, except that it uses a POSIX Extended Regular
Expression.
Oracle - Oracle Database 10g adds 4 regular expression functions that can be used in SQL and PL/SQL statements to filter
rows and to extract and replace regex matches. Oracle implements POSIX Extended Regular Expressions.
C++ Regular Expressions with Boost
Boost is a free source code library for C++. After downloading and unzipping, you need to run the bootstrap batch file or
script and then run b2 --with-regex to compile Boost's regex library. Then add the folder into which you unzipped Boost to
the include path of your C++ compiler. Add the stage\lib subfolder of that folder to your linker's library path. Then you can
add #include <boost/regex.hpp> to your C++ code to make use of Boost regular expressions.
If you use C++Builder, you should download the Boost libraries for your specific version of C++Builder from Embarcadero. The
version of Boost you get depends on your version of C++Builder and whether you're targeting Win32 or Win64. The Win32
compiler in XE3 through XE8, and the classic Win32 compiler in C++Builder 10 Seattle through 10.1 Berlin are all stuck on
Boost 1.39. The Win64 compiler in XE3 through XE6 uses Boost 1.50. The Win64 compiler in XE7 through 10.1 Berlin uses
Boost 1.55. The new C++11 Win32 compiler in C++Builder 10 and later uses the same version of boost as the Win64
compiler.
This guide covers Boost 1.38, 1.39, and 1.42 through the latest 1.62. Boost 1.40 introduced many new regex features
borrowed from Perl 5.10. But it also introduced some serious bugs that weren't fixed until Boost 1.42. So we completely ignore
Boost 1.40 and 1.41. We still cover Boost 1.38 and 1.39 (which have identical regex features) because the classic Win32
C++Builder compiler is stuck on this version. If you're using another compiler, you should definitely use Boost 1.42 or later to
avoid what are now old bugs. You should preferably use Boost 1.47 or later as this version changes certain behaviors
involving backreferences that may change how some of your regexes behave if you later upgrade from pre-1.47 to post-1.47.
In practice, you'll mostly use the Boost's ECMAScript grammar. It's the default grammar and offers far more features that the
other grammars. Whenever the tutorial on this website mentions Boost without mentioning any grammars then what is written
applies to the ECMAScript grammar and may or may not apply to any of the other grammars. You'll really only use the other
grammars if you want to reuse existing regular expressions from old POSIX code or UNIX scripts.
But when you run your C++ application then it can make a big difference whether it is Dinkumware or Boost that is interpreting
your regular expressions. Though both offer the same six grammars, their syntax and behavior are not the same between the
two libraries. Boost defines regex_constants::perl which is not part of the C++11 standard. This is not actually an
additional grammar but simply a synonym to ECMAScript and JavaScript. There are major differences in the regex flavors
used by actual JavaScript and actual Perl. So it's obvious that a library treating these as one flavor or grammar can't be
compatible with either. Boost's ECMAScript grammar is a cross between the actual JavaScript and Perl flavors, with a bunch
of Boost-specific features and peculiarities thrown in. Dinkumware's ECMAScript grammar is closer to actual JavaScript, but
still has significant behavioral differences. Dinkumware didn't borrow any features from Perl that JavaScript doesn't have.
The table below highlights the most important differences between the ECMAScript grammars in std::regex and Boost and
actual JavaScript and Perl. Some are obvious differences in feature sets. But others are subtle differences in behavior.
Feature std::regex Boost JavaScript Perl
Empty character class Fails to match not possible Fails to match not possible
Internally the RegularExpressions unit uses the RegularExpressionsCore unit which defines the TPerlRegEx class.
TPerlRegEx is a wrapper around the open source PCRE library developed by the author of this tutorial. Thus both the
RegularExpressions and RegularExprssionsCore units use the PCRE regex flavor.
In situations where performance is critical, you may want to use TPerlRegEx directly. The PCRE library is based on UTF-8
while the Delphi VCL uses UTF-16 (UnicodeString). TPerlRegEx is also based on UTF-8, giving you full control over the
conversion between UTF-16 and UTF-8. You can avoid the conversions if your own code also uses UTF8String. The
RegularExpressions unit uses UnicodeString just like the rest of the VCL, and handles the UTF-16 to UTF-8
conversion automatically.
For new code written in Delphi XE, you should definitely use the RegularExpressions unit that is part of Delphi rather than
one of the many 3rd party units that may be available. If you're dealing with UTF-8 data, use the RegularExpressionsCore
unit to avoid needless UTF-8 to UTF-16 to UTF-8 conversions.
If you have old code written using TPerlRegEx that you're migrating to Delphi XE, replace the PerlRegEx unit with
RegularExpressionsCore, and update the changed property and method names as described in the section about older
versions of Delphi.
Delphi XE RegularExpressions unit
The RegularExpressions unit defines TRegEx, TMatch, TMatchCollection, TGroup, and TGroupCollection as
records rather than as classes. That means you don't need to call Create and Free to allocate and deallocate memory.
TRegEx does have a Create constructor that you can call if you want to use the same regular expression more than once.
That way TRegEx doesn't compile the same regex twice. If you call the constructor, you can then call any of the non-static
methods that do not take the regular expression as a parameter. If you don't call the constructor, you can only call the static
(class) methods that take the regular expression as a parameter. All TRegEx methods have static and non-static overloads.
Which ones you use solely depends on whether you want to make more than one call to TRegEx using the same regular
expression.
The IsMatch method takes a string and returns True or False indicating whether the regular expression matches (part of)
the string.
The Match method takes a string and returns a TMatch record with the details of the first match. If the match fails, it returns a
TMatch record with the Success property set to nil. The non-static overload of Match() takes an optional starting position
and an optional length parameter that you can use to search through only part of the input string.
The Matches method takes a string and returns a TMatchCollection record. The default Item[] property of this record
holds a TMatch for each match the regular expression found in the string. If there are no matches, the Count property of the
returned TMatchCollection record is zero.
Use the Replace method to search-and-replace all matches in a string. You can pass a TMatchEvaluator which is nothing
more than a method that takes one parameter called Match of type TMatch and returns a string. The string returned by your
method is used as a literal replacement string. If you want backreferences in your string to be replaced when using the
TMatchEvaluator overload, call the Result method on the provided Match parameter before returning the string.
Use the Split method to split a string along its regex matches. The result is returned as a dynamic array of strings. As in
.NET, text matched by capturing groups in the regular expression are also included in the returned array. If you don't like this,
remove all named capturing groups from your regex and pass the roExplicitCapture option to disable numbered
capturing groups. The non-static overload of Split() takes an optional Count parameter to indicate the maximum number
of elements that the returned array may have. In other words, the string is split at most Count-1 times. Capturing group
matches are not included in the count. So if your regex has capturing groups, the returned array may have more than Count
elements. If you pass Count, you can pass a second optional parameter to indicate the position in the string at which to start
splitting. The part of the string before the starting position is returned unsplit in the first element of the returned array.
The TMatch record provides several properties with details about the match. Success indicates if a match was found. If this
is False, all other properties and methods are invalid. Value returns the matched string. Index and Length indicate the
position in the input string and the length of the match. Groups returns a TGroupCollection record that stores a TGroup
record in its default Item[] property for each capturing group. You can use a numeric index to Item[] for numbered
capturing groups, and a string index for named capturing groups.
TMatch also provides two methods. NextMatch returns the next match of the regular expression after this one. If your
TMatch is part of a TMatchCollection you should not use NextMatch to get the next match but use
TMatchCollection.Item[] instead, in order to avoid repeating the search. TMatch.Result takes one parameter with
the replacement text as a string. It returns the string that this match would have been replaced with if you had used this
replacement text with TRegEx.Replace.
The TGroup record has Success, Value, Index and Length properties that work just like those of the TMatch.
Regular Expressions Classes for Older Versions of Delphi
TPerlRegEx has been available long before Embarcadero licensed a copy for inclusion with Delphi XE. Depending on your
needs, you can download one of two versions for use with Delphi 2010 and earlier.
The latest release of TPerlRegEx is fully compatible with the RegularExpressionsCore unit in Delphi XE. For new code
written in Delphi 2010 or earlier, using the latest release of TPerlRegEx is strongly recommended. If you later migrate your
code to Delphi XE, all you have to do is replace PerlRegEx with RegularExrpessionsCore in the uses clause of your
units.
The older versions of TPerlRegEx are non-visual components. This means you can put TPerlRegEx on the component
palette and drop it on a form. The original TPerlRegEx was developed when Borland's goal was to have a component for
everything on the component palette.
If you want to migrate from an older version of TPerlRegEx to the latest TPerlRegEx, start with removing any TPerlRegEx
components you may have placed on forms or data modules and instantiate the objects at runtime instead. When instantiating
at runtime, you no longer need to pass an owner component to the Create() constructor. Simply remove the parameter.
Some of the property and method names in the original TPerlRegEx were a bit unwieldy. These have been renamed in the
latest TPerlRegEx. Essentially, in all identifiers SubExpression was replaced with Group and MatchedExpression was
replaced with Matched. Here is a complete list of the changed identifiers:
StoreSubExpression StoreGroups
NamedSubExpression NamedGroup
MatchedExpression MatchedText
MatchedExpressionLength MatchedLength
MatchedExpressionOffset MatchedOffset
SubExpressionCount GroupCount
SubExpressions Groups
SubExpressionLengths GroupLengths
SubExpressionOffsets GroupOffsets
GNU's implementation of these tools follows the POSIX standard, with added GNU extensions. The effect of the GNU
extensions is that both the Basic Regular Expressions flavor and the Extended Regular Expressions flavor provide exactly the
same functionality. The only difference is that BRE's will use backslashes to give various characters a special meaning, while
ERE's will use backslashes to take away the special meaning of the same characters.
A BRE supports POSIX bracket expressions, which are similar to character classes in other regex flavors, with a few special
features. Other features using the usual metacharacters are the dot to match any character except a line break, the caret and
dollar to match the start and end of the string, and the star to repeat the token zero or more times. To match any of these
characters literally, escape them with a backslash.
The other BRE metacharacters require a backslash to give them their special meaning. The reason is that the oldest versions
of UNIX grep did not support these. The developers of grep wanted to keep it compatible with existing regular expressions,
which may use these characters as literal characters. The BRE a{1,2} matches a{1,2} literally, while a\{1,2\} matches a
or aa. Tokens can be grouped with \( and \). Backreferences are the usual \1 through \9. Only up to 9 groups are
permitted. E.g. \(ab\)\1 matches abab, while (ab)\1 is invalid since there's no capturing group corresponding to the
backreference \1. Use \\1 to match \1 literally.
On top of what POSIX BRE provides as described above, the GNU extension provides \? and \+ as an alternative syntax to
\{0,1\} and \{1,\}. It adds alternation via \|, something sorely missed in POSIX BREs. These extensions in fact mean
that GNU BREs have exactly the same features as GNU EREs, except that +, ?, |, braces and parentheses need
backslashes to give them a special meaning instead of take it away.
All metacharacters have their meaning without backslashes, just like in modern regex flavors. You can use backslashes to
suppress the meaning of all metacharacters. Escaping a character that is not a metacharacter is an error.
The quantifiers ?, +, {n}, {n,m} and {n,} repeat the preceding token zero or once, once or more, n times, between n and m
times, and n or more times, respectively. Alternation is supported through the usual vertical bar |. Unadorned parentheses
create a group, e.g. (abc){2} matches abcabc.
POSIX ERE does not support backreferences. The GNU Extension adds them, using the same \1 through \9 syntax.
Gnulib
GNU wouldn't be GNU if you couldn't use their regular expression implementation in your own (open source) applications. To
do so, you'll need to download Gnulib. Use the included gnulib-tool to copy the regex module to your application's source
tree.
The regex module provides the standard POSIX functions regcomp() for compiling a regular expression, regerror() for
handling compilation errors, regexec() to run a search using a compiled regex, and regfree() to clean up a regex you're
done with.
Using verbose Java code to work with regular expressions in Groovy wouldn't be very groovy. Groovy has a bunch of
language features that make code using regular expressions a lot more concise. You can mix the Groovy-specific syntax with
regular Java code. It's all based in the java.util.regex package, which you'll need to import regardless.
Groovy Strings
Java has only one string style. Strings are placed between double quotes. Double quotes and backslashes in strings must be
escaped with backslashes. That yields a forest of backslashes in literal regular expressions.
Groovy has five string styles. Strings can be placed between single quotes, double quotes, triple single quotes, and triple
double quotes. Using triple single or double quotes allows the string to span multiple lines, which is handy for free-spacing
regular expressions. Unfortunately, all four of these string styles require backslashes to be escaped.
The fifth string style is provided specifically for regular expressions. The string is placed between forward slashes, and only
forward slashes (not backslashes) in the string need to be escaped. This is indeed a string style. Both /hello/ and "hello"
are literal instances of java.lang.String. Unfortunately, strings delimited with forward slashes cannot span across lines,
so you can't use them for free-spacing regular expressions.
To create a Pattern instance, simply place a tilde before the string with your regular expression. The string can use any of
Groovy's five string styles. When assigning this pattern to a variable, make sure to leave a space between the assignment
operator and the tilde.
Finally, the ==~ operator is a quick way to test whether a regex can match a string entirely. myString ==~ /regex/ is
equivalent to myString.matches(/regex/). To find partial matches, you need to use the Matcher.
Java 5 fixes some bugs and adds support for Unicode blocks. Java 6 fixes a few more bugs but doesn't add any features.
Java 7 adds named capture and Unicode scripts.
myString.matches("regex") returns true or false depending whether the string can be matched entirely by the
regular expression. It is important to remember that String.matches() only returns true if the entire string can be
matched. In other words: "regex" is applied as if you had written "^regex$" with start and end of string anchors. This is
different from most other regex libraries, where the "quick match test" method returns true if the regex can be matched
anywhere in the string. If myString is abc then myString.matches("bc") returns false. bc matches bc, but ^bc$ (which
is really being used here) does not.
myString.replaceAll("regex", "replacement") replaces all regex matches inside the string with the replacement
string you specified. No surprises here. All parts of the string that match the regex are replaced. You can use the contents of
capturing parentheses in the replacement text via $1, $2, $3, etc. $0 (dollar zero) inserts the entire regex match. $12 is
replaced with the 12th backreference if it exists, or with the 1st backreference followed by the literal "2" if there are less than
12 backreferences. If there are 12 or more backreferences, it is not possible to insert the first backreference immediately
followed by the literal "2" in the replacement text.
In the replacement text, a dollar sign not followed by a digit causes an IllegalArgumentException to be thrown. If there
are less than 9 backreferences, a dollar sign followed by a digit greater than the number of backreferences throws an
IndexOutOfBoundsException. So be careful if the replacement string is a user-specified string. To insert a dollar sign as
literal text, use \$ in the replacement text. When coding the replacement text as a literal string in your source code, remember
that the backslash itself must be escaped too: "\\$".
myString.split("regex") splits the string at each regex match. The method returns an array of strings where each
element is a part of the original string between two regex matches. The matches themselves are not included in the array. Use
myString.split("regex", n) to get an array containing at most n items. The result is that the string is split at most n-1
times. The last item in the string is the unsplit remainder of the original string.
Using The Pattern Class
In Java, you compile a regular expression by using the Pattern.compile() class factory. This factory returns an object of
type Pattern. E.g.: Pattern myPattern = Pattern.compile("regex"); You can specify certain options as an
optional second parameter.
If you will be using the same regular expression often in your source code, you should create a Pattern object to increase
performance. Creating a Pattern object also allows you to pass matching options as a second parameter to the
Pattern.compile() class factory. If you use one of the String methods above, the only way to specify options is to
embed mode modifier into the regex. Putting (?i) at the start of the regex makes it case insensitive. (?m) is the equivalent of
Pattern.MULTILINE, (?s) equals Pattern.DOTALL and (?u) is the same as Pattern.UNICODE_CASE. Unfortunately,
Pattern.CANON_EQ does not have an embedded mode modifier equivalent.
Use myPattern.split("subject") to split the subject string using the compiled regular expression. This call has exactly
the same results as myString.split("regex"). The difference is that the former is faster since the regex was already
compiled.
To create a Matcher object, simply call Pattern.matcher() like this: myMatcher = Pattern.matcher("subject").
If you already created a Matcher object from the same pattern, call myMatcher.reset("newsubject") instead of
creating a new matcher object, for reduced garbage and increased performance. Either way, myMatcher is now ready for
duty.
To find the first match of the regex in the subject string, call myMatcher.find(). To find the next match, call
myMatcher.find() again. When myMatcher.find() returns false, indicating there are no further matches, the next call
to myMatcher.find() will find the first match again. The Matcher is automatically reset to the start of the string when
find() fails.
The Matcher object holds the results of the last match. Call its methods start(), end() and group() to get details about
the entire regex match and the matches between capturing parentheses. Each of these methods accepts a single int
parameter indicating the number of the backreference. Omit the parameter to get information about the entire regex match.
start() is the index of the first character in the match. end() is the index of the first character after the match. Both are
relative to the start of the subject string. So the length of the match is end() - start(). group() returns the string matched
by the regular expression or pair of capturing parentheses.
The Matcher class allows you to do a search-and-replace and compute the replacement text for each regex match in your
own code. You can do this with the appendReplacement() and appendTail() Here is how:
StringBuffer myStringBuffer = new StringBuffer();
myMatcher = myPattern.matcher("subject");
while (myMatcher.find())
{
if (checkIfThisMatchShouldBeReplaced())
{
myMatcher.appendReplacement(myStringBuffer, computeReplacementString());
}
}
myMatcher.appendTail(myStringBuffer);
The regex \w matches a word character. As a Java string, this is written as "\\w".
The same backslash-mess occurs when providing replacement strings for methods like String.replaceAll() as literal
Java strings in your Java code. In the replacement text, a dollar sign must be encoded as \$ and a backslash as \\ when you
want to replace the regex match with an actual dollar sign or backslash. However, backslashes must also be escaped in literal
Java strings. So a single dollar sign in the replacement text becomes "\\$" when written as a literal Java string. The single
backslash becomes "\\\\". Right again: 4 backslashes to insert a single one.
/g enables "global" matching. When using the replace() method, specify this modifier to replace all matches, rather than
only the first one.
/i makes the regex match case insensitive.
/m enables "multi-line mode". In this mode, the caret and dollar match before and after newlines in the subject string.
You can combine multiple modifiers by stringing them together as in /regex/gim. Notably absent is an option to make the
dot match line break characters.
Since forward slashes delimit the regular expression, any forward slashes that appear in the regex need to be escaped. E.g.
the regex 1/2 is written as /1\/2/ in JavaScript.
There is indeed no /s modifier to make the dot match all characters, including line breaks. To match absolutely any character,
you can use character class that contains a shorthand class and its negated version, such as [\s\S].
JavaScript implements Perl-style regular expressions. However, it lacks quite a number of advanced features available in Perl
and other modern regular expression flavors:
No \A or \Z anchors to match the start or end of the string. Use a caret or dollar instead.
Lookbehind is not supported at all. Lookahead is fully supported.
No atomic grouping or possessive quantifiers.
No Unicode support, except for matching single characters with \uFFFF.
No named capturing groups. Use numbered capturing groups instead.
No mode modifiers to set matching options within the regular expression.
No conditionals.
No regular expression comments. Describe your regular expression with JavaScript // comments instead, outside the
regular expression string.
Many of these features are available in the XRegExp library for JavaScript.
Lookbehind was a major omission in JavaScript's regex syntax for the longest time. Lookbehind is part of the ECMAScript
2018 specification. As of this writing (late 2018), Google's Chrome browser is the only popular JavaScript implementation that
supports lookbehind. So if cross-browser compatibility matters, you can't use lookbehind in JavaScript.
if (myString.match(/regex/))
{
/*Success!*/
}
If you want to verify user input, you should use anchors to make sure that you are testing against the entire string. To test if
the user entered a number, use: myString.match(/^\d+$/). /\d+/ matches any string containing one or more digits, but
/^\d+$/ matches only strings consisting entirely of digits.
To do a search and replace with regexes, use the string's replace() method: myString.replace(/replaceme/g,
"replacement"). Using the /g modifier makes sure that all occurrences of "replaceme" are replaced. The second
parameter is an normal string with the replacement text.
Using a string's split() method allows you to split the string into an array of strings using a regular expression to determine
the positions at which the string is splitted. E.g. myArray = myString.split(/,/) splits a comma-delimited list into an
array. The comma's themselves are not included in the resulting array of strings.
It is recommended that you do not use the RegExp constructor with a literal string, because in literal strings, backslashes must
be escaped. The regular expression \w+ can be created as re = /\w+/ or as re = new RegExp("\\w+"). The latter
is definitely harder to read. The regular expression \\ matches a single backslash. In JavaScript, this becomes re = /\\/
or re = new RegExp("\\\\").
Whichever way you create myregexp, you can pass it to the String methods explained above instead of a literal regular
expression: myString.replace(myregexp, "replacement").
If you want to retrieve the part of the string that was matched, call the exec() function of the RegExp object that you created,
e.g.: mymatch = myregexp.exec("subject"). This function returns an array. The zero item in the array will hold the text
that was matched by the regular expression. The following items contain the text matched by the capturing parentheses in the
regexp, if any. mymatch.length indicates the length of the match[] array, which is one more than the number of capturing
groups in your regular expression. mymatch.index indicates the character position in the subject string at which the regular
expression matched. mymatch.input keeps a copy of the subject string.
Calling the exec() function also changes the lastIndex property of the RegExp object. It stores the index in the subject
string at which the next match attempt will begin. You can modify this value to change the starting position of the next call to
exec().
The test() function of the RegExp object is a shortcut to exec() != null. It takes the subject string as a paarameter and
returns true or false depending on whether the regex matches part of the string or not.
You can call these methods on literal regular expressions too. /\d/.test(subject) is a quick way to test whether there
are any digits in the subject string.
$+ is not part of the standard but is supported by some browsers nonetheless. In Internet Explorer and Firefox, $+ inserts the
text matched by the highest-numbered capturing group in the regex. If the highest-numbered group did not participate in the
match, $+ is replaced with nothing. This is not the same as $+ in Perl, which inserts the text matched by the highest-
numbered capturing group that actually participated in the match.
While things like $& are actually variables in Perl that work anywhere, in JavaScript these only exist as placeholders in the
replacement string passed to the replace() function.
There are no differences in the regex flavor supported by .NET versions 2.0 through 4.6.2. There are no differences between
this flavor and the flavor supported by .NET Core either. There are a few differences between the regex flavor in .NET 1.x
compared with 2.0 and later. In .NET 2.0 a few bugs were fixed. The Unicode categories \p{Pi} and \p{Pf} are no longer
reversed, and Unicode blocks with hyphens in their names are now handled correctly. One feature was added in .NET 2.0:
character class subtraction. It works exactly the way it does in XML Schema regular expressions. The XML Schema standard
first defined this feature and its syntax.
You can then call RegexObj.IsMatch("subject") to check whether the regular expression matches the subject string.
The Regex allows an optional second parameter of type RegexOptions. You could specify RegexOptions.IgnoreCase
as the final parameter to make the regex case insensitive. Other options are RegexOptions.Singleline which causes the
dot to match newlines and RegexOptions.Multiline which causes the caret and dollar to match at embedded newlines in
the subject string.
Call RegexObj.Replace("subject", "replacement") to perform a search-and-replace using the regex on the subject
string, replacing all matches with the replacement string. In the replacement string, you can use $& to insert the entire regex
match into the replacement text. You can use $1, $2, $3, etc... to insert the text matched between capturing parentheses into
the replacement text. Use $$ to insert a single dollar sign into the replacement text. To replace with the first backreference
immediately followed by the digit 9, use ${1}9. If you type $19, and there are less than 19 backreferences, the $19 will be
interpreted as literal text, and appear in the result string as such. To insert the text from a named capturing group, use
${name}. Improper use of the $ sign may produce an undesirable result string, but will never cause an exception to be raised.
RegexObj.Split("subject") splits the subject string along regex matches, returning an array of strings. The array
contains the text between the regex matches. If the regex contains capturing parentheses, the text matched by them is also
included in the array. If you want the entire regex matches to be included in the array, simply place round brackets around the
entire regular expression when instantiating RegexObj.
The Regex class also contains several static methods that allow you to use regular expressions without instantiating a Regex
object. This reduces the amount of code you have to write, and is appropriate if the same regular expression is used only
once or reused seldomly. Note that member overloading is used a lot in the Regex class. All the static methods have the
same names (but different parameter lists) as other non-static methods.
Regex.IsMatch("subject", "regex") checks if the regular expression matches the subject string.
Regex.Split("subject", "regex") splits the subject string into an array of strings as described above. All these
methods accept an optional additional parameter of type RegexOptions, like the constructor.
Either way, you will get an object of class Match that holds the details about the first regex match in the subject string.
MatchObj.Success indicates if there actually was a match. If so, use MatchObj.Value to get the contents of the match,
MatchObj.Length for the length of the match, and MatchObj.Index for the start of the match in the subject string. The
start of the match is zero-based, so it effectively counts the number of characters in the subject string to the left of the match.
If the regular expression contains capturing parentheses, use the MatchObj.Groups collection. MatchObj.Groups.Count
indicates the number of capturing parentheses. The count includes the zeroth group, which is the entire regex match.
MatchObj.Groups(3).Value gets the text matched by the third pair of round brackets. MatchObj.Groups(3).Length
and MatchObj.Groups(3).Index get the length of the text matched by the group and its index in the subject string,
relative to the start of the subject string. MatchObj.Groups("name") gets the details of the named group "name".
To find the next match of the regular expression in the same subject string, call MatchObj.NextMatch() which returns a
new Match object containing the results for the second match attempt. You can continue calling MatchObj.NextMatch()
until MatchObj.Success is False.
Note that after calling RegexObj.Match(), the resulting Match object is independent from RegexObj. This means you can
work with several Match objects created by the same Regex object simultaneously.
To make your code more readable, you should use C# verbatim strings. In a verbatim string, a backslash is an ordinary
character. This allows you to write the regular expression in your C# code as the user would type it into your application. The
regex to match a backlash is written as @"\\" when using C# verbatim strings. The backslash is still an escape character in
the regular expression, so you still need to double it. But doubling is better than quadrupling. To match a word character, use
the verbatim string @"\w".
RegexOptions.ECMAScript
Passing RegexOptions.ECMAScript to the Regex() constructor changes the behavior of certain regex features to follow
the behavior prescribed in the ECMA-262 standard. This standard defines the ECMAScript language, which is better known as
JavaScript. The table below compares the differences between canonical .NET (without the ECMAScript option) and .NET in
ECMAScirpt mode. For reference the table also compares how JavaScript in modern browsers behaves in these areas.
Canonical
Feature or Syntax .NET in ECMAScript mode JavaScript
.NET
Escaped digit that is not a valid backreference Error Octal escape or literal 8 or 9
Fails to
Backreference to non-participating group Zero-length match
match
Fails to
Backreference to group 0 Zero-length match Syntactically not possible
match
\d Unicode ASCII
\w Unicode ASCII
\b Unicode ASCII
Though RegexOptions.ECMAScript brings the .NET regex engine a little bit closer to JavaScript's, there are still significant
differences between the .NET regex flavor and the JavaScript regex flavor. When creating web pages using ASP.NET on the
server an JavaScript on the client, you cannot assume the same regex to work in the same way both on the client side and the
server side even when setting RegexOptions.ECMAScript. The next table lists the more important differences between
.NET and JavaScript. RegexOptions.ECMAScript has no impact on any of these.
The table also compares the XRegExp library for JavaScript. You can use this library to bring JavaScript's regex flavor a little
bit closer to .NET's.
Anchors in multi-line mode Treat only \n as a line break Treat \n, \r, \u2028, and \u2029 as line breaks
$ without multi-line mode Matches at very end of string Matches before final line break and at very end of string
Though PCRE claims to be Perl-compatible, there are more than enough differences between contemporary versions of Perl
and PCRE to consider them distinct regex flavors. Recent versions of Perl have even copied features from PCRE that PCRE
had copied from other programming languages before Perl had them, in an attempt to make Perl more PCRE-compatible.
Today PCRE is used more widely than Perl because PCRE is part of so many libraries and applications.
Philip Hazel has recently released a new library called PCRE2. The first PCRE2 release was given version number 10.00 to
make a clear break with the previous PCRE 8.36. Future PCRE releases will be limited to bug fixes. New features will go into
PCRE2 only. If you're taking on a new development project, you should consider using PCRE2 instead of PCRE. But for
existing projects that already use PCRE, it's probably best to stick with PCRE. Moving from PCRE to PCRE2 requires
significant changes to your source code (but not to your regular expressions).
Using PCRE
Using PCRE is very straightforward. Before you can use a regular expression, it needs to be converted into a binary format for
improved efficiency. To do this, simply call pcre_compile() passing your regular expression as a null-terminated string.
The function will return a pointer to the binary format. You cannot do anything with the result except pass it to the other pcre
functions.
To use the regular expression, call pcre_exec() passing the pointer returned by pcre_compile(), the character array
you want to search through, and the number of characters in the array (which need not be null-terminated). You also need to
pass a pointer to an array of integers where pcre_exec() will store the results, as well as the length of the array expressed
in integers. The length of the array should equal the number of capturing groups you want to support, plus one (for the entire
regex match), multiplied by three (!). The function will return -1 if no match could be found. Otherwise, it will return the number
of capturing groups filled plus one. If there are more groups than fit into the array, it will return 0. The first two integers in the
array with results contain the start of the regex match (counting bytes from the start of the array) and the number of bytes in
the regex match, respectively. The following pairs of integers contain the start and length of the backreferences. So
array[n*2] is the start of capturing group n, and array[n*2+1] is the length of capturing group n, with capturing group 0
being the entire regex match.
When you are done with a regular expression, call pcre_dispose() with the pointer returned by pcre_compile() to
prevent memory leaks.
The original PCRE library only supports regex matching, a job it does rather well. It provides no support for search-
and-replace, splitting of strings, etc. This may not seem as a major issue because you can easily do these things in your own
code. The unfortunate consequence, however, is that all the programming languages and libraries that use PCRE for regex
matching have their own replacement text syntax and their own idiosyncrasies when splitting strings. The new PCRE2 library
does support search-and-replace.
To compile PCRE with Unicode support, you need to define the SUPPORT_UTF8 and SUPPORT_UCP conditional defines. If
PCRE's configuration script works on your system, you can easily do this by running ./configure --enable-unicode-
properties before running make.
PCRE Callout
A feature unique to PCRE is the "callout". If you put (?C1) through (?C255) anywhere in your regex, PCRE calls the
pcre_callout function when it reaches the callout during the match attempt.
If you have PCRE 8.30 or later, you can enable UTF-16 support by passing --enable-pcre16 to the configure script before
running make. Then you can pass PCRE_UTF16 to pcre16_compile() and then do the matching with pcre16_match() if
your regular expression and subject strings are stored as UTF-16. UTF-16 uses two bytes for code points up to U+FFFF, and
four bytes for higher code points. In Visual C++, whchar_t strings use UTF-16. It's important to make sure that you do not
mix the pcre_ and pcre16_ functions. The PCRE_UTF8 and PCRE_UTF16 constants are actually the same. You need to use
the pcre16_ functions to get the UTF-16 version.
If you have PCRE 8.32 or later, you can enable UTF-32 support by passing --enable-pcre32 to the configure script before
running make. Then you can pass PCRE_UTF32 to pcre32_compile() and then do the matching with pcre32_match() if
your regular expression and subject strings are stored as UTF-32. UTF-32 uses four bytes per character and is common for
in-memory Unicode strings on Linux. It's important to make sure that you do not mix the pcre32_ functions with the pcre16_
or pcre_ sets. Again, the PCRE_UTF8 and PCRE_UTF32 constants are the same. You need to use the pcre32_ functions to
get the UTF-32 version.
The PCRE2 Open Source Regex Library
PCRE2 is short for Perl Compatible Regular Expressions, version 2. It is the successor to the widely popular PCRE library.
Both are open source libraries written in C by Philip Hazel.
The first PCRE2 release was given version number 10.00 to make a clear break with the preceding PCRE 8.36. PCRE 8.37
through 8.39 and any future PCRE releases are limited to bug fixes. New features are added to PCRE2 only. If you're taking
on a new development project, you should consider using PCRE2 instead of PCRE. But for existing projects that already use
PCRE, it's probably best to stick with PCRE. Moving from PCRE to PCRE2 requires significant changes to your source code.
The only real reason to do so would be to use the new search-and-replace feature.
The regex syntax supported by PCRE2 10.00 through 10.21 and PCRE 8.36 through 8.39 is pretty much the same. Because
of this, the regex tutorial does not specifically mention PCRE2. Everything it says about PCRE (or PCRE versions 8.36
through 8.39 in particular) also applies to PCRE2.
The only significant new feature is a new version check conditional. The syntax looks much like a conditional that checks a
named backreference, but the inclusion of the equals sign (and other symbols not allowed in capturing group names) make it
a syntax error in the original PCRE. In all versions of PCRE2, (?(VERSION>=10.00)yes|no) matches yes in the string
yesno. You can use any valid regex for the "yes" and "no" parts. If the version check succeeds the "yes" part is attempted.
Otherwise the "no" part is attempted. This is exactly like a normal conditional that evaluates the part before or after the vertical
bar depending on whether a capturing group participated in the match or not.
You can use >= to check for a minimum version, or = to check for a specific version. (?(VERSION=10.00)yes|no) matches
yes in PCRE2 10.00. It matches no in PCRE2 10.10 and all later versions. Omitting the minor version number is the same as
specifying .00. So (?(VERSION>=10)yes|no) matches yes in all versions of PCRE2, but (?(VERSION=10)yes|no)
only matches yes in PCRE2 10.00. If you specify the minor version number you should use two digits after the decimal point.
Three or more digits are an error as of version 10.21. Version 10.21 also changes the interpretation of single digit numbers,
including those specified with a leading zero. Since the first release was 10.00 and the second release was 10.10 there should
be no need to check for single digit numbers. You cannot omit the dot if you specify the minor version number.
(?(VERSION>=1000)yes|no) checks for version 1000.00 or greater.
This version check conditional is mainly intended for people who use PCRE2 indirectly, via an application that provides regex
support based on PCRE2 or a programming language that embeds PCRE2 but does not expose all its function calls. It allows
them to find out which version of PCRE2 the application uses. If you are developing an application with the PCRE2 C library
then you should use a function call to determine the PCRE2 version:
char version[255];
pcre2_config(PCRE2_CONFIG_VERSION, version);
8-bit, 16-bit, or 32-bit code units means that PCRE2 interprets your string as consisting of single byte characters, double byte
characters, or quad byte characters. To work with UTF-8, UTF-16, or UTF-32, you need to use the functions with the
corresponding code unit size, and pass the PCRE2_UTF option to pcre2_compile to allow characters to consists of multiple
code units. UTF-8 characters consist of 1 to 4 bytes. UTF-16 characters consist of 1 or 2 words.
If you want to call the PCRE2 functions without any suffix, as they are shown below, then you need to define
PCRE2_CODE_UNIT_WIDTH as 8, 16, or 32 to make the functions without a suffix use 8-bit, 16-bit, or 32-bit code units. Do so
before including the library, like this:
#define PCRE2_CODE_UNIT_WIDTH 8
#include "pcre2.h"
The functions without a suffix always use the code unit size you've defined. The functions with suffixes remain available. So
your application can use regular expressions with all three code unit sizes. But it is important not to mix them up. If the same
regex needs to be matched against UTF-8 and UTF-16 strings, then you need to compile it twice using pcre_compile_8
and pcre_compile_16 and then use the compiled regexes with the corresponding pcre_match_8 and pcre_match_16
functions.
Using PCRE2
Using PCRE2 is a bit more complicated than using PCRE. With PCRE2 you have to use various types of context to pass
certain compilation or matching options, such as line break handling. In PCRE these options could be passed directly as
option bits when compiling or matching.
Before you can use a regular expression, it needs to be converted into a binary format for improved efficiency. To do this,
simply call pcre2_compile() passing your regular expression as a string. If the string is null-terminated, you can pass
PCRE2_ZERO_TERMINATED as the second parameter. Otherwise, pass the length in code units as the second parameter. For
UTF-8 this is the length in bytes, while for UTF-16 or UTF-32 this is the length in bytes divided by 2 or 4. The 3rd parameter is
a set of options combined with binary OR. You should include PCRE2_UTF for proper UTF-8, UTF-16, or UTF-32 support. If
you omit it, you get pure 8-bit, or UCS-2, or UCS-4 character handling. Other common options are PCRE2_CASELESS,
PCRE2_DOTALL, PCRE2_MULTILINE, and so on. The 4th and 5th parameters receive error conditions. The final parameter is
a context. Pass NULL unless you need special line break handling. The function returns a pointer to memory it allocated. You
must free this with pcre2_code_free() when you're done with the regular expression.
If you need non-default line break handling, you need to call pcre2_compile_context_create(NULL) to create a new
compile context. Then call pcre2_set_newline() passing that context and one of the options like PCRE2_NEWLINE_LF or
PCRE2_NEWLINE_CRLF. Then pass this context as the final parameter to pcre2_compile(). You can reuse the same
context for as many regex compilations as you like. Call pcre2_compile_context_free() when you're done with it.
Note that in the original PCRE, you could pass PCRE_NEWLINE_LF and the like directly to pcre_compile(). This does not
work with PCRE2. PCRE2 will not complain if you pass PCRE2_NEWLINE_LF and the like to pcre2_compile(). But doing
so has no effect. You have to use the match context.
Before you can use the compiled regular expression to find matches in a string, you need to call
pcre2_match_data_create_from_pattern() to allocate memory to store the match results. Pass the compiled regex
as the first parameter, and NULL as the second. The function returns a pointer to memory it allocated. You must free this with
pcre2_match_data_free() when you're done with the match data. You can reuse the same match data for multiple calls
to pcre2_match().
To find a match, call pcre2_match() and pass the compiled regex, the subject string, the length of the entire string, the
offset of the character where the match attempt must begin, matching options, the pointer to the match data object, and NULL
for the context. The length and starting offset are in code units, not characters. The function returns a positive number when
the match succeeds. PCRE2_ERROR_NOMATCH indicates no match was found. Any other non-positive return value indicates
an error. Error messages can be obtained with pcre2_get_error_message().
To find out which part of the string matched, call pcre2_get_ovector_pointer(). This returns a pointer to an array of
PCRE2_SIZE values. You don't need to free this pointer. It will become invalid when you call pcre2_match_data_free().
The length of the array is the value returned by pcre2_match(). The first two values in the array are the start and end of the
overall match. The second pair is the match of the first capturing group, and so on. Call
pcre2_substring_number_from_name() to get group numbers if your regex has named capturing groups.
If you just want to get the matched text, you can use convenience functions like pcre2_substring_copy_bynumber() or
pcre2_substring_copy_byname(). Pass the number or name of a capturing group, or zero for the overall match. Free
the result with pcre2_substring_free(). If the result doesn't need to be zero-terminated, you can use
pcre2_substring_get_bynumber() and pcre2_substring_get_byname() to get a pointer to the start of the
match within the original subject string.
PCRE2 does not provide a function that gives you all the matches of a regex in string. It never returns more than the first
match. To get the second match, call pcre2_match() again and pass ovector[1] (the end of the first match) as the
starting position for the second match attempt. If the first match was zero-length, include PCRE2_NOTEMPTY_ATSTART with
the options passed to pcre2_match() in order to avoid finding the same zero-length match again. This is not the same as
incrementing the starting position before the call. Passing the end of the previous match with PCRE2_NOTEMPTY_ATSTART
may result in a non-zero-length match being found at the same position.
Substituting Matches
The biggest item item that was forever on PCRE's wishlist was likely the ability to search-and-replace. PCRE2 finally delivers.
The replacement string syntax is fairly simple. There is nothing Perl-compatible about it, though. Backreferences can be
specified as $group or ${group} where "group" is either the name or the number of a group. The overall match is group
number zero. To add literal dollar sign to the replacement, you need to double it up. Any single dollar sign that is not part of a
valid backreference is an error. Like Python, but unlike Perl, PCRE2 treats backreferences to non-existent groups and
backreferences to non-participating groups as errors. Backslashes are literals.
Before you can substitute matches, you need to compile your regular expression with pcre2_compile(). You can use the
same compiled regex for both regex matching and regex substitution. You don't need a match data object for substitution.
Call pcre2_substitute() and pass it the compiled regex, the subject string, the length of the subject string, the position in
the string where regex matching should begin, and the matching options. You can include PCRE2_SUBSTITUTE_GLOBAL with
the matching options to replace all matches after the starting position, instead of just the first match. The next parameters are
for match data and match context which can both be set to NULL. Then pass the replacement string and the length of the
replacement string. Finally pass a pointer to a buffer where the result string should be stored, along with a pointer to a variable
that holds the size of the buffer. The buffer needs to have space for a terminating zero. All lengths and offsets are in code
units, not characters.
pcre2_substitute() returns the number of regex matches that were substituted. Zero means no matches were found.
This function never returns PCRE2_ERROR_NOMATCH. A negative number indicates an error occurred.
Call pcre2_get_error_message() to get the error message. The variable that holds the size of the buffer is updated to
indicate the length of the string that it written to the buffer, excluding the terminating zero. (The terminating zero is written.)
\a, \e, \f, \n, \r, and \t are the usual ASCII control character escapes. Notably missing are \b and \v. \x0 through \xF
and \x00 through \xFF are hexadecimal escapes. \x{0} through \x{10FFFF} insert a Unicode code point. \o{0} through
\o{177777} are octal escapes.
Case conversion is also supported. The syntax is the same as Perl, but the behavior is not. Perl allows you to combine \u or
\l with \L or \U to make one character uppercase or lowercase and the remainder the opposite. With PCRE2, any case
conversion escape cancels the preceding escape. So you can't combine them and even \u or \l will end a run of \U or \L.
Conditionals are supported using a newly invented syntax that extends the syntax for backreferences.
${group:+matched:unmatched} inserts matched when the group participated and unmatched when it didn't. You can
use the full replacement string syntax in the two alternatives, including other conditionals.
Case conversion runs through conditionals. Any case conversion in effect before the conditional also applies to the
conditional. If the conditional contains its own case conversion escapes in the part of the conditional that is actually used, then
those remain in effect after the conditional. So you could use ${1:+\U:\L}${2} to insert the text matched by the second
capturing group in uppercase if the first group participated, and in lowercase if it didn't.
Because of Perl's focus on managing and mangling text, regular expression text patterns are an integral part of the Perl
language. This in contrast with most other languages, where regular expressions are available as add-on libraries. In Perl, you
can use the m// operator to test if a regex can match a string, e.g.:
if ($string =~ m/regex/)
{
print 'match';
}
else
{
print 'no match';
}
$string =~ s/regex/replacement/g;
The "g" after the last forward slash stands for "global", which tells Perl to replace all matches, and not just the first one.
Options are typically indicated including the slash, like /g, even though you do not add an extra slash, and even though you
could use any non-word character instead of slashes. If your regex contains slashes, use another character, like
s!regex!replacement!g;
You can add an "i" to make the regex match case insensitive. You can add an "s" to make the dot match newlines. You can
add an "m" to make the dollar and caret match at newlines embedded in the string, as well as at the start and end of the string.
@- is an array of match-start indices into the string. $-[0] holds the start of the entire regex match, $-[1] the start of the first
backreference, etc. Likewise, @+ holds match-end indices (ends, not lengths).
In Perl 5.10 and later you can use the associative array %+ to get the text matched by named capturing groups. For example,
$+{name} holds the text matched by the group "name". Perl does not provide a way to get match positions of capturing
groups by referencing their names. Since named groups are also numbered, you can use @- and @+ for named groups, but
you have to figure out the group's number by yourself.
$' (dollar followed by an apostrophe or single quote) holds the part of the string after (to the right of) the regex match. $`
(dollar backtick) holds the part of the string before (to the left of) the regex match. Using these variables is not recommended
in scripts when performance matters, as it causes Perl to slow down all regex matches in your entire script.
All these variables are read-only, and persist until the next regex match is attempted. They are dynamically scoped, as if they
had an implicit 'local' at the start of the enclosing scope. Thus if you do a regex match, and call a sub that does a regex match,
when that sub returns, your variables are still set as they were for the first match.
The pos() function retrieves the position where the next attempt begins. The first character in the string has position zero.
You can modify this position by using the function as the left side of an assignment, like in pos($string) = 123;.
The most important set of regex functions start with preg. These functions are a PHP wrapper around the PCRE library (Perl-
Compatible Regular Expressions). Anything said about the PCRE regex flavor in the regular expression tutorial on this website
applies to PHP's preg functions. You should use the preg functions for all new PHP code that uses regular expressions. PHP
includes PCRE by default as of PHP 4.2.0 (April 2002).
The oldest set of regex functions are those that start with ereg. They implement POSIX Extended Regular Expressions, like
the traditional UNIX egrep command. These functions are mainly for backward compatibility with PHP 3, and officially
deprecated as of PHP 5.3.0. Many of the more modern regex features such as lazy quantifiers, lookaround and Unicode are
not supported by the ereg functions. Don't let the "extended" moniker fool you. The POSIX standard was defined in 1986, and
regular expressions have come a long way since then.
The last set is a variant of the ereg set, prefixing mb_ for "multibyte" to the function names. While ereg treats the regex and
subject string as a series of 8-bit characters, mb_ereg can work with multi-byte characters from various code pages. If you
want your regex to treat Far East characters as individual characters, you'll either need to use the mb_ereg functions, or the
preg functions with the /u modifier. mb_ereg is available in PHP 4.2.0 and later. It uses the same POSIX ERE flavor.
Regex matching options, such as case insensitivity, are specified in the same way as in Perl. '/regex/i' applies the regex
case insensitively. '/regex/s' makes the dot match all characters. '/regex/m' makes the start and end of line anchors
match at embedded newlines in the subject string. '/regex/x' turns on free-spacing mode. You can specify multiple letters
to turn on several options. '/regex/misx' turns on all four options.
A special option is the /u which turns on the Unicode matching mode, instead of the default 8-bit matching mode. You should
specify /u for regular expressions that use \x{FFFF}, \X or \p{L} to match Unicode characters, graphemes, properties or
scripts. PHP will interpret '/regex/u' as a UTF-8 string rather than as an ASCII string.
Like the ereg function, bool preg_match(string pattern, string subject[, array groups]) returns TRUE if
the regular expression pattern matches the subject string or part of the subject string. If you specify the third parameter, preg
will store the substring matched by the first capturing group in $groups[1]. $groups[2] will contain the second pair, and so
on. If the regex pattern uses named capture, you can access the groups by name with $groups['name']. $groups[0] will
hold the overall match.
int preg_match_all(string pattern, string subject, array matches, int flags) fills the array
"matches" with all the matches of the regular expression pattern in the subject string. If you specify PREG_SET_ORDER as the
flag, then $matches[0] is an array containing the match and backreferences of the first match, just like the $groups array
filled by preg_match. $matches[1] holds the results for the second match, and so on. If you specify
PREG_PATTERN_ORDER, then $matches[0] is an array with full consecutive regex matches, $matches[1] an array with
the first backreference of all matches, $matches[2] an array with the second backreference of each match, etc.
array preg_grep(string pattern, array subjects) returns an array that contains all the strings in the array
"subjects" that can be matched by the regular expression pattern.
mixed preg_replace(mixed pattern, mixed replacement, mixed subject[, int limit]) returns a string
with all matches of the regex pattern in the subject string replaced with the replacement string. At most limit replacements
are made. One key difference is that all parameters, except limit, can be arrays instead of strings. In that case,
preg_replace does its job multiple times, iterating over the elements in the arrays simultaneously. You can also use strings
for some parameters, and arrays for others. Then the function will iterate over the arrays, and use the same strings for each
iteration. Using an array of the pattern and replacement, allows you to perform a sequence of search and replace
operations on a single subject string. Using an array for the subject string, allows you to perform the same search and replace
operation on many subject strings.
Callbacks allow you to do powerful search-and-replace operations that you cannot do with regular expressions alone. E.g. if
you search for the regex (\d+)\+(\d+), you can replace 2+3 with 5, using the callback:
function regexadd($groups)
{
return $groups[1] + $groups[2];
}
array preg_split(string pattern, string subject[, int limit]) works just like split, except that it uses the
Perl syntax for the regex pattern.
int ereg(string pattern, string subject[, array groups]) returns the length of the match if the regular
expression pattern matches the subject string or part of the subject string, or zero otherwise. Since zero evaluates to False
and non-zero evaluates to True, you can use ereg in an if statement to test for a match. If you specify the third parameter,
ereg will store the substring matched by the part of the regular expression between the first pair of round brackets in
$groups[1]. $groups[2] will contain the second pair, and so on. Note that grouping-only round brackets are not supported
by ereg. ereg is case sensitive. eregi is the case insensitive equivalent.
string ereg_replace(string pattern, string replacement, string subject) replaces all matches of the
regex patten in the subject string with the replacement string. You can use backreferences in the replacement string. \\0 is
the entire regex match, \\1 is the first backreference, \\2 the second, etc. The highest possible backreference is \\9.
ereg_replace is case sensitive. eregi_replace is the case insensitive equivalent.
array split(string pattern, string subject[, int limit]) splits the subject string into an array of strings
using the regular expression pattern. The array will contain the substrings between the regular expression matches. The text
actually matched is discarded. If you specify a limit, the resulting array will contain at most that many substrings. The subject
string will be split at most limit-1 times, and the last item in the array will contain the unsplit remainder of the subject string.
split is case sensitive. spliti is the case insensitive equivalent.
Using the mb_ereg function after calling mb_regex_encoding("CP936") would yield the bytes D6D0 or the first character
中, as the result.
To make sure your regular expression uses the correct code page, call mb_regex_encoding() to set the code page. If you
don't, the code page returned by or set by mb_internal_encoding() is used instead.
If your PHP script uses UTF-8, you can use the preg functions with the /u modifier to match multi-byte UTF-8 characters
instead of individual bytes. The preg functions do not support any other code pages.
POSIX Basic Regular Expressions
POSIX or "Portable Operating System Interface for uniX" is a collection of standards that define some of the functionality that
a (UNIX) operating system should support. One of these standards defines two flavors of regular expressions. Commands
involving regular expressions, such as grep and egrep, implement these flavors on POSIX-compliant UNIX systems. Several
database systems also use POSIX regular expressions.
The Basic Regular Expressions or BRE flavor standardizes a flavor similar to the one used by the traditional UNIX grep
command. This is pretty much the oldest regular expression flavor still in use today. One thing that sets this flavor apart is that
most metacharacters require a backslash to give the metacharacter its flavor. Most other flavors, including POSIX ERE, use a
backslash to suppress the meaning of metacharacters. Using a backslash to escape a character that is never a metacharacter
is an error.
A BRE supports POSIX bracket expressions, which are similar to character classes in other regex flavors, with a few special
features. Shorthands are not supported. Other features using the usual metacharacters are the dot to match any character
except a line break, the caret and dollar to match the start and end of the string, and the star to repeat the token zero or more
times. To match any of these characters literally, escape them with a backslash.
The other BRE metacharacters require a backslash to give them their special meaning. The reason is that the oldest versions
of UNIX grep did not support these. The developers of grep wanted to keep it compatible with existing regular expressions,
which may use these characters as literal characters. The BRE a{1,2} matches a{1,2} literally, while a\{1,2\} matches a
or aa. Some implementations support \? and \+ as an alternative syntax to \{0,1\} and \{1,\}, but \? and \+ are not
part of the POSIX standard. Tokens can be grouped with \( and \). Backreferences are the usual \1 through \9. Only up to
9 groups are permitted. E.g. \(ab\)\1 matches abab, while (ab)\1 is invalid since there's no capturing group
corresponding to the backreference \1. Use \\1 to match \1 literally.
POSIX BRE does not support any other features. Even alternation is not supported.
The developers of egrep did not try to maintain compatibility with grep, creating a separate tool instead. Thus egrep, and
POSIX ERE, add additional metacharacters without backslashes. You can use backslashes to suppress the meaning of all
metacharacters, just like in modern regex flavors. Escaping a character that is not a metacharacter is an error.
The quantifiers ?, +, {n}, {n,m} and {n,} repeat the preceding token zero or once, once or more, n times, between n and m
times, and n or more times, respectively. Alternation is supported through the usual vertical bar |. Unadorned parentheses
create a group, e.g. (abc){2} matches abcabc.
The POSIX standard does not define backreferences. Some implementations do support \1 through \9, but these are not
part of the standard for ERE. ERE is an extension of the old UNIX grep, not of POSIX BRE.
As a side effect, the -match operator sets a special variable called $matches. This is an associative array that holds the
overall regex match and all capturing group matches. $matches[0] gives you the overall regex match, $matches[1] the
first capturing group, and $matches['name'] the text matched by the named group 'name'.
The -replace operator uses a regular expression to search-and-replace through a string. E.g. 'test' -replace '\w',
'$&$&s' returns 'tteesstt'. The regex \w matches one letter. The replacement text re-inserts the regex match twice
using $&. The replacement text parameter must be specified, and the regex and replacement must be separated by a comma.
If you want to replace the regex matches with nothing, pass an empty string as the replacement.
Traditionally, regular expressions are case sensitive by default. This is true for the .NET framework too. However, it is not true
in PowerShell. -match and -replace are case insensitive, as are -imatch and -ireplace. For case sensitive matching,
use -cmatch and -creplace. I recommend that you always use the "i" or "c" prefix to avoid confusion regarding case
sensitivity.
The operators do not provide a way to pass options from .NET's RegexOptions enumeration. Instead, use mode modifiers in
the regular expression. E.g. (?m)^test$ is the same as using ^test$ with RegexOptions.MultiLine passed to the
Regex() constructor. Mode modifiers take precedence over options set externally to the regex. -cmatch '(?i)test' is
case insensitive, while -imatch '(?-i)test' is case sensitive. The mode modifier overrides the case insensitivity
preference of the -match operator.
But with PowerShell, there's an extra caveat: double-quoted strings use the dollar syntax for variable interpolation. Variable
interpolation is done before the Regex.Replace() function (which -replace uses internally) parses the replacement text.
Unlike Perl, $1 is not a magical variable in PowerShell. That syntax only works in the replacement text. The -replace
operator does not set the $matches variable either. The effect is that 'test' -replace '(\w)(\w)', "$2$1" (double-
quoted replacement) returns the empty string (assuming you did not set the variables $1 and $2 in preceding PowerShell
code). Due to variable interpolation, the Replace() function never sees $2$1. To allow the Replace() function to substitute
its placeholders, use 'test' -replace '(\w)(\w)', '$2$1' (single-quoted replacement) or 'test' -replace
'(\w)(\w)', "`$2`$1" (dollars escaped with backticks) to make sure $2$1 is passed literally to Regex.Replace().
Using The System.Text.RegularExpressions.Regex Class
To use all of .NET's regex processing functionality with PowerShell, create a regular expression object by instantiating the
System.Text.RegularExpressions.Regex class. PowerShell provides a handy shortcut if you want to use the Regex()
constructor that takes a string with your regular expression as the only parameter: $regex = [regex] '\W+' compiles the
regular expression \W+ (which matches one or more non-word characters) and stores the result in the variable $regex. You
can now call all the methods of the Regex class on your $regex object. E.g. $regex.Split('this is a test') returns
an array of all the words in the string.
If you want to use another constructor, you have to resort to PowerShell's new-object cmdlet. To set the flag
RegexOptions.MultiLine, for example, you'd need this line of code:
Using mode modifiers inside the regex is a much shorter and more readable solution, though:
Python’s re Module
Python is a high level open source scripting language. Python's built-in re module provides excellent support for regular
expressions, with a modern and complete regex flavor. The only significant features missing from Python's regex syntax are
atomic grouping, possessive quantifiers and Unicode properties.
The first thing to do is to import the regexp module into your script with import re.
You can set regex matching modes by specifying a special constant as a third parameter to re.search() re.I or
re.IGNORECASE applies the pattern case insensitively. re.S or re.DOTALL makes the dot match newlines. re.M or
re.MULTILINE makes the caret and dollar match after and before line breaks in the subject string. There is no difference
between the single-letter and descriptive options, except for the number of characters you have to type in. To specify more
than one option, "or" them together with the | operator: re.search("^a","abc", re.I | re.M).
By default, Python's regex engine only considers the letters A through Z, the digits 0 through 9, and the underscore as "word
characters". Specify the flag re.L or re.LOCALE to make \w match all characters that are considered letters given the
current locale settings. Alternatively, you can specify re.U or re.UNICODE to treat all letters from all scripts as word
characters. The setting also affects word boundaries.
Do not confuse re.search() with re.match(). Both functions do exactly the same, with the important distinction that
re.search() will attempt the pattern throughout the string, until it finds a match. re.match() on the other hand, only
attempts the pattern at the very start of the string. Basically, re.match("regex", subject) is the same as
re.search("\Aregex", subject). Note that re.match() does not require the regex to match the entire string.
re.match("a", "ab") will succeed.
Python 3.4 adds a new re.fullmatch() function. This function only returns a Match object if the regex matches the string
entirely. Otherwise it returns None. re.fullmatch("regex", subject) is the same as re.search("\Aregex\Z",
subject). This is useful for validating user input. If subject is an empty string then fullmatch() evaluates to True for any
regex that can find a zero-length match.
To get all matches from a string, call re.findall(regex, subject). This will return an array of all non-overlapping regex
matches in the string. "Non-overlapping" means that the string is searched through from left to right, and the next match
attempt starts beyond the previous match. If the regex contains one or more capturing groups, re.findall() returns an
array of tuples, with each tuple containing text matched by all the capturing groups. The overall regex match is not included in
the tuple, unless you place the entire regex inside a capturing group.
More efficient than re.findall() is re.finditer(regex, subject). It returns an iterator that enables you to loop over
the regex matches in the subject string: for m in re.finditer(regex, subject). The for-loop variable m is a Match
object with the details of the current match.
Unlike re.search() and re.match(), re.findall() and re.finditer() do not support an optional third parameter
with regex matching flags. Instead, you can use global mode modifiers at the start of the regex. E.g. "(?i)regex" matches
regex case insensitively.
Python strings also use the backslash to escape characters. The above regexes are written as Python strings as "\\\\" and
"\\w". Confusing indeed.
Fortunately, Python also has "raw strings" which do not apply special treatment to backslashes. As raw strings, the above
regexes become r"\\" and r"\w". The only limitation of using raw strings is that the delimiter you're using for the string
must not appear in the regular expression, as raw strings do not offer a means to escape it.
You can use \n and \t in raw strings. Though raw strings do not support these escapes, the regular expression engine does.
The end result is the same.
Unicode
Prior to Python 3.3, Python's re module did not support any Unicode regular expression tokens. Python Unicode strings,
however, have always supported the \uFFFF notation. Python's re module can use Unicode strings. So you could pass the
Unicode string u"\u00E0\\d" to the re module to match à followed by a digit. The backslash for \d was escaped, while the
one for \u was not. That's because \d is a regular expression token, and a regular expression backslash needs to be
escaped. \u00E0 is a Python string token that shouldn't be escaped. The string u"\u00E0\\d" is seen by the regular
expression engine as à\d.
If you did put another backslash in front of the \u, the regex engine would see \u00E0\d. If you use this regex with Python
3.2 or earlier, it will match the literal text u00E0 followed by a digit instead.
To avoid any confusion about whether backslashes need to be escaped, just use Unicode raw strings like ur"\u00E0\d".
Then backslashes don't need to be escaped. Python does interpret Unicode escapes in raw strings.
In Python 3.0 and later, strings are Unicode by default. So the u prefix shown in the above samples is no longer necessary.
Python 3.3 also adds support for the \uFFFF notation to the regular expression engine. So in Python 3.3, you can use the
string "\\u00E0\\d" to pass the regex \u00E0\d which will match something like à0.
The re.sub() function applies the same backslash logic to the replacement text as is applied to the regular expression.
Therefore, you should use raw strings for the replacement text, as we did in the examples above. The re.sub() function will
also interpret \n and \t in raw strings. If you want c:\temp as the replacement text, either use r"c:\\temp" or
"c:\\\\temp". The 3rd backreferenence is r"\3" or "\\3".
Splitting Strings
re.split(regex, subject) returns an array of strings. The array contains the parts of subject between all the regex
matches in the subject. Adjacent regex matches will cause empty strings to appear in the array. The regex matches
themselves are not included in the array. If the regex contains capturing groups, then the text matched by the capturing groups
is included in the array. The capturing groups are inserted between the substrings that appeared to the left and right of the
regex match. If you don't want the capturing groups in the array, convert them into non-capturing groups. The re.split()
function does not offer an option to suppress capturing groups.
You can specify an optional third parameter to limit the number of times the subject string is split. Note that this limit controls
the number of splits, not the number of strings that will end up in the array. The unsplit remainder of the subject is added as
the final string to the array. If there are no capturing groups, the array will contain limit+1 items.
The behavior of re.split() has changed between Python versions when the regular expression can find zero-length
matches. In Python 3.4 and prior, re.split() ignores zero-length matches. In Python 3.5 and 3.6 re.split() throws a
FutureWarning when it encounters a zero-length match. This warning signals the change in Python 3.7. Now re.split()
also splits on zero-length matches.
Match Details
re.search() and re.match() return a Match object, while re.finditer() generates an iterator to iterate over a Match
object. This object holds lots of useful information about the regex match. We will use m to signify a Match object in the
discussion below.
m.group() returns the part of the string matched by the entire regular expression. m.start() returns the offset in the string
of the start of the match. m.end() returns the offset of the character beyond the match. m.span() returns a 2-tuple of
m.start() and m.end(). You can use the m.start() and m.end() to slice the subject string:
subject[m.start():m.end()].
If you want the results of a capturing group rather than the overall regex match, specify the name or number of the group as a
parameter. m.group(3) returns the text matched by the third capturing group. m.group('groupname') returns the text
matched by a named group groupname. If the group did not participate in the overall match, m.group() returns an empty
string, while m.start() and m.end() return -1.
If you want to do a regular expression based search-and-replace without using re.sub(), call m.expand(replacement)
to compute the replacement text. The function returns the replacement string with backreferences etc. substituted.
Older versions of R used the GNU library to implement both POSIX BRE and ERE. ERE was the default. Passing the
extended=FALSE parameter allowed you to switch to BRE. This parameter was deprecated in R 2.10.0 and removed in R
2.11.0.
The best way to use regular expressions with R is to pass the perl=TRUE parameter. This tells R to use the PCRE regular
expressions library. When this guide talks about R, it assumes you're using the perl=TRUE parameter.
All the functions use case sensitive matching by default. You can pass ignore.case=TRUE to make them case insensitive.
R's functions do not have any parameters to set any other matching modes. When using perl=TRUE, as you should, you can
add mode modifiers to the start of the regex.
The grepl function takes the same arguments as the grep function, except for the value argument, which is not supported.
grepl returns a logical vector with the same length as the input vector. Each element in the returned vector indicates whether
the regex could find a match in the corresponding string element in the input vector.
The regexpr function takes the same arguments as the grep function, except for the value argument, which is not
supported. regexpr returns an integer vector with the same length as the input vector. Each element in the returned vector
indicates the character position in each corresponding string element in the input vector at which the (first) regex match was
found. A match at the start of the string is indicated with character position 1. If the regex could not find a match in a certain
string, its corresponding element in the result vector is -1. The returned vector also has a match.length attribute. This is
another integer vector with the number of characters in the (first) regex match in each string, or -1 for strings that didn't match.
gregexpr is the same as regexpr, except that it finds all matches in each string. It returns a vector with the same length as
the input vector. Each element is another vector, with one element for each match found in the string indicating the character
position at which that match was found. Each vector element in the returned vector also has a match.length attribute with
the lengths of all matches. If no matches could be found in a particular string, the element in the returned vector is still a
vector, but with just one element -1.
> regexpr("a+", c("abc", "def", "cba a", "aa"))
[1] 1 -1 3 1
attr(,"match.length")
[1] 1 -1 1 2
> gregexpr("a+", c("abc", "def", "cba a", "aa"))
[[1]] [1] 1 attr(,"match.length") [1] 1
[[2]] [1] -1 attr(,"match.length") [1] -1
[[3]] [1] 3 5 attr(,"match.length") [1] 1 1
[[4]] [1] 1 attr(,"match.length") [1] 2
Use regmatches to get the actual substrings matched by the regular expression. As the first argument, pass the same input
that you passed to regexpr or gregexpr. As the second argument, pass the vector returned by regexpr or gregexpr. If
you pass the vector from regexpr then regmatches returns a character vector with all the strings that were matched. This
vector may be shorter than the input vector if no match was found in some of the elements. If you pass the vector from
gregexpr then regmatches returns a vector with the same number of elements as the input vector. Each element is a
character vector with all the matches of the corresponding element in the input vector, or NULL if an element had no matches.
sub returns a new vector with the same length as the input vector. If a regex match could be found in a string element, it is
replaced with the replacement text. Only the first match in each string element is replaced. If no matches could be found in
some strings, those are copied into the result vector unchanged.
Use gsub instead of sub to replace all regex matches in all the string elements in your vector. Other than replacing all
matches, gsub works in exactly the same way, and takes exactly the same arguments.
You can use the backreferences \1 through \9 in the replacement text to reinsert text matched by a capturing group. There is
no replacement text token for the overall match. Place the entire regex in a capturing group and then use \1.
You can use \U and \L to change the text inserted by all following backreferences to uppercase or lowercase. You can use \E
to insert the following backreferences without any change of case. These escapes do not affect literal text.
> sub("(a+)", "z\\U\\1z", c("abc", "def", "cba a", "aa"), perl=TRUE)
[1] "zAzbc" "def" "cbzAz a" "zAAz"
> gsub("(a+)", "z\\1z", c("abc", "def", "cba a", "aa"), perl=TRUE)
[1] "zAzbc" "def" "cbzAz zAz" "zAAz"
A very powerful way of making replacements is to assign a new vector to the regmatches function when you call it on the
result of gregexpr. The vector you assign should have as many elements as the original input vector. Each element should
be a character vector with as many strings as there are matches in that element. The original input vector is then modified to
have all the regex matches replaced with the text from the new vector.
> x <- c("abc", "def", "cba a", "aa")
> m <- gregexpr("a+", x, perl=TRUE)
> regmatches(x, m) <- list(c("one"), character(0), c("two", "three"), c("four"))
> x
[1] "onebc" "def" "cbtwo three" "four"
In Ruby, the caret and dollar always match before and after newlines. Ruby does not have a modifier to change this. Use \A
and \Z to match at the start or the end of the string.
Since forward slashes delimit the regular expression, any forward slashes that appear in the regex need to be escaped. E.g.
the regex 1/2 is written as /1\/2/ in Ruby.
The === method allows you to compare a regexp to a string. It returns true if the regexp matches (part of) the string or
false if it does not. This allows regular expressions to be used in case statements. Do not confuse === (3 equals signs) with
== (2 equals signs). == allows you to compare one regexp to another regexp to see if the two regexes are identical and use
the same matching modes.
The =~ method returns the character position in the string of the start of the match or nil if no match was found. In a boolean
test, the character position evaluates to true and nil evaluates to false. So you can use =~ instead of === to make your
code a little more easier to read as =~ is more obviously a regex matching operator. Ruby borrowed the =~ syntax from Perl.
print(/\w+/ =~ "test") prints "0". The first character in the string has index zero. Switching the order of the =~
operator's operands makes no difference.
The match() method returns a MatchData object when a match is found, or nil if no matches was found. In a boolean
context, the MatchData object evaluates to true. In a string context, the MatchData object evaluates to the text that was
matched. So print(/\w+/.match("test")) prints "test".
Ruby 2.4 adds the match?() method. It returns true or false like the === method. The difference is that match?() does
not does not set $~ (see below) and thus doesn't need to create a MatchData object. If you don't need any match details you
should use match?() to improve performance.
Special Variables
The ===, =~, and match() methods create a MatchData object and assign it to the special variable $~. Regexp.match()
also returns this object. The variable $~ is thread-local and method-local. That means you can use this variable until your
method exits, or until the next time you use the =~ operator in your method, without worrying that another thread or another
method in your thread will overwrite them.
A number of other special variables are derived from the $~ variable. All of these are read-only. If you assign a new
MatchData instance to $~, all of these variables will change too. $& holds the text matched by the whole regular expression.
$1, $2, etc. hold the text matched by the first, second, and following capturing groups. $+ holds the text matched by the
highest-numbered capturing group that actually participated in the match. $` and $' hold the text in the subject string to the
left and to the right of the regex match.
To re-insert the regex match, use \0 in the replacement string. You can use the contents of capturing groups in the
replacement string with backreferences \1, \2, \3, etc. Note that numbers escaped with a backslash are treated as octal
escapes in double-quoted strings. Octal escapes are processed at the language level, before the sub() function sees the
parameter. To prevent this, you need to escape the backslashes in double-quoted strings. So to use the first backreference as
the replacement string, either pass '\1' or "\\1". '\\1' also works.
If your regular expression contains capturing groups, scan() returns an array of arrays. Each element in the overall array will
contain an array consisting of the overall regex match, plus the text matched by all capturing groups.
C++Builder 10 and later support the Dinkumware implementation std::regex when targeting Win32 if you disable the option
to use the classic Borland compiler. When using the classic Borland compiler in C++Builder XE3 and later, you can use
boost::regex instead of std::regex. While std::regex as defined in TR1 and C++11 defines pretty much the same
operations and classes as boost::regex, there are a number of important differences in the actual regex flavor. Most
importantly the ECMAScript regex syntax in Boost adds a number of features borrowed from Perl that aren't part of the
ECMAScript standard and that aren't implemented in the Dinkumware library.
Six Regular Expression Flavors
Six different regular expression flavors or grammars are defined in std::regex_constants:
Most C++ references talk as if C++11 implements regular expressions as defined in the ECMA-262v3 and POSIX standards.
But in reality the C++ implementation is very loosely based these standards. The syntax is quite close. The only significant
differences are that std::regex supports POSIX classes even in ECMAScript mode, and that it is a bit peculiar about which
characters must be escaped (like curly braces and closing square brackets) and which must not be escaped (like letters).
But there are important differences in the actual behavior of this syntax. The caret and dollar always match at embedded line
breaks in std::regex, while in JavaScript and POSIX this is an option. Backreferences to non-participating groups fail to
match as in most regex flavors, while in JavaScript they find a zero-length match. In JavaScript, \d and \w are ASCII-only
while \s matches all Unicode whitespace. This is odd, but all modern browsers follow the spec. In std::regex all the
shorthands are ASCII-only when using strings of char. In Visual C++, but not in C++Builder, they support Unicode when using
strings of wchar_t. The POSIX classes also match non-ASCII characters when using wchar_t in Visual C++, but do not
consistently include all the Unicode characters that one would expect.
In practice, you'll mostly use the ECMAScript grammar. It's the default grammar and offers far more features that the other
grammars. Whenever the tutorial mentions std::regex without mentioning any grammars then what is written applies to the
ECMAScript grammar and may or may not apply to any of the other grammars. You'll really only use the other grammars if you
want to reuse existing regular expressions from old POSIX code or UNIX scripts.
Pass your regex as a string as the first parameter to the constructor. If you want to use a regex flavor other than ECMAScript,
pass the appropriate constant as a second parameter. You can OR this constant with std::regex_constants::icase to
make the regex case insensitive. You can also OR it with std::regex_constants::nosubs to turn all capturing groups
into non-capturing groups, which makes your regex more efficient if you only care about the overall regex match and don't
want to extract text matched by any of the capturing groups.
Both regex_search() and regex_match() return just true or false. To get the part of the string matched by
regex_search(), or to get the parts of the string matched by capturing groups when using either function, you need to pass
an object of the template class std::match_results as the second parameter. The regex object then becomes the third
parameter. Create this object using the default constructor of one of these four template instantiations:
std::cmatch when your subject is an array of char
std::smatch when your subject is an std::string object
std::wcmatch when your subject is an array of wchar_t
std::wsmatch when your subject is an std::wstring object
When the function call returns true, you can call the str(), position(), and length() member functions of the
match_results object to get the text that was matched, or the starting position and its length of the match relative to the
subject string. Call these member functions without a parameter or with 0 as the parameter to get the overall regex match.
Call them passing 1 or greater to get the match of a particular capturing group. The size() member function indicates the
number of capturing groups plus one for the overall match. Thus you can pass a value up to size()-1 to the other three
member functions.
Putting it all together, we can get the text matched by the first capturing group like this:
std::string subject("Name: John Doe");
std::string result;
try
{
std::regex re("Name: (.*)");
std::smatch match;
if (std::regex_search(subject, match, re) && match.size() > 1)
{
result = match.str(1);
}
else
{
result = std::string("");
}
}
catch (std::regex_error& e)
{
// Syntax error in the regular expression
}
Construct one object by calling the constructor with three parameters: a string iterator indicating the starting position of the
search, a string iterator indicating the ending position of the search, and the regex object. If there are any matches to be
found, the object will hold the first match when constructed.
Construct another iterator object using the default constructor to get an end-of-sequence iterator. You can compare the first
object to the second to determine whether there are any further matches. As long as the first object is not equal to the second,
you can dereference the first object to get a match_results object.
std::string subject("This is a test");
try
{
std::sregex_iterator next(subject.begin(), subject.end(), re);
std::sregex_iterator end;
while (next != end)
{
std::smatch match = *next;
std::cout << match.str() << "\n";
next++;
}
}
catch (std::regex_error& e)
{
// Syntax error in the regular expression
}
The replacement string syntax is similar but not identical to that of JavaScript. The same replacement string syntax is used
regardless of which regex syntax or grammar you are using. You can use $& or $0 to insert the whole regex match and $1
through $9 to insert the text matched by the first nine capturing groups. There is no way to insert the text matched by groups
10 or higher. $10 and higher are always replaced with nothing, and $9 and lower are replaced with nothing if there are fewer
capturing groups in the regex than the requested number. $` (dollar backtick) is the part of the string to the left of the match,
and $' (dollar quote) is the part of the string to the right of the match.
Tcl's regular expression support is based on a library developed for Tcl by Henry Spencer. This library has since been used in
a number of other programming languages and applications, such as the PostgreSQL database and the wxWidgets GUI
library for C++. Everything said about Tcl in this regular expression tutorial applies to any tool that uses Henry Spencer's
Advanced Regular Expressions.
There are a number of important differences between Tcl Advanced Regular Expressions and Perl-style regular expressions.
Tcl uses \m, \M, \y and \Y for word boundaries. Perl and most other modern regex flavors use \b and \B. In Tcl, these last
two match a backspace and a backslash, respectively.
Tcl also takes a completely different approach to mode modifiers. The (?letters) syntax is the same, but the available
mode letters and their meanings are quite different. Instead of adding mode modifiers to the regular expression, you can pass
more descriptive switches like -nocase to the regexp and regsub commands for some of the modes. Mode modifier spans
in the style of (?modes:regex) are not supported. Mode modifiers must appear at the start of the regex. They affect the
whole regex. Mode modifiers in the regex override command switches. Tcl supports these modes:
(?i) or -nocase makes the regex match case insensitive.
(?c) makes the regex match case sensitive. This mode is the default.
(?x) or -expanded activates the free-spacing regexp syntax.
(?t) disables the free-spacing regexp syntax. This mode is the default. The "t" stands for "tight", the opposite of
"expanded".
(?b) tells Tcl to interpret the remainder of the regular expression as a Basic Regular Expression.
(?e) tells Tcl to interpret the remainder of the regular expression as an Extended Regular Expression.
(?q) tells Tcl to interpret the remainder of the regular expression as plain text. The "q" stands for "quoted".
(?s) selects "non-newline-sensitive matching", which is the default. The "s" stands for "single line". In this mode, the dot
and negated character classes will match all characters, including newlines. The caret and dollar will match only at the very
start and end of the subject string.
(?p) or -linestop enables "partial newline-sensitive matching". In this mode, the dot and negated character classes will
not match newlines. The caret and dollar will match only at the very start and end of the subject string.
(?w) or -lineanchor enables "inverse partial newline-sensitive matching". The "w" stands for "weird". (Don't look at me!
I didn't come up with this.) In this mode, the dot and negated character classes will match all characters, including newlines.
The caret and dollar will match after and before newlines.
(?n) or -line enables what Tcl calls "newline-sensitive matching". The dot and negated character classes will not match
newlines. The caret and dollar will match after and before newlines. Specifying (?n) or -line is the same as specifying
(?pw) or -linestop -lineanchor.
(?m) is a historical synonym for (?n).
If you use regular expressions with Tcl and other programming languages, be careful when dealing with the newline-related
matching modes. Tcl's designers found Perl's /m and /s modes confusing. They are confusing, but at least Perl has only two,
and they both affect only one thing. In Perl, /m or (?m) enables "multi-line mode", which makes the caret and dollar match
after and before newlines. By default, they match at the very start and end of the string only. In Perl, /s or (?s) enables
"single line mode". This mode makes the dot match all characters, including line break. By default, it doesn't match line
breaks. Perl does not have a mode modifier to exclude line breaks from negated character classes. In Perl, [^a] matches
anything except a, including newlines. The only way to exclude newlines is to write [^a\n]. Perl's default matching mode is
like Tcl's (?p), except for the difference in negated character classes.
Why compare Tcl with Perl? .NET, Java, PCRE and Python support the same (?m) and (?s) modifiers with the exact same
defaults and effects as in Perl. JavaScript lacks /s and Ruby lacks /m, but at least they don't introduce completely different
options. Negated character classes work the same in all these languages and libraries. It's unfortunate that Tcl didn't follow
Perl's standard, since Tcl's four options are just as confusing as Perl's two options. Together they make a very nice alphabet
soup.
If you ignore the fact that Tcl's options affect negated character classes, you can use the following table to translate between
Tcl's newline modes and Perl-style newline modes. Note that the defaults are different. If you don't use any switches, (?s).
and . are equivalent in Tcl, but not in Perl.
(?s) (default) (?s) Start and end of string only Any character
(?p) (default) Start and end of string only Any character except newlines
(?w) (?sm) Start and end of string, and at newlines Any character
(?n) (?m) Start and end of string, and at newlines Any character except newlines
Regular Expressions as Tcl Words
You can insert regular expressions in your Tcl source code either by enclosing them with double quotes (e.g. "regex") or by
enclosing them with curly braces (e.g. {regex}. Since the braces don't do any substitution like the quotes, they're by far the
best choice for regular expressions.
The only thing you need to worry about is that unescaped braces in the regular expression must be balanced. Escaped braces
don't need to be balanced, but the backslash used to escape the brace remains part of the regular expression. You can easily
satisfy these requirements by escaping all braces in your regular expression, except those used as a quantifier. This way your
regex will work as expected, and you don't need to change it at all when pasting it into your Tcl source code, other than putting
a pair of braces around it.
The regular expression ^\{\d{3}\\$ matches a string that consists entirely of an opening brace, three digits and one
backslash. In Tcl, this becomes {^\{\d+{3}\\$}. There's no doubling of backslashes or any sort of escaping needed, as
long as you escape literal braces in the regular expression. { and \{ are both valid regular expressions to match a single
opening brace in a Tcl ARE (and any Perl-style regex flavor, for that matter). Only the latter will work correctly in a Tcl literal
enclosed with braces.
Immediately after the regexp command, you can place zero or more switches from the list above to indicate how Tcl should
apply the regular expression. The only required parameters are the regular expression and the subject string. You can specify
a literal regular expression using braces as I just explained. Or, you can reference any string variable holding a regular
expression read from a file or user input.
If you pass the name of a variable as an additional argument, Tcl will store the part of the string matched by the regular
expression into that variable. Tcl will not set the variable to an empty string if the match attempt fails. If the regular expressions
has capturing groups, you can add additional variable names to capture the text matched by each group. If you specify fewer
variables than the regex has capturing groups, the text matched by the additional groups is not stored. If you specify more
variables than the regex has capturing groups, the additional variables will be set to an empty string if the overall regex match
was successful.
The regexp command returns 1 if (part of) the string could be matched, and zero if there's no match. The following script
applies the regular expression my regex case insensitively to the string stored in the variable subjectstring and displays
the result:
if
{
regexp -nocase {my regex} $subjectstring matchresult
}
then
{
puts $matchresult
}
else
{
puts "my regex could not match the subject string"
}
The regexp command supports three more switches that aren't regex mode modifiers. The -all switch causes the
command to return a number indicating how many times the regex could be matched. The variables storing the regex and
group matches will store the last match in the string only.
The -inline switch tells the regexp command to return an array with the substring matched by the regular expression and
all substrings matched by all capturing groups. If you also specify the -all switch, the array will contain the first regex match,
all the group matches of the first match, then the second regex match, the group matches of the first match, etc.
The -start switch must be followed by a number (as a separate Tcl word) that indicates the character offset in the subject
string at which Tcl should attempt the match. Everything before the starting position will be invisible to the regex engine. This
means that \A will match at the character offset you specify with -start, even if that position is not at the start of the string.
Just like the regexp command, regsub takes zero or more switches followed by a regular expression. It supports the same
switches, except for -inline. Remember to specify -all if you want to replace all matches in the string.
The argument after the regexp should be the replacement text. You can specify a literal replacement using the brace syntax,
or reference a string variable. The regsub command recognizes a few metacharacters in the replacement text. You can use
\0 as a placeholder for the whole regex match, and \1 through \9 for the text matched by one of the first nine capturing
groups. You can also use & as a synonym of \0. Note that there's no backslash in front of the ampersand. & is substituted with
the whole regex match, while \& is substituted with a literal ampersand. Use \\ to insert a literal backslash.
You only need to escape backslashes if they're followed by a digit, to prevent the combination from being seen as a
backreference. Again, to prevent unnecessary duplication of backslashes, you should enclose the replacement text with
braces instead of double quotes. The replacement text \1 becomes {\1} when using braces, and "\\1" when using quotes.
If you pass a variable reference as the final argument, that variable receives the string with the replacements applied, and
regsub returns an integer indicating the number of replacements made. Tcl 8.4 and later allow you to omit the final argument.
In that case regsub returns the string with the replacements applied.
Microsoft made some significant enhancements to VBScript's regular expression support in version 5.5 of Internet Explorer.
Version 5.5 implements quite a few essential regex features that were missing in previous versions of VBScript. Internet
Explorer 6.0 does not expand the regular expression functionality. Whenever this website mentions VBScript, the statements
refer to VBScript's version 5.5 regular expression support.
Basically, Internet Explorer 5.5 implements the JavaScript regular expression flavor. But IE 5.5 did not score very high on web
standards. There are quite a few differences between its implementation of JavaScript regular expressions and the actual
standard. Fortunately, most are corner cases that are not likely to affect you. Therefore, everything said about JavaScript's
regular expression flavor on this website also applies to VBScript.
Modern versions of IE still use the IE 5.5 implementation when rendering web pages in quirks mode. In standards mode,
modern versions of IE follow the JavaScript standard very closely. VBScript regular expressions also still use the IE 5.5
implementation, even when a modern version of IE is installed.
JavaScript and VBScript implement Perl-style regular expressions. However, they lack quite a number of advanced features
available in Perl and other modern regular expression flavors:
No \A or \Z anchors to match the start or end of the string. Use a caret or dollar instead.
Lookbehind is not supported at all. Lookahead is fully supported.
No atomic grouping or possessive quantifiers
No Unicode support, except for matching single characters with \uFFFF
No named capturing groups. Use numbered capturing groups instead.
No mode modifiers to set matching options within the regular expression.
No conditionals.
No regular expression comments. Describe your regular expression with VBScript apostrophe comments instead, outside
the regular expression string.
Version 1.0 of the RegExp object even lacks basic features like lazy quantifiers. This is the main reason this website does not
discuss VBScript RegExp 1.0. All versions of Internet Explorer prior to 5.5 include version 1.0 of the RegExp object. There are
no other versions than 1.0 and 5.5.
The advantage of the RegExp object's bare-bones nature is that it's very easy to use. Create one, put in a regex, and let it
match or replace. Only four properties and three methods are available.
After creating the object, assign the regular expression you want to search for to the Pattern property. If you want to use a
literal regular expression rather than a user-supplied one, simply put the regular expression in a double-quoted string. By
default, the regular expression is case sensitive. Set the IgnoreCase property to True to make it case insensitive. The caret
and dollar only match at the very start and very end of the subject string by default. If your subject string consists of multiple
lines separated by line breaks, you can make the caret and dollar match at the start and the end of those lines by setting the
Multiline property to True. VBScript does not have an option to make the dot match line break characters. Finally, if you
want the RegExp object to return or replace all matches instead of just the first one, set the Global property to True.
After setting the RegExp object's properties, you can invoke one of the three methods to perform one of three basic tasks. The
Test method takes one parameter: a string to test the regular expression on. Test returns True or False, indicating if the
regular expression matches (part of) the string. When validating user input, you'll typically want to check if the entire string
matches the regular expression. To do so, put a caret at the start of the regex, and a dollar at the end, to anchor the regex at
the start and end of the subject string.
The Execute method also takes one string parameter. Instead of returning True or False, it returns a MatchCollection
object. If the regex could not match the subject string at all, MatchCollection.Count will be zero. If the RegExp.Global
property is False (the default), MatchCollection will contain only the first match. If RegExp.Global is true, Matches
will contain all matches.
The Replace method takes two string parameters. The first parameter is the subject string, while the second parameter is
the replacement text. If the RegExp.Global property is False (the default), Replace will return the subject string with the
first regex match (if any) substituted with the replacement text. If RegExp.Global is true, Replace will return the subject
string with all regex matches replaced.
You can specify an empty string as the replacement text. This will cause the Replace method to return the subject string will
all regex matches deleted from it. To re-insert the regex match as part of the replacement, include $& in the replacement text.
E.g. to enclose each regex match in the string between square brackets, specify [$&] as the replacement text. If the regexp
contains capturing parentheses, you can use backreferences in the replacement text. $1 in the replacement text inserts the
text matched by the first capturing group, $2 the second, etc. up to $9. To include a literal dollar sign in the replacements, put
two consecutive dollar signs in the string you pass to the Replace method.
The easiest way to process all matches in the collection is to use a For Each construct, e.g.:
The Match object has four read-only properties. The FirstIndex property indicates the number of characters in the string to
the left of the match. If the match was found at the very start of the string, FirstIndex will be zero. If the match starts at the
second character in the string, FirstIndex will be one, etc. Note that this is different from the VBScript Mid function, which
extracts the first character of the string if you set the start parameter to one. The Length property of the Match object
indicates the number of characters in the match. The Value property returns the text that was matched.
The SubMatches property of the Match object is a collection of strings. It will only hold values if your regular expression has
capturing groups. The collection will hold one string for each capturing group. The Count property indicates the number of
string in the collection. The Item property takes an index parameter, and returns the text matched by the capturing group. The
Item property is the default member, so you can write SubMatches(7) as a shorthand to SubMatches.Item(7).
Unfortunately, VBScript does not offer a way to retrieve the match position and length of capturing groups.
Also, unfortunately, the SubMatches property does not hold the complete regex match as SubMatches(0). Instead,
SubMatches(0) holds the text matched by the first capturing group, while SubMatches(SubMatches.Count-1) holds the
text matched by the last capturing group. This is different from most other programming languages. E.g. in VB.NET,
Match.Groups(0) returns the whole regex match, and Match.Groups(1) returns the first capturing group's match. Note
that this is also different from the backreferences you can use in the replacement text passed to the RegExp.Replace
method. In the replacement text, $1 inserts the text matched by the first capturing group, just like most other regex flavors do.
$0 is not substituted with anything but inserted literally.
One such library is Microsoft's VBScript scripting library, which has decent regular expression capabilities starting with version
5.5. It implements the same regular expression flavor used in JavaScript, as standardized in the ECMA-262 standard for
JavaScript. This library is part of Internet Explorer 5.5 and later. It is available on all computers running Windows XP, Vista, or
7, and previous versions of Windows if the user upgraded to IE 5.5 or later. That includes almost every Windows PC that is
used to connect to the Internet.
To use this library in your Visual Basic application, select Project | References in the VB IDE's menu. Scroll down the list to
find the item "Microsoft VBScript Regular Expressions 5.5". It's immediately below the "Microsoft VBScript Regular
Expressions 1.0" item. Make sure to tick the 5.5 version, not the 1.0 version. The 1.0 version is only provided for backward
compatibility. Its capabilities are less than satisfactory.
After adding the reference, you can see which classes and class members the library provides. Select View | Object Browser
in the menu. In the Object Browser, select the "VBScript_RegExp_55" library in the drop-down list in the upper left corner. For
a detailed description, see the VBScript regular expression reference Anything said about JavaScript's flavor of regular
expressions in the tutorial also applies to VBScript's flavor. The only exception is the character escape support for \xFF and
\uFFFF in the replacement text. JavaScript supports these in string literals, but Visual Basic does not.
The only difference between VB6 and VBScript is that you'll need to use a Dim statement to declare the objects prior to
creating them. Here's a complete code snippet. It's the two code snippets on the VBScript page put together, with three Dim
statements added.
'Prepare a regular expression object
Dim myRegExp As RegExp
Dim myMatches As MatchCollection
Dim myMatch As Match
wxRegEx(const wxString& expr, int flags = wxRE_EXTENDED) creates a wxRegEx object with a compiled
regular expression. The constructor will always create the object, even if your regular expression is invalid. Check
wxRegEx::IsValid to determine if the regular expression was compiled successfully.
To set the regex flavor, specify one of the flags wxRE_EXTENDED, wxRE_ADVANCED or wxRE_BASIC. If you don't specify a
flavor, wxRE_EXTENDED is the default. It is recommended that you always specify the wxRE_ADVANCED flag. AREs are far
more powerful than EREs. Every valid ERE is also a valid ARE, and will give identical results. The only reason to use the ERE
flavor is when your code has to work when wxWidgets is compiled without the "built-in" regular expression library (i.e. Henry
Spencer's code).
You can set three other flags in addition to the flavor. wxRE_ICASE makes the regular expression case insensitive. The default
is case sensitive. wxRE_NOSUB makes the regex engine treat all capturing groups as non-capturing. This means you won't be
able to use backreferences in the replacement text, or query the part of the regex matched by each capturing group. If you
won't be using these anyway, setting the wxRE_NOSUB flag improves performance.
As discussed in the Tcl section, Henry Spencer's "ARE" regex engine did away with the confusing "single line" (?s) and
"multi line" (?m) matching modes, replacing them with the equally confusing "non-newline-sensitive" (?s), "partial newline-
sensitive" (?p), "inverse partial newline-sensitive" (?w) and "newline-sensitive matching" (?n). Since the wxRegEx class
encapsulates the ARE engine, it supports all 4 modes when you use the mode modifiers inside the regular expression. But the
flags parameter only allows you to set two.
If you add wxRE_NEWLINE to the flags, you're turning on "newline-sensitive matching" (?n). In this mode, the dot will not
match newline characters (\n). The caret and dollar will match after and before newlines in the string, as well as at the start
and end of the subject string.
If you don't set the wxRE_NEWLINE flag, the default is "non-newline-sensitive" (?s). In this mode, the dot will match all
characters, including newline characters (\n). The caret and dollar will match only at the start and end of the subject string.
Note that this default is different from the default in Perl and every other regex engine on the planet. In Perl, by default, the dot
does not match newline characters, and the caret and dollar only match at the start and end of the subject string. The only
way to set this mode in wxWidgets is to put (?p) at the start of your regex.
Putting it all together, wxRegex myRegEx("(?p)^[a-z].*$", wxRE_ADVANCED + wxRE_ICASE) will check if your
subject string consists of a single line that starts with a letter. The equivalent in Perl is m/^[a-z].*$/i.
wxRegEx::GetMatchCount() is rather poorly named. It does not return the number of matches found by Matches(). In
fact, you can call GetMatchCount() right after Compile(), before you call Matches. GetMatchCount() returns the
number of capturing groups in your regular expression, plus one for the overall regex match. You can use this to determine the
number of backreferences you can use in the replacement text, and the highest index you can pass to GetMatch(). If your
regex has no capturing groups, GetMatchCount() returns 1. In that case, \0 is the only valid backreference you can use in
the replacement text.
GetMatchCount() returns 0 in case of an error. This will happen if the wxRegEx object does not hold a compiled regular
expression, or if you compiled it with wxRE_NOSUB.
Matches() returns true if the regex matches all or part of the subject string that you passed in the text parameter. Add
anchors to your regex if you want to set whether the regex matches the whole subject string.
Do not confuse the flags parameter with the one you pass to the Compile() method or the wxRegEx() constructor. All the
flavor and matching mode options can only be set when compiling the regex.
The Matches() method allows only two flags: wxRE_NOTBOL and wxRE_NOTEOL. If you set wxRE_NOTBOL, then ^ and \A
will not match at the start of the string. They will still match after embedded newlines if you turned on that matching mode.
Likewise, specifying wxRE_NOTEOL tells $ and \Z not to match at the end of the string.
wxRE_NOTBOL is commonly used to implement a "find next" routine. The wxRegEx class does not provide such a function. To
find the second match in the string, you'll need to call wxRegEx::Matches() and pass it the part of the original subject string
after the first match. Pass the wxRE_NOTBOL flag to indicate that you've cut off the start of the string you're passing.
wxRE_NOTEOL can be useful if you're processing a large set of data, and you want to apply the regex before you've read the
whole data. Pass wxRE_NOTEOL while calling wxRegEx::Matches() as long as you haven't read the entire string yet. Pass
both wxRE_NOTBOL and wxRE_NOTEOL when doing a "find next" on incomplete data.
After a call to Matches() returns true, and you compiled your regex without the wxRE_NOSUB flag, you can call
GetMatch() to get details about the overall regex match, and the parts of the string matched by the capturing groups in your
regex.
bool wxRegEx::GetMatch(size_t* start, size_t* len, size_t index = 0) const retrieves the starting
position of the match in the subject string, and the number of characters in the match.
wxString wxRegEx::GetMatch(const wxString& text, size_t index = 0) const returns the text that was
matched.
For both calls, set the index parameter to zero (or omit it) to get the overall regex match. Set 1 <= index <
GetMatchCount() to get the match of a capturing group in your regular expression. To determine the number of a group,
count the opening brackets in your regular expression from left to right.
int wxRegEx::ReplaceAll(wxString* text, const wxString& replacement) const replaces all regex
matches in text with replacement.
int wxRegEx::ReplaceFirst(wxString* text, const wxString& replacement) const replaces the first
match of the regular expression in text with replacement.
All three calls return the actual number of replacements made. They return zero if the regex failed to match the subject text. A
return value of -1 indicates an error. The replacements are made directly to the wxString that you pass as the first
parameter.
wxWidgets uses the same syntax as Tcl for the replacement text. You can use \0 as a placeholder for the whole regex match,
and \1 through \9 for the text matched by one of the first nine capturing groups. You can also use & as a synonym of \0.
Note that there's no backslash in front of the ampersand. & is substituted with the whole regex match, while \& is substituted
with a literal ampersand. Use \\ to insert a literal backslash. You only need to escape backslashes if they're followed by a
digit, to prevent the combination from being seen as a backreference. When specifying the replacement text as a literal string
in C++ code, you need to double up all the backslashes, as the C++ compiler also treats backslashes as escape characters.
So if you want to replace the match with the first backreference followed by the text &co, you'll need to code that in C++ as
_T("\\1\\&co").
XML Schema Regular Expressions
The W3C XML Schema standard defines its own regular expression flavor. You can use it in the pattern facet of simple type
definitions in your XML schemas. E.g. the following defines the simple type "SSN" using a regular expression to require the
element to contain a valid US social security number.
<xsd:simpleType name="SSN">
<xsd:restriction base="xsd:token">
<xsd:pattern value="[0-9]{3}-[0-9]{2}-[0-9]{4}"/>
</xsd:restriction>
</xsd:simpleType>
Compared with other regular expression flavors, the XML schema flavor is quite limited in features. Since it's only used to
validate whether an entire element matches a pattern or not, rather than for extracting matches from large blocks of data, you
won't really miss the features often found in other flavors. The limitations allow schema validators to be implemented with
efficient text-directed engines.
Particularly noteworthy is the complete absence of anchors like the caret and dollar, word boundaries and lookaround. XML
schema always implicitly anchors the entire regular expression. The regex must match the whole element for the element to
be considered valid. If you have the pattern regexp, the XML schema validator will apply it in the same way as say Perl, Java
or .NET would do with the pattern ^regexp$.
If you want to accept all elements with regex somewhere in the middle of their contents, you'll need to use the regular
expression .*regex.*. The two .* expand the match to cover the whole element, assuming it doesn't contain line breaks. If
you want to allow line breaks, you can use something like [\s\S]*regex[\s\S]*. Combining a shorthand character class
with its negated version results in a character class that matches anything.
XML schemas do not provide a way to specify matching modes. The dot never matches line breaks, and patterns are always
applied case sensitively. If you want to apply literal case insensitively, you'll need to rewrite it as
[lL][iI][tT][eE][rR][aA][lL].
XML regular expressions don't have any tokens like \xFF or \uFFFF to match particular (non-printable) characters. You can
use the  XML syntax for this, or simply copy the character directly from a character map.
Lazy quantifiers are not available. Since the pattern is anchored at the start and the end of the subject string anyway, and only
a success/failure result is returned, the only potential difference between a greedy and lazy quantifier would be performance.
You can never make a fully anchored pattern match or fail by changing a greedy quantifier into a lazy one or vice versa.
Note that the regular expression functions available in XQuery and XPath use a different regular expression flavor. This flavor
is a superset of the XML Schema flavor described here. It adds some of the features that are available in many modern regex
flavors, but not in the XML Schema flavor.
XML Character Classes
Despite its limitations, XML schema regular expressions introduce two handy features. The special short-hand character
classes \i and \c make it easy to match XML names. No other regex flavor supports these.
Character class subtraction makes it easy to match a character that is in a certain list, but not in another list. E.g.
[a-z-[aeiou]] matches an English consonant. This feature is now also available in the .NET regex engine. It is particularly
handy when working with Unicode properties. E.g. [\p{L}-[\p{IsBasicLatin}]] matches any letter that is not an
English letter.
Xojo uses the UTF-8 version of PCRE. This means that if you want to process non-ASCII data that you've retrieved from a file
or the network, you'll need to use Xojo's TextConverter class to convert your strings into UTF-8 before passing them to the
RegEx object. You'll also need to use the TextConverter to convert the strings returned by the RegEx class from UTF-8
back into the encoding your application is working with.
To check if a regular expression matches a particular string, call the Search method of the RegEx object, and pass the
subject string as a parameter. This method returns an instance of the RegExMatch class if a match is found, or Nil if no
match is found. To find the second match in the same subject string, call the Search method again, without any parameters.
Do not pass the subject string again, since doing so restarts the search from the beginning of the string. Keep calling Search
without any parameters until it returns Nil to iterate over all regular expression matches in the string.
The SubExpressionCount property returns the number of capturing groups in the regular expression plus one. E.g. it
returns 3 for the regex (1)(2).
The SubExpressionString property returns the substring matched by the regular expression or a capturing group.
SubExpressionString(0) returns the whole regex match, while SubExpressionString(1) through
SubExpressionString(SubExpressionCount-1) return the matches of the capturing group.
SubExpressionStartB returns the byte offset of the start of the match of the whole regex or one of the capturing groups
depending on the numeric index you pass as a parameter to the property.
The RegExOptions Class
The RegExOptions class has nine properties to set various options for your regular expression.
Set CaseSensitive (False by default) to True to treat uppercase and lowercase letters as different characters. This
option is the inverse of "case insensitive mode" or /i in other programming languages.
Set DotMatchAll (False by default) to True to make the dot match all characters, including line break characters. This
option is the equivalent of "single line mode" or /s in other programming languages.
Set Greedy (True by default) to False if you want quantifiers to be lazy, effectively making .* the same as .*?. We
strongly recommend against setting Greedy to False. Simply use the .*? syntax instead. This way, somebody reading
your source code will clearly see when you're using greedy quantifiers and when you're using lazy quantifiers when they
look only at the regular expression.
The LineEndType option is the only one that takes an Integer instead of a Boolean. This option affect which character the
caret and dollar treat as the "end of line" character. The default is 0, which accepts both \r and \n as end-of-line
characters. Set it to 1 to use auto-detect the host platform, and use \n when your application runs on Windows and Linux,
and \r when it runs on a Mac. Set it to 2 for Mac (\r), 3 for Windows (\r\n) and 4 for UNIX (\n). We recommend you
leave this option as zero, which is most likely to give you the results you intended. This option is actually a modification to
the PCRE library made in Xojo. PCRE supports only option 4, which often confuses Windows developers since it causes
test$ to fail against test\r\n as Windows uses \r\n for line breaks.
Set MatchEmpty (True by default) to False if you want to skip zero-length matches.
Set ReplaceAllMatches (False by default) to True if you want the Regex.Replace method to search-and-replace all
regex matches in the subject string rather than just the first one.
Set StringBeginIsLineBegin (True by default) to False if you don't want the start of the string to be considered the
start of the line. This can be useful if you're processing a large chunk of data as several separate strings, where only the
first string should be considered as starting the (conceptual) overall string.
Similarly, set StringEndIsLineEnd (True by default) to False if the string you're passing to the Search method isn't
really the end of the whole chunk of data you're processing.
Set TreatTargetAsOneLine (False by default) to True to make the caret and dollar match at the start and the end of
the string only. By default, they will also match after and before embedded line breaks. This option is the inverse of the
"multi-line mode" or /m in other programming languages.
In the ReplacementPattern string, you can use $&, $0 or \0 to insert the whole regular expression match into the
replacement. Use $1 or \1 for the match of the first capturing group, $2 or \2 for the second, etc.
If you want more control over how the replacements are made, you can iterate over the regex matches like in the code snippet
above, and call the RegExMatch.Replace method for each match. This method is a bit of a misnomer, since it doesn't
actually replace anything. Rather, it returns the RegEx.ReplacementPattern string with all references to the match and
capturing groups substituted. You can use this results to make the replacements on your own. This method is also useful if
you want to collect a combination of capturing groups for each regex match.
Because the XML Schema flavor is only used for true/false validity tests, these features were eliminated for performance
reasons. The XQuery and XPath functions perform more complex regular expression operators, which require a more
feature-rich regular expression flavor. That said, the XQuery and XPath regex flavor is still limited by modern standards.
XQuery and XPath support the following features on top of the features in the XML Schema flavor:
^ and $ anchors that match at the start or end of the string, or the start or end of a line (see matching modes below). These
are the only two anchors supported.
Lazy quantifiers, using the familiar question mark syntax.
Backreferences and capturing groups. The XML Schema standard supports grouping, but groups were always
non-capturing. XQuery/XPath allows backreferences to be used in the regular expression. fn:replaces supports
backreferences in the replacement text using the $1 notation.
While XML Schema allows no matching modes at all, the XQuery and XPath functions all accept an optional flags
parameter to set matching modes. Mode modifiers within the regular expression are not supported. These four matching
modes are available:
The flags are specified as a string with the letters of the modes you want to turn on. E.g. "ix" turns on case insensitivity and
free-spacing. If you don't want to set any matching modes, you can pass an empty string for the flags parameter, or omit the
parameter entirely.
Three Regex Functions
fn:matches(subject, pattern, flags) takes a subject string and a regular expression as input. If the regular
expression matches any part of the subject string, the function returns true. If it cannot match at all, it returns false. You'll
need to use anchors if you only want the function to return true when the regex matches the entire subject string.
fn:replace(subject, pattern, replacement, flags) takes a subject string, a regular expression, and a
replacement string as input. It returns a new string that is the subject string with all matches of the regex pattern replaced with
the replacement text. You can use $1 to $99 to re-insert capturing groups into the replacement. $0 inserts the whole regex
match. Literal dollar signs and backslashes in the replacement must be escaped with a backslash.
fn:replace cannot replace zero-length matches. E.g. fn:replace("text", "^", "prefix") will raise an error rather
than returning "prefixtext" like regex-based search-and-replace does in most programming languages.
fn:tokenize(subject, pattern, flags) is like the "split" function in many programming languages. It returns an
array of strings that consists of all the substrings in subject between all the regex matches. The array will not contain the regex
matches themselves. If the regex matches the first or last character in the subject string, then the first or last string in the
resulting array will be empty strings.
Using the XRegExp object instead of JavaScript's built-in RegExp object provides you with an regular expression syntax with
more features and fewer cross-browser inconsistencies. Notable added features include free-spacing, named capture, mode
modifiers, and Unicode categories, blocks, and scripts. It also treats invalid escapes and non-existent backreferences as
errors.
XRegExp also provides its own replace() method with a replacement text syntax that is enhanced with named
backreferences and no cross-browser inconsistencies. It also provides a split() method that is fully compliant with the
JavaScript standard.
To use XRegExp, first create a regular expression object with var myre = XRegExp('regex', 'flags') where flags is
a combination of the letters g (global), i (case insensitive), m (anchors match at line breaks), s (dot matches line breaks), x
(free-spacing), and n (explicit capture). XRegExp 3 adds the A flag which includes Unicode characters beyond U+FFFF when
matching Unicode properties and blocks. The ECMAScript 6 flags y (sticky) and u (Unicode) can also be used in modern
browsers that support them natively, but they'll throw errors in browsers that don't have built-in support for these flags.
You can then pass the XRegExp instance you constructed to various XRegExp methods. It's important to make the calls as
shown below to get the full XRegExp functionality. The object returned by the XRegExp constructor is a native JavaScript
RegExp object. That object's methods are the browser's built-in RegExp methods. You can replace the built-in RegExp
methods with XRegExp's methods by calling XRegExp.install('natives'). Doing so also affects RegExp objects
constructed by the normal RegExp constructor or double-slashed regex literals.
XRegExp.test(str, regex, [pos=0], [sticky=false]) tests whether the regex can match part of a string. The
pos argument is a zero-based index in the string where the match attempt should begin. If you pass true or 'sticky' for
the sticky parameter, then the match is only attempted at pos. This is similar to adding the start-of-attempt anchor \G
(which XRegExp doesn't support) to the start of your regex in other flavors.
XRegExp.exec(str, regex, [pos=0], [sticky=false]) does the same as XRegExp.test() but returns null or
an array instead of false or true. Index 0 in the array holds the overall regex match. Indexes 1 and beyond hold the text
matched by capturing groups, if any. If the regex has named capturing groups, their matches are available as properties on
the array. XRegExp.exec() does not rely on the lastIndex property and thus avoids cross-browser problems with that
property.
XRegExp.forEach(str, regex, callback) makes it easy to iterate over all matches of the regex in a string. It always
iterates over all matches, regardless of the global flag an the lastIndex property. The callback is called with four arguments.
The first two are an array like returned by exec() and the index in the string that the match starts at. The last two are str
and regex exactly as you passed them to forEach().
XRegExp.replace(str, regex, replacement, [scope]) returns a string with the matches of regex in str replaced
with replacement. Pass 'one' or 'all' as the scope argument to replace only the first match or all matches. If you omit
the scope argument then the regex.global flag determines whether only the first or all matches are replaced.
The XRegExp.replace() method uses its own replacement text syntax. It is very similar to the native JavaScript syntax. It
is somewhat incompatible by making dollar signs that don't form valid replacement tokens an error. But the benefit is that it
eliminates all cross-browser inconsistencies. $$ inserts a single literal dollar sign. $& and $0 insert the overall regex match.
$` and $' insert the part of the subject string to the left and the right of the regex match. $n, $nn, ${n}, and ${nn} are
numbered backreferences while ${name} is a named backreference.
If you pass a function as the replacement parameter, then it will be called with three or more arguments. The first argument is
the string that was matched, with named capturing groups available through properties on that string. The second and
following arguments are the strings matched by each of the capturing groups in the regex, if any. The final two arguments are
the index in the string at which the match was found and the original subject string.
PostgreSQL
PostgreSQL Has Three Regular Expression Flavors.
PostgreSQL 7.4 and later use the exact same regular expression engine that was developed by Henry Spencer for Tcl 8.2.
This means that PostgreSQL supports the same three regular expressions flavors: Tcl Advanced Regular Expressions, POSIX
Extended Regular Expressions and POSIX Basic Regular Expressions. Just like in Tcl, AREs are the default. All my
comments on Tcl's regular expression flavor, like the unusual mode modifiers and word boundary tokens, fully apply to
PostgreSQL. You should definitely review them if you're not familiar with Tcl's AREs. Unfortunately, PostgreSQL's
regexp_replace function does not use the same syntax for the replacement text as Tcl's regsub command, however.
PostgreSQL versions prior to 7.4 supported POSIX Extended Regular Expressions only. If you are migrating old database
code to a new version of PostgreSQL, you can set PostgreSQL's "regex_flavor" run-time parameter to "extended"
instead of the default "advanced" to make EREs the default.
PostgreSQL also supports the traditional SQL LIKE operator, and the SQL:1999 SIMILAR TO operator. These use their own
pattern languages, which are not discussed here. AREs are far more powerful, and no more complicated if you don't use
functionality not offered by LIKE or SIMILAR TO.
While only case sensitivity can be toggled by the operator, all other options can be set using mode modifiers at the start of the
regular expression. Mode modifiers override the operator type. E.g. '(?c)regex' forces the to be regex case sensitive.
The most common use of this operator is to select rows based on whether a column matches a regular expression, e.g.:
This function is particularly useful to extract information from columns. E.g. to extract the first number from the column
mycolumn for each row, use:
With regexp_replace(subject, pattern, replacement [, flags]) you can replace regex matches in a string. If
you omit the flags parameter, the regex is applied case sensitively, and only the first match is replaced. If you set flags to
'i', the regex is applied case insensitively. The 'g' flag (for "global") causes all regex matches in the string to be replaced.
You can combine both flags as 'gi'.
You can use the backreferences \1 through \9 in the replacement text to re-insert the text matched by a capturing group into
the regular expression. \& re-inserts the whole regex match. Remember to double up the backslashes in literal strings.
MySQL does not offer any matching modes. POSIX EREs don't support mode modifiers inside the regular expression, and
MySQL's REGEXP operator does not provide a way to specify modes outside the regular expression. The dot matches all
characters including newlines, and the caret and dollar only match at the very start and end of the string. In other words:
MySQL treats newline characters like ordinary characters. The REGEXP operator applies regular expressions case
insensitively if the collation of the table is case insensitive, which is the default. If you change the collation to be case
sensitive, the REGEXP operator becomes case sensitive.
Remember that MySQL supports C-style escape sequences in strings. While POSIX ERE does not support tokens like \n to
match non-printable characters like line breaks, MySQL does support this escape in its strings. So WHERE testcolumn
REGEXP '\n' returns all rows where testcolumn contains a line break. MySQL converts the \n in the string into a single line
break character before parsing the regular expression. This also means that backslashes need to be escaped. The regex \\
to match a single backslash becomes '\\\\' as a MySQL string, and the regex \$ to match a dollar symbol becomes
'\\$' as a MySQL string. All this is unlike other databases like Oracle, which don't support \n and don't require backslashes
to be escaped.
To return rows where the column doesn't match the regular expression, use WHERE testcolumn NOT REGEXP
'pattern' The RLIKE operator is a synonym of the REGEXP operator. WHERE testcolumn RLIKE 'pattern' and
WHERE testcolumn NOT RLIKE 'pattern' are identical to WHERE testcolumn REGEXP 'pattern' and WHERE
testcolumn NOT REGEXP 'pattern'. It is recommended that you use REGEXP instead of RLIKE, to avoid confusion
with the LIKE operator.
LIB_MYSQLUDF_PREG
If you want more regular expression power in your database, you can consider using LIB_MYSQLUDF_PREG. This is an
open source library of MySQL user functions that imports the PCRE library. LIB_MYSQLUDF_PREG is delivered in source
code form only. To use it, you'll need to be able to compile it and install it into your MySQL server. Installing this library does
not change MySQL's built-in regex support in any way. It merely makes the following additional functions available:
PREG_CAPTURE extracts a regex match from a string. PREG_POSITION returns the position at which a regular expression
matches a string. PREG_REPLACE performs a search-and-replace on a string. PREG_RLIKE tests whether a regex matches a
string.
All these functions take a regular expression as their first parameter. This regular expression must be formatted like a Perl
regular expression operator. E.g. to test if regex matches the subject case insensitively, you'd use the MySQL code
PREG_RLIKE('/regex/i', subject). This is similar to PHP's preg functions, which also require the extra // delimiters
for regular expressions inside the PHP string.
Second, the POSIX standard states it is illegal to escape a character that is not a metacharacter with a backslash. Oracle
allows this, and simply ignores the backslash. E.g. \z is identical to z in Oracle. The result is that all POSIX ERE regular
expressions can be used with Oracle, but some regular expressions that work in Oracle may cause an error in a fully POSIX-
compliant engine. Obviously, if you only work with Oracle, these differences are irrelevant.
The third difference is more subtle. It won't cause any errors, but may result in different matches. As was explained in the topic
about the POSIX standard, it requires the regex engine to return the longest match in case of alternation. Oracle's engine
does not do this. It is a traditional NFA engine, like all non-POSIX regex flavors discussed on this website.
If you've worked with regular expressions in other programming languages, be aware that POSIX does not support
non-printable character escapes like \t for a tab or \n for a newline. You can use these with a POSIX engine in a
programming language like C++, because the C++ compiler will interpret the \t and \n in string constants. In SQL
statements, you'll need to type an actual tab or line break in the string with your regular expression to make it match a tab or
line break. Oracle's regex engine will interpret the string '\t' as the regex t when passed as the regexp parameter.
Oracle 10g R2 further extends the regex syntax by adding a free-spacing mode (without support for comments), shorthand
character classes, lazy quantifiers, and the anchors \A, \Z, and \z. Oracle 11g and 12c use the same regex flavor as 10g R2.
REGEXP_LIKE(source, regexp, modes) is probably the one you'll use most. You can use it in the WHERE and HAVING
clauses of a SELECT statement. In a PL/SQL script, it returns a Boolean value. You can also use it in a CHECK constraint. The
source parameter is the string or column the regex should be matched against. The regexp parameter is a string with your
regular expression. The modes parameter is optional. It sets the matching modes.
REGEXP_SUBSTR(source, regexp, position, occurrence, modes) returns a string with the part of source
matched by the regular expression. If the match attempt fails, NULL is returned. You can use REGEXP_SUBSTR with a single
string or with a column. You can use it in SELECT clauses to retrieve only a certain part of a column. The position parameter
specifies the character position in the source string at which the match attempt should start. The first character has position 1.
The occurrence parameter specifies which match to get. Set it to 1 to get the first match.
If you specify a higher number, Oracle will continue to attempt to match the regex starting at the end of the previous match,
until it found as many matches as you specified. The last match is then returned. If there are fewer matches, NULL is returned.
Do not confuse this parameter with backreferences. Oracle does not provide a function to return the part of the string matched
by a capturing group. The last three parameters are optional.
REGEXP_COUNT(source, regexp, position, modes) returns the number of times the regex can be matched in the
source string. It returns zero if the regex finds no matches at all. This function is only available in Oracle 11g and later.
'i': Turn on case insensitive matching. The default depends on the NLS_SORT setting.
'c': Turn on case sensitive matching. The default depends on the NLS_SORT setting.
'n': Make the dot match any character, including newlines. By default, the dot matches any character except newlines.
'm': Make the caret and dollar match at the start and end of each line (i.e. after and before line breaks embedded in the
source string). By default, these only match at the very start and the very end of the string.
Sample Regular Expressions
Below, you will find many example patterns that you can use for and adapt to your own purposes. Key techniques used in
crafting each regex are explained, with links to the corresponding pages in the tutorial where these concepts and techniques
are explained in great detail.
If you are new to regular expressions, you can take a look at these examples to see what is possible. Regular expressions are
very powerful. They do take some time to learn. But you will earn back that time quickly when using regular expressions to
automate searching or editing tasks, or when writing scripts or applications in a variety of languages.
Oh, and you definitely do not need to be a programmer to take advantage of regular expressions!
<([a-z][a-z0-9]*)\b[^>]*>(.*?)</\1> will match the opening and closing pair of any HTML tag. Be sure to turn off
case sensitivity. The key in this solution is the use of the backreference \1 in the regex. Anything between the tags is captured
into the second backreference. This solution will also not match tags nested in themselves.
Trimming Whitespace
You can easily trim unnecessary whitespace from the start and the end of a string or the lines in a text file by doing a regex
search-and-replace. Search for ^[ \t]+ and replace with nothing to delete leading whitespace (spaces and tabs). Search for
[ \t]+$ to trim trailing whitespace. Do both by combining the regular expressions into ^[ \t]+|[ \t]+$. Instead of
[ \t] which matches a space or a tab, you can expand the character class into [ \t\r\n] if you also want to strip line
breaks. Or you can use the shorthand \s instead.
Matching a Floating Point Number. Also illustrates the common mistake of making everything in a regular expression optional.
Matching an Email Address. There's a lot of controversy about what is a proper regex to match email addresses. It's a perfect
example showing that you need to know exactly what you're trying to match (and what not), and that there's always a trade-off
between regex complexity and accuracy.
IP Addresses
Matching Valid Dates. A regular expression that matches 31-12-1999 but not 31-13-1999.
Finding or Verifying Credit Card Numbers. Validate credit card numbers entered on your order form. Find credit card numbers
in documents for a security audit.
Matching Complete Lines. Shows how to match complete lines in a text file rather than just the part of the line that satisfies a
certain requirement. Also shows how to match lines in which a particular regex does not match.
Removing Duplicate Lines or Items. Illustrates simple yet clever use of capturing parentheses or backreferences.
Regex Examples for Processing Source Code. How to match common programming language syntax such as comments,
strings, numbers, etc.
Two Words Near Each Other. Shows how to use a regular expression to emulate the "near" operator that some tools have.
Common Pitfalls
Catastrophic Backtracking. If your regular expression seems to take forever, or simply crashes your application, it has likely
contracted a case of catastrophic backtracking. The solution is usually to be more specific about what you want to match, so
the number of matches the engine has to try doesn't rise exponentially.
Making Everything Optional. If all the parts in your regex are optional, it will match a zero-width string anywhere. Your regex
will need to express the facts that different parts are optional depending on which parts are present.
Repeating a Capturing Group vs. Capturing a Repeated Group. Repeating a capturing group will capture only the last iteration
of the group. Capture a repeated group if you want to capture all iterations.
Mixing Unicode and 8-bit Character Codes. Using 8-bit character codes like \x80 with a Unicode engine and subject string
may give unexpected results.
Since regular expressions work with text, a regular expression engine treats 0 as a single character, and 255 as three
characters. To match all characters from 0 to 255, we'll need a regex that matches between one and three characters.
The regex [0-9] matches single-digit numbers 0 to 9. [1-9][0-9] matches double-digit numbers 10 to 99. That's the easy
part.
Matching the three-digit numbers is a little more complicated, since we need to exclude numbers 256 through 999.
1[0-9][0-9] takes care of 100 to 199. 2[0-4][0-9] matches 200 through 249. Finally, 25[0-5] adds 250 till 255.
As you can see, you need to split up the numeric range in ranges with the same number of digits, and each of those ranges
that allow the same variation for each digit. In the 3-digit range in our example, numbers starting with 1 allow all 10 digits for
the following two digits, while numbers starting with 2 restrict the digits that are allowed to follow.
[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5].
This matches the numbers we want, with one caveat: regular expression searches usually allow partial matches, so our regex
would match 123 in 12345. There are two solutions to this.
If you're searching for these numbers in a larger document or input string, use word boundaries to require a non-word
character (or no character at all) to precede and to follow any valid match:
\b([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\b
Since the alternation operator has the lowest precedence of all, the round brackets are required to group the alternatives
together. This way the regex engine will try to match the first word boundary, then try all the alternatives, and then try to match
the second word boundary after the numbers it matched. Regular expression engines consider all alphanumeric characters,
as well as the underscore, as word characters.
If you're using the regular expression to validate input, you'll probably want to check that the entire input consists of a valid
number. To do this, use anchors instead of word boundaries:
^([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])$.
Here are a few more common ranges that you may want to match:
000...255: ^([01][0-9][0-9]|2[0-4][0-9]|25[0-5])$
0 or 000...255: ^([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])$
0 or 000...127: ^(0?[0-9]?[0-9]|1[0-1][0-9]|12[0-7])$
0...999: ^([0-9]|[1-9][0-9]|[1-9][0-9][0-9])$
000...999: ^[0-9]{3}$
0 or 000...999: ^[0-9]{1,3}$
1...999: ^([1-9]|[1-9][0-9]|[1-9][0-9][0-9])$
001...999: ^(00[1-9]|0[1-9][0-9]|[1-9][0-9][0-9])$
1 or 001...999: ^(0{0,2}[1-9]|0?[1-9][0-9]|[1-9][0-9][0-9])$
0 or 00...59: ^[0-5]?[0-9]$
0 or 000...366: ^(0?[0-9]?[0-9]|[1-2][0-9][0-9]|3[0-5][0-9]|36[0-6])$
At first thought, the following regex seems to do the trick: [-+]?[0-9]*\.?[0-9]*. This defines a floating point number as
an optional sign, followed by an optional series of digits (integer part), followed by an optional dot, followed by another optional
series of digits (fraction part).
Spelling out the regex in words makes it obvious: everything in this regular expression is optional. This regular expression will
consider a sign by itself or a dot by itself as a valid floating point number. In fact, it will even consider an empty string as a
valid floating point number. This regular expression can cause serious trouble if it is used in a scripting language like Perl or
PHP to verify user input.
Not escaping the dot is also a common mistake. A dot that is not escaped will match any character, including a dot. If we had
not escaped the dot, 4.4 would be considered a floating point number, and 4X4 too.
When creating a regular expression, it is more important to consider what it should not match, than what it should. The above
regex will indeed match a proper floating point number, because the regex engine is greedy. But it will also match many things
we do not want, which we have to exclude.
Here is a better attempt: [-+]?([0-9]*\.[0-9]+|[0-9]+). This regular expression will match an optional sign, that is
either followed by zero or more digits followed by a dot and one or more digits (a floating point number with optional integer
part), or followed by one or more digits (an integer).
This is a far better definition. Any match will include at least one digit, because there is no way around the [0-9]+ part. We
have successfully excluded the matches we do not want: those without digits.
If you also want to match numbers with exponents, you can use: [-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?.
Notice how I made the entire exponent part optional by grouping it together, rather than making each element in the exponent
optional.
Finally, if you want to validate if a particular string holds a floating point number, rather than finding a floating point number
within longer text, you'll have to anchor your regex:
^[-+]?[0-9]*\.?[0-9]+$ or ^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$.
How to Find or Validate an Email Address
Before ICANN made it possible for any well-funded company to create their own top-level domains, the longest top-level
domains were the rarely used .museum and .travel which are 6 letters long. The most common top-level domains were 2
letters long for country-specific domains, and 3 or 4 letters long for general-purpose domains like .com and .info. A lot of
regexes for validating email addresses you'll find in various regex tutorials and references still assume the top-level domain to
be fairly short. Older editions of this regex tutorial mentioned \b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b as the
regex for email addresses in its introduction. There's one thing to keep in mind. The 4 at the end of the regex restricts the
top-level domain to 4 characters. If you use this regex with anchors to validate the email address entered on your order form,
[email protected] has to do his shopping elsewhere. Yes, the .solutions TLD exists, and when I write this,
disaproved.solutions can be yours for $16.88 per year.
^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,63}$ is as far as you can practically go. Each part of a domain name can
be no longer than 63 characters. There are no single-digit top-level domains and none contain digits. It doesn't look like
ICANN will approve such domains either.
If you want to avoid your system choking on arbitrarily large input, you can replace the infinite quantifiers with finite ones.
^[A-Z0-9._%+-]{1,64}@(?:[A-Z0-9-]{1,63}\.){1,125}[A-Z]{2,63}$ takes into account that the local part
(before the @) is limited to 64 characters and that each part of the domain name is limited to 63 characters. There's no direct
limit on the number of subdomains. But the maximum length of an email address that can be handled by SMTP is 254
characters. So with a single-character local part, a two-letter top-level domain and single-character sub-domains, 125 is the
maximum number of sub-domains.
The previous regex does not actually limit email addresses to 254 characters. If each part is at its maximum length, the regex
can match strings up to 8129 characters in length. You can reduce that by lowering the number of allowed sub-domains from
125 to something more realistic like 8. I've never seen an email address with more than 4 subdomains. If you want to enforce
the 254 character limit, the best solution is to check the length of the input string before you even use a regex. Though this
requires a few lines of procedural code, checking the length of a string is near-instantaneous. If you can only use regexes,
^[A-Z0-9@._%+-]{6,254}$ can be used as a first pass to make sure the string doesn't contain invalid characters and isn't
too short or too long. If you need to do everything with one regex, you'll need a regex flavor that supports lookahead. The
regex ^(?=[A-Z0-9@._%+-]{6,254}$)[A-Z0-9._%+-]{1,64}@(?:[A-Z0-9-]{1,63}\.){1,8}[A-Z]{2,63}$
uses a lookahead to first check that the string doesn't contain invalid characters and isn't too short or too long. When the
lookahead succeeds, the remainder of the regex makes a second pass over the string to check for proper placement of the @
sign and the dots.
All of these regexes allow the characters ._%+- anywhere in the local part. You can force the local part to begin with a letter
by using ^[A-Z0-9][A-Z0-9._%+-]{0,63} instead of ^[A-Z0-9._%+-]{1,64} for the local part:
^[A-Z0-9][A-Z0-9._%+-]{0,63}@(?:[A-Z0-9-]{1,63}\.){1,125}[A-Z]{2,63}$. When using lookahead to
check the overall length of the address, the first character can be checked in the lookahead. We don't need to repeat the initial
character check when checking the length of the local part. This regex is too long to fit the width of the page, so let's turn on
free-spacing mode:
^(?=[A-Z0-9][A-Z0-9@._%+-]{5,253}$)
[A-Z0-9._%+-]{1,64}@(?:[A-Z0-9-]{1,63}\.){1,8}[A-Z]{2,63}$
Domain names can contain hyphens. But they cannot begin or end with a hyphen.
[A-Z0-9](?:[A-Z0-9-]{0,62}[A-Z0-9])? matches a domain name between 1 and 63 characters long that starts and
ends with a letter or digit. The non-capturing group makes the middle of the domain and the final letter or digit optional as a
whole to ensure that we allow single-character domains while at the same time ensuring that domains with two or more
characters do not end with a hyphen. The overall regex starts to get quite complicated:
^[A-Z0-9][A-Z0-9._%+-]{0,63}@(?:[A-Z0-9](?:[A-Z0-9-]{0,62}[A-Z0-9])?\.){1,8}[A-Z]{2,63}$
Domain names cannot contain consecutive hyphens. [A-Z0-9]+(?:-[A-Z0-9]+)* matches a domain name that starts
and ends with a letter or digit and that contains any number of non-consecutive hyphens. This is the most efficient way. This
regex does not do any backtracking to match a valid domain name. It matches all letters and digits at the start of the domain
name. If there are no hyphens, the optional group that follows fails immediately. If there are hyphens, the group matches each
hyphen followed by all letters and digits up to the next hyphen or the end of the domain name. We can't enforce the maximum
length when hyphens must be paired with a letter or digit, but letters and digits can stand on their own. But we can use the
lookahead technique that we used to enforce the overall length of the email address to enforce the length of the domain name
while disallowing consecutive hyphens: (?=[A-Z0-9-]{1,63}\.)[A-Z0-9]+(?:-[A-Z0-9]+)*. Notice that the
lookahead also checks for the dot that must appear after the domain name when it is fully qualified in an email address. This
is important. Without checking for the dot, the lookahead would accept longer domain names. Since the lookahead does not
consume the text it matches, the dot is not included in the overall match of this regex. When we put this regex into the overall
regex for email addresses, the dot will be matched as it was in the previous regexes:
^[A-Z0-9][A-Z0-9._%+-]{0,63}@
(?:(?=[A-Z0-9-]{1,63}\.)[A-Z0-9]+(?:-[A-Z0-9]+)*\.){1,8}[A-Z]{2,63}$
If we include the lookahead to check the overall length, our regex makes two passes over the local part, and three passes
over the domain names to validate everything:
^(?=[A-Z0-9][A-Z0-9@._%+-]{5,253}$)[A-Z0-9._%+-]{1,64}@
(?:(?=[A-Z0-9-]{1,63}\.)[A-Z0-9]+(?:-[A-Z0-9]+)*\.){1,8}[A-Z]{2,63}$
On a modern PC or server this regex will perform just fine when validating a single 254-character email address. Rejecting
longer input would even be faster because the regex will fail when the lookahead fails during first pass. But I wouldn't
recommend using a regex as complex as this to search for email addresses through a large archive of documents or
correspondence. You're better off using the simple regex at the top of this page to quickly gather everything that looks like an
email address. Deduplicate the results and then use a stricter regex if you want to further filter out invalid addresses.
And speaking of backtracking, none of the regexes on this page do any backtracking to match valid email addresses. But
particularly the latter ones may do a fair bit of backtracking on something that's not quite a valid email address. If your regex
flavor supports possessive quantifiers, you can eliminate all backtracking by making all quantifiers possessive. Because no
backtracking is needed to find matches, doing this does not change what is matched by these regexes. It only allows them to
fail faster when the input is not a valid email address.
Our simplest regex then becomes ^[A-Z0-9._%+-]++@[A-Z0-9.-]++\.[A-Z]{2,}+$ with an extra + after each
quantifier. We can do the same with our most complex regex:
^(?=[A-Z0-9][A-Z0-9@._%+-]{5,253}+$)[A-Z0-9._%+-]{1,64}+@
(?:(?=[A-Z0-9-]{1,63}+\.)[A-Z0-9]++(?:-[A-Z0-9]++)*+\.){1,8}+[A-Z]{2,63}+$
An important trade-off in all these regexes is that they only allow English letters, digits, and the most commonly used special
symbols. The main reason is that I don't trust all my email software to be able to handle much else. Even though
John.O'[email protected] is a syntactically valid email address, there's a risk that some software will misinterpret the
apostrophe as a delimiting quote. Blindly inserting this email address into an SQL query, for example, will at best cause it to
fail when strings are delimited with single quotes and at worst open your site up to SQL injection attacks.
And of course, it's been many years already that domain names can include non-English characters. But most software still
sticks to the 37 characters Western programmers are used to. Supporting internationalized domains opens up a whole can of
worms of how the non-ASCII characters should be encoded. So if you use any of the regexes on this page, anyone with an
@ทีเอชนิค.ไทย address will be out of luck. But perhaps it is telling that http://ทีเอชนิค.ไทย simply redirects to
http://thnic.co.th even though they're in the business of selling .ไทย domains.
The conclusion is that to decide which regular expression to use, whether you're trying to match an email address or
something else that's vaguely defined, you need to start with considering all the trade-offs. How bad is it to match something
that's not valid? How bad is it not to match something that is valid? How complex can your regular expression be? How
expensive would it be if you had to change the regular expression later because it turned out to be too broad or too narrow?
Different answers to these questions will require a different regular expression as the solution. My email regex does what I
want, but it may not do what you want.
The same principle applies in many situations. When trying to match a valid date, it's often easier to use a bit of arithmetic to
check for leap years, rather than trying to do it in a regex. Use a regular expression to find potential matches or check if the
input uses the proper syntax, and do the actual validation on the potential matches returned by the regular expression.
Regular expressions are a powerful tool, but they're far from a panacea.
The official standard is known as RFC 5322. It describes the syntax that valid email addresses must adhere to. You can (but
you shouldn't — read on) implement it with the following regular expression. RFC 5322 leaves the domain name part open to
implementation-specific choices that won't work on the Internet today. The regex implements the "preferred" syntax from RFC
1035 which is one of the recommendations in RFC 5322:
\A(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*
| "(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]
| \\[\x01-\x09\x0b\x0c\x0e-\x7f])*")
@ (?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
| \[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:
(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]
| \\[\x01-\x09\x0b\x0c\x0e-\x7f])+) \])\z
This regex has two parts: the part before the @, and the part after the @. There are two alternatives for the part before the @:
it can either consist of a series of letters, digits and certain symbols, including one or more dots. However, dots may not
appear consecutively or at the start or end of the email address. The other alternative requires the part before the @ to be
enclosed in double quotes, allowing any string of ASCII characters between the quotes. Whitespace characters, double
quotes and backslashes must be escaped with backslashes.
The part after the @ also has two alternatives. It can either be a fully qualified domain name (e.g. regular-expressions.info), or
it can be a literal Internet address between square brackets. The literal Internet address can either be an IP address, or a
domain-specific routing address.
The reason you shouldn't use this regex is that it is overly broad. Your application may not be able to handle all email
addresses this regex allows. Domain-specific routing addresses can contain non-printable ASCII control characters, which can
cause trouble if your application needs to display addresses. Not all applications support the syntax for the local part using
double quotes or square brackets. In fact, RFC 5322 itself marks the notation using square brackets as obsolete.
We get a more practical implementation of RFC 5322 if we omit IP addresses, domain-specific addresses, the syntax using
double quotes and square brackets. It will still match 99.99% of all email addresses in actual use today.
\A[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@
(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\z
Neither of these regexes enforce length limits on the overall email address or the local part or the domain names. RFC 5322
does not specify any length limitations. Those stem from limitations in other protocols like the SMTP protocol for actually
sending email. RFC 1035 does state that domains must be 63 characters or less, but does not include that in its syntax
specification. The reason is that a true regular language cannot enforce a length limit and disallow consecutive hyphens at the
same time. But modern regex flavors aren't truly regular, so we can add length limit checks using lookahead like we did
before:
\A(?=[a-z0-9@.!#$%&'*+/=?^_`{|}~-]{6,254}\z)
(?=[a-z0-9.!#$%&'*+/=?^_`{|}~-]{1,64}@)
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*
@ (?:(?=[a-z0-9-]{1,63}\.)[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+
(?=[a-z0-9-]{1,63}\z)[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\z
So even when following official standards, there are still trade-offs to be made. Don't blindly copy regular expressions from
online libraries or discussion forums. Always test them on your own data and with your own applications.
To restrict all 4 numbers in the IP address to 0..255, you can use the following regex. It stores each of the 4 numbers of the IP
address into a capturing group. You can use these groups to further process the IP number. Free-spacing mode allows this to
fit the width of the page.
\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b
The above regex allows one leading zero for numbers 10 to 99 and up to two leading zeros for numbers 0 to 9. Strictly
speaking, IP addresses with leading zeros imply octal notation. So you may want to disallow leading zeros. This requires a
slightly longer regex:
\b(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.
(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.
25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.
(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\b
Restricting The Four IP Address Numbers Without Capturing Them
If you don't need access to the individual numbers, you can shorten above 3 regexes with a quantifier to:
\b(?:\d{1,3}\.){3}\d{1,3}\b
\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b
\b(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.){3}
(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\b
If you want to validate user input by making sure a string consists of nothing but an IP address then you need to replace the
word boundaries with start-of-string and end-of-string anchors. You can use the dedicated anchors \A and \z if your regex
flavor supports them:
\A(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\z
If not, you'll have to use ^ and $ and make sure that the option for them to match at line breaks is off:
^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$
The month is matched by 0[1-9]|1[012], again enclosed by round brackets to keep the two options together. By using
character classes, the first option matches a number between 01 and 09, and the second matches 10, 11 or 12.
The last part of the regex consists of three options. The first matches the numbers 01 through 09, the second 10 through 29,
and the third matches 30 or 31.
Smart use of alternation allows us to exclude invalid dates such as 2000-00-00 that could not have been excluded without
using alternation. To be really perfectionist, you would have to split up the month into various options to take into account the
length of the month. The above regex still matches 2003-02-31, which is not a valid date. Making leading zeros optional could
be another enhancement.
If you want to require the delimiters to be consistent, you could use a backreference.
Again, how complex you want to make your regular expression depends on the data you are using it on, and how big a
problem it is if an unwanted match slips through. If you are validating the user's input of a date in a script, it is probably easier
to do certain checks outside of the regex. For example, excluding February 29th when the year is not a leap year is far easier
to do in a scripting language. It is far easier to check if a year is divisible by 4 (and not divisible by 100 unless divisible by 400)
using simple arithmetic than using regular expressions.
Here is how you could check a valid date in Perl. I also added round brackets to capture the year into a backreference.
sub isvaliddate
{
my $input = shift;
if ($input =~ m!^((?:19|20)\d\d)[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])$!)
{
# At this point, $1 holds the year, $2 the month and $3 the day of the date entered
if ($3 == 31 and ($2 == 4 or $2 == 6 or $2 == 9 or $2 == 11))
{
return 0; # 31st of a month with 30 days
}
elsif ($3 >= 30 and $2 == 2)
{
return 0; # February 30th or 31st
}
elsif ($2 == 2 and $3 == 29 and not ($1 % 4 == 0 and ($1 % 100 != 0 or $1 % 400 == 0)))
{
return 0; # February 29th outside a leap year
}
else
{
return 1; # Valid date
}
}
else
{
return 0; # Not a date
}
}
You can use a slightly different regular expression to find credit card numbers, or number sequences that might be credit card
numbers, within larger documents. This can be very useful to prove in a security audit that you're not improperly exposing your
clients' financial details.
If you're wondering what the plus is for: that's for performance. If the input has consecutive non-digits, e.g. 1===2, then the
regex will match the three equals signs at once, and delete them in one replacement. Without the plus, three replacements
would be required. In this case, the savings are only a few microseconds. But it's a good habit to keep regex efficiency in the
back of your mind. Though the savings are minimal here, so is the effort of typing the extra plus.
Visa: ^4[0-9]{12}(?:[0-9]{3})?$ All Visa card numbers start with a 4. New cards have 16 digits. Old cards have 13.
MasterCard: ^5[1-5][0-9]{14}$ All MasterCard numbers start with the numbers 51 through 55. All have 16 digits.
American Express: ^3[47][0-9]{13}$ American Express card numbers start with 34 or 37 and have 15 digits.
Diners Club: ^3(?:0[0-5]|[68][0-9])[0-9]{11}$ Diners Club card numbers begin with 300 through 305, 36 or 38.
All have 14 digits. There are Diners Club cards that begin with 5 and have 16 digits. These are a joint venture between
Diners Club and MasterCard, and should be processed like a MasterCard.
Discover: ^6(?:011|5[0-9]{2})[0-9]{12}$ Discover card numbers begin with 6011 or 65. All have 16 digits.
JCB: ^(?:2131|1800|35\d{3})\d{11}$ JCB cards beginning with 2131 or 1800 have 15 digits. JCB cards beginning
with 35 have 16 digits.
If you just want to check whether the card number looks valid, without determining the brand, you can combine the above six
regexes using alternation. A non-capturing group puts the anchors outside the alternation. Free-spacing allows for comments
and for the regex to fit the width of this page.
^(?:4[0-9]{12}(?:[0-9]{3})? # Visa
| (?:5[1-5][0-9]{2} # MasterCard
| 222[1-9]|22[3-9][0-9]|2[3-6][0-9]{2}|27[01][0-9]|2720)[0-9]{12}
| 3[47][0-9]{13} # American Express
| 3(?:0[0-5]|[68][0-9])[0-9]{11} # Diners Club
| 6(?:011|5[0-9]{2})[0-9]{12} # Discover
| (?:2131|1800|35\d{3})\d{11} # JCB
)$
These regular expressions will easily catch numbers that are invalid because the customer entered too many or too few digits.
They won't catch numbers with incorrect digits. For that, you need to follow the Luhn Algorithm, which cannot be done with a
regex. And of course, even if the number is mathematically valid, that doesn't mean a card with this number was issued or if
there's money in the account. The benefit or the regular expression is that you can put it in a bit of JavaScript to instantly
check for obvious errors, instead of making the customer wait 30 seconds for your credit card processor to fail the order. And
if your card processor charges for failed transactions, you'll really want to implement both the regex and the Luhn validation.
If you're planning to search a large document server, a simpler regular expression will speed up the search. Unless your
company uses 16-digit numbers for other purposes, you'll have few false positives. The regex \b\d{13,16}\b will find any
sequence of 13 to 16 digits.
When searching a hard disk full of files, you can't strip out spaces and dashes first like you can when validating a single card
number. To find card numbers with spaces or dashes in them, use \b(?:\d[ -]*?){13,16}\b. This regex allows any
amount of spaces and dashes anywhere in the number. This is really the only way. Visa and MasterCard put digits in sets of 4,
while Amex and Discover use groups of 4, 5 and 6 digits. People typing in the numbers may have different ideas yet.
To keep this example simple, let's say we want to match lines containing the word "John". The regex John makes it easy
enough to locate those lines. But the software will only indicate John as the match, not the entire line containing the word.
The solution is fairly simple. To specify that we need an entire line, we will use the caret and dollar sign and turn on the option
to make them match at embedded newlines. To match the parts of the line before and after the match of our original regular
expression John, we simply use the dot and the star. Be sure to turn off the option for the dot to match newlines.
The resulting regex is: ^.*John.*$. You can use the same method to expand the match of any regular expression to an
entire line, or a block of complete lines. In some cases, such as when using alternation, you will need to group the original
regex together using round brackets.
^.*\b(one|two|three)\b.*$ matches a complete line of text that contains any of the words "one", "two" or "three". The
first backreference will contain the word the line actually contains. If it contains more than one of the words, then the last
(rightmost) word will be captured into the first backreference. This is because the star is greedy. If we make the first star lazy,
like in ^.*?\b(one|two|three)\b.*$, then the backreference will contain the first (leftmost) word.
If your condition is that a line should not contain something, use negative lookahead. ^((?!regexp).)*$ matches a
complete line that does not match regexp. Notice that unlike before, when using positive lookahead, We repeat both the
negative lookahead and the dot together. For the positive lookahead, we only need to find one location where it can match.
But the negative lookahead must be tested at each and every character position in the line. We must test that regexp fails
everywhere, not just somewhere.
Finally, you can combine multiple positive and negative requirements as follows:
^(?=.*?\bmust-have\b)(?=.*?\bmandatory\b)((?!avoid|illegal).)*$
When checking multiple positive requirements, the .* at the end of the regular expression full of zero-width assertions made
sure that we actually matched something. Since the negative requirement must match the entire line, it is easy to replace the
.* with the negative test.
Deleting Duplicate Lines From a File
If you have a file in which all lines are sorted (alphabetically or otherwise), you can easily delete (consecutive) duplicate lines.
Simply open the file in your favorite text editor, and do a search-and-replace searching for ^(.*)(\r?\n\1)+$ and replacing
with \1. For this to work, the anchors need to match before and after line breaks (and not just at the start and the end of the
file or string), and the dot must not match newlines.
Here is how this works. The caret will match only at the start of a line. So the regex engine will only attempt to match the
remainder of the regex there. The dot and star combination simply matches an entire line, whatever its contents, if any. The
round brackets store the matched line into the first backreference.
Next we will match the line separator. We put the question mark into \r?\n to make this regex work with both Windows
(\r\n) and UNIX (\n) text files. So up to this point we matched a line and the following line break.
Now we need to check if this combination is followed by a duplicate of that same line. We do this simply with \1. This is the
first backreference which holds the line we matched. The backreference will match that very same text.
If the backreference fails to match, the regex match and the backreference are discarded, and the regex engine tries again at
the start of the next line. If the backreference succeeds, the plus symbol in the regular expression will try to match additional
copies of the line. Finally, the dollar symbol forces the regex engine to check if the text matched by the backreference is a
complete line. We already know the text matched by the backreference is preceded by a line break (matched by \r?\n).
Therefore, we now check if it is also followed by a line break or if it is at the end of the file using the dollar sign.
The entire match becomes line\nline (or line\nline\nline etc.). Because we are doing a search and replace, the
line, its duplicates, and the line breaks in between them, are all deleted from the file. Since we want to keep the original line,
but not the duplicates, we use \1 as the replacement text to put the original line back in.
The positive lookbehind (?<=,|^) forces the regex engine to start matching at the start of the string or after a comma.
([^,]*) captures the item. (,\1)+ matches consecutive duplicate items. Finally, the positive lookahead (?=,|$) checks if
the duplicate items are complete items by checking for a comma or the end of the string.
Unless otherwise indicated, all examples below assume that the dot does not match newlines and that the caret and dollar do
match at embedded line breaks. In many programming languages, this means that single line mode must be off, and multi
line mode must be on.
When used by themselves, these regular expressions may not have the intended result. If a comment appears inside a string,
the comment regex will consider the text inside the string as a comment. The string regex will also match strings inside
comments. The solution is to use more than one regular expression, like in this pseudo-code:
GlobalStartPosition := 0;
while GlobalStartPosition < LengthOfText do
GlobalMatchPosition := LengthOfText;
MatchedRegEx := NULL;
foreach RegEx in RegExList do
RegEx.StartPosition := GlobalStartPosition;
if RegEx.Match and RegEx.MatchPosition < GlobalMatchPosition
then
MatchedRegEx := RegEx;
GlobalMatchPosition := RegEx.MatchPosition;
endif
endforeach
if MatchedRegEx <> NULL
then
// At this point, MatchedRegEx indicates which regex matched
// and you can do whatever processing you want depending on
// which regex actually matched.
endif
GlobalStartPosition := GlobalMatchPosition;
endwhile
If you put a regex matching a comment and a regex matching a string in RegExList, then you can be sure that the comment
regex will not match comments inside strings, and vice versa.
An alternative solution is to combine regexes: (comment)|(string). The alternation has the same effect as the code
snipped above. Iterate over all the matches of this regex. Inside the loop, check which capturing group found the regex match.
If group 1 matched, you have a comment. If group two matched, you have a string. Then process the match according to that.
You can use this technique to build a full parser. Add regular expressions for all lexical elements in the language or file format
you want to parse. Inside the loop, keep track of what was matched so that the following matches can be processed according
to their context. For example, if curly braces need to be balanced, increment a counter when an opening brace is matched,
and decrement it when a closing brace is matched. Raise an error if the counter goes negative at any point or if it is nonzero
when the end of the file is reached.
Comments
#.*$ matches a single-line comment starting with a # and continuing until the end of the line. Similarly, //.*$ matches a
single-line comment starting with //.
If the comment must appear at the start of the line, use ^#.*$. If only whitespace is allowed between the start of the line and
the comment, use ^\s*#.*$. Compiler directives or pragmas in C can be matched this way. Note that in this last example,
any leading whitespace will be part of the regex match. Use capturing parentheses to separate the whitespace and the
comment.
/\*.*?\*/ matches a C-style multi-line comment if you turn on the option for the dot to match newlines. The general syntax
is begin.*?end. C-style comments do not allow nesting. If the "begin" part appears inside the comment, it is ignored. As
soon as the "end" part if found, the comment is closed.
If your programming language allows nested comments, there is no straightforward way to match them using a regular
expression, since regular expressions cannot count. Additional logic is required.
Strings
"[^"\r\n]*" matches a single-line string that does not allow the quote character to appear inside the string. Using the
negated character class is more efficient than using a lazy dot. "[^"]*" allows the string to span across multiple lines.
"[^"\\\r\n]*(?:\\.[^"\\\r\n]*)*" matches a single-line string in which the quote character can appear if it is
escaped by a backslash. Though this regular expression may seem more complicated than it needs to be, it is much faster
than simpler solutions which can cause a whole lot of backtracking in case a double quote appears somewhere all by itself
rather than part of a string. "[^"\\]*(?:\\.[^"\\]*)*" allows the string to span multiple lines.
You can adapt the above regexes to match any sequence delimited by two (possibly different) characters. If we use b for the
starting character, e and the end, and x as the escape character, the version without escape becomes b[^e\r\n]*e, and
the version with escape becomes b[^ex\r\n]*(?:x.[^ex\r\n]*)*e.
Numbers
\b\d+\b matches a positive integer number. Do not forget the word boundaries! [-+]?\b\d+\b allows for a sign.
((\b[0-9]+)?\.)?[0-9]+\b matches an integer number as well as a floating point number with optional integer part.
(\b[0-9]+\.([0-9]+\b)?|\.[0-9]+\b) matches a floating point number with optional integer as well as optional
fractional part, but does not match an integer number.
\b[0-9]+(\.[0-9]+)?(e[+-]?[0-9]+)?\b also matches a number in scientific notation. The difference with the
previous example is that if the mantissa is a floating point number, the integer part is mandatory.
If you read through the floating point number example, you will notice that the above regexes are different from what is used
there. The above regexes are more stringent. They use word boundaries to exclude numbers that are part of other things like
identifiers. You can prepend [-+]? to all of the above regexes to include an optional sign in the regex. I did not do so above
because in programming languages, the + and - are usually considered operators rather than signs.
You can easily perform the same task with the proper regular expression.
The complete regular expression becomes \bword1\W+(?:\w+\W+){1,6}?word2\b. The quantifier {1,6}? makes the
regex require at least one word between "word1" and "word2", and allow at most six words.
If the words may also occur in reverse order, we need to specify the opposite pattern as well:
\b(?:word1\W+(?:\w+\W+){1,6}?word2|word2\W+(?:\w+\W+){1,6}?word1)\b
If you want to find any pair of two words out of a list of words, you can use:
\b(word1|word2|word3)(?:\W+\w+){1,6}?\W+(word1|word2|word3)\b.
This regex will also find a word near itself, e.g. it will match word2 near word2.
Let's see what happens when you apply this regex to xxxxxxxxxxy. The first x+ will match all 10 x characters. The second
x+ fails. The first x+ then backtracks to 9 matches, and the second one picks up the remaining x. The group has now
matched once. The group repeats, but fails at the first x+. Since one repetition was sufficient, the group matches. y matches y
and an overall match is found. The regex is declared functional, the code is shipped to the customer, and his computer
explodes. Almost.
The above regex turns ugly when the y is missing from the subject string. When y fails, the regex engine backtracks. The
group has one iteration it can backtrack into. The second x+ matched only one x, so it can't backtrack. But the first x+ can
give up one x. The second x+ promptly matches xx. The group again has one iteration, fails the next one, and the y fails.
Backtracking again, the second x+ now has one backtracking position, reducing itself to match x. The group tries a second
iteration. The first x+ matches but the second is stuck at the end of the string. Backtracking again, the first x+ in the group's
first iteration reduces itself to 7 characters. The second x+ matches xxx. Failing y, the second x+ is reduced to xx and then
x. Now, the group can match a second iteration, with one x for each x+. But this (7,1),(1,1) combination fails too. So it goes to
(6,4) and then (6,2)(1,1) and then (6,1),(2,1) and then (6,1),(1,2) and then I think you start to get the drift.
If you try this regex on a 10x string in a debugger, it'll take 2558 steps to figure out the final y is missing. For an 11x string, it
needs 5118 steps. For 12, it takes 10238 steps. Clearly we have an exponential complexity of O(2^n) here. At 21x the
debugger bows out at 2.8 million steps, diagnosing a bad case of catastrophic backtracking.
Some regex engines (like .NET) will keep going forever, while others will crash with a stack overflow (like Perl, before version
5.10). Stack overflows are particularly nasty on Windows, since they tend to make your application vanish without a trace or
explanation. Be very careful if you run a web service that allows users to supply their own regular expressions. People with
little regex experience have surprising skill at coming up with exponentially complex regular expressions.
At first sight, this regex looks like it should do the job just fine. The lazy dot and comma match a single comma-delimited field,
and the {11} skips the first 11 fields. Finally, the P checks if the 12th field indeed starts with P. In fact, this is exactly what will
happen when the 12th field indeed starts with a P.
The problem rears its ugly head when the 12th field does not start with a P. Let's assume here that the string is
1,2,3,4,5,6,7,8,9,10,11,12,13. At that point, the regex engine will backtrack. It will backtrack to the point where
^(.*?,){11} had consumed 1,2,3,4,5,6,7,8,9,10,11, giving up the last match of the comma. The next token is
again the dot. The dot matches a comma. The dot matches the comma! However, the comma does not match the 1 in the
12th field, so the dot continues until the 11th iteration of .*?, has consumed 11,12,. You can already see the root of the
problem: the part of the regex (the dot) matching the contents of the field also matches the delimiter (the comma). Because of
the double repetition (star inside {11}), this leads to a catastrophic amount of backtracking.
The regex engine now checks whether the 13th field starts with a P. It does not. Since there is no comma after the 13th field,
the regex engine can no longer match the 11th iteration of .*?, But it does not give up there. It backtracks to the 10th
iteration, expanding the match of the 10th iteration to 10,11,. Since there is still no P, the 10th iteration is expanded to
10,11,12,. Reaching the end of the string again, the same story starts with the 9th iteration, subsequently expanding it to
9,10,, 9,10,11,, 9,10,11,12,. But between each expansion, there are more possibilities to be tried. When the 9th
iteration consumes 9,10,, the 10th could match just 11, as well as 11,12,. Continuously failing, the engine backtracks to
the 8th iteration, again trying all possible combinations for the 9th, 10th, and 11th iterations.
You get the idea: the possible number of combinations that the regex engine will try for each line where the 12th field does not
start with a P is huge. All this would take a long time if you ran this regex on a large CSV file where most rows don't have a P
at the start of the 12th field.
In our example, the solution is to be more exact about what we want to match. We want to match 11 comma-delimited fields.
The fields must not contain comma's. So the regex becomes: ^([^,\r\n]*,){11}P. If the P cannot be found, the engine
will still backtrack. But it will backtrack only 11 times, and each time the [^,\r\n] is not able to expand beyond the comma,
forcing the regex engine to the previous one of the 11 iterations immediately, without trying further options.
Let's see how ^(?>(.*?,){11})P is applied to 1,2,3,4,5,6,7,8,9,10,11,12,13. The caret matches at the start of
the string and the engine enters the atomic group. The star is lazy, so the dot is initially skipped. But the comma does not
match 1, so the engine backtracks to the dot. That's right: backtracking is allowed here. The star is not possessive, and is not
immediately enclosed by an atomic group. That is, the regex engine did not cross the closing round bracket of the atomic
group. The dot matches 1, and the comma matches too. {11} causes further repetition until the atomic group has matched
1,2,3,4,5,6,7,8,9,10,11,.
Now, the engine leaves the atomic group. Because the group is atomic, all backtracking information is discarded and the
group is now considered a single token. The engine now tries to match P to the 1 in the 12th field. This fails.
So far, everything happened just like in the original, troublesome regular expression. Now comes the difference. P failed to
match, so the engine backtracks. The previous token is an atomic group, so the group's entire match is discarded and the
engine backtracks further to the caret. The engine now tries to match the caret at the next position in the string, which fails.
The engine walks through the string until the end, and declares failure. Failure is declared after 30 attempts to match the
caret, and just one attempt to match the atomic group, rather than after 30 attempts to match the caret and a huge number of
attempts to try all combinations of both quantifiers in the regex.
That is what atomic grouping and possessive quantifiers are for: efficiency by disallowing backtracking. The most efficient
regex for our problem at hand would be ^(?>((?>[^,\r\n]*),){11})P, since possessive, greedy repetition of the star is
faster than a backtracking lazy dot. If possessive quantifiers are available, you can further reduce clutter by writing
^(?>([^,\r\n]*+,){11})P
Suppose you want to use a regular expression to match a complete HTML file, and extract the basic parts from the file. If you
know the structure of HTML files, the following regex is very straight-forward:
<html>.*?<head>.*?<title>.*?</title>.*?</head>.*?<body[^>]*>.*?</body>.*?</html>
With the "dot matches newlines" or "single line" matching mode turned on, it will work just fine on valid HTML files.
Unfortunately, this regular expression won't work nearly as well on an HTML file that misses some of the tags. The worst case
is a missing </html> tag at the end of the file. When </html> fails to match, the regex engine backtracks, giving up the
match for </body>.*?. It will then further expand the lazy dot before </body>, looking for a second closing </body> tag in
the HTML file. When that fails, the engine gives up <body[^>]*>.*?, and starts looking for a second opening
<body[^>]*> tag all the way to the end of the file. Since that also fails, the engine proceeds looking all the way to the end of
the file for a second closing head tag, a second closing title tag, etc.
If you run this regex in a debugger, the output will look like a sawtooth. The regex matches the whole file, backs up a little,
matches the whole file again, backs up some more, backs up yet some more, matches everything again, etc. until each of the
7 .*? tokens has reached the end of the file. The result is that this regular has a worst case complexity of N^7. If you double
the length of the HTML file with the missing <html> tag by appending text at the end, the regular expression will take 128
times (2^7) as long to figure out the HTML file isn't valid. This isn't quite as disastrous as the 2^N complexity of our first
example, but will lead to very unacceptable performance on larger invalid files.
In this situation, we know that each of the literal text blocks in our regular expression (the HTML tags, which function as
delimiters) will occur only once in a valid HTML file. That makes it very easy to package each of the lazy dots (the delimited
content) in an atomic group.
<html>(?>.*?<head>)(?>.*?<title>)(?>.*?</title>)(?>.*?</head>)
(?>.*?<body[^>]*>)(?>.*?</body>).*?</html> (all on one line) will match a valid HTML file in the same number of
steps as the original regex. The gain is that it will fail on an invalid HTML file almost as fast as it matches a valid one. When
</html> fails to match, the regex engine backtracks, giving up the match for the last lazy dot. But then, there's nothing
further to backtrack to. Since all of the lazy dots are in an atomic group, the regex engines has discarded their backtracking
positions. The groups function as a "do not expand further" roadblock. The regex engine is forced to announce failure
immediately.
You've no doubt noticed that each atomic group also contains an HTML tag after the lazy dot. This is critical. We do allow the
lazy dot to backtrack until its matching HTML tag was found.
When .*?</body> is processing Last paragraph</p></body>, the </ regex tokens will match </ in </p>. However, b
will fail p. At that point, the regex engine will backtrack and expand the lazy dot to include </p>. Since the regex engine hasn't
left the atomic group yet, it is free to backtrack inside the group. Once </body> has matched, and the regex engine leaves
the atomic group, it discards the lazy dot's backtracking positions. Then it can no longer be expanded.
Essentially, what we've done is to bind a repeated regex token (the lazy dot to match HTML content) to the non-repeated
regex token that follows it (the literal HTML tag). Since anything, including HTML tags, can appear between the HTML tags in
our regular expression, we cannot use a negated character class instead of the lazy dot to prevent the delimiting HTML tags
from being matched as HTML content. But we can and did achieve the same result by combining each lazy dot and the HTML
tag following it into an atomic group. As soon as the HTML tag is matched, the lazy dot's match is locked down. This ensures
that the lazy dot will never match the HTML tag that should be matched by the literal HTML tag in the regular expression.
It's annoying when catastrophic backtracking happens on your PC. But when it happens in a server application with multiple
concurrent users, it can really be catastrophic. Too many users running regexes that exhibit catastrophic backtracking will
bring down the whole server. And "too many" need only be as few as the number of CPU cores in the server.
If the server accepts regexes from the user, then the user can easily provide one that exhibits catastrophic backtracking on
any subject. If the server accepts subject data from the user, then the user may be able to provide subjects that trigger
catastrophic backtracking in regexes used by the server, if those regexes are predisposed to catastrophic backtracking. When
the user can do either of those things, the server is susceptible to regular expression denial of service (ReDoS). When enough
users (or one actor masquerading as many users) provide malicious regexes and/or subjects to match against, the server will
be spending nearly all its CPU cycles on trying to match those regexes.
Handling Regexes Provided by The User
If your application allows the user to provide their own regexes, then your only real defense is to use a text-directed regex
engine. Those engines don't backtrack. But they also don't support features like backreferences that depend on backtracking
and that many users expect.
If your application uses a backtracking engine with user-provided regexes, then you can only mitigate the consequences of
catastrophic backtracking. And you'll really need to do so. It's very easy for people with limited regex skills to accidentally craft
one that degenerates into catastrophic backtracking.
You'll need to use a regex engine that aborts the match attempt when catastrophic backtracking occurs rather than running
until the OS kills the script or crashing. You can easily test this. When the regex (x\w{1,10})+y is attempted on an ever
growing string of x's there should be a reasonable limit on how long it takes for the regex engine to give up. Ideally your
engine will allow you to configure this limit for your purposes. The .NET engine, for example, allows you to pass a timeout to
the Regex() constructor. The PCRE engine allows you to set recursion limits. The lower your limits the better the protection
against ReDoS, but the more likely you'll be aborting regexes that would find a valid match given slightly more time.
Permutations occur when you give the regular expression a choice. You can do this with alternation and with quantifiers. So
these are the regex tokens you need to inspect. Possessive quantifiers are excepted, because they never backtrack.
Alternation
Alternatives must be mutually exclusive. If multiple alternatives can match the same text then the engine will try both if the
remainder of the regex fails. If the alternatives are in a group that is repeated, you have catastrophic backtracking.
A classic example is (.|\s)* to match any amount of any text when the regex flavor does not have a "dot matches line
breaks" mode. If this is part of a longer regex then a subject string with a sufficiently long run of spaces will break the regex.
The engine will try every possible combination of the spaces being matched by . or \s. For example, 3 spaces could be
matched as ..., ..\s, .\s., .\s\s, \s.., \s.\s, \s\s., or \s\s\s. That's 2^N permutations. The fix is to use (.|\n)*
to make the alternatives mutually exclusive. Even better to be more specific about which characters are really allowed, such
as [\r\n\t\x20-\x7E] for ASCII printables, tabs, and line breaks.
It is acceptable for two alternatives to partially match the same text. [0-9]*\.[0-9]+|[0-9]+ is perfectly fine to match a
floating point number with optional integer part and optional fraction. Though a subject that consists of only digits is initially
matched by [0-9]* and does cause some backtracking when \. fails, this backtracking never becomes catastrophic. Even if
you put this inside a group in a longer regex, the group only does a minimal amount of backtracking. (But the group mustn't
have a quantifier or it will fall foul of the rule for nested quantifiers.)
Quantifiers in Sequence
Quantified tokens that are in sequence must either be mutually exclusive with each other or be mutually exclusive with what
comes between them. Otherwise both can match the same text and all combinations of the two quantifiers will be tried when
the remainder of the regex fails to match. A token inside a group with alternation is still in sequence with any token before or
after the group.
A classic example is a.*?b.*?c to match 3 things with "anything" between them. When c can't be matched the first .*?
expands character by character until the end of the line or file. For each expansion the second .*? expands character by
character to match the remainder of the line or file. The fix is to realize that you can't have "anything" between them. The first
run needs to stop at b and the second run needs to stop at c. With single characters a[^b]*+b[^c]*+c is an easy solution.
Since we now stop at the delimiter, we can use possessive quantifiers to further increase performance.
For a more complex example and solution, see matching a complete HTML file in the previous topic. This explains how you
can use atomic grouping to prevent backtracking in more complex situations.
Nested Quantifiers
A group that contains a token with a quantifier must not have a quantifier of its own unless the quantified token inside the
group can only be matched with something else that is mutually exclusive with it. That ensures that there is no way that fewer
iterations of the outer quantifier with more iterations of the inner quantifier can match the same text as more iterations of the
outer quantifier with fewer iterations of the inner quantifier.
The regex (x\w{1,10})+y matches a sequence of one or more codes that start with an x followed by 1 to 10 word
characters, all followed by a y. All is well as long as the y can be matched. When the y is missing, backtracking occurs. If the
string doesn't have too many x's then backtracking happens very quickly. Things only turn catastrophic when the subject
contains a long sequence of x's. x and x are not mutually exclusive. So the repeated group can match xxxx in one iteration as
x\w\w\w or in two iterations as as x\wx\w.
To solve this, you first need to consider whether x and y should be allowed in the 1 to 10 characters that follow it. Excluding
the x eliminates most backtracking. What's left won't be catastrophic. You could exclude it with character class subtraction as
in (x[\w-[x]]{1,10})+y or with character class intersection as in (x[\w&&[^x]]{1,10})+y. If you don't have those
features you'll need to spell out the characters you want to allow: (x[a-wyz0-9_]{1,10})+y.
If the x should be allowed then your only solution is to disallow the y in the same way. Then you can make the group atomic or
the quantifier possessive to eliminate the backtracking.
If both x and y should be allowed in the sequences of 1 to 10 characters, then there is no regex-only solution. You can't make
the group atomic or the quantifier possessive as then \w{1,10} matches the final y which causes y to fail.
Make groups that contain alternatives atomic as much as you can. Use \b(?>one|two|three)\b to match a list of words.
Make quantifiers possessive as much as you can. If a repeated token is mutually exclusive with what follows, enforce that with
a possessive quantifier.
Use (negated) character classes instead of the dot. It's rare that you really want to allow "anything". A double-quoted string,
for example, can't contain "anything". It can't contain unescaped double quotes. So use "[^"\n]*+" instead of ".*?".
Though both find exactly the same matches when used on their own, the latter can lead to catastrophic backtracking when
pasted into a longer regex. The former never backtracks regardless of anything else the regex needs to match.
Why Use Regexes at All?
Some would certainly argue that the above only shows that regexes are dangerous and that they should not be used. They'll
then force developers to do the job with procedural code. Procedural code to match non-trivial patterns quickly becomes long
and complicated, increasing the chance of bugs and the cost to develop and maintain the code. Many pattern matching
problems are naturally solved with recursion. And when a large subject string can't be matched, runaway recursion leads to
stack overflows that crash the application.
Developers need to learn to correctly use their tools. This is no different for regular expressions than for anything else.
Let's say you want to match a tag like !abc! or !123!. Only these two are possible, and you want to capture the abc or 123
to figure out which tag you got. That's easy enough: !(abc|123)! will do the trick.
Now let's say that the tag can contain multiple sequences of abc and 123, like !abc123! or !123abcabc!. The quick and
easy solution is !(abc|123)+!. This regular expression will indeed match these tags. However, it no longer meets our
requirement to capture the tag's label into the capturing group. When this regex matches !abc123!, the capturing group
stores only 123. When it matches !123abcabc!, it only stores abc.
This is easy to understand if we look at how the regex engine applies !(abc|123)+! to !abc123!. First, ! matches !. The
engine then enters the capturing group. It makes note that capturing group #1 was entered when the engine reached the
position between the first and second character in the subject string. The first token in the group is abc, which matches abc.
A match is found, so the second alternative isn't tried. (The engine does store a backtracking position, but this won't be used
in this example.) The engine now leaves the capturing group. It makes note that capturing group #1 was exited when the
engine reached the position between the 4th and 5th characters in the string.
After having exited from the group, the engine notices the plus. The plus is greedy, so the group is tried again. The engine
enters the group again, and takes note that capturing group #1 was entered between the 4th and 5th characters in the string.
It also makes note that since the plus is not possessive, it may be backtracked. That is, if the group cannot be matched a
second time, that's fine. In this backtracking note, the regex engine also saves the entrance and exit positions of the group
during the previous iteration of the group.
abc fails to match 123, but 123 succeeds. The group is exited again. The exit position between characters 7 and 8 is stored.
The plus allows for another iteration, so the engine tries again. Backtracking info is stored, and the new entrance position for
the group is saved. But now, both abc and 123 fail to match !. The group fails, and the engine backtracks. While
backtracking, the engine restores the capturing positions for the group. Namely, the group was entered between characters 4
and 5, and existed between characters 7 and 8.
The engine proceeds with !, which matches !. An overall match is found. The overall match spans the whole subject string.
The capturing group spaces characters 5, 6 and 7, or 123. Backtracking information is discarded when a match is found, so
there's no way to tell after the fact that the group had a previous iteration that matched abc. (The only exception to this is the
.NET regex engine, which does preserve backtracking information for capturing groups after the match attempt.)
The solution to capturing abc123 in this example should be obvious now: the regex engine should enter and leave the group
only once. This means that the plus should be inside the capturing group rather than outside. Since we do need to group the
two alternatives, we'll need to place a second capturing group around the repeated group: !((abc|123)+)!. When this
regex matches !abc123!, capturing group #1 will store abc123, and group #2 will store 123. Since we're not interested in
the inner group's match, we can optimize this regular expression by making the inner group non-capturing:
!((?:abc|123)+)!.
Mixing Unicode and 8-bit Character Codes
Internally, computers deal with numbers, not with characters. When you save a text file, each character is mapped to a
number, and the numbers are stored on disk. When you open a text file, the numbers are read and mapped back to
characters. When processing text with a regular expression, the regular expression needs to use the same mapping as you
used to create the file or string you want the regex to process.
When you simply type in all the characters in your regular expression, you normally don't have anything to worry about. The
application or programming library that provides the regular expression functionality will know what text encodings your
subject string uses, and process it accordingly. So if you want to search for the euro currency symbol, and you have a
European keyboard, just press AltGr+E. Your regex € will find all euro symbols just fine.
But you can't press AltGr+E on a US keyboard. Or perhaps you like your source code to be 7-bit clean (i.e. plain ASCII). In
those cases, you'll need to use a character escape in your regular expression.
If your regular expression engine supports Unicode, simply use the Unicode escape \u20AC (most Unicode flavors) or
\x{20AC} (Perl and PCRE). U+20AC is the Unicode code point for the euro symbol. It will always match the euro symbol,
whether your subject string is encoded in UTF-8, UTF-16, UCS-2 or whatever. Even when your subject string is encoded with
a legacy 8-bit code page, there's no confusion. You may need to tell the application or regex engine what encoding your file
uses. But \u20AC is always the euro symbol.
Most Unicode regex engines also support the 8-bit character escape \xFF. However, its use is not recommended. For
characters \x00 through \x7F, there's usually no trouble. The first 128 Unicode code points are identical to the ASCII table
that most 8-bit code pages are based on.
But the interpretation of \x80 and above may vary. A pure Unicode engine will treat this identical to \u0080, which represents
a Latin-1 control code. But what most people expect is that \x80 matches the euro symbol, as that occupies position 80h in
all Windows code pages. And it will when using an 8-bit regex engine if your text file is encoded using a Windows code page.
Since most people expect \x80 to be treated as an 8-bit character rather than the Unicode code point \u0080, some
Unicode regex engines do exactly that. Some are hard-wired to use a particular code page, say Windows 1252 or your
computer's default code page, to interpret 8-bit character codes.
Other engines will let it depend on the input string. Unicode code point U+0080 is a Latin-1 control code, while Windows 1252
character index 80h is the euro symbol. In reverse, if you type in the euro symbol in a text editor, saving it as UTF-16LE will
save two bytes AC 20, while saving as Windows 1252 will give you one byte 80.
If you find the above confusing, simply don't use \x80 through \xFF with a regex engine that supports Unicode.
When crafting a regular expression for an 8-bit engine, you'll have to take into account which character set or code page you'll
be working with. 8-bit regex engines just don't care. If you type \x80 into your regex, it will match any byte 80h, regardless of
what that byte represents. That'll be the euro symbol in a Windows 1252 text file, a control code in a Latin-1 file, and the digit
zero in an EBCDIC file.
Even for literal characters in your regex, you'll have to match up the encoding you're using in the regular expression with the
subject encoding. If your application is using the Latin-1 code page, and you use the regex À, it'll match Ŕ when you search
through a Latin-2 text file. The application would duly display this as À on the screen, because it's using the wrong code page.
This problem is not really specific to regular expressions. You'll encounter it any time you're working with files and applications
that use different 8-bit encodings.
So when working with 8-bit data, open the actual data you're working with in a hex editor. See the bytes being used, and
specify those in your regular expression.
Where it gets really hairy is if you're processing Unicode files with an 8-bit engine. Let's go back to our text file with just a euro
symbol. When saved as little endian UTF-16 (called "Unicode" on Windows), an 8-bit regex engine will see two bytes AC 20
(remember that little endian reverses the bytes). When saved as UTF-8 (which has no endianness), our 8-bit engine will see
three bytes E2 82 AC. You'd need \xE2\x82\xAC to match the euro symbol in an UTF-8 file with an 8-bit regex engine.
Regular Expression Quick Syntax Reference
This quick reference is a summary of all the regex syntax that is listed in the full reference tables, without any explanation. You
can use this table if you've seen some syntax in somebody else's regex and you have no idea what feature that syntax is for.
Check the corresponding tutorial section to learn more about the syntax. Since the full reference tables cover a variety of
regex flavors, this quick reference may have multiple entries for the same syntax with references to different sections in the
tutorial if different regex flavors use the same syntax for different features.
If you already know the feature you want but forgot which syntax to use, look up the feature in the full regex reference section
instead.
\o{7777} where 7777 is Octal escape \uFFFF where FFFF are 4 Unicode code point
any octal number hexadecimal digits
\pL where L is a Unicode Unicode category \u{FFFF} where FFFF are 1 Unicode code point
category to 4 hexadecimal digits
\p{Block} Unicode block \xFFFF where FFFF are 4 Unicode code point
hexadecimal digits
\p{InBlock} Unicode block \x{FFFF} where FFFF are 1 Unicode code point
to 4 hexadecimal digits
(?(+1)then|else) where Forward conditional (?P>name) where name is Named subroutine call
+1 is a positive integer and the name of a capturing group
then and else are any valid
regexes
(?&name) where name is the Named subroutine call (?-1) where -1 is is a Relative subroutine call
name of a capturing group negative integer
Any character except ^-]\ Literal character [base&&[intersect]] Character class intersection
The syntax reference tables have three columns that explain each regular expression feature:
The actual regex syntax for this feature. If the syntax is fixed, it is simply shown as such. If the syntax
Syntax
has variable elements, the syntax is described.
Flavor Comparison
The flavor comparison tables indicate whether various regex flavors support a particular feature. There are
many possible indicators.
3.0 Version 3.0 and all later versions of this flavor support this feature. Earlier versions do not support it.
4.0
Only version 4.0 supports this feature. Earlier and later versions do not support it.
only
2.0 -
Only versions 2.0 through 2.9 supports this feature. Earlier and later versions do not support it.
2.9
Unicod This feature works with Unicode characters in all versions of this flavor.
code This feature works with the characters in the active code page in all versions of this flavor.
ASCII This feature works with ASCII characters only in all versions of this flavor.
3.0 This feature works with Unicode characters in versions 3.0 and later of this flavor. Earlier versions do not
Unicod support it at all.
2.0 This feature works with the characters in the active code page in versions 2.0 and later of this flavor.
code Earlier versions do not support it at all.
The regex flavor does not support this syntax. But string literals in the programming language that this
string
regex flavor is normally used with do support this syntax.
All versions of this regex flavor support this feature if you set a particular option or precede it with a
option
particular mode modifier.
Version 3.0 and all later versions of this regex flavor support this feature if you set a particular option or
3.0
precede it with a particular mode modifier. Earlier versions either do not support the syntax at all or do not
option
support the mode modifier to change the behavior of the syntax to what the feature describes.
This feature is not applicable to this regex flavor. Features that describe the behavior of certain syntax
n/a
introduced earlier in the reference table show n/a for flavors that do not support that syntax at all.
No version of this flavor support this feature. No indication is given as to what this syntax actually does.
N The same syntax may be used for a different feature which is indicated elsewhere in the reference table.
Or the syntax may trigger an error or it may be interpreted as plain text.
The syntax is recognized by the flavor and regular expressions using it work, but this particular regex
fail token always fails to match. The regex can only find matches if this token is made optional by alternation
or a quantifier.
3.0 Version 3.0 and all later versions of this regex flavor recognize the syntax, but always fail to match this
fail regex token. Earlier and later versions either don't recognize the syntax or treat it as a syntax error.
2.4-3.4 Versions 2.4 through 3.4 recognize the syntax, but always fail to match this regex token. Earlier and later
fail versions either don't recognize the syntax or treat it as a syntax error.
The syntax is recognized by the flavor but it does not do anything useful. This particular regex token
ign
always finds a zero-length match.
error The syntax is recognized by the flavor but it is treated as a syntax error.
For the .NET flavor, some features are indicated with "ECMA" or "non ECMA". That means the feature is only
supported when RegexOptions.ECMAScript is set or is not set. Everything that applies to .NET 2.0 or later
also applies to any version of .NET Core. The Visual Studio IDE uses the non-ECMA .NET flavor starting with
VS 2012.
For the std::regex and boost::regex flavor there are additional indicators ECMA, basic, extend, grep, egrep,
and awk. When one or more of these appear, that means that the feature is only supported if you specify one of
these grammars when compiling your regular expression. Features with Unicod indicators match Unicode
characters when using std::wregex or boost::wregex on wide character strings. In the replacement string
reference, the additional indicators are sed and default. When either one appears, the feature is only supported
when you either pass or don't pass match_flag_type::format_sed to regex_replace(). For boost,
there is one more replacement indicator "all" that indicates the feature is only supported when you pass
match_flag_type::format_all to regex_replace().
When this legend says "all versions" or "no version", that means all or none of the versions of each flavor that
are covered by the reference tables:
.NET 1.0–4.7.2
Java 4–8
Perl 5.8–5.26
PCRE 4.0–8.42
PCRE2 10.00–10.23
PHP 5.0.0–7.1.17
Delphi XE–XE8 & 10–10.2; TRegEx only; also applies to C++Builder XE–XE8 & 10–10.2
R 2.14.0–3.4.4
XRegExp 2.0.0–3.0.0
Python 2.4–3.6
Ruby 1.8–2.5
GNU BRE
GNU ERE
XML 1.0–1.1
XPath 2.0–3.1
Characters
Any character except All characters except the listed special characters match a single a matches a
[\^$.|?*+() instance of themselves.
{ and } { and } are literal characters, unless they're part of a valid { matches {
regular expression token such as a quantifier {3}
\ (backslash) followed by any A backslash escapes special characters to suppress their \+ matches +
of [\^$.|?*+(){} special meaning.
\Q...\E Matches the characters between \Q and \E literally, suppressing \Q+-*/\E matches +-*/
the meaning of special characters.
\xFF where FF are 2 Matches the character with the specified ASCII/ANSI value, \xA9 matches © when using the
hexadecimal digits which depends on the code page used. Can be used in Latin-1 code page.
character classes.
\n, \r and \t Match an LF character, CR character and a tab character \r\n matches a DOS/Windows
respectively. Can be used in character classes. CRLF line break.
\R Matches any line break, including CRLF as a pair, CR only, LF \R{2} and \R\R cannot match
only, form feed, vertical tab, and any Unicode line break \r\n
\a, \b, \e, \f and \v Match a bell character (\x07), backspace character (\x08),
escape character (\x1B), form feed (\x0C) and vertical tab
(\x0B) respectively. Can be used in character classes.
\cA through \cZ Match an ASCII character Control+A through Control+Z, \cM\cJ matches a
equivalent to\x01 through \x1A. Can be used in character DOS/Windows CRLF line break.
classes.
\ca through \cz Match an ASCII character Control+A through Control+Z, \cm\cj matches a
equivalent to \x01 through \x1A. Can be used in character DOS/Windows CRLF line break.
classes.
\o{7777} where 7777 is any Matches the character at the specified position in the active code \o{20254} matches € when
octal number page using Unicode
\1 through \7 and \01 Matches the character at the specified position in the ASCII table \7 and \07 matches the "bell"
through \07 Octal espace character
\10 through \77 and \010 Matches the character at the specified position in the ASCII table \77 and \077matches ?
through \077 Octal escape
\100 through \177 and Matches the character at the specified position in the ASCII table \100 and \0100 matches @
\0100 through \0177 Octal
escape
Characters
\200 through \377 and Matches the character at the specified position in the active code \377 and \0377 matches ÿ
\0200 through \0377 Octal page when using the Latin-1 code page
escape
\400 through \777 Octal Matches the character at the specified position in the active code \777 matches ǿ when using
escape page Unicode
[ (opening square bracket) Starts a character class. A character class matches a single
character out of all the possibilities offered by the character
class. Inside a character class, different rules apply. The rules in
this section are only valid inside character classes. The rules
outside this section are not valid in character classes, except for
a few character escapes that are indicated with "can be used
inside character classes".
Any character except ^-]\ All characters except the listed special characters. [abc] matches a, b or c
add that character to the
possible matches for the
character class.
\ (backslash) followed by any A backslash escapes special characters to suppress their [\^\]] matches ^ or ]
of ^-]\ special meaning.
- (hyphen) except immediately Specifies a range of characters. (Specifies a hyphen if placed [a-zA-Z0-9] matches any
after the opening [ immediately after the opening [) letter or digit
^ (caret) immediately after the Negates the character class, causing it to match a single [^a-d] matches x (any
opening [ character not listed in the character class. (Specifies a caret if character except a, b, c or d)
placed anywhere except after the opening [)
[base-[subtract]] Removes all characters in the subtract class from the base [a-z-[aeiuo]] matches a
class. single letter that is not a vowel.
[base &&[intersect ]] Reduces the character class to the characters present in both [a-z&&[^aeiuo]] matches a
base and intersect. single letter that is not a vowel.
Any supported \p{…} syntax \p{…} syntax can be used inside character classes. [\p{Digit}\p{Lower}]
matches one of 0 through 9 or a
through z
Shorthand Character Classes
\d, \w and \s Shorthand character classes matching digits, word characters [\d\s] matches a character that
(letters, digits, and underscores), and whitespace (spaces, tabs, is a digit or whitespace
and line breaks). Can be used inside and outside character
classes.
\D, \W and \S Negated versions of the above. Should be used only outside \D matches a character that is
character classes. (Can be used inside, but that is confusing.) not a digit
\l and \u Shorthand character classes matching all lowercase letters or all [\u\l] matches Aa but not aA.
uppercase letters. Can be used inside and outside character
classes.
\v Shorthand character class that matches all vertical whitespace. [\v] matches any single vertical
Can be used inside and outside character classes. whitespace character.
\h Shorthand character class that matches all horizontal [\h] matches any single
whitespace. Can be used inside and outside character classes. horizontal whitespace character.
Dot
. (dot) Matches any single character except line break characters \r . matches x or (almost) any other
and \n. Most regex flavors have an option to make the dot character
match line break characters too.
\N Matches any single character except line break characters, like \N matches x or any other
the dot, but is not affected by any options that make the dot character that is not a line break
match all characters including line breaks.
Word Boundaries
\b Matches at the position between a word character (anything .\b matches c in abc
matched by \w) and a non-word character (anything matched by
[^\w] or \W) as well as at the start and/or end of the string if the
first and/or last characters in the string are word characters.
\B Matches at the position between two word characters (i.e the \B.\B matches b in abc
position between \w\w) as well as at the position between two
non-word characters (i.e. \W\W).
Alternation
| (pipe) Causes the regex engine to match either the part on the left abc|def|xyz matches abc,
side, or the part on the right side. Can be strung together into a def or xyz
series of options.
| (pipe) The pipe has the lowest precedence of all operators. Use abc(def|xyz) matches
grouping to alternate only part of the regular expression. abcdef or abcxyz
Anchors
^ (caret) Matches at the start of the string the regex pattern is applied to. ^. matches a in abc\ndef. Also
Matches a position rather than a character. Most regex flavors matches d in "multi-line" mode.
have an option to make the caret match after line breaks (i.e. at
the start of a line in a file) as well.
$ (dollar) Matches at the end of the string the regex pattern is applied to. .$ matches f in abc\ndef. Also
Matches a position rather than a character. Most regex flavors matches c in "multi-line" mode.
have an option to make the dollar match before line breaks (i.e.
at the end of a line in a file) as well. Also matches before the
very last line break if the string ends with a line break.
\A Matches at the start of the string the regex pattern is applied to. \A. matches a in abc
Matches a position rather than a character. Never matches after
line breaks.
\Z Matches at the end of the string the regex pattern is applied to. .\Z matches f in abc\ndef
Matches a position rather than a character. Never matches
before line breaks, except for the very last line break if the string
ends with a line break.
\z Matches at the end of the string the regex pattern is applied to. .\z matches f in abc\ndef
Matches a position rather than a character. Never matches
before line breaks.
\G Matches at the start of the match attempt. \G\w matches a, b, and c when
iterating over all matches in
abc def
Quantifiers
? (question mark) Makes the preceding item optional. Greedy, so the optional item abc? matches ab or abc
is included in the match if possible.
?? Makes the preceding item optional. Lazy, so the optional item is abc?? matches ab or abc
excluded in the match if possible. This construct is often
excluded from documentation because of its limited use.
* (star) Repeats the previous item zero or more times. Greedy, so as ".*" matches "def" "ghi" in
many items as possible will be matched before trying abc "def" "ghi" jkl
permutations with less matches of the preceding item, up to the
point where the preceding item is not matched at all.
*? (lazy star) Repeats the previous item zero or more times. Lazy, so the ".*?" matches "def" in
engine first attempts to skip the previous item, before trying abc "def" "ghi" jkl
permutations with ever increasing matches of the preceding
item.
+ (plus) Repeats the previous item once or more. Greedy, so as many ".+" matches "def" "ghi" in
items as possible will be matched before trying permutations abc "def" "ghi" jkl
with less matches of the preceding item, up to the point where
the preceding item is matched only once.
+? (lazy plus) Repeats the previous item once or more. Lazy, so the engine ".+?" matches "def" in
first matches the previous item only once, before trying abc "def" "ghi" jkl
permutations with ever increasing matches of the preceding
item.
++ (possessive plus) Repeats the previous item once or more. Possessive, so as ".++" can never match anything
many items as possible will be matched, without trying any
permutations with less matches even if the remainder of the
regex fails.
{n} where n is an integer >= 1 Repeats the previous item exactly n times. a{3} matches aaa
{n,m} where n >= 0 and m >= Repeats the previous item between n and m times. Greedy, so a{2,4} matches aaaa, aaa or
n repeating m times is tried before reducing the repetition to n aa
times.
{n,m}? where n >= 0 and m Repeats the previous item between n and m times. Lazy, so a{2,4}? matches aa, aaa or
>= n repeating n times is tried before increasing the repetition to m aaaa
times.
{n,} where n >= 0 Repeats the previous item at least n times. Greedy, so as many a{2,} matches aaaaa in aaaaa
items as possible will be matched before trying permutations
with less matches of the preceding item, up to the point where
the preceding item is matched only n times.
Quantifiers
{n,}? where n >= 0 Repeats the previous item n or more times. Lazy, so the engine a{2,}? matches aa in aaaaa
first matches the previous item n times, before trying
permutations with ever increasing matches of the preceding
item.
{n,m}+ where n >= 0 and m Repeats the previous item between n and m times. Possessive, a{2,4}+a matches aaaaa but
>= n so as many items as possible up to m will be matched, without not aaaa
trying any permutations with less matches even if the remainder
of the regex fails.
{n,}+ where n >= 0 Repeats the previous item n or more times. Possessive, so as a{2,}+a never matches
many items as possible will be matched, without trying any anything
permutations with less matches even if the remainder of the
regex fails.
(regex) Parentheses group the regex between them. They capture the (abc){3} matches abcabcabc.
text matched by the regex inside them that can be reused in a First group matches abc.
backreference, and they allow you to apply regex operators to
the entire grouped regex.
(?:regex) Non-capturing parentheses group the regex so you can apply (?:abc){3} matches
regex operators, but do not capture anything and do not create abcabcabc. No groups.
backreferences.
\1 through \9 Substituted with the text matched between the 1st through 9th (abc|def)=\1 matches
numbered capturing group. Some regex flavors allow more than abc=abc or def=def, but not
9 backreferences. abc=def or def=abc.
\g-1, \g-2, etc. Substituted with the text matched by the capturing group that (a)(b)(c)(d)\g-3 matches
can be found by counting as many opening parentheses of abcdb.
named or numbered capturing groups as specified by the
number from right to left starting at the backreference.
Any numbered backreference Backreferences to groups that did not participate in the match (a)?\1 matches aa but fails to
attempt fail to match. match b.
Any numbered backreference Some regex flavors allow backreferences to be used inside the (a\1?){3} matches aaaaaa.
group they reference.
Any numbered backreference In some regex flavors backreferences can be used before the (\2?(a)){3} matches aaaaaa.
group they reference.
Grouping and Backreferences
(?<name>regex) Captures the text matched by regex into the group name. The (?<x>abc){3} matches
name can contain letters and numbers but must start with a abcabcabc. The group x
letter. matches abc.
\k<name> Substituted with the text matched by the named group name. (?<x>abc|def)=\k<x>
matches abc=abc or def=def,
but not abc=def or def=abc.
Modifiers
(?i) Turn on case insensitivity for the remainder of the regular te(?i)st matches teST but not
expression. (Older regex flavors may turn it on for the entire TEST.
regex.)
(?-i) Turn off case insensitivity for the remainder of the regular (?i)te(?-i)st matches TEst
expression. but not TEST.
(?s) Turn on "dot matches newline" for the remainder of the regular
expression. (Older regex flavors may turn it on for the entire
regex.)
(?-s) Turn off "dot matches newline" for the remainder of the regular
expression.
(?m) Caret and dollar match after and before newlines for the
remainder of the regular expression. (Older regex flavors may
apply this to the entire regex.)
(?-m) Caret and dollar only match at the start and end of the string for
the remainder of the regular expression.
(?i-sm) Turns on the option "i" and turns off "s" and "m" for the
remainder of the regular expression. (Older regex flavors may
apply this to the entire regex.)
(?i-sm:regex) Matches the regex inside the span with the option "i" turned on (?i:te)st matches TEst but
and "m" and "s" turned off. not TEST.
Atomic Grouping and Possessive Quantifiers
(?>regex) Atomic groups prevent the regex engine from backtracking back x(?>\w+)x is more efficient than
into the group (forcing the group to discard part of its match) x\w+x if the second x cannot be
after a match has been found for the group. Backtracking can matched.
occur inside the group before it has matched completely, and the
engine can backtrack past the entire group, discarding its match
entirely. Eliminating needless backtracking provides a speed
increase. Atomic grouping is often indispensable when nesting
quantifiers to prevent a catastrophic amount of backtracking as
the engine needlessly tries pointless permutations of the nested
quantifiers.
?+, *+, ++ and {m,n}+ Makes the preceding item optional. Possessive, so if the optional abc?+c matches abcc but not
item can be matched, then the quantifier won't give up its match abc
even if the remainder of the regex fails.
Lookaround
(?=regex) Zero-width positive lookahead. Matches at a position where the t(?=s) matches the second t in
pattern inside the lookahead can be matched. Matches only the streets.
position. It does not consume any characters or expand the
match. In a pattern like one(?=two)three, both two and
three have to match at the position where the match of one
ends.
(?!regex) Zero-width negative lookahead. Identical to positive lookahead, t(?!s) matches the first t in
except that the overall match will only succeed if the regex inside streets.
the lookahead fails to match.
(?<=regex) Zero-width positive lookbehind. Matches at a position if the (?<=s)t matches the first t in
pattern inside the lookahead can be matched ending at that streets.
position (i.e. to the left of that position). Depending on the regex
flavor you're using, you may not be able to use quantifiers and/or
alternation inside lookbehind.
(?<!regex) Zero-width negative lookbehind. Matches at a position if the (?<!s)t matches the second t
pattern inside the lookahead cannot be matched ending at that in streets.
position.
\K The text matched by the part of the regex to the left of the \K is s\Kt matches only the first t in
omitted from the overall regex match. Other than that the regex streets.
is matched normally from left to right. Capturing groups to the
left of the \K capture as usual.
Continuing from The Previous Match
\G Matches at the position where the previous match ended, or the \G[a-z] first matches a, then
position where the current match attempt started (depending on matches b and then fails to match
the tool or regex flavor). Matches at the start of the string during in ab_cd.
the first match attempt.
Conditionals
(?(?=regex)then|else) If the lookahead succeeds, the then part must match for the (?(?<=a)b|c) matches the
where (?=regex) is any valid overall regex to match. If the lookahead fails, the else part must second b and the first c in
lookaround and then and match for the overall regex to match. Not just positive babxcac
else are any valid regexes lookahead, but all four lookarounds can be used. Note that the
lookahead is zero-width, so the then and else parts need to
match and consume the part of the text matched by the
lookahead as well.
(?(1)then|else) where 1 If the first capturing group took part in the match attempt thus far, (a)?(?(1)b|c) matches ab,
is the number of a capturing the then part must match for the overall regex to match. If the the first c and the second c in
group and then and else are first capturing group did not take part in the match, the else part babxcac
any valid regexes must match for the overall regex to match.
(?(<name>)then|else) If the capturing group with the given name took part in the match (?<one>a)?(?(<one>)b|c)
where name is the name of a attempt thus far, the then part must match for the overall regex matches ab, the first c, and the
capturing group and then and to match. If the capturing group did not take part in the match second c in babxcac
else are any valid regexes thus far, the else part must match for the overall regex to
match.
(?|regex) If the regex inside the branch reset group has multiple (x)(?|(a)|(bc)|(def))\2
alternatives with capturing groups, then the capturing group matches xaa, xbcbc, or
numbers are the same in all the alternatives. xdefdef with the first group
capturing x and the second group
capturing a, bc, or def
Comments
(?#comment) Everything between (?# and ) is ignored by the regex engine. a(?#foobar)b matches ab
Unicode Characters
\uFFFF where FFFF are 4 Matches a specific Unicode code point. Can be used inside \u00E0 matches à encoded as
hexadecimal digits character classes. U+00E0 only. \u00A9 matches ©
\x{FFFF} where FFFF are 1 Perl syntax to match a specific Unicode code point. Can be used \x{E0} matches à encoded as
to 4 hexadecimal digits inside character classes. U+00E0 only. \x{A9} matches ©
\p{L} or \p{Letter} Matches a single Unicode code point that has the property \p{L} matches à encoded as
Letter. See Unicode Character Properties in the tutorial for a U+00E0; \p{S} matches ©
complete list of properties. Each Unicode code point has exactly
one property. Can be used inside character classes.
\p{Script} Matches a single Unicode code point that is part of the specified \p{Greek} matches Ω
Unicode script. See Unicode Scripts in the tutorial for a complete
list of scripts. Each Unicode code point is part of exactly one
script. Can be used inside character classes.
\p{InScript} Matches a single Unicode code point that is part of the specified \p{InCyrillic} matches any
Unicode script. See Unicode Blocks in the tutorial for a complete of the code points in the block
list of blocks. Each Unicode code point is part of exactly one U+400 until U+4FF (Ā until ſ)
block. Blocks may contain unassigned code points. Can be used
inside character classes.
\P{L} or \P{Letter} Matches a single Unicode code point that does not have the \P{L} matches ©
property Letter. You can also use \P to match a code point
that is not part of a particular Unicode block or script. Can be
used inside character classes.
(?P<name>regex) Round brackets group the regex between them. They capture
the text matched by the regex inside them that can be
referenced by the name between the sharp brackets. The name
may consist of letters and digits.
(?P=name) Substituted with the text matched by the capturing group with the (?P<set>abc|def)=(?P=set)
given name. Not a group, despite the syntax using round matches abc=abc or def=def,
brackets. but not abc=def or def=abc.
.NET Syntax for Named Capture and Backreferences
(?<name>regex) Round brackets group the regex between them. They capture
the text matched by the regex inside them that can be
referenced by the name between the sharp brackets. The name
may consist of letters and digits.
(?'name'regex) Round brackets group the regex between them. They capture
the text matched by the regex inside them that can be
referenced by the name between the single quotes. The name
may consist of letters and digits.
\k<name> Substituted with the text matched by the capturing group with the (?<set>abc|def)=\<set>
given name. matches abc=abc or def=def,
but not abc=def or def=abc.
\k'name' Substituted with the text matched by the capturing group with the (?'set'abc|def)=\k'set'
given name. matches abc=abc or def=def,
but not abc=def or def=abc.
(?(name)then|else) If the capturing group name took part in the match attempt thus (?<set>a)?(?(set)b|c)
far, the then part must match for the overall regex to match. If matches ab, the first c and the
the capturing group name did not take part in the match, the second c in babxcac
else part must match for the overall regex to match.
\c \c matches any character that may occur after the first \i\c* matches an XML name
character in an XML name, i.e. [-._:A-Za-z0-9] like xml:schema
[abc-[xyz]] Subtracts character class xyz from character class abc. The [a-z-[aeiou]] matches any
result matches any single character that occurs in the character letter that is not a vowel (i.e. a
class abc but not in the character class xyz. consonant).
POSIX Bracket Expressions
[:alpha:] Matches one character from a POSIX character class. Can only [[:digit:][:lower:]]
be used in a bracket expression. matches one of 0 through 9 or a
through z
[.span-ll.] Matches a POSIX collation sequence. Can only be used in a [[.span-ll.]] matches ll in
bracket expression. the Spanish locale
[=x=] Matches a POSIX character equivalence. Can only be used in a [[=e=]] matches e, é, è and ê
bracket expression. in the French locale
Recursion
(?R) Recursion of the entire regular expression. a(?R)?z matches az, aazz,
aaazzz, etc.
(?1) where 1 is the number of Recursion of a capturing group or subroutine call to a capturing a(b(?1)?y)z matches abyz,
a capturing group group. abbyyz, abbbyyyz, etc.
Characters, and Non-printable characters
Feature .NET Java Perl PCRE PCRE2 PHP Delphi R ECMA VBScript XRegExp Python Ruby std:: Boost TCL POSIX POSIX GNU GNU XML XPath
regex ARE BRE ERE BRE ERE
basic ECMA
Backslash escapes one metacharacter Y N 5.8 - 5.20 Y Y Y Y Y Y Y Y Y 1.9 Y Y Y Y N Y N
grep basic
ECMA
\Q…\E escapes a string of metacharacters N Y Y Y Y Y Y Y N N N N N N N N N N N N N
extend
ECMA ECMA
\n (LF), \r (CR) and \t (tab) Y Y Y Y Y Y Y Y Y Y Y Y Y Y string string string string Y Y
awk extend
\R
N 8 5.10 7.0 Y 5.2.2 Y Y N N N N 2.0 N ECMA N N N N N N N
(linebreak)
CRLF,
CRLF LF,LF or CR
N N N N N N N N N N N N N N N N N N N N Y Y
(Literal line break)
\a ECMA
Y Y Y Y Y Y Y Y N N N Y Y awk Y N N N N N N
(bell) extend
\b
N N N N N N N N N N N N N awk N Y N N N N N N
(backspace)
\B
N N N N N N N N N N N N N N N Y N N N N N N
(backslash)
\e ECMA
Y Y Y Y Y Y Y Y N N N N Y N Y N N N N N N
(escape) extend
\f ECMA ECMA
Y Y Y Y Y Y Y Y Y Y Y Y Y Y N N N N N N
(form feed) awk extend
\v ECMA ECMA
Y 4-7 N N N N N N Y Y Y Y Y Y N N N N N N
(vertical tab) awk extend
\0
Y N Y Y Y Y Y Y Y Y Y Y Y ECMA Y Y N N N N N N
(NULL character)
\o{7777}
N N 5.14 8.34 Y 5.5.10 XE7 3.0.3 N N N N N N N N N N N N N N
(octal escape)
\1 – \7
ECMA N N N N N N N Y Y N N N awk N N N N N N N N
(octal escape)
\10 – \77
Y N Y Y Y Y Y Y Y Y N N Y awk N Y N N N N N N
(octal escape)
\100 – \177
Y N Y Y Y Y Y Y Y Y N Y Y awk N Y N N N N N N
(octal escape)
\200 – \377
Y N Y Y Y Y Y Y Y Y N Y fail awk N Y N N N N N N
(octal escape)
Feature .NET Java Perl PCRE PCRE2 PHP Delphi R ECMA VBScript XRegExp Python Ruby std:: Boost TCL POSIX POSIX GNU GNU XML XPath
regex ARE BRE ERE BRE ERE
[abc]
Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y
character class
[^abc]
Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y
negated character class
[a-z]
Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y
character class range
\
Y Y Y Y Y Y Y Y Y Y Y Y Y ECMA ECMA Y N N N N Y Y
(backslash escapes special characters)
\ basic basic
N N N N N N N N N N N N N N Y Y Y Y N N
(literal backslash) extend extend
[ab[cd]ef]
N Y N N N N N N N N N N 1.9 N N N N N N N N N
(nested character class)
Character class
2.0 - 4.7 N N N N N N N N N N N N N N N N N N N Y Y
subtraction
Character class
N Y N N N N N N N N N N 1.9 N N N N N N N N N
intersection
ECMA ECMA
\n (LF), \r (CR) and \t (tab) Y Y Y Y Y Y Y Y Y Y Y Y Y Y string string string string Y Y
awk awk
\a ECMA
Y Y Y Y Y Y Y Y N N N Y Y awk Y N N N N N N
(bell) awk
\b ECMA ECMA
Y N Y Y Y Y Y Y Y Y Y Y Y Y N N N N Y Y
(backspace) awk awk
\B
N N N N N N N N N N N N N N N Y N N N N N N
(backslash)
\e ECMA
Y Y Y Y Y Y Y Y N N N N Y N Y N N N N N N
(escape) awk
\f ECMA ECMA
Y Y Y Y Y Y Y Y Y Y Y Y Y Y N N N N N N
(form feed) awk awk
\v ECMA ECMA
Y 4-7 N N N N N N Y Y Y Y Y Y N N N N N N
(vertical tab) awk awk
[:alpha:]
N N Unicod ASCII ASCII Unicod ASCII code N N N N Unicod Unicod Unicod Unicod ASCII ASCII ASCII ASCII N N
(POSIX character class)
[:^alpha:] 3.7
N N Y Y Y Y Y Y N N N 1.9 error Y error error error error error N N
(negated POSIX character class) error
[:d:],
[:d:] [:s:],
[:s:] [:w:]
N N N N N N N N N N N N N Unicod Unicod N N N N N N N
(POSIX shorthand class)
extend
Any supported \p{…} syntax N 9 Y N N N N N N N N N 1.9 N N N N N N N N
egrep
Feature .NET Java Perl PCRE PCRE2 PHP Delphi R ECMA VBScript XRegExp Python Ruby std:: Boost TCL POSIX POSIX GNU GNU XML XPath
regex ARE BRE ERE BRE ERE
Any shorthand
Y Y Y Y Y Y Y Y Y Y Y Y Y ECMA Y Y N N N N Y Y
inside a character class
\d non ECMA
ASCII Unicod ASCII ASCII Unicod ASCII ASCII ASCII ASCII ASCII Unicod ASCII Unicod Unicod N N N N Unicod Unicod
(digits) ECMA Unicod
\h
N N N N N N N N N N N N 1.9 ASCII N N N N N N N N N
(hexadecimal digits)
\i and \c
N N N N N N N N N N N N N N N N N N N N Y Y
(XML name)
\I and \C
N N N N N N N N N N N N N N N N N N N N Y Y
(negated variants of \i and \c)
\c
Anchors
Feature .NET Java Perl PCRE PCRE2 PHP Delphi R ECMA VBScript XRegExp Python Ruby std:: Boost TCL POSIX POSIX GNU GNU XML XPath
regex ARE BRE ERE BRE ERE
^
Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y N Y
(start of string/line)
$
Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y N Y
(end of string/line)
$
Y Y Y Y Y Y Y Y N N N Y N N N N N N N N n/a N
(end of string; before final line break)
^ and $
option option option option option option option option option option option option Y Y option option option option option option n/a option
match after each line break
\A ECMA
Y Y Y Y Y Y Y Y N N N Y Y N N N N N N N N
(start of string) extend
\A
N N N N N N N N N N N N N N N Y N N N N N N
(start of match attempt)
\G ECMA
Y Y Y 8.00 Y N N N N N N N N N N N N N N N N
(start of match attempt) extend
\z ECMA
Y Y Y Y Y Y Y Y N N N \Z Y N \Z N N N N N N
(end of string) extend
\Z
Y Y Y Y Y Y Y Y N N N N Y N N N N N N N N N
(end of string, before final line break)
\`
N N N N N N N N N N N N N N N N N N Y Y N N
(start of string)
\'
N N N N N N N N N N N N N N N N N N Y Y N N
(end of string)
The Dot
Feature .NET Java Perl PCRE PCRE2 PHP Delphi R ECMA VBScript XRegExp Python Ruby std:: Boost TCL POSIX POSIX GNU GNU XML XPath
regex ARE BRE ERE BRE ERE
.
Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y
(dot; any character except line break)
\N
N N 5.12 8.10 Y 5.3.4 XE7 Y N N N N N N N N N N N N N N
(not a line break)
Alternation
Feature .NET Java Perl PCRE PCRE2 PHP Delphi R ECMA VBScript XRegExp Python Ruby std:: Boost TCL POSIX POSIX GNU GNU XML XPath
regex ARE BRE ERE BRE ERE
Word Boundaries
Feature .NET Java Perl PCRE PCRE2 PHP Delphi R ECMA VBScript XRegExp Python Ruby std:: Boost TCL POSIX POSIX GNU GNU XML XPath
regex ARE BRE ERE BRE ERE
\y
N N N N N N N N N N N N N N N Unicod N N N N N N
(beginning or end of a word)
\Y
N N N N N N N N N N N N N N N Unicod N N N N N N
(NOT beginning or end of a word)
\m
N N N N N N N N N N N N N N N Unicod N N N N N N
(beginning of a word)
\M
N N N N N N N N N N N N N N N Unicod N N N N N N
(end of a word)
\< ECMA
N N N N N N N N N N N N N N N N N Y Y N N
(beginning of a word) extend
\> ECMA
N N N N N N N N N N N N N N N N N Y Y N N
(end of a word) extend
Quantifiers
Feature .NET Java Perl PCRE PCRE2 PHP Delphi R ECMA VBScript XRegExp Python Ruby std:: Boost TCL POSIX POSIX GNU GNU XML XPath
regex ARE BRE ERE BRE ERE
? ECMA ECMA
Y Y Y Y Y Y Y Y Y Y Y Y Y Y N Y \? Y Y Y
(0 or 1) extend extend
*
Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y
(0 or more)
+ ECMA ECMA
Y Y Y Y Y Y Y Y Y Y Y Y Y Y N Y \+ Y Y Y
(1 or more) extend extend
{,m}
N N N N N N N N N N N Y 1.9 N N N N N \{,m\} Y N N
(between 0 and m)
Feature .NET Java Perl PCRE PCRE2 PHP Delphi R ECMA VBScript XRegExp Python Ruby std:: Boost TCL POSIX POSIX GNU GNU XML XPath
regex ARE BRE ERE BRE ERE
\X ECMA
N 9 Y 5.0 Y 5.0.5 Y Y N N N N 2.0 N N N N N N N N
(Unicode grapheme) extend
\pL
\pL
N Y Y 5.0 Y 5.0.5 Y Y N N 3 N N N N N N N N N N N
(Unicode category)
\p{L
\p{L}
Y Y Y 5.0 Y 5.0.5 Y Y N N Y N 1.9 N N N N N N N Y Y
(Unicode category)
\p{IsL
\p{IsL}
N Y Y N N N N N N N N N N N N N N N N N N N
(Unicode category)
\p{Category
\p{Category} }
N N Y N N N N N N N Y N 1.9 N N N N N N N N N
(Unicode category)
\p{IsCategory
\p{IsCategory} }
N N Y N N N N N N N N N N N N N N N N N N N
(Unicode category)
\p{Script
\p{Script} }
N N Y 6.5 Y 5.1.3 Y Y N N Y N 1.9 N N N N N N N N N
(Unicode script)
\p{IsScript
\p{IsScript} }
N 7 Y N N N N N N N N N N N N N N N N N N N
(Unicode script)
\p{Block
\p{Block} }
N N Y N N N N N N N N N N N N N N N N N N N
(Unicode block)
\p{InBlock
\p{InBlock} }
N Y Y N N N N N N N Y N 2.0 N N N N N N N N N
(Unicode block)
\p{IsBlock
\p{IsBlock} }
Y N Y N N N N N N N N N N N N N N N N N Y Y
(Unicode block)
Feature .NET Java Perl PCRE PCRE2 PHP Delphi R ECMA VBScript XRegExp Python Ruby std:: Boost TCL POSIX POSIX GNU GNU XML XPath
regex ARE BRE ERE BRE ERE
(regex)
Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y \( \) Y \( \) Y Y Y
(numbered capturing group)
(?:regex)
Y Y Y Y Y Y Y Y Y Y Y Y Y ECMA ECMA Y N N N N N Y
(non-capturing group)
\1 - \9 ECMA ECMA
Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y N Y Y N Y
(numbered backreference) basic basic
\10 - \99
Y Y Y Y Y Y Y Y Y Y Y Y Y Y N Y N N N N N Y
(numbered backreference)
\g-1,
\g-1 \g-2,
\g-2 etc.
N N 5.10 7.0 Y 5.2.2 Y Y N N N N N N ECMA N N N N N N N
g{-1},
g{-1} \g{-2},
\g{-2} etc.)
Feature .NET Java Perl PCRE PCRE2 PHP Delphi R ECMA VBScript XRegExp Python Ruby std:: Boost TCL POSIX POSIX GNU GNU XML XPath
regex ARE BRE ERE BRE ERE
\k<-1>,
\k<-1> k<-2>,
k<-2> etc.
N N N N N N N N N N N N 1.9 N ECMA N N N N N N N
\k'-1',
\k'-1' \k'-2',
\k'-2' etc.
\g<-1>,
\g<-1> \g<-2>,
\g<-2> etc.
N N N N N N N N N N N N N N ECMA N N N N N N N
\g'-1',
\g'-1' \g'-2',
\g'-2' etc.
non ECMA
Backreferences to failed groups also fail Y Y Y Y Y Y Y ign ign ign Y Y ECMA Y Y n/a Y Y n/a ign
ECMA basic
Feature .NET Java Perl PCRE PCRE2 PHP Delphi R ECMA VBScript XRegExp Python Ruby std:: Boost TCL POSIX POSIX GNU GNU XML XPath
regex ARE BRE ERE BRE ERE
(?<name>regex)
Y 7 5.10 7.0 Y 5.2.2 Y Y N N Y N 1.9 N ECMA N N N N N N N
(Named capturing group)
(?'name'regex)
Y N 5.10 7.0 Y 5.2.2 Y Y N N N N 1.9 N ECMA N N N N N N N
(Named capturing group)
(?P<name>regex)
N N 5.10 Y Y Y Y Y N N Y Y N N N N N N N N N N
(Named capturing group)
\k<name>
Y 7 5.10 7.0 Y 5.2.2 Y Y N N Y N 1.9 N ECMA N N N N N N N
(named backreference)
\k<name>
Y N 5.10 7.0 Y 5.2.2 Y Y N N N N 1.9 N ECMA N N N N N N N
(named backreference)
(?P=name)
N N 5.10 7.2 Y 5.2.4 Y Y N N N Y N N N N N N N N N N
(named backreference)
non
Backreferences to failed groups also fail 7 5.10 Y Y Y Y Y n/a n/a ign Y 1.9 n/a ECMA n/a n/a n/a n/a n/a n/a n/a
ECMA
A number is a valid name for a capturing 7 5.10 5.0.0 - 2.14.0 - 1.9 ECMA
Y 4.0 - 8.33 error XE - XE6 n/a n/a error error n/a n/a n/a n/a n/a n/a n/a n/a
group. error error 5.1.2 3.0.2 error error
Modifiers
Feature .NET Java Perl PCRE PCRE2 PHP Delphi R ECMA VBScript XRegExp Python Ruby std:: Boost TCL POSIX POSIX GNU GNU XML XPath
regex ARE BRE ERE BRE ERE
(?letters)
Y Y Y Y Y Y Y Y N N Y Y Y N ECMA Y N N N N N N
(mode modifiers)
(?-letters )
Y Y Y Y Y Y Y Y n/a n/a Y 3.7 Y n/a ECMA Y n/a n/a n/a n/a n/a n/a
(turn off mode modifiers)
(?-letters :regex)
regex)
Y Y Y Y Y Y Y Y n/a n/a N 3.7 Y n/a ECMA N n/a n/a n/a n/a n/a n/a
(mode modifiers local to group)
(?^)
N N 5.14 N N N N N N/A N/A N N N N/A N N N/A N/A N/A N/A N/A N/A
Turn off all options
(?c)
N N N N N N N N n/a n/a N N N n/a N Y n/a n/a n/a n/a n/a n/a
(case insensitivity)
Modifiers
Feature .NET Java Perl PCRE PCRE2 PHP Delphi R ECMA VBScript XRegExp Python Ruby std:: Boost TCL POSIX POSIX GNU GNU XML XPath
regex ARE BRE ERE BRE ERE
(?i)
Y Y Y Y Y Y Y Y n/a n/a Y Y Y n/a ECMA Y n/a n/a n/a n/a n/a n/a
(case insensitivity)
(?x)
Y Y Y Y Y Y Y Y n/a n/a Y Y Y n/a ECMA Y n/a n/a n/a n/a n/a n/a
(free spacing mode)
(?m)
Y Y Y Y Y Y Y Y n/a n/a Y Y N n/a ECMA (?s) n/a n/a n/a n/a n/a n/a
(^ and $ match at line breaks)
(?n)
Y N 5.22 N 10.30 N N N n/a n/a Y N N n/a N N n/a n/a n/a n/a n/a n/a
(unnamed groups are non-capturing)
(?J)
N N N 6.7 Y 5.2.0 Y Y n/a n/a N N N n/a N N n/a n/a n/a n/a n/a n/a
(allow duplicate group names)
(?d)
N Y N N N N N N n/a n/a N N N n/a N N n/a n/a n/a n/a n/a n/a
(anchors treat only \n as line break)
(?e) (interpret the regex as a POSIX BRE) N N N N N N N N n/a n/a N N N n/a N Y n/a n/a n/a n/a n/a n/a
(?b) (interpret the regex as a POSIX ERE) N N N N N N N N n/a n/a N N N n/a N Y n/a n/a n/a n/a n/a n/a
(?q)
N N N N N N N N n/a n/a N N N n/a N Y n/a n/a n/a n/a n/a n/a
(interpret the regex as literal string)
Lookaround
Feature .NET Java Perl PCRE PCRE2 PHP Delphi R ECMA VBScript XRegExp Python Ruby std:: Boost TCL POSIX POSIX GNU GNU XML XPath
regex ARE BRE ERE BRE ERE
(?=regex
(?=regex) )
Y Y Y Y Y Y Y Y Y Y Y Y Y ECMA ECMA Y N N N N N N
(positive lookahead)
(?!regex
(?!regex) )
Y Y Y Y Y Y Y Y Y Y Y Y Y ECMA ECMA Y N N N N N N
(negative lookahead)
(?<=regex
(?<=regex) )
Y Y Y Y Y Y Y Y N N N Y 1.9 N ECMA N N N N N N N
(positive lookbehind)
(?<!regex
(?<!regex) )
Y Y Y Y Y Y Y Y N N N Y 1.9 N ECMA N N N N N N N
(negative lookbehind)
(?<=regex
(?<=regex{ {n,m}) (finite repetition
Y 6 N N N N N N n/a n/a n/a N N n/a N n/a n/a n/a n/a n/a n/a n/a
allowed in lookbehind)
Feature .NET Java Perl PCRE PCRE2 PHP Delphi R ECMA VBScript XRegExp Python Ruby std:: Boost TCL POSIX POSIX GNU GNU XML XPath
regex ARE BRE ERE BRE ERE
(?>regex
(?>regex) )
Y Y Y Y Y Y Y Y N N N N Y N ECMA N N N N N N N
(atomic group)
(?|regex
(?|regex) )
N N 5.10 7.2 Y 5.2.4 Y Y N N N N N N ECMA N N N N N N N
(branch reset group)
(?#comment)
(?# )
Y N Y Y Y Y Y Y N N Y Y Y N ECMA Y N N N N N N
(comment ignored by regex engine)
Recursion
Feature .NET Java Perl PCRE PCRE2 PHP Delphi R ECMA VBScript XRegExp Python Ruby std:: Boost TCL POSIX POSIX GNU GNU XML XPath
regex ARE BRE ERE BRE ERE
(?1
(?1) 4
N 5.10 Y Y Y Y Y N N N N N N ECMA N N N N N N N
(subroutine call) only
\g<1
\g<1> and \g'1
\g'1'
N N N 7.7 Y 5.2.7 Y Y N N N N 1.9 N N N N N N N N N
(subroutine call)
(?-1
(?-1) )
N N 5.10 7.2 Y 5.2.4 Y Y N N N N N N ECMA N N N N N N N
(relative subroutine call)
\g<-1
\g<-1> > and \g'-1
\g'-1' '
N N N 7.7 Y 5.2.7 Y Y N N N N 1.9 N N N N N N N N N
(relative subroutine call)
(?+1
(?+1))
N N 5.10 7.2 Y 5.2.4 Y Y N N N N N N ECMA N N N N N N N
(forward subroutine call)
\g<+1
\g<+1> > and \g'+1
\g'+1' '
N N N 7.7 Y 5.2.7 Y Y N N N N 2.0 N N N N N N N N N
(forward subroutine call)
(?&name
(?&name)) and (?P>name
(?P>name))
N N 5.10 7.2 Y 5.2.4 Y Y N N N N N N ECMA N N N N N N N
(named subroutine call)
\g<name
\g<name>> and \g'name
\g'name' '
N N N 7.7 Y 5.2.7 Y Y N N N N 2.0 N N N N N N N N N
(named subroutine call)
(?(DEFINE)regex
(?(DEFINE)regex) )
N N 5.10 7.0 Y 5.2.2 Y Y N N N N N N ECMA N N N N N N N
(separate subroutine definitions)
Recursion is atomic
n/a n/a N 6.5 Y 5.1.3 Y Y n/a n/a n/a n/a N n/a ECMA n/a n/a n/a n/a n/a n/a n/a
(using syntax other than (?P>0))
(?P>0)
Feature .NET Java Perl PCRE PCRE2 PHP Delphi R ECMA VBScript XRegExp Python Ruby std:: Boost TCL POSIX POSIX GNU GNU XML XPath
regex ARE BRE ERE BRE ERE
(?(?=regex
(?(?=regex) )then|
then|else)
else)
Y N Y Y Y Y Y Y N N N N N N ECMA N N N N N N N
(using any lookaround)
(?(regex
(?(regex) )then|
then|else)
else)
Y N N N N N N N N N N N N N N N N N N N N N
(implicit lookahead conditional)
(?(name
(?(name))then|
then|else)
else)
Y N N 6.7 Y 5.2.0 Y Y N N N Y N N N N N N N N N N
(named conditional)
(?(<name
(?(<name>)
>)then
then| |else)
else)
N N 5.10 7.0 Y 5.2.2 Y Y N N N N 2.0 N ECMA N N N N N N N
(named conditional)
(?('name
(?('name')
')then
then| |else)
else)
N N 5.10 7.0 Y 5.2.2 Y Y N N N N 2.0 N ECMA N N N N N N N
(named conditional)
(?(1
(?(1)then|
then|else)
else) (where 1 is the
Y N Y Y Y Y Y Y N N N Y 2.0 N ECMA N N N N N N N
number of a capturing group)
(?(-1
(?(-1) )then|
then|else)
else)
N N N 7.2 Y 5.2.4 Y Y N N N N N N N N N N N N N N
(relative conditional)
(?(+1
(?(+1) )then|
then|else)
else)
N N N 7.2 Y 5.2.4 Y Y N N N N N N N N N N N N N N
(forward conditional)
Regular Expressions Replacement Reference
The regular expressions replacement reference in this section functions both as a reference to all available replacement
syntax and as a comparison of the features supported by the regular expression flavors discussed in the tutorial. The
reference tables pack an incredible amount of information. To get the most out of them, follow this legend to learn how to read
them.
The replacement reference tables have three columns that explain each replacement feature:
The actual regex syntax for this token. If the syntax is fixed, it is simply shown as such. If the syntax has variable
Token
elements, the syntax is described.
Flavor Comparison
A flavor can have four levels of support (or non-support) for a particular token:
3.0 The token will be substituted in version 3.0 and all later versions of this flavor. Earlier versions do not support it.
2.0 - 2.9 The token will be substituted in versions 2.0 through 2.9 of this flavor. Earlier and later versions do not support it.
The token will remain in the replacement as literal text. Note that languages that use variable interpolation in strings may
N still replace tokens indicated as unsupported below, if the syntax of the token corresponds with the variable interpolation
syntax. E.g. in Perl, $0 is replaced with the name of the script..
The regex flavor does not support this replacement token. But string literals in the programming language that this regex
string
flavor is normally used with do support this token.
error The token is recognized by the flavor but it is treated as a syntax error.
This feature is not applicable to this regex flavor. Features that describe the behavior of certain tokens introduced earlier
n/a
in the reference table show n/a for flavors that do not support that token at all.
For the .NET flavor, some tokens are indicated with "ECMA". That means the token is only supported when
RegexOptions.ECMAScript is set. Everything that applies to .NET 2.0 or later also applies to any version of .NET Core.
The Visual Studio IDE uses the non-ECMA .NET flavor starting with VS 2012.
For the std::regex and boost::regex flavor there are additional indicators sed and default. When either one appears, the
feature is only supported when you either pass or don't pass match_flag_type::format_sed to regex_replace().
For boost, there is one more replacement indicator all that indicates the feature is only supported when you pass:
match_flag_type::format_all to regex_replace().
For the PCRE2 flavor, some replacement string features are indicated with extend. This means the feature is only supported
when you pass PCRE2_SUBSTITUTE_EXTENDED to pcre2_substitute.
Characters and Non-Printable Characters
\ (Backslash followed by any A backslash that is followed by any character that does not form a Replacing with \! yields !
character that does not form a replacement string token in combination with the backslash
token) inserts the escaped character literally.
\ (A backslash that does not A backslash that is not part of a replacement string token is a Replacing with \! yields \!
form a token) literal backslash.
\ (Trailing backslash) A backslash at the end of the replacement string is a literal Replacing with \ yields \
backslash.
$ (A dollar that does not form a A dollar sign that does not form a replacement string token is a Replacing with $! yields $!
token) literal dollar sign.
$ (Trailing dollar) A dollar sign at the end of the replacement string is a literal dollar Replacing with $ yields $
sign.
\xFF where FF are 2 Inserts the character at the specified position in the code page \xA9 inserts © when using the
hexadecimal digits Latin-1 code page
\uFFFF where FFFF are 4 Inserts a specific Unicode code point. \u00E0 inserts à encoded as
hexadecimal digits U+00E0 only. \u00A9 inserts ©
\u{FFFF} and \x{FFFF} Inserts a specific Unicode code point. \u{E0} and \x{E0} inserts à
where FFFF are 1 to 4 encoded as U+00E0 only.
hexadecimal digits
\n, \r and \t Insert an LF character, CR character and a tab character \r\n inserts a Windows CRLF
respectively line break
\a, \b, \e, \f and \v Insert the bell character (\x07), backspace character (\0x08),
escape character (\x1B), form feed (\x0C) and vertical tab
(\x0B) respectively.
Match Context
$` and \` Insert the part of the subject string to the left of the regex match Replacing b with $` or \` in
abc yields aac
$' and \' Insert the part of the subject string to the right of the regex match Replacing b with $' or \' in
abc yields acc
\cA through \cZ Insert an ASCII character Control+A through Control+Z, \cM\cJ inserts a Windows
equivalent to \x01 through \x1A CRLF line break
\ca through \cz Insert an ASCII character Control+A through Control+Z, \cm\cj inserts a Windows
equivalent to \x01 through \x1A CRLF line break
\o{7777} where 7777 is any Inserts the character at the specified position in the active code \o{20254} inserts € when
octal number page using Unicode
\01 through \07 Inserts the character at the specified position in the ASCII table \07 inserts the "bell" character
\10 through \77 and Inserts the character at the specified position in the ASCII table \77 and \077 inserts ?
\010 through \077
\100 through \177 Inserts the character at the specified position in the ASCII table \100 inserts @
\200 through \377 Inserts the character at the specified position in the active code \377 inserts ÿ when using the
page Latin-1 code page
\400 through \777 Inserts the character at the specified position in the active code \777 inserts ǿ when using
page Unicode
\& Insert the whole regex match. Replacing \d+ with [$&] in
1a2b yields [1]a[2]b
\1 through \99 Insert the text matched by one of the first 99 capturing groups. Replacing (a)(b)(c) with
\3\3\1 in abc yields cca
\10 through \99 When there are fewer capturing groups than the 2-digit number, Replacing (a)(b)(c) with
treat this as a single-digit backreference followed by a literal \39\38\17 in abc yields
number instead of as an invalid backreference. c9c8a7
${name} Insert the text matched by the named capturing group name. Replacing
(?'one'a)(?'two'b) with
${two}${one} in ab yields ba
$+ Insert the text matched by the highest-numbered capturing group. Replacing (a)(z)? with [$+]
in ab yields [a]b
Case Conversion
\U0 and \U1 through \U99 Insert the whole regex match or the 1st through 99th Replacing .+ with \U0 in
backreference with all letters in the matched text converted to HeLlO WoRlD yields
uppercase. HELLO WORLD
\L0 and \L1 through \L99 Insert the whole regex match or the 1st through 99th Replacing .+ with \L0 in
backreference with all letters in the matched text converted to HeLlO WoRlD yields
lowercase. hello world
\F0 and \F1 through \F99 Insert the whole regex match or the 1st through 99th Replacing .+ with \F0 in
backreference with the first letter in the matched text converted to HeLlO WoRlD yields
uppercase and the remaining letters converted to lowercase. Hello world
\I0 and \I1 through \I99 Insert the whole regex match or the 1st through 99th Replacing .+ with \I0 in
backreference with the first letter of each word in the matched text HeLlO WoRlD yields
converted to uppercase and the remaining letters converted to Hello World
lowercase.
\U All literal text and all text inserted by replacement text tokens after Replacing (\w+) (\w+) with
\U up to the next \E or \L is converted to uppercase. \U$1 CrUeL \E$2 in
HeLlO WoRlD yields
HELLO CRUEL WoRlD
\L All literal text and all text inserted by replacement text tokens after Replacing (\w+) (\w+) with
\L up to the next \E or \U is converted to lowercase. \L$1 CrUeL \E$2 in
HeLlO WoRlD yields
hello cruel WoRlD
\u The first character after \u that is inserted into the replacement Replacing (\w+) (\w+) with
text as a literal or by a token is converted to uppercase. \u$1 \ucRuEl \u$2 in
hElLo wOrLd yields
HElLO CRuEl WOrLd
\l The first character after \l that is inserted into the replacement Replacing (\w+) (\w+) with
text as a literal or by a token is converted to lowercase. \l$1 \lCrUeL \l$2 in
HeLlO WoRlD yields
heLlO crUeL woRlD
\u\L The first character after \u\L that is inserted into the replacement Replacing (\w+) (\w+) with
text as a literal or by a token is converted to uppercase and the \u\L$1 \uCrUeL \E\u$2 in
following characters up to the next \E or \U are converted to HeLlO wOrLd yields
lowercase. Hello Cruel WOrLd
\l\U The first character after \l\U that is inserted into the replacement Replacing (\w+) (\w+) with
text as a literal or by a token is converted to lowercase and the \l\U$1 \lCrUeL \E\l$2 in
following characters up to the next \E or \L are converted to HeLlO WoRlD yields
uppercase. hELLO cRUEL woRlD
Case Conversion
\L\u The first character after \L\u that is inserted into the replacement Replacing (\w+) (\w+) with
text as a literal or by a token is converted to uppercase and the \L\u$1 \uCrUeL \E\u$2 in
following characters up to the next \E or \U are converted to HeLlO wOrLd yields
lowercase. Hello Cruel WOrLd
\U\l The first character after \U\l that is inserted into the replacement Replacing (\w+) (\w+) with
text as a literal or by a token is converted to lowercase and the \U\l$1 \lCrUeL \E\l$2 in
following characters up to the next \E or \L are converted to HeLlO WoRlD yields
uppercase. hELLO cRUEL woRlD
Conditionals
(?{1}yes:no) through Conditional referencing a numbered capturing group. Inserts the Replacing all matches of
(?{99}yes:no) yes part if the group participated or the no part if it didn't. (y)?|n in yyn! with
(?1yes:no) yields
yesyesno!
${1:+yes:no} through Conditional referencing a numbered capturing group. Inserts the Replacing all matches of
${99:+yes:no} yes part if the group participated or the no part if it didn't. (y)?|n in yyn! with
${1:+yes:no} yields
yesyesno!
${1:-no} through Conditional referencing a numbered capturing group. Inserts the Replacing all matches of
${99:-no} text captured by the group if it participated or the contents of the (y)?|n in yyn! with
conditional if it didn't. ${1:-no} yields yyno!
(?{name}yes:no) Conditional referencing a named capturing group. Inserts the yes Replacing all matches of
part if the group participated or the no part if it didn't. (?'one'y)?|n in yyn! with
(?{one}yes:no) yields
yesyesno!
${name:+yes:no} Conditional referencing a named capturing group. Inserts the yes Replacing all matches of
part if the group participated or the no part if it didn't. (?'one'y)?|n in yyn! with
${one:+yes:no} yields
yesyesno!
${name:-no} Conditional referencing a named capturing group. Inserts the text Replacing all matches of
captured by the group if it participated or the contents of the (?'one'y)?|n in yyn! with
conditional if it didn't. ${one:-no} yields yyno!
Characters, and Non-printable characters
Feature .NET Java Perl PCRE2 PHP Delphi R ECMA VBScript XRegExp Python Ruby std:: Boost TCL XPath
regex ARE
\
Y N N default Y Y N Y Y Y Y Y default N Y error
(unescaped backslash is literal)
\
Y error error default Y Y N Y Y Y error Y default Y Y error
(trailing backslash is literal)
\\
N Y Y extend Y Y Y N N N Y Y sed Y Y Y
(backslash escapes itself)
VC15
Trailing $ is literal Y error error error Y Y Y Y Y Y Y Y Y Y error
default
all
$ escapes itself Y error error Y N Y N Y Y Y N N default N error
default
3.7
A backslash escapes $ N Y Y extend Y Y Y N N N N sed Y N Y
error
\xFF
N N Y extend string N string string N string string string string Y string error
(hexadecimal escape)
\u{FFFF
\u{FFFF} } extend
N N N 7.0 string N string N N N 3.7 error 1.9 string N N N error
(Unicode character) error
\x{FFFF
\x{FFFF} }
N N Y extend N N N N N N 3.7 error N N Y N error
(Unicode character)
\n \r and \t
\n,
N string Y extend string N string string N string Y string string Y string error
(character escape)
\a
N N Y extend N N N N N N Y N N Y N error
(bell)
\b extend
N N Y N N N N N N Y N N N N error
(backspace) error
\e
N N Y extend N N N N N N 3.7 error N N Y N error
(escape)
\f
N N Y extend N N N N N N Y N N Y N error
(form feed)
\v extend
N N N N N N N N N Y N N Y N error
(vertical tab) error
\0 extend all
N N Y N N N N N N Y N N N error
(NULL Character) error default
\o{7777}
N N 5.14 extend N N N N N N 3.7 error N N N N error
(octal escape)
Feature .NET Java Perl PCRE2 PHP Delphi R ECMA VBScript XRegExp Python Ruby std:: Boost TCL XPath
regex ARE
\&
N Y Y extend N N Y N N N N N sed Y Y error
(literal ampersand.)
\&
N N N N N Y N N N N N Y N N N error
(whole regex match.)
$&
Y error Y error N Y N Y Y Y N N default all N error
(whole regex match.)
&
N N N N N N N N N N N N sed sed Y N
(whole regex match.)
\0
N N N N Y Y N N N N N Y sed sed Y error
(whole regex match.)
$0
Y Y error Y Y Y N Y N Y N N default all N Y
(whole regex match.)
\g<0>
N N N N N N N N N N Y N N N N error
(whole regex match.)
Matched text and Backreferences
Feature .NET Java Perl PCRE2 PHP Delphi R ECMA VBScript XRegExp Python Ruby std:: Boost TCL XPath
regex ARE
\1 through \9
N N Y N Y Y Y N N N Y Y sed Y Y error
(backreference)
${name}
name}
Y 7 error Y N Y N N N Y N N N N N error
(named backreference)
$+{name
$+{name}}
N error 5.10 error N N N N N error N N N all N error
(named backreference)
$name
N error error Y N N N N N error N N N N N error
(named backreference)
\g<name
\g<name>>
N N N N N Y N N N N Y N N N N error
(named backreference)
Backreference to non-participating
Y Y Y error Y Y Y Y Y Y 3.5 Y Y Y Y Y
group replaced with empty string
\+ (highest-numbered
N N N N N Y N N N N N Y N N N error
participating group)
\+ (highest-numbered group,
N N N N N N N N N N N N N N N error
regardless of match participation)
$+ (highest-numbered
N error 5.18 error N Y N N N error N N N N N error
participating group)
$^N (highest-numbered
N error Y error N N N N N error N N N all N error
participating group)
Feature .NET Java Perl PCRE2 PHP Delphi R ECMA VBScript XRegExp Python Ruby std:: Boost TCL XPath
regex ARE
\U,
\U \L,
\L \u,
\u \l 3.7 all
N N Y extend N N N N N N N N N error
(case conversion) error default
Feature .NET Java Perl PCRE2 PHP Delphi R ECMA VBScript XRegExp Python Ruby std:: Boost TCL XPath
regex ARE
\_
N N N N N N N N N N N N N N N error
(whole subject text)
$_
Y error error error N Y N Y Y error N N N N N error
(whole subject text)
Conditionals
Feature .NET Java Perl PCRE2 PHP Delphi R ECMA VBScript XRegExp Python Ruby std:: Boost TCL XPath
regex ARE
yes:no - ?99yes
?1yes: ?99yes: :no
N N N N N N N N N N N N N all N N
(numbred conditional)
(?1yes
(?1yes::no) - (?99yes
(?99yes::no)
N N N N N N N N N N N N N all N N
(numbred conditional)
(?10yes
(?10yes::no ) - (?99yes
(?99yes: :no)
n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a N n/a n/a
(numbered conditional and literal)
?{1}yes
?{1}yes::no - ?{99}yes
?{99}yes::no
N N N N N N N N N N N N N all N N
(numbered conditional)
(?{1}yes
(?{1}yes::no - (?{99}yes
(?{99}yes:
:no)
N N N N N N N N N N N N N all N N
(numbered conditional)
${1:+yes
${1:+yes: no} - ${99:+yes
:no} ${99:+yes:
:no
N N N extend N N N N N N N N N N N N
(numbered conditional)
${1:-no
${1:-no}} - ${99:-no
${99:-no}
N N N extend N N N N N N N N N N N N
(numbered conditional)
$?{name
$?{name}}yes:
yes:no
N N N N N N N N N N N N N all N N
(named conditional)
(?{name
(?{name}}yes:
yes:no)
N N N N N N N N N N N N N all N N
(named conditional)
${name:+
name:+yes:
yes:no}
N N N extend N N N N N N N N N N N N
(named conditional)
${name:-
name:-no }
N N N extend N N N N N N N N N N N N
(named conditional)