Quick Links

Another regexp performance improvement: skip useless paren-captures

Lists:	pgsql-hackers

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc:	"Joel Jacobson" <joel(at)compiler(dot)org>
Subject:	Another regexp performance improvement: skip useless paren-captures
Date:	2021-08-04 22:15:34
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Here's a little finger exercise that improves a case that's bothered me
for awhile. In a POSIX regexp, parentheses cause capturing by default;
you have to write the very non-obvious "(?:...)" if you don't want the
matching substring to be reported by the regexp engine. That'd be fine
if capturing were cheap, but with our engine it is not particularly
cheap. In many situations, the initial DFA check is sufficient to
tell whether there is an overall match, but it does not tell where any
subexpression match boundaries are. To identify exactly which substring
is deemed to match a parenthesized subexpression, we have to recursively
break down the match, which takes at the very least a few more DFA
invocations; and with an uncooperative regex, it can easily result in
O(N^2) behavior where there was none at the DFA stage.

Therefore, we really ought to expend some effort to not capture
subexpressions if the sub-match data is not actually needed, which in
many invocations we know that it isn't. Spencer's original code has
a REG_NOSUB option that looks like it ought to be good for this ... but
on closer inspection it's basically useless, because it turns *all*
parens into non-capturing ones. That breaks back-references, so unless
you know that the regexp contains no back-refs, you can't use it.

The attached proposed patch redefines REG_NOSUB as being a regexp-
compile-time promise that the caller doesn't care about sub-match
locations, but not a promise that no backrefs exist. (If the
caller passes a match-locations array at execution anyway, it will
just get back -1 values, as if no sub-match had been identified.)
If that flag is passed, we run through the completed sub-regexp
tree and remove the "capture" markers on any subREs that are
not actually referenced by some backref. This typically causes
some parent subREs to no longer be deemed "messy", so that their
separate child subREs can be thrown away entirely, saving memory
space as well as runtime.

(I'd originally thought that a much more complex patch would be
needed to do this, because I assumed that re-optimizing the subRE
tree would be much more complicated than this. However, as far
as I can see this is sufficient; this change doesn't expose any
cases where additional tree restructuring would be helpful.)

Testing with Joel's handy little corpus of web regexps, there's a
useful improvement of the speed of ~ operators (a/k/a regexp_like()).
I see the total time to apply regexp_like() to all 4474520 entries
dropping from 10:17 to 5:46. Interesting statistics include

regexp=# select max(duration),avg(duration) from headresults;
max | avg
-----------------+-----------------
00:00:00.939389 | 00:00:00.000138
(1 row)

regexp=# select max(duration),avg(duration) from patchresults;
max | avg
-----------------+-----------------
00:00:00.918549 | 00:00:00.000077
(1 row)

The lower percentiles don't move much, but upper ones do:

regexp=# select percentile_cont(array[0.5,0.75,0.8,0.9]) within group(order by duration) from headresults;
percentile_cont
-------------------------------------------------------------------
{00:00:00.000027,00:00:00.000059,00:00:00.000067,00:00:00.000108}
(1 row)

regexp=# select percentile_cont(array[0.5,0.75,0.8,0.9]) within group(order by duration) from patchresults;
percentile_cont
-------------------------------------------------------------------
{00:00:00.000025,00:00:00.000042,00:00:00.000048,00:00:00.000065}
(1 row)

This isn't terribly surprising, because regexps that were already
really cheap probably have no capturing parens to dispense with.

Of course, there's no benefit with functions that do need sub-match
data, such as regexp_match. But the added overhead in such cases
should be quite negligible. The only downside I can see is that
if you use the "same" regexp in both submatches-needed and
non-submatches-needed contexts, you'll end up with two separate
compiled regexp cache entries. That doesn't seem like a big
problem though.

regards, tom lane

Attachment	Content-Type	Size
optimize-useless-captures-1.patch	text/x-diff	13.3 KB

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc:	Joel Jacobson <joel(at)compiler(dot)org>
Subject:	Re: Another regexp performance improvement: skip useless paren-captures
Date:	2021-08-05 13:42:47
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 8/4/21 6:15 PM, Tom Lane wrote:
> Here's a little finger exercise that improves a case that's bothered me
> for awhile. In a POSIX regexp, parentheses cause capturing by default;
> you have to write the very non-obvious "(?:...)" if you don't want the
> matching substring to be reported by the regexp engine.

It's not obscure to perl programmers :-)

> That'd be fine
> if capturing were cheap, but with our engine it is not particularly
> cheap. In many situations, the initial DFA check is sufficient to
> tell whether there is an overall match, but it does not tell where any
> subexpression match boundaries are. To identify exactly which substring
> is deemed to match a parenthesized subexpression, we have to recursively
> break down the match, which takes at the very least a few more DFA
> invocations; and with an uncooperative regex, it can easily result in
> O(N^2) behavior where there was none at the DFA stage.
>
> Therefore, we really ought to expend some effort to not capture
> subexpressions if the sub-match data is not actually needed, which in
> many invocations we know that it isn't. Spencer's original code has
> a REG_NOSUB option that looks like it ought to be good for this ... but
> on closer inspection it's basically useless, because it turns *all*
> parens into non-capturing ones. That breaks back-references, so unless
> you know that the regexp contains no back-refs, you can't use it.

In perl you can use the 'n' modifier for this effect (since 5.22)

I would expect to know if a back-ref were present.

I'm a bit worried about how you'll keep track of back-ref numbering
since back-refs only count capturing groups, and you're silently turning
a capturing group into a non-capturing group.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org, Joel Jacobson <joel(at)compiler(dot)org>
Subject:	Re: Another regexp performance improvement: skip useless paren-captures
Date:	2021-08-05 14:36:21
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> I'm a bit worried about how you'll keep track of back-ref numbering
> since back-refs only count capturing groups, and you're silently turning
> a capturing group into a non-capturing group.

They're already numbered at this point, and we aren't changing the numbers
of the capturing groups that remain live. There will be unused entries in
the regmatch_t array at runtime (corresponding to the zapped groups), but
that doesn't cost anything worth mentioning.

Now that you mention it, I am not sure whether there are any regression
test cases that specifically cover still being able to match \2 when
the first capture group went away. Probably should add more cases...

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Joel Jacobson <joel(at)compiler(dot)org>
Subject:	Re: Another regexp performance improvement: skip useless paren-captures
Date:	2021-08-05 14:39:21
Message-ID:	CA+TgmoYH9QJu8G14tW7oY+aomEu0nbet=iLwiAW8fFN5NAQT=w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Aug 5, 2021 at 9:43 AM Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
> On 8/4/21 6:15 PM, Tom Lane wrote:
> > Here's a little finger exercise that improves a case that's bothered me
> > for awhile. In a POSIX regexp, parentheses cause capturing by default;
> > you have to write the very non-obvious "(?:...)" if you don't want the
> > matching substring to be reported by the regexp engine.
> It's not obscure to perl programmers :-)

Well, I consider myself a pretty fair perl programmer, and I know
there's a way to do that, but I never do it, and I would have had to
look up the exact syntax. So +1 from me for anything automatic that
avoids paying the overhead in some cases.

--
Robert Haas
EDB: http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Joel Jacobson <joel(at)compiler(dot)org>
Subject:	Re: Another regexp performance improvement: skip useless paren-captures
Date:	2021-08-05 15:01:29
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Well, I consider myself a pretty fair perl programmer, and I know
> there's a way to do that, but I never do it, and I would have had to
> look up the exact syntax. So +1 from me for anything automatic that
> avoids paying the overhead in some cases.

That's my feeling about it too --- I never really think of this
point when writing a regexp. It seems like something the engine
ought to handle gracefully, so this patch is an attempt to do so.

regards, tom lane

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Joel Jacobson <joel(at)compiler(dot)org>
Subject:	Re: Another regexp performance improvement: skip useless paren-captures
Date:	2021-08-05 15:41:38
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 8/5/21 10:39 AM, Robert Haas wrote:
> On Thu, Aug 5, 2021 at 9:43 AM Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
>> On 8/4/21 6:15 PM, Tom Lane wrote:
>>> Here's a little finger exercise that improves a case that's bothered me
>>> for awhile. In a POSIX regexp, parentheses cause capturing by default;
>>> you have to write the very non-obvious "(?:...)" if you don't want the
>>> matching substring to be reported by the regexp engine.
>> It's not obscure to perl programmers :-)
> Well, I consider myself a pretty fair perl programmer,

I also consider you one :-)

Perhaps I should have said "many perl programmers".

> and I know
> there's a way to do that, but I never do it, and I would have had to
> look up the exact syntax. So +1 from me for anything automatic that
> avoids paying the overhead in some cases.

Yeah, I'm not arguing against the idea. I also have to look it up,
mainly because there is such a huge amount of stuff that can follow
"(?", do "perldoc perlre" happens a lot when I'm doing that sort of work.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

From:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Joel Jacobson <joel(at)compiler(dot)org>
Subject:	Re: Another regexp performance improvement: skip useless paren-captures
Date:	2021-08-05 21:37:21
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Aug 5, 2021, at 7:36 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> Probably should add more cases...

The patch triggers an assertion that master does not:

+select 'azrlfkjbjgidgryryiglcabkgqluflu' !~ '(.(.)((.)))((?:(\3)))';
+server closed the connection unexpectedly
+ This probably means the server terminated abnormally
+ before or while processing the request.
+connection to server was lost

The relevant portion of the stack trace:

frame #3: 0x00000001043bcf6d postgres`ExceptionalCondition(conditionName=<unavailable>, errorType=<unavailable>, fileName=<unavailable>, lineNumber=<unavailable>) at assert.c:69:2 [opt]
frame #4: 0x000000010410168b postgres`cdissect(v=0x00007ffeebdd2ad8, t=0x00007f863cd055b0, begin=0x00007f863d821528, end=0x00007f863d82152c) at regexec.c:767:4 [opt]
frame #5: 0x000000010410129b postgres`cdissect [inlined] ccondissect(v=<unavailable>, t=<unavailable>, begin=0x00007f863d821524, end=<unavailable>) at regexec.c:835:10 [opt]
frame #6: 0x000000010410123d postgres`cdissect(v=0x00007ffeebdd2ad8, t=0x00007f863cd05430, begin=0x00007f863d821524, end=0x00007f863d82152c) at regexec.c:752 [opt]
frame #7: 0x000000010410129b postgres`cdissect [inlined] ccondissect(v=<unavailable>, t=<unavailable>, begin=0x00007f863d821520, end=<unavailable>) at regexec.c:835:10 [opt]
frame #8: 0x000000010410123d postgres`cdissect(v=0x00007ffeebdd2ad8, t=0x00007f863cd050f0, begin=0x00007f863d821520, end=0x00007f863d82152c) at regexec.c:752 [opt]
frame #9: 0x0000000104101282 postgres`cdissect [inlined] ccondissect(v=<unavailable>, t=<unavailable>, begin=0x00007f863d821520, end=<unavailable>) at regexec.c:832:9 [opt]
frame #10: 0x000000010410123d postgres`cdissect(v=0x00007ffeebdd2ad8, t=0x00007f863cd04ff0, begin=0x00007f863d821520, end=0x00007f863d821530) at regexec.c:752 [opt]
frame #11: 0x00000001040ff508 postgres`pg_regexec [inlined] cfindloop(v=<unavailable>, cnfa=<unavailable>, cm=<unavailable>, d=0x00007ffeebdd6d68, s=0x00007ffeebdd2b48, coldp=<unavailable>) at regexec.c:600:10 [opt]
frame #12: 0x00000001040ff36b postgres`pg_regexec [inlined] cfind(v=0x000000010459c5f8, cnfa=<unavailable>, cm=<unavailable>) at regexec.c:515 [opt]
frame #13: 0x00000001040ff315 postgres`pg_regexec(re=<unavailable>, string=<unavailable>, len=140732855577960, search_start=<unavailable>, details=<unavailable>, nmatch=0, pmatch=0x0000000000000000, flags=0) at regexec.c:293 [opt]
frame #14: 0x0000000104244d61 postgres`RE_wchar_execute(re=<unavailable>, data=<unavailable>, data_len=<unavailable>, start_search=<unavailable>, nmatch=<unavailable>, pmatch=<unavailable>) at regexp.c:274:19 [opt]
frame #15: 0x0000000104242c80 postgres`textregexne [inlined] RE_execute(dat=<unavailable>, dat_len=31, nmatch=0, pmatch=0x0000000000000000) at regexp.c:322:10 [opt]
frame #16: 0x0000000104242c50 postgres`textregexne [inlined] RE_compile_and_execute(text_re=<unavailable>, dat=<unavailable>, dat_len=31, cflags=19, collation=<unavailable>, nmatch=0, pmatch=<unavailable>) at regexp.c:357 [opt]

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Joel Jacobson <joel(at)compiler(dot)org>
Subject:	Re: Another regexp performance improvement: skip useless paren-captures
Date:	2021-08-08 17:04:38
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> writes:
> The patch triggers an assertion that master does not:

> +select 'azrlfkjbjgidgryryiglcabkgqluflu' !~ '(.(.)((.)))((?:(\3)))';

On looking into this, it's pretty simple: regexec.c has an assertion
that a pure-capture subre node ought to be doing some capturing.

case '(': /* no-op capture node */
assert(t->child != NULL);
assert(t->capno > 0);

That's fine as of HEAD, but with the proposed patch, we may notice
that the node isn't actually referenced by any backref, and remove
its capture marker, allowing this assertion to fire. Nothing's
really wrong though.

There seem to be three things we could do about that:

1. Extend removecaptures() so that it can actually remove no-op
capture nodes if it's removed their capture markings. This would
substantially complicate that function, and I judge that it's not
worth the trouble. We'll only have such nodes in cases of
capturing parentheses immediately surrounding capturing parentheses,
which doesn't seem like a case worth expending sweat for.

2. Just drop the "t->capno > 0" assertion in regexec.c.

3. Weaken said assertion, perhaps by also checking the BRUSE flag bit.

Not sure that I see any point to #3, so I just dropped the
assertion in the attached.

I've also rebased over the bug fixes from the other thread,
and added a couple more test cases.

regards, tom lane

Attachment	Content-Type	Size
optimize-useless-captures-2.patch	text/x-diff	14.4 KB

From:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Joel Jacobson <joel(at)compiler(dot)org>
Subject:	Re: Another regexp performance improvement: skip useless paren-captures
Date:	2021-08-08 18:22:24
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Aug 8, 2021, at 10:04 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> I've also rebased over the bug fixes from the other thread,
> and added a couple more test cases.
>
> regards, tom lane

Hmm. This changes the behavior when applied against master (c1132aae336c41cf9d316222e525d8d593c2b5d2):

select regexp_split_to_array('uuuzkodphfbfbfb', '((.))(\1\2)', 'ntw');
regexp_split_to_array
-----------------------
- {"",zkodphfbfbfb}
+ {uuuzkodphfbfbfb}
(1 row)

The string starts with three "u" characters. The first of them is doubly-matched, meaning \1 and \2 refer to the first "u" character. The (\1\2) that follows matches the next two "u" characters. When the extra "useless" capture group is skipped, apparently this doesn't work anymore. I haven't looked at your patch, so I'm not sure why, but I'm guessing that \2 doesn't refer to anything.

That analysis is consistent with the next change:

select regexp_split_to_array('snfwbvxeesnzqabixqbixqiumpgxdemmxvnsemjxgqoqknrqessmcqmfslfspskqpqxe', '((((?:.))))\3');
- regexp_split_to_array
----------------------------------------------------------------------
- {snfwbvx,snzqabixqbixqiumpgxde,xvnsemjxgqoqknrqe,mcqmfslfspskqpqxe}
+ regexp_split_to_array
+------------------------------------------------------------------------
+ {snfwbvxeesnzqabixqbixqiumpgxdemmxvnsemjxgqoqknrqessmcqmfslfspskqpqxe}
(1 row)

The pattern matches any double character. I would expect it to match the "ee", the "mm" and the "ss" in the text. With the patched code, it matches nothing.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Joel Jacobson <joel(at)compiler(dot)org>
Subject:	Re: Another regexp performance improvement: skip useless paren-captures
Date:	2021-08-08 20:25:00
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> writes:
> Hmm. This changes the behavior when applied against master (c1132aae336c41cf9d316222e525d8d593c2b5d2):

> select regexp_split_to_array('uuuzkodphfbfbfb', '((.))(\1\2)', 'ntw');
> regexp_split_to_array
> -----------------------
> - {"",zkodphfbfbfb}
> + {uuuzkodphfbfbfb}
> (1 row)

Ugh. The regex engine is finding the match correctly, but it's failing to
tell the caller where it is :-(. I was a little too cute in optimizing
the regmatch_t result-vector copying in pg_regexec, and forgot to ensure
that the overall match position would be reported.

Thanks for the testing!

regards, tom lane

Attachment	Content-Type	Size
optimize-useless-captures-3.patch	text/x-diff	14.6 KB

From:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Joel Jacobson <joel(at)compiler(dot)org>
Subject:	Re: Another regexp performance improvement: skip useless paren-captures
Date:	2021-08-08 20:39:46
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Aug 8, 2021, at 1:25 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> Ugh. The regex engine is finding the match correctly, but it's failing to
> tell the caller where it is :-(. I was a little too cute in optimizing
> the regmatch_t result-vector copying in pg_regexec, and forgot to ensure
> that the overall match position would be reported.
>
> Thanks for the testing!

Sure! Thanks for improving the regular expression engine!

I have applied your latest patch and do not see any problems with it. All my tests pass with no asserts and with no differences in results vs. master. This is a test suite of nearly 1.5 million separate regular expressions.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Joel Jacobson <joel(at)compiler(dot)org>
Subject:	Re: Another regexp performance improvement: skip useless paren-captures
Date:	2021-08-08 22:28:13
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> writes:
> I have applied your latest patch and do not see any problems with it. All my tests pass with no asserts and with no differences in results vs. master. This is a test suite of nearly 1.5 million separate regular expressions.

Cool, thanks. I also tried your millions-of-random-regexps script
and didn't find any difference between the results from HEAD and
those from the v3 patch.

regards, tom lane

From:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Joel Jacobson <joel(at)compiler(dot)org>
Subject:	Re: Another regexp performance improvement: skip useless paren-captures
Date:	2021-08-08 22:33:06
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Aug 8, 2021, at 3:28 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> Cool, thanks. I also tried your millions-of-random-regexps script
> and didn't find any difference between the results from HEAD and
> those from the v3 patch.

The patch looks ready to commit. I don't expect to test it any further unless you have something in particular you'd like me to focus on.

Thanks again for working on this.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Joel Jacobson <joel(at)compiler(dot)org>
Subject:	Re: Another regexp performance improvement: skip useless paren-captures
Date:	2021-08-09 19:14:45
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> writes:
> The patch looks ready to commit. I don't expect to test it any further unless you have something in particular you'd like me to focus on.

Pushed, but while re-reading it before commit I noticed that there's
some more fairly low-hanging fruit in regexp_replace(). As I had it
in that patch, it never used REG_NOSUB because of the possibility
that the replacement string uses "\N". However, we're already
pre-checking the replacement string to see if it has backslashes
at all, so while we're at it we can check for \N to discover if we
actually need any subexpression match data or not. We do need to
refactor a little to postpone calling pg_regcomp until after we
know that, but I think that makes replace_text_regexp's API less
ugly not more so.

While I was at it, I changed the search-for-backslash loops to
use memchr rather than handwritten looping. Their use of
pg_mblen was pretty unnecessary given we only need to find
backslashes, and we can assume the backend encoding is ASCII-safe.

Using a bunch of random cases generated by your little perl
script, I see maybe 10-15% speedup on test cases that don't
use \N in the replacement string, while it's about a wash
on cases that do. (If I'd been using a multibyte encoding,
maybe the memchr change would have made a difference, but
I didn't try that.)

regards, tom lane

Attachment	Content-Type	Size
let-regexp_replace-use-NOSUB.patch	text/x-diff	8.0 KB

From:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Joel Jacobson <joel(at)compiler(dot)org>
Subject:	Re: Another regexp performance improvement: skip useless paren-captures
Date:	2021-08-09 21:01:13
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom,

I can still trigger the old bug for which we thought we'd pushed a fix. The test case below crashes on master (e12694523e7e4482a052236f12d3d8b58be9a22c), and also on the fixed version "Make regexp engine's backref-related compilation state more bulletproof." (cb76fbd7ec87e44b3c53165d68dc2747f7e26a9a).

Can you test if it crashes for you, too? I'm not sure I see why this one fails when millions of others pass.

The backtrace is still complaining about regc_nfa.c:1265:

+select regexp_split_to_array('', '(?:((?:q+))){0}(\1){0,0}?*[^]');
+server closed the connection unexpectedly
+ This probably means the server terminated abnormally
+ before or while processing the request.
+connection to server was lost

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Joel Jacobson <joel(at)compiler(dot)org>
Subject:	Re: Another regexp performance improvement: skip useless paren-captures
Date:	2021-08-09 21:26:16
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> writes:
> I can still trigger the old bug for which we thought we'd pushed a fix. The test case below crashes on master (e12694523e7e4482a052236f12d3d8b58be9a22c), and also on the fixed version "Make regexp engine's backref-related compilation state more bulletproof." (cb76fbd7ec87e44b3c53165d68dc2747f7e26a9a).

> Can you test if it crashes for you, too? I'm not sure I see why this one fails when millions of others pass.

> The backtrace is still complaining about regc_nfa.c:1265:

> +select regexp_split_to_array('', '(?:((?:q+))){0}(\1){0,0}?*[^]');
> +server closed the connection unexpectedly

Hmmm ... yeah, I see it too. This points up something I'd wondered
about before, which is whether the code that "cancels everything"
after detecting {0} is really OK. It throws away the outer subre
*and children* without worrying about what might be inside, and
here we see that that's not good enough --- there's still a v->subs
pointer to the first capturing paren set, which we just deleted,
so that the \1 later on messes up. I'm not sure why the back
branches are managing not to crash, but that might just be a memory
management artifact.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Joel Jacobson <joel(at)compiler(dot)org>
Subject:	Re: Another regexp performance improvement: skip useless paren-captures
Date:	2021-08-09 22:23:36
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> Hmmm ... yeah, I see it too. This points up something I'd wondered
> about before, which is whether the code that "cancels everything"
> after detecting {0} is really OK. It throws away the outer subre
> *and children* without worrying about what might be inside, and
> here we see that that's not good enough --- there's still a v->subs
> pointer to the first capturing paren set, which we just deleted,
> so that the \1 later on messes up. I'm not sure why the back
> branches are managing not to crash, but that might just be a memory
> management artifact.

... yeah, it is. For me, this variant hits the assertion in all
branches:

regression=# select regexp_split_to_array('', '((.)){0}(\2){0}');
server closed the connection unexpectedly

So that's a pre-existing (and very long-standing) bug. I'm not
sure if it has any serious impact in non-assert builds though.
Failure to clean out some disconnected arcs probably has no
real effect on the regex's behavior later.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Joel Jacobson <joel(at)compiler(dot)org>
Subject:	Re: Another regexp performance improvement: skip useless paren-captures
Date:	2021-08-09 23:31:41
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> writes:
> +select regexp_split_to_array('', '(?:((?:q+))){0}(\1){0,0}?*[^]');
> +server closed the connection unexpectedly

Here's a quick draft patch for this. Basically it moves the
responsibility for clearing v->subs[] pointers into the freesubre()
recursion, so that it will happen for contained capturing parens
not only the top level.

There is a potentially interesting definitional question:
what exactly ought this regexp do?

((.)){0}\2

Because the capturing paren sets are zero-quantified, they will
never be matched to any characters, so the backref can never
have any defined referent. I suspect that study of the POSIX
spec would lead to the conclusion that this is a legal regexp
but it will never match anything. Implementing that would be
tedious though, and what's more it seems very unlikely that
the user wanted any such behavior. So I think throwing an
error is an appropriate response. The existing code will
throw such an error for

((.)){0}\1

so I guess Spencer did think about this to some extent -- he
just forgot about the possibility of nested parens.

This patch should work OK in HEAD and v14, but it will need
a bit of fooling-about for older branches I think, given that
they fill v->subs[] a little differently.

regards, tom lane

Attachment	Content-Type	Size
fix-zero-quantified-nested-parens.patch	text/x-diff	3.2 KB

From:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Joel Jacobson <joel(at)compiler(dot)org>
Subject:	Re: Another regexp performance improvement: skip useless paren-captures
Date:	2021-08-09 23:37:18
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Aug 9, 2021, at 12:14 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> Pushed, but while re-reading it before commit I noticed that there's
> some more fairly low-hanging fruit in regexp_replace(). As I had it
> in that patch, it never used REG_NOSUB because of the possibility
> that the replacement string uses "\N". However, we're already
> pre-checking the replacement string to see if it has backslashes
> at all, so while we're at it we can check for \N to discover if we
> actually need any subexpression match data or not. We do need to
> refactor a little to postpone calling pg_regcomp until after we
> know that, but I think that makes replace_text_regexp's API less
> ugly not more so.
>
> While I was at it, I changed the search-for-backslash loops to
> use memchr rather than handwritten looping. Their use of
> pg_mblen was pretty unnecessary given we only need to find
> backslashes, and we can assume the backend encoding is ASCII-safe.
>
> Using a bunch of random cases generated by your little perl
> script, I see maybe 10-15% speedup on test cases that don't
> use \N in the replacement string, while it's about a wash
> on cases that do. (If I'd been using a multibyte encoding,
> maybe the memchr change would have made a difference, but
> I didn't try that.)

I've been reviewing and testing this (let-regexp_replace-use-NOSUB.patch) since you sent it 4 hours ago, and I can't seem to break it. There are pre-existing problems in the regex code, but this doesn't seem to add any new breakage.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Joel Jacobson <joel(at)compiler(dot)org>
Subject:	Re: Another regexp performance improvement: skip useless paren-captures
Date:	2021-08-10 00:14:29
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Aug 9, 2021, at 4:31 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> There is a potentially interesting definitional question:
> what exactly ought this regexp do?
>
> ((.)){0}\2
>
> Because the capturing paren sets are zero-quantified, they will
> never be matched to any characters, so the backref can never
> have any defined referent.

Perl regular expressions are not POSIX, but if there is a principled reason POSIX should differ from perl on this, we should be clear what that is:

#!/usr/bin/perl

use strict;
use warnings;

our $match;
if ('foo' =~ m/((.)(??{ die; })){0}(..)/)
{
print "captured 1 $1\n" if defined $1;
print "captured 2 $2\n" if defined $2;
print "captured 3 $3\n" if defined $3;
print "captured 4 $4\n" if defined $4;
print "match = $match\n" if defined $match;
}

This will print "captured 3 fo", proving that although the regular expression is parsed with the (..) bound to the third capture group, the first two capture groups never run. If you don't believe that, change the {0} to {1} and observe that the script dies.

> So I think throwing an
> error is an appropriate response. The existing code will
> throw such an error for
>
> ((.)){0}\1
>
> so I guess Spencer did think about this to some extent -- he
> just forgot about the possibility of nested parens.

Ugg. That means our code throws an error where perl does not, pretty well negating my point above. If we're already throwing an error for this type of thing, I agree we should be consistent about it. My personal preference would have been to do the same thing as perl, but it seems that ship has already sailed.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Joel Jacobson <joel(at)compiler(dot)org>
Subject:	Re: Another regexp performance improvement: skip useless paren-captures
Date:	2021-08-10 00:18:27
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Aug 9, 2021, at 5:14 PM, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> wrote:
>
> our $match;
> if ('foo' =~ m/((.)(??{ die; })){0}(..)/)

I left in a stray variable. A prior version of this script was assigning to $match where it now has die. Sorry for any confusion.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Joel Jacobson <joel(at)compiler(dot)org>
Subject:	Re: Another regexp performance improvement: skip useless paren-captures
Date:	2021-08-10 00:57:31
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> writes:
>> On Aug 9, 2021, at 12:14 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Pushed, but while re-reading it before commit I noticed that there's
>> some more fairly low-hanging fruit in regexp_replace().

> I've been reviewing and testing this (let-regexp_replace-use-NOSUB.patch) since you sent it 4 hours ago, and I can't seem to break it. There are pre-existing problems in the regex code, but this doesn't seem to add any new breakage.

Pushed that bit, thanks for testing!

I plan to not do anything about the (()){0} bug until after the release
window, since that will need to be back-patched. That bug's gotta be
twenty years old, so while I kinda wish we'd found it a few days
earlier, waiting another 3 months won't make much difference.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Joel Jacobson <joel(at)compiler(dot)org>
Subject:	Re: Another regexp performance improvement: skip useless paren-captures
Date:	2021-08-10 01:11:14
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> writes:
>> On Aug 9, 2021, at 4:31 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> There is a potentially interesting definitional question:
>> what exactly ought this regexp do?
>> ((.)){0}\2
>> Because the capturing paren sets are zero-quantified, they will
>> never be matched to any characters, so the backref can never
>> have any defined referent.

> Perl regular expressions are not POSIX, but if there is a principled reason POSIX should differ from perl on this, we should be clear what that is:

> if ('foo' =~ m/((.)(??{ die; })){0}(..)/)
> {
> print "captured 1 $1\n" if defined $1;
> print "captured 2 $2\n" if defined $2;
> print "captured 3 $3\n" if defined $3;
> print "captured 4 $4\n" if defined $4;
> print "match = $match\n" if defined $match;
> }

Hm. I'm not sure that this example proves anything about Perl's handling
of the situation, since you didn't use a backref. I tried both

if ('foo' =~ m/((.)){0}\1/)

if ('foo' =~ m/((.)){0}\2/)

and while neither throws an error, they don't succeed either.
So AFAICS Perl is acting in the way I'm attributing to POSIX.
But maybe we should actually read POSIX ...

>> ... I guess Spencer did think about this to some extent -- he
>> just forgot about the possibility of nested parens.

> Ugg. That means our code throws an error where perl does not, pretty
> well negating my point above. If we're already throwing an error for
> this type of thing, I agree we should be consistent about it. My
> personal preference would have been to do the same thing as perl, but it
> seems that ship has already sailed.

Removing an error case is usually an easier sell than adding one.
However, the fact that the simplest case (viz, '(.){0}\1') has always
thrown an error and nobody has complained in twenty-ish years suggests
that nobody much cares.

regards, tom lane

From:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Joel Jacobson <joel(at)compiler(dot)org>
Subject:	Re: Another regexp performance improvement: skip useless paren-captures
Date:	2021-08-10 01:17:40
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Aug 9, 2021, at 6:11 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> Hm. I'm not sure that this example proves anything about Perl's handling
> of the situation, since you didn't use a backref.

Well, this doesn't die either:

if ('foo' =~ m/((??{ die; })(.)(??{ die $1; })){0}((??{ die "got here"; })|\2)/)
{
print "matched\n";
}

The point is that the regex engine never walks the part of the pattern that {0} qualifies. I thought it was more clear in the prior example, because that example proves that the engine does get as far as capturing. This example also does that, and with a backref, because it dies with "got here".

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Joel Jacobson <joel(at)compiler(dot)org>
Subject:	Re: Another regexp performance improvement: skip useless paren-captures
Date:	2021-08-10 01:24:35
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Aug 9, 2021, at 6:17 PM, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> wrote:
>
> Well, this doesn't die either:

Meaning it doesn't die in the part of the pattern qualified by {0} either. It does die in the other part. Sorry again for the confusion.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Joel Jacobson <joel(at)compiler(dot)org>
Subject:	Re: Another regexp performance improvement: skip useless paren-captures
Date:	2021-08-10 01:38:33
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Aug 9, 2021, at 4:31 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> This patch should work OK in HEAD and v14, but it will need
> a bit of fooling-about for older branches I think, given that
> they fill v->subs[] a little differently.

Note that I tested your patch *before* master, so the changes look backwards.

I tested this fix and it seems good to me. Some patterns that used to work (by chance?) now raise an error, such as:

select 'bpgouiwcquu' ~ '(((([e])*?)){0,0}?(\3))';
-ERROR: invalid regular expression: invalid backreference number
+ ?column?
+----------
+ f
+(1 row)

I ran a lot of tests with the patch, and the asserts have all cleared up, but I don't know how to think about the user facing differences. If we had a good reason for raising an error for these back-references, maybe that'd be fine, but it seems to just be an implementation detail.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Joel Jacobson <joel(at)compiler(dot)org>
Subject:	Re: Another regexp performance improvement: skip useless paren-captures
Date:	2021-08-10 02:20:33
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> writes:
> I ran a lot of tests with the patch, and the asserts have all cleared up, but I don't know how to think about the user facing differences. If we had a good reason for raising an error for these back-references, maybe that'd be fine, but it seems to just be an implementation detail.

I thought about this some more, and I'm coming around to the idea that
throwing an error is the wrong thing. As a contrary example, consider

(.)|(\1\1)

We don't throw an error for this, and neither does Perl, even though
the capturing parens can never be defined in the branch where the
backrefs are. So it seems hard to argue that this is okay but the
other thing isn't. Another interesting example is

(.){0}(\1){0}

I think that the correct interpretation is that this is a valid
regexp matching an empty string (i.e., zero repetitions of each
part), even though neither capture group will be defined.
That's different from

(.){0}(\1)

which can never match.

So I took another look at the code, and it doesn't seem that hard
to make it act this way. The attached passes regression, but
I've not beat on it with random strings.

regards, tom lane

Attachment	Content-Type	Size
alternate-fix-zero-quantified-nested-parens.patch	text/x-diff	2.8 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Joel Jacobson <joel(at)compiler(dot)org>
Subject:	Re: Another regexp performance improvement: skip useless paren-captures
Date:	2021-08-10 15:15:22
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> So AFAICS Perl is acting in the way I'm attributing to POSIX.
> But maybe we should actually read POSIX ...

I went to look at the POSIX spec, and was reminded that it lacks
backrefs altogether. (POSIX specifies the "BRE" and "ERE" regex
flavors as described in our docs, but not "ARE".) So there's no
help to be had there. The fact that Perl doesn't throw an error
is probably the most useful precedent available.

regards, tom lane

From:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Joel Jacobson <joel(at)compiler(dot)org>
Subject:	Re: Another regexp performance improvement: skip useless paren-captures
Date:	2021-08-10 15:16:35
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Aug 9, 2021, at 7:20 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> So I took another look at the code, and it doesn't seem that hard
> to make it act this way. The attached passes regression, but
> I've not beat on it with random strings.

I expect to get back around to testing this in a day or so.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Joel Jacobson <joel(at)compiler(dot)org>
Subject:	Re: Another regexp performance improvement: skip useless paren-captures
Date:	2021-08-24 17:02:13
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

I've beaten on this with random patterns and it seems to hold up just fine. I have also reviewed the diffs and, for the patterns where the output changes, everything looks correct. I can't find anything wrong with this patch.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Joel Jacobson <joel(at)compiler(dot)org>
Subject:	Re: Another regexp performance improvement: skip useless paren-captures
Date:	2021-08-24 18:29:01
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> writes:
> I've beaten on this with random patterns and it seems to hold up just fine. I have also reviewed the diffs and, for the patterns where the output changes, everything looks correct. I can't find anything wrong with this patch.

Thanks for testing! I'll push it in a bit.

regards, tom lane