Add support for (Var op Var) clause in extended MCV statistics

Lists: pgsql-hackers
From: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Add support for (Var op Var) clause in extended MCV statistics
Date: 2024-08-12 10:42:07
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hi hackers,

I'd like to submit a patch that improves the estimated rows for queries
containing (Var op Var) clauses by applying extended MCV statistics.

*New functions:*

* mcv_clauselist_selectivity_var_op_var() - calculates the selectivity
for (Var op Var) clauses.
* is_opclause_var_op_var() - Checks whether a clause is of the (Var op
Var) form.

*Implementation Details:*

* A new 'if' statement was added to the 'clause_selectivity_ext()'
function to handle (Var op Var) clauses. This allows the process to
locate matching MCV extended statistics and calculate selectivity
using the newly introduced function.
* Additionally, I added 'if' statement
in statext_is_compatible_clause_internal() function to determine
which columns are included in the clause, find matching extended
statistics, and then calculate selectivity through the new function.
I did the same in mcv_get_match_bitmap() to check what values are
true for (Var op Var).
* To support this, I created a new enum type to differentiate between
OR/AND and (Var op Var) clauses.

*Examples:*

create table t (a int, b int);
insert into t select mod(i,10), mod(i,10)+1 from
generate_series(1,100000) s(i);
analyze t;
explain select * from t where a < b;
`
    Estimated:   33333
    Actual:       100000

explain select * from t where a > b;
`
    Estimated:   33333
    Actual:       100000

create statistics s (mcv) on a,b from t;
analyze t;
explain select * from t where a < b;
`
    Estimated without patch:  33333
    Estimated with patch:     100000
    Actual:                             100000

explain select * from t where a > b;
`
    Estimated without patch:  33333
    Estimated with patch:     100000
    Actual:                             100000

If you want to see more examples, see regress tests in the patch.

*Previous thread:*

This feature was originally developed two years ago in [1], and at that
time, the approach was almost the same. My implementation uses dedicated
functions and 'if' statements directly for better readability and
maintainability. Additionally, there was a bug in the previous approach
that has been resolved with my patch. Here’s an example of the bug and
its fix:

CREATE TABLE foo (a int, b int);
INSERT INTO foo SELECT x/10+1, x FROM generate_series(1,10000) g(x);
ANALYZE foo;
EXPLAIN ANALYZE SELECT * FROM foo WHERE a = 1 OR (b > 0 AND b < 10);
`
    Estimated:   18
    Actual:           9

CREATE STATISTICS foo_s (mcv) ON a,b FROM foo;
ANALYZE foo;
EXPLAIN ANALYZE SELECT * FROM foo WHERE a = 1 OR (b > 0 AND b < 10);
`
    Estimated previous patch:  18
    Estimated current patch:      9
    Actual:                                  9

[1]:
https://www.postgresql.org/message-id/flat/9e0a12e0-c05f-b193-ed3d-fe88bc1e5fe1%40enterprisedb.com

I look forward to any feedback or suggestions from the community.

Best regars,
Ilia Evdokimov
Tantor Labs LLC.

Attachment Content-Type Size
v1-Add-support-for-Var-op-Var-clause-in-extended-MCV-st.patch text/x-patch 48.4 KB

From: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Add support for (Var op Var) clause in extended MCV statistics
Date: 2024-08-12 10:59:24
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Another issue mentioned in [1] involves cases where the clause is in the
form (A op A). In my view, this isn't related to the current patch, as
it can be addressed by rewriting the clause, similar to transforming A =
A into A IS NOT NULL. This adjustment would result in more accurate
estimation.

[1]:
https://www.postgresql.org/message-id/7C0F91B5-8A43-428B-8D31-556458720305%40enterprisedb.com

Ilia Evdokimov,
Tantor Labs LLC.


From: Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>
To: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Add support for (Var op Var) clause in extended MCV statistics
Date: 2024-08-12 11:44:05
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hi! I think your work is important)

I started reviewing it and want to suggest some changes to better code:
I think we should consider the case where the expression is not neither
an OpExpr and VarOpVar expression.

Have you tested this code with any benchmarks?

--
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment Content-Type Size
diff1.no-cfbot text/plain 1.9 KB

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>, Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Add support for (Var op Var) clause in extended MCV statistics
Date: 2024-08-12 11:53:32
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On 8/12/24 13:44, Alena Rybakina wrote:
> Hi! I think your work is important)
>

I agree, and I'm grateful someone picked up the original patch. I'll try
to help to keep it moving forward. If the thread gets stuck, feel free
to ping me to take a look.

> I started reviewing it and want to suggest some changes to better code:
> I think we should consider the case where the expression is not neither
> an OpExpr and VarOpVar expression.
>

Do you have some specific type of clauses in mind? Most of the extended
statistics only really handles this type of clauses, so I'm not sure
it's feasible to extend that - at least not in this patch.

> Have you tested this code with any benchmarks?
>
FWIW I think we need to test two things - that it (a) improves the
estimates and (b) does not have significant overhead.

regards

--
Tomas Vondra


From: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>, Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Add support for (Var op Var) clause in extended MCV statistics
Date: 2024-08-12 15:57:19
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On 12.8.24 14:53, Tomas Vondra wrote:

> I agree, and I'm grateful someone picked up the original patch. I'll try
> to help to keep it moving forward. If the thread gets stuck, feel free
> to ping me to take a look.
Good. Thank you!
>> I started reviewing it and want to suggest some changes to better code:
>> I think we should consider the case where the expression is not neither
>> an OpExpr and VarOpVar expression.
>>
> Do you have some specific type of clauses in mind? Most of the extended
> statistics only really handles this type of clauses, so I'm not sure
> it's feasible to extend that - at least not in this patch.

I agree with Alena that we need to consider the following clauses: (Expr
op Var), (Var op Expr) and (Expr op Expr). And we need to return false
in these cases because we did it before my patch in

        /* Check if the expression has the right shape */
        if (!examine_opclause_args(expr->args, &clause_expr, NULL, NULL))
            return false;

In is_opclause_var_op_var() function it is really useless local Node
*expr_left, *expr_right variables. However, we can't assign them NULL at
the begin because if I passed not-null pointers I have to return the
values. Otherwise remain them NULL.

Nevertheless, thank you for review, Alena.

>> Have you tested this code with any benchmarks?
>>
> FWIW I think we need to test two things - that it (a) improves the
> estimates and (b) does not have significant overhead.
Yes, but only TPC-B. And the performance did not drop. In general, it'd
be better to do more tests and those listed by Tomas with new attached
patch.

Attachment Content-Type Size
v2-Add-support-for-Var-op-Var-clause-in-extended-MCV-st.patch text/x-patch 48.6 KB

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Add support for (Var op Var) clause in extended MCV statistics
Date: 2024-08-12 16:25:34
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On 8/12/24 17:57, Ilia Evdokimov wrote:
> On 12.8.24 14:53, Tomas Vondra wrote:
>
>> I agree, and I'm grateful someone picked up the original patch. I'll try
>> to help to keep it moving forward. If the thread gets stuck, feel free
>> to ping me to take a look.
> Good. Thank you!
>>> I started reviewing it and want to suggest some changes to better code:
>>> I think we should consider the case where the expression is not neither
>>> an OpExpr and VarOpVar expression.
>>>
>> Do you have some specific type of clauses in mind? Most of the extended
>> statistics only really handles this type of clauses, so I'm not sure
>> it's feasible to extend that - at least not in this patch.
>
> I agree with Alena that we need to consider the following clauses: (Expr
> op Var), (Var op Expr) and (Expr op Expr). And we need to return false
> in these cases because we did it before my patch in
>
>         /* Check if the expression has the right shape */
>         if (!examine_opclause_args(expr->args, &clause_expr, NULL, NULL))
>             return false;
>
> In is_opclause_var_op_var() function it is really useless local Node
> *expr_left, *expr_right variables. However, we can't assign them NULL at
> the begin because if I passed not-null pointers I have to return the
> values. Otherwise remain them NULL.
>
> Nevertheless, thank you for review, Alena.
>

Ah, right. I agree we should handle clauses with expressions.

I don't recall why I wrote is_opclause_var_op_var() like this, but I
believe this was before we allowed extended statistics on expressions
(which was added in 2021, the patch is from 2020). I don't see why it
could not return expressions, but I haven't tried.

>>> Have you tested this code with any benchmarks?
>>>
>> FWIW I think we need to test two things - that it (a) improves the
>> estimates and (b) does not have significant overhead.
> Yes, but only TPC-B. And the performance did not drop. In general, it'd
> be better to do more tests and those listed by Tomas with new attached
> patch.

Is TPC-B really interesting/useful for this patch? The queries are super
simple, with only a single clause (so it may not even get to the code
handling extended statistics). Did you create any extended stats?

I think you'll need to construct a custom test, with queries that have
multiple (var op var) clauses, extended stats created, etc. And
benchmark that.

FWIW I don't think it makes sense to benchmark the query execution - if
the estimate improves, it's possible to get arbitrary speedup, but
that's expected and mostly mostly irrelevant I think.

What I'd focus on is benchmarking just the query planning - we need the
overhead to be negligible (or at least small) so that it does not hurt
people with already good plans.

BTW can you elaborate why you are interested in this patch? Do you just
think it's interesting/useful, or do you have a workload where it would
actually help? I'm asking because me being uncertain how beneficial this
is in practice (not just being nice in theory) was one of the reasons
why I didn't do more work on this in 2021.

regards

--
Tomas Vondra


From: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>, Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Add support for (Var op Var) clause in extended MCV statistics
Date: 2024-08-19 15:30:15
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On 12.8.24 19:25, Tomas Vondra wrote:

> Is TPC-B really interesting/useful for this patch? The queries are super
> simple, with only a single clause (so it may not even get to the code
> handling extended statistics). Did you create any extended stats?

No, it's not the case. I simply wanted to verify that other queries are
not slowed down after applying my patch.
>
> I think you'll need to construct a custom test, with queries that have
> multiple (var op var) clauses, extended stats created, etc. And
> benchmark that.

I used the test generator from a previous thread [1] and ran it with
|default_statistics_target = 1000| to achieve more accurate estimates
for 3000 rows. It would also be beneficial to run tests with 10,000 and
100,000 rows for a broader perspective. I've attached both the Python
test and the results, including the data. Here’s a breakdown of the issues:

1. (A op A) Clause: Before applying my patch, there were poor estimates
for expressions like |(A op A)|. Currently, we only have correct
estimates for the |(A = A)| clause, which transforms into |A IS NOT
NULL|. Should I address this in this thread? I believe we should
extend the same correction to clauses like |(A != A)|, |(A < A)|,
and similar conditions. However, this issue is not for current thread.
2. AND Clauses: The estimates for AND clauses were inaccurate before my
patch. I noticed code segments where I could add something specific
for the |(Var op Var)| clause, but I'm unsure if I'm missing
anything crucial. If my understanding is incorrect, I'd appreciate
any guidance or corrections.

> FWIW I don't think it makes sense to benchmark the query execution - if
> the estimate improves, it's possible to get arbitrary speedup, but
> that's expected and mostly mostly irrelevant I think.
>
> What I'd focus on is benchmarking just the query planning - we need the
> overhead to be negligible (or at least small) so that it does not hurt
> people with already good plans.
>
> BTW can you elaborate why you are interested in this patch? Do you just
> think it's interesting/useful, or do you have a workload where it would
> actually help? I'm asking because me being uncertain how beneficial this
> is in practice (not just being nice in theory) was one of the reasons
> why I didn't do more work on this in 2021.

I have two reasons for pursuing this. Firstly, I've encountered some of
these queries in practice, although they are quite rare. While it might
be easy to dismiss these cases due to their infrequency, I believe that
we shouldn't overlook the opportunity to develop better handling for
them, regardless of how seldom they occur.

Secondly, I see that you're working on improving estimates for JOIN
clauses in thread [2]. I believe that enhancing estimates for these rare
cases could also benefit future work on JOIN queries, particularly those
with multiple |ON (T1.column = T2.column)| conditions, which are
essentially |(Var op Var)| clauses. My idea is to start with non-JOIN
queries, and then apply the same approach to improve JOIN estimates. Of
course, I might be wrong, but I think this approach has potential.

[1]:
https://www.postgresql.org/message-id/ecc0b08a-518d-7ad6-17ed-a5e962fc4f5f%40enterprisedb.com

[2]:
https://www.postgresql.org/message-id/flat/c8c0ff31-3a8a-7562-bbd3-78b2ec65f16c%40enterprisedb.com

--
Regards,
Ilia Evdokimov,
Tantor Labs LCC.

Attachment Content-Type Size
tests.zip application/zip 3.7 MB

From: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>, Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Add support for (Var op Var) clause in extended MCV statistics
Date: 2024-08-19 16:04:40
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On 12.8.24 19:25, Tomas Vondra wrote:

> Is TPC-B really interesting/useful for this patch? The queries are super
> simple, with only a single clause (so it may not even get to the code
> handling extended statistics). Did you create any extended stats?

No, it's not the case. I simply wanted to verify that other queries are
not slowed down after applying my patch.
>
> I think you'll need to construct a custom test, with queries that have
> multiple (var op var) clauses, extended stats created, etc. And
> benchmark that.

I used the test generator from a previous thread [1] and ran it with
|default_statistics_target = 1000| to achieve more accurate estimates
for 3000 rows. It would also be beneficial to run tests with 10,000 and
100,000 rows for a broader perspective. I've attached the python test.
Here’s a breakdown of the issues:

1. (A op A) Clause: Before applying my patch, there were poor estimates
for expressions like |(A op A)|. Currently, we only have correct
estimates for the |(A = A)| clause, which transforms into |A IS NOT
NULL|. Should I address this in this thread? I believe we should
extend the same correction to clauses like |(A != A)|, |(A < A)|,
and similar conditions. However, this issue is not for current thread.
2. AND Clauses: The estimates for AND clauses were inaccurate before my
patch. I noticed code segments where I could add something specific
for the |(Var op Var)| clause, but I'm unsure if I'm missing
anything crucial. If my understanding is incorrect, I'd appreciate
any guidance or corrections.

> FWIW I don't think it makes sense to benchmark the query execution - if
> the estimate improves, it's possible to get arbitrary speedup, but
> that's expected and mostly mostly irrelevant I think.
>
> What I'd focus on is benchmarking just the query planning - we need the
> overhead to be negligible (or at least small) so that it does not hurt
> people with already good plans.
>
> BTW can you elaborate why you are interested in this patch? Do you just
> think it's interesting/useful, or do you have a workload where it would
> actually help? I'm asking because me being uncertain how beneficial this
> is in practice (not just being nice in theory) was one of the reasons
> why I didn't do more work on this in 2021.
>
>
> regards
>

I have two reasons for pursuing this. Firstly, I've encountered some of
these queries in practice, although they are quite rare. While it might
be easy to dismiss these cases due to their infrequency, I believe that
we shouldn't overlook the opportunity to develop better handling for
them, regardless of how seldom they occur.

Secondly, I see that you're working on improving estimates for JOIN
clauses in thread [2]. I believe that enhancing estimates for these rare
cases could also benefit future work on JOIN queries, particularly those
with multiple |ON (T1.column = T2.column)| conditions, which are
essentially |(Var op Var)| clauses. My idea is to start with non-JOIN
queries, and then apply the same approach to improve JOIN estimates. Of
course, I might be wrong, but I think this approach has potential.

P.S. If I sent this mail twice I'm sorry. I wanted to sent results of
the test, and it was not sent to hackers because of big size of attached
file. Now I sent only test.

[1]:
https://www.postgresql.org/message-id/ecc0b08a-518d-7ad6-17ed-a5e962fc4f5f%40enterprisedb.com

[2]:
https://www.postgresql.org/message-id/flat/c8c0ff31-3a8a-7562-bbd3-78b2ec65f16c%40enterprisedb.com

--
Regards,
Ilia Evdokimov,
Tantor Labs LCC.

Attachment Content-Type Size
test_generator.py text/x-python 4.2 KB

From: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>, Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Add support for (Var op Var) clause in extended MCV statistics
Date: 2024-09-09 10:43:02
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hi everyone,

I've taken a closer look at the patch and realized that we don't need
the new function 'mcv_clause_selectivity_var_op_var()' we can use
'mcv_clause_selectivity()' instead.

I'm attaching the updated patch and test generator.

--
Regards,
Ilia Evdokimov,
Tantor Labs LCC.

Attachment Content-Type Size
test_generator.py text/x-python 4.2 KB
v4-0001-Add-support-for-Var-op-Var-clause-in-extended-MCV.patch text/x-patch 46.6 KB