Quick Links

RFC: Allow EXPLAIN to Output Page Fault Information

Lists:	pgsql-hackers

From:	torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
To:	Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2024-12-24 08:53:06
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

When reading the output of EXPLAIN (ANALYZE) to diagnose slow queries
for our customers, I often want to know how many page faults occurred,
especially major page faults, which involve disk access.

Currently, the BUFFERS option in EXPLAIN provides information on whether
a page was found in the shared buffers, but it doesn't provide insight
into whether the page was found in the OS cache or not and disk I/O
occurred.

Since page faults especially major one impact performance compared to
shared buffer and OS cache hits, it would be helpful to track these
events.

I have attached a PoC patch that modifies EXPLAIN to include page fault
information during both the planning and execution phases of a query.
The output would look like this:

=# EXPLAIN (ANALYZE, PAGEFAULTS)
SELECT * FROM pgbench_branches b
JOIN pgbench_accounts a ON b.bid = a.bid ORDER BY a.aid;

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.58..335386.98 rows=4999917 width=197) (actual
time=3.785..5590.294 rows=5000000 loops=1)

... (omitted)

Planning:
Buffers: shared hit=50 read=48 dirtied=4 written=2
Page Faults: minor=30 major=19 <-ADDED
Planning Time: 22.080 ms
Execution:
Page Faults: minor=49 major=5 <-ADDED
Execution Time: 5794.356 ms

The patch has not yet added this functionality to auto_explain, but I
believe this feature would be more useful in auto_explain than in the
EXPLAIN command itself. I plan to add it to auto_explain if there's
interest in including page fault information in EXPLAIN.

While GUCs like log_statement_stats allow logging of getrusage(2)
information, including page faults, always enabling this can lead to
excessive log output. It would be better to log this data only when
queries are slow. Therefore, adding this feature to auto_explain seems
like a good solution.

The patch introduces a new option, PAGEFAULTS, but it may be more
appropriate to include the page fault information in another option,
such as SUMMARY, especially if there are other useful resources that can
be obtained from getrusage(2).

The patch cannot be applied to Windows because getrusage() in PostgreSQL
ported for Windows currently only tracks CPU times. I'm not sure if
information on major and minor page faults is accessible on Windows, but
it might be acceptable to treat it similarly to log_statement_stats and
exclude it from Windows support.

I also tried to add page faults information for each plan node, similar
to the BUFFERS option, but I decided against this approach due to the
performance impact. The frequent calls to getrusage(2), i.e. each time
getting one row, created significant overhead.

Any feedback would be greatly appreciated.

--
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.

Attachment	Content-Type	Size
v1-0001-PoC-Allow-EXPLAIN-to-output-page-fault-info.patch	text/x-diff	10.6 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
Cc:	Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2024-12-24 15:52:08
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

torikoshia <torikoshia(at)oss(dot)nttdata(dot)com> writes:
> I have attached a PoC patch that modifies EXPLAIN to include page fault
> information during both the planning and execution phases of a query.

Surely these numbers would be too unstable to be worth anything.

regards, tom lane

From:	torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2024-12-26 04:15:59
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2024-12-25 00:52, Tom Lane wrote:
> torikoshia <torikoshia(at)oss(dot)nttdata(dot)com> writes:
>> I have attached a PoC patch that modifies EXPLAIN to include page
>> fault
>> information during both the planning and execution phases of a query.
>
> Surely these numbers would be too unstable to be worth anything.

Thanks for your comment!

Hmm, I didn't know these are unstable. If there are any reasons, I'd
like to know about them.

I would like to explore alternative methods for measuring resource
usage, but
I am not aware of other approaches.
(IIUC pg_stat_kcache[1], which is said to provide information about
filesystem layer I/O usage also gets metrics from getrusage(2))

[1] https://github.com/powa-team/pg_stat_kcache

--
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.

From:	Jeremy Schneider <schneider(at)ardentperf(dot)com>
To:	torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2024-12-26 19:52:11
Message-ID:	20241226115211.676ba09a@jeremy-ThinkPad-T430s
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 26 Dec 2024 13:15:59 +0900
torikoshia <torikoshia(at)oss(dot)nttdata(dot)com> wrote:

> On 2024-12-25 00:52, Tom Lane wrote:
> > torikoshia <torikoshia(at)oss(dot)nttdata(dot)com> writes:
> >> I have attached a PoC patch that modifies EXPLAIN to include page
> >> fault
> >> information during both the planning and execution phases of a
> >> query.
> >
> > Surely these numbers would be too unstable to be worth anything.
>
> Thanks for your comment!
>
> Hmm, I didn't know these are unstable. If there are any reasons, I'd
> like to know about them.
>
> I would like to explore alternative methods for measuring resource
> usage, but
> I am not aware of other approaches.
> (IIUC pg_stat_kcache[1], which is said to provide information about
> filesystem layer I/O usage also gets metrics from getrusage(2))

What I'd really like to see is a column added to pg_stat_database
called blks_read_majflts

It would be great if we could calculate a cache hit ratio that took OS
major page faults into account

Yes this could also be done in pg_stat_kcache but why not do it right
in pg_stat_database? I think it would pretty widely appreciated and
used.

-Jeremy

From:	"Jelte Fennema-Nio" <postgres(at)jeltef(dot)nl>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "torikoshia" <torikoshia(at)oss(dot)nttdata(dot)com>
Cc:	"Pgsql Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2024-12-27 14:15:40
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue Dec 24, 2024 at 4:52 PM CET, Tom Lane wrote:
> torikoshia <torikoshia(at)oss(dot)nttdata(dot)com> writes:
>> I have attached a PoC patch that modifies EXPLAIN to include page fault
>> information during both the planning and execution phases of a query.
>
> Surely these numbers would be too unstable to be worth anything.

What makes you think that? I'd expect them to be similarly stable to the
numbers we get for BUFFERS. i.e. Sure they won't be completely stable,
but I expect them to be quite helpful when debugging perf issues,
because large numbers indicate that the query is disk-bound and small
numbers indicate that it is not.

These numbers seem especially useful for setups where shared_buffers is
significantly smaller than the total memory available to the system. In
those cases the output from BUFFERS might give the impression that that
you're disk-bound, but if your working set still fits into OS cache then
the number of page faults is likely still low. Thus telling you that the
numbers that you get back from BUFFERS are not as big of a problem as
they might seem.

From:	Jeremy Schneider <schneider(at)ardentperf(dot)com>
To:	"Jelte Fennema-Nio" <postgres(at)jeltef(dot)nl>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "torikoshia" <torikoshia(at)oss(dot)nttdata(dot)com>, "Pgsql Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2024-12-27 18:32:43
Message-ID:	20241227103243.39922304@jeremy-ThinkPad-T430s
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 27 Dec 2024 15:15:40 +0100
"Jelte Fennema-Nio" <postgres(at)jeltef(dot)nl> wrote:

> On Tue Dec 24, 2024 at 4:52 PM CET, Tom Lane wrote:
> > torikoshia <torikoshia(at)oss(dot)nttdata(dot)com> writes:
> >> I have attached a PoC patch that modifies EXPLAIN to include page
> >> fault information during both the planning and execution phases of
> >> a query.
> >
> > Surely these numbers would be too unstable to be worth anything.
>
> What makes you think that? I'd expect them to be similarly stable to
> the numbers we get for BUFFERS. i.e. Sure they won't be completely
> stable, but I expect them to be quite helpful when debugging perf
> issues, because large numbers indicate that the query is disk-bound
> and small numbers indicate that it is not.
>
> These numbers seem especially useful for setups where shared_buffers
> is significantly smaller than the total memory available to the
> system. In those cases the output from BUFFERS might give the
> impression that that you're disk-bound, but if your working set still
> fits into OS cache then the number of page faults is likely still
> low. Thus telling you that the numbers that you get back from BUFFERS
> are not as big of a problem as they might seem.

We'd probably need to combine both pg_buffercache_evict() and
/proc/sys/vm/drop_caches to get stable numbers - which is something I
have done in the past for testing.

Another thought would be splitting out the IO timing information into
two values - IO timing for reads that triggered major faults, versus
IO timing for reads that did not.

And system views like pg_stat_database seem worth considering too.

-Jeremy

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2024-12-30 16:19:47
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Dec 27, 2024 at 03:15:40PM +0100, Jelte Fennema-Nio wrote:
> On Tue Dec 24, 2024 at 4:52 PM CET, Tom Lane wrote:
> > torikoshia <torikoshia(at)oss(dot)nttdata(dot)com> writes:
> > > I have attached a PoC patch that modifies EXPLAIN to include page
> > > fault information during both the planning and execution phases of a
> > > query.
> >
> > Surely these numbers would be too unstable to be worth anything.
>
> What makes you think that? I'd expect them to be similarly stable to the
> numbers we get for BUFFERS. i.e. Sure they won't be completely stable,
> but I expect them to be quite helpful when debugging perf issues,
> because large numbers indicate that the query is disk-bound and small
> numbers indicate that it is not.
>
> These numbers seem especially useful for setups where shared_buffers is
> significantly smaller than the total memory available to the system. In
> those cases the output from BUFFERS might give the impression that that
> you're disk-bound, but if your working set still fits into OS cache then
> the number of page faults is likely still low. Thus telling you that the
> numbers that you get back from BUFFERS are not as big of a problem as
> they might seem.

I certainly would love to see storage I/O numbers as distinct from
kernel read I/O numbers.

--
Bruce Momjian <bruce(at)momjian(dot)us> https://momjian.us
EDB https://enterprisedb.com

Do not let urgent matters crowd out time for investment in the future.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2024-12-30 16:39:24
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian <bruce(at)momjian(dot)us> writes:
> On Fri, Dec 27, 2024 at 03:15:40PM +0100, Jelte Fennema-Nio wrote:
>> On Tue Dec 24, 2024 at 4:52 PM CET, Tom Lane wrote:
>>> torikoshia <torikoshia(at)oss(dot)nttdata(dot)com> writes:
>>>> I have attached a PoC patch that modifies EXPLAIN to include page
>>>> fault information during both the planning and execution phases of a
>>>> query.

> I certainly would love to see storage I/O numbers as distinct from
> kernel read I/O numbers.

Me too, but I think it is 100% wishful thinking to imagine that
page fault counts match up with that. Maybe there are filesystems
where a read that we request maps one-to-one with a subsequent
page fault, but it hardly seems likely to me that that's
universal. Also, you can't tell page faults for reading program
code apart from those for data, and you won't get any information
at all about writes.

regards, tom lane

From:	"Jelte Fennema-Nio" <postgres(at)jeltef(dot)nl>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Bruce Momjian" <bruce(at)momjian(dot)us>
Cc:	"torikoshia" <torikoshia(at)oss(dot)nttdata(dot)com>, "Pgsql Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2024-12-30 22:57:14
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon Dec 30, 2024 at 5:39 PM CET, Tom Lane wrote:
> Bruce Momjian <bruce(at)momjian(dot)us> writes:
>> I certainly would love to see storage I/O numbers as distinct from
>> kernel read I/O numbers.
>
> Me too, but I think it is 100% wishful thinking to imagine that
> page fault counts match up with that.

Okay I played around with this patch a bit, in hopes of proving you
wrong. But I now agree with you. I cannot seem to get any numbers out of
this that make sense.

The major page fault numbers are always zero, even after running:

echo 1 > /proc/sys/vm/drop_caches

If Takahori has a way to get some more useful insights from this patch,
I'm quite interested in the steps he took (I might very well have missed
something obvious).

**However, I think the general direction has merit**: Changing this patch to
use `ru_inblock`/`ru_oublock` gives very useful insights. `ru_inblock`
is 0 when everything is in page cache, and it is very high when stuff is
not. I was only hacking around and basically did this:

s/ru_minflt/ru_inblock/g
s/ru_majflt/ru_oublock/g

Obviously more is needed. We'd probably want to show these numbers in
useful units like MB or something. Also, maybe there's some better way
of getting read/write numbers for the current process than
ru_inblock/ru_oublock (but this one seems to work at least reasonably
well).

One other thing that I noticed when playing around with this, which
would need to be addressed: Parallel workers need to pass these values
to the main process somehow, otherwise the IO from those processes gets lost.

For the record, the queries I used to test this patch were:

create table t_big(a int, b text);
insert into t_big SELECT i, repeat(i::text, 200) FROM generate_series(1, 3000000) i;
explain (ANALYZE, PAGEFAULTS) select max(a), max(b) from t_big;
explain (analyze, PAGEFAULTS) insert into t_big SELECT i, repeat(i::text, 200) FROM generate_series(1, 3000000) i;

And then seeing if there was any difference in the explain analyze
output after running the following (as root):

echo 1 > /proc/sys/vm/drop_caches

From:	torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
To:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2025-01-06 09:49:06
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Dec 31, 2024 at 1:39 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Bruce Momjian <bruce(at)momjian(dot)us> writes:
>
>> I certainly would love to see storage I/O numbers as distinct from
>> kernel read I/O numbers.
>
> Me too, but I think it is 100% wishful thinking to imagine that
> page fault counts match up with that. Maybe there are filesystems
> where a read that we request maps one-to-one with a subsequent
> page fault, but it hardly seems likely to me that that's
> universal. Also, you can't tell page faults for reading program
> code apart from those for data, and you won't get any information
> at all about writes.

Thanks for the explanation.

On Tue, Dec 31, 2024 at 7:57 AM Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
wrote:
> On Mon Dec 30, 2024 at 5:39 PM CET, Tom Lane wrote:
>> Bruce Momjian <bruce(at)momjian(dot)us> writes:
>>> I certainly would love to see storage I/O numbers as distinct from
>>> kernel read I/O numbers.
>>
>> Me too, but I think it is 100% wishful thinking to imagine that
>> page fault counts match up with that.
>
> Okay I played around with this patch a bit, in hopes of proving you
> wrong. But I now agree with you. I cannot seem to get any numbers out
> of
> this that make sense.
>
> The major page fault numbers are always zero, even after running:
>
> echo 1 > /proc/sys/vm/drop_caches
>
> If Takahori has a way to get some more useful insights from this patch,
> I'm quite interested in the steps he took (I might very well have
> missed
> something obvious).

Thanks for testing.

I also did pg_ctl restart to clear buffercache in addition to your step
and saw many major faults again.
However, when I replaced the restart with pg_buffercache_evict(), I also
observed too few number of major fault.
I now feel majflt from getrusage() is not appropriate metrics for
measuring storage I/O.

> **However, I think the general direction has merit**: Changing this
> patch to
> use `ru_inblock`/`ru_oublock` gives very useful insights. `ru_inblock`
> is 0 when everything is in page cache, and it is very high when stuff
> is
> not. I was only hacking around and basically did this:
>
> s/ru_minflt/ru_inblock/g
> s/ru_majflt/ru_oublock/g

Great!
I misunderstood these metrics contain page cached I/O.

As far as I inspected, they come from read_bytes/write_bytes of
task_io_accounting and the comment seems they are what we want, i.e.
storage I/O:

--
/usr/src/linux-headers-5.15.0-127/include/linux/task_io_accounting.h
struct task_io_accounting {
..(snip)..
#ifdef CONFIG_TASK_IO_ACCOUNTING
/*
* The number of bytes which this task has caused to be read
from
* storage.
*/
u64 read_bytes;

/*
* The number of bytes which this task has caused, or shall
cause to be
* written to disk.
*/
u64 write_bytes;

> Obviously more is needed. We'd probably want to show these numbers in
> useful units like MB or something. Also, maybe there's some better way
> of getting read/write numbers for the current process than
> ru_inblock/ru_oublock (but this one seems to work at least reasonably
> well).

Updated the PoC patch to calculate them by KB:

=# EXPLAIN (ANALYZE, STORAGEIO) SELECT * FROM pgbench_accounts;
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------
Seq Scan on pgbench_accounts (cost=0.00..263935.35 rows=10000035
width=97) (actual time=1.447..3900.279 rows=10000000 loops=1)
Buffers: shared hit=2587 read=161348
Planning Time: 0.367 ms
Execution:
Storage I/O: read=1291856 KB write=0 KB
Execution Time: 4353.253 ms
(6 rows)

> Also, maybe there's some better way
> of getting read/write numbers for the current process than
> ru_inblock/ru_oublock (but this one seems to work at least reasonably
> well).

Maybe, but as far as using getrusage(), ru_inblock and ru_outblock seem
the best.

> One other thing that I noticed when playing around with this, which
> would need to be addressed: Parallel workers need to pass these values
> to the main process somehow, otherwise the IO from those processes gets
> lost.

Yes.
I haven't developed it yet but I believe we can pass them like
buffer/WAL usage.

--
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.

Attachment	Content-Type	Size
v1-0001-PoC-Allow-EXPLAIN-to-output-storage-I-O-informati.patch	text/x-diff	10.5 KB

From:	Julien Rouhaud <rjuju123(at)gmail(dot)com>
To:	torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
Cc:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, tgl(at)sss(dot)pgh(dot)pa(dot)us, Bruce Momjian <bruce(at)momjian(dot)us>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2025-01-06 09:59:02
Message-ID:	Z3upZkA5Bi6lYS6a@jrouhaud
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Mon, Jan 06, 2025 at 06:49:06PM +0900, torikoshia wrote:
>
> > **However, I think the general direction has merit**: Changing this
> > patch to
> > use `ru_inblock`/`ru_oublock` gives very useful insights. `ru_inblock`
> > is 0 when everything is in page cache, and it is very high when stuff is
> > not. I was only hacking around and basically did this:
> >
> > s/ru_minflt/ru_inblock/g
> > s/ru_majflt/ru_oublock/g
> > [...]
> > Also, maybe there's some better way
> > of getting read/write numbers for the current process than
> > ru_inblock/ru_oublock (but this one seems to work at least reasonably
> > well).
>
> Maybe, but as far as using getrusage(), ru_inblock and ru_outblock seem the
> best.

FWIW that's the counters we've been using in pg_stat_kcache / powa to compute
disk IO and "postgres shared buffers hit VS OS cache hit VS disk read" hit
ratio for many years and we didn't get any problem report for that.

From:	"Jelte Fennema-Nio" <postgres(at)jeltef(dot)nl>
To:	"torikoshia" <torikoshia(at)oss(dot)nttdata(dot)com>, <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Bruce Momjian" <bruce(at)momjian(dot)us>, "Pgsql Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2025-01-06 20:09:04
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon Jan 6, 2025 at 10:49 AM CET, torikoshia wrote:
> On Tue, Dec 31, 2024 at 7:57 AM Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
> Updated the PoC patch to calculate them by KB:
>
> =# EXPLAIN (ANALYZE, STORAGEIO) SELECT * FROM pgbench_accounts;
> QUERY PLAN
>
> ---------------------------------------------------------------------------------------------------------------------------------
> Seq Scan on pgbench_accounts (cost=0.00..263935.35 rows=10000035
> width=97) (actual time=1.447..3900.279 rows=10000000 loops=1)
> Buffers: shared hit=2587 read=161348
> Planning Time: 0.367 ms
> Execution:
> Storage I/O: read=1291856 KB write=0 KB
> Execution Time: 4353.253 ms
> (6 rows)
>
>

The core functionality works well in my opinion. I think it makes sense
to spend the effort to move this from PoC quality to something
committable. Below some of the things that are necessary to do that
after an initial pass over the code (and trying it out):

>> One other thing that I noticed when playing around with this, which
>> would need to be addressed: Parallel workers need to pass these values
>> to the main process somehow, otherwise the IO from those processes gets
>> lost.
>
> Yes.
> I haven't developed it yet but I believe we can pass them like
> buffer/WAL usage.

1. Yeah, makes sense to do this the same way as we do for buffers. Let's do
this now.

> + if (es->storageio)
> + {
> + getrusage(RUSAGE_SELF, &rusage);
> +
> + storageio.inblock = rusage.ru_inblock - storageio_start.inblock;
> + storageio.outblock = rusage.ru_oublock - storageio_start.outblock;
> +
> + if (es->format == EXPLAIN_FORMAT_TEXT)
> + {
> + ExplainIndentText(es);
> + appendStringInfoString(es->str, "Execution:\n");
> + es->indent++;
> + }
> + show_storageio(es, &storageio);
> +
> + if (es->format == EXPLAIN_FORMAT_TEXT)
> + es->indent--;
> + ExplainCloseGroup("Execution", "Execution", true, es);
> + }

2. The current code always shows "Execution: " in the explain analyze
output, even if no storageio is done. I think this should use
peek_storageio() to check if any storageio was done and only show the
"Execution: " line if that is the case.

3. FORMAT JSON seems to be broken with this patch. With the following
simple query:

explain (ANALYZE, STORAGEIO, FORMAT JSON) select max(a), max(b) from t_big;

I get this this assert failure:

TRAP: failed Assert("es->indent == 0"), File: "../src/backend/commands/explain.c", Line: 375, PID: 199034
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(ExceptionalCondition+0x74)[0x5ad72872b464]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(ExplainQuery+0x75b)[0x5ad7283c87bb]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(standard_ProcessUtility+0x595)[0x5ad7285e97f5]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(+0x4daadf)[0x5ad7285e7adf]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(+0x4dafc4)[0x5ad7285e7fc4]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(PortalRun+0x32d)[0x5ad7285e834d]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(+0x4d70a2)[0x5ad7285e40a2]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(PostgresMain+0x16e9)[0x5ad7285e5b39]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(BackendMain+0x5f)[0x5ad7285e02df]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(postmaster_child_launch+0xe1)[0x5ad72853cde1]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(+0x433758)[0x5ad728540758]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(PostmasterMain+0xddd)[0x5ad72854223d]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(main+0x1d0)[0x5ad72828b600]
/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x714aa222a1ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x714aa222a28b]
postgres: jelte postgres 127.0.0.1(49034) EXPLAIN(_start+0x25)[0x5ad72828bc05]

FORMAT JSON should obviously not crash the server, but apart from
that it should also actually show this new data in the json output.

4. I think this setting should be enabled by default for ANALYZE, just
like BUFFERS is now since c2a4078e[1].

5. I'm wondering whether this option deserves its own EXPLAIN option, or
if it should simply be made part of BUFFERS.

6. Windows compilation is currently failing on the CFbot. Looking at the
output, that's because rusage does not contain these fields there. I
think you'll need some #ifdefs

7. The result of getrusage() should be checked for errors and we should
report the error. (eventhough it's very unexpected to ever fail).

8. This needs docs

> + appendStringInfo(es->str, " read=%ld KB", (long) usage->inblock / 2);
> + appendStringInfo(es->str, " write=%ld KB", (long) usage->outblock / 2);

9. I think this division by 2 could use some explanation in a comment. I
understand that you're doing this because linux divides its original
bytes using 512 bytes[2] and your additional factor of 2 gets that to
1024 bytes. But that's not clear immediately from the code.

I'm also not convinced that 512 is the blocksize if this logic is
even correct on every platform. I'm wondering if maybe we should
simply show the blocks after all.

[1]: https://github.com/postgres/postgres/commit/c2a4078ebad71999dd451ae7d4358be3c9290b07
[2]: https://github.com/torvalds/linux/blob/fbfd64d25c7af3b8695201ebc85efe90be28c5a3/include/linux/task_io_accounting_ops.h#L16-L23

From:	Atsushi Torikoshi <torikoshia(dot)tech(at)gmail(dot)com>
To:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, rjuju123(at)gmail(dot)com
Cc:	torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, tgl(at)sss(dot)pgh(dot)pa(dot)us, Bruce Momjian <bruce(at)momjian(dot)us>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2025-01-07 13:41:47
Message-ID:	CAM6-o=BE=oewSsdNKrbbNjR2muzQR49STXsuta+Wpq6CSyxTag@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jan 6, 2025 at 6:59 PM Julien Rouhaud <rjuju123(at)gmail(dot)com> wrote:

> Hi,
>
> On Mon, Jan 06, 2025 at 06:49:06PM +0900, torikoshia wrote:
> >
> > > **However, I think the general direction has merit**: Changing this
> > > patch to
> > > use `ru_inblock`/`ru_oublock` gives very useful insights. `ru_inblock`
> > > is 0 when everything is in page cache, and it is very high when stuff
> is
> > > not. I was only hacking around and basically did this:
> > >
> > > s/ru_minflt/ru_inblock/g
> > > s/ru_majflt/ru_oublock/g
> > > [...]
> > > Also, maybe there's some better way
> > > of getting read/write numbers for the current process than
> > > ru_inblock/ru_oublock (but this one seems to work at least reasonably
> > > well).
> >
> > Maybe, but as far as using getrusage(), ru_inblock and ru_outblock seem
> the
> > best.
>
> FWIW that's the counters we've been using in pg_stat_kcache / powa to
> compute
> disk IO and "postgres shared buffers hit VS OS cache hit VS disk read" hit
> ratio for many years and we didn't get any problem report for that.

Thanks for sharing the information!

On Tue, Jan 7, 2025 at 5:09 AM Jelte Fennema-Nio <postgres(at)jeltef(dot)nl> wrote:

> On Mon Jan 6, 2025 at 10:49 AM CET, torikoshia wrote:
> > On Tue, Dec 31, 2024 at 7:57 AM Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
> > Updated the PoC patch to calculate them by KB:
> >
> > =# EXPLAIN (ANALYZE, STORAGEIO) SELECT * FROM pgbench_accounts;
> > QUERY PLAN
> >
> >
> ---------------------------------------------------------------------------------------------------------------------------------
> > Seq Scan on pgbench_accounts (cost=0.00..263935.35 rows=10000035
> > width=97) (actual time=1.447..3900.279 rows=10000000 loops=1)
> > Buffers: shared hit=2587 read=161348
> > Planning Time: 0.367 ms
> > Execution:
> > Storage I/O: read=1291856 KB write=0 KB
> > Execution Time: 4353.253 ms
> > (6 rows)
> >
> >
>
> The core functionality works well in my opinion. I think it makes sense
> to spend the effort to move this from PoC quality to something
> committable.

Thanks for the comment and review!
If there are no other opinions, I will make a patch based on the direction
of the current PoC patch.

> 4. I think this setting should be enabled by default for ANALYZE, just
> like BUFFERS is now since c2a4078e[1].
>
> 5. I'm wondering whether this option deserves its own EXPLAIN option, or
> if it should simply be made part of BUFFERS.

Although this is not information about PostgreSQL buffers, I now feel that
making this addition part of BUFFERS is attractive, since people using
BUFFERS would also sometimes want to know about the storage I/O.

Since BUFFERS is now ON by default with EXPLAIN ANALYZE, I am concerned
about the performance impact.
However, if it is limited to just twice—once at the start and once at the
end—for the planning phase, execution phase, and each parallel worker, I
believe the impact is relatively small.

> 9. I think this division by 2 could use some explanation in a comment. I
> understand that you're doing this because linux divides its original
> bytes using 512 bytes[2] and your additional factor of 2 gets that to
> 1024 bytes. But that's not clear immediately from the code.
>
> I'm also not convinced that 512 is the blocksize if this logic is
> even correct on every platform. I'm wondering if maybe we should
> simply show the blocks after all.
>

Maybe so. I'll look into this and then decide the unit.

For the other comments, I plan to address them as you suggested.

Regards,
Atsushi Torikoshi

From:	torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, rjuju123(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, Bruce Momjian <bruce(at)momjian(dot)us>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2025-01-27 09:04:59
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Tue, Jan 7, 2025 at 5:09 AM Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
wrote:
> The core functionality works well in my opinion. I think it makes sense
> to spend the effort to move this from PoC quality to something
> committable. Below some of the things that are necessary to do that
> after an initial pass over the code (and trying it out):

Attached a Patch.

On Tue, Jan 7, 2025 at 10:42 PM Atsushi Torikoshi
<torikoshia(dot)tech(at)gmail(dot)com> wrote:
>> 9. I think this division by 2 could use some explanation in a comment.
>> I
>> understand that you're doing this because linux divides its
>> original
>> bytes using 512 bytes[2] and your additional factor of 2 gets that
>> to
>> 1024 bytes. But that's not clear immediately from the code.
>>
>> I'm also not convinced that 512 is the blocksize if this logic is
>> even correct on every platform. I'm wondering if maybe we should
>> simply show the blocks after all.
>
> Maybe so. I'll look into this and then decide the unit.

I looked up the manuals for the following operating systems, as
documented in [1], and it seems that all of them—except Windows—support
getrusage(2) and return ru_inblock/ru_oublock:
Linux, Windows, FreeBSD, OpenBSD, NetBSD, DragonFlyBSD, macOS, AIX,
Solaris, and illumos.

However, I’m unsure if the unit of these values is consistently 512KB
across all operating systems.
Additionally, I’m concerned that the timing of when these metrics are
incremented might vary between OSs.
For example, on Linux, it seems that ru_oublock is incremented when a
page is dirtied, as it’s calculated by dividing write_bytes [2] by 9.
I’m not sure if other operating systems behave the same way.

That said, they all represent the number of storage I/O operations
performed or to be performed.
Therefore, I believe it would be reasonable to report the raw values
as-is, as they should still be useful for understanding storage I/O
activity.

Example output:
=# explain analyze select max(a), max(b) from t_big_ul;
(..snip..)
Planning:
Buffers: shared hit=31 read=54
Storage I/O: read=2744 times write=0 times
Planning Time: 30.685 ms
Execution:
Storage I/O: read=2563600 times write=0 times
Execution Time: 1685.272 ms

[1] https://www.postgresql.org/docs/devel/supported-platforms.html
[2] https://www.kernel.org/doc/Documentation/filesystems/proc.txt

--
Regards,

Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.

Attachment	Content-Type	Size
v1-0001-Add-storage-I-O-tracking-to-BUFFERS-option.patch	text/x-diff	33.8 KB

From:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
To:	torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, rjuju123(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, Bruce Momjian <bruce(at)momjian(dot)us>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2025-02-08 13:54:14
Message-ID:	CAGECzQRvcLx44N3zd_DGCjY02XX4AqXX8mq4BiS8C9Froy+Jhg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 27 Jan 2025 at 10:05, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com> wrote:
> Therefore, I believe it would be reasonable to report the raw values
> as-is, as they should still be useful for understanding storage I/O
> activity.

Sounds reasonable.

Below some feedback on the patch. It's all really minor. The patch
looks great. I'll play around with it a bit next week.

meta: it's confusing that this one is called v1 again, it would be
clearer if it was called v2.

nit: at line 528 the "if (es->buffers)" check can simply be merged
with the if block above (which does the exact same check)

> if (usage->inblock <= 0 && usage->outblock <= 0)
> return false;
>
> else
> return true;

nit: You can replace that if-else with:
return usage->inblock > 0 || usage->outblock > 0;

> StorageIOUsageAccumDiff(StorageIOUsage *dst, const StorageIOUsage *add, const StorageIOUsage *sub)

Missing a function comment

From:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
To:	torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, rjuju123(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, Bruce Momjian <bruce(at)momjian(dot)us>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2025-02-09 11:51:40
Message-ID:	CAGECzQQdc-k=M2fMCKa98kVZntc=6d3rpd6edt8Qs45cayfUeQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 8 Feb 2025 at 14:54, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl> wrote:
> I'll play around with it a bit next week.

Okay, I played around with it and couldn't find any issues. I marked
the patch as "ready for committer" in the commitfest app[1], given
that all feedback in my previous email was very minor.

[1]: https://commitfest.postgresql.org/52/5526/

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
Cc:	torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, pgsql-hackers(at)postgresql(dot)org, rjuju123(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, Bruce Momjian <bruce(at)momjian(dot)us>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2025-02-09 17:59:27
Message-ID:	myw76agwrlmisvamwbfl6ibxgwh5glzitydiwnfmtb5aui232i@274yxxtnbnsp
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2025-02-09 12:51:40 +0100, Jelte Fennema-Nio wrote:
> On Sat, 8 Feb 2025 at 14:54, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl> wrote:
> > I'll play around with it a bit next week.
>
> Okay, I played around with it and couldn't find any issues. I marked
> the patch as "ready for committer" in the commitfest app[1], given
> that all feedback in my previous email was very minor.

I'm somewhat against this patch, as it's fairly fundamentally incompatible
with AIO. There's no real way to get information in this manner if the IO
isn't executed synchronously in process context...

Greetings,

Andres

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, pgsql-hackers(at)postgresql(dot)org, rjuju123(at)gmail(dot)com, Bruce Momjian <bruce(at)momjian(dot)us>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2025-02-09 18:05:42
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> I'm somewhat against this patch, as it's fairly fundamentally incompatible
> with AIO. There's no real way to get information in this manner if the IO
> isn't executed synchronously in process context...

Even without looking ahead to AIO, there's bgwriter, walwriter, and
checkpointer processes that all take I/O load away from foreground
processes. I don't really believe that this will produce useful
numbers.

regards, tom lane

From:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, pgsql-hackers(at)postgresql(dot)org, rjuju123(at)gmail(dot)com, Bruce Momjian <bruce(at)momjian(dot)us>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2025-02-09 20:06:02
Message-ID:	CAGECzQTm6oEEY4yO_FO0ZBtUuJX+pYiXhw2GPPRMzq_5DP5_fQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, 9 Feb 2025 at 19:05, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > I'm somewhat against this patch, as it's fairly fundamentally incompatible
> > with AIO. There's no real way to get information in this manner if the IO
> > isn't executed synchronously in process context...

Hmm, I had not considered how this would interact with your AIO work.
I agree that getting this info would be hard/impossible to do
efficiently, when IOs are done by background IO processes that
interleave IOs from different queries. But I'd expect that AIOs that
are done using iouring would be tracked correctly without having to
change this code at all (because I assume those are done from the
query backend process).

One other thought: I think the primary benefit of this feature is
being able to see how many read IOs actually hit the disk, as opposed
to hitting OS page cache. That benefit disappears when using Direct
IO, because then there's no OS page cache.

How many years away do you think that widespread general use of
AIO+Direct IO is, though? I think that for the N years from now until
then, it would be very nice to have this feature to help debug query
performance problems. Then once the numbers become too
inaccurate/useless at some point, we could simply remove them again.

> Even without looking ahead to AIO, there's bgwriter, walwriter, and
> checkpointer processes that all take I/O load away from foreground
> processes. I don't really believe that this will produce useful
> numbers.

The bgwriter, walwriter, and checkpointer should only take away
*write* IOs. For read IOs the numbers should be very accurate and as
explained above read IOs is where I think the primary benefit of this
feature is.

But even for write IOs I think the numbers would be useful when
looking at them with the goal of finding out why a particular query is
slow: If the bgwriter or checkpointer do the writes, then the query
should be roughly as fast as if no writes to the disk had taken place
at all, but if the query process does the writes then those writes are
probably blocking further execution of the query and thus slowing it
down.

From:	torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
To:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, andres(at)anarazel(dot)de, tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc:	pgsql-hackers(at)postgresql(dot)org, rjuju123(at)gmail(dot)com, Bruce Momjian <bruce(at)momjian(dot)us>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2025-02-10 13:23:06
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2025-02-10 05:06, Jelte Fennema-Nio wrote:

Thanks for reviewing the patch and comments!
Fixed issues you pointed out and attached v2 patch.

> On Sun, 9 Feb 2025 at 19:05, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>
>> Andres Freund <andres(at)anarazel(dot)de> writes:
>> > I'm somewhat against this patch, as it's fairly fundamentally incompatible
>> > with AIO. There's no real way to get information in this manner if the IO
>> > isn't executed synchronously in process context...
>
> Hmm, I had not considered how this would interact with your AIO work.
> I agree that getting this info would be hard/impossible to do
> efficiently, when IOs are done by background IO processes that
> interleave IOs from different queries. But I'd expect that AIOs that
> are done using iouring would be tracked correctly without having to
> change this code at all (because I assume those are done from the
> query backend process).
>
> One other thought: I think the primary benefit of this feature is
> being able to see how many read IOs actually hit the disk, as opposed
> to hitting OS page cache. That benefit disappears when using Direct
> IO, because then there's no OS page cache.
>
> How many years away do you think that widespread general use of
> AIO+Direct IO is, though? I think that for the N years from now until
> then, it would be very nice to have this feature to help debug query
> performance problems. Then once the numbers become too
> inaccurate/useless at some point, we could simply remove them again.

AIO efforts are something I haven't fully grasped yet, but Jelte's
comments seem reasonable to me.
Of course, as someone proposing this, I'm naturally biased toward
thinking it’s beneficial.
What do you think?

>> Even without looking ahead to AIO, there's bgwriter, walwriter, and
>> checkpointer processes that all take I/O load away from foreground
>> processes. I don't really believe that this will produce useful
>> numbers.
>
> The bgwriter, walwriter, and checkpointer should only take away
> *write* IOs. For read IOs the numbers should be very accurate and as
> explained above read IOs is where I think the primary benefit of this
> feature is.
>
> But even for write IOs I think the numbers would be useful when
> looking at them with the goal of finding out why a particular query is
> slow: If the bgwriter or checkpointer do the writes, then the query
> should be roughly as fast as if no writes to the disk had taken place
> at all, but if the query process does the writes then those writes are
> probably blocking further execution of the query and thus slowing it
> down.

I agree with this as well.

For example, in a SELECT query executed immediately after a large number
of INSERTs, we can observe that writes to storage occur due to WAL
writes for hint bits.
This makes the query take longer compared to a scenario where these
writes do not occur.
I think we can guess what is happening from the output:

postgres=# insert into t1 (select i, repeat('a', 1000) from
generate_series(1, 1000000) i);
INSERT 0 1000000

postgres=# explain analyze table t1;
QUERY PLAN

-------------------------------------------------------------------------------------------------------------------
Seq Scan on t1 (cost=0.00..382665.25 rows=21409025 width=36) (actual
time=1.926..11035.531 rows=1000100 loops=1)
Buffers: shared read=168575 dirtied=142858 written=142479
Planning:
Buffers: shared hit=3 read=3 written=1
Storage I/O: read=48 times write=16 times
Planning Time: 4.472 ms
Execution:
Storage I/O: read=2697272 times write=4480096 times // many writes
Execution Time: 11099.424 ms // slow
(9 rows)

postgres=# explain analyze table t1;
QUERY PLAN

------------------------------------------------------------------------------------------------------------------
Seq Scan on t1 (cost=0.00..382665.25 rows=21409025 width=36) (actual
time=2.066..2926.394 rows=1000100 loops=1)
Buffers: shared read=168575 written=14
Planning Time: 0.295 ms
Execution:
Storage I/O: read=2697200 times write=224 times // few writes
Execution Time: 3016.257 ms // fast
(6 rows)

--
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.

Attachment	Content-Type	Size
v2-0001-Add-storage-I-O-tracking-to-BUFFERS-option.patch	text/x-diff	33.9 KB

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, pgsql-hackers(at)postgresql(dot)org, rjuju123(at)gmail(dot)com, Bruce Momjian <bruce(at)momjian(dot)us>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2025-02-10 13:31:38
Message-ID:	edvo7bpgi7rcthj7btsduleep4j6dcnsx3a2aqrwmybetttmkw@g5chr66cckg6
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2025-02-09 21:06:02 +0100, Jelte Fennema-Nio wrote:
> On Sun, 9 Feb 2025 at 19:05, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >
> > Andres Freund <andres(at)anarazel(dot)de> writes:
> > > I'm somewhat against this patch, as it's fairly fundamentally incompatible
> > > with AIO. There's no real way to get information in this manner if the IO
> > > isn't executed synchronously in process context...
>
> Hmm, I had not considered how this would interact with your AIO work.
> I agree that getting this info would be hard/impossible to do
> efficiently, when IOs are done by background IO processes that
> interleave IOs from different queries. But I'd expect that AIOs that
> are done using iouring would be tracked correctly without having to
> change this code at all (because I assume those are done from the
> query backend process).
>
> One other thought: I think the primary benefit of this feature is
> being able to see how many read IOs actually hit the disk, as opposed
> to hitting OS page cache. That benefit disappears when using Direct
> IO, because then there's no OS page cache.
>
> How many years away do you think that widespread general use of
> AIO+Direct IO is, though?

I think it'll always be a subset of use. It doesn't make sense to use DIO for
a small databases or untuned databases. Or a system that's deliberately
overcommmitted.

But this will also not work with AIO w/ Buffered IO. Which we hope to use much
more commonly.

If suddenly I have to reimplement something like this to work with worker
based IO, it'll certainly take longer to get to AIO.

Greetings,

Andres Freund

From:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, pgsql-hackers(at)postgresql(dot)org, rjuju123(at)gmail(dot)com, Bruce Momjian <bruce(at)momjian(dot)us>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2025-02-10 22:52:17
Message-ID:	CAGECzQTydZ33miLd=KdbfrV17RowRGBwQrS28hZ0i6+YhetYgg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 10 Feb 2025 at 14:31, Andres Freund <andres(at)anarazel(dot)de> wrote:
> I think it'll always be a subset of use. It doesn't make sense to use DIO for
> a small databases or untuned databases. Or a system that's deliberately
> overcommmitted.

Thanks, that's useful context.

> But this will also not work with AIO w/ Buffered IO. Which we hope to use much
> more commonly.

To be clear, here you mean worker based AIO right? Because it would
work with io_uring based AIO, right?

> If suddenly I have to reimplement something like this to work with worker
> based IO, it'll certainly take longer to get to AIO.

I totally understand. But in my opinion it would be completely fine to
decide that these new IO stats are simply not available for worker
based IO. Just like they're not available for Windows either with this
patch.

I think it would be a shame to make perfect be the enemy of good here
(as often seems to happen with PG patches). I'd rather have this
feature for some setups, than for no setups at all.

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, pgsql-hackers(at)postgresql(dot)org, rjuju123(at)gmail(dot)com, Bruce Momjian <bruce(at)momjian(dot)us>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2025-02-10 23:30:56
Message-ID:	mzjytadtjqv4knvwukppaol5zx2qzt2bkuqvjlo4rjawvhn4ql@qmdf64qkmxjo
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2025-02-10 23:52:17 +0100, Jelte Fennema-Nio wrote:
> On Mon, 10 Feb 2025 at 14:31, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > But this will also not work with AIO w/ Buffered IO. Which we hope to use much
> > more commonly.
>
> To be clear, here you mean worker based AIO right? Because it would
> work with io_uring based AIO, right?

I mostly meant worker based AIO, yes. I haven't checked how accurately these
are kept for io_uring. I would hope they are...

> > If suddenly I have to reimplement something like this to work with worker
> > based IO, it'll certainly take longer to get to AIO.
>
> I totally understand. But in my opinion it would be completely fine to
> decide that these new IO stats are simply not available for worker
> based IO. Just like they're not available for Windows either with this
> patch.

The thing is that you'd often get completely misleading stats. Some of the IO
will still be done by the backend itself, so there will be a non-zero
value. But it will be a significant undercount, because the asynchronously
executed IO won't be tracked (if worker mode is used).

Greetings,

Andres Freund

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, pgsql-hackers(at)postgresql(dot)org, rjuju123(at)gmail(dot)com, Bruce Momjian <bruce(at)momjian(dot)us>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2025-02-10 23:53:01
Message-ID:	dfnb7me4hsbdb6mxlosxcfvh4xjdk5qoy42piuu2ahblwmocyf@ma3bh552myoe
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2025-02-10 18:30:56 -0500, Andres Freund wrote:
> On 2025-02-10 23:52:17 +0100, Jelte Fennema-Nio wrote:
> > On Mon, 10 Feb 2025 at 14:31, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > But this will also not work with AIO w/ Buffered IO. Which we hope to use much
> > > more commonly.
> >
> > To be clear, here you mean worker based AIO right? Because it would
> > work with io_uring based AIO, right?
>
> I mostly meant worker based AIO, yes. I haven't checked how accurately these
> are kept for io_uring. I would hope they are...

It does look like it is tracked.

> > > If suddenly I have to reimplement something like this to work with worker
> > > based IO, it'll certainly take longer to get to AIO.
> >
> > I totally understand. But in my opinion it would be completely fine to
> > decide that these new IO stats are simply not available for worker
> > based IO. Just like they're not available for Windows either with this
> > patch.
>
> The thing is that you'd often get completely misleading stats. Some of the IO
> will still be done by the backend itself, so there will be a non-zero
> value. But it will be a significant undercount, because the asynchronously
> executed IO won't be tracked (if worker mode is used).

postgres[985394][1]=# SHOW io_method ;
┌───────────┐
│ io_method │
├───────────┤
│ worker │
└───────────┘
(1 row)

postgres[985394][1]=# EXPLAIN ANALYZE SELECT count(*) FROM manyrows ;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate (cost=17906.00..17906.01 rows=1 width=8) (actual time=199.494..199.494 rows=1 loops=1) │
│ Buffers: shared read=5406 │
│ I/O Timings: shared read=57.906 │
│ -> Seq Scan on manyrows (cost=0.00..15406.00 rows=1000000 width=0) (actual time=0.380..140.671 rows=1000000 loops=1) │
│ Buffers: shared read=5406 │
│ I/O Timings: shared read=57.906 │
│ Planning: │
│ Buffers: shared hit=41 read=12 │
│ Storage I/O: read=192 times write=0 times │
│ Planning Time: 1.869 ms │
│ Execution Time: 199.554 ms │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

postgres[1014152][1]=# SHOW io_method ;
┌───────────┐
│ io_method │
├───────────┤
│ io_uring │
└───────────┘
(1 row)

postgres[1014152][1]=# EXPLAIN ANALYZE SELECT count(*) FROM manyrows ;
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate (cost=17906.00..17906.01 rows=1 width=8) (actual time=111.591..111.593 rows=1 loops=1) │
│ Buffers: shared read=5406 │
│ I/O Timings: shared read=14.342 │
│ -> Seq Scan on manyrows (cost=0.00..15406.00 rows=1000000 width=0) (actual time=0.161..70.843 rows=1000000 loops=1) │
│ Buffers: shared read=5406 │
│ I/O Timings: shared read=14.342 │
│ Planning: │
│ Buffers: shared hit=41 read=12 │
│ Storage I/O: read=192 times write=0 times │
│ Planning Time: 1.768 ms │
│ Execution: │
│ Storage I/O: read=86496 times write=0 times │
│ Execution Time: 111.670 ms │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Independent to of this, it's probably not good that we're tracking shared
buffer hits after io combining, if I interpret this correctly... That looks to
be an issue in master, not just the AIO branch.

Greetings,

Andres Freund

From:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, pgsql-hackers(at)postgresql(dot)org, rjuju123(at)gmail(dot)com, Bruce Momjian <bruce(at)momjian(dot)us>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2025-02-11 08:59:43
Message-ID:	CAGECzQTQ+QxR3__5vrHoAoKHs8Uv+c=ZY3_o1FsvXXx4AaRJJQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 11 Feb 2025 at 00:53, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > I mostly meant worker based AIO, yes. I haven't checked how accurately these
> > are kept for io_uring. I would hope they are...
>
> It does look like it is tracked.

nice!

> > The thing is that you'd often get completely misleading stats. Some of the IO
> > will still be done by the backend itself, so there will be a non-zero
> > value. But it will be a significant undercount, because the asynchronously
> > executed IO won't be tracked (if worker mode is used).

Yeah, makes sense. Like I said, I would be completely fine with not
showing these numbers at all/setting them to 0 for setups where we
cannot easily get useful numbers (and this bgworker AIO would be one
of those setups).

> Independent to of this, it's probably not good that we're tracking shared
> buffer hits after io combining, if I interpret this correctly... That looks to
> be an issue in master, not just the AIO branch.

You mean that e.g. a combined IO for 20 blocks still sounds only as 1
"shared read"? Yeah, that sounds like a bug.

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, pgsql-hackers(at)postgresql(dot)org, rjuju123(at)gmail(dot)com, Bruce Momjian <bruce(at)momjian(dot)us>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2025-02-11 15:36:55
Message-ID:	3gz6ymrgnua75aviagsl4d4traoqxo2g2rzzykqa3yl4jyts3y@gj6lcc6aziil
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2025-02-11 09:59:43 +0100, Jelte Fennema-Nio wrote:
> On Tue, 11 Feb 2025 at 00:53, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > The thing is that you'd often get completely misleading stats. Some of the IO
> > > will still be done by the backend itself, so there will be a non-zero
> > > value. But it will be a significant undercount, because the asynchronously
> > > executed IO won't be tracked (if worker mode is used).
>
> Yeah, makes sense. Like I said, I would be completely fine with not
> showing these numbers at all/setting them to 0 for setups where we
> cannot easily get useful numbers (and this bgworker AIO would be one
> of those setups).

Shrug. It means that it'll not work in what I hope will be the default
mechanism before long. I just can't get excited for that. In all likelihood
it'll result in bug reports that I'll then be on the hook to fix.

> > Independent to of this, it's probably not good that we're tracking shared
> > buffer hits after io combining, if I interpret this correctly... That looks to
> > be an issue in master, not just the AIO branch.
>
> You mean that e.g. a combined IO for 20 blocks still sounds only as 1
> "shared read"? Yeah, that sounds like a bug.

Yep.

Greetings,

Andres Freund

From:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, pgsql-hackers(at)postgresql(dot)org, rjuju123(at)gmail(dot)com, Bruce Momjian <bruce(at)momjian(dot)us>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2025-02-11 16:00:36
Message-ID:	CAGECzQRkmJRyFqH-OVJc7CriLfCcdnj1LC9Bav2e7q7pTDVVgw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 11 Feb 2025 at 16:36, Andres Freund <andres(at)anarazel(dot)de> wrote:
> Shrug. It means that it'll not work in what I hope will be the default
> mechanism before long. I just can't get excited for that. In all likelihood
> it'll result in bug reports that I'll then be on the hook to fix.

My assumption was that io_uring would become the default mechanism on
Linux. And that bgworker based AIO would generally only be used by
platforms without similar functionality. Is that assumption incorrect?

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, pgsql-hackers(at)postgresql(dot)org, rjuju123(at)gmail(dot)com, Bruce Momjian <bruce(at)momjian(dot)us>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2025-02-11 16:19:14
Message-ID:	zluvsuokej5tpqirkrmhwdikg5cwhkzu2dvlpkdizlukitu7vl@yaglatfcccho
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2025-02-11 17:00:36 +0100, Jelte Fennema-Nio wrote:
> On Tue, 11 Feb 2025 at 16:36, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > Shrug. It means that it'll not work in what I hope will be the default
> > mechanism before long. I just can't get excited for that. In all likelihood
> > it'll result in bug reports that I'll then be on the hook to fix.
>
> My assumption was that io_uring would become the default mechanism on
> Linux. And that bgworker based AIO would generally only be used by
> platforms without similar functionality. Is that assumption incorrect?

Yes, at least initially:
1) it's not enabled on the kernel level everywhere
2) it requires an optional build dependency
3) it requires tuning the file descriptor ulimit, unless we can convince Tom
that it's ok to do that ourselves

Greetings,

Andres Freund

From:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, pgsql-hackers(at)postgresql(dot)org, rjuju123(at)gmail(dot)com, Bruce Momjian <bruce(at)momjian(dot)us>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2025-02-11 17:45:13
Message-ID:	CAGECzQRUcy+=R1bpiX+VrK2cuGekdHS1xSdN_esRch2JCKOi3w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 11 Feb 2025 at 17:19, Andres Freund <andres(at)anarazel(dot)de> wrote:
> Yes, at least initially:

Ah, then I understand your point of view much better. Still I think we
could easily frame it as: If you enable io_uring, you also get these
additional fancy stats.

Also afaict the items don't have to mean that

> 1) it's not enabled on the kernel level everywhere

Is that really a common thing to do? From a quick internet search, I
can "only" find that Google does this. (Google is definitely a big
cloud player, so I don't want to suggest that that is not important,
but if that's really the only one still the bulk of systems would have
io_uring support)

> 2) it requires an optional build dependency

What build dependency is this? In any case, can't we choose the
default at build time based on the available build dependencies? And
if we cannot, I think we could always add an "auto" default that would
mean the best available AIO implementation (where io_uring is better
than bgworkers).

> 3) it requires tuning the file descriptor ulimit, unless we can convince Tom
> that it's ok to do that ourselves

I think we should just do this, given the reasoning in the blog[1]
from the systemd author I linked in the AIO thread. I agree that a
response/explicit approval from Tom would be nice though.

[1]: https://0pointer.net/blog/file-descriptor-limits.html

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, pgsql-hackers(at)postgresql(dot)org, rjuju123(at)gmail(dot)com, Bruce Momjian <bruce(at)momjian(dot)us>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2025-02-11 18:54:40
Message-ID:	cr4v2kb77h3l4v2i26p4ro4wggj2wmpmwkjdlm6ctx2b32kixl@w7sgnvad6zaj
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2025-02-11 18:45:13 +0100, Jelte Fennema-Nio wrote:
> On Tue, 11 Feb 2025 at 17:19, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > Yes, at least initially:
>
> Ah, then I understand your point of view much better. Still I think we
> could easily frame it as: If you enable io_uring, you also get these
> additional fancy stats.

> Also afaict the items don't have to mean that
>
> > 1) it's not enabled on the kernel level everywhere
>
> Is that really a common thing to do? From a quick internet search, I
> can "only" find that Google does this. (Google is definitely a big
> cloud player, so I don't want to suggest that that is not important,
> but if that's really the only one still the bulk of systems would have
> io_uring support)

RHEL had it disabled for quite a while, not sure if that's still the case.

> > 2) it requires an optional build dependency
>
> What build dependency is this?

Liburing.

> In any case, can't we choose the default at build time based on the
> available build dependencies? And if we cannot, I think we could always add
> an "auto" default that would mean the best available AIO implementation
> (where io_uring is better than bgworkers).

We could, but because of 3) I don't want to do that right now.

> > 3) it requires tuning the file descriptor ulimit, unless we can convince Tom
> > that it's ok to do that ourselves
>
> I think we should just do this, given the reasoning in the blog[1]
> from the systemd author I linked in the AIO thread. I agree that a
> response/explicit approval from Tom would be nice though.

I think it's the right path, but that's a fight to fight after AIO has been
merged, not before.

Greetings,

Andres Freund

From:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
To:	torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
Cc:	andres(at)anarazel(dot)de, tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org, rjuju123(at)gmail(dot)com, Bruce Momjian <bruce(at)momjian(dot)us>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2025-03-17 23:52:05
Message-ID:	CAGECzQT7661fFrdvwLq2tYjL3ovnbomWRqLmwO1vOrHvd=LwKA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 10 Feb 2025 at 14:23, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com> wrote:
> Thanks for reviewing the patch and comments!
> Fixed issues you pointed out and attached v2 patch.

This patch needs a rebase, because it's failing to compile currently.
So I marked this as "Waiting on Author" in the commitfest app.

From:	torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
To:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
Cc:	andres(at)anarazel(dot)de, tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org, rjuju123(at)gmail(dot)com, Bruce Momjian <bruce(at)momjian(dot)us>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2025-03-19 13:15:32
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2025-03-18 08:52, Jelte Fennema-Nio wrote:
> On Mon, 10 Feb 2025 at 14:23, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
> wrote:
>> Thanks for reviewing the patch and comments!
>> Fixed issues you pointed out and attached v2 patch.
>
> This patch needs a rebase, because it's failing to compile currently.
> So I marked this as "Waiting on Author" in the commitfest app.

Thanks! I've attached an updated patch.

BTW based on your discussion, I thought this patch could not be merged
anytime soon. Does that align with your understanding?

- With bgworker-based AIO, this patch could mislead users into
underestimating the actual storage I/O load, which is undesirable.
- With io_uring-based AIO, this patch could provide meaningful values,
but it may take some time before io_uring sees widespread adoption.

--
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.

Attachment	Content-Type	Size
v3-0001-Add-storage-I-O-tracking-to-BUFFERS-option.patch	text/x-diff	35.5 KB

From:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
To:	torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
Cc:	andres(at)anarazel(dot)de, tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org, rjuju123(at)gmail(dot)com, Bruce Momjian <bruce(at)momjian(dot)us>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2025-03-22 11:23:26
Message-ID:	CAGECzQSV7eS86=mBMr8JcN5ghORckEHUmyqAFJLr7Y+P7NVXBg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 19 Mar 2025 at 14:15, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com> wrote:
> BTW based on your discussion, I thought this patch could not be merged
> anytime soon. Does that align with your understanding?

Yeah, that aligns with my understanding. I don't think it's realistic
to get this merged before the code freeze, but I think both of the
below issues could be resolved.

> - With bgworker-based AIO, this patch could mislead users into
> underestimating the actual storage I/O load, which is undesirable.

To resolve this, I think the patch would need to change to not report
anything if bgworker-based AIO is used. So I moved this patch to the
next commitfest, and marked it as "waiting for author" there.

> - With io_uring-based AIO, this patch could provide meaningful values,
> but it may take some time before io_uring sees widespread adoption.

I submitted this patch to help make io_uring-based AIO more of a reality:
https://commitfest.postgresql.org/patch/5570/

From:	torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
To:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
Cc:	andres(at)anarazel(dot)de, tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org, rjuju123(at)gmail(dot)com, Bruce Momjian <bruce(at)momjian(dot)us>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2025-03-25 01:27:31
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2025-03-22 20:23, Jelte Fennema-Nio wrote:

> On Wed, 19 Mar 2025 at 14:15, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
> wrote:
>> BTW based on your discussion, I thought this patch could not be merged
>> anytime soon. Does that align with your understanding?
>
> Yeah, that aligns with my understanding. I don't think it's realistic
> to get this merged before the code freeze, but I think both of the
> below issues could be resolved.
>
>> - With bgworker-based AIO, this patch could mislead users into
>> underestimating the actual storage I/O load, which is undesirable.
>
> To resolve this, I think the patch would need to change to not report
> anything if bgworker-based AIO is used.

Agreed.
I feel the new GUC io_method can be used to determine whether
bgworker-based AIO is being used.

> So I moved this patch to the
> next commitfest, and marked it as "waiting for author" there.

Thanks for moving it.

>> - With io_uring-based AIO, this patch could provide meaningful values,
>> but it may take some time before io_uring sees widespread adoption.
>
> I submitted this patch to help make io_uring-based AIO more of a
> reality:
> https://commitfest.postgresql.org/patch/5570/

Thanks for working on that, too.

--
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.

From:	torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
To:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
Cc:	andres(at)anarazel(dot)de, tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org, rjuju123(at)gmail(dot)com, Bruce Momjian <bruce(at)momjian(dot)us>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2025-04-11 13:18:06
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2025-03-25 10:27, torikoshia wrote:
> On 2025-03-22 20:23, Jelte Fennema-Nio wrote:
>
>> On Wed, 19 Mar 2025 at 14:15, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
>> wrote:
>>> BTW based on your discussion, I thought this patch could not be
>>> merged
>>> anytime soon. Does that align with your understanding?
>>
>> Yeah, that aligns with my understanding. I don't think it's realistic
>> to get this merged before the code freeze, but I think both of the
>> below issues could be resolved.
>>
>>> - With bgworker-based AIO, this patch could mislead users into
>>> underestimating the actual storage I/O load, which is undesirable.
>>
>> To resolve this, I think the patch would need to change to not report
>> anything if bgworker-based AIO is used.
>
> Agreed.
> I feel the new GUC io_method can be used to determine whether
> bgworker-based AIO is being used.

I took this approach and when io_method=worker, no additional output is
shown in the attached patch.

--
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.

Attachment	Content-Type	Size
v4-0001-Add-storage-I-O-tracking-to-BUFFERS-option.patch	text/x-diff	71.3 KB

From:	torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
To:	Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
Cc:	andres(at)anarazel(dot)de, tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org, rjuju123(at)gmail(dot)com, Bruce Momjian <bruce(at)momjian(dot)us>
Subject:	Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date:	2025-05-08 13:51:38
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2025-04-11 22:18, torikoshia wrote:
> On 2025-03-25 10:27, torikoshia wrote:
>> On 2025-03-22 20:23, Jelte Fennema-Nio wrote:
>>
>>> On Wed, 19 Mar 2025 at 14:15, torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
>>> wrote:
>>>> BTW based on your discussion, I thought this patch could not be
>>>> merged
>>>> anytime soon. Does that align with your understanding?
>>>
>>> Yeah, that aligns with my understanding. I don't think it's realistic
>>> to get this merged before the code freeze, but I think both of the
>>> below issues could be resolved.
>>>
>>>> - With bgworker-based AIO, this patch could mislead users into
>>>> underestimating the actual storage I/O load, which is undesirable.
>>>
>>> To resolve this, I think the patch would need to change to not report
>>> anything if bgworker-based AIO is used.
>>
>> Agreed.
>> I feel the new GUC io_method can be used to determine whether
>> bgworker-based AIO is being used.
>
> I took this approach and when io_method=worker, no additional output
> is shown in the attached patch.

Rebased the patch.

--
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.

Attachment	Content-Type	Size
v5-0001-Add-storage-I-O-tracking-to-BUFFERS-option.patch	text/x-diff	71.1 KB