subscription worker signalling wal writer too much

Lists: pgsql-hackers
From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: subscription worker signalling wal writer too much
Date: 2017-06-14 18:55:38
Message-ID: CAMkU=1xD=FN5PSp_yQTKrJ+3aEDg4BUFSvDXzc74UC6kDGPy_g@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

If I publish a pgbench workload and subscribe to it, the subscription
worker is signalling the wal writer thousands of times a second, once for
every async commit. This has a noticeable performance cost.

I don't think it is ever necessary to signal the wal writer here, unless
wal writer is taking the long sleep for power saving purposes. If up
to wal_writer_delay of "committed" transactions get lost on a crash, it
doesn't really matter because they will be replayed again once replication
resumes. However, might delaying the advance of the hot_standby_feedback by
up to the amount of wal_writer_delay be a problem? I would think any
set-up which depends on the standby never being a few dozen milliseconds
behind is already doomed.

If I want to suppress signalling, what would be the right way to
communicate to XLogSetAsyncXactLSN that it is being called in a
subscription worker?

Another approach would be to make XLogSetAsyncXactLSN signalling less
aggressive for everyone, not just subscription workers. There is no point
in signalling it more than once for a given WriteRqstPtr page boundary. So
each backend could locally remember the last boundary for which it
signalled wal writer, and not signal again for the same boundary. This
would be especially effective for a subscription worker, as it should be
pretty common for almost all the async commits to be coming from the same
process. Or, the last boundary could be kept in shared memory (XLogCtl) so
all backends can share it, at the expense of additional work needed to be
done under a spin lock.

Other ideas?

Thanks,

Jeff


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: subscription worker signalling wal writer too much
Date: 2017-06-14 22:08:49
Message-ID: CAMkU=1yz3RYqPta7kzv0dyPQNaGoutkYKmTWBLFKpsPc1DO9nQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jun 14, 2017 at 11:55 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> If I publish a pgbench workload and subscribe to it, the subscription
> worker is signalling the wal writer thousands of times a second, once for
> every async commit. This has a noticeable performance cost.
>

I've used a local variable to avoid waking up the wal writer more than once
for the same page boundary. This reduces the number of wake-ups by about
7/8.

I'm testing it by doing 1e6 transactions over 8 clients while replication
is in effect, then waiting for the logical replica to catch up. This cycle
takes 183.1 seconds in HEAD, and 162.4 seconds with the attached patch.
N=14, p-value for difference of the means 6e-17.

If I suppress all wake-ups just to see what would happen, it further
reduces the runtime to 153.7.

Cheers,

Jeff

Attachment Content-Type Size
wake_wal_writer_less_aggressively.patch application/octet-stream 877 bytes

From: Andres Freund <andres(at)anarazel(dot)de>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: subscription worker signalling wal writer too much
Date: 2017-06-14 22:20:19
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-06-14 15:08:49 -0700, Jeff Janes wrote:
> On Wed, Jun 14, 2017 at 11:55 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>
> > If I publish a pgbench workload and subscribe to it, the subscription
> > worker is signalling the wal writer thousands of times a second, once for
> > every async commit. This has a noticeable performance cost.
> >
>
> I've used a local variable to avoid waking up the wal writer more than once
> for the same page boundary. This reduces the number of wake-ups by about
> 7/8.

Maybe I'm missing something here, but isn't that going to reduce our
guarantees about when asynchronously committed xacts are flushed out?
You can easily fit a number of commits into the same page... As this
isn't specific to logical-rep, I don't think that's ok.

Have you chased down why there's that many wakeups? Normally I'd have
expected that a number of the SetLatch() calls get consolidated
together, but I guess walwriter is "too quick" in waking up and
resetting the latch?

Greetings,

Andres Freund


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: subscription worker signalling wal writer too much
Date: 2017-06-14 23:24:27
Message-ID: CAMkU=1xycMxkVV3ccwdxSF+HgJ1d7YwHf4Y52-A+iDJ5Cmg8Cg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jun 14, 2017 at 3:20 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:

> On 2017-06-14 15:08:49 -0700, Jeff Janes wrote:
> > On Wed, Jun 14, 2017 at 11:55 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
> wrote:
> >
> > > If I publish a pgbench workload and subscribe to it, the subscription
> > > worker is signalling the wal writer thousands of times a second, once
> for
> > > every async commit. This has a noticeable performance cost.
> > >
> >
> > I've used a local variable to avoid waking up the wal writer more than
> once
> > for the same page boundary. This reduces the number of wake-ups by about
> > 7/8.
>
> Maybe I'm missing something here, but isn't that going to reduce our
> guarantees about when asynchronously committed xacts are flushed out?
> You can easily fit a number of commits into the same page... As this
> isn't specific to logical-rep, I don't think that's ok.
>

The guarantee is based on wal_writer_delay not on SIGUSR1, so I don't think
this changes that. (Also, it isn't really a guarantee, the fsync can take
many seconds to complete once we do initiate it, and there is absolutely
nothing we can do about that, other than do the fsync synchronously in the
first place).

The reason for kicking the wal writer at page boundaries is so that hint
bits can get set earlier than they otherwise could. But I don't think
kicking it multiple times per page boundary can help in that effort.

>
> Have you chased down why there's that many wakeups? Normally I'd have
> expected that a number of the SetLatch() calls get consolidated
> together, but I guess walwriter is "too quick" in waking up and
> resetting the latch?
>

I'll have to dig into that some more. The 7/8 reduction I cited was just
in calls to SetLatch from that part of the code, I didn't measure whether
the SetLatch actually called kill(owner_pid, SIGUSR1) or not when I
determined that reduction, so it wasn't truly wake ups I measured. Actual
wake ups were measured only indirectly via the impact on performance. I'll
need to figure out how to instrument that without distorting the
performance too much in the process..

Cheers,

Jeff


From: Andres Freund <andres(at)anarazel(dot)de>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: subscription worker signalling wal writer too much
Date: 2017-06-14 23:29:22
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-06-14 16:24:27 -0700, Jeff Janes wrote:
> On Wed, Jun 14, 2017 at 3:20 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> > On 2017-06-14 15:08:49 -0700, Jeff Janes wrote:
> > > On Wed, Jun 14, 2017 at 11:55 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
> > wrote:
> > >
> > > > If I publish a pgbench workload and subscribe to it, the subscription
> > > > worker is signalling the wal writer thousands of times a second, once
> > for
> > > > every async commit. This has a noticeable performance cost.
> > > >
> > >
> > > I've used a local variable to avoid waking up the wal writer more than
> > once
> > > for the same page boundary. This reduces the number of wake-ups by about
> > > 7/8.
> >
> > Maybe I'm missing something here, but isn't that going to reduce our
> > guarantees about when asynchronously committed xacts are flushed out?
> > You can easily fit a number of commits into the same page... As this
> > isn't specific to logical-rep, I don't think that's ok.
> >
>
> The guarantee is based on wal_writer_delay not on SIGUSR1, so I don't think
> this changes that. (Also, it isn't really a guarantee, the fsync can take
> many seconds to complete once we do initiate it, and there is absolutely
> nothing we can do about that, other than do the fsync synchronously in the
> first place).

Well, wal_writer_delay doesn't work if walwriter is in sleep mode, and
this afaics would allow wal writer to go into sleep mode with half a
page filled, and it'd not be woken up again until the page is filled.
No?

> > Have you chased down why there's that many wakeups? Normally I'd have
> > expected that a number of the SetLatch() calls get consolidated
> > together, but I guess walwriter is "too quick" in waking up and
> > resetting the latch?

> I'll have to dig into that some more. The 7/8 reduction I cited was just
> in calls to SetLatch from that part of the code, I didn't measure whether
> the SetLatch actually called kill(owner_pid, SIGUSR1) or not when I
> determined that reduction, so it wasn't truly wake ups I measured. Actual
> wake ups were measured only indirectly via the impact on performance. I'll
> need to figure out how to instrument that without distorting the
> performance too much in the process..

I'd suspect that just measuring the number of kill() calls should be
doable, if measured via perf or something like hta.t

Greetings,

Andres Freund


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: subscription worker signalling wal writer too much
Date: 2017-06-15 22:06:43
Message-ID: CAMkU=1z6_K4mE12vNirgV5qnu58+cZQdNxD+Pb5bzxnvOVVWoA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jun 14, 2017 at 4:29 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:

> On 2017-06-14 16:24:27 -0700, Jeff Janes wrote:
> > On Wed, Jun 14, 2017 at 3:20 PM, Andres Freund <andres(at)anarazel(dot)de>
> wrote:
> >
> > > On 2017-06-14 15:08:49 -0700, Jeff Janes wrote:
> > > > On Wed, Jun 14, 2017 at 11:55 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
> > > wrote:
> > > >
> > > > > If I publish a pgbench workload and subscribe to it, the
> subscription
> > > > > worker is signalling the wal writer thousands of times a second,
> once
> > > for
> > > > > every async commit. This has a noticeable performance cost.
> > > > >
> > > >
> > > > I've used a local variable to avoid waking up the wal writer more
> than
> > > once
> > > > for the same page boundary. This reduces the number of wake-ups by
> about
> > > > 7/8.
> > >
> > > Maybe I'm missing something here, but isn't that going to reduce our
> > > guarantees about when asynchronously committed xacts are flushed out?
> > > You can easily fit a number of commits into the same page... As this
> > > isn't specific to logical-rep, I don't think that's ok.
> > >
> >
> > The guarantee is based on wal_writer_delay not on SIGUSR1, so I don't
> think
> > this changes that. (Also, it isn't really a guarantee, the fsync can take
> > many seconds to complete once we do initiate it, and there is absolutely
> > nothing we can do about that, other than do the fsync synchronously in
> the
> > first place).
>
> Well, wal_writer_delay doesn't work if walwriter is in sleep mode, and
> this afaics would allow wal writer to go into sleep mode with half a
> page filled, and it'd not be woken up again until the page is filled.
> No?
>

If it is taking the big sleep, then we always wake it up, since
acd4c7d58baf09f.

I didn't change that part. I only changed what we do if it not hibernating.

> > > Have you chased down why there's that many wakeups? Normally I'd have
> > > expected that a number of the SetLatch() calls get consolidated
> > > together, but I guess walwriter is "too quick" in waking up and
> > > resetting the latch?
>
> > I'll have to dig into that some more. The 7/8 reduction I cited was just
> > in calls to SetLatch from that part of the code, I didn't measure whether
> > the SetLatch actually called kill(owner_pid, SIGUSR1) or not when I
> > determined that reduction, so it wasn't truly wake ups I measured.
> Actual
> > wake ups were measured only indirectly via the impact on performance.
> I'll
> > need to figure out how to instrument that without distorting the
> > performance too much in the process..
>
> I'd suspect that just measuring the number of kill() calls should be
> doable, if measured via perf or something like hta.t
>

It looks like only limited consolidation was happening, the number of kills
invoked was more than half of the number of SetLatch. I think the reason
is what when kill(owner_pid, SIGUSR1) is called, the kernel immediately
suspends the calling process and gives the signalled process a chance to
run in that time slice. The Wal Writer gets woken up, sees that it only
has a partial page to write and decides not to do anything, but resets the
latch. It does this fast enough that the subscription worker hasn't
migrated CPUs yet.

But, if the WAL Writer sees it has already flushed up to the highest
eligible page boundary, why doesn't the SetAsync code see the same thing?
Because it isn't flushed that high, but only written that high. It looks
like a story of three git commits:

commit 4de82f7d7c50a81ec8e70e2cb0ab413ab9134c0b
Author: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Date: Sun Nov 13 09:00:57 2011 +0000

Wakeup WALWriter as needed for asynchronous commit performance.

commit db76b1efbbab2441428a9ef21f7ac9ba43c52482
Author: Andres Freund <andres(at)anarazel(dot)de>
Date: Mon Feb 15 22:15:35 2016 +0100

Allow SetHintBits() to succeed if the buffer's LSN is new enough.

commit 7975c5e0a992ae9a45e03d145e0d37e2b5a707f5
Author: Andres Freund <andres(at)anarazel(dot)de>
Date: Mon Feb 15 23:52:38 2016 +0100

Allow the WAL writer to flush WAL at a reduced rate.

The first change made the WALWriter wake up and do a write and flush
whenever an async commit occurred and there was a completed WAL page to be
written. This way the hint bits could be set while the heap page was still
hot, because they couldn't get set until the WAL covering the hinted-at
transaction commit is flushed.

The second change said we can set hint bits before the WAL covering the
hinted-at transaction is flushed, as long the page LSN is newer than that
(so the wal covering the hinted-at transaction commit must be flushed
before the page containing the hint bit can be written).

Then the third change makes the wal writer only write the full pages most
of the times it is woken up, not flush them. It only flushes them once
every wal_writer_delay or wal_writer_flush_after, regardless of how often
it is woken up.

It seems like the third change rendered the first one useless. Wouldn't it
better, and much simpler, just to have reverted the first change once the
second one was done? If that were done, would the third change still be
needed? (It did seem to add some other features as well, so I'm going to
assume we still want those...).

But now the first change is even worse than useless, it is positively
harmful. The only thing to stop it from waking the WALWriter for every
async commit is this line:

/* if we have already flushed that far, we're done */
if (WriteRqstPtr <= LogwrtResult.Flush)
return;

Since LogwrtResult.Flush doesn't advance anymore, this condition doesn't
becomes false due to us waking walwriter, it only becomes false once the
timeout expires (which it would have done anyway with no help from us), or
once wal_writer_flush_after is exceeded. So now every async commit is
waking the walwriter. Most of those wake up do nothing (there is not a
completely new patch to write), some write one completed page but don't
flush anything, and very few do a flush, and that one would have been done
anyway.

So for the sake of wal_writer_flush_after, I guess the simplest thing to do
would be to change repeat the flushbytes calculation and wake the walwriter
only if it looks like that will be exceeded? Attached patch does that.

Maybe the computation of flushbytes should be put into a macro, rather than
repeated in the code? Or even done on the fly, I don't see any reason it
needs to be stored in a variable before testing.

This new patch is simpler than the previous one, and more effective at
speeding up replication. I assume it would speed up pgbench with
synchronous_commit turned off (or against unlogged tables) as well, but I
don't think I have the hardware needed to test that.

Cheers,

Jeff

Attachment Content-Type Size
change_walwriter_wakeup.patch application/octet-stream 915 bytes

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: subscription worker signalling wal writer too much
Date: 2017-06-20 18:31:48
Message-ID: CAMkU=1xBiqfLR+=h+hqKsxnZay4kwhVzu0kCShxsNggdqqdRFg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jun 15, 2017 at 3:06 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

>
> This new patch is simpler than the previous one, and more effective at
> speeding up replication. I assume it would speed up pgbench with
> synchronous_commit turned off (or against unlogged tables) as well, but I
> don't think I have the hardware needed to test that.
>

If I use the 'tpcb-func' script embodied in the attached patch to pgbench,
then I can see the performance difference against unlogged tables using 8
clients on a 8 CPU virtual machine. The normal tpcb-like script has too
much communication overhead, bouncing from pgbench to the postgres backend
7 times per transaction, to see the difference. I also had to
make autovacuum_vacuum_cost_delay=0, otherwise auto analyze holds a
snapshot long enough to bloat the HOT chains which injects a great deal of
variability into the timings.

Commit 7975c5e0a992ae9 in the 9.6 branch causes a regression of about 10%,
and the my patch from the previous email redeems that regression. It also
gives the same improvement against 10dev HEAD.

Cheers,

Jeff

Attachment Content-Type Size
pgbench_function.patch application/octet-stream 1.7 KB

From: Andres Freund <andres(at)anarazel(dot)de>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: subscription worker signalling wal writer too much
Date: 2017-06-25 00:09:52
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 2017-06-15 15:06:43 -0700, Jeff Janes wrote:
> > Well, wal_writer_delay doesn't work if walwriter is in sleep mode, and
> > this afaics would allow wal writer to go into sleep mode with half a
> > page filled, and it'd not be woken up again until the page is filled.
> > No?
> >
>
> If it is taking the big sleep, then we always wake it up, since
> acd4c7d58baf09f.
>
> I didn't change that part. I only changed what we do if it not hibernating.

Right, I hadn't read enough of the context.

> It looks like only limited consolidation was happening, the number of kills
> invoked was more than half of the number of SetLatch. I think the reason
> is what when kill(owner_pid, SIGUSR1) is called, the kernel immediately
> suspends the calling process and gives the signalled process a chance to
> run in that time slice. The Wal Writer gets woken up, sees that it only
> has a partial page to write and decides not to do anything, but resets the
> latch. It does this fast enough that the subscription worker hasn't
> migrated CPUs yet.

I think part of the problem here is that latches signal the owning
process even if the owning process isn't actually sleeping - that's
obviously quite pointless. In cases where walwriter is busy, that
actually causes a lot of superflous interrupted syscalls, because it
actually never ends up waiting on the latch. There's also a lot of
superflous context switches. I think we can avoid doing so quite
easily, as e.g. with the attached patch. Could you check how much that
helps your benchmark?

> The first change made the WALWriter wake up and do a write and flush
> whenever an async commit occurred and there was a completed WAL page to be
> written. This way the hint bits could be set while the heap page was still
> hot, because they couldn't get set until the WAL covering the hinted-at
> transaction commit is flushed.
>
> The second change said we can set hint bits before the WAL covering the
> hinted-at transaction is flushed, as long the page LSN is newer than that
> (so the wal covering the hinted-at transaction commit must be flushed
> before the page containing the hint bit can be written).
>
> Then the third change makes the wal writer only write the full pages most
> of the times it is woken up, not flush them. It only flushes them once
> every wal_writer_delay or wal_writer_flush_after, regardless of how often
> it is woken up.
>
> It seems like the third change rendered the first one useless.

I don't think so. Isn't the walwriter writing out the contents of the
WAL is quite important because it makes room in wal_buffers for new WAL?

I suspect we actually should wake up wal-writer *more* aggressively, to
offload wal fsyncs from individual backends, even when they're not
committing. Right now we fsync whenever a segment is finished - we
really don't want to do that in backends that could do other useful
work. I suspect it'd be a good idea to remove that logic from
XLogWrite() and replace it with
if (ProcGlobal->walwriterLatch)
SetLatch(ProcGlobal->walwriterLatch);

> Wouldn't it
> better, and much simpler, just to have reverted the first change once the
> second one was done?

I think both can actually happen independently, no? It's pretty common
for the page lsn to be *older* than the corresponding commit. In fact
you'll always hit that case unless there's also concurrent writes also
touching said page.

> If that were done, would the third change still be
> needed? (It did seem to add some other features as well, so I'm going to
> assume we still want those...).

It's still useful, yes. It avoids flushing the WAL too often
(page-by-page when there's a lot of writes), which can eat up a lot of
IOPS on fast storage.

> But now the first change is even worse than useless, it is positively
> harmful. The only thing to stop it from waking the WALWriter for every
> async commit is this line:
>
> /* if we have already flushed that far, we're done */
> if (WriteRqstPtr <= LogwrtResult.Flush)
> return;
>
> Since LogwrtResult.Flush doesn't advance anymore, this condition doesn't
> becomes false due to us waking walwriter, it only becomes false once the
> timeout expires (which it would have done anyway with no help from us), or
> once wal_writer_flush_after is exceeded. So now every async commit is
> waking the walwriter. Most of those wake up do nothing (there is not a
> completely new patch to write), some write one completed page but don't
> flush anything, and very few do a flush, and that one would have been done
> anyway.

s/completely new patch/completely new page/?

In my opinion we actually *do* want to write (but not flush!) out such
pages, so I'm not sure I agree with that logic. Have to think about
this some more...

Greetings,

Andres Freund

Attachment Content-Type Size
0001-Don-t-signal-latch-owners-if-owner-not-waiting-on-la.patch text/x-patch 4.2 KB

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: subscription worker signalling wal writer too much
Date: 2017-07-04 03:26:20
Message-ID: CAMkU=1znvaWoxD6Rp6qFjmXdzCkD2_PWozJKr3_jENppB-3XRA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Jun 24, 2017 at 5:09 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:

> Hi,
>
> On 2017-06-15 15:06:43 -0700, Jeff Janes wrote:
>
> > It looks like only limited consolidation was happening, the number of
> kills
> > invoked was more than half of the number of SetLatch. I think the reason
> > is what when kill(owner_pid, SIGUSR1) is called, the kernel immediately
> > suspends the calling process and gives the signalled process a chance to
> > run in that time slice. The Wal Writer gets woken up, sees that it only
> > has a partial page to write and decides not to do anything, but resets
> the
> > latch. It does this fast enough that the subscription worker hasn't
> > migrated CPUs yet.
>
> I think part of the problem here is that latches signal the owning
> process even if the owning process isn't actually sleeping - that's
> obviously quite pointless. In cases where walwriter is busy, that
> actually causes a lot of superflous interrupted syscalls, because it
> actually never ends up waiting on the latch. There's also a lot of
> superflous context switches. I think we can avoid doing so quite
> easily, as e.g. with the attached patch. Could you check how much that
> helps your benchmark?
>

That patch doesn't make any improvement to the pgbench --unlogged-tables
benchmark. I expected that, because this benchmark exposes the opposite
problem. The wal writer isn't busy, it is constantly being woken up but
then immediately going back to sleep.

> > The first change made the WALWriter wake up and do a write and flush
> > whenever an async commit occurred and there was a completed WAL page to
> be
> > written. This way the hint bits could be set while the heap page was
> still
> > hot, because they couldn't get set until the WAL covering the hinted-at
> > transaction commit is flushed.
> >
> > The second change said we can set hint bits before the WAL covering the
> > hinted-at transaction is flushed, as long the page LSN is newer than that
> > (so the wal covering the hinted-at transaction commit must be flushed
> > before the page containing the hint bit can be written).
> >
> > Then the third change makes the wal writer only write the full pages most
> > of the times it is woken up, not flush them. It only flushes them once
> > every wal_writer_delay or wal_writer_flush_after, regardless of how often
> > it is woken up.
> >
> > It seems like the third change rendered the first one useless.
>
> I don't think so. Isn't the walwriter writing out the contents of the
> WAL is quite important because it makes room in wal_buffers for new WAL?
>

If we wanted to do that, I think the asynchronous committer should just
issue the write directly (unless wal_sync_method is open_datasync
or open_sync). Writes are usually extremely fast and it not worth trying
to offload them to a background process, as your half of the signalling
takes longer than just doing the write yourself. When that is not true and
writes start blocking due to extreme io congestion, them everything is
going to start blocking anyway (wal_buffers head will run into the tail,
and the tail can't advance because the writes are blocking) so while
offloading writes to a background process is then good in theory, it isn't
very effective. If everyone is stuck, it doesn't matter which process is
directly stuck and which ones are indirectly stuck.

>
> I suspect we actually should wake up wal-writer *more* aggressively, to
> offload wal fsyncs from individual backends, even when they're not
> committing. Right now we fsync whenever a segment is finished - we
> really don't want to do that in backends that could do other useful
> work. I suspect it'd be a good idea to remove that logic from
> XLogWrite() and replace it with
> if (ProcGlobal->walwriterLatch)
> SetLatch(ProcGlobal->walwriterLatch);
>
>
My understanding is that you can't start on a new file until the old file
is completely synced, because the book keeping can currently only handle
one file at a time. So if you signal the wal writer to do the sync, you
would just end up immediately blocking on it to finish. Getting around
that would require substantially more changes; we would need to implement a
queue of full log files not yet synced so that we could move on without
waiting. If we had such a queue, then sure, this would be a good idea to
just wake the wal writer. But currently, it only matters for large
transactions (\copy, insert ... select, bulk updates). With small
transactions, you can't fill up log files quickly enough for it to make
much of a difference. So I think this is something to do, but it is
independent from what I am currently going on about.

>
> > Wouldn't it
> > better, and much simpler, just to have reverted the first change once the
> > second one was done?
>
> I think both can actually happen independently, no? It's pretty common
> for the page lsn to be *older* than the corresponding commit. In fact
> you'll always hit that case unless there's also concurrent writes also
> touching said page.
>

That is true, but I don't see any residual regression when going from
pre-db76b1efbbab2441428 to 7975c5e0a992ae9a4-with-my-patch applied. So I
think the hint bit issue only shows up when hitting the same page
aggressively, in which case the LSN does get advanced quickly enough that
db76b1efbba solves it. Perhaps a different test case could show that
issue. I also didn't see any improvement from the original 4de82f7d7c50a81e,
so maybe 8 CPUs just isn't enough to detect the problem with setting
hint-buts with permanent tables with synchronous_commit=false.

>
> > If that were done, would the third change still be
> > needed? (It did seem to add some other features as well, so I'm going to
> > assume we still want those...).
>
> It's still useful, yes. It avoids flushing the WAL too often
> (page-by-page when there's a lot of writes), which can eat up a lot of
> IOPS on fast storage.
>

But the only reason it would do that is because we keep waking it up. If
we stop waking it up, it would stop flushing each page, even if it still
had the logic to flush for each page there.

>
>
> > But now the first change is even worse than useless, it is positively
> > harmful. The only thing to stop it from waking the WALWriter for every
> > async commit is this line:
> >
> > /* if we have already flushed that far, we're done */
> > if (WriteRqstPtr <= LogwrtResult.Flush)
> > return;
> >
> > Since LogwrtResult.Flush doesn't advance anymore, this condition doesn't
> > becomes false due to us waking walwriter, it only becomes false once the
> > timeout expires (which it would have done anyway with no help from us),
> or
> > once wal_writer_flush_after is exceeded. So now every async commit is
> > waking the walwriter. Most of those wake up do nothing (there is not a
> > completely new patch to write), some write one completed page but don't
> > flush anything, and very few do a flush, and that one would have been
> done
> > anyway.
>
> s/completely new patch/completely new page/?
>

Yes. My mental tab completion sometimes goes awry.

>
> In my opinion we actually *do* want to write (but not flush!) out such
> pages, so I'm not sure I agree with that logic. Have to think about
> this some more...
>

I think if we want to write each page, we should do it in the foreground.
Conceptually, it could just be another enum value in synchronous_commit,
but synchronous_commit is getting rather unwieldy already.

Cheers,

Jeff


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: subscription worker signalling wal writer too much
Date: 2017-08-25 17:03:50
Message-ID: CA+TgmoaY88+5jjSR2RYEwOpU1jjS0tK5i2XUsMfwz+yLU7OWXQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jul 3, 2017 at 11:26 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> My understanding is that you can't start on a new file until the old file is
> completely synced, because the book keeping can currently only handle one
> file at a time. So if you signal the wal writer to do the sync, you would
> just end up immediately blocking on it to finish. Getting around that would
> require substantially more changes; we would need to implement a queue of
> full log files not yet synced so that we could move on without waiting.

Yeah. I bet that effort would be reasonably worthwhile. I believe
that the forced end-of-segment syncs are costing us a noticeable
amount of performance. Once or twice I took a very half-hearted run
at doing what you describe here, but gave up pretty quickly; it seems
like a somewhat tricky problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: subscription worker signalling wal writer too much
Date: 2017-08-26 21:45:20
Message-ID: CAMkU=1zP931=v-ypFZMDxHAGGQwuVNUPONZE5_ggV+onkXEdvQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jul 3, 2017 at 8:26 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> On Sat, Jun 24, 2017 at 5:09 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
>> On 2017-06-15 15:06:43 -0700, Jeff Janes wrote:
>>
>>
>>
>
>> > Wouldn't it
>> > better, and much simpler, just to have reverted the first change once
>> the
>> > second one was done?
>>
>> I think both can actually happen independently, no? It's pretty common
>> for the page lsn to be *older* than the corresponding commit. In fact
>> you'll always hit that case unless there's also concurrent writes also
>> touching said page.
>>
>
> That is true, but I don't see any residual regression when going from
> pre-db76b1efbbab2441428 to 7975c5e0a992ae9a4-with-my-patch applied. So I
> think the hint bit issue only shows up when hitting the same page
> aggressively, in which case the LSN does get advanced quickly enough that
> db76b1efbba solves it. Perhaps a different test case could show that
> issue. I also didn't see any improvement from the original 4de82f7d7c50a81e,
> so maybe 8 CPUs just isn't enough to detect the problem with setting
> hint-buts with permanent tables with synchronous_commit=false.
>

I've refined my benchmarking, using a smaller scaling factor (8, same as my
CPU count) and running the tests for longer. The smaller size means there
are more intensive updates on individual pgbench_accounts pages, and the
longer run times allows be unset hint bits to build up for longer (usually
I like short test runs with a large number of randomized repetitions, but
that doesn't work well here). Since the scale factor is the same as the
number of CPUs, I've bumped the thread count so that it is more likely
someone will choose a non-contended value of pgbench_branches.bid to update
at any given moment.

pgbench -c16 -j16 -T1200 -M prepared -b tpcb-func --startup='set
synchronous_commit=false'

This testing did show the importance of waking up the wal writer so that
hint bits can be set:

commit TPS
cfafd8beadcd6f 22200.48
cfafd8beadcd6f_nowake 30766.83
median of 14 runs reported, p-val on difference of means is 2e-22.

Where nowake is a hackish disabling of the wake up introduced in
4de82f7d7c50a8,
forward ported to 9.6 branch. (I still wake it if is is doing the big
sleep, I just took out the part that wake it up even from small sleeps)

So my revised benchmark is capable of detecting this effect, even with only
8 CPUs.

When I move to next commit where we set hint bits as long as the page was
re-dirtied after, get these numbers:

db76b1efbbab24 30742.69
db76b1efbbab24_nowake 31072.97

The difference between these is not statistically different, nor is the
difference between them and cfafd8beadcd6f.

So I think the logic you introduced in db76b1efbbab24 captures all the gain
there is to be captured, and 4de82f7d7c50a8 can be reverted. The only
reason to wake up the wal writer is if it is taking the big sleep.

Of course, I can't prove a negative. There could always be some test case
which demonstrates that 4de82f7d7c50a8 still matters in spite of
db76b1efbbab24.
So to be ultra-conservative, attached is a patch which should preserve all
wake-ups other than the ones known to be useless.

7975c5e0a 27810.66
7975c5e0a_patched 30821.41

So 7975c5 causes a significant regression (10% regression, p-val 7e-16 on
difference of the means). It repeatedly wakes the wal-writer up to flush a
page which the wal-writer simply refuses to flush. This can't serve any
purpose. My patch still wakes it up to write the page if it has not been
written yet, and also still wakes it to flush a range of pages of
WalWriterFlushAfter is met. But it just doesn't wake it up to flush a page
which has already been written and will not get flushed.

I'd prefer the code-simplifying change of just not waking it up anymore
except for when it is doing the big sleep, but I can't reliably detect a
performance difference between that and the more conservative change posted
here.

Cheers,

Jeff

Attachment Content-Type Size
change_walwriter_wakeup_v2.patch application/octet-stream 1013 bytes

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: jeff(dot)janes(at)gmail(dot)com
Cc: andres(at)anarazel(dot)de, pgsql-hackers(at)postgresql(dot)org
Subject: Re: subscription worker signalling wal writer too much
Date: 2017-09-19 12:48:16
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

At Sat, 26 Aug 2017 14:45:20 -0700, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote in <CAMkU=1zP931=v-ypFZMDxHAGGQwuVNUPONZE5_ggV+onkXEdvQ(at)mail(dot)gmail(dot)com>
> On Mon, Jul 3, 2017 at 8:26 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>
> > On Sat, Jun 24, 2017 at 5:09 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> >> On 2017-06-15 15:06:43 -0700, Jeff Janes wrote:
> >>
> >>
> >>
> >
> >> > Wouldn't it
> >> > better, and much simpler, just to have reverted the first change once
> >> the
> >> > second one was done?
> >>
> >> I think both can actually happen independently, no? It's pretty common
> >> for the page lsn to be *older* than the corresponding commit. In fact
> >> you'll always hit that case unless there's also concurrent writes also
> >> touching said page.
> >>
> >
> > That is true, but I don't see any residual regression when going from
> > pre-db76b1efbbab2441428 to 7975c5e0a992ae9a4-with-my-patch applied. So I
> > think the hint bit issue only shows up when hitting the same page
> > aggressively, in which case the LSN does get advanced quickly enough that
> > db76b1efbba solves it. Perhaps a different test case could show that
> > issue. I also didn't see any improvement from the original 4de82f7d7c50a81e,
> > so maybe 8 CPUs just isn't enough to detect the problem with setting
> > hint-buts with permanent tables with synchronous_commit=false.
> >
>
> I've refined my benchmarking, using a smaller scaling factor (8, same as my
> CPU count) and running the tests for longer. The smaller size means there
> are more intensive updates on individual pgbench_accounts pages, and the
> longer run times allows be unset hint bits to build up for longer (usually
> I like short test runs with a large number of randomized repetitions, but
> that doesn't work well here). Since the scale factor is the same as the
> number of CPUs, I've bumped the thread count so that it is more likely
> someone will choose a non-contended value of pgbench_branches.bid to update
> at any given moment.
>
> pgbench -c16 -j16 -T1200 -M prepared -b tpcb-func --startup='set
> synchronous_commit=false'
>
> This testing did show the importance of waking up the wal writer so that
> hint bits can be set:
>
> commit TPS
> cfafd8beadcd6f 22200.48
> cfafd8beadcd6f_nowake 30766.83
> median of 14 runs reported, p-val on difference of means is 2e-22.
>
> Where nowake is a hackish disabling of the wake up introduced in
> 4de82f7d7c50a8,
> forward ported to 9.6 branch. (I still wake it if is is doing the big
> sleep, I just took out the part that wake it up even from small sleeps)
>
> So my revised benchmark is capable of detecting this effect, even with only
> 8 CPUs.
>
> When I move to next commit where we set hint bits as long as the page was
> re-dirtied after, get these numbers:
>
> db76b1efbbab24 30742.69
> db76b1efbbab24_nowake 31072.97
>
> The difference between these is not statistically different, nor is the
> difference between them and cfafd8beadcd6f.
>
> So I think the logic you introduced in db76b1efbbab24 captures all the gain
> there is to be captured, and 4de82f7d7c50a8 can be reverted. The only
> reason to wake up the wal writer is if it is taking the big sleep.
>
> Of course, I can't prove a negative. There could always be some test case
> which demonstrates that 4de82f7d7c50a8 still matters in spite of
> db76b1efbbab24.
> So to be ultra-conservative, attached is a patch which should preserve all
> wake-ups other than the ones known to be useless.
>
> 7975c5e0a 27810.66
> 7975c5e0a_patched 30821.41
>
> So 7975c5 causes a significant regression (10% regression, p-val 7e-16 on
> difference of the means). It repeatedly wakes the wal-writer up to flush a
> page which the wal-writer simply refuses to flush. This can't serve any
> purpose. My patch still wakes it up to write the page if it has not been
> written yet, and also still wakes it to flush a range of pages of
> WalWriterFlushAfter is met. But it just doesn't wake it up to flush a page
> which has already been written and will not get flushed.
>
> I'd prefer the code-simplifying change of just not waking it up anymore
> except for when it is doing the big sleep, but I can't reliably detect a
> performance difference between that and the more conservative change posted
> here.

FWIW I tried this patch with taking the following numbers using
pgbench --unlogged-tables/ -T 60 -j 8 -c 8. All numbers are of
the 2nd of two successive runs.

TA: # that XLogSetAsyncXactLSN called while wal writer is awake.
TB: # that XLogSetAsyncXactLSN skipped SetLatch when the TA case.
WA: # that WalWriterMain called XLogBackgroundFlush().
WB: # that XLogBackgroundFlush() skipped flushing in the WA case.

A: without this patch : 1745tps(wal_writer_flush_after = 1MB (default))

WB/WA = 9/141100 : 0.9% of all XLogBackgroundFlush() did nothing.
(woke by SetLatch 141091 times)
TB/TA =27300/103000 : 26% of SetLatch is avoided. (12899 times called)

B: with this patch : 2041 tps (wal_writer_flush_after = 1MB (default))

WB/WA = 46/ 1600 : 2.8% of all XLogBackgroundFlush() did nothing.
(woke by SetLatch 1553 times)
TB/TA =115822/118000: 97% of SetLatch is avoided. (4178 times called)

C: with this patch : 2033 tps (But wal_writer_flush_after = 0)

WB/WA = 117/1700 : 6.9% of all XLogBackgroundFlush() did nothing.
(woke by SetLatch 1582 times)
TB/TA =110615/118000: 93% of SetLatch is avoided. (7785 times called)

The condition for skipping SetLatch of the latest patch (v2)
looks reasonable. And the effect is obvious at least in the
test. The "needless" SetLatch seems very effectively reduced and
actually improving performance. Increased size of bytes flushed
at once seems the main cause of the performance improvement.

Though a bit uneasy to have similar code on both side
(XLogBackgroundFlush and XLogSetAsyncXactLSN) but +1 to this from
me.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: subscription worker signalling wal writer too much
Date: 2017-10-05 13:59:53
Message-ID: CA+TgmoZhw_nyrxYUqSuDh-XEJya-YzfwShki5NuDS=UfgvodrA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 19, 2017 at 8:48 AM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> Though a bit uneasy to have similar code on both side
> (XLogBackgroundFlush and XLogSetAsyncXactLSN) but +1 to this from
> me.

This patch wasn't formatted very nicely; attached is a version that
pgindent likes better and doesn't bust past 80 columns.

I think this patch needs significantly better comments. There's no
explanation of the rather complex formula the patch proposed to use.
Maybe it's just copied from someplace else that does have a comment,
but in that case, the comments here should mention that. And the
comment in the other place should probably be updated too, like:

/* Make sure to keep this in sync with the similar logic in
XLogDoAwesomeStuff() */

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
change_walwriter_wakeup_v3.patch application/octet-stream 833 bytes

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] subscription worker signalling wal writer too much
Date: 2017-11-21 06:44:48
Message-ID: CAB7nPqSH_OiiTveDsBdgP07C5_nrahTQwauUBLnpGN6SWdq7ew@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 5, 2017 at 10:59 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> This patch wasn't formatted very nicely; attached is a version that
> pgindent likes better and doesn't bust past 80 columns.
>
> I think this patch needs significantly better comments. There's no
> explanation of the rather complex formula the patch proposed to use.
> Maybe it's just copied from someplace else that does have a comment,
> but in that case, the comments here should mention that. And the
> comment in the other place should probably be updated too, like:
>
> /* Make sure to keep this in sync with the similar logic in
> XLogDoAwesomeStuff() */

I have just gone through this thread, and reducing the amount of times
the WAL writer is woken up is something worth doing. The last version
if v3 that Robert has sent in
https://www.postgresql.org/message-id/CA+TgmoZhw_nyrxYUqSuDh-XEJya-YzfwShki5NuDS=UfgvodrA@mail.gmail.com.
As far as I can see, the smarter formula presented here comes from
XLogBackgroundFlush(). Jeff, could you update the patch to reflect
that the new condition you are adding comes from there? I am switching
the patch as "waiting on author" for now.
--
Michael


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] subscription worker signalling wal writer too much
Date: 2017-11-28 01:46:06
Message-ID: CAB7nPqQXTLzVtuTozyuiTDfB30iGrBPoeCJMXKuD-J2Lopn4jA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 21, 2017 at 3:44 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> I have just gone through this thread, and reducing the amount of times
> the WAL writer is woken up is something worth doing. The last version
> if v3 that Robert has sent in
> https://www.postgresql.org/message-id/CA+TgmoZhw_nyrxYUqSuDh-XEJya-YzfwShki5NuDS=UfgvodrA@mail.gmail.com.
> As far as I can see, the smarter formula presented here comes from
> XLogBackgroundFlush(). Jeff, could you update the patch to reflect
> that the new condition you are adding comes from there? I am switching
> the patch as "waiting on author" for now.

It has been one week since this update, in consequence I am marking
this patch as returned with feedback for now. Please make sure to
register a new CF entry if you can get a new version out.
--
Michael