Lists: | pgsql-hackers |
---|
From: | Arseny Sher <a(dot)sher(at)postgrespro(dot)ru> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> |
Subject: | logical copy_replication_slot issues |
Date: | 2020-02-09 16:28:59 |
Message-ID: | 871rr3ohbo.fsf@ars-thinkpad |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Lists: | pgsql-hackers |
Hi,
While jumping around partically decoded xacts questions [1], I've read
through the copy replication slots code (9f06d79ef) and found a couple
of issues.
1) It seems quite reckless to me to dive into
DecodingContextFindStartpoint without actual WAL reservation (donors
slot restart_lsn is used, but it is not acquired). Why risking erroring
out with WAL removal error if the function advances new slot position to
updated donors one in the end anyway?
2) In the end, restart_lsn of new slot is set to updated donors
one. However, confirmed_flush field is not updated. This is just wrong
-- we could start decoding too early and stream partially decoded
transaction.
I'd probably avoid doing DecodingContextFindStartpoint at all. Its only
purpose is to assemble consistent snapshot (and establish corresponding
<restart_lsn, confirmed_flush_lsn> pair), but donor slot must have
already done that and we could use it as well. Was this considered?
--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
From: | Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com> |
---|---|
To: | Arseny Sher <a(dot)sher(at)postgrespro(dot)ru> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> |
Subject: | Re: logical copy_replication_slot issues |
Date: | 2020-02-10 05:01:44 |
Message-ID: | CA+fd4k70BXLTm-N6q18LrL=GbKtwY3-2++UVFw05SvFTkZgTyQ@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, 10 Feb 2020 at 01:29, Arseny Sher <a(dot)sher(at)postgrespro(dot)ru> wrote:
>
> Hi,
>
> While jumping around partically decoded xacts questions [1], I've read
> through the copy replication slots code (9f06d79ef) and found a couple
> of issues.
>
> 1) It seems quite reckless to me to dive into
> DecodingContextFindStartpoint without actual WAL reservation (donors
> slot restart_lsn is used, but it is not acquired). Why risking erroring
> out with WAL removal error if the function advances new slot position to
> updated donors one in the end anyway?
Good catch. It's possible that DecodingContextFindStartpoint could
fail when the restart_lsn of the source slot is advanced and removed
required WAL.
>
> 2) In the end, restart_lsn of new slot is set to updated donors
> one. However, confirmed_flush field is not updated. This is just wrong
> -- we could start decoding too early and stream partially decoded
> transaction.
I think you are right.
>
> I'd probably avoid doing DecodingContextFindStartpoint at all. Its only
> purpose is to assemble consistent snapshot (and establish corresponding
> <restart_lsn, confirmed_flush_lsn> pair), but donor slot must have
> already done that and we could use it as well. Was this considered?
Skipping doing DecodingContextFindStartpoint while creating a new
destination logical slot seems sensible to me.
I've attached the draft patch fixing this issue but I'll continue
investigating it more deeply.
Regards,
--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment | Content-Type | Size |
---|---|---|
fix_copy_slot.patch | application/octet-stream | 2.9 KB |
From: | Arseny Sher <a(dot)sher(at)postgrespro(dot)ru> |
---|---|
To: | Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> |
Subject: | Re: logical copy_replication_slot issues |
Date: | 2020-02-10 14:01:25 |
Message-ID: | 87v9oelex6.fsf@ars-thinkpad |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Lists: | pgsql-hackers |
Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com> writes:
> I've attached the draft patch fixing this issue but I'll continue
> investigating it more deeply.
There also should be a check that source slot itself has consistent
snapshot (valid confirmed_flush) -- otherwise it might be possible to
create not initialized slot which is probably not an error, but weird
and somewhat meaningless. Paranoically, this ought to be checked in both
src slot lookups.
With this patch it seems like the only thing
create_logical_replication_slot does is ReplicationSlotCreate, which
questions the usefulness of this function. On the second look,
CreateInitDecodingContext checks plugin sanity (ensures it exists), so
probably it's fine.
-- cheers, arseny
From: | Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com> |
---|---|
To: | Arseny Sher <a(dot)sher(at)postgrespro(dot)ru> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> |
Subject: | Re: logical copy_replication_slot issues |
Date: | 2020-02-19 07:59:40 |
Message-ID: | CA+fd4k6yyvS0UY5mvBMg9s__Lo05RYremSBPPNzkqPvKp4rsYg@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, 10 Feb 2020 at 23:01, Arseny Sher <a(dot)sher(at)postgrespro(dot)ru> wrote:
>
>
> Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com> writes:
>
> > I've attached the draft patch fixing this issue but I'll continue
> > investigating it more deeply.
>
> There also should be a check that source slot itself has consistent
> snapshot (valid confirmed_flush) -- otherwise it might be possible to
> create not initialized slot which is probably not an error, but weird
> and somewhat meaningless. Paranoically, this ought to be checked in both
> src slot lookups.
>
> With this patch it seems like the only thing
> create_logical_replication_slot does is ReplicationSlotCreate, which
> questions the usefulness of this function. On the second look,
> CreateInitDecodingContext checks plugin sanity (ensures it exists), so
> probably it's fine.
>
Thank you for reviewing this patch.
I've attached the updated version patch that incorporated your
comments. I believe we're going in the right direction for fixing this
bug. I'll register this item to the next commit fest so as not to
forget.
Regards,
--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment | Content-Type | Size |
---|---|---|
fix_copy_slot_v2.patch | application/octet-stream | 4.2 KB |
From: | Arseny Sher <a(dot)sher(at)postgrespro(dot)ru> |
---|---|
To: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> |
Subject: | Re: logical copy_replication_slot issues |
Date: | 2020-03-04 15:36:44 |
Message-ID: | 87k1406sj7.fsf@ars-thinkpad |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Lists: | pgsql-hackers |
Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com> writes:
> I've attached the updated version patch that incorporated your
> comments. I believe we're going in the right direction for fixing this
> bug. I'll register this item to the next commit fest so as not to
> forget.
I've moved confirmed_flush check to the second lookup out of paranoic
considerations (e.g. slot could have been recreated and creation hasn't
finished yet) and made some minor stylistic adjustments. It looks good
to me now.
Attachment | Content-Type | Size |
---|---|---|
fix_copy_slot_v3.patch | text/x-diff | 4.1 KB |
From: | Arseny Sher <a(dot)sher(at)postgrespro(dot)ru> |
---|---|
To: | Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> |
Subject: | Re: logical copy_replication_slot issues |
Date: | 2020-03-06 11:02:53 |
Message-ID: | 87imjh7nky.fsf@ars-thinkpad |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Lists: | pgsql-hackers |
I wrote:
> It looks good to me now.
After lying for some time in my head it reminded me that
CreateInitDecodingContext not only pegs the LSN, but also xmin, so
attached makes a minor comment correction.
While taking a look at the nearby code it seemed weird to me that
GetOldestSafeDecodingTransactionId checks PGXACT->xid, not xmin. Don't
want to investigate this at the moment though, and not for this thread.
Also not for this thread, but I've noticed
pg_copy_logical_replication_slot doesn't allow to change plugin name
which is an omission in my view. It would be useful and trivial to do.
-- cheers, arseny
Attachment | Content-Type | Size |
---|---|---|
fix_copy_slot_v4.patch | text/x-diff | 4.1 KB |
From: | Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com> |
---|---|
To: | Arseny Sher <a(dot)sher(at)postgrespro(dot)ru> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> |
Subject: | Re: logical copy_replication_slot issues |
Date: | 2020-03-09 05:47:06 |
Message-ID: | CA+fd4k5qWBLTqomDxfeEeLR-yBf9_+EHxfFcpnP8hc7Bg15OEg@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Fri, 6 Mar 2020 at 20:02, Arseny Sher <a(dot)sher(at)postgrespro(dot)ru> wrote:
>
> I wrote:
>
> > It looks good to me now.
>
> After lying for some time in my head it reminded me that
> CreateInitDecodingContext not only pegs the LSN, but also xmin, so
> attached makes a minor comment correction.
>
> While taking a look at the nearby code it seemed weird to me that
> GetOldestSafeDecodingTransactionId checks PGXACT->xid, not xmin. Don't
> want to investigate this at the moment though, and not for this thread.
>
> Also not for this thread, but I've noticed
> pg_copy_logical_replication_slot doesn't allow to change plugin name
> which is an omission in my view. It would be useful and trivial to do.
>
Thank you for updating the patch. The patch looks basically good to me
but I have a few questions:
/*
- * Create logical decoding context, to build the initial snapshot.
+ * Create logical decoding context to find start point or, if we don't
+ * need it, to 1) bump slot's restart_lsn and xmin 2) check plugin sanity.
*/
Do we need to numbering that despite not referring them?
ctx = CreateInitDecodingContext(plugin, NIL,
- false, /* do not build snapshot */
+ false, /* do not build data snapshot */
restart_lsn,
logical_read_local_xlog_page, NULL, NULL,
NULL);
I'm not sure this change makes the comment better. Could you elaborate
on the motivation of this change?
Regards,
--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Arseny Sher <a(dot)sher(at)postgrespro(dot)ru> |
---|---|
To: | Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> |
Subject: | Re: logical copy_replication_slot issues |
Date: | 2020-03-09 12:46:39 |
Message-ID: | 87h7yx7l1s.fsf@ars-thinkpad |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Lists: | pgsql-hackers |
Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com> writes:
> /*
> - * Create logical decoding context, to build the initial snapshot.
> + * Create logical decoding context to find start point or, if we don't
> + * need it, to 1) bump slot's restart_lsn and xmin 2) check plugin sanity.
> */
>
> Do we need to numbering that despite not referring them?
No, it just seemed clearer to me this way. I don't mind removing the
numbers if you feel this is better.
> ctx = CreateInitDecodingContext(plugin, NIL,
> - false, /* do not build snapshot */
> + false, /* do not build data snapshot */
> restart_lsn,
> logical_read_local_xlog_page, NULL, NULL,
> NULL);
> I'm not sure this change makes the comment better. Could you elaborate
> on the motivation of this change?
Well, DecodingContextFindStartpoint always builds a snapshot allowing
historical *catalog* lookups. This bool controls whether the snapshot
should additionally be suitable for looking at the actual data, this is
e.g. used by initial data sync in the native logical replication.
-- cheers, arseny
From: | Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com> |
---|---|
To: | Arseny Sher <a(dot)sher(at)postgrespro(dot)ru> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> |
Subject: | Re: logical copy_replication_slot issues |
Date: | 2020-03-10 05:58:01 |
Message-ID: | CA+fd4k52caSHzjjPfcPmQF7YFfo7zuhkUuDabQsoNTdfTe+ugA@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Mon, 9 Mar 2020 at 21:46, Arseny Sher <a(dot)sher(at)postgrespro(dot)ru> wrote:
>
>
> Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com> writes:
>
> > /*
> > - * Create logical decoding context, to build the initial snapshot.
> > + * Create logical decoding context to find start point or, if we don't
> > + * need it, to 1) bump slot's restart_lsn and xmin 2) check plugin sanity.
> > */
> >
> > Do we need to numbering that despite not referring them?
>
> No, it just seemed clearer to me this way. I don't mind removing the
> numbers if you feel this is better.
>
Okay.
> > ctx = CreateInitDecodingContext(plugin, NIL,
> > - false, /* do not build snapshot */
> > + false, /* do not build data snapshot */
> > restart_lsn,
> > logical_read_local_xlog_page, NULL, NULL,
> > NULL);
> > I'm not sure this change makes the comment better. Could you elaborate
> > on the motivation of this change?
>
> Well, DecodingContextFindStartpoint always builds a snapshot allowing
> historical *catalog* lookups. This bool controls whether the snapshot
> should additionally be suitable for looking at the actual data, this is
> e.g. used by initial data sync in the native logical replication.
Okay.
Anyway, since the patch looks good to me I've marked this patch as
"Ready for Committer". I think we can defer these things to
committers.
Regards,
--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
---|---|
To: | Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com> |
Cc: | Arseny Sher <a(dot)sher(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: logical copy_replication_slot issues |
Date: | 2020-03-17 19:24:48 |
Message-ID: | [email protected] |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Lists: | pgsql-hackers |
Thanks Arseny and Masahiko, I pushed this patch just now. I changed
some comments while at it, hopefully they are improvements.
On 2020-Mar-09, Masahiko Sawada wrote:
> ctx = CreateInitDecodingContext(plugin, NIL,
> - false, /* do not build snapshot */
> + false, /* do not build data snapshot */
> restart_lsn,
> logical_read_local_xlog_page, NULL, NULL,
> NULL);
>
> I'm not sure this change makes the comment better. Could you elaborate
> on the motivation of this change?
I addressed this issue by adding a comment in CreateInitDecodingContext
to explain the parameter, and then reference that comment's terminology
in this call. I think it ends up clearer overall -- not that this whole
area is at all particularly clear.
Thanks again.
--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com> |
---|---|
To: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
Cc: | Arseny Sher <a(dot)sher(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: logical copy_replication_slot issues |
Date: | 2020-03-19 04:20:57 |
Message-ID: | CA+fd4k6WcgKi+M5CMBmFpN=mHPXu2jMasLUhrs91L9_dRCio_g@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, 18 Mar 2020 at 04:24, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
wrote:
>
> Thanks Arseny and Masahiko, I pushed this patch just now. I changed
> some comments while at it, hopefully they are improvements.
>
> On 2020-Mar-09, Masahiko Sawada wrote:
>
> > ctx = CreateInitDecodingContext(plugin, NIL,
> > - false, /* do not build snapshot */
> > + false, /* do not build data
snapshot */
> > restart_lsn,
> > logical_read_local_xlog_page, NULL,
NULL,
> > NULL);
> >
> > I'm not sure this change makes the comment better. Could you elaborate
> > on the motivation of this change?
>
> I addressed this issue by adding a comment in CreateInitDecodingContext
> to explain the parameter, and then reference that comment's terminology
> in this call. I think it ends up clearer overall -- not that this whole
> area is at all particularly clear.
>
> Thanks again.
>
Thank you for committing the patch! That changes look good to me.
Regards,
--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services