Lists: | pgsql-hackers |
---|
From: | Jeff Janes <jeff(dot)janes(at)gmail(dot)com> |
---|---|
To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Failed recovery with new faster 2PC code |
Date: | 2017-04-15 22:37:15 |
Message-ID: | CAMkU=1xBP8cqdS5eK8APHL=X6RHMMM2vG5g+QamduuTsyCwv9g@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Lists: | pgsql-hackers |
After this commit, I get crash recovery failures when using prepared
transactions.
commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71
Author: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Date: Tue Apr 4 15:56:56 2017 -0400
Speedup 2PC recovery by skipping two phase state files in normal path
After the induced crash, I get this failure in recovery:
FATAL: could not access status of transaction 334419347
DETAIL: Could not open file "pg_xact/013E": No such file or directory.
LOG: startup process (PID 60106) exited with exit code 1
LOG: aborting startup due to startup process failure
LOG: database system is shut down
The earliest file which exists in pg_xact is 0176
Other examples:
FATAL: could not access status of transaction 121729737
DETAIL: Could not open file "pg_xact/0074": No such file or directory.
LOG: startup process (PID 23720) exited with exit code 1
FATAL: could not access status of transaction 181325554
DETAIL: Could not open file "pg_xact/00AC": No such file or directory.
LOG: startup process (PID 8375) exited with exit code 1
I experience this in about 1 out of 15 crash-recovery cycles on 8 CPUs.
The patch Pavan posted here did not make any difference:
I've attached the test harness, which I think will look familiar to y'all.
It is the usual injection of torn-page-write crashes with consistency
checks after recovery (which makes no difference, as the issue is that
recovery does not happen), modified to include a very crude transaction
manager to make use of 2PC.
Cheers,
Jeff
Attachment | Content-Type | Size |
---|---|---|
count.pl | application/octet-stream | 11.2 KB |
do.sh | application/x-sh | 5.4 KB |
crash_REL10.patch | application/octet-stream | 12.8 KB |
From: | Simon Riggs <simon(at)2ndquadrant(dot)com> |
---|---|
To: | Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Nikhil Sontakke <nikhil(dot)sontakke(at)2ndquadrant(dot)com>, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com> |
Subject: | Re: Failed recovery with new faster 2PC code |
Date: | 2017-04-17 09:14:41 |
Message-ID: | CANP8+jJDSSo_7UmBdnGqcdeH+GGbhN5kZWVmqR5sD1tnMgDXsg@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 15 April 2017 at 23:37, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> After this commit, I get crash recovery failures when using prepared
> transactions.
>
> commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71
> Author: Simon Riggs <simon(at)2ndQuadrant(dot)com>
> Date: Tue Apr 4 15:56:56 2017 -0400
>
> Speedup 2PC recovery by skipping two phase state files in normal path
Thanks Jeff for your tests.
So that's now two crash bugs in as many days and lack of clarity about
how to fix it.
Stas, I thought this patch was very important to you, yet two releases
in a row we are too-late-and-buggy.
If anybody has a reason why I shouldn't revert this, please say so now
fairly soon.
Any further attempts to fix must run Jeff's tests.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> |
---|---|
To: | Simon Riggs <simon(at)2ndquadrant(dot)com> |
Cc: | Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Nikhil Sontakke <nikhil(dot)sontakke(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com> |
Subject: | Re: Failed recovery with new faster 2PC code |
Date: | 2017-04-17 09:25:42 |
Message-ID: | [email protected] |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Lists: | pgsql-hackers |
> On 17 Apr 2017, at 12:14, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>
> On 15 April 2017 at 23:37, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>> After this commit, I get crash recovery failures when using prepared
>> transactions.
>>
>> commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71
>> Author: Simon Riggs <simon(at)2ndQuadrant(dot)com>
>> Date: Tue Apr 4 15:56:56 2017 -0400
>>
>> Speedup 2PC recovery by skipping two phase state files in normal path
>
> Thanks Jeff for your tests.
>
> So that's now two crash bugs in as many days and lack of clarity about
> how to fix it.
>
> Stas, I thought this patch was very important to you, yet two releases
> in a row we are too-late-and-buggy.
I’m looking at pgstat issue in nearby thread right now and will switch to this
shortly.
If that’s possible, I’m asking to delay revert for several days.
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
From: | Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com> |
---|---|
To: | Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> |
Cc: | Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com> |
Subject: | Re: Failed recovery with new faster 2PC code |
Date: | 2017-04-17 09:32:01 |
Message-ID: | CAMGcDxd21qbBLQn-Fxwp1AYVtnceztBY+3sQdTYQSKbkw0PfLQ@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Lists: | pgsql-hackers |
> >> commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71
> >> Author: Simon Riggs <simon(at)2ndQuadrant(dot)com>
> >> Date: Tue Apr 4 15:56:56 2017 -0400
> >>
> >> Speedup 2PC recovery by skipping two phase state files in normal path
> >
> > Thanks Jeff for your tests.
> >
> > So that's now two crash bugs in as many days and lack of clarity about
> > how to fix it.
> >
>
I am trying to reproduce and debug it from my end as well.
Regards,
Nikhils
From: | Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com> |
---|---|
To: | Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> |
Cc: | Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com> |
Subject: | Re: Failed recovery with new faster 2PC code |
Date: | 2017-04-18 06:39:45 |
Message-ID: | CAMGcDxeka9UssT4GuE-4eHmCoYuJ92OcqebcnYzPzTNOB3YS6g@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 17 April 2017 at 15:02, Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com> wrote:
>
>
>> >> commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71
>> >> Author: Simon Riggs <simon(at)2ndQuadrant(dot)com>
>> >> Date: Tue Apr 4 15:56:56 2017 -0400
>> >>
>> >> Speedup 2PC recovery by skipping two phase state files in normal
>> path
>> >
>> > Thanks Jeff for your tests.
>> >
>> > So that's now two crash bugs in as many days and lack of clarity about
>> > how to fix it.
>> >
>>
>
> The issue seems to be that a prepared transaction is yet to be committed.
But autovacuum comes in and causes the clog to be truncated beyond this
prepared transaction ID in one of the runs.
We only add the corresponding pgproc entry for a surviving 2PC transaction
on completion of recovery. So could be a race condition here. Digging in
further.
Regards,
Nikhils
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
From: | Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com> |
---|---|
To: | Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> |
Cc: | Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com> |
Subject: | Re: Failed recovery with new faster 2PC code |
Date: | 2017-04-18 08:17:28 |
Message-ID: | CAMGcDxeFVRi2uqhp8OD2nPAX99ChikHbKXcJTAun1mPpCx+Awg@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Lists: | pgsql-hackers |
Hi,
There was a bug in the redo 2PC remove code path. Because of which, autovac
would think that the 2PC is gone and cause removal of the corresponding
clog entry earlier than needed.
Please find attached, the bug fix: 2pc_redo_remove_bug.patch.
I have been testing this on top of Michael's 2pc-restore-fix.patch and
things seem to be ok for the past one+ hour. Will keep it running for long.
Jeff, thanks for these very useful scripts. I am going to make a habit to
run these scripts on my side from now on. Do you have any other script that
I could try against these patches? Please let me know.
Regards,
Nikhils
On 18 April 2017 at 12:09, Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com> wrote:
>
>
> On 17 April 2017 at 15:02, Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com>
> wrote:
>
>>
>>
>>> >> commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71
>>> >> Author: Simon Riggs <simon(at)2ndQuadrant(dot)com>
>>> >> Date: Tue Apr 4 15:56:56 2017 -0400
>>> >>
>>> >> Speedup 2PC recovery by skipping two phase state files in normal
>>> path
>>> >
>>> > Thanks Jeff for your tests.
>>> >
>>> > So that's now two crash bugs in as many days and lack of clarity about
>>> > how to fix it.
>>> >
>>>
>>
>> The issue seems to be that a prepared transaction is yet to be committed.
> But autovacuum comes in and causes the clog to be truncated beyond this
> prepared transaction ID in one of the runs.
>
> We only add the corresponding pgproc entry for a surviving 2PC transaction
> on completion of recovery. So could be a race condition here. Digging in
> further.
>
> Regards,
> Nikhils
> --
> Nikhil Sontakke http://www.2ndQuadrant.com/
> PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
>
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachment | Content-Type | Size |
---|---|---|
2pc_redo_remove_bug.patch | application/octet-stream | 403 bytes |
2pc-restore-fix.patch | application/octet-stream | 5.0 KB |
From: | Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com> |
---|---|
To: | Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> |
Cc: | Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com> |
Subject: | Re: Failed recovery with new faster 2PC code |
Date: | 2017-04-18 08:57:10 |
Message-ID: | CAMGcDxeykkrKCk0FY9Pzt5JusLWw4woKXs8NoqjbOZfQQZ-i2Q@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Lists: | pgsql-hackers |
Please find attached a second version of my bug fix which is stylistically
better and clearer than the first one.
Regards,
Nikhils
On 18 April 2017 at 13:47, Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com> wrote:
> Hi,
>
> There was a bug in the redo 2PC remove code path. Because of which,
> autovac would think that the 2PC is gone and cause removal of the
> corresponding clog entry earlier than needed.
>
> Please find attached, the bug fix: 2pc_redo_remove_bug.patch.
>
> I have been testing this on top of Michael's 2pc-restore-fix.patch and
> things seem to be ok for the past one+ hour. Will keep it running for long.
>
> Jeff, thanks for these very useful scripts. I am going to make a habit to
> run these scripts on my side from now on. Do you have any other script that
> I could try against these patches? Please let me know.
>
> Regards,
> Nikhils
>
> On 18 April 2017 at 12:09, Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com>
> wrote:
>
>>
>>
>> On 17 April 2017 at 15:02, Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com>
>> wrote:
>>
>>>
>>>
>>>> >> commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71
>>>> >> Author: Simon Riggs <simon(at)2ndQuadrant(dot)com>
>>>> >> Date: Tue Apr 4 15:56:56 2017 -0400
>>>> >>
>>>> >> Speedup 2PC recovery by skipping two phase state files in normal
>>>> path
>>>> >
>>>> > Thanks Jeff for your tests.
>>>> >
>>>> > So that's now two crash bugs in as many days and lack of clarity about
>>>> > how to fix it.
>>>> >
>>>>
>>>
>>> The issue seems to be that a prepared transaction is yet to be
>> committed. But autovacuum comes in and causes the clog to be truncated
>> beyond this prepared transaction ID in one of the runs.
>>
>> We only add the corresponding pgproc entry for a surviving 2PC
>> transaction on completion of recovery. So could be a race condition here.
>> Digging in further.
>>
>> Regards,
>> Nikhils
>> --
>> Nikhil Sontakke http://www.2ndQuadrant.com/
>> PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
>>
>
>
>
> --
> Nikhil Sontakke http://www.2ndQuadrant.com/
> PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
>
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachment | Content-Type | Size |
---|---|---|
2pc_redo_remove_bug_v2.0.patch | application/octet-stream | 786 bytes |
From: | Simon Riggs <simon(at)2ndquadrant(dot)com> |
---|---|
To: | Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com> |
Cc: | Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com> |
Subject: | Re: Failed recovery with new faster 2PC code |
Date: | 2017-04-18 10:54:30 |
Message-ID: | CANP8+jK_PF6O3CGkomUbk_hj6-P-GHbphL_wtnCXtVv8y1w7TA@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 18 April 2017 at 09:57, Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com> wrote:
> Please find attached a second version of my bug fix which is stylistically
> better and clearer than the first one.
Yeh, this is better. Pushed.
The bug was that the loop set gxact to be the last entry in the array,
causing the exit condition to fail and us then to remove the last
gxact from memory even when it didn't match the xid, removing a valid
entry too early. That then allowed xmin to move forwards, which causes
autovac to remove pg_xact entries earlier than needed.
Well done for finding that one, thanks for the patch.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Michael Paquier <michael(dot)paquier(at)gmail(dot)com> |
---|---|
To: | Simon Riggs <simon(at)2ndquadrant(dot)com> |
Cc: | Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com>, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com> |
Subject: | Re: Failed recovery with new faster 2PC code |
Date: | 2017-04-18 12:12:43 |
Message-ID: | CAB7nPqQ_RnV8QTYxtm7=hudY56jjtG1tbZgrFOuYF8AuDdFiZA@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Apr 18, 2017 at 7:54 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> Yeh, this is better. Pushed.
I have been outraced on this one, the error is obvious once you see it ;)
Thanks for the investigation and the fix! I have spent a couple of
hours reviewing the interactions between the shmem entries of 2PC
state data created at the beginning of recovery and all the lookups in
procarray.c and varsup.c, noticing nothing by the way.
> The bug was that the loop set gxact to be the last entry in the array,
> causing the exit condition to fail and us then to remove the last
> gxact from memory even when it didn't match the xid, removing a valid
> entry too early. That then allowed xmin to move forwards, which causes
> autovac to remove pg_xact entries earlier than needed.
>
> Well done for finding that one, thanks for the patch.
Running Jeff's test suite, I can confirm that there are no problems now.
--
Michael
From: | Simon Riggs <simon(at)2ndquadrant(dot)com> |
---|---|
To: | Michael Paquier <michael(dot)paquier(at)gmail(dot)com> |
Cc: | Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com>, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com> |
Subject: | Re: Failed recovery with new faster 2PC code |
Date: | 2017-04-18 12:38:50 |
Message-ID: | CANP8+j+yUu=58fygS9_aJOXAgUzau5zaQo8HiWdX2z3b9uN6Dw@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 18 April 2017 at 13:12, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote:
> On Tue, Apr 18, 2017 at 7:54 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> Yeh, this is better. Pushed.
>
> I have been outraced on this one, the error is obvious once you see it ;)
Didn't realise you were working on it, nothing competitive about it.
It's clear this needed fixing, whether or not it fixes Jeff's report.
I do think it explains the report, so I'm hopeful.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: | Jeff Janes <jeff(dot)janes(at)gmail(dot)com> |
---|---|
To: | Simon Riggs <simon(at)2ndquadrant(dot)com> |
Cc: | Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com>, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com> |
Subject: | Re: Failed recovery with new faster 2PC code |
Date: | 2017-04-19 01:48:45 |
Message-ID: | CAMkU=1zmu_gFJ5rfg+0=oUDv6f7Am_gp64_GTOVaEpgrAFEojg@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Apr 18, 2017 at 5:38 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On 18 April 2017 at 13:12, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
> wrote:
> > On Tue, Apr 18, 2017 at 7:54 PM, Simon Riggs <simon(at)2ndquadrant(dot)com>
> wrote:
> >> Yeh, this is better. Pushed.
> >
> > I have been outraced on this one, the error is obvious once you see it ;)
>
> Didn't realise you were working on it, nothing competitive about it.
>
> It's clear this needed fixing, whether or not it fixes Jeff's report.
>
> I do think it explains the report, so I'm hopeful.
>
The git HEAD code (c727f12) has been surviving so far, with both test cases.
Thanks,
Jeff
From: | Jeff Janes <jeff(dot)janes(at)gmail(dot)com> |
---|---|
To: | Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com> |
Cc: | Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com> |
Subject: | Re: Failed recovery with new faster 2PC code |
Date: | 2017-04-19 02:09:00 |
Message-ID: | CAMkU=1y98=hMk=giv8LDszkZqGgTkk2yYWeHPiz+4SN6m7RL5g@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Tue, Apr 18, 2017 at 1:17 AM, Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com>
wrote:
> Hi,
>
> There was a bug in the redo 2PC remove code path. Because of which,
> autovac would think that the 2PC is gone and cause removal of the
> corresponding clog entry earlier than needed.
>
> Please find attached, the bug fix: 2pc_redo_remove_bug.patch.
>
> I have been testing this on top of Michael's 2pc-restore-fix.patch and
> things seem to be ok for the past one+ hour. Will keep it running for long.
>
> Jeff, thanks for these very useful scripts. I am going to make a habit to
> run these scripts on my side from now on. Do you have any other script that
> I could try against these patches? Please let me know.
>
This script is the only one I have that specifically targets 2PC. I wrote
it last year when the previous round of speed-up code (which avoided
writing the files upon "PREPARE" by delaying them until the next
checkpoint) was developed. I just decided to dust that test off to try
again here. I don't know how to change it to make it more targeted towards
this set of patches. Would this bug have been seen in a replica server in
the absence of crashes, or was it only vulnerable during crash recovery
rather than streaming replication?
Cheers,
Jeff
From: | Michael Paquier <michael(dot)paquier(at)gmail(dot)com> |
---|---|
To: | Jeff Janes <jeff(dot)janes(at)gmail(dot)com> |
Cc: | Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com>, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com> |
Subject: | Re: Failed recovery with new faster 2PC code |
Date: | 2017-04-19 02:29:08 |
Message-ID: | CAB7nPqQUzRQrSxycnFsWX8=ztyQrt_MXTPjN=R-864Yhe8M=Ww@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Apr 19, 2017 at 11:09 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> Would this bug have been seen in a replica server in the absence of crashes,
> or was it only vulnerable during crash recovery rather than streaming
> replication?
This issue could have been seen on a streaming standby as well,
letting around a TwoPhaseState impacts as well their linked PGPROC so
CLOG truncation would have been messed up as well. That's also the
case of the first issue involving as well incorrect XID updates,
though the chances to see it are I think lower as only the beginning
of recovery was impacted.
--
Michael