postgres-xc-developers Mailing List for Postgres-XC

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

For utility statements in general, the coordinator propagates SQL
statements to all the required nodes, and most of these statements get run
on the datanodes inside a transaction block. So, when the statement fails
on at least one of the nodes, the statement gets rollbacked on all the
nodes due to the two-phase commit taking place, and therefore the cluster
rollbacks to a consistent state.  But there are some statements which
cannot be run inside a transaction block. Here are some important ones:
CREATE/DROP DATABASE
CREATE/DROP TABLESPACE
ALTER DATABASE SET TABLESPACE
ALTER TYPE ADD ... (for enum types)
CREATE INDEX CONCURRENTLY
REINDEX DATABASE
DISCARD ALL

So such statements run on datanodes in auto-commit mode, and so create
problems if they succeed on some nodes and abort on other nodes.  For e.g.
: CREATE DATABASE. If a datanode d1 returns with error, and any other
datanode d2 has already returned back to coordinator with success, the
coordinator can't undo the commit of d2 because this is already committed.
Or if the coordinator itself crashes after datanodes commit but before the
coordinator commits, then again we have the same problem. The database
cannot be recreated from coordinator, since it is already created on some
of the other nodes. In such a cluster state, administrator needs to connect
to datanodes and do the needed cleanup.

The committed statements can be followed by statements that undo the
operation, for e.g. DROP DATABASE for a CREATE DATABASE. But here again
this statement can fail for some reason. Also, typically for such
statements, their UNDO counterparts themselves cannot be run inside a
transaction block as well. So this is not a guaranteed way to bring back
the cluster to a consistent state.

To find out how we can get around this issue, let's see why these
statements require to be run outside a transaction block in the first
place. There are two reasons why:

1. Typically such statements modify OS files and directories which cannot
be rollbacked.

For DMLs, the rollback does not have to be explicitly undone. MVCC takes
care of it. But for OS file operations, there is no automatic way. So such
operations cannot be rollbacked. So in a transaction block, if a
create-database is followed by 10 other SQL statements before commit, and
one of the statements throws an error, ultimately the database won't be
created but there will be database files taking up disk space, and this has
happened just because the user has written the script wrongly.

So by restricting such statement to be run outside a transaction block, an
unrelated error won't cause garbage files to be created.

The statement itself does get committed eventually as usual. And it can
also get rolled back in the end. But maximum care has been taken in the
statement function (for e.g. createdb) such that the chances of an error
occurring *after* the files are created is least. For this, such a code
segment is inside PG_ENSURE_ERROR_CLEANUP() with some error_callback
function (createdb_failure_callback) which tries to clean up the files
created.

So the end result is that this window between files-created and
error-occurred is minimized, not that such statements will never create
such cleanup issues if run outside transaction block.

Possible solution:

So regarding Postgres-XC, if we let such statements to be run inside
transaction block but only on remote nodes, what are the consequences? This
will of course prevent the issue of the statement committed on one node and
not the other. Also, the end user will still be prevented from running the
statement inside the transaction. Moreover, for such statement, say
create-database, the database will be created on all nodes or none, even if
one of the nodes return error. The only issue is, if the create-database is
aborted, it will leave disk space wasted on nodes where it has succeeded.
But this will be caused because of some configuration issues like disk
space, network down etc. The issue of other unrelated operations in the
same transaction causing rollback of create-database will not occur anyways
because we still don't allow it in a transaction block for the end-user.

So the end result is we have solved the inconsistent cluster issue, leaving
some chances of disk cleanup issue, although not due to user-queries
getting aborted. So may be when such statements error out, we display a
notice that files need to be cleaned up.

We can go further ahead to reduce this window. We split the create-database
operation. We begin a transaction block, and then let datanodes create the
non-file operations first, like inserting pg_database row, etc, by running
them using a new function call. Don't commit it yet. Then fire the last
part: file system operations, this too using another function call. And
then finally commit. This file operation will be under
PG_ENSURE_ERROR_CLEANUP(). Due to synchronizing these individual tasks, we
reduce the window further.

2. Some statements do internal commits.

For e.g. movedb() calls TransactionCommit() after copying the files, and
then removes the original files, so that if it crashes while removing the
files, the database with the new tablespace is already committed and
intact, so we just leave some old files.

Such statements doing internal commits cannot be rolled back if run inside
transaction block, because they already do some commits. For such
statements, the above solution does not work. We need to find a separate
way for these specific statements. Few of such statements include:
ALTER DATABASE SET TABLESPACE
CLUSTER
CREATE INDEX CONCURRENTLY

One similar solution is to split the individual tasks that get internally
committed using different functions for each task, and run the individual
functions on all the nodes synchronously. So the 2nd task does not start
until the first one gets committed on all the nodes. Whether it is feasible
to split the task is a question, and it depends on the particular command.

As of now, I am not sure whether we can do some common changes in the way
transactions are implemented to find a common solution which does not
require changes for individual commands. But I will investigate more.

Comments and suggestions welcome !

Thanks
-Amit

2010	Jan	Feb	Mar	Apr (10)	May (17)	Jun (3)	Jul	Aug	Sep (8)	Oct (18)	Nov (51)	Dec (74)
2011	Jan (47)	Feb (44)	Mar (44)	Apr (102)	May (35)	Jun (25)	Jul (56)	Aug (69)	Sep (32)	Oct (37)	Nov (31)	Dec (16)
2012	Jan (34)	Feb (127)	Mar (218)	Apr (252)	May (80)	Jun (137)	Jul (205)	Aug (159)	Sep (35)	Oct (50)	Nov (82)	Dec (52)
2013	Jan (107)	Feb (159)	Mar (118)	Apr (163)	May (151)	Jun (89)	Jul (106)	Aug (177)	Sep (49)	Oct (63)	Nov (46)	Dec (7)
2014	Jan (65)	Feb (128)	Mar (40)	Apr (11)	May (4)	Jun (8)	Jul (16)	Aug (11)	Sep (4)	Oct (1)	Nov (5)	Dec (16)
2015	Jan (5)	Feb	Mar (2)	Apr (5)	May (4)	Jun (12)	Jul	Aug	Sep	Oct	Nov	Dec (4)
2019	Jan	Feb	Mar	Apr	May	Jun	Jul (2)	Aug	Sep	Oct	Nov	Dec

S	M	T	W	T	F	S
					1 (1)	2
3	4	5 (3)	6	7 (9)	8 (13)	9
10 (2)	11 (1)	12 (4)	13 (8)	14 (7)	15 (14)	16
17	18 (16)	19 (11)	20 (7)	21 (8)	22	23
24	25	26 (9)	27 (12)	28 (8)	29 (4)	30

postgres-xc-developers Mailing List for Postgres-XC

postgres-xc-developers — Postgres-XC hackers and developers