You can subscribe to this list here.
2010 |
Jan
|
Feb
|
Mar
|
Apr
(10) |
May
(17) |
Jun
(3) |
Jul
|
Aug
|
Sep
(8) |
Oct
(18) |
Nov
(51) |
Dec
(74) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2011 |
Jan
(47) |
Feb
(44) |
Mar
(44) |
Apr
(102) |
May
(35) |
Jun
(25) |
Jul
(56) |
Aug
(69) |
Sep
(32) |
Oct
(37) |
Nov
(31) |
Dec
(16) |
2012 |
Jan
(34) |
Feb
(127) |
Mar
(218) |
Apr
(252) |
May
(80) |
Jun
(137) |
Jul
(205) |
Aug
(159) |
Sep
(35) |
Oct
(50) |
Nov
(82) |
Dec
(52) |
2013 |
Jan
(107) |
Feb
(159) |
Mar
(118) |
Apr
(163) |
May
(151) |
Jun
(89) |
Jul
(106) |
Aug
(177) |
Sep
(49) |
Oct
(63) |
Nov
(46) |
Dec
(7) |
2014 |
Jan
(65) |
Feb
(128) |
Mar
(40) |
Apr
(11) |
May
(4) |
Jun
(8) |
Jul
(16) |
Aug
(11) |
Sep
(4) |
Oct
(1) |
Nov
(5) |
Dec
(16) |
2015 |
Jan
(5) |
Feb
|
Mar
(2) |
Apr
(5) |
May
(4) |
Jun
(12) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(4) |
2019 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
S | M | T | W | T | F | S |
---|---|---|---|---|---|---|
|
|
|
|
|
1
(1) |
2
|
3
|
4
|
5
(3) |
6
|
7
(9) |
8
(13) |
9
|
10
(2) |
11
(1) |
12
(4) |
13
(8) |
14
(7) |
15
(14) |
16
|
17
|
18
(16) |
19
(11) |
20
(7) |
21
(8) |
22
|
23
|
24
|
25
|
26
(9) |
27
(12) |
28
(8) |
29
(4) |
30
|
From: Amit K. <ami...@en...> - 2012-06-29 11:42:20
|
For utility statements in general, the coordinator propagates SQL statements to all the required nodes, and most of these statements get run on the datanodes inside a transaction block. So, when the statement fails on at least one of the nodes, the statement gets rollbacked on all the nodes due to the two-phase commit taking place, and therefore the cluster rollbacks to a consistent state. But there are some statements which cannot be run inside a transaction block. Here are some important ones: CREATE/DROP DATABASE CREATE/DROP TABLESPACE ALTER DATABASE SET TABLESPACE ALTER TYPE ADD ... (for enum types) CREATE INDEX CONCURRENTLY REINDEX DATABASE DISCARD ALL So such statements run on datanodes in auto-commit mode, and so create problems if they succeed on some nodes and abort on other nodes. For e.g. : CREATE DATABASE. If a datanode d1 returns with error, and any other datanode d2 has already returned back to coordinator with success, the coordinator can't undo the commit of d2 because this is already committed. Or if the coordinator itself crashes after datanodes commit but before the coordinator commits, then again we have the same problem. The database cannot be recreated from coordinator, since it is already created on some of the other nodes. In such a cluster state, administrator needs to connect to datanodes and do the needed cleanup. The committed statements can be followed by statements that undo the operation, for e.g. DROP DATABASE for a CREATE DATABASE. But here again this statement can fail for some reason. Also, typically for such statements, their UNDO counterparts themselves cannot be run inside a transaction block as well. So this is not a guaranteed way to bring back the cluster to a consistent state. To find out how we can get around this issue, let's see why these statements require to be run outside a transaction block in the first place. There are two reasons why: 1. Typically such statements modify OS files and directories which cannot be rollbacked. For DMLs, the rollback does not have to be explicitly undone. MVCC takes care of it. But for OS file operations, there is no automatic way. So such operations cannot be rollbacked. So in a transaction block, if a create-database is followed by 10 other SQL statements before commit, and one of the statements throws an error, ultimately the database won't be created but there will be database files taking up disk space, and this has happened just because the user has written the script wrongly. So by restricting such statement to be run outside a transaction block, an unrelated error won't cause garbage files to be created. The statement itself does get committed eventually as usual. And it can also get rolled back in the end. But maximum care has been taken in the statement function (for e.g. createdb) such that the chances of an error occurring *after* the files are created is least. For this, such a code segment is inside PG_ENSURE_ERROR_CLEANUP() with some error_callback function (createdb_failure_callback) which tries to clean up the files created. So the end result is that this window between files-created and error-occurred is minimized, not that such statements will never create such cleanup issues if run outside transaction block. Possible solution: So regarding Postgres-XC, if we let such statements to be run inside transaction block but only on remote nodes, what are the consequences? This will of course prevent the issue of the statement committed on one node and not the other. Also, the end user will still be prevented from running the statement inside the transaction. Moreover, for such statement, say create-database, the database will be created on all nodes or none, even if one of the nodes return error. The only issue is, if the create-database is aborted, it will leave disk space wasted on nodes where it has succeeded. But this will be caused because of some configuration issues like disk space, network down etc. The issue of other unrelated operations in the same transaction causing rollback of create-database will not occur anyways because we still don't allow it in a transaction block for the end-user. So the end result is we have solved the inconsistent cluster issue, leaving some chances of disk cleanup issue, although not due to user-queries getting aborted. So may be when such statements error out, we display a notice that files need to be cleaned up. We can go further ahead to reduce this window. We split the create-database operation. We begin a transaction block, and then let datanodes create the non-file operations first, like inserting pg_database row, etc, by running them using a new function call. Don't commit it yet. Then fire the last part: file system operations, this too using another function call. And then finally commit. This file operation will be under PG_ENSURE_ERROR_CLEANUP(). Due to synchronizing these individual tasks, we reduce the window further. 2. Some statements do internal commits. For e.g. movedb() calls TransactionCommit() after copying the files, and then removes the original files, so that if it crashes while removing the files, the database with the new tablespace is already committed and intact, so we just leave some old files. Such statements doing internal commits cannot be rolled back if run inside transaction block, because they already do some commits. For such statements, the above solution does not work. We need to find a separate way for these specific statements. Few of such statements include: ALTER DATABASE SET TABLESPACE CLUSTER CREATE INDEX CONCURRENTLY One similar solution is to split the individual tasks that get internally committed using different functions for each task, and run the individual functions on all the nodes synchronously. So the 2nd task does not start until the first one gets committed on all the nodes. Whether it is feasible to split the task is a question, and it depends on the particular command. As of now, I am not sure whether we can do some common changes in the way transactions are implemented to find a common solution which does not require changes for individual commands. But I will investigate more. Comments and suggestions welcome ! Thanks -Amit |
From: Michael P. <mic...@gm...> - 2012-06-29 05:57:28
|
On Fri, Jun 29, 2012 at 2:29 PM, Amit Khandekar < ami...@en...> wrote: > > > On 28 June 2012 10:56, Michael Paquier <mic...@gm...> wrote: > >> >>> The COPY TO results from the datanode are already in the required >>> format for COPY FROM, so the data is ready to be sent back to datanode >>> as-is. So if possible, we should avoid any input-output conversion when >>> storing in tuplestore. >>> >> Do you mean that we can store the results from COPY TO as-is to >> tuplestore, meaning that we can use tuplestore as-is? >> Or do you mean that we shouldn't use tuplestore? >> > I guess tuplestore will contain just one column of type text, which > represents the complete record. So anyway the conversion won't be there. > And that data is ready to be pushed in for COPY FROM. So again there won't > be any conversion. > I had a look at the code and was thinking also about using the tuple descriptor of tuple store as having a single char column. Btw, i recall that tuplestore writes to disk the data it cannot put into shared memory. > >> >> Also, please check if we can avoid storing the complete data in >>> tuplestore, instead we should transfer data from COPY TO to COPY FROM in >>> chunks. >>> >> Would be nice indeed. >> >>> Also I am not sure if we can truncate immediately after COPY TO is >>> fired. Will that affect the data that is being fetched from COPY? >>> >> Yes it will, we need to finish COPY TO to process to launch the TRUNCATE >> on Datanodes, or we won't be able to fetch all the data. >> Hence it looks necessary to store at some point all the data in >> tuplestore of Coordinator, and using chunks is complicated with this way of >> doing. > > > Hmm. I am overall a bit concerned with the huge amount of data being > handled, and the fact that we are going to transfer the *complete* data > over the network, and also now to be stored in memory in full. The data > would have been drastically reduced if were to fetch and send only the > required records. > I am not sure it will change that much. The worst scenario is when distribution is changed from distributed to replicated among all of nodes. By directly comparing my method to your method, your method moves 1 - 1/N of data through the network while mine moves one. If the cluster uses 20 nodes, the only gain is 5%. Considering the fact that returning is not yet supported and it will be only done in the couple of weeks, I prefer move on and implement that method, which is more generalized and supports all the types of distribution and all the subsets of clusters easily. This will also allow to move faster to 9.2 merge. -- Michael Paquier http://michael.otacoo.com |
From: Amit K. <ami...@en...> - 2012-06-29 05:30:09
|
On 28 June 2012 10:56, Michael Paquier <mic...@gm...> wrote: > >> The COPY TO results from the datanode are already in the required format >> for COPY FROM, so the data is ready to be sent back to datanode as-is. So >> if possible, we should avoid any input-output conversion when storing in >> tuplestore. >> > Do you mean that we can store the results from COPY TO as-is to > tuplestore, meaning that we can use tuplestore as-is? > Or do you mean that we shouldn't use tuplestore? > I guess tuplestore will contain just one column of type text, which represents the complete record. So anyway the conversion won't be there. And that data is ready to be pushed in for COPY FROM. So again there won't be any conversion. > > Also, please check if we can avoid storing the complete data in >> tuplestore, instead we should transfer data from COPY TO to COPY FROM in >> chunks. >> > Would be nice indeed. > >> Also I am not sure if we can truncate immediately after COPY TO is fired. >> Will that affect the data that is being fetched from COPY? >> > Yes it will, we need to finish COPY TO to process to launch the TRUNCATE > on Datanodes, or we won't be able to fetch all the data. > Hence it looks necessary to store at some point all the data in tuplestore > of Coordinator, and using chunks is complicated with this way of doing. Hmm. I am overall a bit concerned with the huge amount of data being handled, and the fact that we are going to transfer the *complete* data over the network, and also now to be stored in memory in full. The data would have been drastically reduced if were to fetch and send only the required records. > > -- > Michael Paquier > http://michael.otacoo.com > |
From: Michael P. <mic...@gm...> - 2012-06-29 03:04:26
|
I just had a look at this patch. Thanks for the clean-up and I see now how you are using the merge I did yesterday. One comment though: if the plan is to remove the functions that are not needed, why surrounding useless things with flags like TODO_NOT_USED or TODO_NOT_NEEDED? You should simply delete them directly and get rid of them, their history will be kept in git so it's not a big matter to take them out. Btw, renaming looks good and regressions are not showing any problems. Once the clean-up is completed properly and you feel satisfied with it, it looks OK for commit. On Thu, Jun 28, 2012 at 9:04 PM, Ashutosh Bapat < ash...@en...> wrote: > Hi All, > The function is_foreign_expr() and pgxc_is_query_shippable() were using > different walkers to checking if an expression or a query respectively, is > shippable. The logic to decide whether an expression or query is shippable > or not needs to be same except for few differences between shippability of > expression and query, arising from the information in both of them. If we > want to code a rule for shippability, we need to make it to both of these > functions, becoming a maintenance burden. The patch unifies the walkers > into one and makes appropriate changes in naming. Regressions do not show > any extra failures. > > -- > Best Wishes, > Ashutosh Bapat > EntepriseDB Corporation > The Enterprise Postgres Company > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Postgres-xc-developers mailing list > Pos...@li... > https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers > > -- Michael Paquier http://michael.otacoo.com |
From: Ashutosh B. <ash...@en...> - 2012-06-28 06:35:15
|
Thanks Michael. On Thu, Jun 28, 2012 at 12:03 PM, Michael Paquier <mic...@gm... > wrote: > Just to justify a little bit more about this unreadable email you > received, you can refer here: > > https://github.com/postgres-xc/postgres-xc/commit/a871778ca44886721b05a64982b5e5d81c7590a3 > > While working on the planner improvements for remote query path > determination, Ashutosh has noticed that he needed a functionality which > was already implemented in Postgres master (pull_var_clause filtering > aggregate Var). This was just a little bit ahead of XC master. So the > decision has been taken to merge XC code up to commit c1d9579 which is in > the middle of Postgres 9.2 dev. > The code of XC will be merged up to the intersection of Postgres master > and 9.2 stable branch in a couple of weeks (for easy backport with 9.2 > stable branch or postgres master), and we are not planning to release any > stable releases until this moment, so this merge has been made to > facilitate the development of the new XC features. > > Thanks, > > > On Thu, Jun 28, 2012 at 3:22 PM, Michael Paquier < > mic...@us...> wrote: > >> Project "Postgres-XC". >> >> The branch, master has been updated >> via 2a32c0ae0e2d01f3cc82384b24f610bd11a23755 (commit) >> via 67ab404afa3ac68f58f586ce889f116b8ff65e3b (commit) >> via 6ba0c48349fd21904822b43a2ea3241a6d0968a9 (commit) >> via a871778ca44886721b05a64982b5e5d81c7590a3 (commit) >> via c1d9579dd8bf3c921ca6bc2b62c40da6d25372e5 (commit) >> via 846af54dd5a77dc02feeb5e34283608012cfb217 (commit) >> via fd6913a18955b0f89ca994b5036c103bcea23f28 (commit) >> via 912bc4f038b3daaea4477c4b4e79fbd8c15e67a0 (commit) >> via afc9635c600ace716294a12d78abd37f65abd0ea (commit) >> via 3315020a091f64c8d08c3b32a2abd46431dcf857 (commit) >> via 75726307e6164673c48d6ce1d143a075b8ce18fa (commit) >> via 4240e429d0c2d889d0cda23c618f94e12c13ade7 (commit) >> via 9d522cb35d8b4f266abadd0d019f68eb8802ae05 (commit) >> via 89fd72cbf26f5d2e3d86ab19c1ead73ab8fac0fe (commit) >> via 9598afa3b0f7a7fdcf3740173346950b2bd5942c (commit) >> > > > -- > Michael Paquier > http://michael.otacoo.com > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Postgres-xc-developers mailing list > Pos...@li... > https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers > > -- Best Wishes, Ashutosh Bapat EntepriseDB Corporation The Enterprise Postgres Company |
From: Michael P. <mic...@gm...> - 2012-06-28 06:33:44
|
Just to justify a little bit more about this unreadable email you received, you can refer here: https://github.com/postgres-xc/postgres-xc/commit/a871778ca44886721b05a64982b5e5d81c7590a3 While working on the planner improvements for remote query path determination, Ashutosh has noticed that he needed a functionality which was already implemented in Postgres master (pull_var_clause filtering aggregate Var). This was just a little bit ahead of XC master. So the decision has been taken to merge XC code up to commit c1d9579 which is in the middle of Postgres 9.2 dev. The code of XC will be merged up to the intersection of Postgres master and 9.2 stable branch in a couple of weeks (for easy backport with 9.2 stable branch or postgres master), and we are not planning to release any stable releases until this moment, so this merge has been made to facilitate the development of the new XC features. Thanks, On Thu, Jun 28, 2012 at 3:22 PM, Michael Paquier < mic...@us...> wrote: > Project "Postgres-XC". > > The branch, master has been updated > via 2a32c0ae0e2d01f3cc82384b24f610bd11a23755 (commit) > via 67ab404afa3ac68f58f586ce889f116b8ff65e3b (commit) > via 6ba0c48349fd21904822b43a2ea3241a6d0968a9 (commit) > via a871778ca44886721b05a64982b5e5d81c7590a3 (commit) > via c1d9579dd8bf3c921ca6bc2b62c40da6d25372e5 (commit) > via 846af54dd5a77dc02feeb5e34283608012cfb217 (commit) > via fd6913a18955b0f89ca994b5036c103bcea23f28 (commit) > via 912bc4f038b3daaea4477c4b4e79fbd8c15e67a0 (commit) > via afc9635c600ace716294a12d78abd37f65abd0ea (commit) > via 3315020a091f64c8d08c3b32a2abd46431dcf857 (commit) > via 75726307e6164673c48d6ce1d143a075b8ce18fa (commit) > via 4240e429d0c2d889d0cda23c618f94e12c13ade7 (commit) > via 9d522cb35d8b4f266abadd0d019f68eb8802ae05 (commit) > via 89fd72cbf26f5d2e3d86ab19c1ead73ab8fac0fe (commit) > via 9598afa3b0f7a7fdcf3740173346950b2bd5942c (commit) > -- Michael Paquier http://michael.otacoo.com |
From: Michael P. <mic...@gm...> - 2012-06-28 06:16:46
|
On Thu, Jun 28, 2012 at 3:03 PM, Ashutosh Bapat < ash...@en...> wrote: > Hi Michael, > You need to take care of visibility of rows when you use COPY mechanism. > > While you do all the stuff to copy from old relation and to new relation, > you need to simulate the behaviour in AtRewriteTable(). In this function, > we take the latest snapshot and then copy the rows over to new relation > storage. You will need to simulate same behaviour here. > The first version of the patch already does that. When running COPY FROM, TRUNCATE and COPY TO a new snapshot is automatically popped at each step. You can refer to distrib.c on the latest version of the patch. > > On Thu, Jun 28, 2012 at 10:56 AM, Michael Paquier < > mic...@gm...> wrote: > >> >>> The COPY TO results from the datanode are already in the required >>> format for COPY FROM, so the data is ready to be sent back to datanode >>> as-is. So if possible, we should avoid any input-output conversion when >>> storing in tuplestore. >>> >> Do you mean that we can store the results from COPY TO as-is to >> tuplestore, meaning that we can use tuplestore as-is? >> Or do you mean that we shouldn't use tuplestore? >> >> Also, please check if we can avoid storing the complete data in >>> tuplestore, instead we should transfer data from COPY TO to COPY FROM in >>> chunks. >>> >> Would be nice indeed. >> >>> Also I am not sure if we can truncate immediately after COPY TO is >>> fired. Will that affect the data that is being fetched from COPY? >>> >> Yes it will, we need to finish COPY TO to process to launch the TRUNCATE >> on Datanodes, or we won't be able to fetch all the data. >> Hence it looks necessary to store at some point all the data in >> tuplestore of Coordinator, and using chunks is complicated with this way of >> doing. >> >> -- >> Michael Paquier >> http://michael.otacoo.com >> >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> Postgres-xc-developers mailing list >> Pos...@li... >> https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers >> >> > > > -- > Best Wishes, > Ashutosh Bapat > EntepriseDB Corporation > The Enterprise Postgres Company > > -- Michael Paquier http://michael.otacoo.com |
From: Ashutosh B. <ash...@en...> - 2012-06-28 06:03:43
|
Hi Michael, You need to take care of visibility of rows when you use COPY mechanism. While you do all the stuff to copy from old relation and to new relation, you need to simulate the behaviour in AtRewriteTable(). In this function, we take the latest snapshot and then copy the rows over to new relation storage. You will need to simulate same behaviour here. On Thu, Jun 28, 2012 at 10:56 AM, Michael Paquier <mic...@gm... > wrote: > >> The COPY TO results from the datanode are already in the required format >> for COPY FROM, so the data is ready to be sent back to datanode as-is. So >> if possible, we should avoid any input-output conversion when storing in >> tuplestore. >> > Do you mean that we can store the results from COPY TO as-is to > tuplestore, meaning that we can use tuplestore as-is? > Or do you mean that we shouldn't use tuplestore? > > Also, please check if we can avoid storing the complete data in >> tuplestore, instead we should transfer data from COPY TO to COPY FROM in >> chunks. >> > Would be nice indeed. > >> Also I am not sure if we can truncate immediately after COPY TO is fired. >> Will that affect the data that is being fetched from COPY? >> > Yes it will, we need to finish COPY TO to process to launch the TRUNCATE > on Datanodes, or we won't be able to fetch all the data. > Hence it looks necessary to store at some point all the data in tuplestore > of Coordinator, and using chunks is complicated with this way of doing. > > -- > Michael Paquier > http://michael.otacoo.com > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Postgres-xc-developers mailing list > Pos...@li... > https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers > > -- Best Wishes, Ashutosh Bapat EntepriseDB Corporation The Enterprise Postgres Company |
From: Michael P. <mic...@gm...> - 2012-06-28 05:26:37
|
> > > The COPY TO results from the datanode are already in the required format > for COPY FROM, so the data is ready to be sent back to datanode as-is. So > if possible, we should avoid any input-output conversion when storing in > tuplestore. > Do you mean that we can store the results from COPY TO as-is to tuplestore, meaning that we can use tuplestore as-is? Or do you mean that we shouldn't use tuplestore? Also, please check if we can avoid storing the complete data in tuplestore, > instead we should transfer data from COPY TO to COPY FROM in chunks. > Would be nice indeed. > Also I am not sure if we can truncate immediately after COPY TO is fired. > Will that affect the data that is being fetched from COPY? > Yes it will, we need to finish COPY TO to process to launch the TRUNCATE on Datanodes, or we won't be able to fetch all the data. Hence it looks necessary to store at some point all the data in tuplestore of Coordinator, and using chunks is complicated with this way of doing. -- Michael Paquier http://michael.otacoo.com |
From: Amit K. <ami...@en...> - 2012-06-28 05:21:09
|
On 28 June 2012 09:05, Michael Paquier <mic...@gm...> wrote: > > >> Something like this: >> 1. Launch copy to stdout >> 2. Launch truncate >> 3. Get the results of step 1 and use copy Apis to redirect rows to >> correct nodes. >> Now the copy to data is put into a file, a tuple store if you want. >> > Let me bring more details here. I had a closer look at postgres > functionalities and there are several possibilities to send a copy output, > the one I would like to use instead of the file currently being used is > DestTuplestore. By using that, it would be possible to store all the tuples > being redistributed without having to use an intermediate file and postgres > would do all the storage work. > So, assuming that TupleStore is used, here are how the redistribution > steps would work by default: > 1. launch copy to and output result to tuplestore > 2. launch truncate > 3. update catalogs > 4. use tuplestore data and relaunch a copy from with execRemote.c APIs. The COPY TO results from the datanode are already in the required format for COPY FROM, so the data is ready to be sent back to datanode as-is. So if possible, we should avoid any input-output conversion when storing in tuplestore. Also, please check if we can avoid storing the complete data in tuplestore, instead we should transfer data from COPY TO to COPY FROM in chunks. Also I am not sure if we can truncate immediately after COPY TO is fired. Will that affect the data that is being fetched from COPY? > > -- > Michael Paquier > http://michael.otacoo.com > |
From: Michael P. <mic...@gm...> - 2012-06-28 03:35:19
|
> > Something like this: > 1. Launch copy to stdout > 2. Launch truncate > 3. Get the results of step 1 and use copy Apis to redirect rows to correct > nodes. > Now the copy to data is put into a file, a tuple store if you want. > Let me bring more details here. I had a closer look at postgres functionalities and there are several possibilities to send a copy output, the one I would like to use instead of the file currently being used is DestTuplestore. By using that, it would be possible to store all the tuples being redistributed without having to use an intermediate file and postgres would do all the storage work. So, assuming that TupleStore is used, here are how the redistribution steps would work by default: 1. launch copy to and output result to tuplestore 2. launch truncate 3. update catalogs 4. use tuplestore data and relaunch a copy from with execRemote.c APIs. -- Michael Paquier http://michael.otacoo.com |
From: Michael P. <mic...@gm...> - 2012-06-27 13:10:22
|
On 2012/06/27, at 21:19, Amit Khandekar <ami...@en...> wrote: > > > On 27 June 2012 17:36, Michael Paquier <mic...@gm...> wrote: > > > On Wed, Jun 27, 2012 at 4:18 PM, Amit Khandekar <ami...@en...> wrote: > > > On 26 June 2012 19:15, Michael Paquier <mic...@gm...> wrote: > > On 2012/06/26, at 21:20, Amit Khandekar <ami...@en...> wrote: > >> >> >> On 20 June 2012 11:16, Michael Paquier <mic...@gm...> wrote: >> Please find attached a far more performant patch. >> This new mechanism uses a redistribution based on method 2: >> - COPY TO >> - TRUNCATE >> - catalog update >> - COPY FROM >> >> The advantages of this method are: >> - network traffic is halved as all the data is located temporarily on Coordinator >> - TRUNCATE speeds up data deletion >> - COPY improves the performance of data transfer by ~30 times >> - Time necessary to redistribute is the time necessary to process COPY FROM + COPY TO >> - Critical diminution of XLOG, there are not anymore a multitude of xlogs generated at Datanode level when distribution is changed to hash/modulo. >> >> I did a couple of tests up with tables of hundreds of MB of data. >> For a table of 10Mrows (300MB with pg_relation_size), redistribution took 30s from a replication to hash. >> With the previous patch, redistribution took 30mins for a 1Mrow table when table was altered from replication to hash distribution. >> I am also conducting tests with tables of GB size. However this mechanism is really fast. >> >> >> Hi Michael, >> >> While looking at the other remote-copy-API related patch that you sent in different thread, I ultimately ended up with this patch in order to know the basic ALTER TABLE implementation. Sorry to comment on this patch so late. >> >> I was checking if it is even possible to pull only the required rows from tables for COPY. If somehow we could build an expression based on the hash function we might be able to use it as a qualifier in the select query to the datanode so that only those rows will be retrieved that need to be deleted from the table: > This assumes that datanodes have the knowledge of the other datanodes, and this is not the case, so this idea is difficult to put in practice. > > Write a function get_node_index(discol_val, nodeList) which returns node index given a distcol value and node index list. Mark this as immutable so that it can be shipped on datanode. > Then use this query to execute from coordinator: > delete from tab where get_node_index(distcol, nodeList) = $1 returning * > Execute it from coordinator, targeting data nodes one by one, replacing $1 with the particular datanode index. Since exec-direct does not work for DELETE, we need to somehow set the exec_nodes list to include only the specific datanode. > I am honestly a little bit sceptic about this approach... It depends on too many things to my mind, and DELETE can be really costly on large table. Returning is not supported and this makes the table redistribution depending on the complete node list. > > Returning is not supported very soon, so we can defer this for now. Using COPY and using RETURNING are two independent things we can do, so we can continue using COPY w/o returning. Right now COPY will get all rows of the table, right ? > > > My idea is to use, as you said, a pipe between COPY FROM/COPY TO and to run in the middle a TRUNCATE which is way faster on large tables. This would use the APIs I extracted with my first patch and the other functions in execRemote.c. > > You are not using an actual system pipe, are you? Just making sure. I meant using the results fetched from COPY TO STDOUT directy as input to COPY FROM STDIN> That's what I would like to do. > > Please also mention the steps of the COPY and TRUNCATE part. Would you be able to use TRUNCATE while piping COPY is being done? Will it not error out? Let me know the steps. Something like this: 1. Launch copy to stdout 2. Launch truncate 3. Get the results of step 1 and use copy Apis to redirect rows to correct nodes. Now the copy to data is put into a file, a tuple store if you want. > > If you have a look at how those functions are processed you would notice that COPY TO redirects all the data to a FILE *, and COPY FROM uses tuple-based strings and sends them to Datanodes with a dedicated set of connections. > Regarding performance and portability, I think this looks as the better approach. Also the COPY APIs in execRemote.c are pretty robust. > > Once this is done, I think it would be OK to commit. Then, there are also some improvements that can be done with replicated tables. For example, when reducing or increasing the set of nodes of a replicated table, you do not need the TRUNCATE phase. > -- > Michael Paquier > http://michael.otacoo.com > |
From: Amit K. <ami...@en...> - 2012-06-27 12:20:05
|
On 27 June 2012 17:36, Michael Paquier <mic...@gm...> wrote: > > > On Wed, Jun 27, 2012 at 4:18 PM, Amit Khandekar < > ami...@en...> wrote: > >> >> >> On 26 June 2012 19:15, Michael Paquier <mic...@gm...> wrote: >> >>> >>> On 2012/06/26, at 21:20, Amit Khandekar <ami...@en...> >>> wrote: >>> >>> >>> >>> On 20 June 2012 11:16, Michael Paquier <mic...@gm...>wrote: >>> >>>> Please find attached a far more performant patch. >>>> This new mechanism uses a redistribution based on method 2: >>>> - COPY TO >>>> - TRUNCATE >>>> - catalog update >>>> - COPY FROM >>>> >>>> The advantages of this method are: >>>> - network traffic is halved as all the data is located temporarily on >>>> Coordinator >>>> - TRUNCATE speeds up data deletion >>>> - COPY improves the performance of data transfer by ~30 times >>>> - Time necessary to redistribute is the time necessary to process COPY >>>> FROM + COPY TO >>>> - Critical diminution of XLOG, there are not anymore a multitude of >>>> xlogs generated at Datanode level when distribution is changed to >>>> hash/modulo. >>>> >>>> I did a couple of tests up with tables of hundreds of MB of data. >>>> For a table of 10Mrows (300MB with pg_relation_size), redistribution >>>> took 30s from a replication to hash. >>>> With the previous patch, redistribution took 30mins for a 1Mrow table >>>> when table was altered from replication to hash distribution. >>>> I am also conducting tests with tables of GB size. However this >>>> mechanism is really fast. >>>> >>>> >>> Hi Michael, >>> >>> While looking at the other remote-copy-API related patch that you sent >>> in different thread, I ultimately ended up with this patch in order to know >>> the basic ALTER TABLE implementation. Sorry to comment on this patch so >>> late. >>> >>> I was checking if it is even possible to pull only the required rows >>> from tables for COPY. If somehow we could build an expression based on the >>> hash function we might be able to use it as a qualifier in the select query >>> to the datanode so that only those rows will be retrieved that need to be >>> deleted from the table: >>> >>> This assumes that datanodes have the knowledge of the other datanodes, >>> and this is not the case, so this idea is difficult to put in practice. >>> >> >> Write a function get_node_index(discol_val, nodeList) which returns node >> index given a distcol value and node index list. Mark this as immutable so >> that it can be shipped on datanode. >> Then use this query to execute from coordinator: >> delete from tab where get_node_index(distcol, nodeList) = $1 returning * >> Execute it from coordinator, targeting data nodes one by one, replacing >> $1 with the particular datanode index. Since exec-direct does not work for >> DELETE, we need to somehow set the exec_nodes list to include only the >> specific datanode. >> > I am honestly a little bit sceptic about this approach... It depends on > too many things to my mind, and DELETE can be really costly on large table. > Returning is not supported and this makes the table redistribution > depending on the complete node list. Returning is not supported very soon, so we can defer this for now. Using COPY and using RETURNING are two independent things we can do, so we can continue using COPY w/o returning. Right now COPY will get all rows of the table, right ? > My idea is to use, as you said, a pipe between COPY FROM/COPY TO and to > run in the middle a TRUNCATE which is way faster on large tables. This > would use the APIs I extracted with my first patch and the other functions > in execRemote.c. > You are not using an actual system pipe, are you? Just making sure. I meant using the results fetched from COPY TO STDOUT directy as input to COPY FROM STDIN> Please also mention the steps of the COPY and TRUNCATE part. Would you be able to use TRUNCATE while piping COPY is being done? Will it not error out? Let me know the steps. If you have a look at how those functions are processed you would notice > that COPY TO redirects all the data to a FILE *, and COPY FROM uses > tuple-based strings and sends them to Datanodes with a dedicated set of > connections. > Regarding performance and portability, I think this looks as the better > approach. Also the COPY APIs in execRemote.c are pretty robust. > > Once this is done, I think it would be OK to commit. Then, there are also > some improvements that can be done with replicated tables. For example, > when reducing or increasing the set of nodes of a replicated table, you do > not need the TRUNCATE phase. > -- > Michael Paquier > http://michael.otacoo.com > |
From: Michael P. <mic...@gm...> - 2012-06-27 12:06:45
|
On Wed, Jun 27, 2012 at 4:18 PM, Amit Khandekar < ami...@en...> wrote: > > > On 26 June 2012 19:15, Michael Paquier <mic...@gm...> wrote: > >> >> On 2012/06/26, at 21:20, Amit Khandekar <ami...@en...> >> wrote: >> >> >> >> On 20 June 2012 11:16, Michael Paquier <mic...@gm...> wrote: >> >>> Please find attached a far more performant patch. >>> This new mechanism uses a redistribution based on method 2: >>> - COPY TO >>> - TRUNCATE >>> - catalog update >>> - COPY FROM >>> >>> The advantages of this method are: >>> - network traffic is halved as all the data is located temporarily on >>> Coordinator >>> - TRUNCATE speeds up data deletion >>> - COPY improves the performance of data transfer by ~30 times >>> - Time necessary to redistribute is the time necessary to process COPY >>> FROM + COPY TO >>> - Critical diminution of XLOG, there are not anymore a multitude of >>> xlogs generated at Datanode level when distribution is changed to >>> hash/modulo. >>> >>> I did a couple of tests up with tables of hundreds of MB of data. >>> For a table of 10Mrows (300MB with pg_relation_size), redistribution >>> took 30s from a replication to hash. >>> With the previous patch, redistribution took 30mins for a 1Mrow table >>> when table was altered from replication to hash distribution. >>> I am also conducting tests with tables of GB size. However this >>> mechanism is really fast. >>> >>> >> Hi Michael, >> >> While looking at the other remote-copy-API related patch that you sent in >> different thread, I ultimately ended up with this patch in order to know >> the basic ALTER TABLE implementation. Sorry to comment on this patch so >> late. >> >> I was checking if it is even possible to pull only the required rows from >> tables for COPY. If somehow we could build an expression based on the hash >> function we might be able to use it as a qualifier in the select query to >> the datanode so that only those rows will be retrieved that need to be >> deleted from the table: >> >> This assumes that datanodes have the knowledge of the other datanodes, >> and this is not the case, so this idea is difficult to put in practice. >> > > Write a function get_node_index(discol_val, nodeList) which returns node > index given a distcol value and node index list. Mark this as immutable so > that it can be shipped on datanode. > Then use this query to execute from coordinator: > delete from tab where get_node_index(distcol, nodeList) = $1 returning * > Execute it from coordinator, targeting data nodes one by one, replacing $1 > with the particular datanode index. Since exec-direct does not work for > DELETE, we need to somehow set the exec_nodes list to include only the > specific datanode. > I am honestly a little bit sceptic about this approach... It depends on too many things to my mind, and DELETE can be really costly on large table. Returning is not supported and this makes the table redistribution depending on the complete node list. My idea is to use, as you said, a pipe between COPY FROM/COPY TO and to run in the middle a TRUNCATE which is way faster on large tables. This would use the APIs I extracted with my first patch and the other functions in execRemote.c. If you have a look at how those functions are processed you would notice that COPY TO redirects all the data to a FILE *, and COPY FROM uses tuple-based strings and sends them to Datanodes with a dedicated set of connections. Regarding performance and portability, I think this looks as the better approach. Also the COPY APIs in execRemote.c are pretty robust. Once this is done, I think it would be OK to commit. Then, there are also some improvements that can be done with replicated tables. For example, when reducing or increasing the set of nodes of a replicated table, you do not need the TRUNCATE phase. -- Michael Paquier http://michael.otacoo.com |
From: Amit K. <ami...@en...> - 2012-06-27 08:30:10
|
I am overall ok with the patch. May be later on I will revisit finally when you start using the API for alter Table. On 27 June 2012 12:57, Michael Paquier <mic...@gm...> wrote: > Hi all, > > Please find an updated patch attached. > There are 2 modifications: the number of arguments of > RemoteCopy_BuildStatement and RemoteCopy_GetRelationLoc is reduced by 1 as > TupleDesc depends on Relation. > > Regards, > -- > Michael Paquier > http://michael.otacoo.com > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Postgres-xc-developers mailing list > Pos...@li... > https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers > > |
From: Ahsan H. <ahs...@en...> - 2012-06-27 08:22:57
|
On Wed, Jun 27, 2012 at 12:26 PM, Michael Paquier <mic...@gm... > wrote: > > > On Wed, Jun 27, 2012 at 4:18 PM, Amit Khandekar < > ami...@en...> wrote: > >> >> >> On 26 June 2012 19:15, Michael Paquier <mic...@gm...> wrote: >> >>> >>> On 2012/06/26, at 21:20, Amit Khandekar <ami...@en...> >>> wrote: >>> >>> >>> >>> On 20 June 2012 11:16, Michael Paquier <mic...@gm...>wrote: >>> >>>> Please find attached a far more performant patch. >>>> This new mechanism uses a redistribution based on method 2: >>>> - COPY TO >>>> - TRUNCATE >>>> - catalog update >>>> - COPY FROM >>>> >>>> The advantages of this method are: >>>> - network traffic is halved as all the data is located temporarily on >>>> Coordinator >>>> - TRUNCATE speeds up data deletion >>>> - COPY improves the performance of data transfer by ~30 times >>>> - Time necessary to redistribute is the time necessary to process COPY >>>> FROM + COPY TO >>>> - Critical diminution of XLOG, there are not anymore a multitude of >>>> xlogs generated at Datanode level when distribution is changed to >>>> hash/modulo. >>>> >>>> I did a couple of tests up with tables of hundreds of MB of data. >>>> For a table of 10Mrows (300MB with pg_relation_size), redistribution >>>> took 30s from a replication to hash. >>>> With the previous patch, redistribution took 30mins for a 1Mrow table >>>> when table was altered from replication to hash distribution. >>>> I am also conducting tests with tables of GB size. However this >>>> mechanism is really fast. >>>> >>>> >>> Hi Michael, >>> >>> While looking at the other remote-copy-API related patch that you sent >>> in different thread, I ultimately ended up with this patch in order to know >>> the basic ALTER TABLE implementation. Sorry to comment on this patch so >>> late. >>> >>> I was checking if it is even possible to pull only the required rows >>> from tables for COPY. If somehow we could build an expression based on the >>> hash function we might be able to use it as a qualifier in the select query >>> to the datanode so that only those rows will be retrieved that need to be >>> deleted from the table: >>> >>> This assumes that datanodes have the knowledge of the other datanodes, >>> and this is not the case, so this idea is difficult to put in practice. >>> >> >> Write a function get_node_index(discol_val, nodeList) which returns node >> index given a distcol value and node index list. Mark this as immutable so >> that it can be shipped on datanode. >> Then use this query to execute from coordinator: >> delete from tab where get_node_index(distcol, nodeList) = $1 returning * >> > returning is not yet supported in XC. > Yes Abbas is working on this as part of this work for WCO cursors. > > >> Execute it from coordinator, targeting data nodes one by one, replacing >> $1 with the particular datanode index. Since exec-direct does not work for >> DELETE, we need to somehow set the exec_nodes list to include only the >> specific datanode. >> > I am really afraid that this mechanism is too slow as we may have to go > through the tuples one by one. > Btw, sending data to dedicated remote nodes is what I am plannning to do > with the remote COPY APIs. > > > -- > Michael Paquier > http://michael.otacoo.com > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Postgres-xc-developers mailing list > Pos...@li... > https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers > > -- Ahsan Hadi Snr Director Product Development EnterpriseDB Corporation The Enterprise Postgres Company Phone: +92-51-8358874 Mobile: +92-333-5162114 Website: www.enterprisedb.com EnterpriseDB Blog: http://blogs.enterprisedb.com/ Follow us on Twitter: http://www.twitter.com/enterprisedb This e-mail message (and any attachment) is intended for the use of the individual or entity to whom it is addressed. This message contains information from EnterpriseDB Corporation that may be privileged, confidential, or exempt from disclosure under applicable law. If you are not the intended recipient or authorized to receive this for the intended recipient, any use, dissemination, distribution, retention, archiving, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and delete this message. |
From: Amit K. <ami...@en...> - 2012-06-27 07:37:45
|
On 27 June 2012 12:56, Michael Paquier <mic...@gm...> wrote: > > > On Wed, Jun 27, 2012 at 4:18 PM, Amit Khandekar < > ami...@en...> wrote: > >> >> >> On 26 June 2012 19:15, Michael Paquier <mic...@gm...> wrote: >> >>> >>> On 2012/06/26, at 21:20, Amit Khandekar <ami...@en...> >>> wrote: >>> >>> >>> >>> On 20 June 2012 11:16, Michael Paquier <mic...@gm...>wrote: >>> >>>> Please find attached a far more performant patch. >>>> This new mechanism uses a redistribution based on method 2: >>>> - COPY TO >>>> - TRUNCATE >>>> - catalog update >>>> - COPY FROM >>>> >>>> The advantages of this method are: >>>> - network traffic is halved as all the data is located temporarily on >>>> Coordinator >>>> - TRUNCATE speeds up data deletion >>>> - COPY improves the performance of data transfer by ~30 times >>>> - Time necessary to redistribute is the time necessary to process COPY >>>> FROM + COPY TO >>>> - Critical diminution of XLOG, there are not anymore a multitude of >>>> xlogs generated at Datanode level when distribution is changed to >>>> hash/modulo. >>>> >>>> I did a couple of tests up with tables of hundreds of MB of data. >>>> For a table of 10Mrows (300MB with pg_relation_size), redistribution >>>> took 30s from a replication to hash. >>>> With the previous patch, redistribution took 30mins for a 1Mrow table >>>> when table was altered from replication to hash distribution. >>>> I am also conducting tests with tables of GB size. However this >>>> mechanism is really fast. >>>> >>>> >>> Hi Michael, >>> >>> While looking at the other remote-copy-API related patch that you sent >>> in different thread, I ultimately ended up with this patch in order to know >>> the basic ALTER TABLE implementation. Sorry to comment on this patch so >>> late. >>> >>> I was checking if it is even possible to pull only the required rows >>> from tables for COPY. If somehow we could build an expression based on the >>> hash function we might be able to use it as a qualifier in the select query >>> to the datanode so that only those rows will be retrieved that need to be >>> deleted from the table: >>> >>> This assumes that datanodes have the knowledge of the other datanodes, >>> and this is not the case, so this idea is difficult to put in practice. >>> >> >> Write a function get_node_index(discol_val, nodeList) which returns node >> index given a distcol value and node index list. Mark this as immutable so >> that it can be shipped on datanode. >> Then use this query to execute from coordinator: >> delete from tab where get_node_index(distcol, nodeList) = $1 returning * >> > returning is not yet supported in XC. > > >> Execute it from coordinator, targeting data nodes one by one, replacing >> $1 with the particular datanode index. Since exec-direct does not work for >> DELETE, we need to somehow set the exec_nodes list to include only the >> specific datanode. >> > I am really afraid that this mechanism is too slow as we may have to go > through the tuples one by one. > The point of the function get_node_index() is to move only the required data, and not touch the data that is at the right location. So you can even just select the filtered rows, as against using DELETE. I did not get what you meant by going through tuples one by one. > Btw, sending data to dedicated remote nodes is what I am plannning to do > with the remote COPY APIs. > Ok. So your remote copy extraction pacth in the other mail will have these changes? Also, please consider piping the COPY FROM STDOUT to COPY TO STDIN , if that makes it possible to avoid dump it to intermediate file. > > > -- > Michael Paquier > http://michael.otacoo.com > |
From: Michael P. <mic...@gm...> - 2012-06-27 07:26:15
|
On Wed, Jun 27, 2012 at 4:18 PM, Amit Khandekar < ami...@en...> wrote: > > > On 26 June 2012 19:15, Michael Paquier <mic...@gm...> wrote: > >> >> On 2012/06/26, at 21:20, Amit Khandekar <ami...@en...> >> wrote: >> >> >> >> On 20 June 2012 11:16, Michael Paquier <mic...@gm...> wrote: >> >>> Please find attached a far more performant patch. >>> This new mechanism uses a redistribution based on method 2: >>> - COPY TO >>> - TRUNCATE >>> - catalog update >>> - COPY FROM >>> >>> The advantages of this method are: >>> - network traffic is halved as all the data is located temporarily on >>> Coordinator >>> - TRUNCATE speeds up data deletion >>> - COPY improves the performance of data transfer by ~30 times >>> - Time necessary to redistribute is the time necessary to process COPY >>> FROM + COPY TO >>> - Critical diminution of XLOG, there are not anymore a multitude of >>> xlogs generated at Datanode level when distribution is changed to >>> hash/modulo. >>> >>> I did a couple of tests up with tables of hundreds of MB of data. >>> For a table of 10Mrows (300MB with pg_relation_size), redistribution >>> took 30s from a replication to hash. >>> With the previous patch, redistribution took 30mins for a 1Mrow table >>> when table was altered from replication to hash distribution. >>> I am also conducting tests with tables of GB size. However this >>> mechanism is really fast. >>> >>> >> Hi Michael, >> >> While looking at the other remote-copy-API related patch that you sent in >> different thread, I ultimately ended up with this patch in order to know >> the basic ALTER TABLE implementation. Sorry to comment on this patch so >> late. >> >> I was checking if it is even possible to pull only the required rows from >> tables for COPY. If somehow we could build an expression based on the hash >> function we might be able to use it as a qualifier in the select query to >> the datanode so that only those rows will be retrieved that need to be >> deleted from the table: >> >> This assumes that datanodes have the knowledge of the other datanodes, >> and this is not the case, so this idea is difficult to put in practice. >> > > Write a function get_node_index(discol_val, nodeList) which returns node > index given a distcol value and node index list. Mark this as immutable so > that it can be shipped on datanode. > Then use this query to execute from coordinator: > delete from tab where get_node_index(distcol, nodeList) = $1 returning * > returning is not yet supported in XC. > Execute it from coordinator, targeting data nodes one by one, replacing $1 > with the particular datanode index. Since exec-direct does not work for > DELETE, we need to somehow set the exec_nodes list to include only the > specific datanode. > I am really afraid that this mechanism is too slow as we may have to go through the tuples one by one. Btw, sending data to dedicated remote nodes is what I am plannning to do with the remote COPY APIs. -- Michael Paquier http://michael.otacoo.com |
From: Amit K. <ami...@en...> - 2012-06-27 07:19:17
|
On 26 June 2012 19:15, Michael Paquier <mic...@gm...> wrote: > > On 2012/06/26, at 21:20, Amit Khandekar <ami...@en...> > wrote: > > > > On 20 June 2012 11:16, Michael Paquier <mic...@gm...> wrote: > >> Please find attached a far more performant patch. >> This new mechanism uses a redistribution based on method 2: >> - COPY TO >> - TRUNCATE >> - catalog update >> - COPY FROM >> >> The advantages of this method are: >> - network traffic is halved as all the data is located temporarily on >> Coordinator >> - TRUNCATE speeds up data deletion >> - COPY improves the performance of data transfer by ~30 times >> - Time necessary to redistribute is the time necessary to process COPY >> FROM + COPY TO >> - Critical diminution of XLOG, there are not anymore a multitude of xlogs >> generated at Datanode level when distribution is changed to hash/modulo. >> >> I did a couple of tests up with tables of hundreds of MB of data. >> For a table of 10Mrows (300MB with pg_relation_size), redistribution took >> 30s from a replication to hash. >> With the previous patch, redistribution took 30mins for a 1Mrow table >> when table was altered from replication to hash distribution. >> I am also conducting tests with tables of GB size. However this mechanism >> is really fast. >> >> > Hi Michael, > > While looking at the other remote-copy-API related patch that you sent in > different thread, I ultimately ended up with this patch in order to know > the basic ALTER TABLE implementation. Sorry to comment on this patch so > late. > > I was checking if it is even possible to pull only the required rows from > tables for COPY. If somehow we could build an expression based on the hash > function we might be able to use it as a qualifier in the select query to > the datanode so that only those rows will be retrieved that need to be > deleted from the table: > > This assumes that datanodes have the knowledge of the other datanodes, and > this is not the case, so this idea is difficult to put in practice. > Write a function get_node_index(discol_val, nodeList) which returns node index given a distcol value and node index list. Mark this as immutable so that it can be shipped on datanode. Then use this query to execute from coordinator: delete from tab where get_node_index(distcol, nodeList) = $1 returning * Execute it from coordinator, targeting data nodes one by one, replacing $1 with the particular datanode index. Since exec-direct does not work for DELETE, we need to somehow set the exec_nodes list to include only the specific datanode. > SELECT from tab1 where expr(dist_col) is true; > And then supply the results into COPY (or directly to insert ?) > > The first version of my patch was using an INSERT-based mechanism. This > has a low performance, and we may finish by generating one INSERT query per > row, generating 1 xlog record for each query. If millions of rows are moved > between nodes, you might need to replay unnecessary records after for > example a node crash, slowing the bode recovery by that much. > > So the COPY will look like: > COPY (SELECT from tab1 where expr(dist_col) is true) > or still better: > COPY ( DELETE from tab1 where expr(dist_col) is true RETURNING *) > This will get rid of a separate step to delete rows. But I thingk COPY ... > DELETE is not valid syntax. > > No it is not. But you can use a select clause with a WITH having a DELETE > returning. However, you need to know with this method on a local datanode > if the tuples of this node are correctly located in cluster, but as a > Datanode has no view of the other XC nodes, this is not directly applicable. > > May be we can write a set returning function that runs delete...returning > and then use COPY (select such_func('table_name') ) > > If we manage to delete and retrieve only required rows, it may be possible > to skip the COPY part and do a straight: > insert into tab1 DELETE from tab1 where expr(dist_col) is true RETURNING > *. I am not sure if it is performant to do this w/ or w/o copy. COPY may or > may not write into WAL logs. > > Copy only does on WAL record I think. It is a single query. However COPY > protocol is the fastest solution for data transfer, so we should definitely > avoid solutions with INSERT as it would finish using a slower extended > query protocol, and increase the number of xlogs on datanodes. > > > This is all a quick thought without much thought on the implications, but > still wanted to put that out here. For e.g., is it possible to create such > a expression expr(dist_col) ? This should be correctly built up according > to whether it is hash or modulo or round robin distribution. For round > robin may be it should always be true because we anyways want to delete > them and reinsert, or do we? If the table is to be altered from hash to > round robin, the expression might be needed to always be true may be; and > so on. > > Regards, >> >> >> On Wed, Jun 20, 2012 at 9:29 AM, Michael Paquier < >> mic...@gm...> wrote: >> >>> In this patch, the SQL/catalog management and the distribution mechanism >>> use really separated APIs. >>> So even if I do not think it is necessary to change the SQL part, the >>> redistribution mechanism can be changed at will. >>> >>> For the time being, the redistribution mechanism is not really >>> performant (well, it was not the goal of this prototype), because it uses >>> the following model. >>> 1) Creation of a storage table (unlogged) with default distribution >>> (CTAS) >>> 2) Take necessary locks on storage and redistributed table >>> 3) Delete all data on redistributed table >>> 4) Update catalogs with new distribution information >>> 5) Perform INSERT SELECT from storage table to redistributed table >>> 6) DROP storage table >>> As mentionned by Ashutosh, the data needs to travel 4 times through >>> network so it can really take a lot of time for tables with lots of gigs of >>> data. >>> Network in itself is not the bottleneck, it is the usage of the >>> framework of postgres. This is especially true when queries used by >>> redistribution mechanism cannot be pushed down. >>> The worst case being when a table is redistributed to a hash/modulo on >>> multiple nodes, as in this case it is necessary to plan each INSERT for the >>> redistribution. As mentionned also by Ashutosh this can create a huge deal >>> of xlogs on remote Datanodes, not really welcome after a crash recovery. >>> >>> There are several ways possible to improve this dramatically improve the >>> redistribution mechanism. >>> Here are 3 ideas. >>> 1) Create storage table with data on a single node >>> Here we reduce the load on network, but it cannot solve the problem of >>> tables redistributed to modulo/hash on multiple nodes. It will create a lot >>> of INSERT queries for a slow result. >>> 2) Use a COPY mechanism >>> One of the simple solutions. Instead of using a costly storage table in >>> cluster, store the data on Coordinator during the redistribution >>> a) COPY the data of table being redistributed to a file in $PGDATA of >>> Coordinator. Why not $PGDATA/pg_distrib/oid? >>> b) DELETE all the data on table being redistributed >>> c) update catalogs >>> d) COPY FROM file to table with new distribution type >>> Network load is halved. COPY is also really faster. >>> Servers of Coordinator are not chosen for there disk I/O but the folder >>> $PGDATA/pg_distrib could be linked to a folder where a faster disk is >>> mounted >>> This also gets rid of the storage table. The only thing to care of is >>> the deletion of the temporary data file once redistribution transaction >>> commits or aborts. >>> Data file could also be compressed to reduce space consumed and I/O on >>> disk. >>> >>> 3) Use a batching process to communicate only necessary tuples from >>> Datanodes to Coordinator. >>> Suggested by Ashutosh, this can use COPY protocol to redistribute in a >>> batch way the tuples being redistributed. >>> The idea is to send from Datanodes to Coordinator only the tuples that >>> need to be redistributed, and then let Coordinator redistribute correctly >>> all the data depending on the new distribution. This avoids to have to >>> store temporarily the data redistributed and all the transfer is managed by >>> cache on Coordinator. >>> This idea has a couple of limitations though: >>> - a Datanode is not aware of the existence of the other nodes in >>> cluster. Now distribution data is only available at Coordinator on catalog >>> pgxc_class, and this distribution data contains the list of nodes where >>> data is distributed. This is directly dependant on catalog pgxc_node. So a >>> Datanode cannot know if a tuple will be at the correct place or not. >>> This could be countered by allowing the run of node DDLs on Datanodes, >>> but this adds an additional constraint on cluster setting as it forces the >>> cluster designer to update all the pgxc_node catalogs on all the nodes. >>> Having a pgxc_node catalog on Datanode would make sense if it communicates >>> with other nodes through the same pooler as Coordinator, but this also >>> raises issues with multiple backends open on one node for the same session, >>> which is dangerous for transaction handling. >>> - visibility concerns. What insures that a tuple has been only selected >>> once. As redistribution is a cluster-based mechanism. What can insure that >>> a scan on a Datanode is not taking into account some tuples that have >>> already been redistributed. >>> >>> Method 1 looks useless from the point of performance. >>> Method 2 should have a good performance. This only point is that data >>> has to be located on Coordinator server temporarily while redistribution is >>> being done. We could also use some GUC parameter to allow DBA to customize >>> the way redistribution data folder is stored (compression type, file name >>> format...). >>> I have some concerns about method 3 as explained above. I might not take >>> into account all the potential problems or have a limited view on this >>> mechanism, but it introduces some new dependencies with cluster setting >>> which may not be necessary. However any discussion on the subject is >>> welcome. >>> >>> Suggestions are welcome. >>> >>> >>> On Wed, Jun 20, 2012 at 8:40 AM, Michael Paquier < >>> mic...@gm...> wrote: >>> >>>> >>>> >>>> On Wed, Jun 20, 2012 at 4:19 AM, Abbas Butt < >>>> abb...@en...> wrote: >>>> >>>>> You forgot to attach the patch. >>>>> >>>> Sorry here is the patch. >>>> >>>> >>>> >>>>> >>>>> On Tue, Jun 19, 2012 at 10:58 AM, Michael Paquier < >>>>> mic...@gm...> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> Please find attached an improved patch. I corrected the following >>>>>> points: >>>>>> - Storage table uses an access exclusive lock meaning it cannot be >>>>>> accessed by other sessions in cluster >>>>>> - The table redistributed uses an exclusive lock, it can be accessed >>>>>> by the other sessions in cluster with SELECT while redistribution is running >>>>>> - Addition of an API to manage table locking >>>>>> - Correction of bugs regarding session concurrency. An update in >>>>>> pgxc_class (update of distribution data) was not seen by concurrent >>>>>> sessions in cluster. >>>>>> - doc correction and completion >>>>>> - regression fixes due to grammar change for node list in CTAS, >>>>>> CREATE TABLE, EXECUTE DIRECT and CLEAN CONNECTION >>>>>> - Fix of system functions using EXECUTE direct >>>>>> - Fix for CTAS query generation >>>>>> - update index of catalog pgxc_class updated >>>>>> - Correct update for relation cache when location data is updated >>>>>> >>>>>> Questions are welcome. >>>>>> This patch can be applied on master and works as expected. >>>>>> >>>>>> On Mon, Jun 18, 2012 at 5:25 PM, Michael Paquier < >>>>>> mic...@gm...> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> Based on the design above, I went to the end of my idea and took a >>>>>>> day to write a prototype for online redistribution based on ALTER TABLE. >>>>>>> It uses the grammar written in previous mail with ADD NODE/DELETE >>>>>>> NODE/DISTRIBUTE BY/TO NODE | GROUP. >>>>>>> >>>>>>> The main idea is the use of what I call a "storage" table which is >>>>>>> used as a temporary location for the data being distributed in cluster. >>>>>>> This table is created as unlogged >>>>>>> >>>>>>> The patch sticks with the design invocated before; >>>>>>> - Cached plans are dropped when redistribution is invocated >>>>>>> - Vacuum is not necessary, this mechanism uses transaction-safe >>>>>>> queries >>>>>>> - for the time being, this implementation uses an exclusive lock, >>>>>>> but as the redistribution is done, a ShareUpdateExclusive lock is not to >>>>>>> exclude. >>>>>>> - tables are reindexed if necessary. >>>>>>> - redistribution cannot be done inside a transaction block >>>>>>> - redistribution is not authorized with all the other commands as >>>>>>> they are locally-safe on each node. >>>>>>> - no restrictions on the distribution types, table types or >>>>>>> subclusters >>>>>>> >>>>>>> This feature can be really improved for example in the case of >>>>>>> replicated tables in particular, when the list of nodes of the table is >>>>>>> changed. >>>>>>> It is one of the things I would like to improve as it would really >>>>>>> increase performance >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> -- >>>>>>> Michael Paquier >>>>>>> http://michael.otacoo.com >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Michael Paquier >>>>>> http://michael.otacoo.com >>>>>> >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> Live Security Virtual Conference >>>>>> Exclusive live event will cover all the ways today's security and >>>>>> threat landscape has changed and how IT managers can respond. >>>>>> Discussions >>>>>> will include endpoint security, mobile security and the latest in >>>>>> malware >>>>>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >>>>>> _______________________________________________ >>>>>> Postgres-xc-developers mailing list >>>>>> Pos...@li... >>>>>> https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> -- >>>>> Abbas >>>>> Architect >>>>> EnterpriseDB Corporation >>>>> The Enterprise PostgreSQL Company >>>>> >>>>> Phone: 92-334-5100153 >>>>> >>>>> >>>>> Website: www.enterprisedb.com >>>>> EnterpriseDB Blog: http://blogs.enterprisedb.com/ >>>>> Follow us on Twitter: http://www.twitter.com/enterprisedb >>>>> >>>>> This e-mail message (and any attachment) is intended for the use of >>>>> the individual or entity to whom it is addressed. This message >>>>> contains information from EnterpriseDB Corporation that may be >>>>> privileged, confidential, or exempt from disclosure under applicable >>>>> law. If you are not the intended recipient or authorized to receive >>>>> this for the intended recipient, any use, dissemination, distribution, >>>>> retention, archiving, or copying of this communication is strictly >>>>> prohibited. If you have received this e-mail in error, please notify >>>>> the sender immediately by reply e-mail and delete this message. >>>>> >>>> >>>> >>>> >>>> -- >>>> Michael Paquier >>>> http://michael.otacoo.com >>>> >>> >>> >>> >>> -- >>> Michael Paquier >>> http://michael.otacoo.com >>> >> >> >> >> -- >> Michael Paquier >> http://michael.otacoo.com >> >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> Postgres-xc-developers mailing list >> Pos...@li... >> https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers >> >> > |
From: Michael P. <mic...@gm...> - 2012-06-27 06:34:39
|
On Wed, Jun 27, 2012 at 3:23 PM, Ashutosh Bapat < ash...@en...> wrote: > Hi Michael, > That's a good catch and good that we came upon it before finalising the > fix. > > While creating the view definition, it tries to parse the SELECT statement > and while doing so, tries to resolve the aggregate function. On datanode, > aggregates have transition type as return type, which in this case is > polymorphic (not acceptable as return type). We manage such aggregates by > not pushing the aggregates down to the datanodes, but in this case don't > look at what can be pushed or not inside the view definition. > Yes, OK. > > What we may want to do, and is hard to do, is to dis-assemble the view > definition at coordinator and send the relevant information (the one stored > in catalogs?) to the datanode to be stored directly (without involving > parsing etc.). The same may need to be done with all the utilities, but > this is a massive change, and something which needs to be thought through > properly. > This is... well... not simple. And not completely related to this fix. > > > On Wed, Jun 27, 2012 at 11:05 AM, Michael Paquier < > mic...@gm...> wrote: > >> Hi, >> >> I wrote a patch enabling the creation of views on Datanodes to get rid of >> this function problem. The fix is attached. >> However, while digging into this issue, I found a problem with types and >> views, for example: >> create table aa (a int); >> create type aa_type as enum ('1','2','3','4','5','6'); >> create view aa_v as select max(a::aa_type) from aa; -- created on all the >> nodes >> ERROR: column "max" has pseudo-type anyenum >> >> This error comes from heap.c, where a check is done on the type of the >> column. >> The problem is that in the case of aggregates, we use the transition type >> on Datanodes, which is a pseudo-type and is by definition forbidden for as >> a column type. >> The aggregate modification comes from here: >> --- a/src/backend/parser/parse_agg.c >> +++ b/src/backend/parser/parse_agg.c >> @@ -209,6 +209,7 @@ transformAggregateCall(ParseState *pstate, Aggref >> *agg, >> aggform = (Form_pg_aggregate) GETSTRUCT(aggTuple); >> agg->aggtrantype = aggform->aggtranstype; >> agg->agghas_collectfn = OidIsValid(aggform->aggcollectfn); >> + //Error comes from this one: >> if (IS_PGXC_DATANODE) >> agg->aggtype = agg->aggtrantype; >> >> Associating a transition type on Datanodes for aggregates is correct, but >> until now we have never created views on Datanodes. >> Btw, a fix for this second issue is included in the patch attached. What >> I simply did was bypassing the error on Datanodes as we may have a >> pseudo-type in the case of an aggregate. Ashutosh, comments on that? >> >> >> >> On Wed, Jun 20, 2012 at 2:59 PM, Ashutosh Bapat < >> ash...@en...> wrote: >> >>> >>> >>> On Wed, Jun 20, 2012 at 10:25 AM, Michael Paquier < >>> mic...@gm...> wrote: >>> >>>> >>>> >>>> On Wed, Jun 20, 2012 at 12:58 PM, Ashutosh Bapat < >>>> ash...@en...> wrote: >>>> >>>>> >>>>> >>>>> On Wed, Jun 20, 2012 at 9:18 AM, Michael Paquier < >>>>> mic...@gm...> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Wed, Jun 20, 2012 at 12:46 PM, Ashutosh Bapat < >>>>>> ash...@en...> wrote: >>>>>> >>>>>>> One fix, I can think of is to create volatile functions only on >>>>>>> coordinator. Although, I would still take a step back, and find out why we >>>>>>> took the decision not to store views on the datanodes. >>>>>>> >>>>>> View => projection of table data => need distribution type of table >>>>>> => distribution data only available on Coordinator for data distribution => >>>>>> no sense to define views on Datanodes >>>>>> >>>>> >>>>> In the case, where a view type is used as function argument or return >>>>> type, it does make sense to have the view definition on the datanodes. The >>>>> implication behind my question is whether there is any correctness problem >>>>> by creating view and related definitions at the datanodes. >>>>> >>>> By taking this question from another angle: >>>> Are there any problems to push down clauses using views to Datanodes? >>>> >>> >>> Having view definitions on the datanode does not imply that we have to >>> push the clauses using views to the datanodes. In fact, even if we want to, >>> we won't be able to do so, as the view resolution happens even before we >>> take into consideration the distribution. >>> >>> >>>> Just based on correctness, the answer is no problem. Btw, the function >>>> using a view should be volatile as it reads data, so it will not be used on >>>> Datanodes at all... >>>> >>> >>> We are not using view here, we are using datatype which corresponds to >>> the view result. Using such datatype does not necessarily mean that we >>> touch any of the data. For example, see the function (modified version of >>> the example given by Dimitrije) below >>> >>> CREATE OR REPLACE FUNCTION some_function() RETURNS SETOF some_view AS >>> $body$ >>> BEGIN >>> return (1, 1); >>> END; >>> $body$ >>> LANGUAGE 'plpgsql' >>> COST 100; >>> >>> This function is certainly immutable (certainly not volatile), and thus >>> pushable to the datanodes. For such functions, it having view definitions >>> at the datanodes will be helpful. >>> >>> >>>> -- >>>> Michael Paquier >>>> http://michael.otacoo.com >>>> >>> >>> >>> >>> -- >>> Best Wishes, >>> Ashutosh Bapat >>> EntepriseDB Corporation >>> The Enterprise Postgres Company >>> >>> >> >> >> -- >> Michael Paquier >> http://michael.otacoo.com >> > > > > -- > Best Wishes, > Ashutosh Bapat > EntepriseDB Corporation > The Enterprise Postgres Company > > -- Michael Paquier http://michael.otacoo.com |
From: Ashutosh B. <ash...@en...> - 2012-06-27 06:23:25
|
Hi Michael, That's a good catch and good that we came upon it before finalising the fix. While creating the view definition, it tries to parse the SELECT statement and while doing so, tries to resolve the aggregate function. On datanode, aggregates have transition type as return type, which in this case is polymorphic (not acceptable as return type). We manage such aggregates by not pushing the aggregates down to the datanodes, but in this case don't look at what can be pushed or not inside the view definition. What we may want to do, and is hard to do, is to dis-assemble the view definition at coordinator and send the relevant information (the one stored in catalogs?) to the datanode to be stored directly (without involving parsing etc.). The same may need to be done with all the utilities, but this is a massive change, and something which needs to be thought through properly. On Wed, Jun 27, 2012 at 11:05 AM, Michael Paquier <mic...@gm... > wrote: > Hi, > > I wrote a patch enabling the creation of views on Datanodes to get rid of > this function problem. The fix is attached. > However, while digging into this issue, I found a problem with types and > views, for example: > create table aa (a int); > create type aa_type as enum ('1','2','3','4','5','6'); > create view aa_v as select max(a::aa_type) from aa; -- created on all the > nodes > ERROR: column "max" has pseudo-type anyenum > > This error comes from heap.c, where a check is done on the type of the > column. > The problem is that in the case of aggregates, we use the transition type > on Datanodes, which is a pseudo-type and is by definition forbidden for as > a column type. > The aggregate modification comes from here: > --- a/src/backend/parser/parse_agg.c > +++ b/src/backend/parser/parse_agg.c > @@ -209,6 +209,7 @@ transformAggregateCall(ParseState *pstate, Aggref *agg, > aggform = (Form_pg_aggregate) GETSTRUCT(aggTuple); > agg->aggtrantype = aggform->aggtranstype; > agg->agghas_collectfn = OidIsValid(aggform->aggcollectfn); > + //Error comes from this one: > if (IS_PGXC_DATANODE) > agg->aggtype = agg->aggtrantype; > > Associating a transition type on Datanodes for aggregates is correct, but > until now we have never created views on Datanodes. > Btw, a fix for this second issue is included in the patch attached. What I > simply did was bypassing the error on Datanodes as we may have a > pseudo-type in the case of an aggregate. Ashutosh, comments on that? > > > > On Wed, Jun 20, 2012 at 2:59 PM, Ashutosh Bapat < > ash...@en...> wrote: > >> >> >> On Wed, Jun 20, 2012 at 10:25 AM, Michael Paquier < >> mic...@gm...> wrote: >> >>> >>> >>> On Wed, Jun 20, 2012 at 12:58 PM, Ashutosh Bapat < >>> ash...@en...> wrote: >>> >>>> >>>> >>>> On Wed, Jun 20, 2012 at 9:18 AM, Michael Paquier < >>>> mic...@gm...> wrote: >>>> >>>>> >>>>> >>>>> On Wed, Jun 20, 2012 at 12:46 PM, Ashutosh Bapat < >>>>> ash...@en...> wrote: >>>>> >>>>>> One fix, I can think of is to create volatile functions only on >>>>>> coordinator. Although, I would still take a step back, and find out why we >>>>>> took the decision not to store views on the datanodes. >>>>>> >>>>> View => projection of table data => need distribution type of table => >>>>> distribution data only available on Coordinator for data distribution => no >>>>> sense to define views on Datanodes >>>>> >>>> >>>> In the case, where a view type is used as function argument or return >>>> type, it does make sense to have the view definition on the datanodes. The >>>> implication behind my question is whether there is any correctness problem >>>> by creating view and related definitions at the datanodes. >>>> >>> By taking this question from another angle: >>> Are there any problems to push down clauses using views to Datanodes? >>> >> >> Having view definitions on the datanode does not imply that we have to >> push the clauses using views to the datanodes. In fact, even if we want to, >> we won't be able to do so, as the view resolution happens even before we >> take into consideration the distribution. >> >> >>> Just based on correctness, the answer is no problem. Btw, the function >>> using a view should be volatile as it reads data, so it will not be used on >>> Datanodes at all... >>> >> >> We are not using view here, we are using datatype which corresponds to >> the view result. Using such datatype does not necessarily mean that we >> touch any of the data. For example, see the function (modified version of >> the example given by Dimitrije) below >> >> CREATE OR REPLACE FUNCTION some_function() RETURNS SETOF some_view AS >> $body$ >> BEGIN >> return (1, 1); >> END; >> $body$ >> LANGUAGE 'plpgsql' >> COST 100; >> >> This function is certainly immutable (certainly not volatile), and thus >> pushable to the datanodes. For such functions, it having view definitions >> at the datanodes will be helpful. >> >> >>> -- >>> Michael Paquier >>> http://michael.otacoo.com >>> >> >> >> >> -- >> Best Wishes, >> Ashutosh Bapat >> EntepriseDB Corporation >> The Enterprise Postgres Company >> >> > > > -- > Michael Paquier > http://michael.otacoo.com > -- Best Wishes, Ashutosh Bapat EntepriseDB Corporation The Enterprise Postgres Company |
From: Michael P. <mic...@gm...> - 2012-06-26 13:46:14
|
On 2012/06/26, at 21:20, Amit Khandekar <ami...@en...> wrote: > > > On 20 June 2012 11:16, Michael Paquier <mic...@gm...> wrote: > Please find attached a far more performant patch. > This new mechanism uses a redistribution based on method 2: > - COPY TO > - TRUNCATE > - catalog update > - COPY FROM > > The advantages of this method are: > - network traffic is halved as all the data is located temporarily on Coordinator > - TRUNCATE speeds up data deletion > - COPY improves the performance of data transfer by ~30 times > - Time necessary to redistribute is the time necessary to process COPY FROM + COPY TO > - Critical diminution of XLOG, there are not anymore a multitude of xlogs generated at Datanode level when distribution is changed to hash/modulo. > > I did a couple of tests up with tables of hundreds of MB of data. > For a table of 10Mrows (300MB with pg_relation_size), redistribution took 30s from a replication to hash. > With the previous patch, redistribution took 30mins for a 1Mrow table when table was altered from replication to hash distribution. > I am also conducting tests with tables of GB size. However this mechanism is really fast. > > > Hi Michael, > > While looking at the other remote-copy-API related patch that you sent in different thread, I ultimately ended up with this patch in order to know the basic ALTER TABLE implementation. Sorry to comment on this patch so late. > > I was checking if it is even possible to pull only the required rows from tables for COPY. If somehow we could build an expression based on the hash function we might be able to use it as a qualifier in the select query to the datanode so that only those rows will be retrieved that need to be deleted from the table: This assumes that datanodes have the knowledge of the other datanodes, and this is not the case, so this idea is difficult to put in practice. > SELECT from tab1 where expr(dist_col) is true; > And then supply the results into COPY (or directly to insert ?) The first version of my patch was using an INSERT-based mechanism. This has a low performance, and we may finish by generating one INSERT query per row, generating 1 xlog record for each query. If millions of rows are moved between nodes, you might need to replay unnecessary records after for example a node crash, slowing the bode recovery by that much. > So the COPY will look like: > COPY (SELECT from tab1 where expr(dist_col) is true) > or still better: > COPY ( DELETE from tab1 where expr(dist_col) is true RETURNING *) > This will get rid of a separate step to delete rows. But I thingk COPY ... DELETE is not valid syntax. No it is not. But you can use a select clause with a WITH having a DELETE returning. However, you need to know with this method on a local datanode if the tuples of this node are correctly located in cluster, but as a Datanode has no view of the other XC nodes, this is not directly applicable. > May be we can write a set returning function that runs delete...returning and then use COPY (select such_func('table_name') ) > > If we manage to delete and retrieve only required rows, it may be possible to skip the COPY part and do a straight: > insert into tab1 DELETE from tab1 where expr(dist_col) is true RETURNING *. I am not sure if it is performant to do this w/ or w/o copy. COPY may or may not write into WAL logs. Copy only does on WAL record I think. It is a single query. However COPY protocol is the fastest solution for data transfer, so we should definitely avoid solutions with INSERT as it would finish using a slower extended query protocol, and increase the number of xlogs on datanodes. > > This is all a quick thought without much thought on the implications, but still wanted to put that out here. For e.g., is it possible to create such a expression expr(dist_col) ? This should be correctly built up according to whether it is hash or modulo or round robin distribution. For round robin may be it should always be true because we anyways want to delete them and reinsert, or do we? If the table is to be altered from hash to round robin, the expression might be needed to always be true may be; and so on. > > Regards, > > > On Wed, Jun 20, 2012 at 9:29 AM, Michael Paquier <mic...@gm...> wrote: > In this patch, the SQL/catalog management and the distribution mechanism use really separated APIs. > So even if I do not think it is necessary to change the SQL part, the redistribution mechanism can be changed at will. > > For the time being, the redistribution mechanism is not really performant (well, it was not the goal of this prototype), because it uses the following model. > 1) Creation of a storage table (unlogged) with default distribution (CTAS) > 2) Take necessary locks on storage and redistributed table > 3) Delete all data on redistributed table > 4) Update catalogs with new distribution information > 5) Perform INSERT SELECT from storage table to redistributed table > 6) DROP storage table > As mentionned by Ashutosh, the data needs to travel 4 times through network so it can really take a lot of time for tables with lots of gigs of data. > Network in itself is not the bottleneck, it is the usage of the framework of postgres. This is especially true when queries used by redistribution mechanism cannot be pushed down. > The worst case being when a table is redistributed to a hash/modulo on multiple nodes, as in this case it is necessary to plan each INSERT for the redistribution. As mentionned also by Ashutosh this can create a huge deal of xlogs on remote Datanodes, not really welcome after a crash recovery. > > There are several ways possible to improve this dramatically improve the redistribution mechanism. > Here are 3 ideas. > 1) Create storage table with data on a single node > Here we reduce the load on network, but it cannot solve the problem of tables redistributed to modulo/hash on multiple nodes. It will create a lot of INSERT queries for a slow result. > 2) Use a COPY mechanism > One of the simple solutions. Instead of using a costly storage table in cluster, store the data on Coordinator during the redistribution > a) COPY the data of table being redistributed to a file in $PGDATA of Coordinator. Why not $PGDATA/pg_distrib/oid? > b) DELETE all the data on table being redistributed > c) update catalogs > d) COPY FROM file to table with new distribution type > Network load is halved. COPY is also really faster. > Servers of Coordinator are not chosen for there disk I/O but the folder $PGDATA/pg_distrib could be linked to a folder where a faster disk is mounted > This also gets rid of the storage table. The only thing to care of is the deletion of the temporary data file once redistribution transaction commits or aborts. > Data file could also be compressed to reduce space consumed and I/O on disk. > > 3) Use a batching process to communicate only necessary tuples from Datanodes to Coordinator. > Suggested by Ashutosh, this can use COPY protocol to redistribute in a batch way the tuples being redistributed. > The idea is to send from Datanodes to Coordinator only the tuples that need to be redistributed, and then let Coordinator redistribute correctly all the data depending on the new distribution. This avoids to have to store temporarily the data redistributed and all the transfer is managed by cache on Coordinator. > This idea has a couple of limitations though: > - a Datanode is not aware of the existence of the other nodes in cluster. Now distribution data is only available at Coordinator on catalog pgxc_class, and this distribution data contains the list of nodes where data is distributed. This is directly dependant on catalog pgxc_node. So a Datanode cannot know if a tuple will be at the correct place or not. > This could be countered by allowing the run of node DDLs on Datanodes, but this adds an additional constraint on cluster setting as it forces the cluster designer to update all the pgxc_node catalogs on all the nodes. Having a pgxc_node catalog on Datanode would make sense if it communicates with other nodes through the same pooler as Coordinator, but this also raises issues with multiple backends open on one node for the same session, which is dangerous for transaction handling. > - visibility concerns. What insures that a tuple has been only selected once. As redistribution is a cluster-based mechanism. What can insure that a scan on a Datanode is not taking into account some tuples that have already been redistributed. > > Method 1 looks useless from the point of performance. > Method 2 should have a good performance. This only point is that data has to be located on Coordinator server temporarily while redistribution is being done. We could also use some GUC parameter to allow DBA to customize the way redistribution data folder is stored (compression type, file name format...). > I have some concerns about method 3 as explained above. I might not take into account all the potential problems or have a limited view on this mechanism, but it introduces some new dependencies with cluster setting which may not be necessary. However any discussion on the subject is welcome. > > Suggestions are welcome. > > > On Wed, Jun 20, 2012 at 8:40 AM, Michael Paquier <mic...@gm...> wrote: > > > On Wed, Jun 20, 2012 at 4:19 AM, Abbas Butt <abb...@en...> wrote: > You forgot to attach the patch. > Sorry here is the patch. > > > > On Tue, Jun 19, 2012 at 10:58 AM, Michael Paquier <mic...@gm...> wrote: > Hi all, > > Please find attached an improved patch. I corrected the following points: > - Storage table uses an access exclusive lock meaning it cannot be accessed by other sessions in cluster > - The table redistributed uses an exclusive lock, it can be accessed by the other sessions in cluster with SELECT while redistribution is running > - Addition of an API to manage table locking > - Correction of bugs regarding session concurrency. An update in pgxc_class (update of distribution data) was not seen by concurrent sessions in cluster. > - doc correction and completion > - regression fixes due to grammar change for node list in CTAS, CREATE TABLE, EXECUTE DIRECT and CLEAN CONNECTION > - Fix of system functions using EXECUTE direct > - Fix for CTAS query generation > - update index of catalog pgxc_class updated > - Correct update for relation cache when location data is updated > > Questions are welcome. > This patch can be applied on master and works as expected. > > On Mon, Jun 18, 2012 at 5:25 PM, Michael Paquier <mic...@gm...> wrote: > Hi all, > > Based on the design above, I went to the end of my idea and took a day to write a prototype for online redistribution based on ALTER TABLE. > It uses the grammar written in previous mail with ADD NODE/DELETE NODE/DISTRIBUTE BY/TO NODE | GROUP. > > The main idea is the use of what I call a "storage" table which is used as a temporary location for the data being distributed in cluster. > This table is created as unlogged > > The patch sticks with the design invocated before; > - Cached plans are dropped when redistribution is invocated > - Vacuum is not necessary, this mechanism uses transaction-safe queries > - for the time being, this implementation uses an exclusive lock, but as the redistribution is done, a ShareUpdateExclusive lock is not to exclude. > - tables are reindexed if necessary. > - redistribution cannot be done inside a transaction block > - redistribution is not authorized with all the other commands as they are locally-safe on each node. > - no restrictions on the distribution types, table types or subclusters > > This feature can be really improved for example in the case of replicated tables in particular, when the list of nodes of the table is changed. > It is one of the things I would like to improve as it would really increase performance > > Regards, > > -- > Michael Paquier > http://michael.otacoo.com > > > > -- > Michael Paquier > http://michael.otacoo.com > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Postgres-xc-developers mailing list > Pos...@li... > https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers > > > > > -- > -- > Abbas > Architect > EnterpriseDB Corporation > The Enterprise PostgreSQL Company > > Phone: 92-334-5100153 > > > Website: www.enterprisedb.com > EnterpriseDB Blog: http://blogs.enterprisedb.com/ > Follow us on Twitter: http://www.twitter.com/enterprisedb > > This e-mail message (and any attachment) is intended for the use of > the individual or entity to whom it is addressed. This message > contains information from EnterpriseDB Corporation that may be > privileged, confidential, or exempt from disclosure under applicable > law. If you are not the intended recipient or authorized to receive > this for the intended recipient, any use, dissemination, distribution, > retention, archiving, or copying of this communication is strictly > prohibited. If you have received this e-mail in error, please notify > the sender immediately by reply e-mail and delete this message. > > > > -- > Michael Paquier > http://michael.otacoo.com > > > > -- > Michael Paquier > http://michael.otacoo.com > > > > -- > Michael Paquier > http://michael.otacoo.com > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Postgres-xc-developers mailing list > Pos...@li... > https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers > > |
From: Amit K. <ami...@en...> - 2012-06-26 12:21:16
|
On 20 June 2012 11:16, Michael Paquier <mic...@gm...> wrote: > Please find attached a far more performant patch. > This new mechanism uses a redistribution based on method 2: > - COPY TO > - TRUNCATE > - catalog update > - COPY FROM > > The advantages of this method are: > - network traffic is halved as all the data is located temporarily on > Coordinator > - TRUNCATE speeds up data deletion > - COPY improves the performance of data transfer by ~30 times > - Time necessary to redistribute is the time necessary to process COPY > FROM + COPY TO > - Critical diminution of XLOG, there are not anymore a multitude of xlogs > generated at Datanode level when distribution is changed to hash/modulo. > > I did a couple of tests up with tables of hundreds of MB of data. > For a table of 10Mrows (300MB with pg_relation_size), redistribution took > 30s from a replication to hash. > With the previous patch, redistribution took 30mins for a 1Mrow table when > table was altered from replication to hash distribution. > I am also conducting tests with tables of GB size. However this mechanism > is really fast. > > Hi Michael, While looking at the other remote-copy-API related patch that you sent in different thread, I ultimately ended up with this patch in order to know the basic ALTER TABLE implementation. Sorry to comment on this patch so late. I was checking if it is even possible to pull only the required rows from tables for COPY. If somehow we could build an expression based on the hash function we might be able to use it as a qualifier in the select query to the datanode so that only those rows will be retrieved that need to be deleted from the table: SELECT from tab1 where expr(dist_col) is true; And then supply the results into COPY (or directly to insert ?) So the COPY will look like: COPY (SELECT from tab1 where expr(dist_col) is true) or still better: COPY ( DELETE from tab1 where expr(dist_col) is true RETURNING *) This will get rid of a separate step to delete rows. But I thingk COPY ... DELETE is not valid syntax. May be we can write a set returning function that runs delete...returning and then use COPY (select such_func('table_name') ) If we manage to delete and retrieve only required rows, it may be possible to skip the COPY part and do a straight: insert into tab1 DELETE from tab1 where expr(dist_col) is true RETURNING *. I am not sure if it is performant to do this w/ or w/o copy. COPY may or may not write into WAL logs. This is all a quick thought without much thought on the implications, but still wanted to put that out here. For e.g., is it possible to create such a expression expr(dist_col) ? This should be correctly built up according to whether it is hash or modulo or round robin distribution. For round robin may be it should always be true because we anyways want to delete them and reinsert, or do we? If the table is to be altered from hash to round robin, the expression might be needed to always be true may be; and so on. Regards, > > > On Wed, Jun 20, 2012 at 9:29 AM, Michael Paquier < > mic...@gm...> wrote: > >> In this patch, the SQL/catalog management and the distribution mechanism >> use really separated APIs. >> So even if I do not think it is necessary to change the SQL part, the >> redistribution mechanism can be changed at will. >> >> For the time being, the redistribution mechanism is not really performant >> (well, it was not the goal of this prototype), because it uses the >> following model. >> 1) Creation of a storage table (unlogged) with default distribution (CTAS) >> 2) Take necessary locks on storage and redistributed table >> 3) Delete all data on redistributed table >> 4) Update catalogs with new distribution information >> 5) Perform INSERT SELECT from storage table to redistributed table >> 6) DROP storage table >> As mentionned by Ashutosh, the data needs to travel 4 times through >> network so it can really take a lot of time for tables with lots of gigs of >> data. >> Network in itself is not the bottleneck, it is the usage of the framework >> of postgres. This is especially true when queries used by redistribution >> mechanism cannot be pushed down. >> The worst case being when a table is redistributed to a hash/modulo on >> multiple nodes, as in this case it is necessary to plan each INSERT for the >> redistribution. As mentionned also by Ashutosh this can create a huge deal >> of xlogs on remote Datanodes, not really welcome after a crash recovery. >> >> There are several ways possible to improve this dramatically improve the >> redistribution mechanism. >> Here are 3 ideas. >> 1) Create storage table with data on a single node >> Here we reduce the load on network, but it cannot solve the problem of >> tables redistributed to modulo/hash on multiple nodes. It will create a lot >> of INSERT queries for a slow result. >> 2) Use a COPY mechanism >> One of the simple solutions. Instead of using a costly storage table in >> cluster, store the data on Coordinator during the redistribution >> a) COPY the data of table being redistributed to a file in $PGDATA of >> Coordinator. Why not $PGDATA/pg_distrib/oid? >> b) DELETE all the data on table being redistributed >> c) update catalogs >> d) COPY FROM file to table with new distribution type >> Network load is halved. COPY is also really faster. >> Servers of Coordinator are not chosen for there disk I/O but the folder >> $PGDATA/pg_distrib could be linked to a folder where a faster disk is >> mounted >> This also gets rid of the storage table. The only thing to care of is the >> deletion of the temporary data file once redistribution transaction commits >> or aborts. >> Data file could also be compressed to reduce space consumed and I/O on >> disk. >> >> 3) Use a batching process to communicate only necessary tuples from >> Datanodes to Coordinator. >> Suggested by Ashutosh, this can use COPY protocol to redistribute in a >> batch way the tuples being redistributed. >> The idea is to send from Datanodes to Coordinator only the tuples that >> need to be redistributed, and then let Coordinator redistribute correctly >> all the data depending on the new distribution. This avoids to have to >> store temporarily the data redistributed and all the transfer is managed by >> cache on Coordinator. >> This idea has a couple of limitations though: >> - a Datanode is not aware of the existence of the other nodes in cluster. >> Now distribution data is only available at Coordinator on catalog >> pgxc_class, and this distribution data contains the list of nodes where >> data is distributed. This is directly dependant on catalog pgxc_node. So a >> Datanode cannot know if a tuple will be at the correct place or not. >> This could be countered by allowing the run of node DDLs on Datanodes, >> but this adds an additional constraint on cluster setting as it forces the >> cluster designer to update all the pgxc_node catalogs on all the nodes. >> Having a pgxc_node catalog on Datanode would make sense if it communicates >> with other nodes through the same pooler as Coordinator, but this also >> raises issues with multiple backends open on one node for the same session, >> which is dangerous for transaction handling. >> - visibility concerns. What insures that a tuple has been only selected >> once. As redistribution is a cluster-based mechanism. What can insure that >> a scan on a Datanode is not taking into account some tuples that have >> already been redistributed. >> >> Method 1 looks useless from the point of performance. >> Method 2 should have a good performance. This only point is that data has >> to be located on Coordinator server temporarily while redistribution is >> being done. We could also use some GUC parameter to allow DBA to customize >> the way redistribution data folder is stored (compression type, file name >> format...). >> I have some concerns about method 3 as explained above. I might not take >> into account all the potential problems or have a limited view on this >> mechanism, but it introduces some new dependencies with cluster setting >> which may not be necessary. However any discussion on the subject is >> welcome. >> >> Suggestions are welcome. >> >> >> On Wed, Jun 20, 2012 at 8:40 AM, Michael Paquier < >> mic...@gm...> wrote: >> >>> >>> >>> On Wed, Jun 20, 2012 at 4:19 AM, Abbas Butt <abb...@en... >>> > wrote: >>> >>>> You forgot to attach the patch. >>>> >>> Sorry here is the patch. >>> >>> >>> >>>> >>>> On Tue, Jun 19, 2012 at 10:58 AM, Michael Paquier < >>>> mic...@gm...> wrote: >>>> >>>>> Hi all, >>>>> >>>>> Please find attached an improved patch. I corrected the following >>>>> points: >>>>> - Storage table uses an access exclusive lock meaning it cannot be >>>>> accessed by other sessions in cluster >>>>> - The table redistributed uses an exclusive lock, it can be accessed >>>>> by the other sessions in cluster with SELECT while redistribution is running >>>>> - Addition of an API to manage table locking >>>>> - Correction of bugs regarding session concurrency. An update in >>>>> pgxc_class (update of distribution data) was not seen by concurrent >>>>> sessions in cluster. >>>>> - doc correction and completion >>>>> - regression fixes due to grammar change for node list in CTAS, CREATE >>>>> TABLE, EXECUTE DIRECT and CLEAN CONNECTION >>>>> - Fix of system functions using EXECUTE direct >>>>> - Fix for CTAS query generation >>>>> - update index of catalog pgxc_class updated >>>>> - Correct update for relation cache when location data is updated >>>>> >>>>> Questions are welcome. >>>>> This patch can be applied on master and works as expected. >>>>> >>>>> On Mon, Jun 18, 2012 at 5:25 PM, Michael Paquier < >>>>> mic...@gm...> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> Based on the design above, I went to the end of my idea and took a >>>>>> day to write a prototype for online redistribution based on ALTER TABLE. >>>>>> It uses the grammar written in previous mail with ADD NODE/DELETE >>>>>> NODE/DISTRIBUTE BY/TO NODE | GROUP. >>>>>> >>>>>> The main idea is the use of what I call a "storage" table which is >>>>>> used as a temporary location for the data being distributed in cluster. >>>>>> This table is created as unlogged >>>>>> >>>>>> The patch sticks with the design invocated before; >>>>>> - Cached plans are dropped when redistribution is invocated >>>>>> - Vacuum is not necessary, this mechanism uses transaction-safe >>>>>> queries >>>>>> - for the time being, this implementation uses an exclusive lock, but >>>>>> as the redistribution is done, a ShareUpdateExclusive lock is not to >>>>>> exclude. >>>>>> - tables are reindexed if necessary. >>>>>> - redistribution cannot be done inside a transaction block >>>>>> - redistribution is not authorized with all the other commands as >>>>>> they are locally-safe on each node. >>>>>> - no restrictions on the distribution types, table types or >>>>>> subclusters >>>>>> >>>>>> This feature can be really improved for example in the case of >>>>>> replicated tables in particular, when the list of nodes of the table is >>>>>> changed. >>>>>> It is one of the things I would like to improve as it would really >>>>>> increase performance >>>>>> >>>>>> Regards, >>>>>> >>>>>> -- >>>>>> Michael Paquier >>>>>> http://michael.otacoo.com >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Michael Paquier >>>>> http://michael.otacoo.com >>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> Live Security Virtual Conference >>>>> Exclusive live event will cover all the ways today's security and >>>>> threat landscape has changed and how IT managers can respond. >>>>> Discussions >>>>> will include endpoint security, mobile security and the latest in >>>>> malware >>>>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >>>>> _______________________________________________ >>>>> Postgres-xc-developers mailing list >>>>> Pos...@li... >>>>> https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers >>>>> >>>>> >>>> >>>> >>>> -- >>>> -- >>>> Abbas >>>> Architect >>>> EnterpriseDB Corporation >>>> The Enterprise PostgreSQL Company >>>> >>>> Phone: 92-334-5100153 >>>> >>>> >>>> Website: www.enterprisedb.com >>>> EnterpriseDB Blog: http://blogs.enterprisedb.com/ >>>> Follow us on Twitter: http://www.twitter.com/enterprisedb >>>> >>>> This e-mail message (and any attachment) is intended for the use of >>>> the individual or entity to whom it is addressed. This message >>>> contains information from EnterpriseDB Corporation that may be >>>> privileged, confidential, or exempt from disclosure under applicable >>>> law. If you are not the intended recipient or authorized to receive >>>> this for the intended recipient, any use, dissemination, distribution, >>>> retention, archiving, or copying of this communication is strictly >>>> prohibited. If you have received this e-mail in error, please notify >>>> the sender immediately by reply e-mail and delete this message. >>>> >>> >>> >>> >>> -- >>> Michael Paquier >>> http://michael.otacoo.com >>> >> >> >> >> -- >> Michael Paquier >> http://michael.otacoo.com >> > > > > -- > Michael Paquier > http://michael.otacoo.com > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Postgres-xc-developers mailing list > Pos...@li... > https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers > > |
From: Peng-Chong L. <li...@gm...> - 2012-06-26 06:37:34
|
First of all, thank you guys for the greate xc project, which somehow represents the future of relational database. However, I personally think xc in current stage is more academic than practical. It is hard to configure and manage for a production system. It will be a complicated cluster system and even more complicated with HA configuration, in the eyes of an application developer/architecturer instead of a database expert or DBA. I would like to suggest something, as I already talked a little with Mr. Suzuki. (1) Centralized Management If you see pacemaker/corosync, you will find the crm interesting. Every configuration change made on any node will be broadcased to the cluster. For xc, the nodes in the cluster do now known anything about each other. Actually, GTM may know the whole thing, although I have to repeat each configuration on every gtm-proxies, coordinators, and datanodes. Do you think it is possible to centralize configuration, monitoring, tuning, management, etc? The first idea comes to my mind is to improve gtm module, which already communicates with every other nodes. (2) High Avalibility I know currently xc-HA depends on HA of each node, either gtm-standby or streaming replication of coordinator and data nodes. However, the solution is too complicated and freaks me out. I have tried to create a xc-HA cluster by myself and it is painful. Not only we are short of heartbeat resource agent for streaming replicated postgresql, but also I think the system is too fragile to be a production system without reliable fail-over and fail-back. Do you think we can make the xc-cluster itself fail-tolerent? For example, create anonther type of node - 'shadow/backup' nodes. They are identical to its coresponding coordinator/data node and SQL statement-level replicated. There will be no single failure point in the cluster. (3) Disaster Backup All data in this share-nothing cluster have to be shipped to the backup facility. If we just use streaming replication of each node, the backup cluster will never function when replication of any node fails or not all replation streams are synchronized. So far I have to use pgpool as the front-end of the two clusters. I think it will be better if there is a cluster replication solution other than nodes replication. Does any DML go through gtm? If it is true, gtm will be the best module to replicate all data changes to the backup cluster. I am really looking forward to "dynamic node adding/removing" in xc roadmap. It will make xc not a set of machines, but a real cluster. Furthermore, it is the starting point of all these features. Thanks in advance. Liu |
From: Peng-Chong L. <li...@gm...> - 2012-06-26 03:43:55
|
After cold reboots and re-creation of xc cluster and database, I archived 1184 tps from 3-node xc cluster. I make more sense. I will dig further. Thank you guys! 2012/6/26 Michael Paquier <mic...@gm...> > > > On Tue, Jun 26, 2012 at 11:52 AM, Peng-Chong LIU <li...@gm...> wrote: > >> Thanks for your instructions. It is exactly the same as Mr. Suzuki's. >> >> My xc cluster was configured on 3 PC servers/Gigabit Ethernet. I read >> some xc documentations, which suggest the number of coordinators should be >> 1/3 of that of datanodes for OLTP cases. However, I tried more >> possibilities. >> >> (1) 1 coordinator (server #3) and 1 datanode (server #3): 666 tps >> (2) 1 coordinator (server #3) and 2 datanodes (server #1 & #3): 542 tps >> (3) 1 coordinator (server #3) and 3 datanodes (server #1, #2, & #3): 554 >> tps >> (4) 3 coordinators (server #1, #2, & #3), and 3 datanodes (server #1, #2, >> & #3): 426 tps >> (5) Same as (4), but with "preferred" datanode settings: 20 tps!!! >> > 20 TPS is weird, it's too low. > > >> >> For (4) and (5), I cannot run pgxc_test_launcher.sh with multiple servers >> in the same appServerList.data, since the multiple dbdriver processes write >> the same log files. It seems the log files are corupted in this situation. >> I have to run pgxc_test_launcher.sh with single dbdriver process on >> different servers at the same time, then add the results together. >> >> Does other conditions (network speed among nodes, think_time, etc.) >> matter? Sometimes, when I raised the workload, for example, think_time = 0, >> the cluster crashed (either process gtm or postgres). >> > A crash should not happen on XC side, that's kind of surprising. > > >> >> Since both of you metioned the preferred node thing, I will re-do test >> case (5) to make sure I did not do something wrong. >> >> >> 2012/6/26 Michael Paquier <mic...@gm...> >> >>> It looks you have been able to set up a cluster, that is already a good >>> step. >>> >>> What is the cluster structure you are using with those 3 servers? >>> Is it 1 Coordinator and 1 Datanode per server? >>> >>> We are able to get here performant results by grouping Datanode and >>> Coordinator on the same server, and then use a feature called the preferred >>> node to maximize the reads of replicated table to the local nodes, hence >>> heavily reduce the network traffic. >>> For example, assuming that in your case you have 3 Coordinators, 3 >>> Datanodes on those 3 servers, each server having 1 Coordinator and 1 >>> Datanode, you need to define the preferred Datanode of Coordinator 1 as >>> Datanode 1, preferred Datanode of Coordinator 2 as Datanode 2, same for >>> Coordinator 3/Datanode 3. >>> >>> You can define a preferred node by using CREATE NODE or ALTER NODE: >>> http://postgres-xc.sourceforge.net/docs/1_0/sql-createnode.html >>> http://postgres-xc.sourceforge.net/docs/1_0/sql-alternode.html >>> For example to create a Datanode as a preferred node on Coordinator, you >>> just need to do: >>> CREATE NODE certain_dn (PORT = $port, PREFERRED); >>> or ALTER NODE certain_dn (PREFERRED); >>> Once defined, all the reads of replicated tables will go to this node >>> (here certain_dn) in priority when an SQL reaches the Coordinator where the >>> preferred node is defined. >>> This really improves performance of DBT-1. >>> >>> Just by reading your email, I can say that there is no problem with >>> DBT-1 setting. >>> Have a try of the preferred node feature :) >>> >>> >>> On Tue, Jun 26, 2012 at 10:50 AM, Peng-Chong LIU <li...@gm...>wrote: >>> >>>> Hi there, >>>> >>>> I would like to reproduce DBT-1 performance test on xc cluster, so that >>>> I can understand its mechanism and limitations better. However, I cannot >>>> get the expected results. >>>> >>>> I used benchmark utility from xc git repository. Single-node xc cluster >>>> reached ca. 70% tps of PostgreSQL, which is reasonable. However, >>>> performance of 2-node and 3-node clusters dropped to only ca. 60% of >>>> PostgreSQL. >>>> >>>> With the kind help of Mr. Suziki in xc project team, I adjusted some >>>> cluster configuration. However, there were little improvement for the >>>> benchmark results. >>>> >>>> Do you have an internal dbt-1 test procedure or any clue to this >>>> problem (xc optimization/dbt-1 test parameters)? >>>> >>>> Thanks and regards, >>>> Liu >>>> >>>> Test Results: >>>> Pure PostgreSQL: node1 846 tps, node2 837 tps, node3 921 tps >>>> Single node xc: node3 666 tps >>>> 2-node xc: 542 tps >>>> 3-node xc (1 coordinator): 554 tps >>>> 3-node xc (3 coordinator): 426 tps >>>> >>>> Test Procedure: >>>> >>>> # download source >>>> git clone git:// >>>> postgres-xc.git.sourceforge.net/gitroot/postgres-xc/dbt1 >>>> >>>> # build >>>> cd dbt1 >>>> make clean >>>> autoconf >>>> autoheader >>>> ./configure --with-postgresql=/opt/pgxc >>>> make >>>> make install >>>> >>>> # generate test data >>>> mkdir ~/test_data >>>> ./datagen/datagen -i 10000 -u 100 -p ~/test_data -T i >>>> ./datagen/datagen -i 10000 -u 100 -p ~/test_data -T c >>>> ./datagen/datagen -i 10000 -u 100 -p ~/test_data -T a >>>> >>>> # create database >>>> psql postgres -c "create database dbt1;" >>>> psql dbt1 -f "./scripts/pgsql/create_tables.sql" >>>> psql dbt1 -f "./scripts/pgsql/create_indexes.sql" >>>> psql dbt1 -f "./scripts/pgsql/create_sequence.sql" >>>> >>> >>>> # load test data >>>> psql dbt1 -c "COPY address FROM '/tmp/address.data' DELIMITER '>';" >>>> psql dbt1 -c "COPY author FROM '/tmp/author.data' DELIMITER '>';" >>>> psql dbt1 -c "COPY cc_xacts FROM '/tmp/cc_xacts.data' DELIMITER '>';" >>>> psql dbt1 -c "COPY country FROM '/tmp/country.data' DELIMITER '>';" >>>> psql dbt1 -c "COPY customer FROM '/tmp/customer.data' DELIMITER '>';" >>>> psql dbt1 -c "COPY item FROM '/tmp/item.data' DELIMITER '>';" >>>> psql dbt1 -c "COPY order_line FROM '/tmp/order_line.data' DELIMITER >>>> '>';" >>>> psql dbt1 -c "COPY orders FROM '/tmp/orders.data' DELIMITER '>';" >>>> psql dbt1 -c "COPY stock FROM '/tmp/stock.data' DELIMITER '>';" >>>> >>> >>>> # setup test program >>>> #cp appServerList.data.sample appServerList.data >>>> echo "127.0.0.1;5432;9992" > appServerList.data >>>> >>>> cp pgxc_stats_param.data.sample pgxc_stats_param.data >>>> sed -i "s/28800/288000/" pgxc_stats_param.data # customers >>>> sed -i "s/4000/300/" pgxc_stats_param.data # duration >>>> sed -i "s/7.2/0.1/" pgxc_stats_param.data # think time >>>> sed -i "s/500/100/" pgxc_stats_param.data # eu & eu/min >>>> >>>> # perform test >>>> export PGUSER=pgxc # export PGUSER=postgres >>>> export SID1=dbt1 >>>> chmod 755 pgxc_test_launcher.sh >>>> rm -f *.log >>>> ./pgxc_test_launcher.sh & >>>> >>>> # see results >>>> rm -f ~/BT ~/ips.csv >>>> ./tools/results --mixfile mix.log --outputdir ~/ >>>> cat ~/BT >>>> >>> >>> >>> -- >>> Michael Paquier >>> http://michael.otacoo.com >>> >> >> > > > -- > Michael Paquier > http://michael.otacoo.com > |