You can subscribe to this list here.
2010 |
Jan
|
Feb
|
Mar
|
Apr
(10) |
May
(17) |
Jun
(3) |
Jul
|
Aug
|
Sep
(8) |
Oct
(18) |
Nov
(51) |
Dec
(74) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2011 |
Jan
(47) |
Feb
(44) |
Mar
(44) |
Apr
(102) |
May
(35) |
Jun
(25) |
Jul
(56) |
Aug
(69) |
Sep
(32) |
Oct
(37) |
Nov
(31) |
Dec
(16) |
2012 |
Jan
(34) |
Feb
(127) |
Mar
(218) |
Apr
(252) |
May
(80) |
Jun
(137) |
Jul
(205) |
Aug
(159) |
Sep
(35) |
Oct
(50) |
Nov
(82) |
Dec
(52) |
2013 |
Jan
(107) |
Feb
(159) |
Mar
(118) |
Apr
(163) |
May
(151) |
Jun
(89) |
Jul
(106) |
Aug
(177) |
Sep
(49) |
Oct
(63) |
Nov
(46) |
Dec
(7) |
2014 |
Jan
(65) |
Feb
(128) |
Mar
(40) |
Apr
(11) |
May
(4) |
Jun
(8) |
Jul
(16) |
Aug
(11) |
Sep
(4) |
Oct
(1) |
Nov
(5) |
Dec
(16) |
2015 |
Jan
(5) |
Feb
|
Mar
(2) |
Apr
(5) |
May
(4) |
Jun
(12) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(4) |
2019 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
S | M | T | W | T | F | S |
---|---|---|---|---|---|---|
|
|
1
(7) |
2
(2) |
3
(8) |
4
|
5
|
6
|
7
|
8
(4) |
9
(3) |
10
|
11
|
12
|
13
|
14
(3) |
15
|
16
|
17
(1) |
18
|
19
|
20
|
21
(2) |
22
|
23
|
24
(1) |
25
|
26
|
27
|
28
|
29
|
30
|
|
|
|
From: Michael P. <mic...@gm...> - 2011-11-24 06:47:54
|
Hi all, Please find attached a patch adding a new system function called pgxc_pool_check in charge of checking consistency of connection data cached in pooler with node catalogs. This function simply returns a boolean if the data is consistent. If it is not the case, user may need to restart the Coordinator. I think this is a necessary prerequisite before working on implementing a function to change pooler cache dynamically. Here are some examples of how to use this function: postgres=# \dfS+ pgxc_pool_check List of functions Schema | Name | Result data type | Argument data types | Type | Volatility | Owner | Language | Source code | Description ------------+-----------------+------------------+---------------------+--------+------------+---------+----------+-----------------+---------------------------------------------------- pg_catalog | pgxc_pool_check | boolean | | normal | volatile | michael | internal | pgxc_pool_check | check connection information consistency in pooler (1 row) postgres=# select oid,* from pgxc_node; oid | node_name | node_type | node_related | node_port | node_host | nodeis_primary | nodeis_preferred -------+-----------+-----------+--------------+-----------+-----------+----------------+------------------ 11133 | coord1 | C | 0 | 5432 | localhost | f | f 11134 | coord2 | C | 0 | 5452 | localhost | f | f 11135 | dn1 | D | 0 | 15451 | localhost | f | t 11136 | dn2 | D | 0 | 15452 | localhost | t | f (4 rows) postgres=# select pgxc_pool_check(); pgxc_pool_check ----------------- t (1 row) postgres=# create node dn3 with (node master); CREATE NODE postgres=# select pgxc_pool_check(); pgxc_pool_check ----------------- f (1 row) postgres=# drop node dn3; DROP NODE postgres=# select pgxc_pool_check(); pgxc_pool_check ----------------- t (1 row) postgres=# alter node dn2 set nodeport=10000; ALTER NODE postgres=# select pgxc_pool_check(); pgxc_pool_check ----------------- f (1 row) postgres=# alter node dn2 set nodeport=15452; ALTER NODE postgres=# select pgxc_pool_check(); pgxc_pool_check ----------------- t (1 row) Regards, -- Michael Paquier http://michael.otacoo.com |
From: Michael P. <mic...@gm...> - 2011-11-21 06:48:15
|
Hi all, I got some ideas for the management of connection data cached in pooler. Since node DDL commit, the information cached in pooler is created when the first session on server begins and then cannot be updated except by restarting Coordinator. My idea is to implement the 2 following functions: - pgxc_pool_cache_check, which is in charge of checking if the data cached in pooler is consistent with catalog pgxc_node. This function returns a boolean, false if data cached is not consistent, true if it is. The question here is that for the time being pooler has no ressource owner that would allow it to have a look at the system catalog. In order to simplify the whole interface between pooler and postgres child. Do you think it is OK to create ressource owned by pooler process when this one is begins? Of course this ressource is released when pooler process shuts down. - pgxc_pool_cache_reload, which is in charge of reloading the data cached in pooler. The main issue here is not the upload in itself, it is the fact that it is necessary to drop all the connections currently hold by the other sessions as well as abort all the existing transactions in the server. I found that there is a mechanism to allow a child to signal the postmaster with SendPostmasterSignal. So here is the scenario I am thinking about when pooler cache is reloaded: 1) Take a lock in pooler to avoid new connections to be taken 2) The session that has received pgxc_pool_cache_reload then cleans up all the other sessions with a CLEAN CONNECTION FORCE. With that all the transactions being hold in server are aborted and pooler connections are completely cleaned up. 3) Pooler data cached is updated. 4) Then the session that received the pgxc_pool_cache_reload signal all the other sessions with SendProcSignal using a new type ProcSignalReason, called let's say PGXC_SESSION_RELOAD_HANDLES. 5) All the sessions receive the reload signal and then update their cache with InitMultinodeExecutor. An important point, pgxc_pool_cache just gets execute on the local Coordinator. Any comments are welcome. Regards, -- Michael Paquier http://michael.otacoo.com |
From: Michael P. <mic...@gm...> - 2011-11-17 05:17:36
|
Hi all, While testing more complex types of joins with node subsets, I found a test case that was not working. This test concerned replicated tables that have their node subsets mapping partially but queries are not pushed down to the correct node. You need at least 3 datanodes to reproduce the problem. create table rep12 (a int) distribute by replication to node dn1,dn2; create table rep13 (a int) distribute by replication to node dn1,dn3; create table rep23 (a int) distribute by replication to node dn2,dn3; insert into rep12 values (1),(2),(3); insert into rep13 values (1),(2),(3); insert into rep23 values (1),(2),(3); explain verbose select a from rep23 join rep12 using (a); explain verbose select a from rep13 join rep12 using (a); drop table rep12,rep13,rep23; The patch attached fixes the issue knowing that the 2 replicated tables used for join have their node subsets mapping. Comments are welcome. Regards, -- Michael Paquier http://michael.otacoo.com |
From: Michael P. <mic...@gm...> - 2011-11-14 06:50:28
|
examine_conditions_walker of planner.c, there is a part dealing with subqueries: if (IsA(expr_node, SubLink)) This part creates an execute node list for the subquery and goes into get_plan_nodes_walker to get the node list, at this point is_subcluster_mapping is checked too. So subqueries are indeed taken into account. For example for those tables: create table tb_rep (a int) distribute by replication; create table tb_rep1 (a int) distribute by replication to node dn1; create table tb_rep2 (a int) distribute by replication to node dn2; create table tb_hash (a int) distribute by hash(a); create table tb_hash1 (a int) distribute by hash(a) to node dn1; create table tb_hash2 (a int) distribute by hash(a) to node dn2; create table tb_rr (a int) distribute by round robin; create table tb_rr1 (a int) distribute by round robin to node dn1; create table tb_rr2 (a int) distribute by round robin to node dn2; I do the following joins: explain verbose select * from tb_rr join tb_rep using (a) where a IN (select * from tb_hash join tb_rep using (a)); explain verbose select * from tb_rr1 join tb_rep1 using (a) where a IN (select * from tb_hash2 join tb_rep2 using (a)); However, there is still a limitation regarding subquery push down. This is not related to my patch as I reproduce also that on master, but I noticed that the subquery join regarding tb_hash and tb_rep could be pushed down to nodes but it is not the case. On Mon, Nov 14, 2011 at 3:30 PM, Ashutosh Bapat < ash...@en...> wrote: > Hi Michael, > Sorry for stretching this far. But, now it seems we do not consider > subqueries at all and that can lead to wrong results. Instead we should > return false for subqueries, thus standard_planner() can take care of those. > > > On Mon, Nov 14, 2011 at 5:34 AM, Michael Paquier < > mic...@gm...> wrote: > >> OK, here is the latest patch ready for commit. >> Any comments/questions before committing that? >> >> -- >> Michael Paquier >> http://michael.otacoo.com >> > > > > -- > Best Wishes, > Ashutosh Bapat > EntepriseDB Corporation > The Enterprise Postgres Company > > -- Michael Paquier http://michael.otacoo.com |
From: Ashutosh B. <ash...@en...> - 2011-11-14 06:30:56
|
Hi Michael, Sorry for stretching this far. But, now it seems we do not consider subqueries at all and that can lead to wrong results. Instead we should return false for subqueries, thus standard_planner() can take care of those. On Mon, Nov 14, 2011 at 5:34 AM, Michael Paquier <mic...@gm...>wrote: > OK, here is the latest patch ready for commit. > Any comments/questions before committing that? > > -- > Michael Paquier > http://michael.otacoo.com > -- Best Wishes, Ashutosh Bapat EntepriseDB Corporation The Enterprise Postgres Company |
From: Michael P. <mic...@gm...> - 2011-11-14 00:04:59
|
OK, here is the latest patch ready for commit. Any comments/questions before committing that? -- Michael Paquier http://michael.otacoo.com |
From: Michael P. <mic...@gm...> - 2011-11-09 12:53:12
|
On Wed, Nov 9, 2011 at 5:46 PM, Ashutosh Bapat < ash...@en...> wrote: > I have few comments about style (Somebody still needs it to be looked from > CREATE TABLE angle) > Regarding create table, it uses the structure introduced in create ddl commit, so I'm not sure there is something that needs to be added. > > Instead of using "!= LOCATOR_TYPE_REPLICATED", we should rather check > whether it's one of MODULO, HASH, CUSTOM etc. and throw error for unhandled > cases. That will protect us from any crashes caused by adding new > LOCATOR_TYPE, which is not REPLICATED or distributed. > OK, I will change the code in consequence it is not a big matter, but it will make the if checks a bit heavier. > > + /* Look at the subqueries */ > + if (rte->rtekind == RTE_SUBQUERY && > + !is_subcluster_mapping(rte->subquery->rtable)) > + return false; > This piece of code looks tricky. Where is the nodelist and locator_type > for a subquery stored? The global_locator_type and inter_nodelist are > variables local to function and will not be updated across the recursive > calls to the function. To be honest, I had a look at this part of the code and saw that get_plan_nodes_walker was also used for subqueries themselves. So it is not necessary and the subquery node list is evaluated automatically with its own rtable separately. -- Michael Paquier http://michael.otacoo.com |
From: Ashutosh B. <ash...@en...> - 2011-11-09 08:46:24
|
I have few comments about style (Somebody still needs it to be looked from CREATE TABLE angle) Instead of using "!= LOCATOR_TYPE_REPLICATED", we should rather check whether it's one of MODULO, HASH, CUSTOM etc. and throw error for unhandled cases. That will protect us from any crashes caused by adding new LOCATOR_TYPE, which is not REPLICATED or distributed. + /* Look at the subqueries */ + if (rte->rtekind == RTE_SUBQUERY && + !is_subcluster_mapping(rte->subquery->rtable)) + return false; This piece of code looks tricky. Where is the nodelist and locator_type for a subquery stored? The global_locator_type and inter_nodelist are variables local to function and will not be updated across the recursive calls to the function. On Wed, Nov 9, 2011 at 12:01 PM, Michael Paquier <mic...@gm...>wrote: > Please find attached the updated patch based on your comments. > This time also is a simple O(n) and passes all the tests. > > -- > Michael Paquier > http://michael.otacoo.com > -- Best Wishes, Ashutosh Bapat EntepriseDB Corporation The Enterprise Postgres Company |
From: Michael P. <mic...@gm...> - 2011-11-09 06:32:04
|
Please find attached the updated patch based on your comments. This time also is a simple O(n) and passes all the tests. -- Michael Paquier http://michael.otacoo.com |
From: Ashutosh B. <ash...@en...> - 2011-11-08 07:59:46
|
On Tue, Nov 8, 2011 at 1:00 PM, Michael Paquier <mic...@gm...>wrote: > > > On Tue, Nov 8, 2011 at 4:01 PM, Ashutosh Bapat < > ash...@en...> wrote: > >> Hi Michael, >> I haven't reviewed the patch completely and somebody needs to take a look >> at it from table creation/data distribution perspective. I have one comment >> related to is_subcluster_mapping(), >> >> We probably don't need to compare each rtable entry with other (that's >> n^2). Instead, we can have algorithm (needs a bit refinement) written in >> the file attached. >> > OK I see. So you take the successive node list intersections and check up > to the point where the node mapping is not respected. > I may unfortunately not have that much time to work on that this week but > I understood the idea and will provide a modified patch at the beginning of > next week. > Based on you idea, only is_subcluster_mapping needs to be modified. > Yes. > Thanks. > -- > Michael Paquier > http://michael.otacoo.com > -- Best Wishes, Ashutosh Bapat EntepriseDB Corporation The Enterprise Postgres Company |
From: Michael P. <mic...@gm...> - 2011-11-08 07:31:03
|
On Tue, Nov 8, 2011 at 4:01 PM, Ashutosh Bapat < ash...@en...> wrote: > Hi Michael, > I haven't reviewed the patch completely and somebody needs to take a look > at it from table creation/data distribution perspective. I have one comment > related to is_subcluster_mapping(), > > We probably don't need to compare each rtable entry with other (that's > n^2). Instead, we can have algorithm (needs a bit refinement) written in > the file attached. > OK I see. So you take the successive node list intersections and check up to the point where the node mapping is not respected. I may unfortunately not have that much time to work on that this week but I understood the idea and will provide a modified patch at the beginning of next week. Based on you idea, only is_subcluster_mapping needs to be modified. Thanks. -- Michael Paquier http://michael.otacoo.com |
From: Ashutosh B. <ash...@en...> - 2011-11-08 07:01:20
|
nodelist = NULL; nodelist_type = NONE; for every rtable do { if (rtable is RELATION) { if (nodelist_type == NONE) { nodelist_type = distribution type of rtable; nodelist = nodelist of the rtable; } else { if (nodelist_type == REPLICATED and distribution type of rtable is REPLICATED ) { nodelist = common_nodes (nodelist, nodelist of the rtable); if (nodelist == NULL) bail out; run through standard planner; nodelist_type = REPLICATED; } else if ((distribution type of rtable is DISTRIBUTED and nodelist_type == REPLICATED) OR (distribution type of rtable is REPLICATED and nodelist_type == DISTRIBUTED)) { /* * for replicated and distributed join we need the * replicated table to be replicated on all nodes where the * distributed table is distributed */ nodelist = common_nodes(nodelist, nodelist of the rtable); if (nodelist != nodelist of the rtable) bail out; run through standard planner; nodelist_type = DISTRIBUTED; } else if (distribut ed type of rtable is DISTRIBUTED and nodelist_type == DISTRIBUTED) { nodelist = common_nodes(nodelist, nodelist of rtable) if (common_nodes != nodelist or common_nodes != nodelist of rtable) bail out; run through standard planner; nodelist_type = DISTRIBUTED; } else { ERROR; } } } else if (relation is Subquery) find the nodelist and nodelist_type for the subquery using this algorithm recursively and apply the same logic for merging those with the previously calculated nodelist and nodelist_type; } |
From: Michael P. <mic...@gm...> - 2011-11-08 02:53:44
|
Hi all, Please find attached a patch that finalizes the possibility to distribute data only among a subset of nodes in the cluster. Now XC is limited in the fact that table data is distributed or replicated among all the nodes. To define a table only on a subset of nodes, a new extension has been added in CREATE TABLE called TO NODE/GROUP. For example, with a cluster of 2Co/2Dn: postgres=# select oid,* from pgxc_node; oid | node_name | node_type | node_related | node_port | node_host | nodeis_primary | nodeis_preferred -------+-----------+-----------+--------------+-----------+-----------+----------------+------------------ 11133 | coord1 | C | 0 | 5432 | localhost | f | f 11134 | coord2 | C | 0 | 5452 | localhost | f | f 11135 | dn1 | D | 0 | 15451 | localhost | f | f 11136 | dn2 | D | 0 | 15452 | localhost | f | f (4 rows) postgres=# create table aa1 (a int) distribute by replication to node dn1; CREATE TABLE postgres=# create table bb (a int) to node dn1,dn2; CREATE TABLE You can also create node groups used as aliases for node lists when creating tables: postgres=# create node group group_all with dn1,dn2; CREATE NODE GROUP postgres=# create node group group_1 with dn1; CREATE NODE GROUP postgres=# create node group group_2 with dn2; CREATE NODE GROUP postgres=# select oid,* from pgxc_group; oid | group_name | group_members -------+------------+--------------- 21073 | group_all | 11135 11136 21074 | group_1 | 11135 21075 | group_2 | 11136 (3 rows) postgres=# create table aa_group2 (a int) to group group_2; CREATE TABLE postgres=# create table aa_groupall (a int) to group group_all; CREATE TABLE This is one of the fancy features that brings more simplicity when managing a db cluster. Attached are also the test cases I used to check that with a 2Co/2Dn cluster. Regards, -- Michael Paquier http://michael.otacoo.com |
From: Koichi S. <koi...@gm...> - 2011-11-03 15:16:59
|
I decided to review the current DDL spec. I found at least a couple if issues which could be a problem with HA feature. I think we need to refactor DDL with HA and node reconfiguration, although I don't think it has major impact to the current implementation. Regards; ---------- Koichi Suzuki 2011/11/3 Mason <ma...@us...>: > On Wed, Nov 2, 2011 at 9:22 PM, Michael Paquier > <mic...@gm...> wrote: >> >>>> There are still 2 functionalities that are missing to achieve this: >>>> 1) the possibility to define node subsets when creating a table, which is >>>> blocked for the time being >>>> 2) the possibility to redistribute data among a certain number of nodes. >>>> >>> >>> OK, an example. You have a table on two nodes, named "jane" and "laura". >>> These are the only nodes in the cluster and when session starts they got >>> indexes 1 and 2 respectively in the local datanode list, the session will >>> send these indexes to the pooler to access the table data. Then you create a >>> node named "kate". Now a new session will create own local datanode list >>> where "jane" will get index 1, "laura" - 3 and "kate" - 2. To access the >>> same table session sends 1 and 3 to the pooler, pooler returns error or >>> even die, because it knows only two datanodes. >>> To avoid this, you should either synchronously update pooler and local >>> caches of all existing sessions, or use something more stable then index in >>> alphabetical list to address datanodes. >> >> Yes, and the first solution is envisaged. >> Creating a new node from a session is a point, but the important point is to >> update the local cache inside pooler on all the existing sessions and decide >> a policy about what to do when this cache is changed with session on pooler >> that have active connections to nodes as in pooler, postmaster childs can be >> easily identified in agent with the process ID that has been saved. >> So the first model envisaged was not to update the pooler cache at the >> moment of pgxc_node modification, but using a separate command that would be >> in charge of that for all the sessions. >> Updating pooler cache is not sufficient, you also need to update the node >> handle cache for each session and perhaps abort existing transactions in >> other sessions of each coordinator. >> Btw, those operations could be performed by commands like: >> - "POOLER REINITIALIZE FORCE", to force all the sessions to update their >> cache for node handles and abort existing transactions. >> - "POOLER REINITIALIZE", force all the sessions in local node to update >> their cache and do not abort existing transactions >> - "POOLER CHECK", this command check if data cached in pool is consistent >> with catalog pgxc_node >> The good point of such functionalities is that an external application could >> issue the necessary commands via a client like psql to facilitate the >> operations. > > I also think some other new command can be used to indicate that we > want to repoint to a the standby (managed externally) for some > failover scenario. Perhaps it is ALTER NODE ... MASTER, indicating > that the standby node is now the master, and automatically making > changing the old master with the same segment id to a slave. This > would imply having the pools get refreshed. > > >> -- >> Michael Paquier >> http://michael.otacoo.com >> > > ------------------------------------------------------------------------------ > RSA(R) Conference 2012 > Save $700 by Nov 18 > Register now > http://p.sf.net/sfu/rsa-sfdev2dev1 > _______________________________________________ > Postgres-xc-developers mailing list > Pos...@li... > https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers > |
From: Mason <ma...@us...> - 2011-11-03 12:56:27
|
On Wed, Nov 2, 2011 at 9:22 PM, Michael Paquier <mic...@gm...> wrote: > >>> There are still 2 functionalities that are missing to achieve this: >>> 1) the possibility to define node subsets when creating a table, which is >>> blocked for the time being >>> 2) the possibility to redistribute data among a certain number of nodes. >>> >> >> OK, an example. You have a table on two nodes, named "jane" and "laura". >> These are the only nodes in the cluster and when session starts they got >> indexes 1 and 2 respectively in the local datanode list, the session will >> send these indexes to the pooler to access the table data. Then you create a >> node named "kate". Now a new session will create own local datanode list >> where "jane" will get index 1, "laura" - 3 and "kate" - 2. To access the >> same table session sends 1 and 3 to the pooler, pooler returns error or >> even die, because it knows only two datanodes. >> To avoid this, you should either synchronously update pooler and local >> caches of all existing sessions, or use something more stable then index in >> alphabetical list to address datanodes. > > Yes, and the first solution is envisaged. > Creating a new node from a session is a point, but the important point is to > update the local cache inside pooler on all the existing sessions and decide > a policy about what to do when this cache is changed with session on pooler > that have active connections to nodes as in pooler, postmaster childs can be > easily identified in agent with the process ID that has been saved. > So the first model envisaged was not to update the pooler cache at the > moment of pgxc_node modification, but using a separate command that would be > in charge of that for all the sessions. > Updating pooler cache is not sufficient, you also need to update the node > handle cache for each session and perhaps abort existing transactions in > other sessions of each coordinator. > Btw, those operations could be performed by commands like: > - "POOLER REINITIALIZE FORCE", to force all the sessions to update their > cache for node handles and abort existing transactions. > - "POOLER REINITIALIZE", force all the sessions in local node to update > their cache and do not abort existing transactions > - "POOLER CHECK", this command check if data cached in pool is consistent > with catalog pgxc_node > The good point of such functionalities is that an external application could > issue the necessary commands via a client like psql to facilitate the > operations. I also think some other new command can be used to indicate that we want to repoint to a the standby (managed externally) for some failover scenario. Perhaps it is ALTER NODE ... MASTER, indicating that the standby node is now the master, and automatically making changing the old master with the same segment id to a slave. This would imply having the pools get refreshed. > -- > Michael Paquier > http://michael.otacoo.com > |
From: Mason <ma...@us...> - 2011-11-03 12:48:34
|
On Thu, Nov 3, 2011 at 6:25 AM, Andrei Martsinchyk <and...@gm...> wrote: > > > 2011/11/3 Koichi Suzuki <koi...@gm...> >> >> I tried to respond to this thread about the node name, addition and >> removal lf the node name and maser/slave control and found new >> "segment id" is also an issue as proposed by Andrei. >> >> Could anybody give what "segment id" menas? I don't understand what >> is it for and why it is needed. >> > > OK, I will try to. > When DBA is planning how to partition a table to stored in distributed > environment he may not know details of the target distributed environment, > ispecially if these details may be changed over time. > So we may think the table distribution is a function taking table row as an > argument and returning a number from 1 to N where N is the number of parts > the table is split. > We call these parts "logical parts", or "segments". They are "logical" > because we now considering abstract distributed table. Term "logical part" > is better describing nature of thing, while "segment" is more brief, I > prefer to use "segment". Probably better term can be chosen, I am open for > suggestions. > To make things real, engine of a distributed database should map segments to > datanodes, means if distributed function is evaluated to some number on some > row, engine should pick up a datanode according to the mapping where it > should store the row or where the row can be found. The mapping should be > defined for each table in distributed database. The mapping should be hard, > means if engine have stored some segment on a datanode, it should always > refer the same datanode to read or update the row. The mapping should be > flexible at the same time - if there are multiple copies of some datanodes > engine should allow to replace a datanode with a copy, if it is failed or > going off for maintenance. > The current implementation in XC is neither hard nor flexible. Each datanode > is identified by name and OID, name is presented to user and is > user-assigned, OID is automatically assigned and used internally, each > uniquely identifies a datanode. > The mapping for a table is build as follows: > 1. Node list is read from the catalog and sorted by node name. There are > multiple copies of the list, different modules build own on startup and > never update. > 2. List of table segment is read from the catalog, in this list for each > segment specified OID of the datanode where segment is. > 3. The mapping is created as an array, where index is the result of > distribution function and value is the order number in the list of datanodes > built in 1. So for each datanode OID in the list of segments engine looks up > entry in the list of datanodes and stores index in the mapping table. The > mapping is never updates once it is built. > The XC engine passes over datanode references as indexes in the datanode > list, but different modules may have different lists and engine may finally > pick up wrong node for a segment. Thus the "hard mapping" principle is > violated. > A table segment is linked to datanode by OID, at the same time а physical > server associate itself with the same OID according to its configuration. So > flexibility is carried out by changing datanode configuration: the copy of > the datanote is pretending it is the datanode which it is replacing. That is > completely out of engine's control and there is no way to verify if > replacement is really the copy of former datanode. > Segment id would target all these issues. It should be stored in 3 places: > table definition should reference it instead of datanode oid; > it should be stored with the datanode definition in the catalog; > it should be stored as an element of physical server configuration, and once > assigned it should never be changed. > New segment id should be allocated when new master datanode is created in > the catalog. It should be different from already existing segment ids in > cluster. > When new slave datanode is created segment id is assigned from its master. > When physical server is wired up to cluster it should verify if segment id > in the configuration is matching to the segment id in the catalog. > If segment id in the configuration is empty it is assigned and written out. > When table is created the engine looks up the catalog for segment ids and > store them. The mapping array will contain segment ids as values and engine > will pass segment ids between modules to get datanode connections. > >> Present design assumes that "node name" is everything, which is given >> local oid in each node and such oid is listed in the order of node >> name in the catalog to define what node the table is >> distributed/replicated. The order of this oid's in the catalog is >> static even when a node is added. If we change the set of nodes for >> specific table, we have to re-distribute each tuple based on the new >> distribution definition, which will be one of the main challenge of >> the next FY. Asutosh is working on this. >> >> It is helpful if anybody summarizes what will be an issue. >> I think Andrei described it pretty well. I can try and summarize and reiterate more simply. - My original main concern was about the danger of node names leading to a misconfigured cluster and corrupt data. - Segment id separates out segments as a logical layer separate from physical node names. We may not need to be so concerned with changing node names in HA scenarios. - We can add additional checks to make sure when connecting that the segment id is correct, avoiding potential misconfigurations and corruption. - Standbys have the same segment id as their master, even while their cluster-wide node names may differ, and that these names don't need to change, leading to less confusion with failover and management. - It gives us flexibility and makes cluster management easier, including dynamically adding nodes. In my previous email I described how we can dynamically add data nodes without needing to name it locally and remove the requirement of running the special SQL file at initdb time. Right now it is kind of unwieldy. By "dynamically adding data nodes" I just mean make the cluster aware about them, not repartition anything. I think Michael mentioned more work needing to be done for subsets of nodes, but the data node would be available for use at createdb time at least. Thanks, Mason |
From: Andrei M. <and...@gm...> - 2011-11-03 10:25:25
|
2011/11/3 Koichi Suzuki <koi...@gm...> > I tried to respond to this thread about the node name, addition and > removal lf the node name and maser/slave control and found new > "segment id" is also an issue as proposed by Andrei. > > Could anybody give what "segment id" menas? I don't understand what > is it for and why it is needed. > > OK, I will try to. When DBA is planning how to partition a table to stored in distributed environment he may not know details of the target distributed environment, ispecially if these details may be changed over time. So we may think the table distribution is a function taking table row as an argument and returning a number from 1 to N where N is the number of parts the table is split. We call these parts "logical parts", or "segments". They are "logical" because we now considering abstract distributed table. Term "logical part" is better describing nature of thing, while "segment" is more brief, I prefer to use "segment". Probably better term can be chosen, I am open for suggestions. To make things real, engine of a distributed database should map segments to datanodes, means if distributed function is evaluated to some number on some row, engine should pick up a datanode according to the mapping where it should store the row or where the row can be found. The mapping should be defined for each table in distributed database. The mapping should be hard, means if engine have stored some segment on a datanode, it should always refer the same datanode to read or update the row. The mapping should be flexible at the same time - if there are multiple copies of some datanodes engine should allow to replace a datanode with a copy, if it is failed or going off for maintenance. The current implementation in XC is neither hard nor flexible. Each datanode is identified by name and OID, name is presented to user and is user-assigned, OID is automatically assigned and used internally, each uniquely identifies a datanode. The mapping for a table is build as follows: 1. Node list is read from the catalog and sorted by node name. There are multiple copies of the list, different modules build own on startup and never update. 2. List of table segment is read from the catalog, in this list for each segment specified OID of the datanode where segment is. 3. The mapping is created as an array, where index is the result of distribution function and value is the order number in the list of datanodes built in 1. So for each datanode OID in the list of segments engine looks up entry in the list of datanodes and stores index in the mapping table. The mapping is never updates once it is built. The XC engine passes over datanode references as indexes in the datanode list, but different modules may have different lists and engine may finally pick up wrong node for a segment. Thus the "hard mapping" principle is violated. A table segment is linked to datanode by OID, at the same time а physical server associate itself with the same OID according to its configuration. So flexibility is carried out by changing datanode configuration: the copy of the datanote is pretending it is the datanode which it is replacing. That is completely out of engine's control and there is no way to verify if replacement is really the copy of former datanode. Segment id would target all these issues. It should be stored in 3 places: table definition should reference it instead of datanode oid; it should be stored with the datanode definition in the catalog; it should be stored as an element of physical server configuration, and once assigned it should never be changed. New segment id should be allocated when new master datanode is created in the catalog. It should be different from already existing segment ids in cluster. When new slave datanode is created segment id is assigned from its master. When physical server is wired up to cluster it should verify if segment id in the configuration is matching to the segment id in the catalog. If segment id in the configuration is empty it is assigned and written out. When table is created the engine looks up the catalog for segment ids and store them. The mapping array will contain segment ids as values and engine will pass segment ids between modules to get datanode connections. Present design assumes that "node name" is everything, which is given > local oid in each node and such oid is listed in the order of node > name in the catalog to define what node the table is > distributed/replicated. The order of this oid's in the catalog is > static even when a node is added. If we change the set of nodes for > specific table, we have to re-distribute each tuple based on the new > distribution definition, which will be one of the main challenge of > the next FY. Asutosh is working on this. > > It is helpful if anybody summarizes what will be an issue. > > Sorry so many emails were posted while I'm traveling form Tokyo to Sao > Paulo. > > Regards; > ---------- > Koichi Suzuki > > > > 2011/11/2 Mason <ma...@us...>: > > On Wed, Nov 2, 2011 at 3:46 AM, Andrei Martsinchyk > > <and...@gm...> wrote: > >> > >> > >> 2011/11/2 Michael Paquier <mic...@gm...> > >>> > >>> > >>> On Wed, Nov 2, 2011 at 12:47 AM, Andrei Martsinchyk > >>> <and...@gm...> wrote: > >>>> > >>>> I have recently taken a look at the code and want to feedback on this. > >>>> I like the idea, it should simplify initial cluster setup and it is a > >>>> promising base for HA features. However I see some issues, pitfalls > and want > >>>> to point them out, and suggest solutions. > >>>> Currently table partitions are associated with nodes by node oids. > This > >>>> could be a problem if you want to failover. Assume you have working > cluster, > >>>> and each master datanode in the cluster has a slave. Eventually one > data > >>>> node went down and you want to fail it over by promoting a slave. If > you run > >>>> ALTER NODE command and change type of respective DATANODE_SLAVE to > >>>> DATANODE_MASTER the node won't be wired up correctly. Server will be > looking > >>>> for the node by oid and won't find oid specified in pgxc_class in the > list > >>>> of master datanodes. To correctly failover, you would have to > use ALTER > >>>> NODE to change connection info of failed node to point to the > promoted slave > >>>> and change node name of the slave so it associate itself with the > master > >>>> datanode in the catalog. You mentioned it is planned that node names > remain > >>>> constant and consistent in the cluster, so that is not desired > approach. > >>> > >>> About this point, I mentionned that node names remain constant in the > >>> cluster. But I meant that for the master nodes. When it may be > necessary to > >>> fail over a slave in case of a master failure, the node name of the > slave > >>> should be changed to the one of the master. > >>> I agree this is not the most flexible approach though. > >>> > >>>> > >>>> Potential vulnerability of renaming nodes is that you can mess up > nodes, > >>>> and run node which already contains data of some logical partition as > a node > >>>> associated with different partition. That may cause data corruption, > and > >>>> even if it does not you very likely will have to dump/restore entire > >>>> database to bring it in order. > >>> > >>> This is not much desired :) > >>>> > >>>> Even without node renaming, it is not reliable to address logical > >>>> partitions by index, which is the number of the node name > in alphabetical > >>>> order. If you add data node to the cluster with name between names of > >>>> existing nodes you must synchronously update node lists in pooler and > in all > >>>> sessions, otherwise you risk to corrupt the data. > >>> > >>> Just about that, there is an additional column in pgxc_class which is > the > >>> list of nodes on which the table data is present. > >>> Even if you add a new node with CREATE NODE, the tables present until > now > >>> won't use the new node as it is not listed in their list. > >>> There are still 2 functionalities that are missing to achieve this: > >>> 1) the possibility to define node subsets when creating a table, which > is > >>> blocked for the time being > >>> 2) the possibility to redistribute data among a certain number of > nodes. > >>> > >> > >> OK, an example. You have a table on two nodes, named "jane" and "laura". > >> These are the only nodes in the cluster and when session starts they got > >> indexes 1 and 2 respectively in the local datanode list, the session > will > >> send these indexes to the pooler to access the table data. Then you > create a > >> node named "kate". Now a new session will create own local datanode list > >> where "jane" will get index 1, "laura" - 3 and "kate" - 2. To access the > >> same table session sends 1 and 3 to the pooler, pooler returns error or > >> even die, because it knows only two datanodes. > >> To avoid this, you should either synchronously update pooler and local > >> caches of all existing sessions, or use something more stable then > index in > >> alphabetical list to address datanodes. > >> That's why I am proposing segment id - an identifier which is assigned > to > >> logical partition once it is created and remains unchanged over cluster > >> lifetime. > >> > >>>> > >>>> All mentioned problems could be targeted if you introduce new concept > >>>> let's name it "segment". Segment is id of logical partition, it is > similar > >>>> to former nodeid and current nodeindex, but it can not be changed once > >>>> assigned to a datanode. Assigned segment id can be stored in the > pgxc_node > >>>> instead of "related" field: each segment may have at most one master > node, > >>>> and all slaves of the segment are related to the segment master. > Also, if > >>>> node is first time started under a name associated with some segment, > the > >>>> segment id can be stored in the data directory, probably in a > separate file, > >>>> like catalog version and validated on startup. It would make it > impossible > >>>> to mess up partitions. Tables may refer segment ids instead of node > oids. > >>> > >>> Is this logical ID some kind of random-generated integer ID created at > the > >>> same time as a new node master? > >> > >> It is not necessary random. You can assign sequential numbers. They > should > >> be unique in cluster, and if a data node worked in cluster under > >> one sequential number, it can not work under different one. > >> > >>> > >>> The problem here is to find a viable solution to target unique > datanodes > >>> when choosing a node for a replicated table for example. The > uniqueness of > >>> master node names and ordering them by node names insures that datanode > >>> target order is safe even there are new nodes added thanks to the node > >>> subset feature. > >>> By using a logical ID, is there any way to insure that a partition ID > is > >>> associated to a constant integer between 1 and n for n datanodes in > cluster? > >>> > >>> Regarding the related field approach... > >>> In Postgres 9.1, syncrep uses only one layer of slaves. This makes the > >>> slaves having only one degree of dependency with a master it is > related. > >>> However 9.2 introduces cascade replication, meaning that you could set > up > >>> a cluster with slaves of slaves. > >>> So if you have a master under one partition ID and slaves in this > >>> partition, which are themselves masters of other slaves, is there some > way > >>> to set up a partition of a partition? > >> > >> OK, you can keep the "related" field and create new one to store > sequence > >> id. > >> > >>> > > > > > > Adding some more thoughts-- > > > > For adding a new data node, one could connect to a coordinator and > > execute CREATE NODE (for a data node master). It checks if it can > > successfully communicate with the node, verify that it has not yet > > been assigned a segment (only initdb was run), sends down a modified > > CREATE NODE command that includes the segment id (which can be > > assigned sequentially) in a new clause, and its cluster-wide > > recognized name. > > > > At this point all Coordinators agree that the data node has a certain > > segment id, and what its name is. Such assignments can take place when > > setting up the cluster, perhaps by connecting to template1. > > > > When CREATE DATABSE is run, a subset of data nodes can optionally be > > specified or created on all data nodes by default. > > > > We could also consider allowing CREATE TABLE to refer to the node > > names we have given, as is the case now, or even the segment ids, > > which should be queryable. This would be somewhat similar to the > > nodeids. > > > > We could also require that when a coordinator connects to a data node, > > it is expecting a certain segment id (it could be an additional > > connection time parameter). If the segment id is different than > > expected, we could have the connection fail. This should importantly > > help prevent accidental data corruption which may be more likely when > > scrambling in the heat of a failure or a late night maintenance > > window. > > > > Standby data nodes are really clones of a master data node, so each > > would know its particular segment. > > > > Souce code should refer to the segment ids and then we have > > connections be obtained from the appropriate pool based on the > > segment. > > > > Thanks, > > > > Mason > > > > > ------------------------------------------------------------------------------ > > RSA® Conference 2012 > > Save $700 by Nov 18 > > Register now! > > http://p.sf.net/sfu/rsa-sfdev2dev1 > > _______________________________________________ > > Postgres-xc-developers mailing list > > Pos...@li... > > https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers > > > -- Best regards, Andrei Martsinchyk mailto:and...@gm... |
From: Ashutosh B. <ash...@en...> - 2011-11-03 10:12:27
|
On Wed, Oct 26, 2011 at 11:25 AM, Michael Paquier <mic...@gm... > wrote: > I saw that the regression test xc_having is failing with this patch, > because of false explain reports. > This is not a big matter and please see updated patch attached. > However, the keyword "__FOREIGN_QUERY__" is printed from time to time in > tests. Everybody, especially Ashutosh, do you think it's OK like this? __FOREIGN_QUERY__ is dummy alias for RTEs corresponding to the RemoteQuery nodes generated by reducing JOINs, GROUP BY plans etc. So this entry is OK. > > -- > Michael Paquier > http://michael.otacoo.com > > > ------------------------------------------------------------------------------ > The demand for IT networking professionals continues to grow, and the > demand for specialized networking skills is growing even more rapidly. > Take a complimentary Learning@Cisco Self-Assessment and learn > about Cisco certifications, training, and career opportunities. > http://p.sf.net/sfu/cisco-dev2dev > _______________________________________________ > Postgres-xc-developers mailing list > Pos...@li... > https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers > > -- Best Wishes, Ashutosh Bapat EntepriseDB Corporation The Enterprise Postgres Company |
From: Koichi S. <koi...@gm...> - 2011-11-03 03:52:27
|
I tried to respond to this thread about the node name, addition and removal lf the node name and maser/slave control and found new "segment id" is also an issue as proposed by Andrei. Could anybody give what "segment id" menas? I don't understand what is it for and why it is needed. Present design assumes that "node name" is everything, which is given local oid in each node and such oid is listed in the order of node name in the catalog to define what node the table is distributed/replicated. The order of this oid's in the catalog is static even when a node is added. If we change the set of nodes for specific table, we have to re-distribute each tuple based on the new distribution definition, which will be one of the main challenge of the next FY. Asutosh is working on this. It is helpful if anybody summarizes what will be an issue. Sorry so many emails were posted while I'm traveling form Tokyo to Sao Paulo. Regards; ---------- Koichi Suzuki 2011/11/2 Mason <ma...@us...>: > On Wed, Nov 2, 2011 at 3:46 AM, Andrei Martsinchyk > <and...@gm...> wrote: >> >> >> 2011/11/2 Michael Paquier <mic...@gm...> >>> >>> >>> On Wed, Nov 2, 2011 at 12:47 AM, Andrei Martsinchyk >>> <and...@gm...> wrote: >>>> >>>> I have recently taken a look at the code and want to feedback on this. >>>> I like the idea, it should simplify initial cluster setup and it is a >>>> promising base for HA features. However I see some issues, pitfalls and want >>>> to point them out, and suggest solutions. >>>> Currently table partitions are associated with nodes by node oids. This >>>> could be a problem if you want to failover. Assume you have working cluster, >>>> and each master datanode in the cluster has a slave. Eventually one data >>>> node went down and you want to fail it over by promoting a slave. If you run >>>> ALTER NODE command and change type of respective DATANODE_SLAVE to >>>> DATANODE_MASTER the node won't be wired up correctly. Server will be looking >>>> for the node by oid and won't find oid specified in pgxc_class in the list >>>> of master datanodes. To correctly failover, you would have to use ALTER >>>> NODE to change connection info of failed node to point to the promoted slave >>>> and change node name of the slave so it associate itself with the master >>>> datanode in the catalog. You mentioned it is planned that node names remain >>>> constant and consistent in the cluster, so that is not desired approach. >>> >>> About this point, I mentionned that node names remain constant in the >>> cluster. But I meant that for the master nodes. When it may be necessary to >>> fail over a slave in case of a master failure, the node name of the slave >>> should be changed to the one of the master. >>> I agree this is not the most flexible approach though. >>> >>>> >>>> Potential vulnerability of renaming nodes is that you can mess up nodes, >>>> and run node which already contains data of some logical partition as a node >>>> associated with different partition. That may cause data corruption, and >>>> even if it does not you very likely will have to dump/restore entire >>>> database to bring it in order. >>> >>> This is not much desired :) >>>> >>>> Even without node renaming, it is not reliable to address logical >>>> partitions by index, which is the number of the node name in alphabetical >>>> order. If you add data node to the cluster with name between names of >>>> existing nodes you must synchronously update node lists in pooler and in all >>>> sessions, otherwise you risk to corrupt the data. >>> >>> Just about that, there is an additional column in pgxc_class which is the >>> list of nodes on which the table data is present. >>> Even if you add a new node with CREATE NODE, the tables present until now >>> won't use the new node as it is not listed in their list. >>> There are still 2 functionalities that are missing to achieve this: >>> 1) the possibility to define node subsets when creating a table, which is >>> blocked for the time being >>> 2) the possibility to redistribute data among a certain number of nodes. >>> >> >> OK, an example. You have a table on two nodes, named "jane" and "laura". >> These are the only nodes in the cluster and when session starts they got >> indexes 1 and 2 respectively in the local datanode list, the session will >> send these indexes to the pooler to access the table data. Then you create a >> node named "kate". Now a new session will create own local datanode list >> where "jane" will get index 1, "laura" - 3 and "kate" - 2. To access the >> same table session sends 1 and 3 to the pooler, pooler returns error or >> even die, because it knows only two datanodes. >> To avoid this, you should either synchronously update pooler and local >> caches of all existing sessions, or use something more stable then index in >> alphabetical list to address datanodes. >> That's why I am proposing segment id - an identifier which is assigned to >> logical partition once it is created and remains unchanged over cluster >> lifetime. >> >>>> >>>> All mentioned problems could be targeted if you introduce new concept >>>> let's name it "segment". Segment is id of logical partition, it is similar >>>> to former nodeid and current nodeindex, but it can not be changed once >>>> assigned to a datanode. Assigned segment id can be stored in the pgxc_node >>>> instead of "related" field: each segment may have at most one master node, >>>> and all slaves of the segment are related to the segment master. Also, if >>>> node is first time started under a name associated with some segment, the >>>> segment id can be stored in the data directory, probably in a separate file, >>>> like catalog version and validated on startup. It would make it impossible >>>> to mess up partitions. Tables may refer segment ids instead of node oids. >>> >>> Is this logical ID some kind of random-generated integer ID created at the >>> same time as a new node master? >> >> It is not necessary random. You can assign sequential numbers. They should >> be unique in cluster, and if a data node worked in cluster under >> one sequential number, it can not work under different one. >> >>> >>> The problem here is to find a viable solution to target unique datanodes >>> when choosing a node for a replicated table for example. The uniqueness of >>> master node names and ordering them by node names insures that datanode >>> target order is safe even there are new nodes added thanks to the node >>> subset feature. >>> By using a logical ID, is there any way to insure that a partition ID is >>> associated to a constant integer between 1 and n for n datanodes in cluster? >>> >>> Regarding the related field approach... >>> In Postgres 9.1, syncrep uses only one layer of slaves. This makes the >>> slaves having only one degree of dependency with a master it is related. >>> However 9.2 introduces cascade replication, meaning that you could set up >>> a cluster with slaves of slaves. >>> So if you have a master under one partition ID and slaves in this >>> partition, which are themselves masters of other slaves, is there some way >>> to set up a partition of a partition? >> >> OK, you can keep the "related" field and create new one to store sequence >> id. >> >>> > > > Adding some more thoughts-- > > For adding a new data node, one could connect to a coordinator and > execute CREATE NODE (for a data node master). It checks if it can > successfully communicate with the node, verify that it has not yet > been assigned a segment (only initdb was run), sends down a modified > CREATE NODE command that includes the segment id (which can be > assigned sequentially) in a new clause, and its cluster-wide > recognized name. > > At this point all Coordinators agree that the data node has a certain > segment id, and what its name is. Such assignments can take place when > setting up the cluster, perhaps by connecting to template1. > > When CREATE DATABSE is run, a subset of data nodes can optionally be > specified or created on all data nodes by default. > > We could also consider allowing CREATE TABLE to refer to the node > names we have given, as is the case now, or even the segment ids, > which should be queryable. This would be somewhat similar to the > nodeids. > > We could also require that when a coordinator connects to a data node, > it is expecting a certain segment id (it could be an additional > connection time parameter). If the segment id is different than > expected, we could have the connection fail. This should importantly > help prevent accidental data corruption which may be more likely when > scrambling in the heat of a failure or a late night maintenance > window. > > Standby data nodes are really clones of a master data node, so each > would know its particular segment. > > Souce code should refer to the segment ids and then we have > connections be obtained from the appropriate pool based on the > segment. > > Thanks, > > Mason > > ------------------------------------------------------------------------------ > RSA® Conference 2012 > Save $700 by Nov 18 > Register now! > http://p.sf.net/sfu/rsa-sfdev2dev1 > _______________________________________________ > Postgres-xc-developers mailing list > Pos...@li... > https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers > |
From: Michael P. <mic...@gm...> - 2011-11-03 01:24:42
|
On Tue, Nov 1, 2011 at 11:38 PM, Mason <ma...@us...>wrote: > On Tue, Nov 1, 2011 at 1:25 AM, Michael Paquier > <mic...@gm...> wrote: > > Hi all, > > > > Please find attached a patch that adds a new option -C/--clusterfile > > allowing to specify a customized SQL file at initdb for node > initialization > > with NODE DDL. > > The new option works as follows: > > initdb -C ./custom_file.sql -D datadir > > initdb --clusterfile ./custom_file.sql -D datadir > > > > Any comments are welcome. > > I think we should slow down this line of development and re-think how > to do it without any SQL initialization file and find a better > solution. This is getting into the overall architecture of the system > and does not feel very clean to me. To be clear, I like the DDL > approach you advocate and the potential for flexibility it gives XC, > and that you put something together for it, but so far it seems to be > making things a bit unwieldy. > > I prefer that we think about how to do this more cleanly in the > architecture and then develop supporting commands around that, before > we get too wedded to the current approach. > > I will think about this some more, likely involving an abstraction > layer of (implicit) segments/partitions associated with nodes instead > of explicit named nodes directly. > Ideas are welcome, and designs, patches are even more. Regards, -- Michael Paquier http://michael.otacoo.com |
From: Michael P. <mic...@gm...> - 2011-11-03 01:22:39
|
> There are still 2 functionalities that are missing to achieve this: >> 1) the possibility to define node subsets when creating a table, which is >> blocked for the time being >> 2) the possibility to redistribute data among a certain number of nodes. >> >> > > OK, an example. You have a table on two nodes, named "jane" and "laura". > These are the only nodes in the cluster and when session starts they got > indexes 1 and 2 respectively in the local datanode list, the session will > send these indexes to the pooler to access the table data. Then you create > a node named "kate". Now a new session will create own local datanode list > where "jane" will get index 1, "laura" - 3 and "kate" - 2. To access the > same table session sends 1 and 3 to the pooler, pooler returns error or > even die, because it knows only two datanodes. > To avoid this, you should either synchronously update pooler and local > caches of all existing sessions, or use something more stable then index in > alphabetical list to address datanodes. > Yes, and the first solution is envisaged. Creating a new node from a session is a point, but the important point is to update the local cache inside pooler on all the existing sessions and decide a policy about what to do when this cache is changed with session on pooler that have active connections to nodes as in pooler, postmaster childs can be easily identified in agent with the process ID that has been saved. So the first model envisaged was not to update the pooler cache at the moment of pgxc_node modification, but using a separate command that would be in charge of that for all the sessions. Updating pooler cache is not sufficient, you also need to update the node handle cache for each session and perhaps abort existing transactions in other sessions of each coordinator. Btw, those operations could be performed by commands like: - "POOLER REINITIALIZE FORCE", to force all the sessions to update their cache for node handles and abort existing transactions. - "POOLER REINITIALIZE", force all the sessions in local node to update their cache and do not abort existing transactions - "POOLER CHECK", this command check if data cached in pool is consistent with catalog pgxc_node The good point of such functionalities is that an external application could issue the necessary commands via a client like psql to facilitate the operations. -- Michael Paquier http://michael.otacoo.com |
From: Mason <ma...@us...> - 2011-11-02 13:18:51
|
On Wed, Nov 2, 2011 at 3:46 AM, Andrei Martsinchyk <and...@gm...> wrote: > > > 2011/11/2 Michael Paquier <mic...@gm...> >> >> >> On Wed, Nov 2, 2011 at 12:47 AM, Andrei Martsinchyk >> <and...@gm...> wrote: >>> >>> I have recently taken a look at the code and want to feedback on this. >>> I like the idea, it should simplify initial cluster setup and it is a >>> promising base for HA features. However I see some issues, pitfalls and want >>> to point them out, and suggest solutions. >>> Currently table partitions are associated with nodes by node oids. This >>> could be a problem if you want to failover. Assume you have working cluster, >>> and each master datanode in the cluster has a slave. Eventually one data >>> node went down and you want to fail it over by promoting a slave. If you run >>> ALTER NODE command and change type of respective DATANODE_SLAVE to >>> DATANODE_MASTER the node won't be wired up correctly. Server will be looking >>> for the node by oid and won't find oid specified in pgxc_class in the list >>> of master datanodes. To correctly failover, you would have to use ALTER >>> NODE to change connection info of failed node to point to the promoted slave >>> and change node name of the slave so it associate itself with the master >>> datanode in the catalog. You mentioned it is planned that node names remain >>> constant and consistent in the cluster, so that is not desired approach. >> >> About this point, I mentionned that node names remain constant in the >> cluster. But I meant that for the master nodes. When it may be necessary to >> fail over a slave in case of a master failure, the node name of the slave >> should be changed to the one of the master. >> I agree this is not the most flexible approach though. >> >>> >>> Potential vulnerability of renaming nodes is that you can mess up nodes, >>> and run node which already contains data of some logical partition as a node >>> associated with different partition. That may cause data corruption, and >>> even if it does not you very likely will have to dump/restore entire >>> database to bring it in order. >> >> This is not much desired :) >>> >>> Even without node renaming, it is not reliable to address logical >>> partitions by index, which is the number of the node name in alphabetical >>> order. If you add data node to the cluster with name between names of >>> existing nodes you must synchronously update node lists in pooler and in all >>> sessions, otherwise you risk to corrupt the data. >> >> Just about that, there is an additional column in pgxc_class which is the >> list of nodes on which the table data is present. >> Even if you add a new node with CREATE NODE, the tables present until now >> won't use the new node as it is not listed in their list. >> There are still 2 functionalities that are missing to achieve this: >> 1) the possibility to define node subsets when creating a table, which is >> blocked for the time being >> 2) the possibility to redistribute data among a certain number of nodes. >> > > OK, an example. You have a table on two nodes, named "jane" and "laura". > These are the only nodes in the cluster and when session starts they got > indexes 1 and 2 respectively in the local datanode list, the session will > send these indexes to the pooler to access the table data. Then you create a > node named "kate". Now a new session will create own local datanode list > where "jane" will get index 1, "laura" - 3 and "kate" - 2. To access the > same table session sends 1 and 3 to the pooler, pooler returns error or > even die, because it knows only two datanodes. > To avoid this, you should either synchronously update pooler and local > caches of all existing sessions, or use something more stable then index in > alphabetical list to address datanodes. > That's why I am proposing segment id - an identifier which is assigned to > logical partition once it is created and remains unchanged over cluster > lifetime. > >>> >>> All mentioned problems could be targeted if you introduce new concept >>> let's name it "segment". Segment is id of logical partition, it is similar >>> to former nodeid and current nodeindex, but it can not be changed once >>> assigned to a datanode. Assigned segment id can be stored in the pgxc_node >>> instead of "related" field: each segment may have at most one master node, >>> and all slaves of the segment are related to the segment master. Also, if >>> node is first time started under a name associated with some segment, the >>> segment id can be stored in the data directory, probably in a separate file, >>> like catalog version and validated on startup. It would make it impossible >>> to mess up partitions. Tables may refer segment ids instead of node oids. >> >> Is this logical ID some kind of random-generated integer ID created at the >> same time as a new node master? > > It is not necessary random. You can assign sequential numbers. They should > be unique in cluster, and if a data node worked in cluster under > one sequential number, it can not work under different one. > >> >> The problem here is to find a viable solution to target unique datanodes >> when choosing a node for a replicated table for example. The uniqueness of >> master node names and ordering them by node names insures that datanode >> target order is safe even there are new nodes added thanks to the node >> subset feature. >> By using a logical ID, is there any way to insure that a partition ID is >> associated to a constant integer between 1 and n for n datanodes in cluster? >> >> Regarding the related field approach... >> In Postgres 9.1, syncrep uses only one layer of slaves. This makes the >> slaves having only one degree of dependency with a master it is related. >> However 9.2 introduces cascade replication, meaning that you could set up >> a cluster with slaves of slaves. >> So if you have a master under one partition ID and slaves in this >> partition, which are themselves masters of other slaves, is there some way >> to set up a partition of a partition? > > OK, you can keep the "related" field and create new one to store sequence > id. > >> Adding some more thoughts-- For adding a new data node, one could connect to a coordinator and execute CREATE NODE (for a data node master). It checks if it can successfully communicate with the node, verify that it has not yet been assigned a segment (only initdb was run), sends down a modified CREATE NODE command that includes the segment id (which can be assigned sequentially) in a new clause, and its cluster-wide recognized name. At this point all Coordinators agree that the data node has a certain segment id, and what its name is. Such assignments can take place when setting up the cluster, perhaps by connecting to template1. When CREATE DATABSE is run, a subset of data nodes can optionally be specified or created on all data nodes by default. We could also consider allowing CREATE TABLE to refer to the node names we have given, as is the case now, or even the segment ids, which should be queryable. This would be somewhat similar to the nodeids. We could also require that when a coordinator connects to a data node, it is expecting a certain segment id (it could be an additional connection time parameter). If the segment id is different than expected, we could have the connection fail. This should importantly help prevent accidental data corruption which may be more likely when scrambling in the heat of a failure or a late night maintenance window. Standby data nodes are really clones of a master data node, so each would know its particular segment. Souce code should refer to the segment ids and then we have connections be obtained from the appropriate pool based on the segment. Thanks, Mason |
From: Andrei M. <and...@gm...> - 2011-11-02 07:46:27
|
2011/11/2 Michael Paquier <mic...@gm...> > > > On Wed, Nov 2, 2011 at 12:47 AM, Andrei Martsinchyk < > and...@gm...> wrote: > >> I have recently taken a look at the code and want to feedback on this. >> I like the idea, it should simplify initial cluster setup and it is a >> promising base for HA features. However I see some issues, pitfalls and >> want to point them out, and suggest solutions. >> Currently table partitions are associated with nodes by node oids. This >> could be a problem if you want to failover. Assume you have working >> cluster, and each master datanode in the cluster has a slave. Eventually >> one data node went down and you want to fail it over by promoting a slave. >> If you run ALTER NODE command and change type of respective DATANODE_SLAVE >> to DATANODE_MASTER the node won't be wired up correctly. Server will be >> looking for the node by oid and won't find oid specified in pgxc_class in >> the list of master datanodes. To correctly failover, you would have to >> use ALTER NODE to change connection info of failed node to point to the >> promoted slave and change node name of the slave so it associate itself >> with the master datanode in the catalog. You mentioned it is planned >> that node names remain constant and consistent in the cluster, so that is >> not desired approach. >> > About this point, I mentionned that node names remain constant in the > cluster. But I meant that for the master nodes. When it may be necessary to > fail over a slave in case of a master failure, the node name of the slave > should be changed to the one of the master. > I agree this is not the most flexible approach though. > > >> Potential vulnerability of renaming nodes is that you can mess up nodes, >> and run node which already contains data of some logical partition as a >> node associated with different partition. That may cause data corruption, >> and even if it does not you very likely will have to dump/restore entire >> database to bring it in order. >> > This is not much desired :) > >> Even without node renaming, it is not reliable to address logical >> partitions by index, which is the number of the node name in alphabetical >> order. If you add data node to the cluster with name between names of >> existing nodes you must synchronously update node lists in pooler and in >> all sessions, otherwise you risk to corrupt the data. >> > Just about that, there is an additional column in pgxc_class which is the > list of nodes on which the table data is present. > Even if you add a new node with CREATE NODE, the tables present until now > won't use the new node as it is not listed in their list. > There are still 2 functionalities that are missing to achieve this: > 1) the possibility to define node subsets when creating a table, which is > blocked for the time being > 2) the possibility to redistribute data among a certain number of nodes. > > OK, an example. You have a table on two nodes, named "jane" and "laura". These are the only nodes in the cluster and when session starts they got indexes 1 and 2 respectively in the local datanode list, the session will send these indexes to the pooler to access the table data. Then you create a node named "kate". Now a new session will create own local datanode list where "jane" will get index 1, "laura" - 3 and "kate" - 2. To access the same table session sends 1 and 3 to the pooler, pooler returns error or even die, because it knows only two datanodes. To avoid this, you should either synchronously update pooler and local caches of all existing sessions, or use something more stable then index in alphabetical list to address datanodes. That's why I am proposing segment id - an identifier which is assigned to logical partition once it is created and remains unchanged over cluster lifetime. > All mentioned problems could be targeted if you introduce new concept >> let's name it "segment". Segment is id of logical partition, it is similar >> to former nodeid and current nodeindex, but it can not be changed once >> assigned to a datanode. Assigned segment id can be stored in the pgxc_node >> instead of "related" field: each segment may have at most one master node, >> and all slaves of the segment are related to the segment master. Also, if >> node is first time started under a name associated with some segment, the >> segment id can be stored in the data directory, probably in a separate >> file, like catalog version and validated on startup. It would make it >> impossible to mess up partitions. Tables may refer segment ids instead of >> node oids. >> > Is this logical ID some kind of random-generated integer ID created at the > same time as a new node master? > It is not necessary random. You can assign sequential numbers. They should be unique in cluster, and if a data node worked in cluster under one sequential number, it can not work under different one. > The problem here is to find a viable solution to target unique datanodes > when choosing a node for a replicated table for example. The uniqueness of > master node names and ordering them by node names insures that datanode > target order is safe even there are new nodes added thanks to the node > subset feature. > By using a logical ID, is there any way to insure that a partition ID is > associated to a constant integer between 1 and n for n datanodes in cluster? > > Regarding the related field approach... > In Postgres 9.1, syncrep uses only one layer of slaves. This makes the > slaves having only one degree of dependency with a master it is related. > However 9.2 introduces cascade replication, meaning that you could set up > a cluster with slaves of slaves. > So if you have a master under one partition ID and slaves in this > partition, which are themselves masters of other slaves, is there some way > to set up a partition of a partition? > OK, you can keep the "related" field and create new one to store sequence id. > Regards, > -- > Michael Paquier > http://michael.otacoo.com > -- Best regards, Andrei Martsinchyk mailto:and...@gm... |
From: Michael P. <mic...@gm...> - 2011-11-01 23:27:28
|
On Wed, Nov 2, 2011 at 12:47 AM, Andrei Martsinchyk < and...@gm...> wrote: > I have recently taken a look at the code and want to feedback on this. > I like the idea, it should simplify initial cluster setup and it is a > promising base for HA features. However I see some issues, pitfalls and > want to point them out, and suggest solutions. > Currently table partitions are associated with nodes by node oids. This > could be a problem if you want to failover. Assume you have working > cluster, and each master datanode in the cluster has a slave. Eventually > one data node went down and you want to fail it over by promoting a slave. > If you run ALTER NODE command and change type of respective DATANODE_SLAVE > to DATANODE_MASTER the node won't be wired up correctly. Server will be > looking for the node by oid and won't find oid specified in pgxc_class in > the list of master datanodes. To correctly failover, you would have to > use ALTER NODE to change connection info of failed node to point to the > promoted slave and change node name of the slave so it associate itself > with the master datanode in the catalog. You mentioned it is planned > that node names remain constant and consistent in the cluster, so that is > not desired approach. > About this point, I mentionned that node names remain constant in the cluster. But I meant that for the master nodes. When it may be necessary to fail over a slave in case of a master failure, the node name of the slave should be changed to the one of the master. I agree this is not the most flexible approach though. > Potential vulnerability of renaming nodes is that you can mess up nodes, > and run node which already contains data of some logical partition as a > node associated with different partition. That may cause data corruption, > and even if it does not you very likely will have to dump/restore entire > database to bring it in order. > This is not much desired :) > Even without node renaming, it is not reliable to address logical > partitions by index, which is the number of the node name in alphabetical > order. If you add data node to the cluster with name between names of > existing nodes you must synchronously update node lists in pooler and in > all sessions, otherwise you risk to corrupt the data. > Just about that, there is an additional column in pgxc_class which is the list of nodes on which the table data is present. Even if you add a new node with CREATE NODE, the tables present until now won't use the new node as it is not listed in their list. There are still 2 functionalities that are missing to achieve this: 1) the possibility to define node subsets when creating a table, which is blocked for the time being 2) the possibility to redistribute data among a certain number of nodes. > All mentioned problems could be targeted if you introduce new concept > let's name it "segment". Segment is id of logical partition, it is similar > to former nodeid and current nodeindex, but it can not be changed once > assigned to a datanode. Assigned segment id can be stored in the pgxc_node > instead of "related" field: each segment may have at most one master node, > and all slaves of the segment are related to the segment master. Also, if > node is first time started under a name associated with some segment, the > segment id can be stored in the data directory, probably in a separate > file, like catalog version and validated on startup. It would make it > impossible to mess up partitions. Tables may refer segment ids instead of > node oids. > Is this logical ID some kind of random-generated integer ID created at the same time as a new node master? The problem here is to find a viable solution to target unique datanodes when choosing a node for a replicated table for example. The uniqueness of master node names and ordering them by node names insures that datanode target order is safe even there are new nodes added thanks to the node subset feature. By using a logical ID, is there any way to insure that a partition ID is associated to a constant integer between 1 and n for n datanodes in cluster? Regarding the related field approach... In Postgres 9.1, syncrep uses only one layer of slaves. This makes the slaves having only one degree of dependency with a master it is related. However 9.2 introduces cascade replication, meaning that you could set up a cluster with slaves of slaves. So if you have a master under one partition ID and slaves in this partition, which are themselves masters of other slaves, is there some way to set up a partition of a partition? Regards, -- Michael Paquier http://michael.otacoo.com |
From: Andrei M. <and...@gm...> - 2011-11-01 15:47:14
|
I have recently taken a look at the code and want to feedback on this. I like the idea, it should simplify initial cluster setup and it is a promising base for HA features. However I see some issues, pitfalls and want to point them out, and suggest solutions. Currently table partitions are associated with nodes by node oids. This could be a problem if you want to failover. Assume you have working cluster, and each master datanode in the cluster has a slave. Eventually one data node went down and you want to fail it over by promoting a slave. If you run ALTER NODE command and change type of respective DATANODE_SLAVE to DATANODE_MASTER the node won't be wired up correctly. Server will be looking for the node by oid and won't find oid specified in pgxc_class in the list of master datanodes. To correctly failover, you would have to use ALTER NODE to change connection info of failed node to point to the promoted slave and change node name of the slave so it associate itself with the master datanode in the catalog. You mentioned it is planned that node names remain constant and consistent in the cluster, so that is not desired approach. Potential vulnerability of renaming nodes is that you can mess up nodes, and run node which already contains data of some logical partition as a node associated with different partition. That may cause data corruption, and even if it does not you very likely will have to dump/restore entire database to bring it in order. Even without node renaming, it is not reliable to address logical partitions by index, which is the number of the node name in alphabetical order. If you add data node to the cluster with name between names of existing nodes you must synchronously update node lists in pooler and in all sessions, otherwise you risk to corrupt the data. All mentioned problems could be targeted if you introduce new concept let's name it "segment". Segment is id of logical partition, it is similar to former nodeid and current nodeindex, but it can not be changed once assigned to a datanode. Assigned segment id can be stored in the pgxc_node instead of "related" field: each segment may have at most one master node, and all slaves of the segment are related to the segment master. Also, if node is first time started under a name associated with some segment, the segment id can be stored in the data directory, probably in a separate file, like catalog version and validated on startup. It would make it impossible to mess up partitions. Tables may refer segment ids instead of node oids. I guess Mason in the initial email suggested something similar, I just gave a bit more details. 2011/11/1 Michael Paquier <mic...@gm...> > > I was not sure of the overall design, just trying to point out the >> theoretical danger, and again, because DDL was added for slaves, I was >> originally not sure if the intention was to allow them to retain their >> names and yet be able to be promoted. That is why I mentioned naming >> the partitions and using that order. Alternatively, perhaps one could >> use an internal id/OID behind the scenes for the partions, sort by >> that for determine hash & modulo buckets, and have a mapping of the >> partitions and node instances. Each master and standby should know >> what its partion id/oid is, perhaps returned at connection time when >> the connection comes from a coordinator. This might do away with the >> node renaming issue. Just something to mull over. Or, maybe some >> standby naming convention will help. We should just think ahead a >> little bit about possible HA management scenarios for flexibility. >> > That's always good to remind there may be future issues in the future > regarding that. > > >> > Just about that.... >> > The first meaning of registering nodes on GTM was to keep a track of >> all the >> > node information in the cluster. >> > But now that node names are the same and have to remain constant in the >> > cluster, is this really necessary? >> > Removing that will also allow to remove pgxc_node_name from guc params. >> > Then identifying a node self for Coordinator/Datanode could be done by >> > initdb with a SELF/NOT SELF keyword as an extension of CREATE DDL. >> >> Maybe it is a good idea. How would a node rename itself later then? >> ALTER NODE oldname RENAME newname? Then if it sees that its own name >> matches oldname, change itself to newname? Otherwise just update >> catalog info? >> > The initial thought was that a node could not rename itself at the current > state and that node names remain constant and consistant in the cluster. > That's why no SQL extensions have been made on this purpose yet. > In case a slave has to be changed to a former master, it should be renamed > to the old master though automatically. > Once again, one of the possible extensions would be to have a node > partition system as you mentioned inside pgxc_node for example. > -- > Michael Paquier > http://michael.otacoo.com > > > ------------------------------------------------------------------------------ > RSA® Conference 2012 > Save $700 by Nov 18 > Register now > http://p.sf.net/sfu/rsa-sfdev2dev1 > _______________________________________________ > Postgres-xc-developers mailing list > Pos...@li... > https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers > > -- Best regards, Andrei Martsinchyk mailto:and...@gm... |