From: Koichi S. <koi...@gm...> - 2012-07-02 05:25:30
Attachments:
20120702_02_xc_watchdog.patch
|
Hi, Eclosed is a WIP patch for xc_watchdog, for coordinator/datanode/gtm and gtm_proxy. It is against the current master as of June 2nd, 2:00PM in JST. I've tested them with gdb and found watchdog timer is incremented as expected. I will write time detector and continue more test. Regards; ---------- Koichi Suzuki |
From: Michael P. <mic...@gm...> - 2012-07-03 05:04:45
|
Hum, I am honestly not a fan of this way of doing. I just cannot get the meaning of touching the core code for a feature which is doing only monitoring. We should discuss with Postgres community about that and receive feedback about the possible solutions we could use here, and really think a lot before touching code parts that we haven't touched yet and might impact PostgreSQL code itself if there are any side effects. What I cannot get is why adding an internal chronometer when there are already options available: - Using a simple "SELECT 1" on the database - Using pg_ctl status ! This implementation makes the core code dependent on monitoring features when it should definitely be the opposite. A database server monitoring shouldn't touch the core, but only use its functionalities. And even if PostgreSQL would need such a feature, XC should extend in a cluster way what is already existing in Postgres. So why reinventing the wheel?? Also, this patch adds a total of 6 GUC parameters, 2 for GTM, 2 for GTM-proxy and 2 for Coordinator/Datanode. It complicated too much the feature. Instead of creating so many dependencies with Postgres code, why not creating a simple system function that returns back to client a confirmation message at a given time interval? Let's imagine the system function pgxc_watchdog(interval, cycle); This function could be like this. Datum pgxc_watchdog(interval time, int cycles) { int i; for (i = 0; i < cycles; i++) { sleep(interval) /* Send back to client */ send_back('SELECT 1 result'); /* Send back a result or something */ } } There are a lot of benefits on doing that: - do not touch the core for monitoring purposes (really really important to my mind) - reduce GUC parameter by 6. - This implementation is portable and easy to maintain, you can also create a similar function to check GTM status from an XC node. - You do not need an additional external module that would need to read the monitoring pulse => you connect with a given client to a PostgreSQL server, and launch it through a driver, whatever it is, so it can be really easily adapted to all kind of implementations and applications. Are there people with a similar opinion to mine??? On Mon, Jul 2, 2012 at 2:25 PM, Koichi Suzuki <koi...@gm...>wrote: > Hi, > > Eclosed is a WIP patch for xc_watchdog, for coordinator/datanode/gtm > and gtm_proxy. It is against the current master as of June 2nd, > 2:00PM in JST. I've tested them with gdb and found watchdog timer is > incremented as expected. I will write time detector and continue > more test. > > Regards; > ---------- > Koichi Suzuki > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Postgres-xc-developers mailing list > Pos...@li... > https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers > > -- Michael Paquier http://michael.otacoo.com |
From: Nikhil S. <ni...@st...> - 2012-07-03 23:02:43
|
> > Are there people with a similar opinion to mine??? > +1 IMO too we should not be making any too invasive internal changes to support monitoring. What would be better would be to maybe allow commands which can be scripted and which can work against each of the components. For example, for the coordinator/datanode periodic "SELECT 1" commands should be good enough. Even doing an EXECUTE DIRECT via a coordinator to the datanodes will help. For GTM/GTM_Standy/GTM_Proxy components we should introduce "gtm_ctl ping" kinds of commands which will basically connect to them and see that they are responding ok. Such interfaces make it really easy for monitoring solutions like nagios, zabbix etc. to monitor them. These tools have been used for a while now to monitor Postgres and it should be a natural logical evolution for users to see them being used for PG XC. Regards, Nikhils -- StormDB - http://www.stormdb.com The Database Cloud |
From: Michael P. <mic...@gm...> - 2012-07-03 23:38:15
|
On Wed, Jul 4, 2012 at 8:02 AM, Nikhil Sontakke <ni...@st...> wrote: > > > > Are there people with a similar opinion to mine??? > > > > +1 > > IMO too we should not be making any too invasive internal changes to > support monitoring. What would be better would be to maybe allow > commands which can be scripted and which can work against each of the > components. > This could be more easily manageable by creating new system functions for monitoring in C to do that as an EXTENSION, pluggable as a contrib module > For example, for the coordinator/datanode periodic "SELECT 1" commands > should be good enough. Even doing an EXECUTE DIRECT via a coordinator > to the datanodes will help. > > For GTM/GTM_Standy/GTM_Proxy components we should introduce "gtm_ctl > ping" kinds of commands which will basically connect to them and see > that they are responding ok. > That is an interesting idea. Btw, we should definitely avoid any additional GUC parameters inside GTM core code. This avoids complicating cluster settings, and users may use a different monitoring solution as the one proposed. > > Such interfaces make it really easy for monitoring solutions like > nagios, zabbix etc. to monitor them. These tools have been used for a > while now to monitor Postgres and it should be a natural logical > evolution for users to see them being used for PG XC. > Completely agreed, we do not need to reinvent solutions already existing and proved to be enough sufficient. -- Michael Paquier http://michael.otacoo.com |
From: Koichi S. <koi...@gm...> - 2012-07-04 00:31:59
|
The background of xc_watchdog is to provide quicker means to detect node fault. I understand that it is not compatible with what we're doing in conventional PG applications, which are mostly based upon psql -c 'select 1'. It takes at most 60sec to detect the error (TCP timeout value). Some applications will be satisfied with this and some may not. This is raised at the clustering summit in Ottawa and the suggestion was to have this kind of means (watchdog). I don't know if PG people are interested in this now. Maybe we should wait until such fault detection is more realistic issue. Implementation is very straightforward. For datanode, I don't like to ask applications to connect to it directly using psql because it is a kind of tricky use and it may mean that we allow applications to connect to datanodes directly. So I think we should encapsulate this with dedicated command like xc_monitor. Xc_ping sounds good too but "ping" reminds me consecutive monitoring. Current practice needs only one monitoring. So I'd like xc_monitor (or node_monitor). Command like 'xc_monitor -Z nodetype -h host -p port' will not need any modification to the core. Will be submitted soon as contrib module. Regards; ---------- Koichi Suzuki 2012/7/4 Nikhil Sontakke <ni...@st...>: >> Are there people with a similar opinion to mine??? >> > > +1 > > IMO too we should not be making any too invasive internal changes to > support monitoring. What would be better would be to maybe allow > commands which can be scripted and which can work against each of the > components. > > For example, for the coordinator/datanode periodic "SELECT 1" commands > should be good enough. Even doing an EXECUTE DIRECT via a coordinator > to the datanodes will help. > > For GTM/GTM_Standy/GTM_Proxy components we should introduce "gtm_ctl > ping" kinds of commands which will basically connect to them and see > that they are responding ok. > > Such interfaces make it really easy for monitoring solutions like > nagios, zabbix etc. to monitor them. These tools have been used for a > while now to monitor Postgres and it should be a natural logical > evolution for users to see them being used for PG XC. > > Regards, > Nikhils > -- > StormDB - http://www.stormdb.com > The Database Cloud |
From: Michael P. <mic...@gm...> - 2012-07-04 00:56:16
|
On Wed, Jul 4, 2012 at 9:31 AM, Koichi Suzuki <koi...@gm...>wrote: > The background of xc_watchdog is to provide quicker means to detect > node fault. I understand that it is not compatible with what we're > doing in conventional PG applications, which are mostly based upon > psql -c 'select 1'. It takes at most 60sec to detect the error (TCP > timeout value). Some applications will be satisfied with this and > some may not. This is raised at the clustering summit in Ottawa and > the suggestion was to have this kind of means (watchdog). > It is also possible to set up keep alive options in the connection string used to connect to the database server. The "SELECT 1" option is just an option. You could also use the hooks implemented in vanilla postgres to write back in logs some monitoring activities. Those hooks could be created with an extension module pluggable directly to XC and hence avoid core having dependencies with monitoring. > I don't know if PG people are interested in this now. Maybe we > should wait until such fault detection is more realistic issue. > Implementation is very straightforward. > We should definitely discuss with them about that. it may be that they have some reasons not to do that yet, in that case I don't know which ones. However, what is sure is that core code should not depend on monitoring. > > For datanode, I don't like to ask applications to connect to it > directly using psql because it is a kind of tricky use and it may mean > that we allow applications to connect to datanodes directly. So I > think we should encapsulate this with dedicated command like > xc_monitor. Xc_ping sounds good too but "ping" reminds me > consecutive monitoring. Current practice needs only one monitoring. > So I'd like xc_monitor (or node_monitor). > > Command like 'xc_monitor -Z nodetype -h host -p port' will not need > any modification to the core. Will be submitted soon as contrib > module. > This is OK I think. The main point here is how to control if the database is alive or not, and there are already some solutions ready to be used. We just cannot enforce a solution inside our core code that may impact all the users, including people who would like to monitor their databases with solutions like nagios. -- Michael Paquier http://michael.otacoo.com |
From: Koichi S. <koi...@gm...> - 2012-07-04 01:03:31
|
I also found that it is quite difficult to have both gtm_client.h and libpq-fe.h in the same binary. Gtm_client.h includes gtm/libpq-fe.h, which has so many naming conflicts. I remembered this was a major issue when I tried first pgxc_clean implementation and finally I made the core modification to make pgxc_clean complete XC application. This time, I don't think we should depend on some coordinator because it could fail while the whole cluster is in operation. I need direct access to gtm. Because it is not simple and practical to resolve conflicts among gtm/libpq-fe.h and libpq-fe.h, I'd like to write separate monitor command, gtm_monitor and node_monitor. To discourage direct connection from Apps to datanodes, still I'd like to provide dedicated monitoring command for datanodes (and coordinators). I also believe it's not a good idea to monitor a datanode through a coordinator using EXECUTE DIRECT because the latter may be failed while the whole cluster is in operation. Regards; ---------- Koichi Suzuki 2012/7/4 Koichi Suzuki <koi...@gm...>: > The background of xc_watchdog is to provide quicker means to detect > node fault. I understand that it is not compatible with what we're > doing in conventional PG applications, which are mostly based upon > psql -c 'select 1'. It takes at most 60sec to detect the error (TCP > timeout value). Some applications will be satisfied with this and > some may not. This is raised at the clustering summit in Ottawa and > the suggestion was to have this kind of means (watchdog). > > I don't know if PG people are interested in this now. Maybe we > should wait until such fault detection is more realistic issue. > Implementation is very straightforward. > > For datanode, I don't like to ask applications to connect to it > directly using psql because it is a kind of tricky use and it may mean > that we allow applications to connect to datanodes directly. So I > think we should encapsulate this with dedicated command like > xc_monitor. Xc_ping sounds good too but "ping" reminds me > consecutive monitoring. Current practice needs only one monitoring. > So I'd like xc_monitor (or node_monitor). > > Command like 'xc_monitor -Z nodetype -h host -p port' will not need > any modification to the core. Will be submitted soon as contrib > module. > > Regards; > ---------- > Koichi Suzuki > > > 2012/7/4 Nikhil Sontakke <ni...@st...>: >>> Are there people with a similar opinion to mine??? >>> >> >> +1 >> >> IMO too we should not be making any too invasive internal changes to >> support monitoring. What would be better would be to maybe allow >> commands which can be scripted and which can work against each of the >> components. >> >> For example, for the coordinator/datanode periodic "SELECT 1" commands >> should be good enough. Even doing an EXECUTE DIRECT via a coordinator >> to the datanodes will help. >> >> For GTM/GTM_Standy/GTM_Proxy components we should introduce "gtm_ctl >> ping" kinds of commands which will basically connect to them and see >> that they are responding ok. >> >> Such interfaces make it really easy for monitoring solutions like >> nagios, zabbix etc. to monitor them. These tools have been used for a >> while now to monitor Postgres and it should be a natural logical >> evolution for users to see them being used for PG XC. >> >> Regards, >> Nikhils >> -- >> StormDB - http://www.stormdb.com >> The Database Cloud |
From: Nikhil S. <ni...@st...> - 2012-07-04 02:50:03
|
> I also believe it's not a good idea to monitor a datanode through a > coordinator using EXECUTE DIRECT because the latter may be failed > while the whole cluster is in operation. > Well, if there are multiple failures we ought to know about them anyways. So if this particular coordinator fails the monitor tells us about it first. We fix it and then move on to the datanode failure detection. Since the datanodes have to be reachable via coordinators and we have multiple coordinators around to load balance anyways, I still think EXECUTE DIRECT via the coordinator node is a decent idea. If we can round robin the calls via all the coordinators that would be better too I think. Regards, Nikhils > Regards; > ---------- > Koichi Suzuki > > > 2012/7/4 Koichi Suzuki <koi...@gm...>: >> The background of xc_watchdog is to provide quicker means to detect >> node fault. I understand that it is not compatible with what we're >> doing in conventional PG applications, which are mostly based upon >> psql -c 'select 1'. It takes at most 60sec to detect the error (TCP >> timeout value). Some applications will be satisfied with this and >> some may not. This is raised at the clustering summit in Ottawa and >> the suggestion was to have this kind of means (watchdog). >> >> I don't know if PG people are interested in this now. Maybe we >> should wait until such fault detection is more realistic issue. >> Implementation is very straightforward. >> >> For datanode, I don't like to ask applications to connect to it >> directly using psql because it is a kind of tricky use and it may mean >> that we allow applications to connect to datanodes directly. So I >> think we should encapsulate this with dedicated command like >> xc_monitor. Xc_ping sounds good too but "ping" reminds me >> consecutive monitoring. Current practice needs only one monitoring. >> So I'd like xc_monitor (or node_monitor). >> >> Command like 'xc_monitor -Z nodetype -h host -p port' will not need >> any modification to the core. Will be submitted soon as contrib >> module. >> >> Regards; >> ---------- >> Koichi Suzuki >> >> >> 2012/7/4 Nikhil Sontakke <ni...@st...>: >>>> Are there people with a similar opinion to mine??? >>>> >>> >>> +1 >>> >>> IMO too we should not be making any too invasive internal changes to >>> support monitoring. What would be better would be to maybe allow >>> commands which can be scripted and which can work against each of the >>> components. >>> >>> For example, for the coordinator/datanode periodic "SELECT 1" commands >>> should be good enough. Even doing an EXECUTE DIRECT via a coordinator >>> to the datanodes will help. >>> >>> For GTM/GTM_Standy/GTM_Proxy components we should introduce "gtm_ctl >>> ping" kinds of commands which will basically connect to them and see >>> that they are responding ok. >>> >>> Such interfaces make it really easy for monitoring solutions like >>> nagios, zabbix etc. to monitor them. These tools have been used for a >>> while now to monitor Postgres and it should be a natural logical >>> evolution for users to see them being used for PG XC. >>> >>> Regards, >>> Nikhils >>> -- >>> StormDB - http://www.stormdb.com >>> The Database Cloud -- StormDB - http://www.stormdb.com The Database Cloud |
From: Koichi S. <koi...@gm...> - 2012-07-04 03:15:19
|
We don't have to do such a round robin. As Michael suggested, pg_ctl status works well even with datanodes. It doesn't issue any query but checks if the postmaster is running. I think it is sufficient. Only one restriction is pg_stl status determines zombie process as running. Regards; ---------- Koichi Suzuki 2012/7/4 Nikhil Sontakke <ni...@st...>: >> I also believe it's not a good idea to monitor a datanode through a >> coordinator using EXECUTE DIRECT because the latter may be failed >> while the whole cluster is in operation. >> > > Well, if there are multiple failures we ought to know about them > anyways. So if this particular coordinator fails the monitor tells us > about it first. We fix it and then move on to the datanode failure > detection. Since the datanodes have to be reachable via coordinators > and we have multiple coordinators around to load balance anyways, I > still think EXECUTE DIRECT via the coordinator node is a decent idea. > If we can round robin the calls via all the coordinators that would be > better too I think. > > Regards, > Nikhils > > >> Regards; >> ---------- >> Koichi Suzuki >> >> >> 2012/7/4 Koichi Suzuki <koi...@gm...>: >>> The background of xc_watchdog is to provide quicker means to detect >>> node fault. I understand that it is not compatible with what we're >>> doing in conventional PG applications, which are mostly based upon >>> psql -c 'select 1'. It takes at most 60sec to detect the error (TCP >>> timeout value). Some applications will be satisfied with this and >>> some may not. This is raised at the clustering summit in Ottawa and >>> the suggestion was to have this kind of means (watchdog). >>> >>> I don't know if PG people are interested in this now. Maybe we >>> should wait until such fault detection is more realistic issue. >>> Implementation is very straightforward. >>> >>> For datanode, I don't like to ask applications to connect to it >>> directly using psql because it is a kind of tricky use and it may mean >>> that we allow applications to connect to datanodes directly. So I >>> think we should encapsulate this with dedicated command like >>> xc_monitor. Xc_ping sounds good too but "ping" reminds me >>> consecutive monitoring. Current practice needs only one monitoring. >>> So I'd like xc_monitor (or node_monitor). >>> >>> Command like 'xc_monitor -Z nodetype -h host -p port' will not need >>> any modification to the core. Will be submitted soon as contrib >>> module. >>> >>> Regards; >>> ---------- >>> Koichi Suzuki >>> >>> >>> 2012/7/4 Nikhil Sontakke <ni...@st...>: >>>>> Are there people with a similar opinion to mine??? >>>>> >>>> >>>> +1 >>>> >>>> IMO too we should not be making any too invasive internal changes to >>>> support monitoring. What would be better would be to maybe allow >>>> commands which can be scripted and which can work against each of the >>>> components. >>>> >>>> For example, for the coordinator/datanode periodic "SELECT 1" commands >>>> should be good enough. Even doing an EXECUTE DIRECT via a coordinator >>>> to the datanodes will help. >>>> >>>> For GTM/GTM_Standy/GTM_Proxy components we should introduce "gtm_ctl >>>> ping" kinds of commands which will basically connect to them and see >>>> that they are responding ok. >>>> >>>> Such interfaces make it really easy for monitoring solutions like >>>> nagios, zabbix etc. to monitor them. These tools have been used for a >>>> while now to monitor Postgres and it should be a natural logical >>>> evolution for users to see them being used for PG XC. >>>> >>>> Regards, >>>> Nikhils >>>> -- >>>> StormDB - http://www.stormdb.com >>>> The Database Cloud > > > > -- > StormDB - http://www.stormdb.com > The Database Cloud |
From: Koichi S. <koi...@gm...> - 2012-07-04 03:53:09
|
I'd like to post one more thing on this. Yes, pg_ctl status and gtm_ctl status (I've corrected this and will submit soon) check that postmaster and main thread is alive but they don't show if each node responds correctly. For this purpose, we can use psql -c 'select 1' only for coordinator. Yes, we can use the same thing to datanode but we should restrict psql use directly against the datanode generally. Also, we need gtm/gtm_proxy counterpart. So please let me implement pgxc_monitor command as in the previous post which really interacts with each node and reports if they're running. It may need some extension to gtm/gtm_proxy to add a message just to respond ack. No other core extension will be needed. Regards; ---------- Koichi Suzuki 2012/7/4 Koichi Suzuki <koi...@gm...>: > We don't have to do such a round robin. As Michael suggested, pg_ctl > status works well even with datanodes. It doesn't issue any query > but checks if the postmaster is running. I think it is sufficient. > Only one restriction is pg_stl status determines zombie process as > running. > > Regards; > ---------- > Koichi Suzuki > > > 2012/7/4 Nikhil Sontakke <ni...@st...>: >>> I also believe it's not a good idea to monitor a datanode through a >>> coordinator using EXECUTE DIRECT because the latter may be failed >>> while the whole cluster is in operation. >>> >> >> Well, if there are multiple failures we ought to know about them >> anyways. So if this particular coordinator fails the monitor tells us >> about it first. We fix it and then move on to the datanode failure >> detection. Since the datanodes have to be reachable via coordinators >> and we have multiple coordinators around to load balance anyways, I >> still think EXECUTE DIRECT via the coordinator node is a decent idea. >> If we can round robin the calls via all the coordinators that would be >> better too I think. >> >> Regards, >> Nikhils >> >> >>> Regards; >>> ---------- >>> Koichi Suzuki >>> >>> >>> 2012/7/4 Koichi Suzuki <koi...@gm...>: >>>> The background of xc_watchdog is to provide quicker means to detect >>>> node fault. I understand that it is not compatible with what we're >>>> doing in conventional PG applications, which are mostly based upon >>>> psql -c 'select 1'. It takes at most 60sec to detect the error (TCP >>>> timeout value). Some applications will be satisfied with this and >>>> some may not. This is raised at the clustering summit in Ottawa and >>>> the suggestion was to have this kind of means (watchdog). >>>> >>>> I don't know if PG people are interested in this now. Maybe we >>>> should wait until such fault detection is more realistic issue. >>>> Implementation is very straightforward. >>>> >>>> For datanode, I don't like to ask applications to connect to it >>>> directly using psql because it is a kind of tricky use and it may mean >>>> that we allow applications to connect to datanodes directly. So I >>>> think we should encapsulate this with dedicated command like >>>> xc_monitor. Xc_ping sounds good too but "ping" reminds me >>>> consecutive monitoring. Current practice needs only one monitoring. >>>> So I'd like xc_monitor (or node_monitor). >>>> >>>> Command like 'xc_monitor -Z nodetype -h host -p port' will not need >>>> any modification to the core. Will be submitted soon as contrib >>>> module. >>>> >>>> Regards; >>>> ---------- >>>> Koichi Suzuki >>>> >>>> >>>> 2012/7/4 Nikhil Sontakke <ni...@st...>: >>>>>> Are there people with a similar opinion to mine??? >>>>>> >>>>> >>>>> +1 >>>>> >>>>> IMO too we should not be making any too invasive internal changes to >>>>> support monitoring. What would be better would be to maybe allow >>>>> commands which can be scripted and which can work against each of the >>>>> components. >>>>> >>>>> For example, for the coordinator/datanode periodic "SELECT 1" commands >>>>> should be good enough. Even doing an EXECUTE DIRECT via a coordinator >>>>> to the datanodes will help. >>>>> >>>>> For GTM/GTM_Standy/GTM_Proxy components we should introduce "gtm_ctl >>>>> ping" kinds of commands which will basically connect to them and see >>>>> that they are responding ok. >>>>> >>>>> Such interfaces make it really easy for monitoring solutions like >>>>> nagios, zabbix etc. to monitor them. These tools have been used for a >>>>> while now to monitor Postgres and it should be a natural logical >>>>> evolution for users to see them being used for PG XC. >>>>> >>>>> Regards, >>>>> Nikhils >>>>> -- >>>>> StormDB - http://www.stormdb.com >>>>> The Database Cloud >> >> >> >> -- >> StormDB - http://www.stormdb.com >> The Database Cloud |
From: Amit K. <ami...@en...> - 2012-07-04 05:39:05
|
On 4 July 2012 09:23, Koichi Suzuki <koi...@gm...> wrote: > I'd like to post one more thing on this. > > Yes, pg_ctl status and gtm_ctl status (I've corrected this and will > submit soon) check that postmaster and main thread is alive but they > don't show if each node responds correctly. For this purpose, we can > use psql -c 'select 1' only for coordinator. Yes, we can use the > same thing to datanode but we should restrict psql use directly > against the datanode generally. Also, we need gtm/gtm_proxy > counterpart. > May be we can use the PQping() call to test the connectivity. That's what pg_ctl start does. After starting up, it waits for a response from the server and only then returns. > So please let me implement pgxc_monitor command as in the previous > post which really interacts with each node and reports if they're > running. It may need some extension to gtm/gtm_proxy to add a > message just to respond ack. No other core extension will be > needed. > > Regards; > ---------- > Koichi Suzuki > > > 2012/7/4 Koichi Suzuki <koi...@gm...>: > > We don't have to do such a round robin. As Michael suggested, pg_ctl > > status works well even with datanodes. It doesn't issue any query > > but checks if the postmaster is running. I think it is sufficient. > > Only one restriction is pg_stl status determines zombie process as > > running. > > > > Regards; > > ---------- > > Koichi Suzuki > > > > > > 2012/7/4 Nikhil Sontakke <ni...@st...>: > >>> I also believe it's not a good idea to monitor a datanode through a > >>> coordinator using EXECUTE DIRECT because the latter may be failed > >>> while the whole cluster is in operation. > >>> > >> > >> Well, if there are multiple failures we ought to know about them > >> anyways. So if this particular coordinator fails the monitor tells us > >> about it first. We fix it and then move on to the datanode failure > >> detection. Since the datanodes have to be reachable via coordinators > >> and we have multiple coordinators around to load balance anyways, I > >> still think EXECUTE DIRECT via the coordinator node is a decent idea. > >> If we can round robin the calls via all the coordinators that would be > >> better too I think. > >> > >> Regards, > >> Nikhils > >> > >> > >>> Regards; > >>> ---------- > >>> Koichi Suzuki > >>> > >>> > >>> 2012/7/4 Koichi Suzuki <koi...@gm...>: > >>>> The background of xc_watchdog is to provide quicker means to detect > >>>> node fault. I understand that it is not compatible with what we're > >>>> doing in conventional PG applications, which are mostly based upon > >>>> psql -c 'select 1'. It takes at most 60sec to detect the error (TCP > >>>> timeout value). Some applications will be satisfied with this and > >>>> some may not. This is raised at the clustering summit in Ottawa and > >>>> the suggestion was to have this kind of means (watchdog). > >>>> > >>>> I don't know if PG people are interested in this now. Maybe we > >>>> should wait until such fault detection is more realistic issue. > >>>> Implementation is very straightforward. > >>>> > >>>> For datanode, I don't like to ask applications to connect to it > >>>> directly using psql because it is a kind of tricky use and it may mean > >>>> that we allow applications to connect to datanodes directly. So I > >>>> think we should encapsulate this with dedicated command like > >>>> xc_monitor. Xc_ping sounds good too but "ping" reminds me > >>>> consecutive monitoring. Current practice needs only one monitoring. > >>>> So I'd like xc_monitor (or node_monitor). > >>>> > >>>> Command like 'xc_monitor -Z nodetype -h host -p port' will not need > >>>> any modification to the core. Will be submitted soon as contrib > >>>> module. > >>>> > >>>> Regards; > >>>> ---------- > >>>> Koichi Suzuki > >>>> > >>>> > >>>> 2012/7/4 Nikhil Sontakke <ni...@st...>: > >>>>>> Are there people with a similar opinion to mine??? > >>>>>> > >>>>> > >>>>> +1 > >>>>> > >>>>> IMO too we should not be making any too invasive internal changes to > >>>>> support monitoring. What would be better would be to maybe allow > >>>>> commands which can be scripted and which can work against each of the > >>>>> components. > >>>>> > >>>>> For example, for the coordinator/datanode periodic "SELECT 1" > commands > >>>>> should be good enough. Even doing an EXECUTE DIRECT via a coordinator > >>>>> to the datanodes will help. > >>>>> > >>>>> For GTM/GTM_Standy/GTM_Proxy components we should introduce "gtm_ctl > >>>>> ping" kinds of commands which will basically connect to them and see > >>>>> that they are responding ok. > >>>>> > >>>>> Such interfaces make it really easy for monitoring solutions like > >>>>> nagios, zabbix etc. to monitor them. These tools have been used for a > >>>>> while now to monitor Postgres and it should be a natural logical > >>>>> evolution for users to see them being used for PG XC. > >>>>> > >>>>> Regards, > >>>>> Nikhils > >>>>> -- > >>>>> StormDB - http://www.stormdb.com > >>>>> The Database Cloud > >> > >> > >> > >> -- > >> StormDB - http://www.stormdb.com > >> The Database Cloud > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Postgres-xc-developers mailing list > Pos...@li... > https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers > |
From: Michael P. <mic...@gm...> - 2012-07-04 05:04:04
|
On Wed, Jul 4, 2012 at 12:53 PM, Koichi Suzuki <koi...@gm...>wrote: > I'd like to post one more thing on this. > > Yes, pg_ctl status and gtm_ctl status (I've corrected this and will > submit soon) check that postmaster and main thread is alive but they > don't show if each node responds correctly. For this purpose, we can > use psql -c 'select 1' only for coordinator. Yes, we can use the > same thing to datanode but we should restrict psql use directly > against the datanode generally. Also, we need gtm/gtm_proxy > counterpart. > > So please let me implement pgxc_monitor command as in the previous > post which really interacts with each node and reports if they're > running. It may need some extension to gtm/gtm_proxy to add a > message just to respond ack. No other core extension will be > needed. > This additional message for GTM/proxy makes sense, it can be considered as an equivalent to the simple "SELECT 1" on nodes. I imagine that it can also be used for other purposes. This should be written as an independant patch. -- Michael Paquier http://michael.otacoo.com |
From: Koichi S. <koi...@gm...> - 2012-07-04 05:08:03
|
Yes, I agree to make it separate. Will be submitted soon. ---------- Koichi Suzuki 2012/7/4 Michael Paquier <mic...@gm...>: > > > On Wed, Jul 4, 2012 at 12:53 PM, Koichi Suzuki <koi...@gm...> > wrote: >> >> I'd like to post one more thing on this. >> >> Yes, pg_ctl status and gtm_ctl status (I've corrected this and will >> submit soon) check that postmaster and main thread is alive but they >> don't show if each node responds correctly. For this purpose, we can >> use psql -c 'select 1' only for coordinator. Yes, we can use the >> same thing to datanode but we should restrict psql use directly >> against the datanode generally. Also, we need gtm/gtm_proxy >> counterpart. >> >> So please let me implement pgxc_monitor command as in the previous >> post which really interacts with each node and reports if they're >> running. It may need some extension to gtm/gtm_proxy to add a >> message just to respond ack. No other core extension will be >> needed. > > This additional message for GTM/proxy makes sense, it can be considered as > an equivalent to the simple "SELECT 1" on nodes. > I imagine that it can also be used for other purposes. This should be > written as an independant patch. > -- > Michael Paquier > http://michael.otacoo.com |