上一篇MySQL高可用主从集群(二)中讲解了如何部署ProxySQL中间件以及通过ProxySQL代理MySQL主从集群并实现读写分离,极大的提高了MySQL集群的性能,但仍存在一些问题,当MySQL集群发生故障时,例如主节点宕机,集群将无法继续对外提供服务。这时可以通过Orchestrator对MySQL集群进行监测,当集群主节点发生故障时,Orchestrator可以将在剩下的从节点中选举出新的主节点,保证集群的高可用性。
1 Orchestrator介绍
GitHub地址:https://github.com/openark/orchestrator
orchestrator是一个MySQL高可用性和复制管理工具,作为一个服务运行,并提供命令行访问、HTTP API和Web界面。orchestrator支持以下功能:
发现
orchestrator主动遍历您的拓扑,并绘制拓扑图。它读取基本的MySQL信息,如复制状态和配置。它为您提供了拓扑的精美可视化,包括在面对故障时的复制问题。
重构
orchestrator了解复制规则。它了解关于binlog文件位置、GTID、伪GTID和binlog服务器的信息。重构复制拓扑可以简单地将一个复制品拖动到另一个主服务器下。移动复制品是安全的:orchestrator会拒绝非法的重构尝试。通过各种命令行选项可以实现细粒度的控制。
恢复
orchestrator使用全面的方法来检测主服务器和中间主服务器的故障。根据从拓扑本身获取的信息,它可以识别各种故障场景。
可配置的,它可以选择执行自动恢复(或允许用户选择手动恢复类型)。中间主服务器的恢复由orchestrator内部完成。主服务器故障切换由预/后故障钩子支持。
恢复过程利用orchestrator对拓扑的理解和其进行重构的能力。它基于状态而不是配置:orchestrator在恢复过程中通过调查/评估拓扑来选择最佳的恢复方法。
2 orchestrator集群部署
部署一个一主两从的orchestrator集群,当主节点宕机时,会在从节点中选举出新的主节点,保证orchestrator集群的高可用性,并继续为检测MySQL集群服务。
2.1 下载orchestrator
wget https://github.com/openark/orchestrator/releases/download/v3.2.2/orchestrator-client-3.2.2-1.x86_64.rpm
wget https://github.com/openark/orchestrator/releases/download/v3.2.2/orchestrator-cli-3.2.2-1.x86_64.rpm
wget https://github.com/openark/orchestrator/releases/download/v3.2.2/orchestrator-3.2.2-1.x86_64.rpm
2.2 yum安装orchestrator
进入安装包存放目录
yum localinstall orchestrator-3.2.2-1.x86_64.rpm
yum localinstall orchestrator-client-3.2.2-1.x86_64.rpm
yum localinstall orchestrator-cli-3.2.2-1.x86_64.rpm
2.3 修改orchestrator配置
2.3.1修改orchestrator.service
sudo sed -i 's@ExecStart=/usr/local/orchestrator/orchestrator http@ExecStart=/usr/local/orchestrator/orchestrator --config=/usr/local/orchestrator/orchestrator.conf.json http@g' /etc/systemd/system/orchestrator.service
##重新加载配置文件
systemctl daemon-reload
##创建orchestrator文件夹
mkdir -p /orchestrator
2.3.2配置DB的hosts
配置系统hosts,在orchestrator每个节点上配置MySQL集群的主机和域名,并在每个MySQL节点上进行如下配置
sudo cat >> /etc/hosts << EOF
10.0.0.1 mysql-0001
10.0.0.2 mysql-0002
10.0.0.3 mysql-0003
EOF
配置orchestrator的host配置文件
sudo cat >> /usr/local/orchestrator/host.cnf << EOF
10.0.0.1 mysql-0001
10.0.0.2 mysql-0002
10.0.0.3 mysql-0003
2.3.3配置orchestrator.conf.json
sudo cat >> /usr/local/orchestrator/orchestrator.conf.json << EOF
{
"Debug": false,
"EnableSyslog": false,
"ListenAddress": ":3000",
"MySQLTopologyUser": "orch_monitor",##配置访问mysql集群的用户
"MySQLTopologyPassword": "123456",##配置访问mysql集群的密码
"MySQLTopologyUseMutualTLS": false,
"BackendDB": "sqlite", ##orchestrator持久化方式,使用mysql时需要配置mysql信息
"SQLite3DataFile": "/orchestrator/orchestrator.sqlite3",
"MySQLOrchestratorHost": "",
"MySQLOrchestratorPort": 3306,
"MySQLOrchestratorDatabase": "",
"MySQLOrchestratorUser": "",
"MySQLOrchestratorPassword": "",
"UseSuperReadOnly": true,
"MySQLConnectTimeoutSeconds": 1,
"DefaultInstancePort": 3306,
"DiscoverByShowSlaveHosts": false,
"InstancePollSeconds": 5,
"RaftEnabled": true,
"RaftDataDir": "/orchestrator",
"RaftBind": "10.0.0.6",##配置本节点IP
"DefaultRaftPort": 10008,##orchestrator集群节点交互端口
"RaftNodes": ["10.0.0.6","10.0.0.7","10.0.0.8"],##orchestrator集群节点
"UnseenInstanceForgetHours": 240,
"SnapshotTopologiesIntervalHours": 0,
"InstanceBulkOperationsWaitTimeoutSeconds": 20,
"HostnameResolveMethod": "default",
"MySQLHostnameResolveMethod": "@@hostname",
"SkipBinlogServerUnresolveCheck": true,
"ExpiryHostnameResolvesMinutes": 120,
"RejectHostnameResolvePattern": "",
"ReasonableReplicationLagSeconds": 10,
"ProblemIgnoreHostnameFilters": [""],
"VerifyReplicationFilters": false,
"ReasonableMaintenanceReplicationLagSeconds": 20,
"CandidateInstanceExpireMinutes": 60,
"AuditLogFile": "",
"AuditToSyslog": false,
"RemoveTextFromHostnameDisplay": ".mydomain.com:3306",
"ReadOnly": false,
"AuthenticationMethod": "",
"HTTPAuthUser": "",
"HTTPAuthPassword": "",
"AuthUserHeader": "",
"PowerAuthUsers": [
"*"
],
"ClusterNameToAlias": {
"127.0.0.1": "orch local"
},
"ReplicationLagQuery": "",
"DetectClusterAliasQuery": "SELECT SUBSTRING_INDEX(@@hostname, '.', 1)",
"DetectClusterDomainQuery": "",
"DetectInstanceAliasQuery": "",
"DetectPromotionRuleQuery": "",
"DataCenterPattern": "[.]([^.]+)[.][^.]+[.]mydomain[.]com",
"PhysicalEnvironmentPattern": "[.]([^.]+[.][^.]+)[.]mydomain[.]com",
"PromotionIgnoreHostnameFilters": [],
"DetectSemiSyncEnforcedQuery": "",
"ServeAgentsHttp": false,
"AgentsServerPort": ":3001",
"AgentsUseSSL": false,
"AgentsUseMutualTLS": false,
"AgentSSLSkipVerify": false,
"AgentSSLPrivateKeyFile": "",
"AgentSSLCertFile": "",
"AgentSSLCAFile": "",
"AgentSSLValidOUs": [],
"UseSSL": false,
"UseMutualTLS": false,
"SSLSkipVerify": false,
"SSLPrivateKeyFile": "",
"SSLCertFile": "",
"SSLCAFile": "",
"SSLValidOUs": [],
"URLPrefix": "",
"StatusEndpoint": "/api/status",
"StatusSimpleHealth": true,
"StatusOUVerify": false,
"AgentPollMinutes": 60,
"UnseenAgentForgetHours": 6,
"StaleSeedFailMinutes": 60,
"SeedAcceptableBytesDiff": 8192,
"PseudoGTIDPattern": "",
"PseudoGTIDPatternIsFixedSubstring": false,
"PseudoGTIDMonotonicHint": "asc:",
"DetectPseudoGTIDQuery": "",
"BinlogEventsChunkSize": 10000,
"SkipBinlogEventsContaining": [],
"ReduceReplicationAnalysisCount": true,
"FailureDetectionPeriodBlockMinutes": 30,
"FailMasterPromotionOnLagMinutes": 0,
"RecoveryPeriodBlockSeconds": 3600,
"RecoveryIgnoreHostnameFilters": [],
"RecoverMasterClusterFilters": [
"*"
],
"RecoverIntermediateMasterClusterFilters": [
"_intermediate_master_pattern_"
],
"OnFailureDetectionProcesses": [
"echo '1 Detected {failureType} on {failureCluster}. Affected replicas: {countSlaves}' >> /tmp/recovery.log"
],
"PreGracefulTakeoverProcesses": [
"echo 'Planned takeover about to take place on {failureCluster}. Master will switch to read_only' >> /tmp/recovery.log"
],
"PreFailoverProcesses": [
"/usr/bin/prefailover.sh {failedHost} {failedPort} >> /tmp/recovery.log"
],
"PostFailoverProcesses": [
"/usr/bin/postfailover.sh {failedHost} {failedPort} {successorHost} {successorPort} >> /tmp/recovery.log"
],
"PostUnsuccessfulFailoverProcesses": [],
"PostMasterFailoverProcesses": [
"echo '4 Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Promoted: {successorHost}:{successorPort}' >> /tmp/recovery.log"
],
"PostIntermediateMasterFailoverProcesses": [
"echo 'Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >> /tmp/recovery.log"
],
"PostGracefulTakeoverProcesses": [
"echo '5 Planned takeover complete' >> /tmp/recovery.log"
],
"CoMasterRecoveryMustPromoteOtherCoMaster": true,
"DetachLostSlavesAfterMasterFailover": true,
"ApplyMySQLPromotionAfterMasterFailover": true,
"PreventCrossDataCenterMasterFailover": false,
"PreventCrossRegionMasterFailover": false,
"MasterFailoverDetachReplicaMasterHost": false,
"MasterFailoverLostInstancesDowntimeMinutes": 0,
"PostponeReplicaRecoveryOnLagMinutes": 0,
"DelayMasterPromotionIfSQLThreadNotUpToDate": true,
"OSCIgnoreHostnameFilters": [],
"GraphiteAddr": "",
"GraphitePath": "",
"GraphiteConvertHostnameDotsToUnderscores": true,
"ConsulAddress": "",
"ConsulAclToken": ""
}
EOF
2.3.4配置orchestrator恢复文件
创建prefailover.sh脚本,配置在orchestrator.conf.json,用于orchestrator进行mysql集群恢复操作之前立即执行
创建exec_prefailover.sh脚本,配置在prefailover.sh,用于orchestrator进行mysql集群恢复操作之前立即执行
创建postfailover.sh脚本,配置在orchestrator.conf.json,用于orchestrator进行mysql集群成功的恢复结束时执行
创建exec_postfailover.sh脚本,配置在postfailover.sh,用于orchestrator进行mysql集群成功的恢复结束时执行
##重载服务
systemctl daemon-reload
2.3.5启动orchestrator
##创建持久化文件保存路径
mkdir -p /db/orchestrator
##启动orchestrator
systemctl start orchestrator
systemctl restart orchestrator
systemctl stop orchestrator
systemctl status orchestrator
##查看日志
tail -f -n 200 /var/log/messages
2.3.6配置orchestrator集群环境变量
sudo cat >> /etc/profile.d/orchestrator.sh << EOF
export ORCHESTRATOR_API=" http://10.0.0.6:3000/api http://10.0.0.7:3000/api http://10.0.0.8:3000/api"
export ORCHESTRATOR_AUTH_USER=admin
export ORCHESTRATOR_AUTH_PASSWORD=123456
unset i
unset -f pathmunge
export TMOUT=900
EOF
##使环境变量生效
source /etc/profile
2.3.7开放防火墙端口
firewall-cmd --zone=public --add-port=3000/tcp --permanent
firewall-cmd --zone=public --add-port=10008/tcp --permanent
firewall-cmd --reload
2.4 配置mysql集群发现
2.4.1 mysql主库创建orch管理用户
登录mysql集群主库
create user 'orch_monitor'@'%' identified by '123456';
GRANT RELOAD, PROCESS, SUPER, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'orch_monitor'@'%' ;
GRANT SELECT ON meta.* TO 'orch_monitor'@'%';
ALTER USER 'orch_monitor'@'%' IDENTIFIED WITH mysql_native_password BY '123456';
flush privileges;
2.4.2 mysql所有节点追加配置,my.cnf文件增加
orchestrator管理需要在被管理的mysql配置中加入下面的配置
#mysql本节点ip
report_host=10.0.0.1
report_port=3306
2.4.3 运行发现服务
orchestrator主节点运行发现服务
orchestrator-client -c discover -i 10.0.0.1:3306
orchestrator会自动发现mysql主从集群的拓扑结构
2.4.4 设置提升规则
orchestrator主节点,设置提升规则:这里设置业务库一主二从的一个从库为主要切换节点,当mysql主节点宕机时,orchestrator会邮箱将此从节点提升为mysql的主节点
/usr/bin/orchestrator-client -c register-candidate -i mysql-0002:3306 --promotion-rule prefer
2.4.5 设置主切换点定时检查
在orchestrator每个节点上设置主切换点定时检查
crontab -e
#添加定时任务
*/2 * * * * /usr/bin/perl -le 'sleep rand 10' && /usr/bin/orchestrator-client -c register-candidate -imysql-0002:3306 --promotion-rule prefer
2.5 访问orchestrator web
http://10.0.0.6:3000/
检查mysql集群是否有异常