ambari2.7.4安装namenode HA 步骤及踩坑

本文记录了在部署Hadoop HDFS高可用(HA)过程中遇到的问题,包括namenode启动失败的8020端口连接问题和ZKFC启动失败。作者详细描述了错误日志分析、解决步骤,以及如何通过调整权限和重启服务来修复这些问题。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

  1. hdfs界面点击action
    在这里插入图片描述

  2. 点击enable namenode HA
    在这里插入图片描述

  3. 输入namespace ID
    在这里插入图片描述

  4. 选择安装位置
    在这里插入图片描述

  5. 检查配置
    在这里插入图片描述

  6. 执行命令 (此处为一个坑)
    在这里插入图片描述

  7. 安装所需组件
    在这里插入图片描述

  8. 初始化JournalNodes文件夹
    在这里插入图片描述
    在这里插入图片描述
    会有报错 Could not initialinze shared edits dir ,因为对应文件夹下不为空,查看网上解决方案为格式化namenode,我没管,直接点了next.

  9. 启动主节点组件
    在这里插入图片描述

  10. 主节点执行hdfs zkfc -formatZK 在安装另外一个namenode节点执行hdfs namenode -boostrapStandby
    在这里插入图片描述

  11. 然后下一步开始有几点坑
    在这里插入图片描述


坑一:新加的namenode无法启动,报错8020端口连接不上 没有active的namenode

2021-01-28 15:20:56,127 - Getting jmx metrics from NN failed. URL: http://aisino-slave01.test.com:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem
Traceback (most recent call last):
  File "/usr/lib/ambari-agent/lib/resource_management/libraries/functions/jmx.py", line 38, in get_value_from_jmx
    _, data, _ = get_user_call_output(cmd, user=run_user, quiet=False)
  File "/usr/lib/ambari-agent/lib/resource_management/libraries/functions/get_user_call_output.py", line 62, in get_user_call_output
    raise ExecutionFailed(err_msg, code, files_output[0], files_output[1])
ExecutionFailed: Execution of 'curl -s 'http://aisino-slave01.test.com:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem' 1>/tmp/tmpCiS8Ie 2>/tmp/tmpU6EWZq' returned 7. 

21/01/28 15:49:40 INFO ipc.Client: Retrying connect to server: aisino-slave01.test.com/10.70.12.104:8020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
Operation failed: Call From aisino-slave01.test.com/10.70.12.104 to aisino-slave01.test.com:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
2021-01-28 15:49:40,626 - call returned (255, '21/01/28 15:49:40 INFO ipc.Client: Retrying connect to server: aisino-slave01.test.com/10.70.12.104:8020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)\nOperation failed: Call From aisino-slave01.test.com/10.70.12.104 to aisino-slave01.test.com:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused')
2021-01-28 15:49:40,627 - NameNode HA states: active_namenodes = [], standby_namenodes = [(u'nn1', 'aisino-master.test.com:50070')], unknown_namenodes = [(u'nn2', 'aisino-slave01.test.com:50070')]
2021-01-28 15:49:40,627 - Will retry 3 time(s), caught exception: No active NameNode was found.. Sleeping for 5 sec(s)

在这里插入图片描述
起初还查看日志

2021-01-28 15:19:45,675 ERROR namenode.NameNode (NameNode.java:main(1715)) - Failed to start namenode.
java.io.FileNotFoundException: /hadoop/hdfs/namenode/current/VERSION (Permission denied)
	at java.io.RandomAccessFile.open0(Native Method)
	at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
	at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
	at org.apache.hadoop.hdfs.server.common.StorageInfo.readPropertiesFile(StorageInfo.java:250)
	at org.apache.hadoop.hdfs.server.namenode.NNStorage.readProperties(NNStorage.java:660)
	at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:388)
	at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:227)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1090)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:937)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:910)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)

chown hdfs:hadoop -R /hadoop/hdfs/

解决方法
第6步坑,关闭datanode 安全模式 命令enter改为 leave,然后retry即可

坑二:ZKFAILOVERCONTROLLER 页面启动不了
在这里插入图片描述

Traceback (most recent call last):
  File "/var/lib/ambari-agent/cache/stacks/HDP/3.0/services/HDFS/package/scripts/zkfc_slave.py", line 201, in <module>
    ZkfcSlave().execute()
  File "/usr/lib/ambari-agent/lib/resource_management/libraries/script/script.py", line 352, in execute
    method(env)
  File "/var/lib/ambari-agent/cache/stacks/HDP/3.0/services/HDFS/package/scripts/zkfc_slave.py", line 75, in start
    ZkfcSlaveDefault.start_static(env, upgrade_type)
  File "/var/lib/ambari-agent/cache/stacks/HDP/3.0/services/HDFS/package/scripts/zkfc_slave.py", line 100, in start_static
    create_log_dir=True
  File "/var/lib/ambari-agent/cache/stacks/HDP/3.0/services/HDFS/package/scripts/utils.py", line 261, in service
    Execute(daemon_cmd, not_if=process_id_exists_command, environment=hadoop_env_exports)
  File "/usr/lib/ambari-agent/lib/resource_management/core/base.py", line 166, in __init__
    self.env.run()
  File "/usr/lib/ambari-agent/lib/resource_management/core/environment.py", line 160, in run
    self.run_action(resource, action)
  File "/usr/lib/ambari-agent/lib/resource_management/core/environment.py", line 124, in run_action
    provider_action()
  File "/usr/lib/ambari-agent/lib/resource_management/core/providers/system.py", line 263, in action_run
    returns=self.resource.returns)
  File "/usr/lib/ambari-agent/lib/resource_management/core/shell.py", line 72, in inner
    result = function(command, **kwargs)
  File "/usr/lib/ambari-agent/lib/resource_management/core/shell.py", line 102, in checked_call
    tries=tries, try_sleep=try_sleep, timeout_kill_strategy=timeout_kill_strategy, returns=returns)
  File "/usr/lib/ambari-agent/lib/resource_management/core/shell.py", line 150, in _call_wrapper
    result = _call(command, **kwargs_copy)
  File "/usr/lib/ambari-agent/lib/resource_management/core/shell.py", line 314, in _call
    raise ExecutionFailed(err_msg, code, out, err)
resource_management.core.exceptions.ExecutionFailed: Execution of 'ambari-sudo.sh su hdfs -l -s /bin/bash -c 'ulimit -c unlimited ;  /usr/hdp/3.1.4.0-315/hadoop/bin/hdfs --config /usr/hdp/3.1.4.0-315/hadoop/conf --daemon start zkfc'' returned 1.

解决办法:
部署zkfc 节点执行

hdfs --daemon stop zkfc

zkfc已启动 但是它好像识别不了不是页面启动的,然后retry

最后等待
在这里插入图片描述

评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值