版本如下:
Hadoop2.9.2
spark2.4.8
Scala2.11.12
Linux:CentOS7.4
四台机器hostname设置如下:
ambari.master.hadoop
ambari.node1.hadoop
ambari.node2.hadoop
ambari.node3.hadoop
spark作为主节点,其它三个是计算节点。
环境准备
1.设置机器之间互相互信:
创建路径:mkdir -p ~/.ssh,每台机器都需要生成公私秘钥:ssh-keygen -t rsa ,全按Enter键,每台机器都需要
在ambari.master.hadoop节点执行:
cat ~/.ssh/id_rsa.pub >>~/.ssh/authorized_keys
scp ~/.ssh/authorized_keys ambari.node1.hadoop:~/.ssh/在ambari.node1.hadoop节点执行:
cat ~/.ssh/id_rsa.pub >>~/.ssh/authorized_keys
scp ~/.ssh/authorized_keys ambari.node2.hadoop:~/.ssh/在ambari.node2.hadoop节点执行:
cat ~/.ssh/id_rsa.pub >>~/.ssh/authorized_keys
scp ~/.ssh/authorized_keys ambari.node3.hadoop:~/.ssh/在ambari.node3.hadoop节点执行:
cat ~/.ssh/id_rsa.pub >>~/.ssh/authorized_keys
scp ~/.ssh/authorized_keys ambari.master.hadoop:~/.ssh/
scp ~/.ssh/authorized_keys ambari.node1.hadoop:~/.ssh/
scp ~/.ssh/authorized_keys ambari.node2.hadoop:~/.ssh/
2.关闭防火墙:
systemctl stop firewalld
systemctl disable firewalld
3.jdk安装,每个节点都需要
export JAVA_HOME=/usr/local/jdk1.8.0_221
export JRE_HOME=/usr/local/jdk1.8.0_221/jre
export CLASS_PATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib
export PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin
4.scala安装,spark安装需要依赖Scala,每个节点都需要安装。
export SCALA_HOME=/usr/local/scala-2.11.12
export PATH=$PATH:$SCALA_HOME/bin
一、Hadoop安装
下载Hadoop:
https://hadoop.apache.org/releases.html
解压并授权
tar -xzvf hadoop-2.9.2.tar.gz
mv hadoop-2.9.2 /usr/local/server/hadoop-2.9.2
chmod -R 755 /usr/local/server/hadoop-2.9.2
设置环境变量
vim /etc/profile.d/hadoop.sh
export HADOOP_HOME=/usr/local/server/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
修改配置文件,均位于/usr/local/server/hadoop-2.9.2/etc/hadoop下
vim hadoop-env.sh
export JAVA_HOME=/usr/local/jdk1.8.0_221
vim core-site.xml hdfs核心配置
<configuration> <!-- 指定HDFS老大(namenode)的通信地址 --> <property> <name>fs.defaultFS</name> <value>hdfs://ambari.master.hadoop:9000</value> </property> <!-- 指定hadoop运行时产生文件的存储路径 --> <property> <name>hadoop.tmp.dir</name> <value>/usr/local/server/hadoop/tmp</value> </property> </configuration>
vim hdfs-site.xml 通信地址
<configuration> <!-- 设置namenode的http通讯地址 --> <property> <name>dfs.namenode.http-address</name> <value>ambari.master.hadoop:50070</value> </property> <!-- 设置secondarynamenode的http通讯地址 --> <property> <name>dfs.namenode.secondary.http-address</name> <value>ambari.node1.hadoop:50090</value> </property> <!-- 设置namenode存放的路径 --> <property> <name>dfs.namenode.name.dir</name> <value>/usr/local/server/hadoop/name</value> </property> <!-- 设置hdfs副本数量 --> <property> <name>dfs.replication</name> <value>2</value> </property> <!-- 设置datanode存放的路径 --> <property> <name>dfs.datanode.data.dir</name> <value>/usr/local/server/hadoop/data</value> </property> </configuration>
vim mapred-site.xml
<configuration> <!-- 通知框架MR使用YARN --> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
vim yarn-site.xml
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>ambari.master.hadoop:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>ambari.master.hadoop:8030</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>ambari.master.hadoop:8035</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>ambari.master.hadoop:8033</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>ambari.master.hadoop:8088</value> </property> <property> <name>yarn.nodemanager.pmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> <!-- 以下变量根据自己的环境修改value值,yarn不会根据环境动态识别 --> <property> <description>Amount of physical memory, in MB, that can be allocated for containers.</description> <name>yarn.nodemanager.resource.memory-mb</name> <value>3036</value> </property> <property> <description>The minimum allocation for every container request at the RM, in MBs. Memory requests lower than this won't take effect, and the specified value will get allocated at minimum.</description> <name>yarn.scheduler.minimum-allocation-mb</name> <value>128</value> </property> <property> <description>The maximum allocation for every container request at the RM, in MBs. Memory requests higher than this won't take effect, and will get capped to this value.</description> <name>yarn.scheduler.maximum-allocation-mb</name> <value>2560</value> </property> </configuration>
vim masters
ambari.master.hadoop
vim slaves
ambari.node1.hadoop ambari.node2.hadoop ambari.node3.hadoop
将Hadoop目录拷贝到各个节点
将/etc/profile.d/hadoop.sh
. /etc/profile.d/hadoop.sh
cd /usr/local/server/
ln -s /usr/local/server/hadoop-2.9.2 hadoop
mkdir -pv /usr/local/server/hadoop/{data,name,tmp}
主节点格式化名称节点:
hdfs namenode -format
所有节点启动hdfs:
start-dfs.sh
所有节点启动yarn:
start-yarn.sh
查看机器状态:
hdfs dfsadmin -report
至此已安装完成,访问yarn页面看看
http://ambari.master.hadoop:8088/cluster
另附上集群关闭命令和流程:
stop-yarn.sh
stop-dfs.sh
二、spark安装
spark下载
http://spark.apache.org/downloads.html
解压
tar -xzvf spark-2.4.8-bin-without-hadoop.tgz
mv spark-2.4.8-bin-without-hadoop /etc/local/server/
增加环境变量:
vim /etc/profile.d/hadoop.sh
export SPARK_HOME=/usr/local/server/spark
export PATH=$PATH:$SPARK_HOME/bin
export PATH=$PATH:$SPARK_HOME/sbin
创建hdfs目录
hadoop fs -mkdir -p /tmp/spark/lib_jars/
hadoop fs -mkdir -p /eventLogs
hadoop fs -mkdir -p /user/hive/warehouse
修改配置文件
vim spark-env.shexport LD_LIBRARY_PATH=/usr/local/server/hadoop-2.9.2/lib/native export JAVA_HOME=/usr/local/jdk1.8.0_231 export HADOOP_CONF_DIR=/usr/local/server/hadoop/etc/hadoop export YARN_CONF_DIR=/usr/local/server/hadoop/etc/hadoop export SPARK_CONF_DIR=/usr/local/server/spark/conf export SPARK_DIST_CLASSPATH=$(${HADOOP_HOME}/bin/hadoop classpath)
vim spark-defaults.conf
spark.yarn.jars hdfs://ambari.master.hadoop:9000/tmp/spark/lib_jars/*.jar spark.eventLog.enabled true spark.eventLog.dir hdfs://ambari.master.hadoop:9000/eventLogs spark.eventLog.compress true spark.serializer org.apache.spark.serializer.KryoSerializer spark.master yarn
为加快spark程序运行,把spark安装目录下的jar包全部放到hdfs上面
hadoop fs -put $SPARK_HOME/jars/* /tmp/spark/lib_jars
将spark目录复制到各个节点
将/etc/profile.d/hadoop.sh分发到各个节点
配置软链接
cd /usr/local/server/
ln -s /usr/local/server/spark-2.4.8-bin-without-hadoop spark
验证是否安装成功,启动spark:
spark-shell
上面安装的是spark on yarn模式,如果要部署spark standlone模式,则安装以下安装
spark standlone模式:
vim spark-env.shexport SPARK_MASTER_IP=master
vim spark-defaults.conf
#spark.master yarn
vim slaves
ambari.node1.hadoop ambari.node2.hadoop ambari.node3.hadoop
将修改的配置文件分发到各个节点
启动spark
./start-all.sh
验证
spark-shell --master spark://ambari.master.hadoop:7077