Hadoop+Spark + Hive高可用集群部署

本文详细介绍了如何构建Hadoop+Spark+Hive的高可用分布式集群,涵盖集群规划、环境准备、各组件的安装配置流程,以及调优策略。特别关注于Hadoop的完全分布式集群、Spark的高可用YARN集群与Hive的数据仓库搭建。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >



Hadoop+Spark+Hive高可用分布式集群安装

集群规划

节点IP节点别名zookeeper节点JournalNode节点NodeManager节点DataNode节点zkfc节点NameNode节点ResourceManager节点
192.168.99.61spark01zookeeperJournalNodeNodeManagerDataNodeDFSZKFailoverControllerNameNode
192.168.88.221spark02zookeeperJournalNodeNodeManagerDataNodeDFSZKFailoverControllerNameNodeResourceManager
192.168.99.98spark03zookeeperJournalNodeNodeManagerDataNodeResourceManager

一 Hadoop的完全分布式集群安装

1、环境准备

(1)服务器三台

192.168.99.61,192.168.88.221,192.168.88.221

(2)域名解析及免密

配置hosts文件域名解析及各节点互相免密

192.168.99.61   hadoop01 spark01 zookeeper01 hive01
192.168.88.221  hadoop02 spark02 zookeeper02 metastore01
192.168.99.98   hadoop03 spark03 zookeeper03 mysql01
(3)路径

安装路径:/usr/local/spark

java-home:/usr/java/jdk1.8.0_162(rpm包安装默认路径)

(4)软件版本
zookeeper:zookeeper-3.4.14.tar.gz
hadoop:hadoop-2.7.0.tar.gz
jdk:jdk-8u162-linux-x64.rpm
scala:scala-2.11.8.zip
spark:spark-2.2.0-bin-hadoop2.7.tgz
Hive:apache-hive-2.1.1-bin.tar.gz
mysql:

2、管理节点zookeeper的安装

三个几点都安装zookeeper,leader自动选举

安装目录:/usr/local/spark/zookeeper-3.4.14

安装过程:

(1)解压进目录
tar –xf zookeeper-3.4.14.tar.gz  –C  zookeeper-3.4.14.tar.gz
cd /usr/local/spark/zookeeper-3.4.14/conf/
cp  zoo_sample.cfg   zoo.cfg
(2)修改配置文件
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/usr/local/spark/zookeeper-3.4.14/data/
dataLogDir=/usr/local/spark/zookeeper-3.4.14/logs/
clientPort=2181
server.1=zookeeper01:2888:3888
server.2=zookeeper02:2888:3888
server.3=zookeeper03:2888:3888
(3)创建数据存储目录及myid
mkdir –p /usr/local/spark/zookeeper-3.4.14/{data,logs}

在不同的节点输入不同的值给myid

[root@spark01 hadoop]#	echo 1 > /usr/local/spark/zookeeper-3.4.14/data/myid
[root@spark02 hadoop]#	echo 2 > /usr/local/spark/zookeeper-3.4.14/data/myid
[root@spark03 hadoop]#	echo 3 > /usr/local/spark/zookeeper-3.4.14/data/myid
(4)其他节点的安装

安装过程步骤一致,只需要修改myid的值

(5)添加zookeeper集群启动的环境变量

在 /etc/profile

ZOOKEEPER_HOME=/usr/local/spark/zookeeper-3.4.14
PATH=PATH:ZOOKEEPER_HOME/bin

生效环境变量soruce /etc/profile

(6)zookeeper集群的启动

三个几点执行:zkServer.sh start

查看该节点zookeeper的状态:zkServer.sh status
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

由图可见,spark02的zookeeper是该集群的leader,其他节点为follwer,leader节点挂掉其他节点会选举一个成为下一个leader

3、hadoop集群安装

(1)解压进目录
tar –xf hadoop-2.7.0.tar.gz –C /usr/local/spark/
cd /usr/local/spark/hadoop-2.7.0/etc/hadoop
(2)修改文件hdfs、mapred、yarn的java环境变量
JAVA_HOME=/usr/java/jdk1.8.0_162
(3)修改文件core-site.xml
<configuration>
         <property>
 <!-- 指定HDFS中NameNode的地址 -->
 <!-- 把两个NameNode的地址组装成一个集群mycluster -->
         <name>fs.defaultFS</name>
         <value>hdfs://hadoop-cluster</value>
         </property>
 <!-- 指定hadoop运行时产生文件的存储目录 -->
         <property>
         <name>hadoop.tmp.dir</name>
         <value>file:/usr/local/spark/hadoop-2.7.0/data/tmp/</value>
         </property>
 <!-- 指定ZKFC故障自动切换转移 -->
 <property>
      <name>ha.zookeeper.quorum</name>
      <value>zookeeper01:2181,zookeeper02:2181,zookeeper03:2181</value>
 </property>
 </configuration>
(4)修改文件hdfs-site.xml
<configuration>
 <!-- 设置dfs副本数,不设置默认是3个   -->
          <property>
              <name>dfs.replication</name>
              <value>2</value>
          </property>
 <!-- 完全分布式集群名称 -->
          <property>
              <name>dfs.nameservices</name>
              <value>hadoop-cluster</value>
          </property>
 <!-- namenode数据存储路径 -->
          <property>
              <name>dfs.namenode.name.dir</name>
              <value>file:/usr/local/spark/hadoop-2.7.0/data/dfs/name</value>
          </property>
 <!-- datanode数据存储路径 -->
          <property>
              <name>dfs.datanode.data.dir</name>
              <value>file:/usr/local/spark/hadoop-2.7.0/data/dfs/data</value>
          </property>
 <!-- 集群中NameNode节点都有哪些 -->
          <property>
              <name>dfs.ha.namenodes.hadoop-cluster</name>
              <value>nn01,nn02</value>
          </property>
 <!-- nn01的RPC通信地址 -->
          <property>
              <name>dfs.namenode.rpc-address.hadoop-cluster.nn01</name>
              <value>hadoop01:8020</value>
          </property>
 <!-- nn02的RPC通信地址 -->
          <property>
              <name>dfs.namenode.rpc-address.hadoop-cluster.nn02</name>
              <value>hadoop02:8020</value>
          </property>
 <!-- nn01的http通信地址 -->
          <property>
              <name>dfs.namenode.http-address.hadoop-cluster.nn01</name>
              <value>hadoop01:50070</value>
          </property>
 <!-- nn02的http通信地址 -->
          <property>
              <name>dfs.namenode.http-address.hadoop-cluster.nn02</name>
              <value>hadoop02:50070</value>
          </property>
 <!-- 指定NameNode元数据在JournalNode上的存放位置 -->
          <property>
              <name>dfs.namenode.shared.edits.dir</name>
              <value>qjournal://hadoop01:8485;hadoop02:8485;hadoop03:8485/hadoop-cluster</value>
          </property>
 <!-- 配置隔离机制,即同一时刻只能有一台服务器对外响应 -->
          <property>
              <name>dfs.ha.fencing.methods</name>
              <value>sshfence</value>
          </property>
 <!-- 使用隔离机制时需要ssh无秘钥登录-->
          <property>
              <name>dfs.ha.fencing.ssh.private-key-files</name>
              <value>/root/.ssh/id_rsa</value>
          </property>
 <!-- 声明journalnode服务器存储目录-->
          <property>
              <name>dfs.journalnode.edits.dir</name>
              <value>/usr/local/spark/hadoop-2.7.0/data/journalnode</value>
          </property>
 <!-- 关闭权限检查-->
          <property>       
              <name>dfs.permissions.enable</name>
              <value>false</value>
          </property>
 <!-- 访问代理类:client,hadoop-cluster,active配置失败自动切换实现方式-->
          <property>
              <name>dfs.client.failover.proxy.provider.hadoop-cluster</name>
              <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
          </property>
 <!-- 配置自动故障转移-->
          <property>
              <name>dfs.ha.automatic-failover.enabled</name>
              <value>true</value>
          </property> 
 </configuration>
(5)修改文件mapred-site.xml
<configuration>
 <!-- 指定mr框架为yarn方式 -->
          <property>
                <name>mapreduce.framework.name</name>
              <value>yarn</value>
          </property>
 <!-- 指定mr历史服务器主机,端口 -->
          <property>   
              <name>mapreduce.jobhistory.address</name>   
              <value>hadoop01:10020</value>   
          </property> 
 <!-- 指定mr历史服务器WebUI主机,端口 -->
          <property>   
              <name>mapreduce.jobhistory.webapp.address</name>   
              <value>hadoop01:19888</value>   
          </property>
 <!-- 历史服务器的WEB UI上最多显示20000个历史的作业记录信息 -->    
          <property>
              <name>mapreduce.jobhistory.joblist.cache.size</name>
              <value>20000</value>
          </property>
 <!--配置作业运行日志 --> 
          <property>
              <name>mapreduce.jobhistory.done-dir</name>
              <value>$yarn.app.mapreduce.am.staging-dir/history/done</value>
          </property>
          <property>
              <name>mapreduce.jobhistory.intermediate-done-dir</name>
              <value>$yarn.app.mapreduce.am.staging-dir/history/done_intermediate</value>
          </property>
          <property>
              <name>yarn.app.mapreduce.am.staging-dir</name>
              <value>/usr/local/spark/hadoop-2.7.0/data/hadoop-yarn/staging</value>
          </property>
 </configuration>
(6)修改文件yarn-site.xml
<configuration>
 <!-- reducer获取数据的方式 -->
          <property>
              <name>yarn.nodemanager.aux-services</name>
             <value>mapreduce_shuffle</value>
         </property>
 <!--启用resourcemanager ha-->
         <property>
             <name>yarn.resourcemanager.ha.enabled</name>
             <value>true</value>
         </property>
 <!--声明两台resourcemanager的地址-->
         <property>
             <name>yarn.resourcemanager.cluster-id</name>
             <value>rmCluster</value>
         </property>
         <property>
             <name>yarn.resourcemanager.ha.rm-ids</name>
             <value>rm01,rm02</value>
         </property>
         <property>
             <name>yarn.resourcemanager.hostname.rm01</name>
             <value>hadoop02</value>
         </property>
         <property>
             <name>yarn.resourcemanager.hostname.rm02</name>
             <value>hadoop03</value>
          </property>
 <!--指定zookeeper集群的地址-->
          <property>
             <name>yarn.resourcemanager.zk-address</name>
             <value>zookeeper01:2181,zookeeper02:2181,zookeeper03:2181</value>
          </property>
 <!--启用自动恢复-->
          <property>
             <name>yarn.resourcemanager.recovery.enabled</name>
             <value>true</value>
         </property>
 <!--指定resourcemanager的状态信息存储在zookeeper集群-->
         <property>
             <name>yarn.resourcemanager.store.class</name>    
             <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
         </property>
 </configuration>
(7)修改文件slaves
[root@spark01 hadoop]# cat slaves
 hadoop01
 hadoop02
 hadoop03
(8)其他节点部署
scp –r /usr/local/spark/hadoop-2.7.0/ spark02:/usr/local/spark/
scp –r /usr/local/spark/hadoop-2.7.0/ spark03:/usr/local/spark/
(9)各个节点hadoop的环境变量配置
HADOOP_HOME=/usr/local/spark/hadoop-2.7.0
HADOOP_CONF_DIR=$PATH:$HADOOP_HOME/etc/hadoop
PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

生效环境变量soruce /etc/profile

(10)组件启动
[1] JournalNode启动
hadoop-daemon.sh start journalnode
[root@spark01 hadoop]# hadoop-daemon.sh start journalnode
[root@spark02 hadoop]# hadoop-daemon.sh start journalnode
[root@spark03 hadoop]# hadoop-daemon.sh start journalnode
[2] namenode启动
hadoop-daemon.sh start namenode

先启动spark01节点的nn01

[root@spark01 hadoop]# hadoop-daemon.sh start namenode

在spark02节点nn02上面同步spark01节点的nn01的元数据

[root@spark02 hadoop]# hdfs namenode -bootstrapStandby

启动spark02节点的nn02

[root@spark02 hadoop]#  hadoop-daemon.sh start namenode
[3] datanode启动

在任意节点执行

hadoop-daemons.sh start datanode
[root@spark01 hadoop]#     hadoop-daemons.sh start datanode
[4] 启动zkfc,修改NameNode状态

在这里插入图片描述
在这里插入图片描述
如上所示nodename状态为standby,需修改其中一个状态为Active。在spark01和 spark02节点启动组件zkfc,谁先启动谁为Active

hadoop-daemon.sh start zkfc
[root@spark01 hadoop]#  hadoop-daemon.sh start zkfc
[root@spark02 hadoop]#  hadoop-daemon.sh start zkfc

执行完如下图所示
在这里插入图片描述在这里插入图片描述

强制修改节点状态

hdfs haadmin -transitionToActive nn01 --forcemanual 
hdfs haadmin -transitionToStandby nn01 --forcemanual
[5] 自动切换namenode状态

需要初始化HA在Zookeeper中状态 hdfs zkfc –formatZK

[root@spark01 hadoop]# hdfs zkfc -formatZK
在这里插入图片描述

[6] yarn启动

在任意节点执行start-yarn.sh,生成nodemanager的jps进程

[root@spark01 ~]# start-yarn.sh
[7] resourcemanager启动

在spark02和spark03节点执行

yarn-daemon.sh start resourcemanager
[root@spark02 hadoop]# yarn-daemon.sh start resourcemanager

[root@spark02 hadoop]# yarn-daemon.sh start resourcemanager

查看状态

在这里插入图片描述

[8] 查看jps启动进程信息

在这里插入图片描述
在这里插入图片描述

(11)namenode状态自动切换测试

现在是spark01上的namenode是avctive,kill掉改进程,看状态变换
在这里插入图片描述在这里插入图片描述
kill掉saprk01节点的namenode
在这里插入图片描述
状态变化,hadoop01访问不了namenode,active切换到hadoop的namenode上面去了
在这里插入图片描述

二 spark高可用yarn分布式集群搭建

1、scala环境安装

(1)解压
unzip -d /usr/local/spark/	scala-2.11.8.zip
(2)配置环境变量

修改文件/etc/profile

SCALA_HOME=/usr/local/spark/scala-2.11.8
PATH=$PATH:$SCALA_HOME/bin

2、spark安装

(1)解压进目录
[root@spark01 ~]# tar -xf spark-2.2.0-bin-hadoop2.7.tgz -C  /usr/local/spark/
[root@spark01 ~]# cd  /usr/local/spark/spark-2.2.0-bin-hadoop2.7/conf/
(2)修改文件spark-env.sh
[root@spark01 conf]# cp spark-env.sh.template spark-env.sh
[root@spark01 conf]# vim spark-env.sh
#!/usr/bin/env bash
#指定默认master的ip或主机名
export SPARK_MASTER_HOST=spark01
#指定maaster提交任务的默认端口为7077    
export SPARK_MASTER_PORT=7077
#指定masster节点的webui端口       
export SPARK_MASTER_WEBUI_PORT=8080
#每个worker从节点能够支配的内存数 
export SPARK_WORKER_MEMORY=1g
#允许Spark应用程序在计算机上使用的核心总数(默认值:所有可用核心)
export SPARK_WORKER_CORES=1
#每个worker从节点的实例(可选配置) 
export SPARK_WORKER_INSTANCES=1
#指向包含Hadoop集群的(客户端)配置文件的目录,运行在Yarn上配置此项   
export HADOOP_CONF_DIR=/usr/local/spark/hadoop-2.7.0/etc/hadoop
#指定整个集群状态是通过zookeeper来维护的,包括集群恢复
export SPARK_DAEMON_JAVA_OPTS="      
-Dspark.deploy.recoveryMode=ZOOKEEPER 
-Dspark.deploy.zookeeper.url=zookeeper01:2181,zookeeper02:2181,zookeeper03:2181
-Dspark.deploy.zookeeper.dir=/spark/data"
(3)修改slaves文件
[root@spark01 conf]# grep -Pv "^$|#" slaves
spark01
spark02
spark03
(4)其他节点部署
[root@spark01 conf]# scp spark-env.sh slaves spark02:/usr/local/spark/spark-2.2.0-bin-hadoop2.7/conf/
[root@spark01 conf]# scp spark-env.sh slaves spark03:/usr/local/spark/spark-2.2.0-bin-hadoop2.7/conf/

修改spark02的配置文件spark-env.sh,将地址改为spark02

export SPARK_MASTER_HOST=spark02
(5)配置环境变量
SPARK_HOME=/usr/local/spark/spark-2.2.0-bin-hadoop2.7
PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
[root@spark01 conf]# source /etc/profile
(6)节点启动

master节点启动

[root@spark01 spark-2.2.0-bin-hadoop2.7]# start-master.sh
[root@spark02 spark-2.2.0-bin-hadoop2.7]# start-master.sh

worker节点启动

指定节点启动woker

[root@spark01 spark-2.2.0-bin-hadoop2.7]# start-slave.sh spark://spark01:7077
[root@spark02 spark-2.2.0-bin-hadoop2.7]# start-slave.sh spark://spark01:7077
[root@spark03 spark-2.2.0-bin-hadoop2.7]# start-slave.sh spark://spark01:7077

或者启动所有的worker

[root@spark01 spark-2.2.0-bin-hadoop2.7]# start-slaves.sh

jps进程查看

[root@spark01 spark-2.2.0-bin-hadoop2.7]# jps
3445 NodeManager
3142 DFSZKFailoverController
16775 Jps
2811 JournalNode
3707 QuorumPeerMain
16495 Master
16299 Worker
16637 NameNode
3037 DataNode
[root@spark02 spark-2.2.0-bin-hadoop2.7]# jps
27556 Jps
2029 JournalNode
4769 NodeManager
4516 NameNode
2284 DataNode
4042 DFSZKFailoverController
27050 Master
27205 Worker
27371 QuorumPeerMain
6095 ResourceManager
 [root@spark03 spark-2.2.0-bin-hadoop2.7]# jps
14625 QuorumPeerMain
513 JournalNode
7770 Worker
8923 NodeManager
9356 Jps
13453 ResourceManager
2302 DataNode

web页面查看状态
在这里插入图片描述在这里插入图片描述

(7)高可用测试

kill掉spark01的master进程,worker节点注册到spark02的备master节点上

[root@spark01 spark-2.2.0-bin-hadoop2.7]# jps
3445 NodeManager
3142 DFSZKFailoverController
2811 JournalNode
3707 QuorumPeerMain
12795 Worker
13084 Jps
3037 DataNode
12687 Master
[root@spark01 spark-2.2.0-bin-hadoop2.7]# kill -9 12687

在这里插入图片描述

(8)配置历史服务器

修改配置文件spark-2.2.0-bin-hadoop2.7/confspark-defaults.conf

#开启记录事件日志的功能
spark.eventLog.enabled          true
#日志优化选项,压缩日志
spark.eventLog.compress         true
#设置事件日志存储的目录
spark.eventLog.dir              hdfs://spark01:8020/spark/applicationHistory
#设置HistoryServer加载事件日志的位置
spark.history.fs.logDirectory   hdfs://spark01:8020/spark/applicationHistory

创建hdfs目录

[root@spark01 conf]# hdfs dfs -mkdir -p /spark/applicationHistory
[root@spark01 conf]# hdfs dfs -ls /spark
Found 1 items
drwxr-xr-x   - root supergroup          0 2019-12-27 15:13 /spark/applicationHistor

查看jps

[root@spark01 hadoop]# jps 
3445 NodeManager
3142 DFSZKFailoverController
2811 JournalNode
3707 QuorumPeerMain
16299 Worker
19820 HistoryServer
20412 Jps
3037 DataNode
16637 NameNode
16495 Master

查看web界面

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lhwvey4L-1577676855798)
在这里插入图片描述

三 数据仓库 Hive的搭建

节点分配

192.168.99.61:hive,远程metastore服务端
192.168.88.221:客户端
192.168.99.98:mysql服务

1、mysql安装

(1)卸载mariadb下载安装包
[root@spark03 mysql]# rpm -qa |grep mariadb
[root@spark03 mysql]# yum remove -y mariadb-libs-5.5.52-1.el7.x86_64
[root@spark03 mysql]# ll
总用量 82644
-rw-r--r--. 1 root root 19730692 2月  13 2019 mysql-community-client-5.6.16-1.el7.x86_64.rpm
-rw-r--r--. 1 root root   252116 2月  13 2019 mysql-community-common-5.6.16-1.el7.x86_64.rpm
-rw-r--r--. 1 root root  3486556 2月  13 2019 mysql-community-devel-5.6.16-1.el7.x86_64.rpm
-rw-r--r--. 1 root root  2070536 2月  13 2019 mysql-community-libs-5.6.16-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 59068576 2月  13 2019 mysql-community-server-5.6.16-1.el7.x86_64.rpm
(2)mysql安装
[root@spark03 mysql]# rpm -ivh mysql-community-common-5.6.16-1.el7.x86_64.rpm
[root@spark03 mysql]# rpm -ivh mysql-community-libs-5.6.16-1.el7.x86_64.rpm
[root@spark03 mysql]# rpm -ivh mysql-community-devel-5.6.16-1.el7.x86_64.rpm
[root@spark03 mysql]# rpm -ivh mysql-community-client-5.6.16-1.el7.x86_64.rpm
[root@spark03 mysql]# yum install -y mysql-community-server-5.6.16-1.el7.x86_64.rpm
[root@spark03 mysql]# systemctl start mysqld
[root@spark03 mysql]# mysqladmin -uroot  password 123456
(3)授权
[root@spark03 mysql]# mysql -uroot -p123456
mysql> grant all privileges on *.* to root@'%' identified by '123456';
mysql> grant all privileges on *.* to root@'localhost' identified by '123456';
mysql> grant all privileges on *.* to root@'192.168.99.98' identified by '123456';
mysql> flush privlieges;
mysql> \q
Bye

2、Hive安装,远程Metastore服务端和客户端

(1)解压进目录
[root@spark01 ~]# tar -xf apache-hive-2.1.1-bin.tar.gz -C /usr/local/spark/
[root@spark01 ~]# cd /usr/local/spark/apache-hive-2.1.1-bin/conf
(2)配置环境变量

修改文件 /etc/profile

HIVE_HOME=/usr/local/spark/apache-hive-2.1.1-bin
PATH=$PATH:$HIVE_HOME/bin

生效环境变量

[root@spark01 ~]# source /etc/profile
(3)修改文件hive-env.sh
[root@spark01 conf]# cp hive-env.sh.template hive-env.sh
[root@spark01 conf]# vim hive-env.sh
HADOOP_HOME=/usr/local/spark/hadoop-2.7.0
HIVE_CONF_DIR=/usr/local/spark/apache-hive-2.1.1-bin/conf
(4)修改文件hive-log4j2.properties
[root@spark01 conf]# cp hive-log4j2.properties.template hive-log4j2.properties
[root@spark01 conf]# vim hive-log4j2.properties
修改下面的配置指定hive的日志路径
property.hive.log.dir = /usr/local/spark/apache-hive-2.1.1-bin/logs

创建Hive日志目录

[root@spark01 apache-hive-2.1.1-bin]# mkdir logs
(5)修改服务端配置文件
[root@spark01 conf]# cp hive-default.xml.template hive-site.xml
[root@spark01 conf]# echo > hive-site.xml
[root@spark01 conf]# vim hive-site.xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!--Hive作业的HDFS根目录位置 --> 
<property>
    <name>hive.exec.scratchdir</name>
    <value>/spark/hive/data</value>
</property>
<!--Hive作业的HDFS根目录创建写权限 --> 
<property>
    <name>hive.scratch.dir.permission</name>
    <value>733</value>
</property>
<!--hdfs上hive元数据存放位置 --> 
<property>  
  <name>hive.metastore.warehouse.dir</name>  
  <value>/spark/hive/warehouse</value>   
</property>
<!--连接数据库地址,名称 -->  
<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://mysql01:3306/hive?createDatabaseIfNotExist=true&amp;characterEncoding=UTF-8</value>  
</property>  
<!--连接数据库驱动 --> 
<property>
  <name>javax.jdo.option.ConnectionDriverName</name>  
  <value>com.mysql.jdbc.Driver</value>  
</property> 
<!--连接数据库用户名称 -->  
<property>  
  <name>javax.jdo.option.ConnectionUserName</name>  
  <value>root</value>
</property> 
<!--连接数据库用户密码 -->  
<property>  
  <name>javax.jdo.option.ConnectionPassword</name>  
  <value>123456</value>
</property>
<!--客户端显示当前查询表的头信息 --> 
 <property>
  <name>hive.cli.print.header</name>
  <value>true</value>
</property>
<!--客户端显示当前数据库名称信息 --> 
<property>
  <name>hive.cli.print.current.db</name>
  <value>true</value>
</property>
<!-- 修改日志位置 -->
<property>
    <name>hive.exec.local.scratchdir</name>
    <value>/usr/local/spark/apache-hive-2.1.1-bin/logs/HiveJobsLog</value>
</property>
<property>
    <name>hive.downloaded.resources.dir</name>
    <value>/usr/local/spark/apache-hive-2.1.1-bin/logs/ResourcesLog</value>
</property>
<property>
    <name>hive.querylog.location</name>
    <value>/usr/local/spark/apache-hive-2.1.1-bin/logs/HiveRunLog</value>
</property>
<property>
    <name>hive.server2.logging.operation.log.location</name>
    <value>/usr/local/spark/apache-hive-2.1.1-bin/logs/OpertitionLog</value>
</property>
</configuration>
(6)修改客户端配置文件

先操作(1)-(5)然后修改hive-default.xml客户端配置配置文件

[root@spark02 conf]# cp hive-default.xml.template hive-site.xml
[root@spark02 conf]# echo > hive-site.xml
[root@spark02 conf]# vim hive-site.xml
?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!--Hive作业的HDFS根目录位置 -->
<property>
    <name>hive.exec.scratchdir</name>
    <value>/spark/hive/data</value>
</property>
<!--Hive作业的HDFS根目录创建写权限 -->
<property>
    <name>hive.scratch.dir.permission</name>
    <value>733</value>
</property>
<!--hdfs上hive元数据存放位置,默认 -->
<property>
  <name>hive.metastore.warehouse.dir</name>
  <value>/spark/hive/warehouse</value>
</property>
<!--元数据存放路径 -->
<property>
  <name>hive.metastore.uris</name>
  <value>thrift://hive-client01:9083</value>
</property>
<!--客户端显示当前查询表的头信息 -->
 <property>
  <name>hive.cli.print.header</name>
  <value>true</value>
</property>
<!--客户端显示当前数据库名称信息 -->
<property>
  <name>hive.cli.print.current.db</name>
  <value>true</value>
</property>
</configuration>
(7)安装mysql连接插件

服务端

[root@spark01 ~]# unzip mysql-connector-java-5.1.46.zip
[root@spark01 ~]# cd mysql-connector-java-5.1.46/
[root@spark01 mysql-connector-java-5.1.46]# cp mysql-connector-java-5.1.46-bin.jar /usr/local/spark/apache-hive-2.1.1-bin/lib/

客户端

[root@spark02 mysql-connector-java-5.1.46]# scp mysql-connector-java-5.1.46-bin.jar  spark02:/usr/local/spark/apache-hive-2.1.1-bin/lib/
(8)启动Hive

初始化服务端

[root@spark01 apache-hive-2.1.1-bin]# schematool -dbType mysql -initSchema

启动Hive

[root@spark01 apache-hive-2.1.1-bin]# hive  --service metastore &>/usr/local/spark/apache-hive-2.1.1-bin/logs/\ /hive-start.log  &	

客户端连接Hive

[root@spark02 conf]# hive

四 调优

1、yarn调优

修改yarn-site.xml文件

<!--最小值-->
    <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>750</value>
    </property>
<!--最大值-->
    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>4096</value>
    </property>

参考文档:

spark生态圈:https://blog.csdn.net/xiangxizhishi/article/details/77927564
hadoop、hbase、hive、spark分布式系统架构原理:https://blog.csdn.net/luanpeng825485697/article/details/80319552
Map-Reduce理解:https://blog.csdn.net/laobai1015/article/details/86590076
NN和2NN工作机制:https://www.cnblogs.com/linkworld/p/10941256.html
数据仓库的概念:https://www.cnblogs.com/frankdeng/p/9462754.html
Hive学习:https://www.cnblogs.com/frankdeng/p/9381934.html
hadoop完全集群搭建:https://www.cnblogs.com/frankdeng/p/9047698.html

spark高可用集群搭建:https://www.cnblogs.com/frankdeng/p/9294812.html


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值