2021-02-11 大数据课程笔记 day22

最新推荐文章于 2023-12-27 18:04:08 发布

Rich Dad

最新推荐文章于 2023-12-27 18:04:08 发布

阅读量1.2k

点赞数

CC 4.0 BY-SA版权

分类专栏：西行日记文章标签：大数据 hadoop 数据库 mysql java

I love 段奥娟

本文链接：https://blog.csdn.net/qq_44745905/article/details/113789430

第3天离线项目-3 新增用户数据处理

时间维度
浏览器维度
平台维度
KPI 一个工具维度
通过以上四个维度的各种组合，计算它的新增用户指标

课程大纲

项目模块设计思路
新增用户指标 mapper 开发
新增用户指标 reducer 开发
新增用户指标 Runner 开发
MapReduce 结果存 MySQL
新增用户指标运行结果

hbase

uuid,servertime,browser,platform,kpi


事件  lanuch

	时间
	浏览器
	平台
	kpi 模块
	

时间      r
时间，浏览器    r
时间，平台      r
时间，浏览器，平台   r


时间  分组
时间，浏览器 分组

手机

品牌两个品牌
型号三个型号

问题：
1、一共库存多少手机
2、某个品牌库存多少手机
3、某个品牌某个型号库存多少手机

phone1   brand1  type1
phone2   brand2  type2
phone3   brand2  type3
phone4   brand1  type2

phone1   brand1  type1

phone1   brand1   type1
phone1   brand2   type1
phone2   brand1   type3

phone1   brand1  type1

phone1   
phone1   type1
phone1  brand1  type1

brand1  phone1
type1    phone1
brand1 type1  phone1

phone1

维度组合的类图：在这里插入图片描述

项目模块设计思路

map
数据裂变
纬度组合

reduce
汇聚统计

1、从 hbase 读取数据
2、纬度的组合
3、 reducer 聚合
4、数据放 MySQL
a) TableMapReduceUtil.initTableReducerJob();
b) 自己实现向 MySQL 存数据的OutputFormat

新增用户指标 mapper 开发

TransformerOutputFormat.class自定义的用于向MySQL插入数据的类

四个纬度：时间、浏览器、平台、模块
需要对LAUNCH_EVENT数据过滤
组合四个纬度，向输出外键值对信息。
在这里插入图片描述
维度组合有多少种？
各个维度的种类相乘得到结果

新增用户指标 reducer 开发

由于统计的是用户的数量，需要对 log 进行 uuid 的过滤，因为同一个人有可能点击了多次。

新增用户指标 Runner 开发

1、scan 添加过滤器 , startKey stopKey
2、指定 en=e_l 的事件条件
3、指定要获取的列名
MultipleColumnPrefixFilter
4、指定表名
scan.setAttribute(Scan.SCAN_ATTRIUBTES_TABLE_NAME, EventLogContants.HBASE_NAME_EVENT_LOGS.getBytes());

TestDataMaker 用于生成 hbase 的数据。

service.impl 修改连接 MySQL 的字符串
transformer-env.xml中修改连接 MySQL 的字符串
com.sxt.transformer.service.impl.DimensionConverterImpl修改连接数据库字符串

MapReduce 结果存 MySQL

TransformerOutputFormat 类中的 RecordWriter 用于向 MySQL 输出

新增用户指标运行结果

活跃用户数据统计

关于架构：
nginx 文件需要滚动，让 flume 监控某个目录，预防 nginx 端日志文件过大。
ETL，公司封装好的。

1、活跃用户需要统计的是 PV 事件

2、需要的维度：
时间、平台、模块、浏览器
3、使用 StatsUserDimension 作为 key，使用 TimeOutputValue 作为输出
如果有可能，需要自定义单独的维度类，以及使用封装类封装维度类信息。
4、需要指定output-collector.xml和query-mapping.xml中的配置
其中output-collector.xml是自己实现的如何给PreparedStatement赋值以及添加addBatch()。
query-mapping.xml中用于配置 sql 语句的，包括如何查询或插入父表数据，如何获取主表 ID，以及如何向 MySQL 插入数据的 SQL 语句。

项目中如何新增模块？

1、开发 Runner、Mapper 和 Reducer
2、添加额外的单一维度对象
3、添加纬度对象组合
4、添加对应的 collector 类
5、修改配置文件
a) query-mapping.xml添加对应的 SQL 语句
b) output-collector.xml添加对应的反射类

第五天大数据网站日志离线分析项目

1 hive 和 hbase 的整合
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration

注意事项：

版本信息
Avro Data Stored in HBase Columns
As of Hive 0.9.0 the HBase integration requires at least HBase 0.92, earlier versions of Hive were working with HBase 0.89/0.90

Hive 0.9.0与HBase 0.92兼容。

版本信息
Hive 1.x will remain compatible with HBase 0.98.x and lower versions. Hive 2.x will be compatible with HBase 1.x and higher. (See HIVE-10990 for details.) Consumers wanting to work with HBase 1.x using Hive 1.x will need to compile Hive 1.x stream code themselves.

Hive 1.x仍然和HBase 0.98.x兼容。

HIVE-705 提出的原生支持的 Hive 和 HBase 的整合。可以使用 Hive QL 语句访问 HBase 的表，包括 SELECT 和 INSERT。甚至让 hive 做 Hive 表和 HBase 表的 join 操作和 union 操作。

需要 jar 包（hive 自带）

hive-hbase-handler-x.y.z.jar

在 hive 的服务端： hive 远程模式：服务端 + 客户端 metastore hive 在这里插入图片描述

然后正常启动：hive --service metastore
启动客户端 CLI：hive

要在 hive 中操作 hbase 的表，需要对列进行映射。

CREATE external TABLE hbase_table_1(key string, value string) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("hbase.table.name" = "xyz",
 "hbase.mapred.output.outputtable" = "xyz");

必须指定hbase.columns.mapping属性。
hbase.table.name属性可选，用于指定 hbase 中对应的表名，允许在 hive 表中使用不同的表名。上例中，hive 中表名为 hbase_table_1，hbase 中表名为 xyz。如果不指定，hive 中的表名与 hbase 中的表名一致。
hbase.mapred.output.outputtable属性可选，向表中插入数据的时候是必须的。该属性的值传递给了hbase.mapreduce.TableOutputFormat使用。

在 hive 表定义中的映射hbase.columns.mapping中的cf1:val在创建完表之后，hbase 中只显示 cf1，并不显示 val，因为 val 是行级别的，cf1 才是 hbase 中表级别的元数据。

具体操作：
hive：

CREATE TABLE hbase_table_1(key int, value string) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("hbase.table.name" = "xyz", "hbase.mapred.output.outputtable" = "xyz");

hbase:

list
desc 'xyz'

hive 操作：

insert into hbase_table_1 values(1,'zhangsan');

hbase 操作：

scan 'xyz'

建立外部表要求 hbase 中必须有表对应。

hbase 操作：

create 'tb_user', 'info'

hive 操作：

create external table hive_tb_user1 (
key int,
name string,
age int,
sex string,
likes array<string>
)
row format
delimited
collection items terminated by '-'
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties("hbase.columns.mapping"=":key,info:name,info:age,info:sex,info:likes")
tblproperties("hbase.table.name"="tb_user", "hbase.mapred.output.outputtable"="tb_user");

from tb_test
insert into table hive_tb_user
select 1,'zhangsan',25,'female',array('climbing','reading','shopping') limit 1;

hbase 操作：

scan 'tb_user'
put 'tb_user', 1, 'info:likes', 'like1-like2-like3-like4'

hive 和 hbase
要求在 hive 的 server 端中添加配置信息：
hive-site.xml中添加

<property>
  <name>hbase.zookeeper.quorum</name>
  <value>node2,node3,node4</value>
</property>

hive --service metastore

客户端直接启动 hive 就行了
hive

1、创建 hive 的内部表，要求 hbase 中不能有对应的表
2、创建 hive 的外部表，要求 hbase 中一定要有对应的表
3、映射关系通过

a)	WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:id,cf:username,cf:age")

4、 stored by 指定 hive 中存储数据的时候，由该类来处理，该类会将数据放到hbase 的存储中，同时在 hive 读取数据的时候，由该类负责处理 hbase 的数据和 hive 的对应关系
a) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
5、指定 hive 表和 hbase 中的哪张表对应，outputtable 负责当 hive insert 数据的时候将数据写到 hbase 的哪张表。

TBLPROPERTIES ("hbase.table.name" = "my_table", "hbase.mapred.output.outputtable" = "my_table");

6、如果 hbase 中的表名和 hive 中的表名一致，则可以不指定 tblproperties。

2 sqoop 介绍+安装+数据导入

Sqoop: 将关系数据库（oracle、mysql、postgresql 等）数据与 hadoop 数据进行转换的工具
官网：http://sqoop.apache.org/
版本：（两个版本完全不兼容，sqoop1 使用最多）
sqoop1：1.4.x
sqoop2：1.99.x
同类产品
DataX：阿里顶级数据交换工具
sqoop 架构非常简单，是 hadoop 生态系统的架构最简单的框架。
sqoop1 由 client 端直接接入 hadoop，任务通过解析生成对应的 maprecue 执行

MR 中通过 InputFormat 和 OutputFormat 配置 MR 的输入和输出在这里插入图片描述
2.1 sqoop 导入：
2.2 sqoop 导出
2.3 sqoop 安装和测试

解压
配置环境变量

SQOOP_HOME
PATH

添加数据库驱动包
配置sqoop-env.sh 不需要修改
注释掉bin/configure-sqoop中的第134-147行以关闭不必要的警告信息。
测试

sqoop version
	sqoop list-databases --connect jdbc:mysql://node4:3306/ --username root --password 123456
	sqoop help
	sqoop help command

直接在命令行执行：

sqoop list-databases --connect jdbc:mysql://node1:3306 --username hive --password hive123

将 sqoop 的命令放到文件中：
sqoop1.txt
######################
list-databases
--connect
jdbc:mysql://node4:3306
--username
hive
--password
hive123
######################
命令行执行：
sqoop --options-file sqoop1.txt

[root@node4 sqoop-1.4.6]# sqoop help list-databases
usage: sqoop list-databases [GENERIC-ARGS] [TOOL-ARGS]

Common arguments:
   --connect <jdbc-uri>                         Specify JDBC connect
                                                string
   --connection-manager <class-name>            Specify connection manager
                                                class name
   --connection-param-file <properties-file>    Specify connection
                                                parameters file
   --driver <class-name>                        Manually specify JDBC
                                                driver class to use
   --hadoop-home <hdir>                         Override
                                                $HADOOP_MAPRED_HOME_ARG
   --hadoop-mapred-home <dir>                   Override
                                                $HADOOP_MAPRED_HOME_ARG
   --help                                       Print usage instructions
-P                                              Read password from console
   --password <password>                        Set authentication
                                                password
   --password-alias <password-alias>            Credential provider
                                                password alias
   --password-file <password-file>              Set authentication
                                                password file path
   --relaxed-isolation                          Use read-uncommitted
                                                isolation for imports
   --skip-dist-cache                            Skip copying jars to
                                                distributed cache
   --username <username>                        Set authentication
                                                username
   --verbose                                    Print more information
                                                while working

Generic Hadoop command-line arguments:
(must preceed any tool-specific arguments)
Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <local|namenode:port>      specify a namenode
-jt <local|resourcemanager:port>    specify a ResourceManager
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]

从hive导出到MySQL，则需要在hive的主机（比如hive的客户端所在的位置）安装sqoop。

$CONDITIONS

[root@server3 ~]# sqoop help import
usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]

Common arguments:
   --connect <jdbc-uri>                         Specify JDBC connect
                                                string
   --connection-manager <class-name>            Specify connection manager
                                                class name
   --connection-param-file <properties-file>    Specify connection
                                                parameters file
   --driver <class-name>                        Manually specify JDBC
                                                driver class to use
   --hadoop-home <hdir>                         Override
                                                $HADOOP_MAPRED_HOME_ARG
   --hadoop-mapred-home <dir>                   Override
                                                $HADOOP_MAPRED_HOME_ARG
   --help                                       Print usage instructions
-P                                              Read password from console
   --password <password>                        Set authentication
                                                password
   --password-alias <password-alias>            Credential provider
                                                password alias
   --password-file <password-file>              Set authentication
                                                password file path
   --relaxed-isolation                          Use read-uncommitted
                                                isolation for imports
   --skip-dist-cache                            Skip copying jars to
                                                distributed cache
   --username <username>                        Set authentication
                                                username
   --verbose                                    Print more information
                                                while working

Import control arguments:
   --append                                                   Imports data
                                                              in append
                                                              mode
   --as-avrodatafile                                          Imports data
                                                              to Avro data
                                                              files
   --as-parquetfile                                           Imports data
                                                              to Parquet
                                                              files
   --as-sequencefile                                          Imports data
                                                              to
                                                              SequenceFile
                                                              s
   --as-textfile                                              Imports data
                                                              as plain
                                                              text
                                                              (default)
   --autoreset-to-one-mapper                                  Reset the
                                                              number of
                                                              mappers to
                                                              one mapper
                                                              if no split
                                                              key
                                                              available
   --boundary-query <statement>                               Set boundary
                                                              query for
                                                              retrieving
                                                              max and min
                                                              value of the
                                                              primary key
   --columns <col,col,col...>         指定将数据库表中的哪些列数据导入
                                                              
   --compression-codec <codec>                                Compression
                                                              codec to use
                                                              for import
   --delete-target-dir                 Imports data  in delete mode
   --direct                                                   Use direct
                                                              import fast
                                                              path
   --direct-split-size <n>                                    Split the
                                                              input stream
                                                              every 'n'
                                                              bytes when
                                                              importing in
                                                              direct mode
-e,--query <statement>                                        Import
                                                              results of
                                                              SQL
                                                              'statement'
   --fetch-size <n>                                           Set number
                                                              'n' of rows
                                                              to fetch
                                                              from the
                                                              database
                                                              when more
                                                              rows are
                                                              needed
   --inline-lob-limit <n>                                     Set the
                                                              maximum size
                                                              for an
                                                              inline LOB
-m,--num-mappers <n>                                          Use 'n' map
                                                              tasks to
                                                              import in
                                                              parallel
   --mapreduce-job-name <name>                                Set name for
                                                              generated
                                                              mapreduce
                                                              job
   --merge-key <column>                                       Key column
                                                              to use to
                                                              join results
   --split-by <column-name>                                   Column of
                                                              the table
                                                              used to
                                                              split work
                                                              units
   --table <table-name>                                       Table to
                                                              read
   --target-dir <dir>                                         HDFS plain
                                                              table
                                                              destination
   --validate                                                 Validate the
                                                              copy using
                                                              the
                                                              configured
                                                              validator
   --validation-failurehandler <validation-failurehandler>    Fully
                                                              qualified
                                                              class name
                                                              for
                                                              ValidationFa
                                                              ilureHandler
   --validation-threshold <validation-threshold>              Fully
                                                              qualified
                                                              class name
                                                              for
                                                              ValidationTh
                                                              reshold
   --validator <validator>                                    Fully
                                                              qualified
                                                              class name
                                                              for the
                                                              Validator
   --warehouse-dir <dir>                                      HDFS parent
                                                              for table
                                                              destination
   --where <where clause>                                     WHERE clause
                                                              to use
                                                              during
                                                              import
-z,--compress                                                 Enable
                                                              compression

Incremental import arguments:
   --check-column <column>        Source column to check for incremental
                                  change
   --incremental <import-type>    Define an incremental import of type