Hive函数

原创已于 2024-06-13 22:21:01 修改 · 771 阅读

12 ·

CC 4.0 BY-SA版权

文章标签：

#hive #hadoop #数据仓库

于 2024-06-12 23:16:46 首次发布

sql

1、count(*)、count(1) 、count('字段名') 区别

从执行结果来看：

count(*)包括了所有的列，相当于行数，不会忽略列值为NULL，执行速度最慢。
count(1)包括了忽略所有列，用1代表代码行，在统计结果的时候，不会忽略列值NULL ，最快
count(列名)只包括列名那一列，会忽略列值为空（这里的空不是指空字符串或者0，而是表示null）的计数，即某个字段值为NULL时，不统计。执行速度仅次于count(1)。

从执行效率来看：

如果列为主键，count(列名)效率优于count(1)
如果列不为主键，count(1)效率优于count(列名)
如果表中存在主键，count(主键列名)效率最优
如果表中只有一列，则count(*)效率最优
如果表有多列，且不存在主键，则count(1)效率优于count(*)

如果没有特殊要求，就使用count(1)进行计数

2、hive语句执行顺序

from->join on 或 lateral view explode(炸裂的列) ->where->group by
->聚合函数(sum)...->having->over()开窗函数->distinct->order by
->limit(数据量庞大，加limit查询)

3、hive中大小写不敏感（列名无所谓大小写）

4、在hive中，数据中如果有null字符串，加载到表中的时候会变成 null （不是字符串）

如果需要判断 null，使用某个字段名 is null 这样的方式来判断

不能直接某个字段名 == null

5、使用explain查看SQL执行计划

explain select * from emp;

面试题：hive中一条sql语句如何解析成MapReduce作业执行的？

解析器：分析语法是否完整，元数据从哪来，sql是否有误
编译器：生成逻辑执行计划。
优化器：对逻辑执行计划进行优化。
执行器：生产一个物理的执行计划，就是 MR/Spark/flink

Hive 常用函数

关系运算

// 等值比较 = == < = >
// 不等值比较 != <>
// 区间比较： select * from default.students where age between 15 and 22;
// 空值/非空值判断：isnull、isnotnull、nvl()、isnull()
// like、rlike、regexp用法

数值计算

取整函数(四舍五入)：round
向上取整：ceil
向下取整：floor

条件函数

if(表达式,表达式成立的返回值,表达式不成立的返回值) (重点)

select sal,if(sal<2000,'低薪',if(sal>=2000 and sal<3000,'中等','高薪'))
 as level from emp;

select COALESCE('1',null,'2'); // 1  //从左往右 依次匹配 直到非空为止

case when(重点)
select sal,case when sal<2000 then '低薪' 
                when sal>=2000 and sal<3000 then '中等薪资'
                else '高薪' end as level
from emp;

日期函数(重点)

select from_unixtime(1718204485,'YYYY年MM月dd日 hh时mm分ss秒');

select from_unixtime(unix_timestamp('2024年06月12日','yyyy年MM月dd日'),'yyyy-MM-dd');  //2024-06-12

字符串函数

select concat('666','999'); // 666999

select concat_ws('?','a','b','c'); // a?b?c

select explode(split("shihao,huolinger",",")); //shihao 										 
                                              // huolinger

Hive中的wordCount

create table words(
    words string
)row format delimited fields terminated by '\n';

// 数据
hello,java,hello,java,scala,python
hbase,hadoop,hadoop,hdfs,hive,hive
hbase,hadoop,hadoop,hdfs,hive,hive

select word,count(*) from (select explode(split(words,',')) word from words) ws group by ws.word;

1.1 hive窗口函数

普通的聚合函数每组(Group by)只返回一个值，而开窗函数窗口中的每行都返回一个值。就是对查询的结果多出一列，这一列可以是聚合值，也可以是排序值。开窗函数一般就是over()函数，其窗口是由一个 OVER 子句定义的多行记录开窗函数一般分为聚合开窗函数和排序开窗函数。

-- 聚合格式
select sum(字段名) over([partition by 字段名] [ order by 字段名]) as 别名,
    max(字段名) over() as 别名 
from 表名;

-- 排序窗口格式
select rank() over([partition by 字段名] [ order by 字段名]) as 别名 from 表名;

注意点：

over()函数中的分区、排序、指定窗口范围可组合使用也可以不指定，根据不同的业务需求结合使用
over()函数中如果不指定分区，窗口大小是针对查询产生的所有数据，如果指定了分区，窗口大小是针对每个分区的数据

测试数据

-- 创建表
create table grade(
name string,
subject string, 
score int) 
row format delimited fields terminated by ","
lines terminated by '\n';

-- 测试数据 grade.txt
张三,语文,10
张三,数学,73
张三,英语,15
李四,语文,10
李四,数学,73
李四,英语,11
麻七,语文,22
麻七,数学,70
麻七,英语,31
王五,语文,21
王五,数学,81
王五,英语,23

-- 上传数据
load data local inpath '/usr/local/soft/bigdata30/grade.txt' into table grade;

1.1.1 聚合开窗函数

sum(求和)

min(最小)

max(最大)

avg(平均值)

count(计数)

lag(获取当前行上一行的数据）

-- 
select name,subject,score,sum(score) over() as sumover from grade;


select name,subject,score,
sum(score) over() as sum1,
sum(score) over(partition by subject) as sum2,
sum(score) over(partition by subject order by score) as sum3, 
-- 由起点到当前行的窗口聚合，和sum3一样
sum(score) over(partition by subject order by score rows between unbounded preceding and current row) as sum4, 
-- 当前行和前面一行的窗口聚合
sum(score) over(partition by subject order by score rows between 1 preceding and current row) as sum5,
-- 当前行的前面一行到后面一行的窗口聚合  前一行+当前行+后一行
sum(score) over(partition by subject order by score rows between 1 preceding and 1 following) as sum6,
-- 当前行与后一行之和
sum(score) over(partition by subject order by score rows between current row and 1 following) as sum6,
-- 当前和后面所有的行
sum(score) over(partition by subject order by score rows between current row and unbounded following) as sum7
from grade;
+-------+----------+--------+-------+-------+-------+-------+-------+-------+-------+-------+
| name  | subject  | score  | sum1  | sum2  | sum3  | sum4  | sum5  | sum6  | sum6  | sum7  |
+-------+----------+--------+-------+-------+-------+-------+-------+-------+-------+-------+
| 麻七    | 数学       | 70     | 440   | 297   | 70    | 70    | 70    | 143   | 143   | 297   |
| 李四    | 数学       | 73     | 440   | 297   | 216   | 143   | 143   | 216   | 146   | 227   |
| 张三    | 数学       | 73     | 440   | 297   | 216   | 216   | 146   | 227   | 154   | 154   |
| 王五    | 数学       | 81     | 440   | 297   | 297   | 297   | 154   | 154   | 81    | 81    |
| 李四    | 英语       | 11     | 440   | 80    | 11    | 11    | 11    | 26    | 26    | 80    |
| 张三    | 英语       | 15     | 440   | 80    | 26    | 26    | 26    | 49    | 38    | 69    |
| 王五    | 英语       | 23     | 440   | 80    | 49    | 49    | 38    | 69    | 54    | 54    |
| 麻七    | 英语       | 31     | 440   | 80    | 80    | 80    | 54    | 54    | 31    | 31    |
| 李四    | 语文       | 10     | 440   | 63    | 20    | 10    | 10    | 20    | 20    | 63    |
| 张三    | 语文       | 10     | 440   | 63    | 20    | 20    | 20    | 41    | 31    | 53    |
| 王五    | 语文       | 21     | 440   | 63    | 41    | 41    | 31    | 53    | 43    | 43    |
| 麻七    | 语文       | 22     | 440   | 63    | 63    | 63    | 43    | 43    | 22    | 22    |
+-------+----------+--------+-------+-------+-------+-------+-------+-------+-------+-------+


rows：行
unbounded preceding：起点
unbounded following：终点
n preceding：前 n 行
n following：后 n 行
current row：当前行

rows必须跟在Order by 子句之后，对排序的结果进行限制

LAG(col,n,default_val)：往前第n行数据，col是列名，n是往上的行数，当第n行为null的时候取default_val

LEAD(col,n, default_val)：往后第n行数据，col是列名，n是往下的行数，当第n行为null的时候取default_val

聚合开窗函数：

1、Hive用户购买明细数据分析

创建表和加载数据

name，orderdate，cost
jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94

建表加载数据

vim business.txt
create table business
(
name string, 
orderdate date,
cost int
)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

load data local inpath "/usr/local/soft/bigdata30/business.txt" into table business;

需求：

需求1：查询在2017年4月份购买过的顾客及总人数

# 分析：按照日期过滤、分组count求总人数

select t1.name
,t1.orderdate,count(1) over() as counts_04 
from (select name,orderdate from business 
where month(orderdate)='04') t1;

+----------+---------------+------------+
| t1.name | t1.orderdate | counts_04 |
+----------+---------------+------------+
| mart | 2017-04-13 | 5 |
| mart | 2017-04-11 | 5 |
| mart | 2017-04-09 | 5 |
| mart | 2017-04-08 | 5 |
| jack | 2017-04-06 | 5 |
+----------+---------------+------------+

需求2：查询顾客的购买明细及月购买总额

# 分析：按照顾客分组、sum购买金额
select name,orderdate,cost,sum(cost) over(partition by name,month(orderdate)) from business;

需求3：上述的场景,要将cost按照日期进行累加

# 分析：按照顾客分组、日期升序排序、组内每条数据将之前的金额累加
select name,orderdate,cost,sum(cost) over(partition by name order by orderdate rows between unbounded preceding and current row)  from business;

1.1.2 排序开窗函数(重点)

RANK() 排序相同时会重复，总数不会变
DENSE_RANK() 排序相同时会重复，总数会减少
ROW_NUMBER() 会根据顺序计算
PERCENT_RANK()计算给定行的百分比排名。

select name,subject,
score,
rank() over(partition by subject order by score desc) rp,
dense_rank() over(partition by subject order by score desc) drp,
row_number() over(partition by subject order by score desc) rnp,
percent_rank() over(partition by subject order by score) as percent_rank 
from t_fraction;

2、Hive分析学生成绩信息

创建表语加载数据

name  subject score
红猫  语文  87
红猫  数学  95
红猫  英语  68
蓝兔  语文  94
蓝兔  数学  56
蓝兔  英语  84
黑小虎  语文  64
黑小虎  数学  86
黑小虎  英语  84
猪无戒 语文  65
猪无戒 数学  85
猪无戒 英语  78

建表加载数据
vim score.txt

create table score2
(
name string,
subject string, 
score int
) row format delimited fields terminated by "\t";

load data local inpath '/usr/local/soft/bigdata30/score.txt' into table score2;

需求1：每门学科学生成绩排名(是否并列排名、空位排名三种实现)

select *, rank() over(partition by subject order by score desc) rp,--重复不连续

dense_rank() over(partition by subject order by score desc) drp,--重复且连续

row_number() over(partition by subject order by score desc) rmp--不重复且连续

from score2;

需求2：每门学科成绩排名top 2的学生
select * from(select * ,rank() over(partition by subject order by score desc )as rm from score2)t1 where t1.rm<3;

Hive 行转列

lateral view explode

create table testarray(
    name string,
    weight array<string>
)
 row format delimited fields terminated by '\t',
 COLLECTION ITEMS terminated by ',';

张三  "110","120","130"
李四  "130","100","160"

select name,col1  from testarray lateral view explode(weight) t1 as col1;

Hive 列转行

create table testLie(
    name string,
    col1 int
)row format delimited 
fields terminated by '\t';

select name,collect_list(col1) from testLie group by name;