desc function extended from_unixtime;
hive中split、coalesce及collect_list函数的用法(可举例)
Split将字符串转化为数组。
split('a,b,c,d' , ',') ==> ["a","b","c","d"]
collect_list 列出该字段所有的值,不去重 select collect_list(id) from table;
判空的处理
NVL(表达式1,表达式2)
如果表达式1为空值,NVL返回值为表达式2的值,否则返回表达式1的值。
该函数的目的是把一个空值(null)转换成一个实际的值。其表达式的值可以是数字型、字符型和日期型。但是表达式1和表达式2的数据类型必须为同一个类型。
hive> select nvl(1,0);
1
hive> select nvl(null,"hello");
hello
COALESCE(T v1, T v2, …) 返回参数中的第一个非空值;如果所有值都为 NULL,那么返回NULL。
COALESCE(T v1, T v2, …)
返回第一非null的值,如果全部都为NULL就返回NULL
1. 日期时间函数
将mongodb中时区转换过来: 由UTC时区 转换为
GMT时区
,
差8个小时
date_format(from_utc_timestamp( CONCAT_WS(' ',substring(updatetime,1,10),substring(updatetime,12,8) ) ,'GMT+8'),'yyyy-MM-dd HH:mm:ss') updatetime
select date_format(from_utc_timestamp(create_time,"UTC"),'yyyy-MM-dd HH:mm:ss') as local_time
select date_format(from_utc_timestamp(create_time,"GMT+8"),'yyyy-MM-dd HH:mm:ss') as local_time
时间戳 秒S是10位; 毫秒ms是13位;
1. 时间戳转日期format格式
from_unixtime(
bigint
unixtime, string format),将时间的秒值转换成format格式(format可为"yyyy
-
MM
-
dd HH:mm:ss","yyyy
-
MM
-
dd HH","yyyy
-
MM
-
dd HH:mm"等等)
select
from_unixtime(unix_timestamp(),
'
yyyy-MM-dd HH:mm:ss
'
);
<==>
select
from_unixtime(unix_timestamp())
date_format(from_unixtime(
cast
(h.updatetime
as
int
)),
'
yyyy-MM-dd HH:mm:ss
'
)
今天:
select
date_format(
current_timestamp
,
'
yyyy-MM-dd
'
)
前一天:
select
date_sub(
current_date
,
1
);
2. 日期转时间戳
select
unix_timestamp();
//
获得当前时区的UNIX时间戳,10位单位为妙
select
unix_timestamp(
'
2020-12-18 12
'
,
'
yyyy-MM-dd HH
'
);
select
unix_timestamp(
'
2020-12-18 09:42:30
'
);
<==>
unix_timestamp(
'
2020-12-18 09:42:30
'
,
'
yyyy-MM-dd HH:mm:ss
'
);
to_date(string
timestamp
) STRING
/
TIMESTAMP
/
DATEWRITABLE types, got
LONG
, 返回时间字符串的日期部分
select
to_date(
'
2020-09-10 10:31:31
'
);
->
2020
-
09
-
10
;
year
(string date) 返回时间字符串的年份部分
month
(string date) 返回时间字符串的月份部分
day
(string date) 返回时间字符串的天
hour(stirng str)日期转换为小时, str必须是 yyyy-MM-dd HH:mm:ss 格式
-----------------------------------------------------------------------------------------------------------------
1
)date_format函数(根据格式整理日期)
select
date_format(
'
2020-06-14 12:20:15
'
,
'
yyyy-MM-dd HH:mm:ss
'
);
2020
-
06
-
14
12
:
20
:
15
select
date_format(
'
2020-06-14 12:20:15
'
,
'
yyyy-MM-dd HH
'
);
2020
-
06
-
14
12
2
) date_add 函数(加减日期)
date_add(string startdate,
int
days) 从开始时间startdate加上days
date_sub(string startdate,
int
days) 从开始时间startdate减去days
select
date_add(
'
2020-06-14
'
,
-
1
); ##等同于
select
date_sub(
'
2020-06-14
'
,
1
);
2020
-
06
-
13
select
date_add(
'
2020-06-14
'
,
2
);
2020
-
06
-
14
3
) next_day 函数
(
1
)取当前天的下一个周一
select
next_day(
'
2020-06-10
'
,
'
MO
'
); ##等同于
select
next_day(
'
2020-06-10
'
,
'
Monday
'
)
2020
-
06
-
15
说明:星期一到星期日的英文( Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday)
(
2
)取当前周的周一
select
date_add(next_day(
'
2020-06-14
'
,
'
MO
'
),
-
7
);
2020
-
06
-
8
4
)last_day函数(求当月最后一天日期)
select
last_day(
'
2020-06-14
'
);
2020
-
06
-
30
2. 字段的合并
1)concat函数
concat函数在连接字符串的时候,只要其中一个是NULL,那么将返回NULL
hive> select concat('a','b','c');
hive> select concat('a','b',null);
hive> select concat(payid,' ',carlogid) ##使用' '将两个字段分隔开=>
01a893092b914703b75941b713767ebf 408693
hive> select order_id,concat(order_status,'=',operate_time)...
1101 1001=2020-01-01 11:20:30
1102 1002=2020-02-02 12:30:40
concat(sum(total_amount_pur),'&&',sum
(total_amount_sig),'&&',sum(total_amount_jump))
0.0&&16665.0&&0.0
concat(substr(summary_time,9,2),'',substr(summary_time,6,2),'',substr(summary_time,1,4),'_',concat(market_id),'_' ,concat(mid))
09022019_108_0
12022019_108_0
21022019_108_0
2)concat_ws函数
concat_ws函数在连接字符串的时候,只要有一个字符串不是NULL,就不会返回NULL。concat_ws函数需要指定分隔符。
hive> select concat_ws('-','a','b','c');
a-b-c
hive> select concat_ws('-','a','b',null);
a-b
CONCAT_WS("/",r.province,r.city,a.area) channel_address => 北京/北京市/朝阳区 ,字段必须是string;
concat_ws("_" ,substr(summary_time,9,2),substr(summary_time,6,2),substr(summary_time,1,4),concat(market_id),concat(mid))
09_03_2019_108_0
13_03_2019_108_0
21_03_2019_108_0
concat_ws('',array('a', 'b', 'c'))
abc
3. 字符串截取函数substr, substring
语法: substr(string A, int start), substring(string A, int start)
substr(string A, int start, int len), substring(string A, int start, int len)
返回值: string
说明:返回字符串A从start位置到结尾的字符串 或 返回字符串A从start位置开始,长度为len的字符串
select substring('abcde',3); select substr('abcde',3); 返回 cde
select substring('abcde',3,2); select substr('abcde',3,2); 返回 cd
substring(h.id, 10, 24) 59409d1d2cdcc90b91c62be5 ObjectId(59409d1d2cdcc90b91c62be5)
4. 数值累加
1. 需求分析
group by 与 sum() over partition by ( ) 的区别:
select channel_type ,sum(num) from record group by channel_type;
channel_type _c1
select channel_type,dt,sum(num) over (partition by channel_type) from record;
select channel_type,dt,num, sum(num) over (partition by channel_type order by dt) from record;
channel_type dt sum_window_0
A 2015-01-02 30
A 2015-01-01 30
A 2015-01-05 30
A 2015-01-04 30
A 2015-01-02 30
A 2015-01-03 30
B 2015-01-03 15
B 2015-01-01 15
B 2015-01-02 15
B 2015-01-02 15
C 2015-02-01 19
C 2015-01-30 19
C 2015-01-30 19
C 2015-02-02 19
channel_type dt num sum_window_0
A 2015-01-01 8 8
A 2015-01-02 4 17
A 2015-01-02 5 17
A 2015-01-03 2 19
A 2015-01-04 5 24
A 2015-01-05 6 30
B 2015-01-01 1 1
B 2015-
01-02 9 13
B 2015-01-02 3 13
B 2015-01-03 2 15
C 2015-01-30 8 15
C 2015-01-30 7 15
C 2015-02-01 1 16
C 2015-02-02 3 19
View Code
现有 hive 表 record, 内容如下:其中字段意义: channel_type(string) dt(string) num(int); 分别代表: 渠道类型 日期 该天个数,原数据模拟如下:
select * from record;
channel_type dt num
A 2015-01-01 8
A 2015-01-02 4
A 2015-01-02 5
C 2015-02-01 1
A 2015-01-04 5
A 2015-01-05 6
B 2015-01-03 2
B 2015-01-02 3
A 2015-01-03 2
C 2015-01-30 8
C 2015-01-30 7
B 2015-01-02 9
B 2015-01-01 1
C 2015-02-02 3
View Code
统计每个渠道截止到当天为止的最大单日人数和累计到该天的总人数:
# 先求出每个渠道每天总访问量:
create table record_nj as
select
channel_type,dt,sum(num) as new_join
from record
group by channel_type,dt;
channel_type dt new_join
A 2015-01-01 8
A 2015-01-02 9
A 2015-01-03 2
A 2015-01-04 5
A 2015-01-05 6
B 2015-01-01 1
B 2015-01-02 12
B 2015-01-03 2
C 2015-01-30 15
C 2015-02-01 1
C 2015-02-02 3
View Code
方法一: 使用Hive窗口函数over max()、sum()
select
channel_type, dt, new_join,
sum(new_join) over(partition by channel_type order by dt) as sum_count,
max(new_join) over(partition by channel_type order by dt) as max_count
from record_nj;
方法二:使用group by join自连接
select
t1.channel_type,t1.dt,t1.new_join,
sum(t2.new_join) sum_count, max(t2.new_join) max_count
from record_nj t1 join record_nj t2 on t1.channel_type = t2.channel_type
where t1.dt >= t2.dt
group by t1.channel_type,t1.dt,t1.new_join order by t1.channel_type,t1.dt;
数据结果如下:
channel_type dt new_join sum_count max_count
A 2015-01-01 8 8 8
A 2015-01-02 9 17 9
A 2015-01-03 2 19 9
A 2015-01-04 5 24 9
A 2015-01-05 6 30 9
B 2015-
01-01 1 1 1
B 2015-01-02 12 13 12
B 2015-01-03 2 15 12
C 2015-01-30 15 15 15
C 2015-02-01 1 16 15
C 2015-02-02 3 19 15
View Code
累加、累乘、最大值:
select
channel_type,
new_join,
sum(new_join) over(partition by channel_type order by dt) as sum_count,--累加
sum(new_join) over(partition by channel_type order by dt rows between unbounded preceding and current row) sum_count, --累加
round(power(10, sum(log(10, new_join))over(partition by channel_type order by dt rows between unbounded preceding and current row))) as tired,--累乘处理-
max(new_join) over(partition by channel_type order by dt) as max_count --最大值
from record_nj;
select
id,order_id,order_price,
sum(order_price) over(partition by order_id), --对order_price按order_id累加
sum(order_price) over(partition by order_id order by id), --对order_price按id进行累加
row_number() over(partition by order_id order by id ) --对order_id按id升序排序打标
from ods_order_detail where order_id = 3276;
id order_id order_price _c3 sum_window_1 row_number_window_2
8449 3276 235.00 7534.00 235.00 1
8450 3276 7299.00 7534.00 7534.00 2
5. 行列互相转换
原数据如下:
select * from score;
name subject score
孙悟空 语文 87
孙悟空 数学 95
孙悟空 英语 68
大海 语文 94
大海 数学 56
大海 英语 84
kris 语文 64
kris 数学 86
kris 英语 84
婷婷 语文 65
婷婷 数学 85
婷婷 英语 78
View Code
求语文成绩比数学成绩好的学生
方法一:join
select
s1.name,s1.subject, s1.score
from score s1 inner join score s2 on s1.name = s2.name
where s1.score > s2.score and s1.subject = '语文' and s2.subject = '数学';
name,
case subject when '语文' then score else 0 end as chinese_score,
case subject when '数学' then score else 0 end as math_score
from score;
name chinese_score math_score
孙悟空 87 0
孙悟空 0 95
孙悟空 0 0
大海 94 0
大海 0 56
大海 0 0
kris 64 0
kris 0 86
kris 0 0
婷婷 65 0
婷婷 0 85
婷婷 0 0
View Code
create
table
t2
AS
select
name,
max
(chinese_score) chinese_score,
max
(math_score) math_score
from
t1
group
by
name;
name chinese_score math_score
kris 64 86
大海 94 56
婷婷 65 85
孙悟空 87 95
View Code
select
name, chinese_score, math_score
from
t2
where
chinese_score
>
math_score;
或者三个hql合并为一个如下
====>
select
name,chinese_score,math_score
from
(
select
name,
max
(chinese_score) chinese_score,
max
(math_score) math_score
from
(
select
name,
case
subject
when
'
语文
'
then
score
else
0
end
as
chinese_score,
case
subject
when
'
数学
'
then
score
else
0
end
as
math_score
from
score)t1
group
by
t1.name
)t2
where
chinese_score
>=
math_score ;
name chinese_score math_score
大海 94 56
View Code
列转行的实现
数据如下:
id sname math computer english
1 Jed 34 58 58
2 Tony 45 87 45
3 Tom 76 34 89
View Code
select id, sname, 'math' as course, math as score from score
union
select id, sname, 'computer' as course, computer as score from score
union
select id, sname, 'english' as course, english as score from score
order by id, sname, course;
结果如下:
id sname course score
1 Jed computer 58
1 Jed english 58
1 Jed math 34
2 Tony computer 87
2 Tony english 45
2 Tony math 45
3 Tom computer 34
3 Tom english 89
3 Tom math 76
View Code
6. 窗口函数
一、窗口聚合函数sum、avg、max、min
建表,load数据:
create table cookie(
cookie_id string,
create_time string,
pv int
)row format delimited fields terminated by ",";
a_cookie,2019-06-10,1
a_cookie,2019-06-11,9
a_cookie,2019-06-12,7
a_cookie,2019-06-13,3
a_cookie,2019-06-14,2
a_cookie,2019-06-15,4
a_cookie,2019-06-16,4
b_cookie,2019-08-17,6
b_cookie,2019-08-18,9
b_cookie,2019-08-19,5
b_cookie,2019-08-17,2
load data local inpath "/opt/module/datas/cookie.txt" into table cookie;
View Code
select
cookie_id,create_time,pv,
sum(pv) over(partition by cookie_id) as sum1, --对组内的pv值进行全部累加
sum(pv) over(partition by cookie_id order by create_time) as accu_sum --从第一行开始累加到当前行
from cookie;
数据如下:
a_cookie 2019-06-10 1 30 1
a_cookie 2019-06-11 9 30 10
a_cookie 2019-06-12 7 30 17
a_cookie 2019-06-13 3 30 20
a_cookie 2019-06-14 2 30 22
a_cookie 2019-06-15 4 30 26
a_cookie 2019-06-16 4 30 30
b_cookie 2019-08-17 2 22 8
b_cookie 2019-08-17 6
22 8
b_cookie 2019-08-18 9 22 17
b_cookie 2019-08-19 5 22 22
View Code
select
cookie_id,create_time,pv,
sum(pv) over(partition by cookie_id) as sum1, --对组内的pv值进行全部累加
sum(pv) over(partition by cookie_id order by create_time) as accu_sum, --从第一行开始累加到当前行
sum(pv) over(partition by cookie_id order by create_time rows between unbounded preceding and current row) as sum3,
--默认就是从起点到当前行往前累加,所以between unbounded(起点) preceding(往前) and current row这个条件可以不写。
sum(pv) over(partition by cookie_id order by create_time rows between current row and unbounded following) as sum4,
--从当前行累加到最前边,跟前一个是相反的累加,相当于是降序累加
sum(pv) over(partition by cookie_id order by create_time rows between 3 preceding and current row) as sum5,
--取当前行-至往前数3行的数进行累加, 一共4行进行累加处理
sum(pv) over(partition by cookie_id order by create_time rows between 3 preceding and 1 following) as sum6
--取当前行,往前数3行,往后数1行的数值进行累加处理
from cookie;
数据如下:
a_cookie 2019-06-10 1 30 1 1 30 1 10
a_cookie 2019-06-11 9 30 10 10 29 10 17
a_cookie 2019-06-12 7 30 17 17 20 17 20
a_cookie 2019-06-13 3 30 20 20 13 20 22
a_cookie 2019-06-14 2 30 22 22 10 21 25
a_cookie 2019-06-15 4 30 26 26 8 16 20
a_cookie 2019-06-16 4 30 30 30 4 13 13
b_cookie 2019-08-17 2 22 8 2 22 2 8
b_cookie 2019-08-17 6 22 8 8 20 8 17
b_cookie 2019-08-18 9 22 17 17 14 17 22
b_cookie 2019-08-19 5 22 22 22 5 22 22
View Code
sum、avg、max、min这些窗口函数的语法都是一样的;同上;
求: 近(必须是连续的)3天内的销售量
sum(paid_cnt) over (partition by org_id,goods_id order by stats_date rows BETWEEN 3 PRECEDING AND 1 PRECEDING) 3_precePre_paid_cnt, --不包括当前(行)
rows是按行往前计算;使用range,时间戳的形式
sum(paid_cnt) over (partition by org_id,goods_id order by unix_timestamp(stats_date, 'yyyy-MM-dd') RANGE BETWEEN 86400 * 3 PRECEDING AND 1 PRECEDING) 3_precePre_paid_cnt, --不包括当前日期(行)
select
stats_date,
org_id,
goods_id,
paid_cnt,
sum(paid_cnt) over (partition by org_id,goods_id order by unix_timestamp(stats_date, 'yyyy-MM-dd') RANGE BETWEEN 86400 * 3 PRECEDING AND 1 PRECEDING) 3_precePre_paid_cnt, --不包括当前日期(行)
sum(paid_cnt) over (partition by org_id,goods_id order by unix_timestamp(stats_date, 'yyyy-MM-dd') RANGE BETWEEN 86400 * 3 PRECEDING AND 1 FOLLOWING) 3_preceFllow_paid_cnt,--包括当前日期(行)
sum(paid_cnt) over (partition by org_id,goods_id order by unix_timestamp(stats_date, 'yyyy-MM-dd') RANGE BETWEEN 86400 * 3
PRECEDING AND -1 FOLLOWING) 3_prece_Fllow_paid_cnt,--不包括当前日期(行)
sum(paid_cnt) over(partition by stats_date,org_id,goods_id) as paid_sum
dws_sales_org_sku where org_id = 'xxx' and goods_id = xxx
order by 1 desc
数据如下所示:
+-------------+----------+-----------+-----------+----------------------+------------------------+-------------------------+-----------+--+
| stats_date | org_id | goods_id | paid_cnt | 3_precePre_paid_cnt | 3_preceFllow_paid_cnt | 3_prece_Fllow_paid_cnt | paid_sum |
+-------------+----------+-----------+-----------+----------------------+------------------------+-------------------------+-----------+--+
| 2022-06-15 | JM00026 | 12266 | 2 | 19 | 21 | 19 | 2 |
| 2022-06-14 | JM00026 | 12266 | 1 | 19 | 20 | 19 | 1 |
| 2022-06-13 | JM00026 | 12266 | 4 | 19 | 23 | 19 | 4 |
| 2022-06-12 | JM00026 | 12266 | 14 | 5 | 19 | 5 | 14 |
| 2022-06-11 | JM00026 | 12266 | 1 | 5 | 6 | 5 | 1 |
| 2022-06-10 | JM00026 | 12266 | 4 | 12 | 16 | 12 | 4 |
| 2022-06-08 | JM00026 | 12266 | 1 | 16 | 17 | 16 | 1 |
| 2022-06-07 | JM00026 | 12266 | 11 | 11 | 22 | 11 | 11 |
| 2022-06-06 | JM00026 | 12266 | 3 | 23 | 26 | 23 | 3 |
| 2022-06-05 | JM00026 | 12266 | 2 | 29 | 31 | 29 | 2 |
| 2022-06-04 | JM00026 | 12266 | 6 | 25 | 31 | 25 | 6 |
| 2022-06-03 | JM00026 | 12266 | 15 | 12 | 27 | 12 | 15 |
| 2022-06-02 | JM00026 | 12266 | 8 |
5 | 13 | 5 | 8 |
| 2022-06-01 | JM00026 | 12266 | 2 | 32 | 34 | 32 | 2 |
| 2022-05-31 | JM00026 | 12266 | 2 | 50 | 52 | 50 | 2 |
| 2022-05-30 | JM00026 | 12266 | 1 | 49 | 50 | 49 | 1 |
| 2022-05-29 | JM00026 | 12266 | 29 | 20 | 49 | 20 | 29 |
| 2022-05-28 | JM00026 | 12266 | 20 | 2 | 22 | 2 | 20 |
| 2022-05-25 | JM00026 | 12266 | 2 | 30 | 32 | 30 | 2 |
| 2022-05-24 | JM00026 | 12266 | 8 | 34 | 42 | 34 | 8 |
| 2022-05-23 | JM00026 | 12266 | 8 | 32 | 40 | 32 | 8 |
| 2022-05-22 | JM00026 | 12266 | 14 | 18 | 32 | 18 | 14 |
| 2022-05-21 | JM00026 | 12266 | 12 | 6 | 18 | 6 | 12 |
| 2022-05-20 | JM00026 | 12266 | 6 | NULL | 6 | NULL | 6 |
| 2022-05-15 | JM00026 | 12266 | 2 | 4 | 6 | 4 | 2 |
| 2022-05-14 | JM00026 | 12266 | 2 | 2 | 4 | 2 | 2 |
| 2022-05-13 | JM00026 | 12266 | 1 | 2 | 3 | 2 | 1 |
| 2022-05-
12 | JM00026 | 12266 | 1 | 7 | 8 | 7 | 1 |
| 2022-05-10 | JM00026 | 12266 | 1 | 10 | 11 | 10 | 1 |
| 2022-05-09 | JM00026 | 12266 | 6 | 6 | 12 | 6 | 6 |
| 2022-05-08 | JM00026 | 12266 | 2 | 5 | 7 | 5 | 2 |
| 2022-05-07 | JM00026 | 12266 | 2 | 7 | 9 | 7 | 2 |
| 2022-05-06 | JM00026 | 12266 | 2 | 12 | 14 | 12 | 2 |
| 2022-05-05 | JM00026 | 12266 | 1 | 15 | 16 | 15 | 1 |
| 2022-05-04 | JM00026 | 12266 | 4 | 15 | 19 | 15 | 4 |
| 2022-05-03 | JM00026 | 12266 | 7 | 17 | 24 | 17 | 7 |
| 2022-05-02 | JM00026 | 12266 | 4 | 17 | 21 | 17 | 4 |
| 2022-05-01 | JM00026 | 12266 | 4 | 15 | 19 | 15 | 4
View Code
ntile(n),用于将分组数据按照顺序切分成n片,返回当前切片值,ntile不支持
rows between; 如果切片不均匀,默认增加第一个切片的分布。
将分组数据按照顺序切分成1份、2份、3份、4份
select
cookie_id,create_time,pv,
ntile(1) over(partition by cookie_id order by create_time) as nt1,
ntile(2) over(partition by cookie_id order by create_time) as nt2,
ntile(3) over(partition by cookie_id order by create_time) as nt3,
ntile(4) over(partition by cookie_id order by create_time) as nt4
from cookie;
结果: a_cookie组有7条数据, 所以默认第一个分片加1条数据
cookie_id create_time pv nt1 nt2 nt3 nt4
a_cookie 2019-06-10 1 1 1 1 1
a_cookie 2019-06-11 9 1 1 1 1
a_cookie 2019-06-12 7 1 1 1 2
a_cookie 2019-06-13 3 1 1 2
2
a_cookie 2019-06-14 2 1 2 2 3
a_cookie 2019-06-15 4 1 2 3 3
a_cookie 2019-06-16 4 1 2 3 4
b_cookie 2019-08-17 6 1 1 1 1
b_cookie 2019-08-17 2 1 1 1 2
b_cookie 2019-08-18 9 1 2 2 3
b_cookie 2019-08-19 5 1 2 3 4
应用场景:统计一个每个cookie的pv数最多的前1/3的天
select
cookie_id,create_time,pv
from(
select
cookie_id,create_time,pv,
ntile(3) over(partition by cookie_id order by pv desc) as nt1 --按pv降序排,取nt = 1份的数据就是要求的结果
from cookie
)t where nt1 = 1;
a_cookie 2019-06-11 9
a_cookie 2019-06-12 7
a_cookie 2019-06-16 4
b_cookie 2019-08-18 9
b_cookie 2019-08-17 6
View Code
三、窗口排序函数
相同点:都是分组排序
Row_number: 即便出现相同的排序,排名也不会一致,只会进行累加;
即排序次序连续,但不会出现同一排名 ;1、 2 、
3、 4、
5
rank: 当出现相同的排序时,中间会出现一个空缺,
即分组内会出现同一个排名,但是排名次序是不连续的 1、 2、
3、 3、
5
Dense_rank: 当出现相同排序时,中间不会出现空缺,
即分组内可能会出现同样的次序,且排序名次是连续的 1、 2、
3、 3、
4
select
cookie_id,create_time,pv,
row_number() over(partition by cookie_id order by pv desc) as row_index,
rank() over(partition by cookie_id order by pv desc) as rank_index,
dense_rank() over(partition by cookie_id order by pv desc) as dense_index
from cookie;
cookie_id create_time pv row_index rank_index dense_index
a_cookie 2019-06-11 9 1 1 1
a_cookie 2019-06-12 7 2 2 2
a_cookie 2019-06-16 4 3 3 3
a_cookie 2019-06-15 4 4 3 3
a_cookie 2019-06-13 3 5 5 4
a_cookie 2019-06-14 2 6 6 5
a_cookie 2019-06-10 1 7 7 6
常用场景:分组取TopN, 比如求每个cookie排名前三的pv
create table tmp as
select cookieid, createtime, pv,
row_number() over(partition by cookieid order by pv desc) as index
from cookie2;
select * from tmp where index <= 3;
四、lag | lead | first_value| last_value
LAG(col,n,DEFAULT)用于统计窗口内往上第n行的值作为当前行的值
第一个参数为列名
第二个参数为往上第n行(可选,默认为1)
第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)
select
cookie_id,create_time,pv,
lag(pv,1) over (partition by cookie_id order by create_time) as last_1_pv //可设置默认值为0即 lag(pv,1,0) over(...)
from cookie;
数据如下:
结果: 没有设置默认值,没有上一行时显示为null
cookie_id create_time pv last_1_pv
a_cookie 2019-06-10 1 NULL
a_cookie 2019-06-11 9
1
a_cookie 2019-06-12 7 9
a_cookie 2019-06-13 3 7
a_cookie 2019-06-14 2 3
a_cookie 2019-06-15 4 2
a_cookie 2019-06-16 4 4
b_cookie 2019-08-17 6 NULL
b_cookie 2019-08-17 2 6
b_cookie 2019-08-18 9 2
b_cookie 2019-08-19 5 9
View Code
lead的作用与lag相反
LEAD(col,n,DEFAULT)
用于统计窗口内往下第n行值;通常用于行值填充;或者和指定行进行
差值比较
第一个参数为列名
第二个参数为往下第n行(可选,默认为1)
第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)
select
cookie_id,create_time,pv,
lead(pv,1) over(partition by cookie_id order by create_time) as next_1_pv,
lead(create_time,1,'1970-01-01') over(partition by cookie_id order by create_time) as next_1_time
from cookie;
数据如下:
cookie_id create_time pv next_1_pv next_1_time
a_cookie 2019-06-10 1 9 2019-06-11
a_cookie 2019-06-11 9 7 2019-06-12
a_cookie 2019-06-12 7 3 2019-06-13
a_cookie 2019-06-13 3 2 2019-06-14
a_cookie 2019-06-14 2 4 2019-06-15
a_cookie 2019-06-15 4 4 2019-06-16
a_cookie 2019-06-16 4 NULL 1970-01-01
b_cookie 2019-08-17 6 2 2019-08-17
b_cookie 2019-08-17 2 9 2019-08-18
b_cookie 2019-08-18 9 5 2019-08-19
b_cookie 2019-08-19 5 NULL 1970-01-01
View Code
first_value: 取分组内排序后,截止到当前行,第一个值
select
cookie_id,create_time,pv,
first_value(pv) over (partition by cookie_id order by pv) as first_pv
from cookie;
数据如下:
cookie_id create_time pv first_pv
a_cookie 2019-06-10 1 1
a_cookie 2019-06-14 2 1
a_cookie 2019-06-13 3 1
a_cookie 2019-06-16 4 1
a_cookie 2019-06-15 4
1
a_cookie 2019-06-12 7 1
a_cookie 2019-06-11 9 1
b_cookie 2019-08-17 2 2
b_cookie 2019-08-19 5 2
b_cookie 2019-08-17 6 2
b_cookie 2019-08-18 9 2
View Code
last_value 取分组内排序后,截止到当前行,最后一个值(其实就是它自己)
select
cookie_id,create_time,pv,
last_value(pv) over (partition by cookie_id order by pv) as last_pv
from cookie;
cookie_id create_time pv last_pv
a_cookie 2019-06-10 1 1
a_cookie 2019-06-14 2 2
a_cookie 2019-06-13 3 3
a_cookie 2019-06-16 4 4
a_cookie 2019-06-15 4 4
a_cookie 2019-06-12 7 7
a_cookie 2019-06-11 9 9
b_cookie 2019-08-17 2 2
b_cookie 2019-08-19 5 5
b_cookie 2019-08-17 6 6
b_cookie 2019-08-18 9 9
View Code
grouping sets 、 grouping__id 、 cube 、 rollup函数;
cume_dist 、 percent_rank等函数
https://www.jianshu.com/nb/19948302
7. collect_set| collect_list
输入数据为一个ID对应多个name,要求输出数据为ID是唯一的,name随便取一个就可以。
collect_set函数,类似于mysql的group_concat函数,把每个分组的其他字段,按照逗号进行拼接,得到一个最终字符串:
取同一分组的不同行的数据集,collect_set(可去重)、collect_list(不去重,把所有的都列出来)。
collect_set,
将分组内的数据放入到一个集合中,具有去重的功能,返回类型array 即一个去重后的对象集合;
collect_list
和collect_set一样,但是没有去重功能
collect_set(
named_struct
( 'sku_id', sku_id, 'sku_num', sku_num, 'order_count', order_count, 'order_amount', order_amount ) ) order_stats
select
sid,collect_set(class_id)
from
table1
group
by
sid;
1
[
11,12,13
]
2
[
11,14
]
3
[
12,15
]
可以针对第二列做一些计数、求和操作,分别对应到Hive的聚合函数count、
sum
。
对应到本文的目的,直接从数组获取第一个元素就达到目的了,这样做:
select
sid,collect_set(class_id)
[
0
]
from
table1
group
by
sid;
select
course, collect_set(area),
avg
(score)
from
stud
group
by
course;
//
collect_set,去重
chinese
[
"sh","bj"
]
79.0
math
[
"bj"
]
93.5
select
course, collect_list(area),
avg
(score)
from
stud
group
by
course;
//
collect_list,不去重
chinese
[
"sh","sh","bj"
]
79.0
math
[
"bj","bj"
]
93.5
用下标可以取某一个
select
course, collect_set(area)
[
0
]
,
avg
(score)
from
stud
group
by
course;
chinese sh
79.0
math bj
93.5
collect_set 与 named_struct结合的使用形成Array结构:
select
course, collect_set(named_struct("name",name, "area",area)),
avg
(score)
from
default
.stud
group
by
course;
chinese
[
{"name":"wang5","area":"sh"},{"name":"zhao6","area":"sh"},{"name":"tian7","area":"bj"}
]
79.0
math
[
{"name":"zhang3","area":"bj"},{"name":"li4","area":"bj"}
]
93.5
collect_set使用后,结合array_contains的使用:
select
sum
(
if
(array_contains(pages,
'
home
'
),
1
,
0
)) home_count,
sum
(
if
(array_contains(pages,
'
good_detail
'
),
1
,
0
)) good_detail_count
(
select
collect_set(page_id) pages
from
dwd_page_log
and
page_id
in
(
'
home
'
,
'
good_detail
'
)
group
by
mid_id)tmp
--统计每个用户具体哪些天访问过
select
cookie_id,
collect_set(create_time) over(partition by cookie_id) as create_time_set
from cookie
数据如下:
cookie_id create_time_set
a_cookie ["2019-06-10","2019-06-14","2019-06-13","2019-06-15","2019-06-16","2019-06-12","2019-06-11"]
a_cookie ["2019-06-10","2019-06-14","2019-06-13","2019-06-15","2019-06-16","2019-06-12","2019-06-11"]
a_cookie ["2019-06-10","2019-06-14","2019-06-13","2019-06-15","2019-06-16","2019-06-12","2019-06-11"]
a_cookie ["2019-06-10","2019-06-14","2019-06-13","2019-06-15","2019-06-16","2019-06-12","2019-06-11"]
a_cookie ["2019-06-10","2019-06-14","2019-06-13","2019-06-15","2019-06-16","2019-06-12","2019-06-11"]
a_cookie ["2019-06-10","2019-06-14","2019-06-13","2019-06-15","2019-06-16","2019-06-12","2019-06-11"]
a_cookie ["2019-06-10","2019-06-14","2019-06-13","2019-06-15","2019-06-16","2019-06-12","2019-06-11"]
View Code
sort_array:
数组内排序;通常结合collect_set或者collect_list使用;
如collect_list为例子,可以发现日期并不是按照顺序组合的,这里有需求需要按照时间升序的方式来组合
--按照时间升序来组合
select
cookie_id,
sort_array(collect_list(create_time) over(partition by cookie_id)) as create_time_list
from cookie
数据如下:
a_cookie ["2019-06-10","2019-06-11","2019-06-12","2019-06-13","2019-06-14","2019-06-15","2019-06-16"]
a_cookie ["2019-06-10","2019-06-11","2019-06-12","2019-06-13","2019-06-14","2019-06-15","2019-06-16"]
a_cookie ["2019-06-10","2019-06-11","2019-06-12","2019-06-13","2019-06-14","2019-06-15","2019-06-16"]
a_cookie ["2019-06-10","2019-06-11","2019-06-12","2019-06-13","2019-06-14","2019-06-15","2019-06-16"]
a_cookie ["2019-06-10","2019-06-11","2019-06-12","2019-06-13","2019-06-14","2019-06-15","2019-06-16"]
a_cookie ["2019-06-10","2019-06-11","2019-06-12","2019-06-13","2019-06-14","2019-06-15","2019-06-16"]
a_cookie ["2019-06-10","2019-06-11","2019-06-12","2019-06-13","2019-06-14","2019-06-15","2019-06-16"]
View Code
想要按照时间降序来组合,那基于上面的sql该如何变通呢?
这里根据没必要按照sort_array来实现,在collect_list中的分组函数内直接按照visit_date降序即可,
这里只是为了演示sort_array如何使用
--按照时间降序排序
select
cookie_id,
collect_list(create_time) over(partition by cookie_id order by create_time desc) as create_time_set
from cookie
数据如下
a_cookie ["2019-06-16"]
a_cookie ["2019-06-16","2019-06-15"]
a_cookie ["2019-06-16","2019-06-15","2019-06-14"]
a_cookie ["2019-06-16","2019-06-15","2019-06-14","2019-06-13"]
a_cookie ["2019-06-16","2019-06-15","2019-06-14","2019-06-13","2019-06-12"]
a_cookie ["2019-06-16","2019-06-15","2019-06-14","2019-06-13","2019-06-12","2019-06-11"]
a_cookie ["2019-06-16","2019-06-15","2019-06-14","2019-06-13","2019-06-12","2019-06-11","2019-06-10"]
View Code
8. STR_TO_MAP
STR_TO_MAP 函数
(1)语法描述
STR_TO_MAP(VARCHAR text, VARCHAR listDelimiter, VARCHAR keyValueDelimiter)
(2)功能描述
使用 listDelimiter 将text分隔成K-V对,然后使用 keyValueDelimiter 分隔每个K-V对,组装成MAP返回。默认listDelimiter为( ,),keyValueDelimiter为(=)。
(3)案例
str_to_map('1001=2020-06-14,1002=2020-06-14', ',' , '=')
{"1001":"2020-06-14","1002":"2020-06-14"}
select
order_id, concat_ws(',', collect_set(concat(order_status,'=',operate_time)))
from ods_order_status_log where dt='2020-06-14' group by order_id;
3261 1001=2020-06-14 08:37:45.0
3263 1001=2020-06-14 08:37:45.0,1002=2020-06-14 08:37:45.0,1004=2020-06-14 08:37:45.0
3270 1001=2020-06-14 08:37:45.0,1002=2020-06-14 08:37:45.0,1005=2020-06-14 08:37:45.0
select
order_id, str_to_map(concat_ws(',',collect_set(concat(order_status,'=',operate_time))), ',' , '=')
from ods_order_status_log where dt='2020-06-14' group by order_id;
3261 {"1001":"2020-06-14 08:37:45.0"}
3263 {"1001":"2020-06-14 08:37:45.0","1002":"2020-06-14 08:37:45.0","1004":"2020-06-14 08:37:45.0"}
3270 {"1001":"2020-06-14 08:37:45.0","1002":"2020-06-14 08:37:45.0","1005":"2020-06-14 08:37:45.0"}
select
order_id,
collect_set(concat(order_status,'=',operate_time)),
--["1001=2020-06-14 08:37:45.0","1002=2020-06-14 08:37:45.0","1004=2020-06-14 08:37:45.0"]
concat_ws(',', collect_set(concat(order_status,'=',operate_time))),
--1001=2020-06-14 08:37:45.0,1002=2020-06-14 08:37:45.0,1004=2020-06-14 08:37:45.0
str_to_map(concat_ws(',',collect_set(concat(order_status,'=',operate_time))),',','=') tms
--{"1001":"2020-06-14 08:37:45.0","1002":"2020-06-14 08:37:45.0","1004":"2020-06-14 08:37:45.0"}
from ods_order_status_log --订单状态表
where dt='2020-06-14' and order_id = 3268
group by order_id
9. JSON处理函数
get_json_object 的使用,通常用于获取json字符串中的key,如果不存在则返回null
select get_json_object('[{"name":"kris","sex":"男","age":"25"},{"name":"shero","sex":"女","age":"18"}]', '$[0]');
{"name":"kris","sex":"男","age":"25"}
select get_json_object('[{"name":"kris","sex":"男","age":"25"},{"name":"shero","sex":"女","age":"18"}]', '$[0].age');
expode 结合get_json_object的使用
explode:
列转行,通常是将一个数组内的元素打开,拆成多行
--简单例子
select explode(array(1,2,3,4,5))
+------+--+
| col |
+------+--+
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
+------+-
--结合lateral view 使用
select
get_json_object(user,'$.user_id')
select
distinct collect_set(concat('{"user_id":"',user_id,'"}')) over(partition by year(visit_date)) as user_list
from url_info
lateral view explode(user_list) user_list as user
10. URL的处理
select * from url_info;
user_id visit_url visit_cnt visit_time visit_date
user1 https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 10 2020-09-12 02:20:02 2020-09-12
user1 https://blog.csdn.net/qq_28680977/article/details/108298276?k1=v1&k2=v2#Ref1 2 2020-09-11 11:20:12 2020-09-11
user1 https://blog.csdn.net/qq_28680977/article/details/108295053?k1=v1&k2=v2#Ref1 4 2020-09-10 08:19:22 2020-09-10
user1 https://blog.csdn.net/qq_28680977/article/details/108460523?k1=v1&k2=v2#Ref1 5 2020-08-12 19:20:22 2020-08-12
user2 https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 29 2020-04-04 12:23:22 2020-04-04
user2 https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 30 2020-05-15 12:34:23 2020-05-15
user2 https://blog.csdn.net/qq_28680977/
article/details/108161655?k1=v1&k2=v2#Ref1 30 2020-05-15 13:34:23 2020-05-15
user2 https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 19 2020-05-16 19:03:32 2020-05-16
user2 https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 10 2020-05-17 06:20:22 2020-05-17
user3 https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 43 2020-04-12 08:02:22 2020-04-12
user3 https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 5 2020-08-02 08:10:22 2020-08-02
user3 https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 6 2020-08-02 10:10:22 2020-08-02
user3 https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 50 2020-08-12 12:23:22 2020-08-12
user4 https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 10 2020-04-12 11:20:22 2020-04-12
user4 https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 30 2020-03-12 10:20:22 2020-03-12
user4 https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 20 2020-02-12 20:26:43 2020-02-12
user2 https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 10 2020-04-12 19:12:36 2020-04-12
user2 https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 40 2020-05-12 18:24:31 2020-05-12
View Code
parse_url:用于解析url相关的参数
select
visit_url,
parse_url(visit_url, 'HOST') as url_host, --解析host
parse_url(visit_url, 'PATH') as url_path, --解析path
parse_url(visit_url, 'QUERY') as url_query,--解析请求参数
parse_url(visit_url, 'REF') as url_ref, --解析ref
parse_url(visit_url, 'PROTOCOL') as url_protocol, --解析协议
parse_url(visit_url, 'AUTHORITY') as url_authority,--解析author
parse_url(visit_url, 'FILE') as url_file, --解析filepath
parse_url(visit_url, 'USERINFO') as url_user_info --解析userinfo
from url_info
数据如下:
visit_url url_host url_path url_query url_ref url_protocol url_authority url_file url_user_info
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 blog.csdn.net /qq_28680977/article/details/108161655 k1=v1&k2=v2 Ref1 https blog.csdn.net /qq_28680977/article/details/108161655?k1=v1&k2=v2 NULL
https://blog.csdn.net/qq_28680977/article/details/108298276?k1=v1&k2=v2#Ref1 blog.csdn.net /qq_28680977/article/details/108298276 k1=v1&k2=v2 Ref1 https blog.csdn.net /qq_28680977/article/details/108298276?k1=v1&k2=v2 NULL
https://blog.csdn.net/qq_28680977/article/details/108295053?k1=v1&k2=v2#Ref1 blog.csdn.net /qq_28680977/article/details/108295053 k1=v1&k2=v2 Ref1 https blog.csdn.net /qq_28680977/article/details/108295053?k1=v1&k2=v2 NULL
https://blog.csdn.net/qq_28680977/article/details/108460523?k1=v1&k2=v2#Ref1 blog.csdn.net /qq_28680977/article/details/108460523 k1=v1&k2=v2 Ref1 https blog.csdn.net /qq_28680977/article/details/108460523?k1=v1&k2=v2 NULL
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 blog.csdn.net /qq_28680977/article/details/108161655 k1=v1&k2=v2 Ref1 https blog.csdn.net /qq_28680977/article/details/108161655?k1=v1&k2=v2 NULL
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 blog.csdn.net /qq_28680977/article/details/108161655 k1=v1&k2=v2 Ref1 https blog.csdn.net /qq_28680977/article/details/108161655?k1=v1&k2=v2 NULL
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 blog.csdn.net /qq_28680977/article/details/108161655 k1=v1&k2=v2 Ref1 https blog.csdn.net /qq_28680977/article/details/108161655?k1=v1&k2=v2 NULL
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 blog.csdn.net /qq_28680977/article/details/108161655 k1=v1&k2=v2 Ref1 https blog.csdn.net /qq_28680977/article/details/108161655?k1=v1&k2=v2 NULL
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 blog.csdn.net /qq_28680977/article/details/108161655 k1=v1&k2=v2 Ref1 https blog.csdn.net /qq_28680977/article/details/108161655?k1=v1&k2=v2 NULL
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 blog.csdn.net /qq_28680977/article/details/108161655 k1=v1&k2=v2 Ref1 https blog.csdn.net /qq_28680977/article/details/108161655
?k1=v1&k2=v2 NULL
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 blog.csdn.net /qq_28680977/article/details/108161655 k1=v1&k2=v2 Ref1 https blog.csdn.net /qq_28680977/article/details/108161655?k1=v1&k2=v2 NULL
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 blog.csdn.net /qq_28680977/article/details/108161655 k1=v1&k2=v2 Ref1 https blog.csdn.net /qq_28680977/article/details/108161655?k1=v1&k2=v2 NULL
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 blog.csdn.net /qq_28680977/article/details/108161655 k1=v1&k2=v2 Ref1 https blog.csdn.net /qq_28680977/article/details/108161655?k1=v1&k2=v2 NULL
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 blog.csdn.net /qq_28680977/article/details/108161655 k1=v1&k2=v2 Ref1 https blog.csdn.net /qq_28680977/article/details/108161655?k1=v1&k2=v2 NULL
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 blog.csdn.net /qq_28680977/article/details/108161655 k1=v1&k2=v2 Ref1 https blog.csdn.net /qq_28680977/article/details/108161655?k1=v1&k2=v2 NULL
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 blog.csdn.net /qq_28680977/article/details/108161655 k1=v1&k2=v2 Ref1 https blog.csdn.net /qq_28680977/article/details/108161655?k1=v1&k2=v2 NULL
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 blog.csdn.net /qq_28680977/article/details/108161655 k1=v1&k2=v2 Ref1 https blog.csdn.net /qq_28680977/article/details/108161655?k1=v1&k2=v2 NULL
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 blog.csdn.net /qq_28680977/article/details/108161655 k1=v1&k2=v2 Ref1 https blog.csdn.net /qq_28680977/article/details/108161655?k1=v1&k2=v2 NULL
View Code
reflect:该函数是利用java的反射来实现一些功能,url编解码
--url编码
select
visit_url,
reflect("java.net.URLEncoder", "encode", visit_url, "UTF-8") as visit_url_encode
from url_info
visit_url visit_url_encode
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108161655%3Fk1%3Dv1%26k2%3Dv2%23Ref1
https://blog.csdn.net/qq_28680977/article/details/108298276?k1=v1&k2=v2#Ref1 https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108298276%3Fk1%3Dv1%26k2%3Dv2%23Ref1
https://blog.csdn.net/qq_28680977/article/details/108295053?k1=v1&k2=v2#Ref1 https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108295053%3Fk1%3Dv1%26k2%3Dv2%23Ref1
https://blog.csdn.net/qq_28680977/article/details/108460523?k1=v1&k2=v2#Ref1 https%
3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108460523%3Fk1%3Dv1%26k2%3Dv2%23Ref1
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108161655%3Fk1%3Dv1%26k2%3Dv2%23Ref1
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108161655%3Fk1%3Dv1%26k2%3Dv2%23Ref1
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108161655%3Fk1%3Dv1%26k2%3Dv2%23Ref1
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108161655%3Fk1%3Dv1%26k2%3Dv2%23Ref1
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108161655%3Fk1%3Dv1%26k2%3Dv2%23Ref1
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108161655%3Fk1%3Dv1%26k2%3Dv2%23Ref1
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108161655%3Fk1%3Dv1%26k2%3Dv2%23Ref1
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108161655%3Fk1%3Dv1%26k2%3Dv2%23Ref1
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108161655%3Fk1%3Dv1%26k2%3Dv2%23Ref1
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108161655%3Fk1%3Dv1%26k2%3Dv2%23Ref1
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108161655%3Fk1%3Dv1%26k2%3Dv2%23Ref1
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108161655%3Fk1%3Dv1%26k2%3Dv2%23Ref1
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108161655%3Fk1%3Dv1%26k2%3Dv2%23Ref1
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108161655%
3Fk1%3Dv1%26k2%3Dv2%23Ref1
View Code
--url解码
select
visit_url,
reflect("java.net.URLDecoder", "decode", visit_url_encode, "UTF-8") as visit_url_decode
select
visit_url,
reflect("java.net.URLEncoder", "encode", visit_url, "UTF-8") as visit_url_encode
from url_info
visit_url visit_url_decode
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1
https://blog.csdn.net/qq_28680977/article/details/108298276?k1=v1&k2=v2#Ref1 https://blog.csdn.net/qq_28680977/article/details/108298276?k1=v1&k2=v2#Ref1
https://blog.csdn.net/qq_28680977/article/details/108295053?k1=v1&k2=v2#Ref1 https://blog.csdn.net/qq_28680977/article/details/108295053?k1=v1&k2=v2#Ref1
https://blog.csdn.net/qq_28680977/article/details/108460523?k1=v1&k2=v2#Ref1 https://blog.csdn.net/qq_28680977/article/details/108460523?k1=v1&k2=v2#Ref1
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1
https://blog.csdn.net/
qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1
https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1 https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1
View Code
11. 复杂数据类型定义
1)map结构数据定义
map<string,string>
2)array结构数据定义
array<string>
3)struct结构数据定义
struct<id:int,name:string,age:int>
4)struct和array嵌套定义
array<struct<id:int,name:string,age:int>>
lateral view explode array
user_id order_detail_stats
419 [{"sku_id":"10","sku_num":4,"order_count":2,"order_amount":894}]
329 [{"sku_id":"5","sku_num":3,"order_count":1,"order_amount":658}]
122 [{"sku_id":"4","sku_num":2,"order_count":1,"order_amount":2902}]
278 [{"sku_id":"16","sku_num":2,"order_count":1,"order_amount":451}]
select
user_id,
order_stats_struct.sku_id sku_id,
order_stats_struct.order_count order_count
from dws_user_action_daycount lateral view explode(order_detail_stats) tmp as order_stats_struct
user_id sku_id order_count
419 10 2
329 5 1
122 4 1
278 16 1
12. 数学函数
round(DOUBLE a):返回对a四舍五入的BIGINT值,
round(DOUBLE a, INT d):返回DOUBLE型d的保留n位小数的DOUBLW型的近似值
floor(DOUBLE a):向下取整''
hex(BIGINT a)/ hex(STRING a)/ hex(BINARY a)
计算十六进制a的STRING类型,如果a为STRING类型就转换成字符相对应的十六进制
该函数很少使用,主要是因为曾经遇到过关于emoj表情符脏数据,故使用该函数进行处理
round()函数主要用来对给定的数字进行四舍五入取近似值
select round(3.146,2);
select round(3.145,2);
select round(3.144,2);
sqrt()函数表示对给定的数字取平方根
SELECT Sqrt(4);
SELECT Sqrt(5);
2.23606797749979
类型转换函数
将某些数据类型转换为便于查询或者计算统计的数据类型。例如 cast()函数,其基本格式为 cast(value as TYPE),能够将给定的数据 value 转化为 TYPE类型
SELECT CAST("5" AS INT);
字符串处理函数
instr:
查找字符串str中子字符串substr出现的位置,如果查找失败将返回0,如果任一参数为Null将返回null,
注意位置为从1开始的;通常用这个函数作为模糊查询来查询
select visit_time,instr(visit_time,'10') from url_info;
visit_time _c1
2020-09-10 08:19:22 9
2020-08-02 08:10:22 15
2020-08-02 10:10:22 12
2020-08-12 12:23:22 0
--使用size distinct
select
distinct size (collect_set(user_id) over(partition by year(visit_date)))
from url_info
+-----------+--+
| user_cnt |
+-----------+--+
| 4 |
+-----------+--+
1 row selected (0.268 seconds)
--使用通过distinct,然后count统计的方式
select
count(1)
select
distinct user_id
from url_info
+-----------+--+
| count(1) |
+-----------+--+
| 4 |
+-----------+--+
1 row selected (0.661 seconds)
--这里只用到了19条记录数,就可以明显观察到耗时差异,这里涉及到shuffle问题
regexp_replace:
regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT)
按照Java正则表达式PATTERN将字符串中符合条件的部分成REPLACEMENT所指定的字符串,
如里REPLACEMENT空的话,抽符合正则的部分将被去掉
--将url中?参数后面的内容全部剔除
select
distinct regexp_replace(visit_url,'\\?(.*)','') as visit_url
from url_info
visit_url
https://blog.csdn.net/qq_28680977/article/details/108161655
https://blog.csdn.net/qq_28680977/article/details/108295053
https://blog.csdn.net/qq_28680977/article/details/108298276
https://blog.csdn.net/qq_28680977/article/details/108460523
regexp_extract:
regexp_extract(string subject, string pattern, int index)
抽取字符串subject中符合正则表达式pattern的第index个部分的子字符串,注意些预定义字符的使用
类型于python爬虫中的xpath,用于提取指定的内容
--提取csdn文章编号
select
distinct regexp_extract(visit_url,'/details/([0-9]+)',1) as visit_url
from url_info
visit_url
108161655
108295053
108298276
108460523
substring_index:
substring_index(string A, string delim, int count)
截取第count分隔符之前的字符串,如count为正则从左边开始截取,如果为负则从右边开始截取
--比如将2020年的用户组合获取前2个用户,下面的sql将上面讲解的函数都结合在一起使用了
select
user_set,
substring_index(user_set,',',2) as user_id
select
distinct concat_ws(',',collect_set(user_id) over(partition by year(visit_date))) as user_set
from url_info
Hive之Grouping Sets函数
使用高级分组聚合的语法时,hive是否开启了向量模式。
set hive.verctorized.execution.enabled = true;
GROUPING SETS: 根据不同的维度组合进行聚合,等价于将不同维度的GROUP BY结果集进行UNION ALL
GROUPING__ID:表示结果属于哪一个分组集合,属于虚字段
CUBE: 根据GROUP BY的维度的所有组合进行聚合。
ROLLUP: 为CUBE的子集,以最左侧的维度为主,从该维度进行层级聚合。
简单来说:在一个GROUP BY 查询中,根据不同的维度组合进行聚合,等价于将不同维度的GROUP BY结果进行UNION ALL操作。
SETS就是一种将多个GROUP BY逻辑UNION写在一个HIVE SQL语句中的便利写法。GROUPING SETS会把在单个GROUP BY逻辑中没有参与GROUP BY的那一列置为NULL值,这样聚合出来的结果,未被GROUP BY的列将显示为NULL
使用方法:
假如现在又如下场景,a,b,num三个字段,现在要求对a,b字段分别进行统计,有三种情况:(a,b)、(a)、(b)。常规写法我们可能会写成:
SELECT
a,b,sum(num) AS total_num
FROM table_name
GROUP BY a,b
UNION ALL
SELECT a,sum(num) AS total_num
FROM table_name
GROUP BY a
UNION ALL
SELECT b,sum(num) AS total_num
FROM table_name
GROUP BY B
现在用GROUPING SETS来进行改写:
SELECT
sum(num) AS total_num,
grouping__id
FROM table_name
GROUP BY a,b
GROUPING SETS (a,b),(a),(b)
可见代码简洁了很多,并且生成的job数也变少且计算的效率提高了(UNION ALL是多次扫描表),但是有同学注意到有一个grouping__id,这个是什么作用呢?
grouping__id表示结果属于哪一个分组集合,以二进制的形式表达,那么问题又来了 ,在hivesql和sparksql里二进制的生成方法和规则又不相同,下面具体说下区别:
1、hive中的grouping__id的数字规则
如
select a,b,c,grouping__id from test group by a,b,c
grouping sets ((ab),(ac),(abc))
规则:根据group by 字段 ,从低到高,group sets中出现的字段是1没出现是0
如 (a,b)对应 011
(a,c)对应 101