添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
desc function extended from_unixtime;

hive中split、coalesce及collect_list函数的用法(可举例)

Split将字符串转化为数组。

split('a,b,c,d' , ',') ==> ["a","b","c","d"]

collect_list 列出该字段所有的值,不去重  select collect_list(id) from table;

判空的处理

NVL(表达式1,表达式2)

如果表达式1为空值,NVL返回值为表达式2的值,否则返回表达式1的值。

该函数的目的是把一个空值(null)转换成一个实际的值。其表达式的值可以是数字型、字符型和日期型。但是表达式1和表达式2的数据类型必须为同一个类型。

hive> select nvl(1,0);
  1
hive> select nvl(null,"hello");   hello

COALESCE(T v1, T v2, …) 返回参数中的第一个非空值;如果所有值都为 NULL,那么返回NULL。

COALESCE(T v1, T v2, …)

返回第一非null的值,如果全部都为NULL就返回NULL

1. 日期时间函数

将mongodb中时区转换过来: 由UTC时区 转换为 GMT时区 差8个小时

date_format(from_utc_timestamp( CONCAT_WS(' ',substring(updatetime,1,10),substring(updatetime,12,8) ) ,'GMT+8'),'yyyy-MM-dd HH:mm:ss') updatetime

select date_format(from_utc_timestamp(create_time,"UTC"),'yyyy-MM-dd HH:mm:ss') as local_time select date_format(from_utc_timestamp(create_time,"GMT+8"),'yyyy-MM-dd HH:mm:ss') as local_time
时间戳 秒S是10位; 毫秒ms是13位; 1. 时间戳转日期format格式 from_unixtime( bigint unixtime, string format),将时间的秒值转换成format格式(format可为"yyyy - MM - dd HH:mm:ss","yyyy - MM - dd HH","yyyy - MM - dd HH:mm"等等) select from_unixtime(unix_timestamp(), ' yyyy-MM-dd HH:mm:ss ' ); <==> select from_unixtime(unix_timestamp()) date_format(from_unixtime( cast (h.updatetime as int )), ' yyyy-MM-dd HH:mm:ss ' )
今天:
select date_format( current_timestamp , ' yyyy-MM-dd ' ) 前一天: select date_sub( current_date , 1 ); 2. 日期转时间戳 select unix_timestamp(); // 获得当前时区的UNIX时间戳,10位单位为妙 select unix_timestamp( ' 2020-12-18 12 ' , ' yyyy-MM-dd HH ' ); select unix_timestamp( ' 2020-12-18 09:42:30 ' ); <==> unix_timestamp( ' 2020-12-18 09:42:30 ' , ' yyyy-MM-dd HH:mm:ss ' ); to_date(string timestamp ) STRING / TIMESTAMP / DATEWRITABLE types, got LONG , 返回时间字符串的日期部分 select to_date( ' 2020-09-10 10:31:31 ' ); -> 2020 - 09 - 10 ; year (string date) 返回时间字符串的年份部分 month (string date) 返回时间字符串的月份部分 day (string date) 返回时间字符串的天
hour(stirng str)日期转换为小时, str必须是 yyyy-MM-dd HH:mm:ss 格式 -----------------------------------------------------------------------------------------------------------------
1 )date_format函数(根据格式整理日期) select date_format( ' 2020-06-14 12:20:15 ' , ' yyyy-MM-dd HH:mm:ss ' ); 2020 - 06 - 14 12 : 20 : 15 select date_format( ' 2020-06-14 12:20:15 ' , ' yyyy-MM-dd HH ' ); 2020 - 06 - 14 12 2 ) date_add 函数(加减日期) date_add(string startdate, int days) 从开始时间startdate加上days date_sub(string startdate, int days) 从开始时间startdate减去days select date_add( ' 2020-06-14 ' , - 1 ); ##等同于 select date_sub( ' 2020-06-14 ' , 1 ); 2020 - 06 - 13 select date_add( ' 2020-06-14 ' , 2 ); 2020 - 06 - 14 3 ) next_day 函数 ( 1 )取当前天的下一个周一 select next_day( ' 2020-06-10 ' , ' MO ' ); ##等同于 select next_day( ' 2020-06-10 ' , ' Monday ' ) 2020 - 06 - 15 说明:星期一到星期日的英文( Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday) ( 2 )取当前周的周一 select date_add(next_day( ' 2020-06-14 ' , ' MO ' ), - 7 ); 2020 - 06 - 8 4 )last_day函数(求当月最后一天日期) select last_day( ' 2020-06-14 ' ); 2020 - 06 - 30

2. 字段的合并

1)concat函数
concat函数在连接字符串的时候,只要其中一个是NULL,那么将返回NULL
hive> select concat('a','b','c');
hive> select concat('a','b',null);
hive> select concat(payid,' ',carlogid)  ##使用' '将两个字段分隔开=> 
   01a893092b914703b75941b713767ebf 408693
hive> select order_id,concat(order_status,'=',operate_time)...
   1101 1001=2020-01-01 11:20:30
   1102 1002=2020-02-02 12:30:40
concat(sum(total_amount_pur),'&&',sum (total_amount_sig),'&&',sum(total_amount_jump))   0.0&&16665.0&&0.0 concat(substr(summary_time,9,2),'',substr(summary_time,6,2),'',substr(summary_time,1,4),'_',concat(market_id),'_' ,concat(mid)) 09022019_108_0 12022019_108_0 21022019_108_0 2)concat_ws函数 concat_ws函数在连接字符串的时候,只要有一个字符串不是NULL,就不会返回NULL。concat_ws函数需要指定分隔符。 hive> select concat_ws('-','a','b','c');     a-b-c hive> select concat_ws('-','a','b',null);     a-b CONCAT_WS("/",r.province,r.city,a.area) channel_address => 北京/北京市/朝阳区 ,字段必须是string; concat_ws("_" ,substr(summary_time,9,2),substr(summary_time,6,2),substr(summary_time,1,4),concat(market_id),concat(mid)) 09_03_2019_108_0 13_03_2019_108_0 21_03_2019_108_0

concat_ws('',array('a', 'b', 'c'))
  abc

3. 字符串截取函数substr, substring

语法: substr(string A, int start), substring(string A, int start)
      substr(string A, int start, int len), substring(string A, int start, int len)
返回值: string
说明:返回字符串A从start位置到结尾的字符串 或 返回字符串A从start位置开始,长度为len的字符串
select substring('abcde',3);   select substr('abcde',3);     返回 cde 
select substring('abcde',3,2); select substr('abcde',3,2);   返回 cd 
substring(h.id, 10, 24) 59409d1d2cdcc90b91c62be5    ObjectId(59409d1d2cdcc90b91c62be5)

4. 数值累加

1. 需求分析

group  by 与 sum() over partition by ( ) 的区别:

select  channel_type ,sum(num) from  record group by channel_type;
channel_type       _c1
select channel_type,dt,sum(num) over (partition by channel_type) from  record;  
select channel_type,dt,num, sum(num) over (partition by channel_type order by dt) from record; 
channel_type    dt    sum_window_0
A    2015-01-02    30
A    2015-01-01    30
A    2015-01-05    30
A    2015-01-04    30
A    2015-01-02    30
A    2015-01-03    30
B    2015-01-03    15
B    2015-01-01    15
B    2015-01-02    15
B    2015-01-02    15
C    2015-02-01    19
C    2015-01-30    19
C    2015-01-30    19
C    2015-02-02    19
channel_type    dt    num    sum_window_0
A    2015-01-01    8    8
A    2015-01-02    4    17
A    2015-01-02    5    17
A    2015-01-03    2    19
A    2015-01-04    5    24
A    2015-01-05    6    30
B    2015-01-01    1    1
B    2015-




    
01-02    9    13
B    2015-01-02    3    13
B    2015-01-03    2    15
C    2015-01-30    8    15
C    2015-01-30    7    15
C    2015-02-01    1    16
C    2015-02-02    3    19
View Code

现有 hive 表 record, 内容如下:其中字段意义: channel_type(string)   dt(string)   num(int); 分别代表: 渠道类型 日期  该天个数,原数据模拟如下:

select * from record;
channel_type  dt     num
A    2015-01-01    8
A    2015-01-02    4
A    2015-01-02    5
C    2015-02-01    1
A    2015-01-04    5
A    2015-01-05    6
B    2015-01-03    2
B    2015-01-02    3
A    2015-01-03    2
C    2015-01-30    8
C    2015-01-30    7
B    2015-01-02    9
B    2015-01-01    1
C    2015-02-02    3
View Code

统计每个渠道截止到当天为止的最大单日人数和累计到该天的总人数:

# 先求出每个渠道每天总访问量:
create table record_nj as  
select
channel_type,dt,sum(num) as new_join
from record
group by channel_type,dt;
channel_type    dt    new_join
A    2015-01-01    8
A    2015-01-02    9
A    2015-01-03    2
A    2015-01-04    5
A    2015-01-05    6
B    2015-01-01    1
B    2015-01-02    12
B    2015-01-03    2
C    2015-01-30    15
C    2015-02-01    1
C    2015-02-02    3
View Code
方法一: 使用Hive窗口函数over  max()、sum()
select 
    channel_type, dt, new_join, 
    sum(new_join) over(partition by channel_type order by dt) as sum_count,
    max(new_join) over(partition by channel_type order by dt) as max_count
from record_nj;
方法二:使用group by  join自连接
select 
    t1.channel_type,t1.dt,t1.new_join, 
    sum(t2.new_join) sum_count, max(t2.new_join) max_count 
from record_nj t1 join record_nj t2 on t1.channel_type = t2.channel_type 
where t1.dt >= t2.dt 
group by t1.channel_type,t1.dt,t1.new_join order by t1.channel_type,t1.dt;
数据结果如下:
channel_type    dt    new_join    sum_count    max_count
A    2015-01-01    8    8    8
A    2015-01-02    9    17    9
A    2015-01-03    2    19    9
A    2015-01-04    5    24    9
A    2015-01-05    6    30    9
B    2015-




    
01-01    1    1    1
B    2015-01-02    12    13    12
B    2015-01-03    2    15    12
C    2015-01-30    15    15    15
C    2015-02-01    1    16    15
C    2015-02-02    3    19    15
View Code

 累加、累乘、最大值:

select 
    channel_type,
    new_join, 
    sum(new_join) over(partition by channel_type order by dt) as sum_count,--累加
    sum(new_join) over(partition by channel_type order by dt rows between unbounded preceding and current row) sum_count, --累加
    round(power(10, sum(log(10, new_join))over(partition by channel_type order by dt rows between unbounded preceding and current row))) as tired,--累乘处理-
    max(new_join) over(partition by channel_type order by dt) as max_count --最大值
from record_nj;
select 
  id,order_id,order_price,
  sum(order_price) over(partition by order_id),             --对order_price按order_id累加 
  sum(order_price) over(partition by order_id order by id), --对order_price按id进行累加
  row_number() over(partition by order_id order by id )     --对order_id按id升序排序打标 
from ods_order_detail where order_id = 3276; 
id      order_id    order_price    _c3         sum_window_1    row_number_window_2
8449    3276    235.00        7534.00     235.00                1
8450    3276    7299.00        7534.00     7534.00             2

5. 行列互相转换

原数据如下:

select * from score;
name    subject    score
孙悟空    语文    87
孙悟空    数学    95
孙悟空    英语    68
大海    语文    94
大海    数学    56
大海    英语    84
kris    语文    64
kris    数学    86
kris    英语    84
婷婷    语文    65
婷婷    数学    85
婷婷    英语    78
View Code

求语文成绩比数学成绩好的学生

方法一:join 
select
s1.name,s1.subject, s1.score from score s1 inner join score s2 on s1.name = s2.name where s1.score > s2.score and s1.subject = '语文' and s2.subject = '数学';
name,
case subject when '语文' then score else 0 end as chinese_score, case subject when '数学' then score else 0 end as math_score from score;
name    chinese_score    math_score
孙悟空    87    0
孙悟空    0    95
孙悟空    0    0
大海            94    0
大海            0    56
大海            0    0
kris           64    0
kris           0    86
kris           0    0
婷婷          65    0
婷婷          0       85
婷婷          0        0
View Code create table t2 AS select name, max (chinese_score) chinese_score, max (math_score) math_score from t1 group by name;
name    chinese_score    math_score
kris    64    86
大海    94    56
婷婷    65    85
孙悟空    87    95
View Code select name, chinese_score, math_score from t2 where chinese_score > math_score;
或者三个hql合并为一个如下
====> select name,chinese_score,math_score from (
select name, max (chinese_score) chinese_score, max (math_score) math_score from ( select name, case subject when ' 语文 ' then score else 0 end as chinese_score, case subject when ' 数学 ' then score else 0 end as math_score from score)t1
group by
t1.name
)t2
where chinese_score >= math_score ;
name    chinese_score    math_score
大海    94    56
View Code

列转行的实现

数据如下:

id  sname   math    computer    english
1   Jed     34      58          58
2   Tony    45      87          45
3   Tom     76      34          89
View Code
select id, sname, 'math' as course, math as score from score
union 
select id, sname, 'computer' as course, computer as score from score
union 
select id, sname, 'english' as course, english as score from score
order by id, sname, course;

结果如下:

id  sname   course      score
1   Jed     computer    58
1   Jed     english     58
1   Jed     math        34
2   Tony    computer    87
2   Tony    english     45
2   Tony    math        45
3   Tom     computer    34
3   Tom     english     89
3   Tom     math        76
View Code

6. 窗口函数

一、窗口聚合函数sum、avg、max、min

建表,load数据:
create table cookie(
  cookie_id  string,
  create_time string,
  pv int
)row format delimited fields terminated by ",";
a_cookie,2019-06-10,1
a_cookie,2019-06-11,9
a_cookie,2019-06-12,7
a_cookie,2019-06-13,3
a_cookie,2019-06-14,2
a_cookie,2019-06-15,4
a_cookie,2019-06-16,4
b_cookie,2019-08-17,6
b_cookie,2019-08-18,9
b_cookie,2019-08-19,5
b_cookie,2019-08-17,2
load data local inpath  "/opt/module/datas/cookie.txt" into table cookie;
View Code
select 
    cookie_id,create_time,pv,
    sum(pv) over(partition by cookie_id) as sum1, --对组内的pv值进行全部累加
    sum(pv) over(partition by  cookie_id order by create_time) as accu_sum  --从第一行开始累加到当前行
from cookie;

数据如下:

a_cookie    2019-06-10    1    30    1
a_cookie    2019-06-11    9    30    10
a_cookie    2019-06-12    7    30    17
a_cookie    2019-06-13    3    30    20
a_cookie    2019-06-14    2    30    22
a_cookie    2019-06-15    4    30    26
a_cookie    2019-06-16    4    30    30
b_cookie    2019-08-17    2    22    8
b_cookie    2019-08-17    6




    
    22    8
b_cookie    2019-08-18    9    22    17
b_cookie    2019-08-19    5    22    22
View Code
select
    cookie_id,create_time,pv,
    sum(pv) over(partition by cookie_id) as sum1, --对组内的pv值进行全部累加
    sum(pv) over(partition by  cookie_id order by create_time) as accu_sum, --从第一行开始累加到当前行
    sum(pv) over(partition by cookie_id order by create_time rows between unbounded preceding and current row) as sum3,
                         --默认就是从起点到当前行往前累加,所以between unbounded(起点) preceding(往前) and current row这个条件可以不写。
    sum(pv) over(partition by  cookie_id order by create_time rows between current row and unbounded following) as sum4,
                       --从当前行累加到最前边,跟前一个是相反的累加,相当于是降序累加
    sum(pv) over(partition by  cookie_id order by create_time rows between 3 preceding and current row) as sum5,
                       --取当前行-至往前数3行的数进行累加, 一共4行进行累加处理
    sum(pv) over(partition by  cookie_id order by create_time rows between 3 preceding and 1 following) as sum6
                        --取当前行,往前数3行,往后数1行的数值进行累加处理
from cookie;

数据如下:

a_cookie    2019-06-10    1    30    1    1    30    1    10
a_cookie    2019-06-11    9    30    10    10    29    10    17
a_cookie    2019-06-12    7    30    17    17    20    17    20
a_cookie    2019-06-13    3    30    20    20    13    20    22
a_cookie    2019-06-14    2    30    22    22    10    21    25
a_cookie    2019-06-15    4    30    26    26    8    16    20
a_cookie    2019-06-16    4    30    30    30    4    13    13
b_cookie    2019-08-17    2    22    8    2    22    2    8
b_cookie    2019-08-17    6    22    8    8    20    8    17
b_cookie    2019-08-18    9    22    17    17    14    17    22
b_cookie    2019-08-19    5    22    22    22    5    22    22
View Code

sum、avg、max、min这些窗口函数的语法都是一样的;同上;

求:  近(必须是连续的)3天内的销售量

sum(paid_cnt) over (partition by org_id,goods_id order by stats_date rows BETWEEN 3 PRECEDING AND 1 PRECEDING) 3_precePre_paid_cnt, --不包括当前(行)

rows是按行往前计算;使用range,时间戳的形式

sum(paid_cnt) over (partition by org_id,goods_id order by unix_timestamp(stats_date, 'yyyy-MM-dd') RANGE BETWEEN 86400 * 3 PRECEDING AND 1 PRECEDING) 3_precePre_paid_cnt, --不包括当前日期(行)

select
        stats_date,
        org_id,
        goods_id,
        paid_cnt,
        sum(paid_cnt) over (partition by org_id,goods_id order by unix_timestamp(stats_date, 'yyyy-MM-dd') RANGE BETWEEN 86400 * 3 PRECEDING AND 1 PRECEDING) 3_precePre_paid_cnt, --不包括当前日期(行)
        sum(paid_cnt) over (partition by org_id,goods_id order by unix_timestamp(stats_date, 'yyyy-MM-dd') RANGE BETWEEN 86400 * 3 PRECEDING AND 1 FOLLOWING) 3_preceFllow_paid_cnt,--包括当前日期(行)
        sum(paid_cnt) over (partition by org_id,goods_id order by unix_timestamp(stats_date, 'yyyy-MM-dd') RANGE BETWEEN 86400 * 3




    
 PRECEDING AND -1 FOLLOWING) 3_prece_Fllow_paid_cnt,--不包括当前日期(行)
        sum(paid_cnt) over(partition by stats_date,org_id,goods_id) as paid_sum
dws_sales_org_sku where  org_id = 'xxx' and goods_id = xxx
order by 1 desc

数据如下所示:

+-------------+----------+-----------+-----------+----------------------+------------------------+-------------------------+-----------+--+
| stats_date  |  org_id  | goods_id  | paid_cnt  | 3_precePre_paid_cnt  | 3_preceFllow_paid_cnt  | 3_prece_Fllow_paid_cnt  | paid_sum  |
+-------------+----------+-----------+-----------+----------------------+------------------------+-------------------------+-----------+--+
| 2022-06-15  | JM00026  | 12266     | 2         | 19                   | 21                     | 19                      | 2         |
| 2022-06-14  | JM00026  | 12266     | 1         | 19                   | 20                     | 19                      | 1         |
| 2022-06-13  | JM00026  | 12266     | 4         | 19                   | 23                     | 19                      | 4         |
| 2022-06-12  | JM00026  | 12266     | 14        | 5                    | 19                     | 5                       | 14        |
| 2022-06-11  | JM00026  | 12266     | 1         | 5                    | 6                      | 5                       | 1         |
| 2022-06-10  | JM00026  | 12266     | 4         | 12                   | 16                     | 12                      | 4         |
| 2022-06-08  | JM00026  | 12266     | 1         | 16                   | 17                     | 16                      | 1         |
| 2022-06-07  | JM00026  | 12266     | 11        | 11                   | 22                     | 11                      | 11        |
| 2022-06-06  | JM00026  | 12266     | 3         | 23                   | 26                     | 23                      | 3         |
| 2022-06-05  | JM00026  | 12266     | 2         | 29                   | 31                     | 29                      | 2         |
| 2022-06-04  | JM00026  | 12266     | 6         | 25                   | 31                     | 25                      | 6         |
| 2022-06-03  | JM00026  | 12266     | 15        | 12                   | 27                     | 12                      | 15        |
| 2022-06-02  | JM00026  | 12266     | 8         |




    
 5                    | 13                     | 5                       | 8         |
| 2022-06-01  | JM00026  | 12266     | 2         | 32                   | 34                     | 32                      | 2         |
| 2022-05-31  | JM00026  | 12266     | 2         | 50                   | 52                     | 50                      | 2         |
| 2022-05-30  | JM00026  | 12266     | 1         | 49                   | 50                     | 49                      | 1         |
| 2022-05-29  | JM00026  | 12266     | 29        | 20                   | 49                     | 20                      | 29        |
| 2022-05-28  | JM00026  | 12266     | 20        | 2                    | 22                     | 2                       | 20        |
| 2022-05-25  | JM00026  | 12266     | 2         | 30                   | 32                     | 30                      | 2         |
| 2022-05-24  | JM00026  | 12266     | 8         | 34                   | 42                     | 34                      | 8         |
| 2022-05-23  | JM00026  | 12266     | 8         | 32                   | 40                     | 32                      | 8         |
| 2022-05-22  | JM00026  | 12266     | 14        | 18                   | 32                     | 18                      | 14        |
| 2022-05-21  | JM00026  | 12266     | 12        | 6                    | 18                     | 6                       | 12        |
| 2022-05-20  | JM00026  | 12266     | 6         | NULL                 | 6                      | NULL                    | 6         |
| 2022-05-15  | JM00026  | 12266     | 2         | 4                    | 6                      | 4                       | 2         |
| 2022-05-14  | JM00026  | 12266     | 2         | 2                    | 4                      | 2                       | 2         |
| 2022-05-13  | JM00026  | 12266     | 1         | 2                    | 3                      | 2                       | 1         |
| 2022-05-




    
12  | JM00026  | 12266     | 1         | 7                    | 8                      | 7                       | 1         |
| 2022-05-10  | JM00026  | 12266     | 1         | 10                   | 11                     | 10                      | 1         |
| 2022-05-09  | JM00026  | 12266     | 6         | 6                    | 12                     | 6                       | 6         |
| 2022-05-08  | JM00026  | 12266     | 2         | 5                    | 7                      | 5                       | 2         |
| 2022-05-07  | JM00026  | 12266     | 2         | 7                    | 9                      | 7                       | 2         |
| 2022-05-06  | JM00026  | 12266     | 2         | 12                   | 14                     | 12                      | 2         |
| 2022-05-05  | JM00026  | 12266     | 1         | 15                   | 16                     | 15                      | 1         |
| 2022-05-04  | JM00026  | 12266     | 4         | 15                   | 19                     | 15                      | 4         |
| 2022-05-03  | JM00026  | 12266     | 7         | 17                   | 24                     | 17                      | 7         |
| 2022-05-02  | JM00026  | 12266     | 4         | 17                   | 21                     | 17                      | 4         |
| 2022-05-01  | JM00026  | 12266     | 4         | 15                   | 19                     | 15                      | 4       
View Code
ntile(n),用于将分组数据按照顺序切分成n片,返回当前切片值,ntile不支持 rows between; 如果切片不均匀,默认增加第一个切片的分布。
将分组数据按照顺序切分成1份、2份、3份、4份
select 
    cookie_id,create_time,pv,
    ntile(1) over(partition by  cookie_id order by create_time) as nt1,
    ntile(2) over(partition by  cookie_id order by create_time) as nt2,
    ntile(3) over(partition by  cookie_id order by create_time) as nt3,
    ntile(4) over(partition by  cookie_id order by create_time) as nt4
from cookie;
结果: a_cookie组有7条数据, 所以默认第一个分片加1条数据
cookie_id    create_time pv nt1 nt2 nt3 nt4
a_cookie    2019-06-10    1    1    1    1    1
a_cookie    2019-06-11    9    1    1    1    1
a_cookie    2019-06-12    7    1    1    1    2
a_cookie    2019-06-13    3    1    1    2




    
    2
a_cookie    2019-06-14    2    1    2    2    3
a_cookie    2019-06-15    4    1    2    3    3
a_cookie    2019-06-16    4    1    2    3    4
b_cookie    2019-08-17    6    1    1    1    1
b_cookie    2019-08-17    2    1    1    1    2
b_cookie    2019-08-18    9    1    2    2    3
b_cookie    2019-08-19    5    1    2    3    4

应用场景:统计一个每个cookie的pv数最多的前1/3的天

select 
    cookie_id,create_time,pv
from(
select 
    cookie_id,create_time,pv,
    ntile(3) over(partition by  cookie_id order by pv desc) as nt1 --按pv降序排,取nt = 1份的数据就是要求的结果
from cookie
)t where nt1 = 1;
a_cookie    2019-06-11    9
a_cookie    2019-06-12    7
a_cookie    2019-06-16    4
b_cookie    2019-08-18    9
b_cookie    2019-08-17    6
View Code

三、窗口排序函数

相同点:都是分组排序

Row_number: 即便出现相同的排序,排名也不会一致,只会进行累加; 即排序次序连续,但不会出现同一排名  ;1、 2 、 3、 4、 5

rank: 当出现相同的排序时,中间会出现一个空缺, 即分组内会出现同一个排名,但是排名次序是不连续的             1、 2、 3、 3、 5

Dense_rank: 当出现相同排序时,中间不会出现空缺, 即分组内可能会出现同样的次序,且排序名次是连续的        1、 2、 3、 3、 4

select 
    cookie_id,create_time,pv,
    row_number() over(partition by  cookie_id order by pv desc) as row_index,
    rank() over(partition by  cookie_id order by pv desc) as rank_index,
    dense_rank() over(partition by  cookie_id order by pv desc) as dense_index
from cookie;
cookie_id    create_time pv  row_index rank_index dense_index
a_cookie 2019-06-11 9 1      1      1 a_cookie 2019-06-12 7 2      2      2 a_cookie 2019-06-16 4 3      3      3 a_cookie 2019-06-15 4 4      3      3 a_cookie 2019-06-13 3 5      5      4 a_cookie 2019-06-14 2 6     6      5 a_cookie 2019-06-10 1 7      7      6

常用场景:分组取TopN, 比如求每个cookie排名前三的pv

create table tmp as
select cookieid, createtime,  pv,
row_number() over(partition by cookieid order by pv desc) as index
from cookie2;
select * from tmp where index <= 3;

四、lag | lead | first_value| last_value

LAG(col,n,DEFAULT)用于统计窗口内往上第n行的值作为当前行的值

  • 第一个参数为列名
  • 第二个参数为往上第n行(可选,默认为1)
  • 第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)
  • select 
        cookie_id,create_time,pv, 
        lag(pv,1) over (partition by cookie_id order by create_time) as last_1_pv //可设置默认值为0即 lag(pv,1,0) over(...)
     from cookie;

    数据如下:

    结果: 没有设置默认值,没有上一行时显示为null
    cookie_id  create_time  pv  last_1_pv
    a_cookie    2019-06-10    1    NULL
    a_cookie    2019-06-11    9
    
    
    
    
        
        1
    a_cookie    2019-06-12    7    9
    a_cookie    2019-06-13    3    7
    a_cookie    2019-06-14    2    3
    a_cookie    2019-06-15    4    2
    a_cookie    2019-06-16    4    4
    b_cookie    2019-08-17    6    NULL
    b_cookie    2019-08-17    2    6
    b_cookie    2019-08-18    9    2
    b_cookie    2019-08-19    5    9
    View Code
    lead的作用与lag相反
    LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值;通常用于行值填充;或者和指定行进行 差值比较
  • 第一个参数为列名
  • 第二个参数为往下第n行(可选,默认为1)
  • 第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)
  • select 
        cookie_id,create_time,pv, 
        lead(pv,1) over(partition by cookie_id order by create_time) as next_1_pv,
        lead(create_time,1,'1970-01-01') over(partition by cookie_id order by create_time) as next_1_time 
    from cookie;
    数据如下:
    cookie_id create_time pv next_1_pv next_1_time 
    a_cookie    2019-06-10    1    9    2019-06-11
    a_cookie    2019-06-11    9    7    2019-06-12
    a_cookie    2019-06-12    7    3    2019-06-13
    a_cookie    2019-06-13    3    2    2019-06-14
    a_cookie    2019-06-14    2    4    2019-06-15
    a_cookie    2019-06-15    4    4    2019-06-16
    a_cookie    2019-06-16    4    NULL    1970-01-01
    b_cookie    2019-08-17    6    2    2019-08-17
    b_cookie    2019-08-17    2    9    2019-08-18
    b_cookie    2019-08-18    9    5    2019-08-19
    b_cookie    2019-08-19    5    NULL    1970-01-01
    View Code

    first_value: 取分组内排序后,截止到当前行,第一个值

    select 
        cookie_id,create_time,pv, 
        first_value(pv) over (partition by cookie_id  order by pv) as first_pv 
    from cookie;
    数据如下:
    cookie_id create_time  pv first_pv 
    a_cookie    2019-06-10    1    1
    a_cookie    2019-06-14    2    1
    a_cookie    2019-06-13    3    1
    a_cookie    2019-06-16    4    1
    a_cookie    2019-06-15    4
    
    
    
    
        
        1
    a_cookie    2019-06-12    7    1
    a_cookie    2019-06-11    9    1
    b_cookie    2019-08-17    2    2
    b_cookie    2019-08-19    5    2
    b_cookie    2019-08-17    6    2
    b_cookie    2019-08-18    9    2
    View Code

    last_value 取分组内排序后,截止到当前行,最后一个值(其实就是它自己)

    select 
        cookie_id,create_time,pv, 
        last_value(pv) over (partition by cookie_id  order by pv) as last_pv 
    from cookie;
    cookie_id create_time  pv last_pv 
    a_cookie    2019-06-10    1    1
    a_cookie    2019-06-14    2    2
    a_cookie    2019-06-13    3    3
    a_cookie    2019-06-16    4    4
    a_cookie    2019-06-15    4    4
    a_cookie    2019-06-12    7    7
    a_cookie    2019-06-11    9    9
    b_cookie    2019-08-17    2    2
    b_cookie    2019-08-19    5    5
    b_cookie    2019-08-17    6    6
    b_cookie    2019-08-18    9    9
    View Code

    grouping sets 、 grouping__id 、 cube 、 rollup函数;

    cume_dist 、 percent_rank等函数

    https://www.jianshu.com/nb/19948302

    7. collect_set| collect_list

    输入数据为一个ID对应多个name,要求输出数据为ID是唯一的,name随便取一个就可以。

    collect_set函数,类似于mysql的group_concat函数,把每个分组的其他字段,按照逗号进行拼接,得到一个最终字符串:

    取同一分组的不同行的数据集,collect_set(可去重)、collect_list(不去重,把所有的都列出来)。

    collect_set, 将分组内的数据放入到一个集合中,具有去重的功能,返回类型array 即一个去重后的对象集合;

    collect_list 和collect_set一样,但是没有去重功能

    collect_set( named_struct ( 'sku_id', sku_id, 'sku_num', sku_num,  'order_count',  order_count,  'order_amount',  order_amount ) ) order_stats

    select sid,collect_set(class_id) from table1 group by sid; 1 [ 11,12,13 ] 2 [ 11,14 ] 3 [ 12,15 ] 可以针对第二列做一些计数、求和操作,分别对应到Hive的聚合函数count、 sum 。 对应到本文的目的,直接从数组获取第一个元素就达到目的了,这样做: select sid,collect_set(class_id) [ 0 ] from table1 group by sid; select course, collect_set(area), avg (score) from stud group by course; // collect_set,去重   chinese [ "sh","bj" ] 79.0 math [ "bj" ] 93.5 select course, collect_list(area), avg (score) from stud group by course; // collect_list,不去重   chinese [ "sh","sh","bj" ] 79.0 math [ "bj","bj" ] 93.5 用下标可以取某一个 select course, collect_set(area) [ 0 ] , avg (score) from stud group by course;   chinese sh 79.0 math bj 93.5 collect_set 与 named_struct结合的使用形成Array结构: select course, collect_set(named_struct("name",name, "area",area)), avg (score) from default .stud group by course;   chinese [ {"name":"wang5","area":"sh"},{"name":"zhao6","area":"sh"},{"name":"tian7","area":"bj"} ] 79.0 math [ {"name":"zhang3","area":"bj"},{"name":"li4","area":"bj"} ] 93.5 collect_set使用后,结合array_contains的使用: select sum ( if (array_contains(pages, ' home ' ), 1 , 0 )) home_count, sum ( if (array_contains(pages, ' good_detail ' ), 1 , 0 )) good_detail_count     ( select collect_set(page_id) pages from dwd_page_log and page_id in ( ' home ' , ' good_detail ' ) group by mid_id)tmp
    --统计每个用户具体哪些天访问过
    select
    cookie_id, collect_set(create_time) over(partition by cookie_id) as create_time_set from cookie

    数据如下:
    cookie_id    create_time_set
    a_cookie    ["2019-06-10","2019-06-14","2019-06-13","2019-06-15","2019-06-16","2019-06-12","2019-06-11"]
    a_cookie    ["2019-06-10","2019-06-14","2019-06-13","2019-06-15","2019-06-16","2019-06-12","2019-06-11"]
    a_cookie    ["2019-06-10","2019-06-14","2019-06-13","2019-06-15","2019-06-16","2019-06-12","2019-06-11"]
    a_cookie    ["2019-06-10","2019-06-14","2019-06-13","2019-06-15","2019-06-16","2019-06-12","2019-06-11"]
    a_cookie    ["2019-06-10","2019-06-14","2019-06-13","2019-06-15","2019-06-16","2019-06-12","2019-06-11"]
    a_cookie    ["2019-06-10","2019-06-14","2019-06-13","2019-06-15","2019-06-16","2019-06-12","2019-06-11"]
    a_cookie    ["2019-06-10","2019-06-14","2019-06-13","2019-06-15","2019-06-16","2019-06-12","2019-06-11"]
    View Code

    sort_array:

    数组内排序;通常结合collect_set或者collect_list使用;

    如collect_list为例子,可以发现日期并不是按照顺序组合的,这里有需求需要按照时间升序的方式来组合

    --按照时间升序来组合
    select
      cookie_id,
      sort_array(collect_list(create_time) over(partition by cookie_id)) as create_time_list 
    from cookie
    数据如下:
    a_cookie    ["2019-06-10","2019-06-11","2019-06-12","2019-06-13","2019-06-14","2019-06-15","2019-06-16"]
    a_cookie    ["2019-06-10","2019-06-11","2019-06-12","2019-06-13","2019-06-14","2019-06-15","2019-06-16"]
    a_cookie    ["2019-06-10","2019-06-11","2019-06-12","2019-06-13","2019-06-14","2019-06-15","2019-06-16"]
    a_cookie    ["2019-06-10","2019-06-11","2019-06-12","2019-06-13","2019-06-14","2019-06-15","2019-06-16"]
    a_cookie    ["2019-06-10","2019-06-11","2019-06-12","2019-06-13","2019-06-14","2019-06-15","2019-06-16"]
    a_cookie    ["2019-06-10","2019-06-11","2019-06-12","2019-06-13","2019-06-14","2019-06-15","2019-06-16"]
    a_cookie    ["2019-06-10","2019-06-11","2019-06-12","2019-06-13","2019-06-14","2019-06-15","2019-06-16"]
    View Code

    想要按照时间降序来组合,那基于上面的sql该如何变通呢?
    这里根据没必要按照sort_array来实现,在collect_list中的分组函数内直接按照visit_date降序即可,
    这里只是为了演示sort_array如何使用

    --按照时间降序排序
    select
      cookie_id,
      collect_list(create_time) over(partition by cookie_id order by create_time desc) as create_time_set 
    from cookie
    数据如下
    a_cookie    ["2019-06-16"]
    a_cookie    ["2019-06-16","2019-06-15"]
    a_cookie    ["2019-06-16","2019-06-15","2019-06-14"]
    a_cookie    ["2019-06-16","2019-06-15","2019-06-14","2019-06-13"]
    a_cookie    ["2019-06-16","2019-06-15","2019-06-14","2019-06-13","2019-06-12"]
    a_cookie    ["2019-06-16","2019-06-15","2019-06-14","2019-06-13","2019-06-12","2019-06-11"]
    a_cookie    ["2019-06-16","2019-06-15","2019-06-14","2019-06-13","2019-06-12","2019-06-11","2019-06-10"]
    View Code

    8. STR_TO_MAP

    STR_TO_MAP 函数
    (1)语法描述
      STR_TO_MAP(VARCHAR text, VARCHAR listDelimiter, VARCHAR keyValueDelimiter)
    (2)功能描述
      使用 listDelimiter 将text分隔成K-V对,然后使用 keyValueDelimiter 分隔每个K-V对,组装成MAP返回。默认listDelimiter为( ,),keyValueDelimiter为(=)。
    (3)案例
      str_to_map('1001=2020-06-14,1002=2020-06-14', ',' , '=')
        {"1001":"2020-06-14","1002":"2020-06-14"}
    select 
        order_id, concat_ws(',', collect_set(concat(order_status,'=',operate_time))) 
    from ods_order_status_log where dt='2020-06-14' group by order_id;
    3261    1001=2020-06-14 08:37:45.0
    3263    1001=2020-06-14 08:37:45.0,1002=2020-06-14 08:37:45.0,1004=2020-06-14 08:37:45.0
    3270    1001=2020-06-14 08:37:45.0,1002=2020-06-14 08:37:45.0,1005=2020-06-14 08:37:45.0
    select 
        order_id, str_to_map(concat_ws(',',collect_set(concat(order_status,'=',operate_time))), ','  ,  '=') 
    from ods_order_status_log where dt='2020-06-14' group by order_id;
    3261    {"1001":"2020-06-14 08:37:45.0"}
    3263    {"1001":"2020-06-14 08:37:45.0","1002":"2020-06-14 08:37:45.0","1004":"2020-06-14 08:37:45.0"}
    
    
    
    
    
        
    3270    {"1001":"2020-06-14 08:37:45.0","1002":"2020-06-14 08:37:45.0","1005":"2020-06-14 08:37:45.0"}
    select
        order_id,
        collect_set(concat(order_status,'=',operate_time)), 
        --["1001=2020-06-14 08:37:45.0","1002=2020-06-14 08:37:45.0","1004=2020-06-14 08:37:45.0"]
        concat_ws(',', collect_set(concat(order_status,'=',operate_time))), 
        --1001=2020-06-14 08:37:45.0,1002=2020-06-14 08:37:45.0,1004=2020-06-14 08:37:45.0
        str_to_map(concat_ws(',',collect_set(concat(order_status,'=',operate_time))),',','=') tms 
        --{"1001":"2020-06-14 08:37:45.0","1002":"2020-06-14 08:37:45.0","1004":"2020-06-14 08:37:45.0"}
    from ods_order_status_log --订单状态表 
    where dt='2020-06-14' and order_id = 3268
    group by order_id
    

    9. JSON处理函数

    get_json_object 的使用,通常用于获取json字符串中的key,如果不存在则返回null

    select get_json_object('[{"name":"kris","sex":"男","age":"25"},{"name":"shero","sex":"女","age":"18"}]', '$[0]'); 
      {"name":"kris","sex":"男","age":"25"}
    select get_json_object('[{"name":"kris","sex":"男","age":"25"},{"name":"shero","sex":"女","age":"18"}]', '$[0].age'); 
    

     expode 结合get_json_object的使用

    explode:
    列转行,通常是将一个数组内的元素打开,拆成多行
    --简单例子
    select  explode(array(1,2,3,4,5))
    +------+--+
    | col  |
    +------+--+
    | 1    |
    | 2    |
    | 3    |
    | 4    |
    | 5    |
    +------+-
    --结合lateral view 使用
    select 
      get_json_object(user,'$.user_id')
      select 
       distinct collect_set(concat('{"user_id":"',user_id,'"}')) over(partition by year(visit_date)) as user_list
      from url_info
    lateral view explode(user_list) user_list as user

    10. URL的处理

    select * from url_info;
    user_id                                    visit_url                                    visit_cnt    visit_time        visit_date
    user1    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    10    2020-09-12 02:20:02    2020-09-12
    user1    https://blog.csdn.net/qq_28680977/article/details/108298276?k1=v1&k2=v2#Ref1    2    2020-09-11 11:20:12    2020-09-11
    user1    https://blog.csdn.net/qq_28680977/article/details/108295053?k1=v1&k2=v2#Ref1    4    2020-09-10 08:19:22    2020-09-10
    user1    https://blog.csdn.net/qq_28680977/article/details/108460523?k1=v1&k2=v2#Ref1    5    2020-08-12 19:20:22    2020-08-12
    user2    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    29    2020-04-04 12:23:22    2020-04-04
    user2    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    30    2020-05-15 12:34:23    2020-05-15
    user2    https://blog.csdn.net/qq_28680977/
    
    
    
    
        
    article/details/108161655?k1=v1&k2=v2#Ref1    30    2020-05-15 13:34:23    2020-05-15
    user2    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    19    2020-05-16 19:03:32    2020-05-16
    user2    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    10    2020-05-17 06:20:22    2020-05-17
    user3    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    43    2020-04-12 08:02:22    2020-04-12
    user3    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    5    2020-08-02 08:10:22    2020-08-02
    user3    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    6    2020-08-02 10:10:22    2020-08-02
    user3    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    50    2020-08-12 12:23:22    2020-08-12
    user4    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    10    2020-04-12 11:20:22    2020-04-12
    user4    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    30    2020-03-12 10:20:22    2020-03-12
    user4    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    20    2020-02-12 20:26:43    2020-02-12
    user2    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    10    2020-04-12 19:12:36    2020-04-12
    user2    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    40    2020-05-12 18:24:31    2020-05-12
    View Code
    

    parse_url:用于解析url相关的参数

    select 
    visit_url,
    parse_url(visit_url, 'HOST') as url_host, --解析host
    parse_url(visit_url, 'PATH') as url_path, --解析path
    
    
    
    
        
    
    parse_url(visit_url, 'QUERY') as url_query,--解析请求参数
    parse_url(visit_url, 'REF') as url_ref, --解析ref
    parse_url(visit_url, 'PROTOCOL') as url_protocol, --解析协议
    parse_url(visit_url, 'AUTHORITY') as url_authority,--解析author
    parse_url(visit_url, 'FILE') as url_file, --解析filepath
    parse_url(visit_url, 'USERINFO') as url_user_info --解析userinfo
    from url_info
    数据如下:
    visit_url                                                                            url_host                url_path                     url_query    url_ref    url_protocol  url_authority        url_file                              url_user_info
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    blog.csdn.net    /qq_28680977/article/details/108161655    k1=v1&k2=v2    Ref1    https    blog.csdn.net    /qq_28680977/article/details/108161655?k1=v1&k2=v2    NULL
    https://blog.csdn.net/qq_28680977/article/details/108298276?k1=v1&k2=v2#Ref1    blog.csdn.net    /qq_28680977/article/details/108298276    k1=v1&k2=v2    Ref1    https    blog.csdn.net    /qq_28680977/article/details/108298276?k1=v1&k2=v2    NULL
    https://blog.csdn.net/qq_28680977/article/details/108295053?k1=v1&k2=v2#Ref1    blog.csdn.net    /qq_28680977/article/details/108295053    k1=v1&k2=v2    Ref1    https    blog.csdn.net    /qq_28680977/article/details/108295053?k1=v1&k2=v2    NULL
    https://blog.csdn.net/qq_28680977/article/details/108460523?k1=v1&k2=v2#Ref1    blog.csdn.net    /qq_28680977/article/details/108460523    k1=v1&k2=v2    Ref1    https    blog.csdn.net    /qq_28680977/article/details/108460523?k1=v1&k2=v2    NULL
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    blog.csdn.net    /qq_28680977/article/details/108161655    k1=v1&k2=v2    Ref1    https    blog.csdn.net    /qq_28680977/article/details/108161655?k1=v1&k2=v2    NULL
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    blog.csdn.net    /qq_28680977/article/details/108161655    k1=v1&k2=v2    Ref1    https    blog.csdn.net    /qq_28680977/article/details/108161655?k1=v1&k2=v2    NULL
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    blog.csdn.net    /qq_28680977/article/details/108161655    k1=v1&k2=v2    Ref1    https    blog.csdn.net    /qq_28680977/article/details/108161655?k1=v1&k2=v2    NULL
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    blog.csdn.net    /qq_28680977/article/details/108161655    k1=v1&k2=v2    Ref1    https    blog.csdn.net    /qq_28680977/article/details/108161655?k1=v1&k2=v2    NULL
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    blog.csdn.net    /qq_28680977/article/details/108161655    k1=v1&k2=v2    Ref1    https    blog.csdn.net    /qq_28680977/article/details/108161655?k1=v1&k2=v2    NULL
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    blog.csdn.net    /qq_28680977/article/details/108161655    k1=v1&k2=v2    Ref1    https    blog.csdn.net    /qq_28680977/article/details/108161655
    
    
    
    
        
    ?k1=v1&k2=v2    NULL
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    blog.csdn.net    /qq_28680977/article/details/108161655    k1=v1&k2=v2    Ref1    https    blog.csdn.net    /qq_28680977/article/details/108161655?k1=v1&k2=v2    NULL
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    blog.csdn.net    /qq_28680977/article/details/108161655    k1=v1&k2=v2    Ref1    https    blog.csdn.net    /qq_28680977/article/details/108161655?k1=v1&k2=v2    NULL
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    blog.csdn.net    /qq_28680977/article/details/108161655    k1=v1&k2=v2    Ref1    https    blog.csdn.net    /qq_28680977/article/details/108161655?k1=v1&k2=v2    NULL
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    blog.csdn.net    /qq_28680977/article/details/108161655    k1=v1&k2=v2    Ref1    https    blog.csdn.net    /qq_28680977/article/details/108161655?k1=v1&k2=v2    NULL
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    blog.csdn.net    /qq_28680977/article/details/108161655    k1=v1&k2=v2    Ref1    https    blog.csdn.net    /qq_28680977/article/details/108161655?k1=v1&k2=v2    NULL
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    blog.csdn.net    /qq_28680977/article/details/108161655    k1=v1&k2=v2    Ref1    https    blog.csdn.net    /qq_28680977/article/details/108161655?k1=v1&k2=v2    NULL
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    blog.csdn.net    /qq_28680977/article/details/108161655    k1=v1&k2=v2    Ref1    https    blog.csdn.net    /qq_28680977/article/details/108161655?k1=v1&k2=v2    NULL
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    blog.csdn.net    /qq_28680977/article/details/108161655    k1=v1&k2=v2    Ref1    https    blog.csdn.net    /qq_28680977/article/details/108161655?k1=v1&k2=v2    NULL
    View Code

    reflect:该函数是利用java的反射来实现一些功能,url编解码

    --url编码

    select 
        visit_url,
        reflect("java.net.URLEncoder", "encode", visit_url, "UTF-8") as visit_url_encode
    from url_info
    visit_url    visit_url_encode
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108161655%3Fk1%3Dv1%26k2%3Dv2%23Ref1
    https://blog.csdn.net/qq_28680977/article/details/108298276?k1=v1&k2=v2#Ref1    https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108298276%3Fk1%3Dv1%26k2%3Dv2%23Ref1
    https://blog.csdn.net/qq_28680977/article/details/108295053?k1=v1&k2=v2#Ref1    https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108295053%3Fk1%3Dv1%26k2%3Dv2%23Ref1
    https://blog.csdn.net/qq_28680977/article/details/108460523?k1=v1&k2=v2#Ref1    https%
    
    
    
    
        
    3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108460523%3Fk1%3Dv1%26k2%3Dv2%23Ref1
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108161655%3Fk1%3Dv1%26k2%3Dv2%23Ref1
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108161655%3Fk1%3Dv1%26k2%3Dv2%23Ref1
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108161655%3Fk1%3Dv1%26k2%3Dv2%23Ref1
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108161655%3Fk1%3Dv1%26k2%3Dv2%23Ref1
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108161655%3Fk1%3Dv1%26k2%3Dv2%23Ref1
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108161655%3Fk1%3Dv1%26k2%3Dv2%23Ref1
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108161655%3Fk1%3Dv1%26k2%3Dv2%23Ref1
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108161655%3Fk1%3Dv1%26k2%3Dv2%23Ref1
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108161655%3Fk1%3Dv1%26k2%3Dv2%23Ref1
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108161655%3Fk1%3Dv1%26k2%3Dv2%23Ref1
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108161655%3Fk1%3Dv1%26k2%3Dv2%23Ref1
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108161655%3Fk1%3Dv1%26k2%3Dv2%23Ref1
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108161655%3Fk1%3Dv1%26k2%3Dv2%23Ref1
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    https%3A%2F%2Fblog.csdn.net%2Fqq_28680977%2Farticle%2Fdetails%2F108161655%
    
    
    
    
        
    3Fk1%3Dv1%26k2%3Dv2%23Ref1
    View Code
    --url解码
    select 
      visit_url,
     reflect("java.net.URLDecoder", "decode", visit_url_encode, "UTF-8") as visit_url_decode
      select 
      visit_url,
      reflect("java.net.URLEncoder", "encode", visit_url, "UTF-8") as visit_url_encode
      from url_info
    
    visit_url    visit_url_decode
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1
    https://blog.csdn.net/qq_28680977/article/details/108298276?k1=v1&k2=v2#Ref1    https://blog.csdn.net/qq_28680977/article/details/108298276?k1=v1&k2=v2#Ref1
    https://blog.csdn.net/qq_28680977/article/details/108295053?k1=v1&k2=v2#Ref1    https://blog.csdn.net/qq_28680977/article/details/108295053?k1=v1&k2=v2#Ref1
    https://blog.csdn.net/qq_28680977/article/details/108460523?k1=v1&k2=v2#Ref1    https://blog.csdn.net/qq_28680977/article/details/108460523?k1=v1&k2=v2#Ref1
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1
    https://blog.csdn.net/
    
    
    
    
        
    qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1
    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1    https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1
    View Code

    11. 复杂数据类型定义

    1)map结构数据定义
      map<string,string>

    2)array结构数据定义
      array<string>

    3)struct结构数据定义
      struct<id:int,name:string,age:int>

    4)struct和array嵌套定义
      array<struct<id:int,name:string,age:int>>

     lateral view explode array

    user_id  order_detail_stats
    419         [{"sku_id":"10","sku_num":4,"order_count":2,"order_amount":894}]
    329         [{"sku_id":"5","sku_num":3,"order_count":1,"order_amount":658}]
    122         [{"sku_id":"4","sku_num":2,"order_count":1,"order_amount":2902}]
    278         [{"sku_id":"16","sku_num":2,"order_count":1,"order_amount":451}]
     select
            user_id,
            order_stats_struct.sku_id sku_id,
            order_stats_struct.order_count order_count
        from dws_user_action_daycount lateral view explode(order_detail_stats) tmp as order_stats_struct
    user_id    sku_id    order_count
    419        10        2
    329        5         1
    122        4         1
    278        16        1

    12. 数学函数

    round(DOUBLE a):返回对a四舍五入的BIGINT值,

    round(DOUBLE a, INT d):返回DOUBLE型d的保留n位小数的DOUBLW型的近似值

    floor(DOUBLE a):向下取整''

    hex(BIGINT a)/ hex(STRING a)/ hex(BINARY a)

    计算十六进制a的STRING类型,如果a为STRING类型就转换成字符相对应的十六进制

    该函数很少使用,主要是因为曾经遇到过关于emoj表情符脏数据,故使用该函数进行处理

    round()函数主要用来对给定的数字进行四舍五入取近似值
    select round(3.146,2);
    select round(3.145,2);
    select round(3.144,2);
    sqrt()函数表示对给定的数字取平方根
    SELECT Sqrt(4);
    SELECT Sqrt(5);
      2.23606797749979

    类型转换函数

    将某些数据类型转换为便于查询或者计算统计的数据类型。例如 cast()函数,其基本格式为 cast(value as TYPE),能够将给定的数据 value 转化为 TYPE类型

    SELECT CAST("5" AS INT);
    

     字符串处理函数

    instr:
    查找字符串str中子字符串substr出现的位置,如果查找失败将返回0,如果任一参数为Null将返回null,
    注意位置为从1开始的;通常用这个函数作为模糊查询来查询
    select visit_time,instr(visit_time,'10') from url_info;
    visit_time            _c1
    2020-09-10 08:19:22    9
    2020-08-02 08:10:22    15
    2020-08-02 10:10:22    12
    2020-08-12 12:23:22    0
    --使用size  distinct
    select 
      distinct size (collect_set(user_id) over(partition by year(visit_date)))
    from url_info
    +-----------+--+
    | user_cnt  |
    +-----------+--+
    | 4         |
    +-----------+--+
    1 row selected (0.268 seconds)
    --使用通过distinct,然后count统计的方式
    select 
      count(1)
      select 
        distinct user_id
      from url_info 
    +-----------+--+
    | count(1)  |
    +-----------+--+
    | 4         |
    +-----------+--+
    1 row selected (0.661 seconds)
    --这里只用到了19条记录数,就可以明显观察到耗时差异,这里涉及到shuffle问题
    regexp_replace:
    regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT)
    按照Java正则表达式PATTERN将字符串中符合条件的部分成REPLACEMENT所指定的字符串,
    如里REPLACEMENT空的话,抽符合正则的部分将被去掉
    --将url中?参数后面的内容全部剔除
      select 
        distinct regexp_replace(visit_url,'\\?(.*)','') as visit_url
      from url_info
    visit_url
    https://blog.csdn.net/qq_28680977/article/details/108161655
    https://blog.csdn.net/qq_28680977/article/details/108295053
    https://blog.csdn.net/qq_28680977/article/details/108298276
    https://blog.csdn.net/qq_28680977/article/details/108460523
    regexp_extract:
    regexp_extract(string subject, string pattern, int index)
    抽取字符串subject中符合正则表达式pattern的第index个部分的子字符串,注意些预定义字符的使用
    类型于python爬虫中的xpath,用于提取指定的内容
    --提取csdn文章编号
    select 
        distinct regexp_extract(visit_url,'/details/([0-9]+)',1) as visit_url
      from url_info 
    visit_url
    108161655
    108295053
    108298276
    108460523
    substring_index:
    substring_index(string A, string delim, int count)
    截取第count分隔符之前的字符串,如count为正则从左边开始截取,如果为负则从右边开始截取
    --比如将2020年的用户组合获取前2个用户,下面的sql将上面讲解的函数都结合在一起使用了
    select 
      user_set,
      substring_index(user_set,',',2) as user_id
      select 
        distinct concat_ws(',',collect_set(user_id) over(partition by year(visit_date))) as user_set
      from url_info 
    

    Hive之Grouping Sets函数

    使用高级分组聚合的语法时,hive是否开启了向量模式。

            set hive.verctorized.execution.enabled = true;

    GROUPING SETS: 根据不同的维度组合进行聚合,等价于将不同维度的GROUP BY结果集进行UNION ALL

    GROUPING__ID:表示结果属于哪一个分组集合,属于虚字段

    CUBE: 根据GROUP BY的维度的所有组合进行聚合。

    ROLLUP: 为CUBE的子集,以最左侧的维度为主,从该维度进行层级聚合。

    简单来说:在一个GROUP BY 查询中,根据不同的维度组合进行聚合,等价于将不同维度的GROUP BY结果进行UNION ALL操作。

    SETS就是一种将多个GROUP BY逻辑UNION写在一个HIVE SQL语句中的便利写法。GROUPING SETS会把在单个GROUP BY逻辑中没有参与GROUP BY的那一列置为NULL值,这样聚合出来的结果,未被GROUP BY的列将显示为NULL

    使用方法:

    假如现在又如下场景,a,b,num三个字段,现在要求对a,b字段分别进行统计,有三种情况:(a,b)、(a)、(b)。常规写法我们可能会写成:

    SELECT 
      a,b,sum(num) AS total_num FROM table_name GROUP BY a,b UNION ALL SELECT a,sum(num) AS total_num FROM table_name GROUP BY a UNION ALL SELECT b,sum(num) AS total_num FROM table_name GROUP BY B

    现在用GROUPING SETS来进行改写:

    SELECT 
          sum(num) AS total_num,
          grouping__id
      FROM table_name
     GROUP BY a,b
     GROUPING SETS (a,b),(a),(b)

    可见代码简洁了很多,并且生成的job数也变少且计算的效率提高了(UNION ALL是多次扫描表),但是有同学注意到有一个grouping__id,这个是什么作用呢?

    grouping__id表示结果属于哪一个分组集合,以二进制的形式表达,那么问题又来了 ,在hivesql和sparksql里二进制的生成方法和规则又不相同,下面具体说下区别:

    1、hive中的grouping__id的数字规则

    select a,b,c,grouping__id from test group by a,b,c
    grouping sets ((ab),(ac),(abc))
    规则:根据group by 字段 ,从低到高,group sets中出现的字段是1没出现是0
    如 (a,b)对应 011
    (a,c)对应 101