Hive练习 —— 月访问次数，学生课程成绩，年最大气温，学生选课情况，月销售额_随笔

Hive练习 —— 月访问次数，学生课程成绩，年最大气温，学生选课情况，月销售额

题目和数据均来自于网络，本篇博客用来记录自己写的答案和解析

这里写目录标题

求单月访问次数和总访问次数学生课程成绩每一年出现过的最大气温的日期+温度求学生选课情况求月销售额和总销售额

求单月访问次数和总访问次数

用户名，访问日期，访问次数
A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5
A,2015-03,16
A,2015-03,22
B,2015-03,23
B,2015-03,10
B,2015-03,1

create external table test1(name string,month string,count int)
row format delimited fields terminated by ',';

load data local inpath '/usr/local/data/test1.txt' into table test1;

首先针对name和month进行去重，相同name和month的count进行求和

select name,
       month,
       sum(count)
from test1
group by name,month;

然后求出，每个用户截止到每月为止的最大单月访问次数和累计到该月的总访问次数

select name,
       month,
       max(sumCount) over (partition by name order by month) as maxMonth,
       sum(sumCount) over (partition by name order by month) as sumMonth,
       sumCount
from (select name,
       month,
       sum(count) as sumCount
      from test1
      group by name,month)t
order by name,month;

这里要注意，max，sum等聚合函数，加上order by后就是累计计算，不加order by就是在分区内全局计算。比如这里的sum()，如果后面是over (partition by name)，没有order by，那么结果是按照name相同的一组sumCount进行全局求和

举一下例子

select name,
       month,
       max(sumCount) over (partition by name ) as maxMonth,
       sum(sumCount) over (partition by name ) as sumMonth,
       sumCount
from (select name,
       month,
       sum(count) as sumCount
      from test1
      group by name,month)t
order by name,month;

结果为

A	2015-01		38	81	33
A	2015-02		38	81	10
A	2015-03		38	81	38
B	2015-01		34	79	30
B	2015-02		34	79	15
B	2015-03		34	79	34

这是直接按照name分组后，对一组的count进行求和

正确的sql语句为

select name,
       month,
       max(sumCount) over (partition by name order by month) as maxMonth,
       sum(sumCount) over (partition by name order by month) as sumMonth,
       sumCount
from (select name,
       month,
       sum(count) as sumCount
      from test1
      group by name,month)t
order by name,month;

结果为

A	2015-01		33	33	33
A	2015-02		33	43	10
A	2015-03		38	81	38
B	2015-01		30	30	30
B	2015-02		30	45	15
B	2015-03		34	79	34

学生课程成绩

id，学号sid，课程名course，成绩score
1	1	yuwen	43
2	1	shuxue	55
3	2	yuwen	77
4	2	shuxue	88
5	3	yuwen	98
6	3	shuxue	65

求所有数学课程成绩大于语文课程成绩的学生的学号

select a.sid
from (select sid,
       course,
       score
      from course
      where course='yuwen'
      ) a
left join
(select sid,
        course,
        score
from course
where course='shuxue') b
on a.sid = b.sid
where a.score < b.score;

结果为

1
2

每一年出现过的最大气温的日期+温度

数据含义：数据为 2014010216，表示在2014年01月02日的气温为16度

建表语句和加载数据

hive> create external table test3(tem string)
    > row format delimited fields terminated by ',';
    
hive> load data local inpath '/usr/local/test3.txt' into table test3;

首先按照年，日期，温度进行切分，切分成三列数据

select substr(tem,1,4) as year,
       substr(tem,1,8) as d,
       substr(tem,-2) as temp
from test3;

结果为

2014	20140102	16
2014	20140104	10
2012	20120106	09
2012	20120108	12
2012	20120110	23
2001	20010102	12
2001	20010104	11
2013	20130106	19
2013	20130108	12
2013	20130110	23
2008	20080102	16
2008	20080104	14
2007	20070106	19
2007	20070108	12
2007	20070110	23
2010	20100102	16
2010	20100104	10
2015	20150106	49
2015	20150108	12
2015	20150110	23

然后求取每一年的温度最大值，这里不要用group by，用开窗函数，否则取不到日期字段

select d,
       temp,
       max(temp) over (partition by year) as maxTem
from (
       select substr(tem,1,4) as year,
              substr(tem,1,8) as d,
              substr(tem,-2) as temp
       from test3
       ) t

结果为

20010102	12	12
20010104	11	12
20070106	19	23
20070110	23	23
20070108	12	23
20080104	14	16
20080102	16	16
20100104	10	16
20100102	16	16
20120106	09	23
20120110	23	23
20120108	12	23
20130110	23	23
20130108	12	23
20130106	19	23
20140102	16	16
20140104	10	16
20150108	12	49
20150110	23	49
20150106	49	49

然后判断每一天的温度，是否等于当年的最高温度，如果相等则输出

select d,maxTem
from (select d,
             temp,
             max(temp) over (partition by year) as maxTem
       from (
            select substr(tem,1,4) as year,
                   substr(tem,1,8) as d,
                   substr(tem,-2) as temp
            from test3
            ) t
       )t2
where temp = maxTem;

结果为

20010102	12
20070110	23
20080102	16
20100102	16
20120110	23
20130110	23
20140102	16
20150106	49

求学生选课情况

id,course
1,a
1,b
1,c
1,e
2,a
2,c
2,d
2,f
3,a
3,b
3,c
3,e

首先将去重后的课程列表添加在每一行数据的后面

create table stu_course as 
select t1.id as id,
       t1.course as cs,
       t2.course courses 
from (
    select id as id,
            collect_set(course) as course 
    from test4 group by id ) t1 
join 
(select sort_array(collect_set(course)) as course from test4) t2;

其中，t2 表的结果为 [“a”,“b”,“c”,“e”,“d”,“f”]

没有on语句，即没有连接字段的 join语句，作用是将第二张表的结果添加在第一张表每一行数据的后面，这里就是将 [“a”,“b”,“c”,“e”,“d”,“f”] 添加在 t1表每一行数据的后面

collect_set(course) 的作用是将course的全部列值转变成一个没有重复元素的数组
但是collect_set转换的数组是无序的，所以需要用 sort_array 进行排序

结果为

1	["a","b","c","e"]	["a","b","c","d","e","f"]
2	["a","c","d","f"]	["a","b","c","d","e","f"]
3	["a","b","c","e"]	["a","b","c","d","e","f"]

array_contains可以用于判断一张表中同一个id的多条记录中的同一字段是否包含指定的一个或多个值。需要注意字段类型保持一致，若不一致则需要进行强制类型转换

这里就是判断同一个id的cs字段，是否包含course(0)，course(1)等值

case when 会返回多个列，每一个case when对应一个列

select id,
case when array_contains(cs,courses[0]) then 1 else 0 end as a,
case when array_contains(cs,courses[1]) then 1 else 0 end as b,
case when array_contains(cs,courses[2]) then 1 else 0 end as c,
case when array_contains(cs,courses[3]) then 1 else 0 end as d,
case when array_contains(cs,courses[4]) then 1 else 0 end as e,
case when array_contains(cs,courses[5]) then 1 else 0 end as f
from stu_course;

结果为
最左边的一列为id，后面的六列对应六门课程，a、b、c、d、e、f

1	1	1	1	0	1	0
2	1	0	1	1	0	1
3	1	1	1	0	1	0

求月销售额和总销售额

a,01,150
a,01,200
b,01,1000
b,01,800
c,01,250
c,01,220
b,01,6000
a,02,2000
a,02,3000
b,02,1000
b,02,1500
c,02,350
c,02,280
a,03,350
a,03,250

create table test5(store string,month int,money int)
row format delimited fields terminated by ',';

需求是求出每个店铺的当月销售额和累计到当月的总销售额

select store,
       month,
       m1,
       m2
from (select store,
             month,
             sum(money) over (partition by store,month) as m1,
             sum(money) over (partition by store order by month) as m2,
             row_number() over (partition by store,month) as rank
       from test5) t
where rank = 1;

这里的两个sum开窗函数的用法在之前说过了，不加order by是分组内全局聚合，加上order by是累加

这里主要要注意的是，利用row_number()完成最终数据的去重

最后的结果为

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/5705286.html

Hive练习 —— 月访问次数，学生课程成绩，年最大气温，学生选课情况，月销售额

发表评论

评论列表（0条）