题目和数据均来自于网络,本篇博客用来记录自己写的答案和解析
这里写目录标题求单月访问次数和总访问次数学生课程成绩每一年出现过的最大气温的日期+温度求学生选课情况求月销售额和总销售额
求单月访问次数和总访问次数用户名,访问日期,访问次数 A,2015-01,5 A,2015-01,15 B,2015-01,5 A,2015-01,8 B,2015-01,25 A,2015-01,5 A,2015-02,4 A,2015-02,6 B,2015-02,10 B,2015-02,5 A,2015-03,16 A,2015-03,22 B,2015-03,23 B,2015-03,10 B,2015-03,1
create external table test1(name string,month string,count int) row format delimited fields terminated by ','; load data local inpath '/usr/local/data/test1.txt' into table test1;
首先针对name和month进行去重,相同name和month的count进行求和
select name, month, sum(count) from test1 group by name,month;
然后求出,每个用户截止到每月为止的最大单月访问次数和累计到该月的总访问次数
select name, month, max(sumCount) over (partition by name order by month) as maxMonth, sum(sumCount) over (partition by name order by month) as sumMonth, sumCount from (select name, month, sum(count) as sumCount from test1 group by name,month)t order by name,month;
这里要注意,max,sum等聚合函数,加上order by后就是累计计算,不加order by就是在分区内全局计算。比如这里的sum(),如果后面是over (partition by name),没有order by,那么结果是按照name相同的一组sumCount进行全局求和
举一下例子
select name, month, max(sumCount) over (partition by name ) as maxMonth, sum(sumCount) over (partition by name ) as sumMonth, sumCount from (select name, month, sum(count) as sumCount from test1 group by name,month)t order by name,month;
结果为
A 2015-01 38 81 33 A 2015-02 38 81 10 A 2015-03 38 81 38 B 2015-01 34 79 30 B 2015-02 34 79 15 B 2015-03 34 79 34
这是直接按照name分组后,对一组的count进行求和
正确的sql语句为
select name, month, max(sumCount) over (partition by name order by month) as maxMonth, sum(sumCount) over (partition by name order by month) as sumMonth, sumCount from (select name, month, sum(count) as sumCount from test1 group by name,month)t order by name,month;
结果为
A 2015-01 33 33 33 A 2015-02 33 43 10 A 2015-03 38 81 38 B 2015-01 30 30 30 B 2015-02 30 45 15 B 2015-03 34 79 34学生课程成绩
id,学号sid,课程名course,成绩score 1 1 yuwen 43 2 1 shuxue 55 3 2 yuwen 77 4 2 shuxue 88 5 3 yuwen 98 6 3 shuxue 65
求 所有数学课程成绩 大于 语文课程成绩的学生的学号
select a.sid from (select sid, course, score from course where course='yuwen' ) a left join (select sid, course, score from course where course='shuxue') b on a.sid = b.sid where a.score < b.score;
结果为
1 2每一年出现过的最大气温的日期+温度
2014010216 2014010410 2012010609 2012010812 2012011023 2001010212 2001010411 2013010619 2013010812 2013011023 2008010216 2008010414 2007010619 2007010812 2007011023 2010010216 2010010410 2015010649 2015010812 2015011023
数据含义: 数据为 2014010216,表示在2014年01月02日的气温为16度
建表语句和加载数据
hive> create external table test3(tem string) > row format delimited fields terminated by ','; hive> load data local inpath '/usr/local/test3.txt' into table test3;
首先按照年,日期,温度进行切分,切分成三列数据
select substr(tem,1,4) as year, substr(tem,1,8) as d, substr(tem,-2) as temp from test3;
结果为
2014 20140102 16 2014 20140104 10 2012 20120106 09 2012 20120108 12 2012 20120110 23 2001 20010102 12 2001 20010104 11 2013 20130106 19 2013 20130108 12 2013 20130110 23 2008 20080102 16 2008 20080104 14 2007 20070106 19 2007 20070108 12 2007 20070110 23 2010 20100102 16 2010 20100104 10 2015 20150106 49 2015 20150108 12 2015 20150110 23
然后求取每一年的温度最大值,这里不要用group by,用开窗函数,否则取不到日期字段
select d, temp, max(temp) over (partition by year) as maxTem from ( select substr(tem,1,4) as year, substr(tem,1,8) as d, substr(tem,-2) as temp from test3 ) t
结果为
20010102 12 12 20010104 11 12 20070106 19 23 20070110 23 23 20070108 12 23 20080104 14 16 20080102 16 16 20100104 10 16 20100102 16 16 20120106 09 23 20120110 23 23 20120108 12 23 20130110 23 23 20130108 12 23 20130106 19 23 20140102 16 16 20140104 10 16 20150108 12 49 20150110 23 49 20150106 49 49
然后判断每一天的温度,是否等于当年的最高温度,如果相等则输出
select d,maxTem from (select d, temp, max(temp) over (partition by year) as maxTem from ( select substr(tem,1,4) as year, substr(tem,1,8) as d, substr(tem,-2) as temp from test3 ) t )t2 where temp = maxTem;
结果为
20010102 12 20070110 23 20080102 16 20100102 16 20120110 23 20130110 23 20140102 16 20150106 49求学生选课情况
id,course 1,a 1,b 1,c 1,e 2,a 2,c 2,d 2,f 3,a 3,b 3,c 3,e
首先将去重后的课程列表添加在每一行数据的后面
create table stu_course as select t1.id as id, t1.course as cs, t2.course courses from ( select id as id, collect_set(course) as course from test4 group by id ) t1 join (select sort_array(collect_set(course)) as course from test4) t2;
其中,t2 表的结果为 [“a”,“b”,“c”,“e”,“d”,“f”]
没有on语句,即没有连接字段的 join语句,作用是将第二张表的结果添加在第一张表每一行数据的后面,这里就是将 [“a”,“b”,“c”,“e”,“d”,“f”] 添加在 t1表 每一行数据的后面
collect_set(course) 的作用是将course的全部列值转变成一个没有重复元素的数组
但是collect_set转换的数组是无序的,所以需要用 sort_array 进行排序
结果为
1 ["a","b","c","e"] ["a","b","c","d","e","f"] 2 ["a","c","d","f"] ["a","b","c","d","e","f"] 3 ["a","b","c","e"] ["a","b","c","d","e","f"]
array_contains可以用于判断一张表中同一个id的多条记录中的同一字段是否包含指定的一个或多个值。需要注意字段类型保持一致,若不一致则需要进行强制类型转换
这里就是判断同一个id的cs字段,是否包含course(0),course(1)等值
case when 会返回多个列,每一个case when对应一个列
select id, case when array_contains(cs,courses[0]) then 1 else 0 end as a, case when array_contains(cs,courses[1]) then 1 else 0 end as b, case when array_contains(cs,courses[2]) then 1 else 0 end as c, case when array_contains(cs,courses[3]) then 1 else 0 end as d, case when array_contains(cs,courses[4]) then 1 else 0 end as e, case when array_contains(cs,courses[5]) then 1 else 0 end as f from stu_course;
结果为
最左边的一列为id,后面的六列对应六门课程,a、b、c、d、e、f
1 1 1 1 0 1 0 2 1 0 1 1 0 1 3 1 1 1 0 1 0求月销售额和总销售额
a,01,150 a,01,200 b,01,1000 b,01,800 c,01,250 c,01,220 b,01,6000 a,02,2000 a,02,3000 b,02,1000 b,02,1500 c,02,350 c,02,280 a,03,350 a,03,250
create table test5(store string,month int,money int) row format delimited fields terminated by ',';
需求是求出每个店铺的当月销售额和累计到当月的总销售额
select store, month, m1, m2 from (select store, month, sum(money) over (partition by store,month) as m1, sum(money) over (partition by store order by month) as m2, row_number() over (partition by store,month) as rank from test5) t where rank = 1;
这里的两个sum开窗函数的用法在之前说过了,不加order by是分组内全局聚合,加上order by是累加
这里主要要注意的是,利用row_number()完成最终数据的去重
最后的结果为
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)