- 一、窗口函数
- 1、window as
- 2、with as
- 3、常用函数
- 4、窗口帧
- 二、自定义函数
- 1、UDF:一进一出
- 2、UDTF:一进多出
I know, i know
地球另一端有你陪我
一、窗口函数
1、window as
select *,rank() over(partition by clazz order by score desc) as scorerank from students;
这里的 over() 中的语句大多相似,可以独立拆下来,用变量名代替
window o as(partition by clazz order by score desc)
原句可以简化为
select *,rank() o as scorerank from students;
2、with as
查询语句同样可以用变量名代替
select id,max(continuity) from (select id,count(origin) as continuity from (select id,day,row1,date_sub(day,row1) as origin from (select id,day,row_number() over(partition by id order by day) as row1 from login) as t1 ) as t2 group by id,origin )as t3 group by id;
一层层剥下来
with t1 as (select id,day,row_number() over(partition by id order by day) as row1 from login)
with t2 as (select id,day,row1,date_sub(day,row1) as origin from t1)
with t3 as (select id,count(origin) as continuity from t2 group by id,origin)
最终简化为
(多个 with as 之间用 , 隔开,且省略后面的 with)
with t1 as (select id,day,row_number() over(partition by id order by day) as row1 from login), t2 as (select id,day,row1,date_sub(day,row1) as origin from t1), t3 as (select id,count(origin) as continuity from t2 group by id,origin) select id,max(continuity) from t3 group by id;
3、常用函数
row_number:
无并列排名
1 2 3
dense_rank:
有并列排名,并且依次递增
1 1 2 2 3 3
rank:
有并列排名,不依次递增
1 1 3 3 5 5
窗口函数允许一次读取特定范围的数据,通过窗口帧来实现
窗口帧会一行行扫过数据,根据条件来选择读取的范围
Hive 提供了两种定义窗口帧的形式:ROWS 和 RANGE
只能运用在max、min、avg、count、sum、FIRST_VALUE、LAST_VALUE这几个窗口函数上
1 前一行、当前行和后一行
over rows between 1 preceding and 1 following
2 从起始行到现在行
over rows between unbounded preceding and current row
3 范围 [当前值减去3 和当前值加上11]
例:当前读取的值为 22,即寻找表中数据范围是[19,11]的
over range between 3 PRECEDING and 11 FOLLOWING
二、自定义函数
hive 允许使用者自定义函数进行使用
大致分为三类UDF 、 UDTF和UDAF
导个包先
org.apache.hive hive-exec1.2.1
1、UDF:一进一出
编写代码,继承org.apache.hadoop.hive.ql.exec.UDF
实现evaluate方法,在evaluate方法中实现自己的逻辑
用于一次只进入一个对象,并且是依赖 java 的 return
所以叫一进一出
import org.apache.hadoop.hive.ql.exec.UDF; public class HiveUDF extends UDF { // hadoop => #hadoop$ public String evaluate(String col1) { // 给传进来的数据 左边加上 # 号 右边加上 $ String result = "#" + col1 + "$"; return result; } }
java 中打包,传到 linux 中,再导入到 hive
add jar /usr/local/data/hive-1.0-SNAPSHOT.jar;
注册函数名,指名使用的类
create temporary function funudf as 'day47.UDF.Test';
玩一下
select funudf(name) from students;
2、UDTF:一进多出
通过自己的方法输出对象
可以实现一进多出
import org.apache.hadoop.hive.ql.exec.UDFArgumentException; import org.apache.hadoop.hive.ql.metadata.HiveException; import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory; import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.primitive .PrimitiveObjectInspectorFactory; import java.util.ArrayList; public class Test extends GenericUDTF { // 指定输出的列名 及 类型 @Override public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException { ArrayListfiledNames = new ArrayList (); ArrayList filedObj = new ArrayList (); filedNames.add("col1"); filedObj.add(PrimitiveObjectInspectorFactory .javaStringObjectInspector); filedNames.add("col2"); filedObj .add(PrimitiveObjectInspectorFactory.javaStringObjectInspector); return ObjectInspectorFactory .getStandardStructObjectInspector(filedNames, filedObj); } // 处理逻辑 my_udtf(col1,col2,col3) // "key1:value1,key2:value2,key3:value3" // my_udtf("key1:value1,key2:value2,key3:value3") public void process(Object[] objects) throws HiveException { // objects 表示传入的N列 String col = objects[0].toString(); // key1:value1 key2:value2 key3:value3 String[] splits = col.split(","); for (String str : splits) { String[] cols = str.split(":"); // 将数据输出 forward(cols); } } public void close() throws HiveException { } }
一样要导包和注册函数名,玩一下
select funudtf('key1:value1,key2:value2,key3:value3');
输出结果
key1 value1 key2 value2 key3 value3
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)