day 47 hive 窗口函数(二) 自定义函数

day 47 hive 窗口函数(二) 自定义函数,第1张

day 47 hive 窗口函数(二) 自定义函数

文章目录
  • 一、窗口函数
    • 1、window as
    • 2、with as
    • 3、常用函数
    • 4、窗口帧
  • 二、自定义函数
    • 1、UDF:一进一出
    • 2、UDTF:一进多出


I know, i know
地球另一端有你陪我




一、窗口函数
1、window as
select *,rank() over(partition by clazz order by score desc) as scorerank
from students;

这里的 over() 中的语句大多相似,可以独立拆下来,用变量名代替

window o as(partition by clazz order by score desc)

原句可以简化为

select *,rank() o as scorerank 
from students;

2、with as

查询语句同样可以用变量名代替

select id,max(continuity)
from
    (select id,count(origin) as continuity
    from
        (select id,day,row1,date_sub(day,row1) as origin
        from
            (select 
            id,day,row_number() over(partition by id order by day) as row1
            from login) as t1
        ) as t2
    group by id,origin
    )as t3
group by id;

一层层剥下来

with t1 as 
(select id,day,row_number() over(partition by id order by day) as row1
from login)
with t2 as
(select id,day,row1,date_sub(day,row1) as origin from t1)
with t3 as
(select id,count(origin) as continuity from t2 group by id,origin)

最终简化为
(多个 with as 之间用 , 隔开,且省略后面的 with)

with t1 as
(select id,day,row_number() over(partition by id order by day) as row1 from login),
t2 as
(select id,day,row1,date_sub(day,row1) as origin from t1),
t3 as
(select id,count(origin) as continuity from t2 group by id,origin)

select id,max(continuity) from t3 group by id;

3、常用函数

row_number:
无并列排名
1 2 3

dense_rank:
有并列排名,并且依次递增
1 1 2 2 3 3

rank:
有并列排名,不依次递增
1 1 3 3 5 5

4、窗口帧

窗口函数允许一次读取特定范围的数据,通过窗口帧来实现
窗口帧会一行行扫过数据,根据条件来选择读取的范围

Hive 提供了两种定义窗口帧的形式:ROWS 和 RANGE

只能运用在max、min、avg、count、sum、FIRST_VALUE、LAST_VALUE这几个窗口函数上

1 前一行、当前行和后一行

over rows between 1 preceding and 1 following

2 从起始行到现在行

over rows between unbounded preceding and current row

3 范围 [当前值减去3 和当前值加上11]
   例:当前读取的值为 22,即寻找表中数据范围是[19,11]的

over range between 3 PRECEDING and 11 FOLLOWING

二、自定义函数

hive 允许使用者自定义函数进行使用
大致分为三类UDF 、 UDTF和UDAF

导个包先

        
            org.apache.hive
            hive-exec
            1.2.1
        

1、UDF:一进一出

编写代码,继承org.apache.hadoop.hive.ql.exec.UDF
实现evaluate方法,在evaluate方法中实现自己的逻辑

用于一次只进入一个对象,并且是依赖 java 的 return
所以叫一进一出

import org.apache.hadoop.hive.ql.exec.UDF;

public class HiveUDF extends UDF {
    // hadoop => #hadoop$
    public String evaluate(String col1) {
    // 给传进来的数据 左边加上 # 号 右边加上 $
        String result = "#" + col1 + "$";
        return result;
    }
}

java 中打包,传到 linux 中,再导入到 hive

add jar /usr/local/data/hive-1.0-SNAPSHOT.jar;

注册函数名,指名使用的类

create temporary function funudf as 'day47.UDF.Test';

玩一下

select funudf(name) from students;

2、UDTF:一进多出

通过自己的方法输出对象
可以实现一进多出

import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive
.PrimitiveObjectInspectorFactory;

import java.util.ArrayList;

public class Test extends GenericUDTF {


    // 指定输出的列名 及 类型
    @Override
    public StructObjectInspector initialize(StructObjectInspector argOIs) 
    throws UDFArgumentException {
        ArrayList filedNames = new ArrayList();
        
        ArrayList filedObj 
        = new ArrayList();
        
        filedNames.add("col1");
        
        filedObj.add(PrimitiveObjectInspectorFactory
        .javaStringObjectInspector);
        
        filedNames.add("col2");
        
        filedObj
        .add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
        
        return ObjectInspectorFactory
        .getStandardStructObjectInspector(filedNames, filedObj);
    }


    // 处理逻辑 my_udtf(col1,col2,col3)
    // "key1:value1,key2:value2,key3:value3"
    // my_udtf("key1:value1,key2:value2,key3:value3")

    public void process(Object[] objects) throws HiveException {
        // objects 表示传入的N列
        String col = objects[0].toString();
        // key1:value1  key2:value2  key3:value3
        String[] splits = col.split(",");
        for (String str : splits) {
            String[] cols = str.split(":");
            // 将数据输出
            forward(cols);
        }
    }

    public void close() throws HiveException {
    }
}

一样要导包和注册函数名,玩一下

select funudtf('key1:value1,key2:value2,key3:value3');

输出结果

	key1	value1
	key2	value2
	key3	value3

欢迎分享,转载请注明来源:内存溢出

原文地址: https://outofmemory.cn/zaji/5608478.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-12-15
下一篇 2022-12-15

发表评论

登录后才能评论

评论列表(0条)

保存