day 47 hive 窗口函数（二）自定义函数_随笔

day 47 hive 窗口函数（二）自定义函数

文章目录

一、窗口函数
- 1、window as
- 2、with as
- 3、常用函数
- 4、窗口帧
二、自定义函数
- 1、UDF：一进一出
- 2、UDTF：一进多出

I know, i know
地球另一端有你陪我

一、窗口函数
1、window as

select *,rank() over(partition by clazz order by score desc) as scorerank
from students;

这里的 over() 中的语句大多相似，可以独立拆下来，用变量名代替

window o as(partition by clazz order by score desc)

原句可以简化为

select *,rank() o as scorerank 
from students;

2、with as

查询语句同样可以用变量名代替

select id,max(continuity)
from
    (select id,count(origin) as continuity
    from
        (select id,day,row1,date_sub(day,row1) as origin
        from
            (select 
            id,day,row_number() over(partition by id order by day) as row1
            from login) as t1
        ) as t2
    group by id,origin
    )as t3
group by id;

一层层剥下来

with t1 as 
(select id,day,row_number() over(partition by id order by day) as row1
from login)

with t2 as
(select id,day,row1,date_sub(day,row1) as origin from t1)

with t3 as
(select id,count(origin) as continuity from t2 group by id,origin)

最终简化为
（多个 with as 之间用 , 隔开，且省略后面的 with）

with t1 as
(select id,day,row_number() over(partition by id order by day) as row1 from login),
t2 as
(select id,day,row1,date_sub(day,row1) as origin from t1),
t3 as
(select id,count(origin) as continuity from t2 group by id,origin)

select id,max(continuity) from t3 group by id;

3、常用函数

row_number：
无并列排名
1 2 3

dense_rank：
有并列排名，并且依次递增
1 1 2 2 3 3

rank：
有并列排名，不依次递增
1 1 3 3 5 5

4、窗口帧

窗口函数允许一次读取特定范围的数据，通过窗口帧来实现
窗口帧会一行行扫过数据，根据条件来选择读取的范围

Hive 提供了两种定义窗口帧的形式：ROWS 和 RANGE

只能运用在max、min、avg、count、sum、FIRST_VALUE、LAST_VALUE这几个窗口函数上

1 前一行、当前行和后一行

over rows between 1 preceding and 1 following

2 从起始行到现在行

over rows between unbounded preceding and current row

3 范围 [当前值减去3 和当前值加上11]
例：当前读取的值为 22，即寻找表中数据范围是[19,11]的

over range between 3 PRECEDING and 11 FOLLOWING

二、自定义函数

hive 允许使用者自定义函数进行使用
大致分为三类UDF 、 UDTF和UDAF

导个包先

        
            org.apache.hive
            hive-exec
            1.2.1

1、UDF：一进一出

编写代码，继承org.apache.hadoop.hive.ql.exec.UDF
实现evaluate方法，在evaluate方法中实现自己的逻辑

用于一次只进入一个对象，并且是依赖 java 的 return
所以叫一进一出

import org.apache.hadoop.hive.ql.exec.UDF;

public class HiveUDF extends UDF {
    // hadoop => #hadoop$
    public String evaluate(String col1) {
    // 给传进来的数据 左边加上 # 号 右边加上 $
        String result = "#" + col1 + "$";
        return result;
    }
}

java 中打包，传到 linux 中，再导入到 hive

add jar /usr/local/data/hive-1.0-SNAPSHOT.jar;

注册函数名，指名使用的类

create temporary function funudf as 'day47.UDF.Test';

玩一下

select funudf(name) from students;

2、UDTF：一进多出

通过自己的方法输出对象
可以实现一进多出

import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive
.PrimitiveObjectInspectorFactory;

import java.util.ArrayList;

public class Test extends GenericUDTF {


    // 指定输出的列名 及 类型
    @Override
    public StructObjectInspector initialize(StructObjectInspector argOIs) 
    throws UDFArgumentException {
        ArrayList filedNames = new ArrayList();
        
        ArrayList filedObj 
        = new ArrayList();
        
        filedNames.add("col1");
        
        filedObj.add(PrimitiveObjectInspectorFactory
        .javaStringObjectInspector);
        
        filedNames.add("col2");
        
        filedObj
        .add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
        
        return ObjectInspectorFactory
        .getStandardStructObjectInspector(filedNames, filedObj);
    }


    // 处理逻辑 my_udtf(col1,col2,col3)
    // "key1:value1,key2:value2,key3:value3"
    // my_udtf("key1:value1,key2:value2,key3:value3")

    public void process(Object[] objects) throws HiveException {
        // objects 表示传入的N列
        String col = objects[0].toString();
        // key1:value1  key2:value2  key3:value3
        String[] splits = col.split(",");
        for (String str : splits) {
            String[] cols = str.split(":");
            // 将数据输出
            forward(cols);
        }
    }

    public void close() throws HiveException {
    }
}

一样要导包和注册函数名，玩一下

select funudtf('key1:value1,key2:value2,key3:value3');

输出结果

	key1	value1
	key2	value2
	key3	value3

欢迎分享，转载请注明来源：内存溢出

原文地址: https://outofmemory.cn/zaji/5608478.html

day 47 hive 窗口函数（二）自定义函数

发表评论

评论列表（0条）

day 47 hive 窗口函数（二） 自定义函数

发表评论

评论列表（0条）

day 47 hive 窗口函数（二）自定义函数