hive sql 中怎么用变量_教程

第一步先准山亩备测试表和测试数据用于后续测试：

hive>create database test

Time taken: 2.606 seconds

然后执行让胡建表坦唯拦和导入数据的sql文件：

[czt@www.crazyant.net testHivePara]$ hive -f student.sql

Hive history file=/tmp/crazyant.net/hive_job_log_czt_201309131615_1720869864.txt

Time taken: 2.131 seconds

Time taken: 0.878 seconds

Copying data from file:/home/users/czt/testdata_student

Copying file: file:/home/users/czt/testdata_student

Loading data to table test.student

hive函数：

1、根据指定条件返回结果：case when then else end as

2、基本类型转换：CAST（）

3、nvl：处理空字段：三个str时，是否为空可以指定返回不同的值

4、sql通配符： https://www.w3school.com.cn/sql/sql_wildcards.asp

5、count(1)与COUNT(*)：返回行数

如果表没有主键，那么count（1）比count（*）快；

如果有主键，那么count（主键，联合主键）比count（*）快；

count(1)跟count(主键)一样，只扫描主键。count(*)跟count(非主键)一样，扫描整个表。明显前者更快一些。

性能问题：

1.任何情况下SELECT COUNT(*) FROM tablename是最优选择,(指没有where的情况）；

2.尽量减少SELECT COUNT(*) FROM tablename WHERE COL = ‘value’ 这种查询；

3.杜绝SELECT COUNT(COL) FROM tablename WHERE COL2 = ‘value’ 的出现。

count(expression)：查询 is_reply=0 的数量： SELECT COUNT(IF(is_reply=0,1,NULL)) count FROM t_iov_help_feedback

6、distinct与group by

distinct去重所有distinct之后所有的字段，如果有一个字段值不一致就不作为一条

group by是根据某一字段分组，然后查询出该条数据的所需字段，可以搭配 where max(time)或者Row_Number函数使用，求出最大的一条数据

7、使用with 临时表名 as() 的形式,简单的临时表直接嵌套进枝段sql中，复杂的和需要复用的表写到临猛行誉时表中，关联的时候先找到关联字段，过滤条件最好在临时表中先过滤后关联

处理json的函数：

split(json_array_string(schools), '\\|\\|') AS schools

get_json_object(school, '$.id') AS school_id,

字符串函数：

1、instr（’源字符串’ , ‘目标字符串’ ,’开始位置’,’第几次出现’）

instr(sourceString,destString,start,appearPosition)

1.sourceString代表源字符串； destString代表要从源字符串中查找的子串；

2.start代表查找的开始位置，这个参数可选的，默认为1；

3.appearPosition代表想从源字符中查找出第几次出现的destString，这个参数也是可选的，默认为1

4.如果start的值为负数，则代表从右往左进行查找，但是位置数据仍然从左向右计带枯算。

5.返回值为：查找到的字符串的位置。如果没有查找到，返回0。

最简单例子：在abcd中查找a的位置，从第一个字母开始查，查找第一次出现时的位置

select instr(‘abcd’,’a’,1,1) from dual—1

应用于模糊查询：instr(字段名/列名, ‘查找字段’)

select code,name,dept,occupation from staff where instr(code, ‘001’)>0

等同于 select code, name, dept, occupation from staff where code like ‘%001%’

应用于判断包含关系：

select ccn,mas_loc from mas_loc where instr(‘FH,FHH,FHM’,ccn)>0

等同于 select ccn,mas_loc from mas_loc where ccn in (‘FH’,’FHH’,’FHM’)

2、substr（string A，int start，int len）和 substring（string A，int start，int len），用法一样

substr(time,1,8) 表示将time从第1位开始截取，截取的长度为8位

第一种用法：

substr（string A，int start）和 substring（string A，int start），用法一样

功效：返回字符串A从下标start位置到结尾的字符串

第二种用法：

substr（string A，int start，int len）和 substring（string A，int start，int len），用法一样

功效：返回字符串A从下标start位置开始，长度为len的字符串

3、get_json_object(form_data,'$.学生姓名') as student_name

json_tuple 函数的作用：用来解析json字符串中的多个字段

4、split(full_name, '\\.') [5] AS zq 取的是数组里的第六个

日期(时间)函数：

1、to_date(event_time) 返回日期部分

2、date_sub：返回当前日期的相对时间

当前日期：select curdate()

当前日期前一天：select date_sub(curdate(),interval 1 day)

当前日期后一天：select date_sub(curdate(),interval -1 day)

date_sub(from_unixtime(unix_timestamp(), 'yyyy-MM-dd HH:mm:ss'), 14) 将现在的时间总秒数转为标准格式时间，返回14天之前的时间

时间戳>>>>日期：

from_unixtime(unix_timestamp(), 'yyyy-MM-dd HH:mm:ss') 将现在的时间总秒数转为标准格式时间

from_unixtime(get_json_object(get_json_object(form_data,'$.挽单时间'),'$.$date')/1000) as retain_time

unix_timestamp('2019-08-15 16:40:00','yyyy-MM-dd HH:mm:ss') --1565858400

日期>>>>时间戳：unix_timestamp()

date_format：yyyy-MM-dd HH:mm:ss 时间转格式化时间

select date_format('2019-10-07 13:24:20', 'yyyyMMdd000000')-- 20191007000000select date_format('2019-10-07', 'yyyyMMdd000000')-- 20191007000000

1.日期比较函数: datediff语法: datediff(string enddate,string startdate)

返回值: int

说明: 返回结束日期减去开始日期的天数。

举例： hive>select datediff('2016-12-30','2016-12-29') 1

2.日期增加函数: date_add语法: date_add(string startdate, intdays)

返回值: string

说明: 返回开始日期startdate增加days天后的日期。

举例： hive>select date_add('2016-12-29',10) 2017-01-08

3.日期减少函数: date_sub语法: date_sub (string startdate,int days)

返回值: string

说明: 返回开始日期startdate减少days天后的日期。

举例： hive>select date_sub('2016-12-29',10) 2016-12-19

4.查询近30天的数据

select * from table where datediff(current_timestamp,create_time)<=30；

create_time 为table里的字段，current_timestamp 返回当前时间 2018-06-01 11:00:00

3、trunc()函数的用法：当前日期的各种第一天,或者对数字进行不四舍五入的截取

日期：

1.select trunc(sysdate) from dual --2011-3-18 今天的日期为2011-3-18

2.select trunc(sysdate, 'mm') from dual --2011-3-1 返回当月第一天.

上月1号 trunc(add_months(current_date(),-1),'MM')

3.select trunc(sysdate,'yy') from dual --2011-1-1 返回当年第一天

4.select trunc(sysdate,'dd') from dual --2011-3-18 返回当前年月日

5.select trunc(sysdate,'yyyy') from dual --2011-1-1 返回当年第一天

6.select trunc(sysdate,'d') from dual --2011-3-13 (星期天)返回当前星期的第一天

7.select trunc(sysdate, 'hh') from dual --2011-3-18 14:00:00 当前时间为14:41

8.select trunc(sysdate, 'mi') from dual --2011-3-18 14:41:00 TRUNC()函数没有秒的精确

数字：TRUNC（number,num_digits） Number 需要截尾取整的数字。Num_digits 的默认值为 0。TRUNC()函数截取时不进行四舍五入

11.select trunc(123.458,1) from dual --123.4

12.select trunc(123.458,-1) from dual --120

4、round()：四舍五入：

select round(1.455, 2) #结果是：1.46，即四舍五入到十分位，也就是保留两位小数

select round(1.5) #默认四舍五入到个位，结果是：2

select round(255, -1) #结果是：260，即四舍五入到十位，此时个位是5会进位

floor()：地板数

ceil()天花板数

5、

6.日期转年函数: year语法: year(string date)

返回值: int

说明: 返回日期中的年。

举例：

hive> select year('2011-12-08 10:03:01') from dual

2011

hive> select year('2012-12-08') fromdual

2012

7.日期转月函数: month语法: month (string date)

返回值: int

说明: 返回日期中的月份。

举例：

hive> select month('2011-12-08 10:03:01') from dual

hive> select month('2011-08-08') fromdual

8.日期转天函数: day语法: day (string date)

返回值: int

说明: 返回日期中的天。

举例：

hive> select day('2011-12-08 10:03:01') from dual

hive> select day('2011-12-24') fromdual

9.日期转小时函数: hour语法: hour (string date)

返回值: int

说明: 返回日期中的小时。

举例：

hive> select hour('2011-12-08 10:03:01') from dual

10.日期转分钟函数: minute语法: minute (string date)

返回值: int

说明: 返回日期中的分钟。

举例：

hive> select minute('2011-12-08 10:03:01') from dual

11.日期转秒函数: second语法: second (string date)

返回值: int

说明: 返回日期中的秒。

举例：

hive> select second('2011-12-08 10:03:01') from dual

12.日期转周函数: weekofyear语法: weekofyear (string date)

返回值: int

说明: 返回日期在当前的周数。

举例：

hive> select weekofyear('2011-12-08 10:03:01') from dual

查看hive表在hdfs中的位置：show create table 表名

在hive中hive2hive，hive2hdfs：

HDFS、本地、hive ----->Hive：使用 insert into | overwrite、loaddata local inpath "" into table student

Hive ---->Hdfs、本地：使用：insert overwrite | local

网站访问量统计：

uv：每用户访问次数

ip：每ip(可能很多人)访问次数

PV:是指页面的浏览次数

VV:是指你访问网站的次数

sql：

基本函数：

count、max、min、sum、avg、like、rlike（'2%'、'_2%'、%2%'、'[2]'）（java正则）

and、or、not、in

where、group by、having、{ join on 、full join} 、order by（desc降序）

sort by需要与distribut by集合结合使用：

hive (default)>set mapreduce.job.reduces=3 //先设置reduce的数量

insert overwrite local directory '/opt/module/datas/distribute-by'

row format delimited fields terminated by '\t'

先按照部门编号分区，再按照员工编号降序排序。

select * from emp distribute by deptno sort by empno desc

外部表 create external table if not exists dept

分区表：create table dept_partition ( deptno int, dname string, loc string ) partitioned by ( month string )

load data local inpath '/opt/module/datas/dept.txt' into table default.dept_partition partition(month='201809')

alter table dept_partition add/drop partition(month='201805') ,partition(month='201804')

多分区联合查询：union

select * from dept_partition2 where month='201809' and day='10'

show partitions dept_partition

desc formatted dept_partition

二级分区表：create table dept_partition2 ( deptno int, dname string, loc string ) partitioned by (month string, day string) row format delimited fields terminated by '\t'

分桶抽样查询：分区针对的是数据的存储路径；分桶针对的是数据文件

create table stu_buck(id int, name string) clustered by(id) into 4 bucketsrow format delimited fields terminated by '\t'

设置开启分桶与reduce为1：

set hive.enforce.bucketing=true

set mapreduce.job.reduces=-1

分桶抽样：select * from stu_bucktablesample(bucket x out of y on id)

抽取，桶数/y，x是从哪个桶开始抽取，y越大抽样数越少，y与抽样数成反比，x必须小于y

给空字段赋值：

如果员工的comm为NULL，则用-1代替或用其他字段代替：select nvl(comm,-1) from emp

case when:如何符合记为1，用于统计、分组统计

select dept_id, sum(case sex when '男' then 1 else 0 end) man , sum(case sex when '女' then 1 else 0 end) woman from emp_sex group by dept_id

用于组合归类汇总(行转列)：UDAF：多转一

concat：拼接查询结果

collect_set(col)：去重汇总，产生array类型字段，类似于distinct

select t.base, concat_ws('|',collect_set(t.name)) from (select concat_ws(',',xingzuo,blood_type) base,name from person_info) t group by t.base

解释：先第一次查询得到一张没有按照（星座血型）分组的表，然后分组，使用collect_set将名字组合成数组，然后使用concat将数组变成字符串

用于拆分数据：(列转行)：UDTF：一转多

explode(col)：将hive一列中复杂的array或者map结构拆分成多行。

lateral view 侧面显示：用于和UDTF一对多函数搭配使用

用法：lateral view udtf(expression) tablealias as cate

cate：炸开之后的列别名

temptable ：临时表表名

解释：用于和split, explode等UDTF一起使用，它能够将一列数据拆成多行数据，在此基础上可以对拆分后的数据进行聚合。

开窗函数：

Row_Number,Rank，Dense_Rank over：针对统计查询使用

Row_Number：返回从1开始的序列

Rank:生成分组中的排名序号，会在名词s中留下空位。3 3 5

dense_rank：生成分组中的排名序号，不会在名词中留下空位。3 3 4

over：主要是分组排序，搭配窗口函数使用

结果：

SUM、AVG、MIN、MAX、count

preceding：往前

following：往后

current row：当前行

unbounded：unbounded preceding 从前面的起点， unbounded following：到后面的终点

sum：直接使用sum是总的求和，结合over使用可统计至每一行的结果、总的结果、当前行+之前多少行/之后多少行、当前行到往后所有行的求和。

over(rowsbetween 3/current rowprecedingandunboundedfollowing ) 当前行到往后所有行的求和

ntile：分片，结合over使用，可以给数据分片，返回分片号

使用场景：统计出排名前百分之或n分之一的数据。

lead,lag,FIRST_VALUE,LAST_VALUE

lag与lead函数可以返回上下行的数据

lead(col,n,dafault) 用于统计窗口内往下第n行值

第一个参数为列名，第二个参数为往下第n行（可选，默认为1），第三个参数为默认值（当往下第n行为NULL时候，取默认值，如不指定，则为NULL）

LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值

第一个参数为列名，第二个参数为往上第n行（可选，默认为1），第三个参数为默认值（当往上第n行为NULL时候，取默认值，如不指定，则为NULL）

使用场景：通常用于统计某用户在某个网页上的停留时间

FIRST_VALUE:取分组内排序后，截止到当前行，第一个值

LAST_VALUE:取分组内排序后，截止到当前行,最后一个值

范围内求和: https://blog.csdn.net/happyrocking/article/details/105369558

cume_dist，percent_rank

–CUME_DIST :小于等于当前值的行数 / 分组内总行数

–比如，统计小于等于当前薪水的人数，占总人数的比例

percent_rank:分组内当前行的RANK值-1/分组内总行数-1

总结：

在Spark中使用spark sql与hql一致，也可以直接使用sparkAPI实现。

HiveSql窗口函数主要应用于求TopN，分组排序TopN、TopN求和，前多少名前百分之几。

与Flink窗口函数不同。

Flink中的窗口是用于将无线数据流切分为有限块处理的手段。

window分类：

CountWindow：按照指定的数据条数生成一个 Window，与时间无关。

TimeWindow：按照时间生成 Window。

1. 滚动窗口（Tumbling Windows）：时间对齐，窗口长度固定，不重叠：：常用于时间段内的聚合计算

2.滑动窗口（Sliding Windows）：时间对齐，窗口长度固定，可以有重叠：：适用于一段时间内的统计（某接口最近 5min 的失败率来报警）

3. 会话窗口（Session Windows）无时间对齐，无长度，不重叠：：设置session间隔，超过时间间隔则窗口关闭。

在通过终端查询hive时，终端结果显示为乱码，想必大家都遇到过这种情况。这种情况出现的原因是：hive在将晌雀数据写入hdfs时候，会把数据格式转换为utf-8格式的。如果你导入hive表的源数据不是utf-8格式的，hive在进行写hdfs转换格式的时候会出现乱码，所有你查询出来的中文也是乱码。

解决办法如下：

1、把源文件，用editplus等编辑软戚晌件打开，将文件转换为urf-8格式，保存。再重新导入高谨锋到hive表中，问题解决。

2、源文件只要是文本格式，如csv，txt，log等文本格式，均可用此种方法转化。前提是你终端也要设置为utf-8格式。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/tougao/12230693.html

hive sql 中怎么用变量

发表评论

评论列表（0条）

hive sql 中 怎么用变量

发表评论

评论列表（0条）

hive sql 中怎么用变量