hive实现用户连续登陆的最大天数_框架

假设说我们现在有这样一张表

连续登陆，也就是在连续登陆的期间内，后一天和前一天的差值为1，不能为大于1的值，直到间断。那么在这里其实我们可以设置一列序号，如果是连续的话，这列序号也是会随着日期同步增长的，那么日期减去这个序号，应该都是一个确定的日期。比如说9月16号对应的序号是1，9月17号对应的序号是2，9月18号对应的序号是3，那么9月16号-1=9月15号，同理，9月17号-2=9月15号，都是同样的日期。那么我们根据这个日期和user_id和日期标示进行分组即可。

我们可以看到用ROW_NUMBER()函数已经成功加上了一列。

我们用log_date这一列减去row_number这一列得到一个时间标志

在这里做的事情是把之前的rn从string的格式改为了int的格式，这样才能够传入到date_sub函数里面做减法。这里要注意，做subquery的时候要制定子表格的名字，比如我在这里就指定了子表格名字为c。

接下来我们要根据user_id和log_date进行分组然后统计。

现在我们已经得到了每个日期的分组的数目统计，这里的天数实际上是开始连续的那个日期-1（因为不连续的都不是同一个标志日，不会分到一组）但是还有两个问题，一个是要求出每个id的最大值，另外就是要将天数加1回到一开始连续开始的日期。

以上解决了找到最大值，但是最大值的日期没有返回。

由于嵌套写得太多了，接下来还是新建一张表格，用连接的方式找出最大日期

如果原始数据表有连续的日期，应该考虑用dense_number()，同样的 *** 作后，用select distinct的方法来选出user_id, date以及symbol_date都不一样的一行。

关于这个问题可以参考链接：

hivesql sql — 获取指定hive表或指定文件所hive表DDL按区则默认执行近7区DDL同table支持符合sql语则表达式表匹配则提示用户选择(使用file则自关闭该交互功能)

在过去几年中，主要受到围绕Stinger计划的Hive社区创新的推动，Hive查询时间得到了显着改善，使Hive能够以速度和规模支持批量和交互式工作负载。

但是，许多使用者仍然不熟悉以最快速度运行Hive查询的基本技术和最佳实践。本文中，将重点介绍一些常使用的简单技术，以提高HIVE查询的性能。

Hive可以使用Apache Tez执行引擎而不是Map-reduce引擎。不会详细介绍这里提到的使用Tez的许多好处; 相反，提出一个简单的建议：如果在您的环境中默认情况下没有打开它，请在Hive查询的开头使用Tez设置为“true”

Hive支持ORCfile，这是一种新的表存储格式，通过谓词下推，压缩等技术实现极佳的速度提升。

对每个HIVE表使用ORCFile应该是一个明智的选择，对于获得HIVE查询的快速响应时间非常有益。

作为一个例子，考虑两个大表A和B（存储为文本文件，这里没有指定一些列），以及一个简单的查询 :

此查询可能需要很长时间才能执行，因为表A和B都存储为TEXT。将这些表转换为ORCFile格式通常会显着缩短查询时间：

ORC支持压缩存储（使用ZLIB或如上所示使用SNAPPY），但也支持未压缩存储。

将基表转换为ORC通常是取决于所在团队获取数据的职责，由于其他优先级，可能需要一些时间来更改完整的获取数据过程。ORCFile的好处是如此明显，以至于推荐如上所示的自助式方法 - 将A转换为A_ORC，将B转换为B_ORC并以此方式进行连接，以便立即从更快的查询中受益，而不依赖于其他团队。

矢量化查询执行通过一次批量执行1024行而不是每行一行来提高扫描，聚合，过滤器和连接等 *** 作的性能。

这个功能在Hive 013中引入，显着缩短了查询执行时间，并且可以通过两个参数设置轻松启用：

在提交最终执行之前，Hive会优化每个查询的逻辑和物理执行计划。这些优化不是基于查询的成本 - 也就是说，直到运行时。

最近添加到Hive，基于成本的优化，基于查询成本执行进一步优化，从而导致可能不同的决策：如何订购联接，执行哪种类型的联接，并行度等。

要使用基于成本的优化（也称为CBO），请在查询开头设置以下参数

然后，通过运行Hive的“analyze”命令为CBO准备数据，以收集我们想要使用CBO的表的各种统计信息。

例如，在tweet数据表中，希望收集有关该表的统计信息以及大约2列：“sender”和“topic”：

使用HIVE 014（在HDP 22上），analyze命令的工作速度要快得多，而且您不需要指定每一列，因此只需如下：

现在使用此表执行查询应该会导致不同的执行计划由于成本计算和Hive创建的不同执行计划而更快。

SQL是一种强大的声明性语言。与其他声明性语言一样，编写SQL语句的方法不止一种。尽管每个语句的功能都相同，但它可能具有截然不同的性能特征

每条记录代表一次点击事件，希望找到每个sessionID的最新网址。

有人使用如下方式：

在上面的查询中，构建一个子查询来收集每个会话中最新事件的时间戳，然后使用内部联接来过滤掉其余的事件。

虽然查询是一个合理的解决方案 - 从功能的角度来看 - 事实证明，有一种更好的方法来重写这个查询，如下所示

在这里，使用Hive的OLAP功能（OVER和RANK）来实现相同的功能，但没有使用表连接。

显然，删除不必要的连接几乎总能带来更好的性能，而且当使用大数据时，这比以往任何时候都更重要。在很多情况下查询不是最优的 - 所以仔细查看每个查询并考虑重写是否可以使它更好更快。

更多内容信息 >

笔记：

1窗口函数边界：

复购率: 当前月份购买2次及以上的客户占所有客户比例

回购率：当前月份购买且上个月份也购买的客户占当月所有月份客户比例

思路：

复购率

1、对当月（2月份）的客户分组，计数购买次数

2、筛选购买次数为2以上的，认为是复购群体

回购率

1、筛选当月及上月部分

2、利用客户id进行当月连上月，推荐左连

3、对同一条客户id均有购买记录的，认为是回购群体

注意：

1连接条件中，条件1为客户key字段，表示同一个客户；条件2为两个月的信息连接（本月日期 = 上月日期 +1个月）

2ADD_MONTHS：别忘了有个's'，add_months只能对完整日期格式进行月份加减(add_months（‘yyyy-MM-dd hh:mm:ss’，N），add_months（‘yyyy-MM-dd’，N）)，所以这里先用concat给'年月'增加一个'日'字段，其中N为正则增加，N为负则减少

第一步：根据用户分组，求出用户购买的产品及顺序

cpzl_zw1是cpzl_zw的上一个购买的产品

第三步：为相邻的产品添加排序（即上面lag函数添加的产品）

若相邻产品的排序为3，则表示第三件产品，题目要求前两件产品，则可以根据小于3进行筛选

步骤：

第一步：将两表进行连接（customer为主表）

第二步：按城市、产品进行聚合分组，求出销量

第三步：窗口函数对销量进行排序

第四步：where进行筛选（where不能直接对窗口函数结果进行筛选，所以要再次使用表子连接）

提示：只用到订单表 ods_sales_orders，并列排序 dense_rank 窗口函数

日期变换：

(1)dt转日期

to_date(from_unixtime(unix_timestamp('${dt}','yyyyMMdd')))

(2)日期转dt

regexp_replace('${date}','-','')

(3)dt转当月1号日期

to_date(from_unixtime(unix_timestamp(concat(substr('${dt}',1,6),'01'),'yyyyMMdd')))

trunc(to_date(from_unixtime(unix_timestamp('${dt}','yyyyMMdd'))),'MM')

-- 下月1号日期

trunc(add_months(to_date(from_unixtime(unix_timestamp('${dt}','yyyyMMdd'))),1),'MM')

(4)dt转当周星期一日期

next_day(date_add(to_date(from_unixtime(unix_timestamp('${dt}','yyyyMMdd'))), -7), 'Mo')

date_sub(next_day(to_date(from_unixtime(unix_timestamp('${dt}','yyyyMMdd'))),'MO'),7)

-- 下周星期一日期

next_day(to_date(from_unixtime(unix_timestamp('${dt}','yyyyMMdd'))),'MO')

(5)dt前六天日期(dt为星期天时得到的是本周周一的日期)

date_add(to_date(from_unixtime(unix_timestamp('${dt}','yyyyMMdd'))), -6)

(5)dt转当季第一天日期

if(length(floor(substr('${dt}',5,2)/31)3+1)=1,concat(substr('${dt}',1,4),'-0',floor(substr('${dt}',5,2)/31)3+1,'-01'),concat(substr('${dt}',1,4),'-',floor(substr('${dt}',5,2)/31)3+1,'-01'))

(6)dt转半年第一天日期

if(length(floor(substr('${dt}',5,2)/61)6+1)=1,concat(substr('${dt}',1,4),'-0',floor(substr('${dt}',5,2)/61)6+1,'-01'),concat(substr('${dt}',1,4),'-',floor(substr('${dt}',5,2)/61)6+1,'-01'))

(7)dt转当年1号日期

concat(substr('${dt}',1,4),'-01-01')(8)在同时有日周月粒度时要注意数据的时间范围，有时每月的第一个自然周会跨月，比如2019年3月的第一周的日期是20190225-20190303where agent_business_date between date_add_day('${dt}',-31) and to_date(from_unixtime(unix_timestamp('${dt}','yyyyMMdd')))where dt between regexp_replace(date_add_day('${dt}',-31),'-','') and '${dt}'

------------------------------------------------------------------------------------------

-- 日期维度表表结构edw_publicdim_esf_edw_pub_date

------------------------------------------------------------------------------------------

col_name data_type comment

------------------------------------------------------------------------

calendar_date string 日期，格式为"YYYY-MM-DD"

week_english_name string 星期英文名

week_chinese_name string 星期中文名

day_of_week_number int 所属一周当中的第几天

calendar_month_code string 日期所属月份，格式为"YYYY-MM"

calendar_month_number int 所属月份数字

month_english_name string 月份英文名

month_chinese_name string 月份中文名

day_of_month_number int 所属月份当中的第几天

calendar_quater_code string 日期所属季度，格式为"YYYY-QT"

calendar_quater_number int 所属季度数字

day_of_quater_number int 所属季度当中的第几天

calendar_half_year_code string 日期所属半年，格式为"YYYY-HY"

calendar_half_year_number int 所属半年数字，1为上半年，2为下半年

calendar_year_code string 日期所属年份，格式为"YYYY"

day_of_year_number int 所属年份当中的第几天

work_day_flag string 工作日标志: Y - 是/ N - 否

holiday_flag string 节假日标志: Y - 是/ N - 否

-- 日期维度表的使用

-- 当天日期

SELECT

calendar_date

FROM

edw_publicdim_esf_edw_pub_date

WHERE

calendar_date = regexp_replace('${dt}','(\\d{4})(\\d{2})(\\d{2})','$1-$2-$3')

-- Finereport中日周月季半年年各周期末日期的算法

select

${if(粒度 == 1," case when date(max(calendar_date))>=date(date_add('day',-1,current_date)) then date(date_add('day',-1,current_date)) else date(max(calendar_date)) end as period_end_date","")}

${if(粒度 == 2," distinct case when day_of_week_number = 1 and date_add('day',6,date(calendar_date)) >=date(date_add('day',-1,current_date)) then date(date_add('day',-1,current_date)) when day_of_week_number = 7 and date(calendar_date) >=date(date_add('day',-1,current_date)) then date(date_add('day',-1,current_date)) when day_of_week_number = 1 then date_add('day',6,date(calendar_date)) when day_of_week_number = 7 then date(calendar_date) else date(calendar_date) end as period_end_date ","")}

${if(粒度 == 3," case when date(max(calendar_date))>=date(date_add('day',-1,current_date)) then date(date_add('day',-1,current_date)) else date(max(calendar_date)) end as period_end_date ","")}

${if(粒度 == 4," case when date(max(calendar_date))>=date(date_add('day',-1,current_date)) then date(date_add('day',-1,current_date)) else date(max(calendar_date)) end as period_end_date ","")}

${if(粒度 == 5," case when date(max(calendar_date))>=date(date_add('day',-1,current_date)) then date(date_add('day',-1,current_date)) else date(max(calendar_date)) end as period_end_date ","")}

${if(粒度 == 6," case when date(max(calendar_date))>=date(date_add('day',-1,current_date)) then date(date_add('day',-1,current_date)) else date(max(calendar_date)) end as period_end_date ","")}

from

edw_publicdim_esf_edw_pub_date

where calendar_date >= '${开始时间}' and calendar_date <= '${结束时间}'

${if(粒度 == 1," group by calendar_date ","")}

${if(粒度 == 2," and day_of_week_number in (1,7) ","")}

${if(粒度 == 3," group by calendar_month_code ","")}

${if(粒度 == 4," group by calendar_quater_code ","")}

${if(粒度 == 5," group by calendar_year_code ","")}

${if(粒度 == 6," group by calendar_half_year_code ","")}

-- Finereport中日周月季半年年各周期期初期末日期的算法(这种计算方法当前日期是20190330，输入的日期范围是2019-03-01至2091-03-28则输出的月日期范围是2019-03-29)

select

${if(粒度 == 1,"date(calendar_date) as period_start_date, date(calendar_date) as period_end_date ","")}

${if(粒度 == 2,"case when day_of_week_number = 1 then date(calendar_date) when day_of_week_number = 7 then date_add('day',-6, date(calendar_date)) end as period_start_date, case when day_of_week_number = 1 and date_add('day',6, date(calendar_date)) >=date(date_add('day',-1,current_date)) then date(date_add('day',-1,current_date)) when day_of_week_number = 7 and date(calendar_date)>=date(date_add('day',-1,current_date)) then date(date_add('day',-1,current_date)) when day_of_week_number = 1 then date_add('day',6, date(calendar_date)) when day_of_week_number = 7 then date(calendar_date) end as period_end_date ","")}

${if(粒度 == 3,"date(calendar_date) as period_start_date, case when date_add('day',-day(date(calendar_date)),date_add('month',1,(date(calendar_date))))>=date(date_add('day',-1,current_date)) then date(date_add('day',-1,current_date)) else date_add('day',-day(date(calendar_date)),date_add('month',1,(date(calendar_date)))) end as period_end_date ","")}

${if(粒度 == 4,"calendar_date as period_start_date,date_add('day',-1,date_add('month',1,date(substr(calendar_date,1,4)||'-'||cast(cast(floor(cast(substr(calendar_date,6,2) as int)/31)3+3 as int) as varchar)||'-01'))) as period_end_date ","")}

${if(粒度 == 5,"date(concat(substr(calendar_date,1,4),'-01','-01')) as period_start_date,case when date(concat(substr(calendar_date,1,4),'-12','-31'))>= date(date_add('day',-1,current_date)) then date(date_add('day',-1,current_date)) else date(concat(substr(calendar_date,1,4),'-12','-31')) end as period_end_date","")}

${if(粒度 == 6,"date(min(calendar_date)) as period_start_date,case when date(max(calendar_date))>= date(date_add('day',-1,current_date)) then date(date_add('day',-1,current_date)) else date(max(calendar_date)) end as period_end_date","")}

from

edw_publicdim_esf_edw_pub_date

where calendar_date >= '${开始时间}' and calendar_date <= '${结束时间}'

${if(粒度 == 1," and 1 = 1 ","")}

${if(粒度 == 2," and day_of_week_number in (1,7) ","")}

${if(粒度 == 3," and day_of_month_number = 1","")}

${if(粒度 == 4," and day_of_quater_number = 1","")}

${if(粒度 == 5," and day_of_year_number = 1","")}

${if(粒度 == 6," group by calendar_half_year_code ","")}

------------------------------------------------------------------------------------------------

-- 根据输入的时间范围计算期末日期

------------------------------------------------------------------------------------------------

select t1

from

-- 日周月季年半年不同粒度的统计数据各存为了一张表

edw_reportsadm_xf_edw_house_sub_project_report_00${dtype}ly_di t1--日报

join

(

-- 日

SELECT

calendar_date

FROM

edw_publicdim_esf_edw_pub_date

WHERE

calendar_date BETWEEN '${bdt}' AND '${edt}'

AND '${dtype}' = '1_dai'

UNION

-- 月

SELECT

MAX(calendar_date) AS calendar_date

FROM

edw_publicdim_esf_edw_pub_date

WHERE

calendar_date BETWEEN '${bdt}' AND '${edt}'

AND '${dtype}' = '2_dai'

GROUP BY

calendar_month_number

UNION

-- 周

SELECT

calendar_date

FROM

edw_publicdim_esf_edw_pub_date

WHERE

calendar_date BETWEEN '${bdt}' AND '${edt}'

AND day_of_week_number = 7

AND '${dtype}' = '3_dai'

UNION

-- 季

SELECT

MAX(calendar_date) AS calendar_date

FROM

edw_publicdim_esf_edw_pub_date

WHERE

calendar_date BETWEEN '${bdt}' AND '${edt}'

AND '${dtype}' = '4_dai'

GROUP BY

calendar_quater_code

UNION

-- 年

SELECT

MAX(calendar_date) AS calendar_date

FROM

edw_publicdim_esf_edw_pub_date

WHERE

calendar_date BETWEEN '${bdt}' AND '${edt}'

AND '${dtype}' = '5_dai'

GROUP BY

calendar_year_code

UNION

-- 半年

SELECT

MAX(calendar_date) AS calendar_date

FROM

edw_publicdim_esf_edw_pub_date

WHERE

calendar_date BETWEEN '${bdt}' AND '${edt}'

AND '${dtype}' = '6_dai'

GROUP BY

calendar_half_year_code

UNION

SELECT

MAX(calendar_date) AS calendar_date

FROM

edw_publicdim_esf_edw_pub_date

WHERE

calendar_date BETWEEN '${bdt}' AND '${edt}'

ORDER BY

calendar_date

) t2

on t1statistic_date = t2calendar_date

where

statistic_date between '${bdt}' and '${edt}'

${if(len(tenant_name) == 0,"","and house_sub_project_organization_short_name = '" + tenant_name + "'")}

${if(len(status) == 0,"","and house_sub_project_cooperation_status_code = " + status)}

${if(len(tenant_type) == 0,"","and house_sub_project_organization_business_type_code= " + tenant_type)}

${if(len(project_type) == 0,"","and house_sub_project_cooperation_type_code= " + project_type)}

order by statistic_date

在hive中，我们经常需要进行日期的计算，可是，经常会出现这样一种情况，hive仓库中日期的存储格式是yyyyMMdd,例如20200520，在进行日期的计算时，hive支持的函数例如，add_months 仅识别yyyy-MM-dd格式的函数，如果输入yyyyMMdd的参数，返回结果是null。所以我们在对日期的字段进行计算时，优先需要进行格式的转换，将yyyyMMdd转换成yyyy-MM-dd格式

转换方式如下：亲测可用

DATE(from_unixtime(unix_timestamp('20171205','yyyymmdd'),'yyyy-mm-dd'))

以上就是关于hive实现用户连续登陆的最大天数全部的内容，包括:hive实现用户连续登陆的最大天数、用HIVESQL怎么获取上一个月的月份、5种让Hive查询变快的方法等相关内容解答，如果想了解更多相关内容，可以关注我们，你们的支持是我们更新的动力！

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/web/9339897.html

hive实现用户连续登陆的最大天数

发表评论

评论列表（0条）