Hive 将数据插入hive动态分区表或hdfs动态分区目录的优化 distribute by分区排序的应用

Hive 将数据插入hive动态分区表或hdfs动态分区目录的优化 distribute by分区排序的应用,第1张

Hive 将数据插入hive动态分区表或hdfs动态分区目录的优化 distribute by分区排序的应用

将数据插入动态分区可能会导致短时间内(map任务)产生大量的分区(大于分区列的值去重后的数量),导致资源消耗过大,因此可以设置以下3个用于保护自己的参数。

  • Dynamic partition insert could potentially be a resource hog in that it could generate a large number of partitions in a short time. To get yourself buckled, we define three parameters:
    • hive.exec.max.dynamic.partitions.pernode (default value being 100) is the maximum dynamic partitions that can be created by each mapper or reducer. If one mapper or reducer created more than that the threshold, a fatal error will be raised from the mapper/reducer (through counter) and the whole job will be killed.   每个mapper或reducer可以创建的最大动态分区。
    • hive.exec.max.dynamic.partitions (default value being 1000) is the total number of dynamic partitions could be created by one DML. If each mapper/reducer did not exceed the limit but the total number of dynamic partitions does, then an exception is raised at the end of the job before the intermediate data are moved to the final destination.
    • hive.exec.max.created.files (default value being 100000) is the maximum total number of files created by all mappers and reducers. This is implemented by updating a Hadoop counter by each mapper/reducer whenever a new file is created. If the total number is exceeding hive.exec.max.created.files, a fatal error will be thrown and the job will be killed.

Troubleshooting and best practices故障排除和最佳实践:

  • As stated above, there are too many dynamic partitions created by a particular mapper/reducer, a fatal error could be raised and the job will be killed. The error message looks something like:
    beeline> set hive.exec.dynamic.partition.mode=nonstrict;
    beeline> FROM page_view_stg pvs
          INSERT OVERWRITE TABLE page_view PARTITION(dt, country)
                 SELECt pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip,
                        from_unixtimestamp(pvs.viewTime, 'yyyy-MM-dd') ds, pvs.country;
...
2010-05-07 11:10:19,816 Stage-1 map = 0%,  reduce = 0%
[Fatal Error] Operator FS_28 (id=41): fatal error. Killing the job.
Ended Job = job_201005052204_28178 with errors
...

The problem of this that one mapper will take a random set of rows and it is very likely that the number of distinct (dt, country) pairs will exceed the limit of hive.exec.max.dynamic.partitions.pernode. One way around it is to group the rows by the dynamic partition columns in the mapper and distribute them to the reducers where the dynamic partitions will be created. In this case the number of distinct dynamic partitions will be significantly reduced. The above example query could be rewritten to:

beeline> set hive.exec.dynamic.partition.mode=nonstrict;
beeline> FROM page_view_stg pvs
      INSERT OVERWRITE TABLE page_view PARTITION(dt, country)
             SELECt pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip,
                    from_unixtimestamp(pvs.viewTime, 'yyyy-MM-dd') dt, pvs.country
             DISTRIBUTE BY dt, country;

This query will generate a MapReduce job rather than Map-only job. The SELECT-clause will be converted to a plan to the mappers and the output will be distributed to the reducers based on the value of (dt, country) pairs. The INSERT-clause will be converted to the plan in the reducer which writes to the dynamic partitions.

实际工作中,情况不够复杂,不需要使用distribute by来优化,应为每天执行定时任务处理昨日一省的数据,以日期和省份两个字段作为分区字段,每个程序本来就只处理一个分区的数据,所以mapper和reducer不会产生过多分区。

参考

https://cwiki.apache.org/confluence/display/Hive/Tutorial

欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/zaji/5682318.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-12-17
下一篇 2022-12-17

发表评论

登录后才能评论

评论列表(0条)

保存