hive笔记2_随笔_内存溢出

hive笔记2

1. 分区表的概念：

2.建表语句：

3.查看分区信息

4.向分区表中插入信息

5.查询信息

6.动态分区

7.现在根据两个字段分区

分区表：

1. 分区表的概念：

分区表指的是在创建表时指定分区空间，实际上就是在hdfs上表的目录下再创建子目录。在使用数据时如果指定了需要访问的分区名称，则只会读取相应的分区，避免全表扫描，提高查询效率。

2.建表语句：

（1）添加分区

首先我们创建一个根据某个字段进行分区的分区表，我们再给他添加分区信息

注：其他字段是不可以和分区字段字段重名的，不然建表会报错

//分区建表语句
create table students_dt
(id bigint
,name string
,age int  
,gender string
,clazz string
) partitioned by (dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

//添加分区:alter table 表名 add partition(分区字段='值');
alter table students_dt add partition(dt='20210101');
alter table students_dt add partition(dt='20210102');
alter table students_dt add partition(dt='20210103');
//删除分区:alter table 表名 drop partition(分区字段='值');
alter table students_dt del partition(dt='20210101');

前往hdfs查询students_dt表下的内容，发现确实出现了3个新的文件夹，分区就是通过文件夹的形式对文件做分割

3.查看分区信息

查看分区字段信息方式1：

// select DISTINCT 分区字段 from 表; 
hive> select distinct dt  from students_dt;

依然是通过mapreduce的方式计算出结果，只截了最终结果

查看分区字段信息方式2：

hive> show partitions students_dt;

4.向分区表中插入信息

在普通插入数据的基础上加上指定的分区：parttion(分区字段='值')

//load data local inpath '路径' into table 表名 partiton(分组字段='值');
//分区不存在自动创建
load data local inpath '/usr/local/module/students_dt.txt' into students_dt partition(dt='20211111');

在hdfs中确实出现了新的文件夹

这时就可以查询到信息了

select * from student_dt;

截取部分结果

发现后面的分区信息都变成了20211111

我传入的数据中，最后一栏是20210101-20210110 ，统一分区的数据，他们的分区信息肯定是统一的

5.查询信息

加入分区的好处就是我可以不用去扫描全表，可以根据分区先进行筛选，加快了查询效率

//查询语句 where 分区字段='分区值'
hive> select count(*) from students_dt where dt='20211111';

6.动态分区

上面我们创建了分区表后，还需要一个一个添加分区，向这个字目录传入数据，这样是不是又点麻烦呢？

所以hive中动态分区功能解决了这个问题

//动态分区默认是关闭的需要我们打开
 hive> set hive.exec.dynamic.partition=true;

//动态分区模式  动静结合，既有动态分区，我们也可以通过手动添加新的分区
hive> set hive.exec.dynamic.partition.mode=nostrict;

//hive最大分区数
hive> set hive.exec.max.dynamic.partitions.pernode=1000;

先把上面3个都设置一下，至少前两个，不然报错不会运行插入语句的

注意开启语句每次都需要执行的，不是永远打开的状态

//这是我们存储要插入数据的表，可以理解为将分区表中的所有字段，包括分区字段都列出的普通表
create table message(
id bigint
,name string
,age int  
,gender string
,clazz string
,dt string)
partitioned by (pt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

// 和开始一样按dt分区的分区表
create table students_dt
(id bigint
,name string
,age int  
,gender string
,clazz string
) partitioned by (dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

//注意后面的表包括前面的所有字段，而且是一个普通的表，没有分区！！！
//insert into 分区表 分区字段(字段名称) select * from 表
insert into students_dt partition(dt) select * from message;

前往hdfs查看，里面确实是按照dt分好了

7.现在根据两个字段分区

首先创建一个分区表，这个表有两个分区字段

注意：分区字段不会按照名字匹配,按照位置(匹配查询到的最后n个字段)

create table students_pt
(id bigint
,name string
,age int  
,gender string
,clazz string
) PARTITIonED BY (year string, month string) 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

现在再创建一个数据来源表，这个表包括了分区表中的所有字段

//普通表存储数据
create table message2
(id bigint
,name string
,age int  
,gender string
,clazz string
, year string
, month string 
) 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

//先向数据表中传输数据
hive> load data local inpath '/usr/local/module/studnets_pt.txt' into table message2;

//开启一下动态分区
hive>  set hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nostrict;

//向分区表中插入数据
hive> insert into students_pt partition(year,month) select * from message2;

前往hdfs中查看一下分区表下内容

再点进去，可以看到确实按照2个字段一次划分了，第一层根据年，进入年文件夹过后，里面是根据具体月份划分

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/5605099.html

hive笔记2

发表评论

评论列表（0条）