实践数据湖iceberg 第一课 入门
实践数据湖iceberg 第二课 iceberg基于hadoop的底层数据格式
实践数据湖iceberg 第三课 在sqlclient中,以sql方式从kafka读数据到iceberg
实践数据湖iceberg 第四课 在sqlclient中,以sql方式从kafka读数据到iceberg(升级版本到flink1.12.7)
实践数据湖iceberg 第五课 hive catalog特点
实践数据湖iceberg 第六课 从kafka写入到iceberg失败问题 解决
实践数据湖iceberg 第七课 实时写入到iceberg
实践数据湖iceberg 第八课 hive与iceberg集成
实践数据湖iceberg 第九课 合并小文件
实践数据湖iceberg 第十课 快照删除
实践数据湖iceberg 第十一课 测试分区表完整流程(造数、建表、合并、删快照)
实践数据湖iceberg 第十二课 catalog是什么
文章目录
系列文章目录前言1. 制造数据
1.1 生成数据1.2 log->flume->kafka 2. 建立、测试分区表
2.1 启动flink-sql2.2 建hiveCatalog2.3建分区表 3.把kafka的行为数据写入分区表
3.1 定义kafka表 为csv类型(失败) 3.2 定义kafka表 为raw类型(成功)
3.3 把kafka表转为分区表的sql3.4 在flink-sql实际执行sql汇总 4.重复步骤1和步骤3,生成多天的数据5.合并小文件,观察6.合并并删除老快照,观察总结
前言
测试分区表的小文件合并,快照删除,查看对分区表的影响
模拟生产环境
测试架构: log->kafka-> flink->iceberg
1. 制造数据 1.1 生成数据
要求: 生成的id范围、生成数据频率、数据的日期可配
package org.example; import java.util.Calendar; import java.util.Random; public class GenerateLogWithDate { public static void main( String[] args ) throws InterruptedException { if(args.length !=5){ System.out.println("<生成id范围 1-?><每条数据停顿时长毫秒><日期 yyyy>"); System.exit(0); } int len = 100000; int sleepMilis = 1000; Calendar cal = null; // System.out.println("<生成id范围><每条数据停顿时长毫秒>"); if(args.length == 1){ len = Integer.valueOf(args[0]); } if(args.length == 2){ len = Integer.valueOf(args[0]); sleepMilis = Integer.valueOf(args[1]); } if(args.length == 5){ cal = Calendar.getInstance(); int year = Integer.valueOf(args[2]); int month = Integer.valueOf(args[3]); int day = Integer.valueOf(args[4]); cal.set(year,month,day); } Random random = new Random(); for(int i=0; i<10000000; i++){ System.out.println(i+"," + random.nextInt(len)+","+ Calendar.getInstance().getTimeInMillis() ); Thread.sleep(sleepMilis); } } }
打包运行,效果如下:
[root@hadoop101 software]# java -jar log-generater-1.0-SNAPSHOT.jar 100000 1000 2022 0 20 0,54598,1643341060767 1,71915,1643341061768 2,69469,1643341062768 3,7125,1643341063768
java -jar /opt/software/log-generater-1.0-SNAPSHOT.jar 100000 1000 2022 0 20 > /opt/module/logs/withDateFileToKakfa.log [root@hadoop102 ~]# java -jar /opt/software/log-generater-1.0-SNAPSHOT.jar 100000 10 2022 0 27 > /opt/module/logs/withDateFileToKakfa.log 更换日期: [root@hadoop102 ~]# java -jar /opt/software/log-generater-1.0-SNAPSHOT.jar 1000000 1 2022 0 27 >> /opt/module/logs/withDateFileToKakfa.log [root@hadoop102 ~]# java -jar /opt/software/log-generater-1.0-SNAPSHOT.jar 1000000 1 2022 0 26 >> /opt/module/logs/withDateFileToKakfa.log [root@hadoop102 ~]# java -jar /opt/software/log-generater-1.0-SNAPSHOT.jar 1000000 1 2022 0 29 >> /opt/module/logs/withDateFileToKakfa.log1.2 log->flume->kafka
先在kafka创建topic behavior_with_date_log。
flume agent配置:
kafka作为channel,没有sink
cd $FLUME_HOME/conf
vim with-date-file-to-kafka.conf
a1.sources = r1 a1.channels = c1 a1.sources.r1.type = TAILDIR a1.sources.r1.positionFile = /var/log/flume/withDate_taildir_position.json a1.sources.r1.filegroups = f1 a1.sources.r1.filegroups.f1 = /opt/module/logs/withDateFileToKakfa.log a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel a1.channels.c1.kafka.bootstrap.servers = hadoop101:9092,hadoop102:9092,hadoop103:9092 a1.channels.c1.kafka.topic = behavior_with_date_log a1.channels.c1.kafka.consumer.group.id = flume-consumer2 #绑定source和channel以及sink和channel的关系 a1.sources.r1.channels = c1
准备启停agent脚本:
[root@hadoop102 bin]# cat flumeWithDate.sh #! /bin/bash case $1 in "start"){ for i in hadoop102 do echo " --------启动 $i flume with-date-file-to-kafka.conf -------" ssh $i "source /etc/profile;cd /opt/module/flume; nohup flume-ng agent --conf conf --conf-file conf/with-date-file-to-kafka.conf --name a1 -Dflume.root.logger=INFO,console >/opt/module/flume/logs/with-date-file-to-kafka.conf 2>&1 &" done };; "stop"){ for i in hadoop102 do echo " --------停止 $i flume with-date-file-to-kafka.conf -------" ssh $i "ps -ef | grep with-date-file-to-kafka.conf | grep -v grep |awk '{print $2}' | xargs -n1 kill" done };; esac
启动agent
[root@hadoop102 bin]# flumeWithDate.sh start --------启动 hadoop102 flume with-date-file-to-kafka.conf ------- /etc/profile: line 49: HISTCONTROL: readonly variable
查看数据是否进入kafka
发现:确实进去了
消费出来看看
[root@hadoop103 ~]# kafka-console-consumer.sh --topic behavior_with_date_log --bootstrap-server hadoop101:9092,hadoop102:9092,hadoop103:9092 --from-beginning 4129299,24977,1643354586164 4129302,18826,1643354586465 4129255,98763,1643354581760 4129258,68045,1643354582060 4129261,42309,1643354582361 4129264,25737,1643354582661 4129267,63120,1643354582961 4129270,62009,1643354583261 4129273,36358,1643354583562 4129276,18414,1643354583862 4129279,73156,1643354584162 4129282,80180,16433545844632. 建立、测试分区表 2.1 启动flink-sql
[root@hadoop101 ~]# sql-client.sh embedded -j /opt/software/iceberg-flink-runtime-0.12.1.jar -j /opt/software/flink-sql-connector-hive-2.3.6_2.12-1.12.7.jar -j /opt/software/flink-sql-connector-kafka_2.12-1.12.7.jar shell
2.2 建hiveCatalog继续使用hiveDatalog6
每次使用catalog都需要创建一下
创建catalog脚本:
CREATE CATALOG hive_catalog6 WITH ( 'type'='iceberg', 'catalog-type'='hive', 'uri'='thrift://hadoop101:9083', 'clients'='5', 'property-version'='1', 'warehouse'='hdfs:///user/hive/warehouse/hive_catalog6' );
database 各个catalog是共享的
use catalog hive_catalog6;
create database iceberg_db6;
CREATE TABLE hive_catalog6.iceberg_db6.behavior_with_date_log_ib ( i STRING, id STRING,otime STRING,dt STRING ) PARTITIonED BY ( dt )
查看生成的目录
[root@hadoop103 ~]# hadoop fs -ls -R /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/ drwxr-xr-x - root supergroup 0 2022-01-28 16:19 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/metadata -rw-r--r-- 2 root supergroup 1793 2022-01-28 16:19 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/metadata/00000-cbb96f75-6438-4cfd-8c09-9da953af8519.metadata.json
drop table事项:
metadata下的文件会删除,这个表到medata的目录会保留!
会保留到目录如下:
/user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/metadata3.把kafka的行为数据写入分区表 3.1 定义kafka表 为csv类型(失败)
定义表为csv类型
create table behavior_with_date_log ( i STRING,ID STRING,OTIME STRING ) WITH ( 'connector' = 'kafka', 'topic' = 'behavior_with_date_log', 'properties.bootstrap.servers' = 'hadoop101:9092,hadoop102:9092,hadoop103:9092', 'properties.group.id' = 'rickGroup6', 'scan.startup.mode' = 'earliest-offset', 'csv.field-delimiter'=',', 'csv.ignore-parse-errors' = 'true', 'format' = 'csv' ) 每个字段都定义为STRING, 竟然前面是乱码,后面2个字段都是null,什么原因?
每个字段都定义为STRING, 竟然前面是乱码,后面2个字段都是null,什么原因?
没加 ‘csv.ignore-parse-errors’ = ‘true’, 显示的还是NULL
分隔符有问题?我把日志的分隔符直接复制到建表语句,重建,还是一样
使用kafka-console-consumer.sh 直接消费出来是没有问题的
kafka-console-consumer.sh --topic behavior_with_date_log --bootstrap-server hadoop101:9092,hadoop102:9092,hadoop103:9092 --from-beginning 4132950,38902,1643354951608 4132903,27044,1643354946904 4132906,17342,1643354947204 4132909,42787,1643354947504 4132912,70169,1643354947805 4132915,68492,1643354948105 4132918,38690,1643354948405 4132921,53191,1643354948705 4132924,82465,1643354949006
删表重建语句:
drop table behavior_with_date_log
其实是可以把表定义为raw类型,但flink-sql没有内置split函数。有内置split_index函数,为什么show functions查不到split_index函数?
3.2 定义kafka表 为raw类型(成功)create table behavior_with_date_log ( log STRING ) WITH ( 'connector' = 'kafka', 'topic' = 'behavior_with_date_log', 'properties.bootstrap.servers' = 'hadoop101:9092,hadoop102:9092,hadoop103:9092', 'properties.group.id' = 'rickGroup6', 'scan.startup.mode' = 'earliest-offset', 'format' = 'raw' )
Flink SQL>
create table behavior_with_date_log
(
log STRING
) WITH (
‘connector’ = ‘kafka’,
‘topic’ = ‘behavior_with_date_log’,
‘properties.bootstrap.servers’ = ‘hadoop101:9092,hadoop102:9092,hadoop103:9092’,
‘properties.group.id’ = ‘rickGroup6’,
‘scan.startup.mode’ = ‘earliest-offset’,
‘format’ = ‘raw’
);
[INFO] Table has been created.
sql语法:
select split_index('4135009,73879,1643355157701',',',0), split_index('4135009,73879,1643355157701',',',1),split_index('4135009,73879,1643355157701',',',2); SELECT TO_TIMESTAMP(FROM_UNIXTIME(1513135677000 / 1000, 'yyyy-MM-dd'));
对kafka原始日志做etl,增加dt字段 , sql如下:
select split_index(log,',',0) as i, split_index(log,',',1) as id,split_index(log,',',2) as otime , from_unixtime(cast(cast(split_index(log,',',2) as bigint)/1000 as bigint),'yyyyMMdd') as dt from behavior_with_date_log ;3.3 把kafka表转为分区表的sql
insert into hive_catalog6.iceberg_db6.behavior_with_date_log_ib select split_index(log,',',0) as i, split_index(log,',',1) as id,split_index(log,',',2) as otime , from_unixtime(cast(cast(split_index(log,',',2) as bigint)/1000 as bigint),'yyyyMMdd') as dt from behavior_with_date_log ;3.4 在flink-sql实际执行sql汇总
[root@hadoop101 ~]# sql-client.sh embedded -j /opt/software/iceberg-flink-runtime-0.12.1.jar -j /opt/software/flink-sql-connector-hive-2.3.6_2.12-1.12.7.jar -j /opt/software/flink-sql-connector-kafka_2.12-1.12.7.jar shell SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/module/flink-1.12.7/lib/log4j-slf4j-impl-2.16.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/module/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] No default environment specified. Searching for '/opt/module/flink-1.12.7/conf/sql-client-defaults.yaml'...found. Reading default environment from: file:/opt/module/flink-1.12.7/conf/sql-client-defaults.yaml No session environment specified. Command history file path: /root/.flink-sql-history ▒▓██▓██▒ ▓████▒▒█▓▒▓███▓▒ ▓███▓░░ ▒▒▒▓██▒ ▒ ░██▒ ▒▒▓▓█▓▓▒░ ▒████ ██▒ ░▒▓███▒ ▒█▒█▒ ░▓█ ███ ▓░▒██ ▓█ ▒▒▒▒▒▓██▓░▒░▓▓█ █░ █ ▒▒░ ███▓▓█ ▒█▒▒▒ ████░ ▒▓█▓ ██▒▒▒ ▓███▒ ░▒█▓▓██ ▓█▒ ▓█▒▓██▓ ░█░ ▓░▒▓████▒ ██ ▒█ █▓░▒█▒░▒█▒ ███▓░██▓ ▓█ █ █▓ ▒▓█▓▓█▒ ░██▓ ░█░ █ █▒ ▒█████▓▒ ██▓░▒ ███░ ░ █░ ▓ ░█ █████▒░░ ░█░▓ ▓░ ██▓█ ▒▒▓▒ ▓███████▓░ ▒█▒ ▒▓ ▓██▓ ▒██▓ ▓█ █▓█ ░▒█████▓▓▒░ ██▒▒ █ ▒ ▓█▒ ▓█▓ ▓█ ██▓ ░▓▓▓▓▓▓▓▒ ▒██▓ ░█▒ ▓█ █ ▓███▓▒░ ░▓▓▓███▓ ░▒░ ▓█ ██▓ ██▒ ░▒▓▓███▓▓▓▓▓██████▓▒ ▓███ █ ▓███▒ ███ ░▓▓▒░░ ░▓████▓░ ░▒▓▒ █▓ █▓▒▒▓▓██ ░▒▒░░░▒▒▒▒▓██▓░ █▓ ██ ▓░▒█ ▓▓▓▓▒░░ ▒█▓ ▒▓▓██▓ ▓▒ ▒▒▓ ▓█▓ ▓▒█ █▓░ ░▒▓▓██▒ ░▓█▒ ▒▒▒░▒▒▓█████▒ ██░ ▓█▒█▒ ▒▓▓▒ ▓█ █░ ░░░░ ░█▒ ▓█ ▒█▓ ░ █░ ▒█ █▓ █▓ ██ █░ ▓▓ ▒█▓▓▓▒█░ █▓ ░▓██░ ▓▒ ▓█▓▒░░░▒▓█░ ▒█ ██ ▓█▓░ ▒ ░▒█▒██▒ ▓▓ ▓█▒ ▒█▓▒░ ▒▒ █▒█▓▒▒░░▒██ ░██▒ ▒▓▓▒ ▓██▓▒█▒ ░▓▓▓▓▒█▓ ░▓██▒ ▓░ ▒█▓█ ░░▒▒▒ ▒▓▓▓▓▓▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒░░▓▓ ▓░▒█░ ______ _ _ _ _____ ____ _ _____ _ _ _ BETA | ____| (_) | | / ____|/ __ | | / ____| (_) | | | |__ | |_ _ __ | | __ | (___ | | | | | | | | |_ ___ _ __ | |_ | __| | | | '_ | |/ / ___ | | | | | | | | | |/ _ '_ | __| | | | | | | | | < ____) | |__| | |____ | |____| | | __/ | | | |_ |_| |_|_|_| |_|_|_ |_____/ __________| _____|_|_|___|_| |_|__| Welcome! Enter 'HELP;' to list all available commands. 'QUIT;' to exit. Flink SQL> CREATE CATALOG hive_catalog6 WITH ( > 'type'='iceberg', > 'catalog-type'='hive', > 'uri'='thrift://hadoop101:9083', > 'clients'='5', > 'property-version'='1', > 'warehouse'='hdfs:///user/hive/warehouse/hive_catalog6' > ); 2022-01-28 16:20:20,351 INFO org.apache.hadoop.hive.conf.HiveConf [] - Found configuration file file:/opt/module/hive/conf/hive-site.xml 2022-01-28 16:20:20,574 WARN org.apache.hadoop.hive.conf.HiveConf [] - HiveConf of name hive.metastore.event.db.notification.api.auth does not exist [INFO] Catalog has been created. Flink SQL> create table behavior_with_date_log > ( > log STRING > ) WITH ( > 'connector' = 'kafka', > 'topic' = 'behavior_with_date_log', > 'properties.bootstrap.servers' = 'hadoop101:9092,hadoop102:9092,hadoop103:9092', > 'properties.group.id' = 'rickGroup6', > 'scan.startup.mode' = 'earliest-offset', > 'format' = 'raw' > ) > ; [INFO] Table has been created. Flink SQL> insert into hive_catalog6.iceberg_db6.behavior_with_date_log_ib > select split_index(log,',',0) as i, split_index(log,',',1) as id,split_index(log,',',2) as otime , from_unixtime(cast(cast(split_index(log,',',2) as bigint)/1000 as bigint),'yyyyMMdd') as dt from behavior_with_date_log ; [INFO] Submitting SQL update statement to the cluster... [INFO] Table update statement has been successfully submitted to the cluster: Job ID: 78930f941991e19112d3917fd4dd4cb24.重复步骤1和步骤3,生成多天的数据
输出的文件:
1分钟后:
为什么每分钟2个文件?
先统计小文件个数
[root@hadoop103 ~]# hadoop fs -ls /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data Found 6 items drwxr-xr-x - root supergroup 0 2022-01-28 17:36 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=19700101 drwxr-xr-x - root supergroup 0 2022-01-28 16:36 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=20220104 drwxr-xr-x - root supergroup 0 2022-01-28 17:30 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=20220126 drwxr-xr-x - root supergroup 0 2022-01-28 17:29 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=20220127 drwxr-xr-x - root supergroup 0 2022-01-28 16:29 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=20220128 drwxr-xr-x - root supergroup 0 2022-01-28 17:37 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=20220129
28号的数据文件:
[root@hadoop103 ~]# hadoop fs -ls /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=20220128 Found 18 items -rw-r--r-- 2 root supergroup 441875 2022-01-28 16:22 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=20220128/00000-0-0c8e94fd-5c8e-48df-ac03-4fe0e109940f-00001.parquet -rw-r--r-- 2 root supergroup 2842 2022-01-28 16:23 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=20220128/00000-0-0c8e94fd-5c8e-48df-ac03-4fe0e109940f-00003.parquet -rw-r--r-- 2 root supergroup 2819 2022-01-28 16:24 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=20220128/00000-0-0c8e94fd-5c8e-48df-ac03-4fe0e109940f-00004.parquet -rw-r--r-- 2 root supergroup 2833 2022-01-28 16:25 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=20220128/00000-0-0c8e94fd-5c8e-48df-ac03-4fe0e109940f-00005.parquet -rw-r--r-- 2 root supergroup 2801 2022-01-28 16:26 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=20220128/00000-0-0c8e94fd-5c8e-48df-ac03-4fe0e109940f-00006.parquet -rw-r--r-- 2 root supergroup 2824 2022-01-28 16:27 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=20220128/00000-0-0c8e94fd-5c8e-48df-ac03-4fe0e109940f-00007.parquet -rw-r--r-- 2 root supergroup 2802 2022-01-28 16:28 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=20220128/00000-0-0c8e94fd-5c8e-48df-ac03-4fe0e109940f-00008.parquet -rw-r--r-- 2 root supergroup 2781 2022-01-28 16:29 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=20220128/00000-0-0c8e94fd-5c8e-48df-ac03-4fe0e109940f-00009.parquet -rw-r--r-- 2 root supergroup 2000 2022-01-28 16:30 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=20220128/00000-0-0c8e94fd-5c8e-48df-ac03-4fe0e109940f-00010.parquet -rw-r--r-- 2 root supergroup 882110 2022-01-28 16:22 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=20220128/00001-0-29418337-5d27-40bb-98cb-94b3b5007b99-00001.parquet -rw-r--r-- 2 root supergroup 4385 2022-01-28 16:23 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=20220128/00001-0-29418337-5d27-40bb-98cb-94b3b5007b99-00003.parquet -rw-r--r-- 2 root supergroup 4364 2022-01-28 16:24 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=20220128/00001-0-29418337-5d27-40bb-98cb-94b3b5007b99-00004.parquet -rw-r--r-- 2 root supergroup 4383 2022-01-28 16:25 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=20220128/00001-0-29418337-5d27-40bb-98cb-94b3b5007b99-00005.parquet -rw-r--r-- 2 root supergroup 4340 2022-01-28 16:26 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=20220128/00001-0-29418337-5d27-40bb-98cb-94b3b5007b99-00006.parquet -rw-r--r-- 2 root supergroup 4357 2022-01-28 16:27 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=20220128/00001-0-29418337-5d27-40bb-98cb-94b3b5007b99-00007.parquet -rw-r--r-- 2 root supergroup 4347 2022-01-28 16:28 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=20220128/00001-0-29418337-5d27-40bb-98cb-94b3b5007b99-00008.parquet -rw-r--r-- 2 root supergroup 4340 2022-01-28 16:29 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=20220128/00001-0-29418337-5d27-40bb-98cb-94b3b5007b99-00009.parquet -rw-r--r-- 2 root supergroup 2688 2022-01-28 16:30 /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=20220128/00001-0-29418337-5d27-40bb-98cb-94b3b5007b99-00010.parquet [root@hadoop103 ~]# hadoop fs -ls /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=20220128 |wc 19 147 3813
文件个数中,上面,最后一行19-1=18,就是18个文件。
[root@hadoop103 ~]# hadoop fs -ls /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=19700101 |wc 43 339 8877 [root@hadoop103 ~]# hadoop fs -ls /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=20220104 |wc 17 131 3391 [root@hadoop103 ~]# hadoop fs -ls /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=20220126 |wc 5 35 858 [root@hadoop103 ~]# hadoop fs -ls /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=20220127 |wc 109 867 22804 [root@hadoop103 ~]# hadoop fs -ls /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=20220128 |wc 19 147 3813 [root@hadoop103 ~]# hadoop fs -ls /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=20220129 |wc 21 163 4235
其中29日的数据还在不断写入:
[root@hadoop103 ~]# hadoop fs -ls /user/hive/warehouse/hive_catalog6/iceberg_db6.db/behavior_with_date_log_ib/data/dt=20220129 |wc 23 179 46576.合并并删除老快照,观察
没跑完。。。,年后继续。。。
Caused by: javax.jdo.JDOFatalUserException: Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found. NestedThrowables: java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryonImplementation(JDOHelper.java:1175) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701) at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:521) at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:550) at org.apache.hadoop.hive.metastore.ObjectStore.initializeHelper(ObjectStore.java:405) at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:342) at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:303) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:76) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:136) at org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:58) at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:67) at org.apache.hadoop.hive.metastore.HivemetaStore$HMSHandler.newRawStoreForConf(HivemetaStore.java:628) at org.apache.hadoop.hive.metastore.HivemetaStore$HMSHandler.getMSForConf(HivemetaStore.java:594) at org.apache.hadoop.hive.metastore.HivemetaStore$HMSHandler.getMS(HivemetaStore.java:588) at org.apache.hadoop.hive.metastore.HivemetaStore$HMSHandler.createDefaultDB(HivemetaStore.java:659) at org.apache.hadoop.hive.metastore.HivemetaStore$HMSHandler.init(HivemetaStore.java:431) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:148) ... 24 more Caused by: java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at javax.jdo.JDOHelper$18.run(JDOHelper.java:2018) at javax.jdo.JDOHelper$18.run(JDOHelper.java:2016) at java.security.AccessController.doPrivileged(Native Method) at javax.jdo.JDOHelper.forName(JDOHelper.java:2015) at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryonImplementation(JDOHelper.java:1162) ... 45 more Process finished with exit code 1
总结
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)