ClickHouse与Presto及Hive性能对比(7亿数据)_随笔

ClickHouse与Presto及Hive性能对比(7亿数据) 数据量

总量7.6亿，机台数据

Hive中数据

DROp TABLE IF EXISTS dwd_ipqc_online;
CREATE EXTERNAL TABLE dwd_ipqc_online
(
    MACH_ID       string COMMENT '機台ID',
    MACH_IP       string COMMENT '機台IP',
    CREATE_TIME   string COMMENT '創建時間',
    IPQC_onLINEID string COMMENT 'ID',
    INS_TIME      string COMMENT '插入時間',
    PROD_SN       string COMMENT '產品SN',
    DOT_ID        string COMMENT '点位',
    DOT_VALUE     string COMMENT '值'
) COMMENT 'ipqc在線量測记录'
    PARTITIonED BY (`dt` string)
    STORED AS PARQUET
    LOCATION '/warehouse/xx/dwd/dwd_ipqc_online/'
    TBLPROPERTIES ("parquet.compression" = "lzo");

我这边使用的时hive on spark

set mapreduce.job.queuename=hive;
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
set spark.executor.instances=12;
set spark.executor.cores=4;
set spark.executor.memory=4G;
set spark.default.parallelism=100;
select count(*) from dwd_ipqc_online;

用时90s,8CPU+38G内存

Presto

直接连接使用hive中对应数据

select count(dt) from cut."dwd_ipqc_online";
select count(*) from cut."dwd_ipqc_online";

首次54s，下次38s，占用4CPU+3.5G内存

ClickHouse

将Hive中数据导入ClickHouse

drop table if exists dwd_ipqc_online;
create  table dwd_ipqc_online
(
    mach_id       String comment '機台ID',
    mach_ip       String comment '機台IP',
    create_time   DateTime comment '創建時間',
    ipqc_onlineid String comment 'ID',
    ins_time      DateTime comment '插入時間',
    prod_sn       String comment '產品SN',
    dot_id        String comment '点位',
    dot_value     String comment '值'
)engine =MergeTree
    ORDER BY (create_time)
partition by toYYYYMMDD(create_time)
;

用时0.038s，CPU基本没动，内存一共才用0.6G

总结

之后直接ClickHouse对接源数据库？
离线数据：考虑到ClickHouse表join的问题，应该会采用离线历史数据存放至Hive，可以用Presto先join分析，等需要查询速度快时，关联形成宽表导出ClickHouse的方式提速。
实时数据：采用kafka+spark/flink批量导入ClickHouse的方式，后续还要研究。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/5624530.html

ClickHouse与Presto及Hive性能对比(7亿数据)

发表评论

评论列表（0条）