String tableName = "testTable"
Scan scan = new Scan()
scan.setCaching(10000)
scan.setCacheBlocks(false)
conf.set(TableInputFormat.INPUT_TABLE, tableName)
ClientProtos.Scan proto = ProtobufUtil.toScan(scan)
String ScanToString = Base64.encodeBytes(proto.toByteArray())
conf.set(TableInputFormat.SCAN, ScanToString)
JavaPairRDD<ImmutableBytesWritable, Result>myRDD = sc
.newAPIHadoopRDD(conf, TableInputFormat.class,
ImmutableBytesWritable.class, Result.class)
在Spark使用如上Hadoop提供的标准接口读取HBase表数据(全表读),读取5亿左右数据,要20M+,而同样的数据保存在Hive中,读取却只需要1M以内,性能差别非常大。
直接看代码[java] view plain copy
import org.apache.spark.sql.{DataFrame, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}
/**
* wo xi huan xie dai ma
* Created by wangtuntun on 16-5-7.
*/
object clean {
def main(args: Array[String]) {
//设置环境
val conf=new SparkConf().setAppName("tianchi").setMaster("local")
val sc=new SparkContext(conf)
val sqc=new SQLContext(sc)
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)