大家平时可能都是在本集群上读取本地的HDFS文件,那如果我有两套集群呢?这个时候该如何读取另外一套集群上面的HDFS文件呢?废话不多说,直接上代码,如果代码有关于一些nameservices等这些信息不知道去哪里看的人,可以翻翻我之前的博客,或者私信我。
public static void main(String[] args) { SparkConf conf = new SparkConf().setAppName("bigdata-spark-etl-task") .setMaster("local[*]"); SparkSession sparkSession = SparkSession.builder() .config(conf) .getOrCreate(); MapnameNodeMap = new HashMap<>(); nameNodeMap.put("namenode33", "incubator-dc-006:8020"); nameNodeMap.put("namenode35", "incubator-dc-007:8020"); hdfsReader(sparkSession, "test-ns1", nameNodeMap, "hdfs://test-ns1/dw/public/test/ads_mkt_td_user_ds"); } static Dataset hdfsReader(SparkSession sparkSession, String nameServices, Map
nameNodeMap, String hdfsDir) { SparkContext sparkContext = sparkSession.sparkContext(); sparkContext.hadoopConfiguration().set("fs.defaultFs", "hdfs://" + nameServices); sparkContext.hadoopConfiguration().set("dfs.nameservices", nameServices); List nameNodesLists = new ArrayList<>(); StringBuilder haNameNodes = new StringBuilder(); int i = 0; for (Map.Entry nameNodes : nameNodeMap.entrySet()) { nameNodesLists.add(nameNodes.getKey()); if (i == 0) { haNameNodes = new StringBuilder(nameNodes.getKey()); } else { haNameNodes.append(",").append(nameNodes.getKey()); } i++; } sparkContext.hadoopConfiguration().set("dfs.ha.namenodes." + nameServices, haNameNodes.toString()); for (String nameNodesList : nameNodesLists) { sparkContext.hadoopConfiguration().set("dfs.namenode.rpc-address." + nameServices + "." + nameNodesList, nameNodeMap.get(nameNodesList)); } sparkContext.hadoopConfiguration().set("dfs.ha.automatic-failover.enabled." + nameServices, "true"); sparkContext.hadoopConfiguration().set("dfs.client.failover.proxy.provider." + nameServices, "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"); sparkSession.read().load(hdfsDir).show(); return sparkSession.read().load(hdfsDir); }
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)