如何在Windows下面运行hadoop的MapReduce程序_CMS教程

1 首先登入hadoop 集群里面的一个节点，创建一个java源文件，偷懒起见，基本盗用官方的word count (因为本文的目的是教会你如何快编写和运行一个MapReduce程序，而不是如何写好一个功能齐全的MapReduce程序）

内容如下：

import javaioIOException;

import javautilStringTokenizer;

import orgapachehadoopconfConfiguration;

import orgapachehadoopfsPath;

import orgapachehadoopioIntWritable;

import orgapachehadoopioText;

import orgapachehadoopmapreduceJob;

import orgapachehadoopmapreduceMapper;

import orgapachehadoopmapreduceReducer;

import orgapachehadoopmapreducelibinputFileInputFormat;

import orgapachehadoopmapreduceliboutputFileOutputFormat;

import orgapachehadooputilGenericOptionsParser;

public class myword {

public static class TokenizerMapper

extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text value, Context context

) throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(valuetoString());

while (itrhasMoreTokens()) {

wordset(itrnextToken());

contextwrite(word, one);

}

public static class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWritable> {

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,

Context context

) throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += valget();

}

resultset(sum);

contextwrite(key, result);

}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

String[] otherArgs = new GenericOptionsParser(conf, args)getRemainingArgs();

if (otherArgslength != 2) {

Systemerrprintln('Usage: wordcount <in> <out>');

Systemexit(2);

}

Job job = new Job(conf, 'word count');

jobsetJarByClass(mywordclass);

jobsetMapperClass(TokenizerMapperclass);

jobsetCombinerClass(IntSumReducerclass);

jobsetReducerClass(IntSumReducerclass);

jobsetOutputKeyClass(Textclass);

jobsetOutputValueClass(IntWritableclass);

FileInputFormataddInputPath(job, new Path(otherArgs[0]));

FileOutputFormatsetOutputPath(job, new Path(otherArgs[1]));

Systemexit(jobwaitForCompletion(true) 0 : 1);

}

与官方版本相比，主要做了两处修改

1）为了简单起见，去掉了开头的 package orgapachehadoopexamples;

2）将类名从 WordCount 改为 myword, 以体现是我们自己的工作成果 :)

2 拿到hadoop 运行的class path, 主要为编译所用

运行命令

hadoop classpath

保存打出的结果，本文用的hadoop 版本是Pivotal 公司的Pivotal hadoop, 例子：

/etc/gphd/hadoop/conf:/usr/lib/gphd/hadoop/lib/:/usr/lib/gphd/hadoop///:/usr/lib/gphd/hadoop-hdfs//:/usr/lib/gphd/hadoop-hdfs/lib/:/usr/lib/gphd/hadoop-hdfs///:/usr/lib/gphd/hadoop-yarn/lib/:/usr/lib/gphd/hadoop-yarn///:/usr/lib/gphd/hadoop-mapreduce/lib/:/usr/lib/gphd/hadoop-mapreduce///::/etc/gphd/pxf/conf::/usr/lib/gphd/pxf/pxf-corejar:/usr/lib/gphd/pxf/pxf-apijar:/usr/lib/gphd/publicstage:/usr/lib/gphd/gfxd/lib/gemfirexdjar::/usr/lib/gphd/zookeeper/zookeeperjar:/usr/lib/gphd/hbase/lib/hbase-commonjar:/usr/lib/gphd/hbase/lib/hbase-protocoljar:/usr/lib/gphd/hbase/lib/hbase-clientjar:/usr/lib/gphd/hbase/lib/hbase-thriftjar:/usr/lib/gphd/hbase/lib/htrace-core-201jar:/etc/gphd/hbase/conf::/usr/lib/gphd/hive/lib/hive-servicejar:/usr/lib/gphd/hive/lib/libthrift-090jar:/usr/lib/gphd/hive/lib/hive-metastorejar:/usr/lib/gphd/hive/lib/libfb303-090jar:/usr/lib/gphd/hive/lib/hive-commonjar:/usr/lib/gphd/hive/lib/hive-execjar:/usr/lib/gphd/hive/lib/postgresql-jdbcjar:/etc/gphd/hive/conf::/usr/lib/gphd/sm-plugins/:

3 编译

运行命令

javac -classpath xxx /mywordjava

xxx部分就是上一步里面取到的class path

运行完此命令后，当前目录下会生成一些class 文件，例如：

mywordclass myword$IntSumReducerclass myword$TokenizerMapperclass

4 将class文件打包成jar文件

运行命令

jar -cvf mywordjar /class

至此, 目标jar 文件成功生成

5 准备一些文本文件，上传到hdfs, 以做word count的input

例子：

随意创建一些文本文件，保存到mapred_test 文件夹

运行命令

hadoop fs -put /mapred_test/

确保此文件夹成功上传到hdfs 当前用户根目录下

6 运行我们的程序

运行命令

hadoop jar /mywordjar myword mapred_test output

顺利的话，此命令会正常进行，一个MapReduce job 会开始工作，输出的结果会保存在 hdfs 当前用户根目录下的output 文件夹里面。

至此大功告成！

如果还需要更多的功能，我们可以修改前面的源文件以达到一个真正有用的MapReduce job。

但是原理大同小异，练手的话，基本够了。

一个抛砖引玉的简单例子，欢迎板砖。

mapreduce工作原理为：MapReduce是一种编程模型，用于大规模数据集的并行运算。

mapreduce工作原理为：MapReduce是一种编程模型，用于大规模数据集的并行运算。MapReduce采用”分而治之”的思想，把对大规模数据集的 *** 作，分发给一个主节点管理下的各个分节点共同完成，然后通过整合各个节点的中间结果，得到最终结果。

Mapreduce是什么？

MapReduce就是“任务的分解与结果的汇总”，它极大地方便了编程人员在不会分布式并行编程的情况下，将自己的程序运行在分布式系统上。

MapReduce是一种编程模型，用于大规模数据集（大于1TB）的并行运算。概念"Map（映射）"和"Reduce（归约）"，是它们的主要思想，都是从函数式编程语言里借来的，还有从矢量编程语言里借来的特性。它极大地方便了编程人员在不会分布式并行编程的情况下，将自己的程序运行在分布式系统上。

使用eclipse编写mapreduce程序的步骤：

一安装hadoop for eclipse的插件，注意：插件版本要和hadoop版本一致。

下载：hadoop-eclipse-plugin-252jar

本文主要讲解三个问题：

使用Java编写

MapReduce

程序时，如何向map、reduce函数传递参数。

使用Streaming编写MapReduce程序(C/C++,

Shell,

Python)时，如何向map、reduce脚本传递参数。

使用Streaming编写MapReduce程序(C/C++,

Shell,

Python)时，如何向map、reduce脚本传递文件或文件夹。

(1)

streaming

加载本地单个文件

(2)

streaming

加载本地多个文件

(3)

streaming

加载本地目录

(4)

streaming编程时在mapreduce脚本中读

hdfs

文件

(5)

streaming编程时在mapreduce脚本中读

hdfs

以上就是关于如何在Windows下面运行hadoop的MapReduce程序全部的内容，包括:如何在Windows下面运行hadoop的MapReduce程序、mapreduce工作原理、如何使用eclipse编写mapreduce程序等相关内容解答，如果想了解更多相关内容，可以关注我们，你们的支持是我们更新的动力！

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zz/9873985.html

如何在Windows下面运行hadoop的MapReduce程序

发表评论

评论列表（0条）