如何在Hadoop环境下搭建Python？_软件运维

搭建 Python 环境在 Hadoop 上的步骤如下：

安装 Hadoop：在你的计算机上安装 Hadoop。

安装 Python：请确保你的计算机上已经安装了 Python。

配置 Hadoop 环境：编辑 Hadoop 的配置文件，以确保 Hadoop 可以与 Python 配合使用。

安装相关模块：请安装所需的 Python 模块，以便在 Hadoop 环境下使用 Python。

测试 Python 安装：请运行一些测试脚本，以确保 Python 可以在 Hadoop 环境下正常工作。

这些步骤可以帮助你在 Hadoop 环境下搭建 Python。请注意，具体的步骤可能因 Hadoop 的版本和环境而异，请仔细查看相关文档。

MichaelG.Noll在他的Blog中提到如何在Hadoop中用Python编写MapReduce程序，韩国的gogamza在其Bolg中也提到如何用C编写MapReduce程序（我稍微修改了一下原程序,因为他的Map对单词切分使用tab键）。我合并他们两人的文章，也让国内的Hadoop用户能够使用别的语言来编写MapReduce程序。首先您得配好您的Hadoop集群，这方面的介绍网上比较多，这儿给个链接（Hadoop学习笔记二安装部署）。HadoopStreaming帮助我们用非Java的编程语言使用MapReduce，Streaming用STDIN(标准输入)和STDOUT(标准输出)来和我们编写的Map和Reduce进行数据的交换数据。任何能够使用STDIN和STDOUT都可以用来编写MapReduce程序，比如我们用Python的sys.stdin和sys.stdout，或者是C中的stdin和stdout。我们还是使用Hadoop的例子WordCount来做示范如何编写MapReduce，在WordCount的例子中我们要解决计算在一批文档中每一个单词的出现频率。首先我们在Map程序中会接受到这批文档每一行的数据，然后我们编写的Map程序把这一行按空格切开成一个数组。并对这个数组遍历按"1"用标准的输出输出来，代表这个单词出现了一次。在Reduce中我们来统计单词的出现频率。PythonCodeMap:mapper.py#!/usr/bin/envpythonimportsys#mapswordstotheircountsword2count={}#inputcomesfromSTDIN(standardinput)forlineinsys.stdin:#removeleadingandtrailingwhitespaceline=line.strip()#splitthelineintowordswhileremovinganyemptystringswords=filter(lambdaword:word,line.split())#increasecountersforwordinwords:#writetheresultstoSTDOUT(standardoutput)#whatweoutputherewillbetheinputforthe#Reducestep,i.e.theinputforreducer.py##tab-delimitedthetrivialwordcountis1print'%s\t%s'%(word,1)Reduce:reducer.py#!/usr/bin/envpythonfromoperatorimportitemgetterimportsys#mapswordstotheircountsword2count={}#inputcomesfromSTDINforlineinsys.stdin:#removeleadingandtrailingwhitespaceline=line.strip()#parsetheinputwegotfrommapper.pyword,count=line.split()#convertcount(currentlyastring)tointtry:count=int(count)word2count[word]=word2count.get(word,0)+countexceptValueError:#countwasnotanumber,sosilently#ignore/discardthislinepass#sortthewordslexigraphically##thisstepisNOTrequired,wejustdoitsothatour#finaloutputwilllookmoreliketheofficialHadoop#wordcountexamplessorted_word2count=sorted(word2count.items(),key=itemgetter(0))#writetheresultstoSTDOUT(standardoutput)forword,countinsorted_word2count:print'%s\t%s'%(word,count)CCodeMap:Mapper.c#include#include#include#include#defineBUF_SIZE2048#defineDELIM"\n"intmain(intargc,char*argv[]){charbuffer[BUF_SIZE]while(fgets(buffer,BUF_SIZE-1,stdin)){intlen=strlen(buffer)if(buffer[len-1]=='\n')buffer[len-1]=0char*querys=index(buffer,'')char*query=NULLif(querys==NULL)continuequerys+=1/*nottoinclude'\t'*/query=strtok(buffer,"")while(query){printf("%s\t1\n",query)query=strtok(NULL,"")}}return0}h>h>h>h>Reduce:Reducer.c#include#include#include#include#defineBUFFER_SIZE1024#defineDELIM"\t"intmain(intargc,char*argv[]){charstrLastKey[BUFFER_SIZE]charstrLine[BUFFER_SIZE]intcount=0*strLastKey='\0'*strLine='\0'while(fgets(strLine,BUFFER_SIZE-1,stdin)){char*strCurrKey=NULLchar*strCurrNum=NULLstrCurrKey=strtok(strLine,DELIM)strCurrNum=strtok(NULL,DELIM)/*necessarytocheckerrorbut.*/if(strLastKey[0]=='\0'){strcpy(strLastKey,strCurrKey)}if(strcmp(strCurrKey,strLastKey)){printf("%s\t%d\n",strLastKey,count)count=atoi(strCurrNum)}else{count+=atoi(strCurrNum)}strcpy(strLastKey,strCurrKey)}printf("%s\t%d\n",strLastKey,count)/*flushthecount*/return0}h>h>h>h>首先我们调试一下源码：chmod+xmapper.pychmod+xreducer.pyecho"foofooquuxlabsfoobarquux"|./mapper.py|./reducer.pybar1foo3labs1quux2g++Mapper.c-oMapperg++Reducer.c-oReducerchmod+xMapperchmod+xReducerecho"foofooquuxlabsfoobarquux"|./Mapper|./Reducerbar1foo2labs1quux1foo1quux1你可能看到C的输出和Python的不一样,因为Python是把他放在词典里了.我们在Hadoop时,会对这进行排序,然后相同的单词会连续在标准输出中输出.在Hadoop中运行程序首先我们要下载我们的测试文档wget页面中摘下的用php编写的MapReduce程序,供php程序员参考：Map:mapper.php#!/usr/bin/php$word2count=array()//inputcomesfromSTDIN(standardinput)while(($line=fgets(STDIN))!==false){//removeleadingandtrailingwhitespaceandlowercase$line=strtolower(trim($line))//splitthelineintowordswhileremovinganyemptystring$words=preg_split('/\W/',$line,0,PREG_SPLIT_NO_EMPTY)//increasecountersforeach($wordsas$word){$word2count[$word]+=1}}//writetheresultstoSTDOUT(standardoutput)//whatweoutputherewillbetheinputforthe//Reducestep,i.e.theinputforreducer.pyforeach($word2countas$word=>$count){//tab-delimitedecho$word,chr(9),$count,PHP_EOL}?>Reduce:mapper.php#!/usr/bin/php$word2count=array()//inputcomesfromSTDINwhile(($line=fgets(STDIN))!==false){//removeleadingandtrailingwhitespace$line=trim($line)//parsetheinputwegotfrommapper.phplist($word,$count)=explode(chr(9),$line)//convertcount(currentlyastring)toint$count=intval($count)//sumcountsif($count>0)$word2count[$word]+=$count}//sortthewordslexigraphically////thissetisNOTrequired,wejustdoitsothatour//finaloutputwilllookmoreliketheofficialHadoop//wordcountexamplesksort($word2count)//writetheresultstoSTDOUT(standardoutput)foreach($word2countas$word=>$count){echo$word,chr(9),$count,PHP_EOL}?>作者：马士华发表于：2008-03-05

Hadoop的Python框架指南 - 技术翻译 - 开源中国社区

https://www.oschina.net/translate/a-guide-to-python-frameworks-for-hadoop

最近，我加入了Cloudera，在这之前，我在计算生物学/基因组学上已经工作了差不多10年。我的分析工作主要是利用Python语言和它很棒的科学计算栈来进行的。但Apache Hadoop的生态系统大部分都是用Java来实现的，也是为Java准备的，这让我很恼火。所以，我的头等大事变成了寻找一些Python可以用的Hadoop框架。

在这篇文章里，我会把我个人对这些框架的一些无关科学的看法写下来，这些框架包括:

Hadoop流

mrjob

dumbo

hadoopy

pydoop

其它

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/yw/11721160.html

如何在Hadoop环境下搭建Python？

发表评论

评论列表（0条）