伪分布式Hadoop+Spark安装与配置,并实现WordCount

伪分布式Hadoop+Spark安装与配置,并实现WordCount,第1张

伪分布式Hadoop+Spark安装与配置,并实现WordCount

在我们学习时更多的是用伪分布式环境来进行 *** 作,以下就是伪分布式Hadoop+Spark安装与配置

  • centos:7.4
  • jdk:1.8
  • hadoop:2.7.2
  • scala:2.12.13
  • spark:3.0.1
1、配置虚拟机

下载centos-7,安装虚拟机

1、配置静态ip
vi /etc/sysconfig/network-scripts/ifcfg-ens33

TYPE=Ethernet
PROXY_METHOD=none
BROWSER_ONLY=no
# 修改为 static
BOOTPROTO=static
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=ens33
UUID=aec3fd78-3c06-4a77-8773-2667fe034ef4
DEVICE=ens33
# 修改为 yes
ONBOOT=yes
# 添加 ip和网关
IPADDR=192.168.75.120
GATEWAY=192.168.75.2
DNS1=192.168.75.2

service network restart 重启网络

2、关闭防火墙
//查看防火墙状态
systemctl status firewalld.service
active(running)

//关闭防火墙
systemctl stop firewalld.service

disavtive(dead)

//查看状态
systemctl status firewalld.service

//永久关闭防火墙
systemctl disable firewalld.service
3、测试 yum 使用
[root@localhost ~]# yum install vim

结果不能下载

已加载插件:fastestmirror
Determining fastest mirrors
Could not retrieve mirrorlist http://mirrorlist.centos.org/?release=7&arch=x86_64&repo=os&infra=stock error was
14: curl#6 - "Could not resolve host: mirrorlist.centos.org; 未知的错误"


 One of the configured repositories failed (未知),
 and yum doesn't have enough cached data to continue. At this point the only
 safe thing yum can do is fail. There are a few ways to work "fix" this:

     1. Contact the upstream for the repository and get them to fix the problem.

     2. Reconfigure the baseurl/etc. for the repository, to point to a working
        upstream. This is most often useful if you are using a newer
        distribution release than is supported by the repository (and the
        packages for the previous distribution release still work).

     3. Run the command with the repository temporarily disabled
            yum --disablerepo= ...

     4. Disable the repository permanently, so yum won't use it by default. Yum
        will then just ignore the repository until you permanently enable it
        again or use --enablerepo for temporary usage:

            yum-config-manager --disable 
        or
            subscription-manager repos --disable=

     5. Configure the failing repository to be skipped, if it is unavailable.
        Note that yum will try to contact the repo. when it runs most commands,
        so will have to try and fail each time (and thus. yum will be be much
        slower). If it is a very temporary problem though, this is often a nice
        compromise:

            yum-config-manager --save --setopt=.skip_if_unavailable=true

Cannot find a valid baseurl for repo: base/7/x86_64

然后 ping 测试

[root@seckillmysql ~]# ping 114.114.114.114
PING 114.114.114.114 (114.114.114.114) 56(84) bytes of data.
64 bytes from 114.114.114.114: icmp_seq=1 ttl=128 time=36.6 ms
64 bytes from 114.114.114.114: icmp_seq=2 ttl=128 time=36.9 ms
[root@seckillmysql ~]# ping www.baidu.com
ping: www.baidu.com: 未知的名称或服务

发现ping不通百度,修改配置文件

[root@seckillmysql ~]# vi /etc/resolv.conf
//增加这两行
nameserver 223.5.5.5
nameserver 223.6.6.6

然后发现 ping 百度就可以了,yum 下载也可以执行了

2、安装hadoop 1、安装 jdk

1、上传 jdk-8u152-linux-x64.tar.gz 包到 opt/

2、解压

tar -zxvf jdk-8u152-linux-x64.tar.gz

3、改名字

mv jdk1.8.0_152 jdk

4、配置环境变量

vim /etc/profile

5、增加 JAVA_HOME 到路径

# jdk
export JAVA_HOME=/opt/jdk
export PATH=$PATH:$JAVA_HOME/bin

6、刷新路径

source /etc/profile
2、免密登录 1、生成密匙
ssh-keygen

一直按回车即可

Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa): 
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:O5cniA6gS5ovHr7MtIC65ED1pcJoEw463taWcUJF4m4 [email protected]
The key's randomart image is:
+---[RSA 2048]----+
|     ..o         |
|    . o          |
|.. . o .         |
|+ = + o          |
|o*.o E .S        |
|=oo.+ =. o .     |
|=* o.+. + + .    |
|#o+ .o   o o     |
|B%o   .          |
+----[SHA256]-----+

2、拷贝公钥
[root@localhost sbin]# cd /root/.ssh/
[root@localhost .ssh]# ssh-copy-id root@localhost
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@localhost's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'root@localhost'"
and check to make sure that only the key(s) you wanted were added.
3、hadoop 1、解压缩文件

1、上传 hadoop-2.7.2.tar.gz 压缩包至 opt/

2、解压缩

tar -zxvf hadoop-2.7.2.tar.gz
2、配置环境

1、配置环境变量

vim /etc/profile

2、增加 HADOOP_HOME 到路径

# hadoop
export HADOOP_HOME=/opt/hadoop-2.7.2
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

3、刷新环境变量

source /etc/profile

4、配置 hadoop

配置 hadoop-env.sh

vim /opt/hadoop-2.7.2/etc/hadoop/hadoop-env.sh
echo $JAVA_HOME
/opt/jdk

修改 JAVA_HOME 路径:

export JAVA_HOME=/opt/jdk

配置 core-site.xml

vim /opt/hadoop-2.7.2/etc/hadoop/core-site.xml

fs.defaultFS
hdfs://localhost:9000



hadoop.tmp.dir
/opt/hadoop-2.7.2/data/tmp

5、配置 hdfs

配置 hdfs.site.xml

vim /opt/hadoop-2.7.2/etc/hadoop/hdfs.site.xml

dfs.replication
1



dfs.namenode.secondary.http-address
192.168.75.120:50090

192.168.75.120 为你的虚拟机地址

6、启动测试

格式化

[root@localhost hadoop-2.7.2] bin/hdfs namenode -format

启动 namenode

[root@localhost hadoop-2.7.2] sbin/hadoop-daemon.sh start namenode

启动 datanode

[root@localhost hadoop-2.7.2] sbin/hadoop-daemon.sh start datanode

浏览器访问 http://192.168.75.120:50070

关闭

[root@localhost hadoop-2.7.2] sbin/hadoop-daemon.sh stop namenode
[root@localhost hadoop-2.7.2] sbin/hadoop-daemon.sh stop datanode

只练习 spark 以下可以不做

7、配置 yarn

配置 yarn-env.sh

vim /opt/hadoop-2.7.2/etc/hadoop/yarn-env.sh

修改 JAVA_HOME

export JAVA_HOME=/opt/jdk

配置 yarn-site.xml

vim /opt/hadoop-2.7.2/etc/hadoop/yarn-site.xml

yarn.nodemanager.aux-services
mapreduce_shuffle



yarn.resourcemanager.hostname
192.168.75.120

8、配置 mapred

配置 mapred-env.sh

vim /opt/hadoop-2.7.2/etc/hadoop/mapred-env.sh

修改 JAVA_HOME

export JAVA_HOME=/opt/jdk

配置 mapred-site.xml

vim /opt/hadoop-2.7.2/etc/hadoop/mapred-site.xml

mapreduce.framework.name
yarn

9、启动

[root@localhost hadoop-2.7.2] sbin/start-dfs.sh
3、安装spark 1、解压缩文件

1、上传 spark-3.0.1-bin-hadoop2.7.tgz 到 /opt

2、解压

tar -zxvf spark-3.0.1-bin-hadoop2.7.tgz

3、重命名

mv spark-3.0.1-bin-hadoop2.7 spark
2、配置环境 1、配置系统变量
vim /etc/profile

添加以下内容

# spark
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin
export PATH=$PATH:$SPARK_HOME/sbin

刷新配置

source /etc/profile
2、修改 spark 配置

1、进入解压缩后路径的 conf 目录,修改 slaves.template 文件名为 slaves

mv slaves.template slaves

2、修改 spark-env.sh.template 文件名为 spark-env.sh

mv spark-env.sh.template spark-env.sh

3、修改 spark-env.sh 文件,添加 JAVA_HOME 环境变量和集群对应的 master 节点

export JAVA_HOME=/opt/jdk
SPARK_MASTER_HOST=192.168.75.120
SPARK_MASTER_PORT=7077

**注意: 7077 端口,相当于 hadoop3 内部通信的 8020 端口,此处的端口需要确认自己的 Hadoop配置 **

3、测试spark

启动 worker 和 Master

sbin/start-all.sh

测试是否成功

bin/spark-shell
21/12/13 04:15:13 WARN Utils: Your hostname, localhost.localdomain resolves to a loopback address: 127.0.0.1; using 192.168.75.120 instead (on interface ens33)
21/12/13 04:15:13 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/12/13 04:15:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://192.168.75.120:4040
Spark context available as 'sc' (master = local[*], app id = local-1639386925730).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _ / _ / _ `/ __/  '_/
   /___/ .__/_,_/_/ /_/_   version 3.0.1
      /_/
         
Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_152)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 
4、运行一个spark程序 1、创建工程

1、创建一个 maven 工程,jdk为 1.8,scala为 2.12.13



    4.0.0

    com.manster
    spark_demo
    1.0-SNAPSHOT
    
        3.0.1
        2.12
    
    
        
            org.apache.spark
            spark-core_2.12
            ${spark.version}
        
        
            org.apache.spark
            spark-sql_2.12
            ${spark.version}
        
        
            org.apache.spark
            spark-streaming_2.12
            ${spark.version}
        
        
            org.apache.spark
            spark-mllib_2.12
            ${spark.version}
        
        
            com.huaban
            jieba-analysis
            1.0.2
        
    
    
        
            
                org.apache.maven.plugins
                maven-shade-plugin
                3.2.1
                
                    false
                
                
                    
                        package
                        
                            shade
                        
                        
                            
                                
                                    *:*
                                    
                                        meta-INF
object WordCountDemo {
  def main(args: Array[String]): Unit = {
    val sparkConf: SparkConf = new SparkConf().setMaster("local").setAppName("WordCountDemo")
    val sparkContext = new SparkContext(sparkConf)
    //获取文件
    val file: RDD[String] = sparkContext.textFile("word.txt")
    val split: RDD[String] = file.flatMap(_.split(" "))
    val map: RDD[(String, Int)] = split.map(word => (word, 1))
    val reduce: RDD[(String, Int)] = map.reduceByKey(_ + _)
    val res: Array[(String, Int)] = reduce.collect()
    res.foreach(println)
    sparkContext.stop()
  }
}

3、运行结果

(scala,2)
(spark,2)
(hadoop,1)
(flume,1)
(hello,2)
(hbase,1)

默认会打印出很多日志,我们可以在 resources 下新建 log4j.properties 文件来配置

log4j.rootCategory=ERROR, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Set the default spark-shell log level to ERROR. When running the spark-shell,
the
# log level for this class is used to overwrite the root logger's log level, so
that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=ERROR
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark_project.jetty=ERROR
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=ERROR
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=ERROR
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
3、打包

为了更好的测试效果,我们在代码中加入一个入口参数后再进行打包

package com.manster.spark.demo

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}


object WordCountDemo {
  def main(args: Array[String]): Unit = {
    val sparkConf: SparkConf = new SparkConf().setMaster("local").setAppName("WordCountDemo")
    val sparkContext = new SparkContext(sparkConf)
    //获取文件
    val file: RDD[String] = sparkContext.textFile(args(0))
    val split: RDD[String] = file.flatMap(_.split(" "))
    val map: RDD[(String, Int)] = split.map(word => (word, 1))
    val reduce: RDD[(String, Int)] = map.reduceByKey(_ + _)
    val res: Array[(String, Int)] = reduce.collect()
    res.foreach(println)
    sparkContext.stop()
  }
}

在 maven 的插件中双击 package ,在 target 目录下生成 spark_demo-1.0-SNAPSHOT.jar

4、上传

1、先使用 xftp 或者 filezilla 将 word.txt 文件上传到 /opt/hadoop/ 文件夹下

2、将这个文件上传到 hdfs(先启动 hadoop 的 start-dfs.sh)

创建目录

hdfs dfs -mkdir /wordcount

上传文件至 hdfs

hdfs dfs -put word.txt /wordcount

3、上传 spark_demo-1.0-SNAPSHOT.jar 至 spark 文件夹下

4、跑一个 wordcount (先启动 spark 的 start-all.sh)

bin/spark-submit 
--class com.manster.spark.demo.WordCount 
--master spark://192.168.75.120:7077 
./spark_demo-1.0-SNAPSHOT.jar 
hdfs://192.168.75.120:9000/wordcount
  1. –class 表示要执行程序的主类
  2. –master spark://192.168.75.120:7077 独立部署模式,连接到 Spark 集群
  3. spark_demo-1.0-SNAPSHOT.jar 运行类所在的 jar 包
  4. hdfs://192.168.75.120:9000/wordcount 表示程序的入口参数,用于设定当前要进行求 wordcount 的文件

执行任务时,会产生多个 Java 进程

执行任务时,默认采用服务器集群节点的总核数,每个节点内存 1024M

欢迎分享,转载请注明来源:内存溢出

原文地址: https://outofmemory.cn/zaji/5664911.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-12-16
下一篇 2022-12-16

发表评论

登录后才能评论

评论列表(0条)

保存