在我们学习时更多的是用伪分布式环境来进行 *** 作,以下就是伪分布式Hadoop+Spark安装与配置
- centos:7.4
- jdk:1.8
- hadoop:2.7.2
- scala:2.12.13
- spark:3.0.1
下载centos-7,安装虚拟机
1、配置静态ipvi /etc/sysconfig/network-scripts/ifcfg-ens33 TYPE=Ethernet PROXY_METHOD=none BROWSER_ONLY=no # 修改为 static BOOTPROTO=static DEFROUTE=yes IPV4_FAILURE_FATAL=no IPV6INIT=yes IPV6_AUTOCONF=yes IPV6_DEFROUTE=yes IPV6_FAILURE_FATAL=no IPV6_ADDR_GEN_MODE=stable-privacy NAME=ens33 UUID=aec3fd78-3c06-4a77-8773-2667fe034ef4 DEVICE=ens33 # 修改为 yes ONBOOT=yes # 添加 ip和网关 IPADDR=192.168.75.120 GATEWAY=192.168.75.2 DNS1=192.168.75.2
service network restart 重启网络
2、关闭防火墙//查看防火墙状态 systemctl status firewalld.service active(running) //关闭防火墙 systemctl stop firewalld.service disavtive(dead) //查看状态 systemctl status firewalld.service //永久关闭防火墙 systemctl disable firewalld.service3、测试 yum 使用
[root@localhost ~]# yum install vim
结果不能下载
已加载插件:fastestmirror Determining fastest mirrors Could not retrieve mirrorlist http://mirrorlist.centos.org/?release=7&arch=x86_64&repo=os&infra=stock error was 14: curl#6 - "Could not resolve host: mirrorlist.centos.org; 未知的错误" One of the configured repositories failed (未知), and yum doesn't have enough cached data to continue. At this point the only safe thing yum can do is fail. There are a few ways to work "fix" this: 1. Contact the upstream for the repository and get them to fix the problem. 2. Reconfigure the baseurl/etc. for the repository, to point to a working upstream. This is most often useful if you are using a newer distribution release than is supported by the repository (and the packages for the previous distribution release still work). 3. Run the command with the repository temporarily disabled yum --disablerepo=... 4. Disable the repository permanently, so yum won't use it by default. Yum will then just ignore the repository until you permanently enable it again or use --enablerepo for temporary usage: yum-config-manager --disable or subscription-manager repos --disable= 5. Configure the failing repository to be skipped, if it is unavailable. Note that yum will try to contact the repo. when it runs most commands, so will have to try and fail each time (and thus. yum will be be much slower). If it is a very temporary problem though, this is often a nice compromise: yum-config-manager --save --setopt= .skip_if_unavailable=true Cannot find a valid baseurl for repo: base/7/x86_64
然后 ping 测试
[root@seckillmysql ~]# ping 114.114.114.114 PING 114.114.114.114 (114.114.114.114) 56(84) bytes of data. 64 bytes from 114.114.114.114: icmp_seq=1 ttl=128 time=36.6 ms 64 bytes from 114.114.114.114: icmp_seq=2 ttl=128 time=36.9 ms [root@seckillmysql ~]# ping www.baidu.com ping: www.baidu.com: 未知的名称或服务
[root@seckillmysql ~]# vi /etc/resolv.conf //增加这两行 nameserver 223.5.5.5 nameserver 223.6.6.6
然后发现 ping 百度就可以了,yum 下载也可以执行了
2、安装hadoop 1、安装 jdk1、上传 jdk-8u152-linux-x64.tar.gz 包到 opt/
2、解压
tar -zxvf jdk-8u152-linux-x64.tar.gz
3、改名字
mv jdk1.8.0_152 jdk
4、配置环境变量
vim /etc/profile
5、增加 JAVA_HOME 到路径
# jdk export JAVA_HOME=/opt/jdk export PATH=$PATH:$JAVA_HOME/bin
6、刷新路径
source /etc/profile2、免密登录 1、生成密匙
ssh-keygen
一直按回车即可
Generating public/private rsa key pair. Enter file in which to save the key (/root/.ssh/id_rsa): Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /root/.ssh/id_rsa. Your public key has been saved in /root/.ssh/id_rsa.pub. The key fingerprint is: SHA256:O5cniA6gS5ovHr7MtIC65ED1pcJoEw463taWcUJF4m4 [email protected] The key's randomart image is: +---[RSA 2048]----+ | ..o | | . o | |.. . o . | |+ = + o | |o*.o E .S | |=oo.+ =. o . | |=* o.+. + + . | |#o+ .o o o | |B%o . | +----[SHA256]-----+2、拷贝公钥
[root@localhost sbin]# cd /root/.ssh/ [root@localhost .ssh]# ssh-copy-id root@localhost /usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub" /usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed /usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys root@localhost's password: Number of key(s) added: 1 Now try logging into the machine, with: "ssh 'root@localhost'" and check to make sure that only the key(s) you wanted were added.3、hadoop 1、解压缩文件
1、上传 hadoop-2.7.2.tar.gz 压缩包至 opt/
2、解压缩
tar -zxvf hadoop-2.7.2.tar.gz2、配置环境
1、配置环境变量
vim /etc/profile
2、增加 HADOOP_HOME 到路径
# hadoop export HADOOP_HOME=/opt/hadoop-2.7.2 export PATH=$PATH:$HADOOP_HOME/bin export PATH=$PATH:$HADOOP_HOME/sbin
3、刷新环境变量
source /etc/profile
4、配置 hadoop
配置 hadoop-env.sh
vim /opt/hadoop-2.7.2/etc/hadoop/hadoop-env.sh
echo $JAVA_HOME /opt/jdk
修改 JAVA_HOME 路径:
export JAVA_HOME=/opt/jdk
配置 core-site.xml
vim /opt/hadoop-2.7.2/etc/hadoop/core-site.xml
fs.defaultFS hdfs://localhost:9000 hadoop.tmp.dir /opt/hadoop-2.7.2/data/tmp
5、配置 hdfs
配置 hdfs.site.xml
vim /opt/hadoop-2.7.2/etc/hadoop/hdfs.site.xml
dfs.replication 1 dfs.namenode.secondary.http-address 192.168.75.120:50090
192.168.75.120 为你的虚拟机地址
6、启动测试
格式化
[root@localhost hadoop-2.7.2] bin/hdfs namenode -format
启动 namenode
[root@localhost hadoop-2.7.2] sbin/hadoop-daemon.sh start namenode
启动 datanode
[root@localhost hadoop-2.7.2] sbin/hadoop-daemon.sh start datanode
浏览器访问 http://192.168.75.120:50070
关闭
[root@localhost hadoop-2.7.2] sbin/hadoop-daemon.sh stop namenode [root@localhost hadoop-2.7.2] sbin/hadoop-daemon.sh stop datanode
只练习 spark 以下可以不做
7、配置 yarn
配置 yarn-env.sh
vim /opt/hadoop-2.7.2/etc/hadoop/yarn-env.sh
修改 JAVA_HOME
export JAVA_HOME=/opt/jdk
配置 yarn-site.xml
vim /opt/hadoop-2.7.2/etc/hadoop/yarn-site.xml
yarn.nodemanager.aux-services mapreduce_shuffle yarn.resourcemanager.hostname 192.168.75.120
8、配置 mapred
配置 mapred-env.sh
vim /opt/hadoop-2.7.2/etc/hadoop/mapred-env.sh
修改 JAVA_HOME
export JAVA_HOME=/opt/jdk
配置 mapred-site.xml
vim /opt/hadoop-2.7.2/etc/hadoop/mapred-site.xml
mapreduce.framework.name yarn
9、启动
[root@localhost hadoop-2.7.2] sbin/start-dfs.sh3、安装spark 1、解压缩文件
1、上传 spark-3.0.1-bin-hadoop2.7.tgz 到 /opt
2、解压
tar -zxvf spark-3.0.1-bin-hadoop2.7.tgz
3、重命名
mv spark-3.0.1-bin-hadoop2.7 spark2、配置环境 1、配置系统变量
vim /etc/profile
添加以下内容
# spark export SPARK_HOME=/opt/spark export PATH=$PATH:$SPARK_HOME/bin export PATH=$PATH:$SPARK_HOME/sbin
刷新配置
source /etc/profile2、修改 spark 配置
1、进入解压缩后路径的 conf 目录,修改 slaves.template 文件名为 slaves
mv slaves.template slaves
2、修改 spark-env.sh.template 文件名为 spark-env.sh
mv spark-env.sh.template spark-env.sh
3、修改 spark-env.sh 文件,添加 JAVA_HOME 环境变量和集群对应的 master 节点
export JAVA_HOME=/opt/jdk SPARK_MASTER_HOST=192.168.75.120 SPARK_MASTER_PORT=7077
**注意: 7077 端口,相当于 hadoop3 内部通信的 8020 端口,此处的端口需要确认自己的 Hadoop配置 **
3、测试spark启动 worker 和 Master
sbin/start-all.sh
测试是否成功
bin/spark-shell
21/12/13 04:15:13 WARN Utils: Your hostname, localhost.localdomain resolves to a loopback address: 127.0.0.1; using 192.168.75.120 instead (on interface ens33) 21/12/13 04:15:13 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 21/12/13 04:15:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://192.168.75.120:4040 Spark context available as 'sc' (master = local[*], app id = local-1639386925730). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 3.0.1 /_/ Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_152) Type in expressions to have them evaluated. Type :help for more information. scala>4、运行一个spark程序 1、创建工程
1、创建一个 maven 工程,jdk为 1.8,scala为 2.12.13
4.0.0 com.manster spark_demo1.0-SNAPSHOT 3.0.1 2.12 org.apache.spark spark-core_2.12${spark.version} org.apache.spark spark-sql_2.12${spark.version} org.apache.spark spark-streaming_2.12${spark.version} org.apache.spark spark-mllib_2.12${spark.version} com.huaban jieba-analysis1.0.2 org.apache.maven.plugins maven-shade-plugin3.2.1 false package shade *:* meta-INF object WordCountDemo { def main(args: Array[String]): Unit = { val sparkConf: SparkConf = new SparkConf().setMaster("local").setAppName("WordCountDemo") val sparkContext = new SparkContext(sparkConf) //获取文件 val file: RDD[String] = sparkContext.textFile("word.txt") val split: RDD[String] = file.flatMap(_.split(" ")) val map: RDD[(String, Int)] = split.map(word => (word, 1)) val reduce: RDD[(String, Int)] = map.reduceByKey(_ + _) val res: Array[(String, Int)] = reduce.collect() res.foreach(println) sparkContext.stop() } }
3、运行结果
(scala,2) (spark,2) (hadoop,1) (flume,1) (hello,2) (hbase,1)
3、打包默认会打印出很多日志,我们可以在 resources 下新建 log4j.properties 文件来配置
log4j.rootCategory=ERROR, console log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n # Set the default spark-shell log level to ERROR. When running the spark-shell, the # log level for this class is used to overwrite the root logger's log level, so that # the user can have different defaults for the shell and regular Spark apps. log4j.logger.org.apache.spark.repl.Main=ERROR # Settings to quiet third party logs that are too verbose log4j.logger.org.spark_project.jetty=ERROR log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=ERROR log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=ERROR log4j.logger.org.apache.parquet=ERROR log4j.logger.parquet=ERROR # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
为了更好的测试效果,我们在代码中加入一个入口参数后再进行打包
package com.manster.spark.demo import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} object WordCountDemo { def main(args: Array[String]): Unit = { val sparkConf: SparkConf = new SparkConf().setMaster("local").setAppName("WordCountDemo") val sparkContext = new SparkContext(sparkConf) //获取文件 val file: RDD[String] = sparkContext.textFile(args(0)) val split: RDD[String] = file.flatMap(_.split(" ")) val map: RDD[(String, Int)] = split.map(word => (word, 1)) val reduce: RDD[(String, Int)] = map.reduceByKey(_ + _) val res: Array[(String, Int)] = reduce.collect() res.foreach(println) sparkContext.stop() } }
在 maven 的插件中双击 package ,在 target 目录下生成 spark_demo-1.0-SNAPSHOT.jar
4、上传1、先使用 xftp 或者 filezilla 将 word.txt 文件上传到 /opt/hadoop/ 文件夹下
2、将这个文件上传到 hdfs(先启动 hadoop 的 start-dfs.sh)
创建目录
hdfs dfs -mkdir /wordcount
上传文件至 hdfs
hdfs dfs -put word.txt /wordcount
3、上传 spark_demo-1.0-SNAPSHOT.jar 至 spark 文件夹下
4、跑一个 wordcount (先启动 spark 的 start-all.sh)
bin/spark-submit --class com.manster.spark.demo.WordCount --master spark://192.168.75.120:7077 ./spark_demo-1.0-SNAPSHOT.jar hdfs://192.168.75.120:9000/wordcount
- –class 表示要执行程序的主类
- –master spark://192.168.75.120:7077 独立部署模式,连接到 Spark 集群
- spark_demo-1.0-SNAPSHOT.jar 运行类所在的 jar 包
- hdfs://192.168.75.120:9000/wordcount 表示程序的入口参数,用于设定当前要进行求 wordcount 的文件
执行任务时,会产生多个 Java 进程
执行任务时,默认采用服务器集群节点的总核数,每个节点内存 1024M
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)