- namenode的作用:管理元数据Fsimage文件(镜像,存储几乎所有的元数据,不会立刻更新) Edits文件(日志文件,存储最近一段时间元数据,数据格式不一样慢)
- SecondaryNameNode辅助管理元数据:隔段时间将fsimage和edits文件拷贝到所在主机,将两个文件合并,合并成新的fsimage.ckpt文件替换旧的fsimage,生成edits.new文件最后到edits。触发条件:每隔一小时,或edits文件大于64m。
- SecondaryNameNode在合并edits和fsimage时需要消耗的内存和那么node差不多,所以一般不把namenode和secondarynode放在一台机器上
- namenode元数据恢复:可以通过SecondaryNameNode恢复
- 配置windows的hadoop
1.1 第一步:将已经编译好的Windows版本Hadoop解压到到一个没有中文没有空格的路径下面
1.2第二步:在windows上面配置hadoop的环境变量: HADOOP_HOME,并将%HADOOP_HOME%bin添加到path中
1.3第三步:把hadoop2.7.5文件夹中bin目录下的hadoop.dll文件放到系统盘: C:WindowsSystem32 目录
1.4第四步:关闭windows重启 - 导入maven依赖
org.apache.hadoop hadoop-common2.7.5 org.apache.hadoop hadoop-client2.7.5 org.apache.hadoop hadoop-hdfs2.7.5 org.apache.hadoop hadoop-mapreduce-client-core2.7.5 junit junit4.12
- 使用文件系统方式访问数据
3.1 涉及的主要类:configuration:封装了客户端或服务器的配置,filesystem是一个文件系统对象,可以用该对象的一些方法对文件进行 *** 作(get方法)
package com.hlzq.hdfs; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.junit.Test; import java.io.IOException; import java.net.URI; import java.net.URISyntaxException; public class TestDemo1 { @Test public void meth01GetFileSystem() throws IOException { //1.创建configuration对象 Configuration configuration = new Configuration(); //2.指定创建文件系统类型 configuration.set("fs.defaultFS", "hdfs://node1:8020"); //3.获取指定的文件系统 FileSystem fileSystem = FileSystem.get(configuration); System.out.println(fileSystem); } @Test public void meth02GetFileSystem() throws IOException, URISyntaxException { //1.创建configuration对象 //2.指定创建文件系统类型 //3.获取指定的文件系统 FileSystem jdjsdj = FileSystem.get(new URI( "hdfs://node1:8020"), new Configuration()); System.out.println(jdjsdj); } }
//遍历hdfs文件
@Test public void bianLi() throws URISyntaxException, IOException { //获取filesystem对象 FileSystem fileSystem = FileSystem.get(new URI("hdfs://node1:8020"), new Configuration()); // 获取指定目录下的缩影文件的详情 RemoteIteratorlocatedFileStatusRemoteIterator = fileSystem.listFiles(new Path("/"), true); //遍历迭代器集合,分别获取文件的信息 while (locatedFileStatusRemoteIterator.hasNext()){ LocatedFileStatus next = locatedFileStatusRemoteIterator.next(); //获取具体的文件信息 Path path = next.getPath(); System.out.println(path); //获取每一个文件的block信息 BlockLocation[] blockLocations = next.getBlockLocations(); System.out.println(blockLocations.length);//block的数量(文件被切分成几份) for (BlockLocation blockLocation : blockLocations) { String[] hosts = blockLocation.getHosts(); for(String host: hosts){ System.out.println(host); } System.out.println("#########################################################"); } } //关闭filesystem对象 fileSystem.close(); }
创建文件夹:
@Test public void mkdir() throws URISyntaxException, IOException { FileSystem fileSystem = FileSystem.get(new URI("hdfs://node1:8020"), new Configuration()); boolean exists = fileSystem.exists(new Path("/xx/yy/zz"));//exists判断文件夹是否存在 if (!exists){ System.out.println("不存在创建"); fileSystem.mkdirs(new Path("/xx/yy/zz")); }else System.out.println("存在,不创建"); }
下载
//下载方法一 @Test public void dowlo() throws URISyntaxException, IOException { FileSystem fileSystem = FileSystem.get(new URI("hdfs://node1:8020"), new Configuration()); //获取源文件输入流 FSDataInputStream inputStream = fileSystem.open(new Path("/anaconda-ks.cfg")); //获取本地文件的输出流 FileOutputStream outputStream = new FileOutputStream("F:\dnxx"); //使用IOUtils工具实现文件的拷贝 IOUtils.copy(inputStream, outputStream); //关流 outputStream.close(); inputStream.close(); } //下载方法二, @Test public void dowlo2() throws URISyntaxException, IOException { FileSystem fileSystem = FileSystem.get(new URI("hdfs://node1:8020"), new Configuration()); fileSystem.copyToLocalFile(new Path("/anaconda-ks.cfg"),new Path("F:\dnxx")); fileSystem.close(); }
上传
//文件上传 @Test public void Sha() throws URISyntaxException, IOException { FileSystem fileSystem = FileSystem.get(new URI("hdfs://node1:8020"), new Configuration()); fileSystem.copyFromLocalFile(new Path("F:\dnxx"),new Path("/xx/yy/zz")); fileSystem.close(); }
小文件合并(本地小文件,追加)
@Test public void Shaheb() throws URISyntaxException, IOException { FileSystem fileSystem = FileSystem.get(new URI("hdfs://node1:8020"), new Configuration()); //创建HDFS上的文件名 FSDataOutputStream outputStream = fileSystem.create(new Path("/a.txt")); //获取本地文件系统 LocalFileSystem local = FileSystem.getLocal(new Configuration()); //通过本地文件系统获取文件列表为一个集合 FileStatus[] fileStatuses = local.listStatus(new Path("file:///G:\dd")); for (FileStatus fileStatus:fileStatuses){ FSDataInputStream open = local.open(fileStatus.getPath()); IOUtils.copy(open, outputStream); IOUtils.closeQuietly(open); } IOUtils.closeQuietly(outputStream); local.close(); fileSystem.close(); }三.HDFS的访问权限控制
- 权限总开关:vim hdfs-site.xml
打开为TRUE ,分发,重启hdfs - 伪装用户:
1.不同集群之间数据复制:
1.1集群内部文件拷贝scp:scp (-r目录)文件名 主机名:$pwd
1.2远程复制到本地:scp root@主机名:文件地址名字 存放位置
scp -r root@node2:zookeeper.out dir33/
1.3跨集群之间的数据拷贝:hadoop distcp hdfs://node1:8020/jdk-8u hdsfs://cluste2:8020/
1.4Archive档案的使用:hdfs不擅长存储小文件,每个文件最少一个block,每个block都会在namenode占用内存,存在大量的小文件,就会吃掉namenode的大量内存。Archive把多个文件归档称为一个文件,归档成为一个文件后还可以透明的访问每一个文件。
1.4.1:创建Archive
例:你想存档一个目录/config下的所有文件:
hadoop archive -archiveName 名字 -p /config (打包的目录)/存放的目录
1.4.2查看Archive:hadoop fs -cat /output/test.har/part-0
1.4.3查看小文件列表: hadoop fs -ls har://hdfs-node1:8020/output/test.har
1.4.4访问单独的小文件: hadoop fs -cat har://hdfs-node1:8020/output/test.har/core-site.xml
1.4.5archive不支持打包
1.4.6解压Achive: hadoop fs -cp har:///output/test.har/* /config2
-
数据备份,误 *** 作容灾恢复
-
开启指定目录的快照功能:hdfs dfsadmin -allowSnapshot 路径
-
禁用指定路径的快照功能:hdfs dfsadmin -disallowSnapshot 路径
-
给指定路径创建快照:hdfs dfs -createSnapshot 路径
-
指定快照名称进行创建快照:hdfs dfs -createSnapshot 路径 名称
-
快照重命名:hdfs dfs -renameSnapshot 路径 旧名称 新名称
-
列出当前用户所有可快照目录:hdfs lsSnapshottableDir
-
恢复快照:hdfs dfs -cp -ptopax 快照位置 /恢复位置
-
删除快照:hdfs dfs -deleteSnapshort /快照目录的 快照名
- 在/user/用户名/.Trash目录
- 参数
- 强删除:hadoop fs -rm -skipTrash /dir1/a.txt
2.HDFS高可用集群
2.1修改core-site.xml
ha.zookeeper.quorum node1:2181,node2:2181,node3:2181 fs.defaultFS hdfs://ns hadoop.tmp.dir /opt/server/hadoop-2.7.5/data/tmp fs.trash.interval 10080
2.2. 修改hdfs-site.xml
dfs.nameservices ns dfs.ha.namenodes.ns nn1,nn2 dfs.namenode.rpc-address.ns.nn1 node1:8020 dfs.namenode.rpc-address.ns.nn2 node2:8020 dfs.namenode.servicerpc-address.ns.nn1 node1:8022 dfs.namenode.servicerpc-address.ns.nn2 node2:8022 dfs.namenode.http-address.ns.nn1 node1:50070 dfs.namenode.http-address.ns.nn2 node2:50070 dfs.namenode.shared.edits.dir qjournal://node1:8485;node2:8485;node3:8485/ns1 dfs.client.failover.proxy.provider.ns org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider dfs.ha.fencing.methods sshfence dfs.ha.fencing.ssh.private-key-files /root/.ssh/id_rsa dfs.journalnode.edits.dir /opt/server/hadoop-2.7.5/data/dfs/jn dfs.ha.automatic-failover.enabled true dfs.namenode.name.dir file:///opt/server/hadoop-2.7.5/data/dfs/nn/name dfs.namenode.edits.dir file:///opt/server/hadoop-2.7.5/data/dfs/nn/edits dfs.datanode.data.dir file:///opt/server/hadoop-2.7.5/data/dfs/dn dfs.permissions false dfs.blocksize 134217728
2.3 修改yarn-site.xml,注意node03与node02配置不同
yarn.log-aggregation-enable true yarn.resourcemanager.ha.enabled true yarn.resourcemanager.cluster-id mycluster yarn.resourcemanager.ha.rm-ids rm1,rm2 yarn.resourcemanager.hostname.rm1 node2 yarn.resourcemanager.hostname.rm2 node3 yarn.resourcemanager.address.rm1 node2:8032 yarn.resourcemanager.scheduler.address.rm1 node2:8030 yarn.resourcemanager.resource-tracker.address.rm1 node2:8031 yarn.resourcemanager.admin.address.rm1 node2:8033 yarn.resourcemanager.webapp.address.rm1 node2:8088 yarn.resourcemanager.address.rm2 node3:8032 yarn.resourcemanager.scheduler.address.rm2 node3:8030 yarn.resourcemanager.resource-tracker.address.rm2 node3:8031 yarn.resourcemanager.admin.address.rm2 node3:8033 yarn.resourcemanager.webapp.address.rm2 node3:8088 yarn.resourcemanager.recovery.enabled true yarn.resourcemanager.ha.id rm1 If we want to launch more than one RM in single node, we need this configuration yarn.resourcemanager.store.class org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore yarn.resourcemanager.zk-address node2:2181,node3:2181,node1:2181 For multiple zk services, separate them with comma yarn.resourcemanager.ha.automatic-failover.enabled true Enable automatic failover; By default, it is enabled only when HA is enabled. yarn.client.failover-proxy-provider org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider yarn.nodemanager.resource.cpu-vcores 2 yarn.nodemanager.resource.memory-mb 2048 yarn.scheduler.minimum-allocation-mb 1024 yarn.scheduler.maximum-allocation-mb 2048 yarn.log-aggregation.retain-seconds 2592000 yarn.nodemanager.log.retain-seconds 604800 yarn.nodemanager.log-aggregation.compression-type gz yarn.nodemanager.local-dirs /opt/server/hadoop-2.7.5/yarn/local yarn.resourcemanager.max-completed-applications 1000 yarn.nodemanager.aux-services mapreduce_shuffle yarn.resourcemanager.connect.retry-interval.ms 2000
2.4修改mapred-site.xml先拷贝mapred-site.xml.template
mapreduce.framework.name yarn mapreduce.jobhistory.address node3:10020 hadoop.tmp.dir}/mapred/system --> mapreduce.jobhistory.webapp.address node3:19888 mapreduce.jobtracker.system.dir /opt/server/hadoop-2.7.5/data/system/jobtracker mapreduce.map.memory.mb 1024 mapreduce.reduce.memory.mb 1024 mapreduce.task.io.sort.mb 100 mapreduce.task.io.sort.factor 10 mapreduce.reduce.shuffle.parallelcopies 15 yarn.app.mapreduce.am.command-opts -Xmx1024m hadoop.tmp.dir}/mapred/local--> yarn.app.mapreduce.am.resource.mb 1536 mapreduce.cluster.local.dir /opt/server/hadoop-2.7.5/data/system/local
2.5修改slaves
node1 node2 node3
2.6修改hadoop-env.sh
export JAVA_HOME=/export/server/jdk1.8.0_241
2.7启动分发
cd /opt/server
scp -r hadoop-2.7.5/ node2:$ PWD
scp -r hadoop-2.7.5/ node3:$PWD
2.8三台机器执行一下命令:
mkdir -p /opt/server/hadoop-2.7.5/data/dfs/nn/name mkdir -p /opt/server/hadoop-2.7.5/data/dfs/nn/edits mkdir -p /opt/server/hadoop-2.7.5/data/dfs/nn/name mkdir -p /opt/server/hadoop-2.7.5/data/dfs/nn/edits
2.9更改node3的rm2
vim yarn-site.xml
2.10启动
node1上执行
cd /opt/server/hadoop-2.7.5 bin/hdfs zkfc -formatZK sbin/hadoop-daemons.sh start journalnode bin/hdfs namenode -format bin/hdfs namenode -initializeSharedEdits -force sbin/start-dfs.sh
node2上执行
cd /opt/server/hadoop-2.7.5 bin/hdfs namenode -bootstrapStandby sbin/hadoop-daemon.sh start namenode sbin/start-yarn.sh bin/yarn rmadmin -getServiceState rm1(查看resourceManager状态)
node3上面执行
cd /export/servers/hadoop-2.7.5 sbin/start-yarn.sh bin/yarn rmadmin -getServiceState rm2 sbin/mr-jobhistory-daemon.sh start historyserver
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)