大数据高级开发工程师——Hadoop学习笔记（6）_随笔

大数据高级开发工程师——Hadoop学习笔记（6）

文章目录

Hadoop进阶篇
- MapReduce：Hadoop分布式并行计算框架
- - 自定义OutputFormat
  - - 1. 需求
    - 2. 分析
    - 3. 代码实现
  - shuffle中数据压缩
  - - 1. hadoop 支持的压缩算法比较
    - 2. 如何开启压缩
  - 计数器与累加器
  - - 1. hadoop 内置计数器
    - 2. 自定义计数器
  - MapReduce的join
  - - 1. reduce join
    - - 需求
      - 实现方案
    - 2. map join
  - mapTask工作机制
  - - 1. Read 阶段
    - 2. Map 阶段
    - 3. Collect 收集阶段
    - 4. Spill 溢写阶段
    - 5. 合并阶段
  - reduceTask工作机制
  - - 1. reduce流程
    - 2. 设置ReduceTask并行度(个数)
    - 3. 测试ReduceTask多少合适
  - MapReduce完整流程
  - - 1. map 简图
    - 2. reduce 简图
    - 3. mapreduce简略步骤

Hadoop进阶篇 MapReduce：Hadoop分布式并行计算框架自定义OutputFormat 1. 需求

现在有一些订单的评论数据，需求，将订单的好评与其他评论（中评、差评）进行区分开来，将最终的数据分开到不同的文件夹下面去
数据内容参见资料文件夹，其中数据第九个字段表示好评，中评，差评。0：好评，1：中评，2：差评

2. 分析

程序的关键点是要在一个mapreduce程序中根据数据的不同输出两类结果到不同目录，这类灵活的输出需求可以通过自定义outputformat来实现。
实现要点：
- 在mapreduce中访问外部资源
- 自定义outputformat，改写其中的recordwriter，改写具体输出数据的方法write()

3. 代码实现

自定义一个 OutputFormat：

public class MyOutputFormat extends FileOutputFormat {
    static class MyRecordWriter extends RecordWriter {
        private FSDataOutputStream goodStream;
        private FSDataOutputStream badStream;

        public MyRecordWriter(FSDataOutputStream goodStream, FSDataOutputStream badStream) {
            this.goodStream = goodStream;
            this.badStream = badStream;
        }

        @Override
        public void write(Text key, NullWritable value) throws IOException, InterruptedException {
            if (key.toString().split("t")[9].equals("0")) {// 好评
                goodStream.write(key.toString().getBytes());
                goodStream.write("rn".getBytes());
            } else { // 中评或差评
                badStream.write(key.toString().getBytes());
                badStream.write("rn".getBytes());
            }
        }

        @Override
        public void close(TaskAttemptContext context) throws IOException, InterruptedException {
            if (badStream != null) badStream.close();
            if (goodStream != null) goodStream.close();
        }

    }

    @Override
    public RecordWriter getRecordWriter(TaskAttemptContext context) throws IOException, InterruptedException {
        FileSystem fs = FileSystem.get(context.getConfiguration());
        Path goodCommentPath = new Path("/Volumes/F/MyGitHub/bigdata/hadoop-demo/src/main/resources/output/good.txt");
        Path badCommentPath = new Path("/Volumes/F/MyGitHub/bigdata/hadoop-demo/src/main/resources/output/bad.txt");

        FSDataOutputStream goodOutputStream = fs.create(goodCommentPath);
        FSDataOutputStream badOutputStream = fs.create(badCommentPath);

        return new MyRecordWriter(goodOutputStream, badOutputStream);
    }
}

定义程序入口类：

public class OutputFormatMain extends Configured implements Tool {
    static class OutputFormatMapper extends Mapper {
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            context.write(value, NullWritable.get());
        }
    }
    @Override
    public int run(String[] args) throws Exception {
        Job job = Job.getInstance(super.getConf(), OutputFormatMain.class.getSimpleName());
        job.setJarByClass(OutputFormatMain.class);
        
        job.setInputFormatClass(TextInputFormat.class);
        TextInputFormat.addInputPath(job, new Path(args[0]));
        
        job.setMapperClass(OutputFormatMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);
        
        // 使用默认的reduce类的逻辑
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);
        
        job.setOutputFormatClass(MyOutputFormat.class);
        MyOutputFormat.setOutputPath(job, new Path(args[1]));
        
        job.setNumReduceTasks(2);
        
        return job.waitForCompletion(true) ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        int run = ToolRunner.run(new Configuration(), new OutputFormatMain(), args);
        System.exit(run);
    }
}

运行查看输出结果

shuffle中数据压缩

shuffle 阶段，可以看到数据通过大量的拷贝，从map阶段输出的数据，都要通过网络拷贝，发送到reduce阶段，这一过程中，涉及到大量的网络IO，如果数据能够进行压缩，那么数据的发送量就会少得多。
从map阶段输出的数据，都要通过网络拷贝，发送到reduce阶段，这一过程中，涉及到大量的网络IO，如果数据能够进行压缩，那么数据的发送量就会少得多
MapReduce的执行流程：

MapReduce
input
mapper
shuffle
partitioner、sort、combiner、【compress】、group
reducer
output

文件压缩有两个好处：节约磁盘空间，加速数据在网络和磁盘上的传输。
查看 hadoop 支持的压缩算法：

hadoop checknative

1. hadoop 支持的压缩算法比较压缩格式工具算法文件扩展名是否可切分对应使用的Java类DEFLATE无DEFLATE.deflate否org.apache.hadoop.io.compress.DefaultCodecGzipgzipDEFLATE.gz否org.apache.hadoop.io.compress.GzipCodecbzip2bzip2bzip2.bz2是org.apache.hadoop.io.compress.BZip2CodecLZOlzopLZO.lzo否com.hadoop.compression.lzo.LzopCodecLZ4无LZ4.lz4否org.apache.hadoop.io.compress.Lz4CodecSnappy无Snappy.snappy否org.apache.hadoop.io.compress.SnappyCodec

压缩速率比较

压缩算法原始文件大小压缩后的文件大小压缩速率解压缩速度gzip8.3GB1.8GB17.5MB/s58MB/sbzip28.3GB1.1GB2.4MB/s9.5MB/sLZO-bset8.3GB2GB4MB/s60.6MB/sLZO8.3GB2.9GB135MB/s410MB/sSnappy8.3GB1.8GB172MB/s409MB/s

常用的压缩算法主要有LZO和snappy

2. 如何开启压缩

在代码中设置压缩：

Configuration configuration = new Configuration();
// 设置 map 阶段压缩
configuration.set("mapreduce.map.output.compress", "true");
configuration.set("mapreduce.map.output.compress.codec", "org.apache.hadoop.io.compress.SnappyCodec");
// 设置 reduce 阶段的压缩
configuration.set("mapreduce.output.fileoutputformat.compress", "true");
configuration.set("mapreduce.output.fileoutputformat.compress.type", "RECORD");
configuration.set("mapreduce.output.fileoutputformat.compress.codec", "org.apache.hadoop.io.compress.SnappyCodec");

修改 mapred-site.xml 进行 MapReduce 压缩：所有节点都要修改mapred-site.xml，修改完成之后记得重启集群


	mapreduce.map.output.compress
	true


	mapreduce.map.output.compress.codec
	org.apache.hadoop.io.compress.SnappyCodec



       
	mapreduce.output.fileoutputformat.compress
	true

        
	mapreduce.output.fileoutputformat.compress.type
	RECORD

        
	mapreduce.output.fileoutputformat.compress.codec
	org.apache.hadoop.io.compress.SnappyCodec

计数器与累加器

计数器是收集作业统计信息的有效手段之一，用于质量控制或应用级统计。计数器还可辅助诊断系统故障。如果需要将日志信息传输到map 或reduce 任务，更好的方法通常是看能否用一个计数器值来记录某一特定事件的发生。对于大型分布式作业而言，使用计数器更为方便。除了因为获取计数器值比输出日志更方便，还有根据计数器值统计特定事件的发生次数要比分析一堆日志文件容易得多。

1. hadoop 内置计数器计数器对应的Java类MapReduce任务计数器org.apache.hadoop.mapreduce.TaskCounter文件系统计数器org.apache.hadoop.mapreduce.FileSystemCounterFileInputFormat计数器org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounterFileOutputFormat计数器org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter作业计数器org.apache.hadoop.mapreduce.JobCounter

每次mapreduce执行完成之后，我们都会看到一些日志记录出来，其中最重要的一些日志记录如下截图

2. 自定义计数器

利用之前排序及序列化的案例，统计 map 端接收到的数据的条数。
Mapper 中，通过 context 上下文对象可以获取我们的计数器，并进行记录，通过 context 上下文对象，在 map 端使用计数器进行统计：

Reducer 中，通过 enum 枚举类型来定义计数器，统计 reduce 端数据的输入 key 有多少个，对应的 value 有多少个：

运行程序查看输出结果

MapReduce的join 1. reduce join 需求

现在有两张表如下

订单数据表t_order：

iddatepidamount100120150710P00012100220150710P00023100220150710P00033

商品信息表t_product

idpnamecategory_idpriceP0001小米510002000P0002锤子T110003000

假如数据量巨大，两表的数据是以文件的形式存储在HDFS中，需要用 MapReduce 程序来实现一下 SQL 的 join 查询运算：

select a.id,a.date,b.name,b.category_id,b.price from t_order a join t_product  b on a.pid = b.id

实现方案

通过将关联的条件作为map输出的key，将两表满足join条件的数据并携带数据所来源的文件信息，发往同一个reduce task，在reduce中进行数据的串联
定义 Mapper 类：

public class ReduceJoinMapper extends Mapper {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        

        // 方式二：因为t_product表，都是以p开头，所以可以作为判断的依据
        String[] slices = value.toString().split(",");
        if (value.toString().startsWith("p")) {
            // 样例数据: p0001,小米5,1000,2000
            context.write(new Text(slices[0]), value);
        } else {
            // order: 1001,20150710,p0001,2
            context.write(new Text(slices[2]), value);
        }
    }
}

定义 Reducer 类：

public class ReduceJoinReducer extends Reducer {
    @Override
    protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
        // 订单数据有多条
        List orders = new ArrayList<>();
        // 保存商品信息
        String product = "";
        
        for (Text value : values) {
            if (value.toString().startsWith("p")) {
                product = value.toString();
            } else {
                orders.add(value.toString());
            }
        }
        for (String order : orders) {
            context.write(new Text(order + "t" + product), NullWritable.get());
        }
    }
}

定义 main 程序入口：

public class ReduceJoinMain extends Configured implements Tool {
    @Override
    public int run(String[] args) throws Exception {
        Job job = Job.getInstance(super.getConf(), ReduceJoinMain.class.getSimpleName());
        job.setJarByClass(ReduceJoinMain.class);

        // 第一步：读取文件
        job.setInputFormatClass(TextInputFormat.class);
        TextInputFormat.addInputPath(job, new Path(args[0]));

        // 第二步：设置自定义 Mapper 逻辑
        job.setMapperClass(ReduceJoinMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);

        // 第三步到第六步：分区、排序、规约、分组 省略

        // 第七步：设置自定义 Reducer 逻辑
        job.setReducerClass(ReduceJoinReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);

        // 第八步：设置输出数据路径
        job.setOutputFormatClass(TextOutputFormat.class);
        TextOutputFormat.setOutputPath(job, new Path(args[1]));

        return job.waitForCompletion(true) ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        int run = ToolRunner.run(new Configuration(), new ReduceJoinMain(), args);
        System.exit(run);
    }
}

运行程序，查看输出结果：

2. map join

适用于关联表中有小表的情形，可以将小表分发到所有的map节点，这样，map节点就可以在本地对自己所读到的大表数据进行join并输出最终结果，可以大大提高join *** 作的并发度，加快处理速度。
先在mapper类中预先定义好小表，进行join。
自定义 Mapper 类：

public class MapJoinMapper extends Mapper {
    
    private Map productMap;

    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        productMap = new HashMap<>();
        // 获取到所有的缓存文件
        // 方式一：
        Configuration configuration = context.getConfiguration();
        URI[] cacheFiles = Job.getInstance(configuration).getCacheFiles();
//        // 方式二：
//        URI[] cacheFiles = DistributedCache.getCacheFiles(configuration);

        // 本例只有一个缓存文件放进了分布式缓存
        URI cacheFile = cacheFiles[0];

        FileSystem fileSystem = FileSystem.get(configuration);
        FSDataInputStream fsdis = fileSystem.open(new Path(cacheFile));

        try (BufferedReader br = new BufferedReader(new InputStreamReader(fsdis))){
            String line = null;
            while ((line = br.readLine()) != null) {
                String[] slices = line.split(",");
                productMap.put(slices[0], line);
            }
        }
    }

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] slices = value.toString().split(",");
        // 获取订单的商品id
        String pid = slices[2];
        // 获取商品表的数据
        String pdtsLine = productMap.get(pid);
        context.write(new Text(value.toString() + "t" + pdtsLine), NullWritable.get());
    }
}

定义 main 程序入口：

public class MapJoinMain extends Configured implements Tool {
    @Override
    public int run(String[] args) throws Exception {
        // 分布式缓存的hdfs路径
        URI uri = new URI("hdfs://node01:8020/cache/pdts.txt");

        Job job = Job.getInstance(super.getConf(), MapJoinMain.class.getSimpleName());
        job.setJarByClass(MapJoinMain.class);
        // 添加缓存文件
        job.addCacheFile(uri);
        
        job.setInputFormatClass(TextInputFormat.class);
        TextInputFormat.addInputPath(job, new Path(args[0]));
        
        job.setMapperClass(MapJoinMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);
        
        job.setOutputFormatClass(TextOutputFormat.class);
        TextOutputFormat.setOutputPath(job, new Path(args[1]));
        
        job.setNumReduceTasks(2);
        
        return job.waitForCompletion(true) ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        int run = ToolRunner.run(new Configuration(), new MapJoinMain(), args);
        System.exit(run);
    }
}

将程序打成 jar 包，并提交到集群运行，查看输出结果：

hadoop jar hadoop-demo-1.0.jar com.yw.hadoop.mr.p13_map_join.MapJoinMain /order.txt /map_join_out

github 源代码地址：https://github.com/shouwangyw/bigdata/tree/master/hadoop-demo

mapTask工作机制

1. Read 阶段

MapTask 通过用户编写的 RecordReader，从输入 InputSplit 中解析出一个个 key、value。

2. Map 阶段

该节点主要是将解析出的 key、value 交给用户编写的 map() 函数处理，并产生一系列新的 key、value。

3. Collect 收集阶段

在用户编写 map() 函数中，当数据处理完成后，一般会调用 OutputCollector.collect() 输出结果。在该函数内部，它会将生成的 key、value 分区（调用 Partitioner），并写入一个环形内存缓冲区中。

4. Spill 溢写阶段

当环形缓冲区满 80% 后，MapReduce 会将数据写到本地磁盘上，生成一个临时文件。需要注意的是，将数据写入本地磁盘之前，先要对数据进行一次本地排序，并在必要时对数据进行合并，压缩等 *** 作。
溢写阶段详情：
- 步骤一：利用快速排序算法对缓存区内的数据进行排序，排序方式是：先按照分区编号 Partition 进行排序，然后按照 key 进行排序。这样，经过排序后，数据以分区为单位聚集在一起，且同一分区内所有数据按照 key 有序。
- 步骤二：按照分区编号有小到大依次将每个分区中的数据写入任务工作目录下的临时文件 output/spillN.out （N表示当前溢写次数）中。如果用户设置了 Combiner，则写入文件之前，对每个分区中的数据进行一次聚集 *** 作。
- 步骤三：将分区数据的元信息写到内存索引数据结构 SpillRecord 中，其中每个分区的元信息包括：在临时文件中的偏移量、压缩前数据大小和压缩后数据大小。如果当前内存索引大小超过 1MB，则将内存索引写到文件 output/spillN.out.index 中。

5. 合并阶段

当所有数据处理完成后，MapTask对所有临时文件进行一次合并，以确保最终只会生成一个数据文件。
当所有数据处理完成后，MapTask 会将所有临时文件合并成一个大文件，并保存到文件 output/file.out 中，同时生成相应的索引文件 output/file.out.index。
当进行文件合并过程中，MapTask 以分区为单位进行合并。对于某个分区，它将采用多轮递归合并的方式。每轮合并 io.sort.factor（默认10）个文件，并将产生的文件重新加入待合并列表中，对文件排序后，重复以上过程，知道最终得到一个大文件。
让每个 MapTask 最终只生成一个数据文件，可避免同时打开大量文件和同时读取大量小文件产生的随机读取带来的开销。

reduceTask工作机制

1. reduce流程

Copy 阶段：ReduceTask 从各个 MapTask 上远程拷贝一片数据，并针对某一片数据，如果其大小超过一定阈值，则写到磁盘，否则直接放到内存中。
Merge 阶段：在远程拷贝数据的同时，ReduceTask 启动了两个后台线程对内存和磁盘上的文件进行合并，以防止内存使用过多或磁盘上文件过多。
Sort阶段：当所有 MapTask 的分区数据全部拷贝完，按照 MapReduce 语义，用户编写 reduce() 函数输入数据是按 key 进行聚集的一组数据。为了将 key 相同的数据聚在一起，Hadoop 采用了基于排序的策略。由于各个 MapTask 已经实现对自己的处理结果进行了局部排序，因此，ReduceTask 只需对所有数据进行一次归并排序即可。
Reduce 阶段：reduce() 函数将计算结果写到 HDFS 上。

2. 设置ReduceTask并行度(个数)

ReduceTask的并行度同样影响整个Job的执行并发度和执行效率，但与MapTask的并发数由切片数决定不同，ReduceTask数量的决定是可以直接手动设置：

// 默认值是1，手动设置 4
job.setNumReduceTasks(4);

3. 测试ReduceTask多少合适

实验环境：1个Master节点，16个Slave节点，CPU 8GHZ，内存 2G，数据量 1GB
实验结论：

MapTask =16ReduceTask151015162025304560总时间8921461109288100128101145104 MapReduce完整流程 1. map 简图

2. reduce 简图

3. mapreduce简略步骤

第一步：读取文件，解析成为key，value对
第二步：自定义map逻辑接受k1,v1，转换成为新的k2,v2输出；写入环形缓冲区
第三步：分区：写入环形缓冲区的过程，会给每个kv加上分区Partition index。（同一分区的数据，将来会被发送到同一个reduce里面去）
第四步：排序：当缓冲区使用80%，开始溢写文件
- 先按partition进行排序，相同分区的数据汇聚到一起；
- 然后，每个分区中的数据，再按key进行排序
第五步：combiner。调优过程，对数据进行map阶段的合并（注意：并非所有mr都适合combine）
第六步：将环形缓冲区的数据进行溢写到本地磁盘小文件
第七步：归并排序，对本地磁盘溢写小文件进行归并排序
第八步：等待reduceTask启动线程来进行拉取数据
第九步：reduceTask启动线程，从各map task拉取属于自己分区的数据
第十步：从mapTask拉取回来的数据继续进行归并排序
第十一步：进行groupingComparator分组 *** 作
第十二步：调用reduce逻辑，写出数据
第十三步：通过outputFormat进行数据输出，写到文件，一个reduceTask对应一个结果文件

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/5654463.html

大数据高级开发工程师——Hadoop学习笔记（6）

发表评论

评论列表（0条）