Hadoop-Mapreduce(四)

Join multiple applications

  • Map join (Distributedcache distributed cache)

    • Usage scenario: a table is very small, and a table is very large.

    • solution

      Cache multiple tables on the map side and process business logic in advance, thus increasing the map side business, reducing the data pressure on the reduce side, and reducing data skew as much as possible.

    • Specific method: use distributedcache

      ​ (1) In the setup phase of mapper, read the file into the cache collection.

      ​ (2) Load the cache in the driver function.

      ​ job.addCacheFile(new URI("file:/e:/mapjoincache/pd.txt"));// Cache ordinary files to task running node

      Reduce join

  • principle

    The main task of the Map side is to label key/value pairs from different tables (files) to distinguish records from different sources. Then use the connection field as the key, the rest and the newly added flag as the value, and finally output.

    The main work of the reduce side: the grouping with the connection field as the key on the reduce side has been completed, we only need to separate the records from different files (marked in the map stage) in each group, and finally merge it. .

  • Disadvantages of this method

    The disadvantage of this method is obviously that it will cause a large amount of data transmission on the map and reduce side, that is, the shuffle stage, and the efficiency is very low.

    Data cleaning (ETL)

  • Overview

Before running the core business Mapreduce program, it is often necessary to clean the data first to clean up data that does not meet user requirements. The cleaning process usually only needs to run the mapper program, not the reduce program.

Counter application

Hadoop maintains several built-in counters for each job to describe multiple indicators. For example, some counters record the number of bytes processed and the number of records, allowing users to monitor the amount of input data processed and the amount of output data generated.

  • API
    • Count counts by enumeration
enum MyCounter{
    
    MALFORORMED,NORMAL}

//对枚举定义的自定义计数器加1

context.getCounter(MyCounter.MALFORORMED).increment(1);

Use counter group and counter name to count

context.getCounter("counterGroup", "countera").increment(1);
  • The group name and counter name can start at will, but it's better to have meaning.
  • The counting result is checked on the console after the program is running.

MapReduce development summary

​ Several aspects that need to be considered when writing a mapreduce program:

  • Input data interface: InputFormat
默认使用的实现类是:TextInputFormat 

TextInputFormat的功能逻辑是:一次读一行文本,然后将该行的起始偏移量作为key,行内容作为value返回。

KeyValueTextInputFormat每一行均为一条记录,被分隔符分割为key,value。默认分隔符是tab(\t)。

NlineInputFormat按照指定的行数N来划分切片。

CombineTextInputFormat可以把多个小文件合并成一个切片处理,提高处理效率。

用户还可以自定义InputFormat。
  • Logical processing interface: Mapper

    The user implements three methods according to business requirements: map() setup() cleanup ()

  • Partitioner

    ​There is a default implementation of HashPartitioner, the logic is to return a partition number based on the hash value of the key and numReduces; key.hashCode()&Integer.MAXVALUE% numReduces

    ​If there are special needs in the business, you can customize the partition.

  • Comparable sort

    ​When we use a custom object as a key to output, we must implement the WritableComparable interface and rewrite the compareTo() method.

    ​Partial sorting: sort each file in the final output.

    ​All sort: sort all data, usually there is only one Reduce.

    ​Secondary sorting: There are two sorting conditions.

  • Combiner

Combiner can improve the efficiency of program execution and reduce io transmission. However, it must not affect the original business processing results when used.

  • Reduce end grouping: Groupingcomparator

    After ​reduceTask gets the input data (all data in a partition), it first needs to group the data. The default grouping principle is the same key, and then the reduce() method is called once for each group of kv data, and this group The key of the first kv in kv is passed as a parameter to the key of reduce, and the iterator of the value of this set of data is passed to the values ​​parameter of reduce().

    ​Using the above mechanism, we can implement an efficient logic for grouping to take the maximum value.

    ​Customize a bean object to encapsulate our data, and then rewrite its compareTo method to produce a reverse sorting effect. Then customize a Groupingcomparator to change the grouping logic of bean objects to grouping according to our business grouping id (such as order number). In this way, the maximum value we want to take is the key passed in in the reduce() method.

  • Logical processing interface: Reducer

    ​Users implement three methods according to business needs: reduce() setup() cleanup ()

  • Output data interface: OutputFormat

    ​The default implementation class is TextOutputFormat, and the functional logic is: output each KV facing target text file as a line.

SequenceFileOutputFormat writes its output as a sequence file. If the output needs to be used as the input of subsequent MapReduce tasks, this is a good output format, because its format is compact and easily compressed.

Users can also customize the OutputFormat.

Guess you like

Origin blog.csdn.net/qq_45092505/article/details/105419171