terminal connected reduce - the polymerization packet partition

1.1.1 reduce end connected - Packet aggregation partition

reduce connection terminal is utilized to reduce partitioning functions assigned to the same partition stationid same, using Packet reduce aggregation, the weather data and the same temperature stationid recording data into a set, function reads reduce first record (that is, weather station name) grouped with other records after the combination output, connected. Station connected, for example the following data set and recording the temperature data set. First do with a few data analysis shows that, certainly more than this actual data.

Weather data set, and weather station name data table id

StationId StationName

1~hangzhou

2~shanghai

3~beijing

Record temperature data set

StationId  TimeStamp Temperature

3~20200216~6

3~20200215~2

3~20200217~8

1~20200211~9

1~20200210~8

2~20200214~3

2~20200215~4

Objectives: The above two data sets is to be connected, the name of the weather temperature is added Station recording the output as the most weather id:

1~hangzhou ~20200211~9

1~hangzhou ~20200210~8

2~shanghai ~20200214~3

2~shanghai ~20200215~4

3~beijing ~20200216~6

3~beijing ~20200215~2

3~beijing ~20200217~8

Detail steps are as follows

(1)   two maper outputs the read data into two data sets with a document

Because different data formats, it is necessary to create two different maper are read, output to the same file, so use MultipleInputs provided two file path, two mapper.

(2)   create a key combination <stationed, mark> used map sorts the output.

Key combination such that the output map stationID arranged in ascending order, stationid second field arranged in the same order. mark only two values, weather data read, mark 0, mark data recorded temperature reading data set is one. This ensures that the same recording stationid first name is the weather, the temperature of the remaining record data. TextPair key combination is defined as follows

package Temperature;


import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class TextPair implements WritableComparable<TextPair> {
    private Text first;
    private Text second;

    public TextPair(Text first, Text second) {
        this.first = first;
        this.second = second;
    }

    public int compareTo(TextPair o) {
        int cmp=first.compareTo(o.getFirst());
        if (cmp!=0)//第一字段不同按第一字段升序排列
        {
            return cmp;
        }
        ///第一字段相同,按照第二字段升序排列
        return second.compareTo(o.getSecond());
    }

    public void write(DataOutput dataOutput) throws IOException {
        first.write(dataOutput);
        second.write(dataOutput);
    }

    public void readFields(DataInput dataInput) throws IOException {
        first.readFields(dataInput);
        second.readFields(dataInput);
    }

    public Text getFirst() {
        return first;
    }

    public void setFirst(Text first) {
        this.first = first;
    }

    public Text getSecond() {
        return second;
    }
    public void setSecond(Text second) {
        this.second = second;
    }
}

Maper output result is defined below, in front of a key combination, followed by a value.

<1,0>    hangzhou

<1,1>    20200211~9

<1,1>    20200210~8

<2,0>    shanghai

<2,1>    20200214~3

<2,1>    20200215~4

<3,0>    beijing

<3,1>    20200216~6

<3,1>    20200215~2

<3,1>    20200217~8

(. 3) Map Results incoming reduce press stationid partition polymerization regrouping

The map output results, recording the same on the second field stationid ascending order in accordance with a first key field combination stationid ascending order, recording data, and weather data and then mixed together, shulfe process, the map data to reduce, will go through partition partitions, stationid same data will be assigned to reduce the same, the same data in a reduce stationid are grouped. Assuming two reduce tasks, partitioned according stationid% 2, the result is the partition

Zone 1

<1,0>    hangzhou

<1,1>    20200211~9

<1,1>    20200210~8

<3,0>    beijing

<3,1>    20200216~6

<3,1>    20200215~2

<3,1>    20200217~8

Division 2

<2,0>    shanghai

<2,1>    20200214~3

<2,1>    20200215~4

( 4 ) after the data for each partition and then the partition according stationid packet aggregation

Zone 1

Heat 1

<1,0>    <Hangzhou, 20200211~9, 20200210~8>

Heat 2

<3,0>    <Beijing, 20200216~6, 20200215~2, 20200217~8>

Partition 2

<2,0> <shanghai, 20200214~3, 20200215~4>

( 5 ) the incoming data packet aggregation reduce function, the temperature of the recording station is added to the output of the latter.

Because the data is arranged in ascending order after the mark, so that each of the first data weather station name data, and the rest is a temperature change of weather data recording, is to ensure that the effect of the first weather mark data field. Therefore, each read the first value, both weather station name. Other combinations with an output value, i.e., to achieve a connection data set.

1~hangzhou ~20200211~9

1~hangzhou ~20200210~8

2~shanghai ~20200214~3

2~shanghai ~20200215~4

3~beijing ~20200216~6

3~beijing ~20200215~2

3~beijing ~20200217~8

( 6 ) detailed code examples

package Temperature;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import java.io.IOException;
import java.util.Iterator;

JoinRecordWithStationId the extends the Configured the implements class public Tool {
    // weather map processing class data set name
   public static class StationMapper the extends Mapper <LongWritable, the Text, TextPair, the Text> {
       protected void map (the Text Key, the Text value, the Context context) throws IOException, {InterruptedException
           // Hangzhou. 1 ~
           String [] = value.toString values () Split ( "~");.
           IF (! values.length = 2)
           {
               return;
           }
           // key combination a first field stationid, the second 0 is the default field indicates the station name data
           context.write (new new TextPair (new new the Text (values [0]), the Text new new ( "0")), the Text new new (values [. 1]));
       }
   }
   // thermographic processing mapper class data set
   static class TemperatureRecordMapper the extends Mapper public <LongWritable, the Text, TextPair, the Text> {
       protected Map void (TextPair Key, the Text value, the Context context) throws IOException, InterruptedException {
           String [] = value.toString values (). Split ( "~" );
           ! IF (= values.length. 3)
           {
               return;
           }
           // key combination stationID first field, the second field is a default, the recording data indicates temperature
           // 20,200,216. 3 ~ ~. 6
           String = OutputValue values [1] + "~" + values [2];
           context.write (new new TextPair (new new the Text (values [0]), the Text new new ( ". 1")), the Text new new (OutputValue));
       }
   }
   // partition according statitionid partioner class
    static class FirstPartitioner the extends Partitioner public <TextPair, the Text> {

       public int getPartition (TextPair textPair, the Text text, int I) {
           // take the remainder reduce the number of tasks in the first field stationid, to give the partition ID
           return Integer. the parseInt (textPair.getFirst () .toString ())% I;
       }
   }
   // Comparative packet class
   public static class GroupingComparator the extends WritableComparator
   {
       public int Compare (WritableComparable A, B WritableComparable) {
           TextPair Pairal = (TextPair) A;
           TextPair pairB = (TextPair) B ;
           same // stationid, the return value is 0 into a set of
           return pairA.getFirst () the compareTo (pairB.getFirst ());.
       }
   }
   // reudce packet after the key data to the first data values (Station Name), the polymerization temperature is recorded behind the values to a file
    public static class JoinReducer the extends the Reducer <TextPair, the Text, the Text, the Text>
    {
        @ override
        protected void the reduce (TextPair Key, the Iterable <the Text> values, the context context) throws IOException, InterruptedException {
            the Iterator values.iterator IT = ();
            String StationName = it.next () toString ();.
            the while (it.hasNext ( ))
            {
                String OutputValue = "~" + StationName + "~" + it.toString ();
                context.write (key.getFirst (), the Text new new (OutputValue));
            }
        }
    }
    public int run(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
       if (args.length!=3)
       {
           return -1;
       }
        Job job=new Job(getConf(),"joinStationTemperatueRecord");
       if (job==null)
       {
           return -1;
       }
       job.setJarByClass(this.getClass());
      //设置两个输入路径,一个输出路径
       Path StationPath=new Path(args[0]);
       Path TemperatureRecordPath= new Path(args[1]);
       Path outputPath=new Path(args[2]);
       MultipleInputs.addInputPath(job,StationPath, TextInputFormat.class,StationMapper.class);
       MultipleInputs.addInputPath(job,TemperatureRecordPath,TextInputFormat.class,TemperatureRecordMapper.class);
       FileOutputFormat.setOutputPath(job,outputPath);
       //设置分区类、分组类、reduce类
       job.setPartitionerClass(FirstPartitioner.class);
       job.setGroupingComparatorClass(GroupingComparator.class);
       job.setReducerClass(JoinReducer.class);
       //设置输出类型
       job.setOutputKeyClass(Text.class);
       job.setOutputValueClass(Text.class);
       job.setMapOutputKeyClass(TextPair.class);
       job.setMapOutputValueClass(Text.class);
       return job.waitForCompletion(true)? 0:1;
    }
    static void main public (String [] args) throws Exception
    {
        // three parameters, one parameter: sets weather data path, two parameters: temperature of the data set record path, 3 parameters: output path
       int = ToolRunner The exitCode. RUN (new new JoinRecordWithStationId (), args);
       . the System Exit (The exitCode);
    }

}

 

Mission command

% hadoop jar temperature-example.jar JoinRecordWithStationId input/station/all input/ncdc/all output

Himself developed an intelligent stock analysis software, very powerful, you need to click on the link below to obtain:

https://www.cnblogs.com/bclshuai/p/11380657.html

Guess you like

Origin www.cnblogs.com/bclshuai/p/12319490.html