MapReduce of "Attack on Big Data" series of tutorials

1. MapReduce installation

(1) Overview of distributed computing

 

Visit master:8088 to check whether yarn is started successfully.

(2) Verify that mapreduce is installed successfully

Run the mapreduce regular matching example included in the hadoop installation package.

 You can see the following output on the console, indicating that the mapReduce task is running, and you can see the task execution record on the yarn monitoring interface.

Two, hadoop serialization mechanism


Use hadoop's writeable interface to achieve serialization


<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-client</artifactId>
  <version>2.7.5</version>
<dependency>
@Data
@NoArgsConstructor
@AllArgsConstructor
@ToString
public class BlockWritable implements Writable {

    private long blockId;

    private long numBytes;

    private long generationStamp;

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeLong(blockId);
        out.writeLong(numBytes);
        out.writeLong(generationStamp);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.blockId = in.readLong();
        this.numBytes = in.readLong();
        this.generationStamp = in.readLong();
    }



    public static void main(String[] args) throws IOException {
        //序列化
        BlockWritable blockWritable = new BlockWritable(34234L, 234324345L, System.currentTimeMillis());
        DataOutputStream dataOutputStream = new DataOutputStream(new FileOutputStream("D:/block.txt"));
        blockWritable.write(dataOutputStream);
        //反序列化
        Writable writable = WritableFactories.newInstance(BlockWritable.class);
        DataInputStream dataInputStream = new DataInputStream(new FileInputStream("D:/block.txt"));
        writable.readFields(dataInputStream);
        System.out.println((BlockWritable) writable);
    }
}

A set of serialization mechanism encapsulated by hadoop, the file size after serialization is much smaller than that of java serialization. In the case of a large amount of data, the performance is greatly improved.

Three, use mapReduce to achieve distributed text line count calculation

(1) Calculation of the number of distributed text lines

(2) Add mapReduce dependency to the project


<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-mapreduce-client-core</artifactId>
  <version>2.7.5</version>
<dependency>

 (3) Write mapReduce code

package com.dzx.hadoopdemo.mapred;

import org.apache.hadoop.conf.Configurable;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.JobContext;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;

/**
 * @author duanzhaoxu
 * @ClassName:
 * @Description:
 * @date 2020年12月24日 14:28:59
 */
public class DistributeCount {

    public static class ToOneMapper extends Mapper<Object, Text, Text, IntWritable> {

        private final static IntWritable ONE = new IntWritable(1);
        private Text text = new Text();


        @Override
        protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            this.text.set("count");
            context.write(this.text, ONE);
        }
    }

    public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable(0);

        @Override
        protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration configuration = new Configuration();
        //创建JOB
        Job job = Job.getInstance(configuration, "distribute-count");
        //设置启动类
        job.setJarByClass(DistributeCount.class);
        //设置mapper类
        job.setMapperClass(ToOneMapper.class);
//            job.setCombinerClass(IntSumReducer.class);
        //设置reduce类
        job.setReducerClass(IntSumReducer.class);
        //设置输出结果key类型
        job.setOutputKeyClass(Text.class);
        //设置输出结果value类型
        job.setOutputValueClass(IntWritable.class);
        JobConf jobConf = new JobConf(configuration);
        //设置文件输入路径
        FileInputFormat.addInputPath(jobConf, new Path(args[0]));
        //设置结果输出文件路径
        FileOutputFormat.setOutputPath(jobConf, new Path(args[1]));
        //等待任务执行完成之后结束进程,设置为true会打印一些日志
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

(4) Execute job

Package the written mapReduce code into mapreduce-course-1.0-SNAPSHOT.jar

Prepare a larger text file big.txt and upload it to hdfs

hadoop  fs  -mkdir  -p  /user/hadoop-twq/mr/count/input

hadoop  fs -put  bih.txt   /user/hadoop-twq/mr/count/input/

yarn jar  mapreduce-course-1.0-SNAPSHOT.jar  com.dzx.hadoopdemo.mapred.DistributeCount   /user/hadoop-twq/mr/count/input/big.txt     /user/hadoop-twq/mr/count/output 

 As shown in the figure, after the task is executed, a file will be generated under output. Check the content of the file and display count 21000104, indicating that the big.txt text file has more than 21 million rows of data

If the job is executed again, an error that the output file already exists will be reported, and the original output file must be deleted first 

Fourth, the relationship between block and map input split

A block -> an input split

A file less than a block size-"an input split

Assuming that the size of each block is 256M, then a 326M big.txt file will be divided into two blocks for storage, so when the job is running, you can see that there are two corresponding map tasks on the yarn monitoring interface. This can also be seen from the following log output.

Five, the principle of MapReduce running on yarn

//设置reduce任务数
job.setNumReduceTasks(2)

RM refers to the ResourceManager of yarn

Six, MapReduce memory cpu resource configuration

Add the following configuration in mapred-site.xml

Then synchronize the above configuration to slave1 and slave2 

scp  mapred-site.xml hadoopq-twq@slave1:~/bigdata/hadoop-2.7.5/etc/hadoop/

scp  mapred-site.xml hadoopq-twq@slave2:~/bigdata/hadoop-2.7.5/etc/hadoop/

Seven, Combiner in MapReduce 

(1)Combiner explained

Use combiner to reduce data on each machine in advance, reduce the network transmission of final data, and improve performance.

Implementation in the code: job.setCombinerClass(IntSumReduce.class);

8. Use mapReduce to implement wordCount 

(1) Code writing

package com.dzx.hadoopdemo.mapred;

import org.apache.commons.io.FileUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileUtil;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.File;
import java.io.IOException;
import java.util.StringTokenizer;

/**
 * @author duanzhaoxu
 * @ClassName:
 * @Description:
 * @date 2020年12月25日 11:06:53
 */
public class WordCount {

    public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> {
        private Text text = new Text();
        private final static IntWritable ONE = new IntWritable(1);

        @Override
        protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
//            String s = value.toString();
//            String[] strArray = s.split(" ");
//            for (String item : strArray) {
//                text.set(item);
//                context.write(text, ONE);
//            }
            StringTokenizer stringTokenizer = new StringTokenizer(value.toString());
            while (stringTokenizer.hasMoreTokens()) {
                text.set(stringTokenizer.nextToken());
                context.write(text, ONE);
            }
        }
    }

    public static class WordCountReduce extends Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable res = new IntWritable(0);

        @Override
        protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            res.set(sum);
            context.write(key, res);
        }
    }


    public static void main(String[] args) throws Exception {
        File file = new File(args[1]);
        if (file.exists()) {
            FileUtils.deleteQuietly(file);
        }
        Configuration configuration = new Configuration();
        Job job = Job.getInstance(configuration, "word-count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(WordCountMapper.class);
        job.setCombinerClass(WordCountReduce.class);
        job.setReducerClass(WordCountReduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        job.getConfiguration().set("yarn.app.mapreduce.am.resource.mb", "512");
        job.getConfiguration().set("yarn.app.mapreduce.am.command-opts", "-Xmx250m");
        job.getConfiguration().set("yarn.app.mapreduce.am.resource.cpu-vcores", "1");
        job.getConfiguration().set("mapreduce.map.memory.mb", "400");
        job.getConfiguration().set("mapreduce.map.java.opts", "-Xmx200m");
        job.getConfiguration().set("mapreduce.map.cpu.vcores", "1");
        job.getConfiguration().set("mapreduce.reduce.memory.mb", "400");
        job.getConfiguration().set("mapreduce.reduce.java.opts", "-Xmx200m");
        job.getConfiguration().set("mapreduce.reduce.cpu.vcores", "1");

        JobConf jobConf = new JobConf(configuration);
        FileInputFormat.addInputPath(jobConf, new Path(args[0]));
        FileOutputFormat.setOutputPath(jobConf, new Path(args[1]));
        System.out.println(job.waitForCompletion(true) ? 0 : 1);
    }


}

 Type the written code into a mapreduce-wordcount.jar package and upload it to the server, and execute the following command:

hadoop jar  mapreduce-wordcount.jar  com.dzx.hadoopdemo.mapred.WordCount  /user/hadoop-twq/mr/count/input/big_word.txt     /user/hadoop-twq/mr/count/output  

After waiting for the completion of the task execution, check the output file of the result and see the following content, indicating that the statistics of the number of words have been completed

Improve virtual memory configuration 

 After restarting yarn, you can see that the virtual memory has been enlarged by 4 times.

(2) Detailed explanation of word count program-shuffle

When job set reduceTask to 2

 As you can see in the figure, in the combine phase of maptask, the results of the map will be sorted according to the natural alphabetical order of the key.

(3) Custom partitioner

When the reduceTask is set to 2, the final task output file will produce two result set files, then how to achieve this data partitioning involves the partitioning rules.

Hadoop by default is partitioned according to the hash value of the key.

In fact, the hash value of each word is modulo 2. 

Custom partitioner

package com.dzx.hadoopdemo.mapred;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

/**
 * @author duanzhaoxu
 * @ClassName:
 * @Description:
 * @date 2020年12月25日 14:34:47
 */
public class CustomPartitiner extends Partitioner<Text, IntWritable> {


    //自定义分区器
    @Override
    public int getPartition(Text text, IntWritable intWritable, int i) {
        if (text.toString().contains("s")) {
            return 0;
        }
        return 1;
    }
}

Repackage and upload it to the server, run the task, and find that the result of the key containing s is output to the part0 file, and the result of the key not containing s is output to the part1 file.

(4) MapReduce application

1. The distinct problem

Use the key of mapReduce to de-duplicate naturally, and use the value of map input as the key of reduce to automatically de-duplicate

2.distcp

Copy hdfs nn1 node data to nn2 node

distcp  hdfs://nn1:8020/source/first   hdfs://nn1:8020/source/second   hdfs://nn2:8020/target

Nine, hadoop compression mechanism

The data is specially encoded through a certain algorithm, so that the storage space occupied by the data is relatively small. This process is called compression, and vice versa is decompression.

No matter what kind of compression tool needs to weigh time and space, in the field of big data also consider the separability of compressed files

The compression tools supported by Hadoop are DEFAULT, bzip, Snappy

10. Avro row storage and parquet column storage (not updated temporarily)

11. Reading and writing of avro files and parquet files (important) (not updated temporarily)

12. Reading and writing of sequenceFile files (not updated temporarily)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Guess you like

Origin blog.csdn.net/qq_31905135/article/details/111608474