Hadoop learning: in-depth analysis of MapReduce's big data magic data compression (4)

4.1 Overview

1) The advantages and disadvantages of compression

Advantages of compression: to reduce disk IO and reduce disk storage space.
Disadvantages of compression: increased CPU overhead.

2) Compression principle

(1) Use less compression for computing-intensive jobs
(2) Use more compression for IO-intensive jobs

4.2 Compression codes supported by MR

1) Comparison of compression algorithms
insert image description here

insert image description here
2) Comparison of compression performance
insert image description here

4.3 Selection of compression method

Key considerations when choosing a compression method: compression/decompression speed, compression ratio (compressed storage size), and whether
slices can be supported after compression.

4.3.1 Gzip compression

Advantages: relatively high compression rate;
Disadvantages: Split is not supported; compression/decompression speed is average;

4.3.2 Bzip2 compression

Advantages: high compression rate; support Split;
disadvantages: slow compression/decompression speed.

4.3.3 Lzo compression

Advantages: Compression/decompression speed is relatively fast; Split is supported;
Disadvantages: Compression rate is average; additional indexes need to be created to support slicing.

4.3.4 Snappy compression

Advantages: fast compression and decompression;
Disadvantages: does not support Split; average compression rate;

4.3.5 Compression location selection

Compression can be enabled at any stage of the MapReduce process.
insert image description here

4.4 Compression parameter configuration

1) In order to support multiple compression/decompression algorithms, Hadoop introduces encoders/decoders
insert image description here
2) To enable compression in Hadoop, you can configure the following parameters
insert image description here
insert image description here

4.5 Compression Practical Cases

4.5.1 Map output using compression

Even if the input and output files of your MapReduce are uncompressed files, you can still
compress the intermediate result output of the Map task, because it needs to be written on the hard disk and transmitted to the Reduce node through the network, and its compression can be greatly
improved Performance, these tasks only need to set two properties, let's see how to set the code.
1) The compression formats supported by Hadoop source code are: ==BZip2Codec, DefaultCodec ==

package com.atguigu.mapreduce.compress; 
import java.io.IOException; 
import org.apache.hadoop.conf.Configuration; 
import org.apache.hadoop.fs.Path; 
import org.apache.hadoop.io.IntWritable; 
import org.apache.hadoop.io.Text; 
import org.apache.hadoop.io.compress.BZip2Codec;  
import org.apache.hadoop.io.compress.CompressionCodec; 
import org.apache.hadoop.io.compress.GzipCodec; 
import org.apache.hadoop.mapreduce.Job; 
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 
 
public class WordCountDriver {
    
     
 
 public static void main(String[] args) throws IOException, 
ClassNotFoundException, InterruptedException {
    
     
 
  Configuration conf = new Configuration(); 
 
  // 开启map端输出压缩 
  conf.setBoolean("mapreduce.map.output.compress", true); 
 
  // 设置map端输出压缩方式 
  conf.setClass("mapreduce.map.output.compress.codec", 
BZip2Codec.class,CompressionCodec.class);
 Job job = Job.getInstance(conf); 
 
  job.setJarByClass(WordCountDriver.class); 
 
  job.setMapperClass(WordCountMapper.class); 
  job.setReducerClass(WordCountReducer.class); 
 
  job.setMapOutputKeyClass(Text.class); 
  job.setMapOutputValueClass(IntWritable.class); 
 
  job.setOutputKeyClass(Text.class); 
  job.setOutputValueClass(IntWritable.class); 
 
  FileInputFormat.setInputPaths(job, new Path(args[0])); 
  FileOutputFormat.setOutputPath(job, new Path(args[1])); 
 
  boolean result = job.waitForCompletion(true); 
 
  System.exit(result ? 0 : 1); 
 } 
} 

2) Mapper remains unchanged

package com.atguigu.mapreduce.compress; 
import java.io.IOException; 
import org.apache.hadoop.io.IntWritable; 
import org.apache.hadoop.io.LongWritable; 
import org.apache.hadoop.io.Text; 
import org.apache.hadoop.mapreduce.Mapper; 
 
public class WordCountMapper extends Mapper<LongWritable, Text, Text, 
IntWritable>{
    
     
 
 Text k = new Text(); 
 IntWritable v = new IntWritable(1); 
 
 @Override 
 protected void map(LongWritable key, Text value, Context 
context)throws IOException, InterruptedException {
    
     
 
  // 1 获取一行 
  String line = value.toString(); 
 
  // 2 切割 
  String[] words = line.split(" "); 
 
  // 3 循环写出 
  for(String word:words){
    
     
   k.set(word); 
   context.write(k, v); 
  } 
 } 
} 

3) Reducer remains unchanged

package com.atguigu.mapreduce.compress; 
import java.io.IOException; 
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text; 
import org.apache.hadoop.mapreduce.Reducer; 
 
public class WordCountReducer extends Reducer<Text, IntWritable, Text, 
IntWritable>{
    
     
 
 IntWritable v = new IntWritable(); 
 
 @Override 
 protected void reduce(Text key, Iterable<IntWritable> values, 
   Context context) throws IOException, InterruptedException {
    
     
   
  int sum = 0; 
 
  // 1 汇总 
  for(IntWritable value:values){
    
     
   sum += value.get(); 
  } 
   
         v.set(sum); 
 
         // 2 输出 
  context.write(key, v); 
 } 
} 

4.5.2 Compression is used at the output of Reduce

Based on WordCount case processing.
1) Modify the driver

package com.atguigu.mapreduce.compress; 
import java.io.IOException; 
import org.apache.hadoop.conf.Configuration; 
import org.apache.hadoop.fs.Path; 
import org.apache.hadoop.io.IntWritable; 
import org.apache.hadoop.io.Text; 
import org.apache.hadoop.io.compress.BZip2Codec; 
import org.apache.hadoop.io.compress.DefaultCodec; 
import org.apache.hadoop.io.compress.GzipCodec; 
import org.apache.hadoop.io.compress.Lz4Codec; 
import org.apache.hadoop.io.compress.SnappyCodec; 
import org.apache.hadoop.mapreduce.Job; 
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 
 
public class WordCountDriver {
    
     
 
 public static void main(String[] args) throws IOException, 
ClassNotFoundException, InterruptedException {
    
     
   
  Configuration conf = new Configuration(); 
   
  Job job = Job.getInstance(conf); 
   
  job.setJarByClass(WordCountDriver.class); 
   
  job.setMapperClass(WordCountMapper.class);
  job.setReducerClass(WordCountReducer.class); 
job.setMapOutputKeyClass(Text.class); 
job.setMapOutputValueClass(IntWritable.class); 
job.setOutputKeyClass(Text.class); 
job.setOutputValueClass(IntWritable.class); 
FileInputFormat.setInputPaths(job, new Path(args[0])); 
FileOutputFormat.setOutputPath(job, new Path(args[1])); 
// 设置reduce端输出压缩开启 
FileOutputFormat.setCompressOutput(job, true); 
// 设置压缩的方式 
FileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class);  
//     
FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);  
//     
FileOutputFormat.setOutputCompressorClass(job, 
DefaultCodec.class);  
boolean result = job.waitForCompletion(true); 
System.exit(result?0:1); 
} 
} 

2) Mapper and Reducer remain unchanged (see 4.5.1 for details)

Common Errors and Solutions

1) Importing packages is error-prone. Especially Text and CombineTextInputFormat.

2) The first input parameter in Mapper must be LongWritable or NullWritable, not IntWritable. The error reported is a type conversion exception.

3) java.lang.Exception: java.io.IOException: Illegal partition for 13926435656 (4), indicating that the number of Partition
and ReduceTask is not correct, adjust the number of ReduceTask.

4) If the number of partitions is not 1, but the reducetask is 1, whether to execute the partition process. The answer is: the partitioning process is not performed. Because in the source code of MapTask, the premise of executing partition is to judge whether the number of ReduceNum is greater than 1. If it is not greater than 1, it will definitely not be executed.

5) Import the jar package compiled in the Windows environment to run in the Linux environment, and
the hadoop
jar
wc.jar
/user/atguigu/output
reports the following error:
Exception
in
com.atguigu.mapreduce.wordcount.WordCountDriver
thread
“main”
/user/atguigu /
java.lang.UnsupportedClassVersionError:
com/atguigu/mapreduce/wordcount/WordCountDriver : Unsupported major.minor version 52.0
The reason is jdk1.7 for Windows environment and jdk1.8 for Linux environment.
Solution: Unify the jdk version.
6) In the case of caching small pd.txt files, it is reported that the pd.txt file cannot be found.
Reason: most of them are path writing errors. There is also the problem of checking pd.txt.txt. There are also individual computers that write relative paths
and cannot find pd.txt, which can be changed to absolute paths.

7) A type conversion exception is reported.
Usually it is a writing error when setting the Map output and the final output in the driver function.
If the keys output by Map are not sorted, a type conversion exception will also be reported.

8) When running wc.jar in the cluster, the input file cannot be obtained.
Reason: The input file of the WordCount case cannot be placed in the root directory of the HDFS cluster.
9) The following related exception occurred:
Exception
in
thread
“main”
java.lang.UnsatisfiedLinkError:
org.apache.hadoop.io.nativeio.NativeIO W indows . access 0 ( L java / lang / S tring ; I ) Z atorg . apache .hadoop.io.nativeio.Native IO Windows.access0(Ljava/lang/String;I)Z at org.apache.hadoop.io.nativeio.NativeIOWindows.access0(Ljava/lang/String;I)Zatorg.apache.hadoop.io.nativeio.NativeIOWindows.access0(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:609)
at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:977)
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:356)
at org.apache.hadoop.util. Shell.getWinUtilsPath(Shell.java:371)
at org.apache.hadoop.util.Shell.(Shell.java:364)
Solution: Copy the hadoop.dll file to the Windows directory C:\Windows\System32. Some students' computers
also need to modify the Hadoop source code.
Solution 2: Create the following package name, and copy NativeIO.java to the package name
insert image description here
10) When customizing the Outputformat, note that the close method in the RecordWirter must close the stream resource. Otherwise, the data in the output file content is empty.

@Override 
public 
void 
close(TaskAttemptContext context) throws IOException, 
InterruptedException {
    
     
if (atguigufos != null) {
    
     
atguigufos.close(); 
} 
if (otherfos != null) {
    
     
otherfos.close(); 
} 
}

Guess you like

Origin blog.csdn.net/m0_66106755/article/details/132365080