Hadoop learning: in-depth analysis of MapReduce's big data magic data compression (4)
4.1 Overview
1) The advantages and disadvantages of compression
Advantages of compression: to reduce disk IO and reduce disk storage space.
Disadvantages of compression: increased CPU overhead.
2) Compression principle
(1) Use less compression for computing-intensive jobs
(2) Use more compression for IO-intensive jobs
4.2 Compression codes supported by MR
1) Comparison of compression algorithms
2) Comparison of compression performance
4.3 Selection of compression method
Key considerations when choosing a compression method: compression/decompression speed, compression ratio (compressed storage size), and whether
slices can be supported after compression.
4.3.1 Gzip compression
Advantages: relatively high compression rate;
Disadvantages: Split is not supported; compression/decompression speed is average;
4.3.2 Bzip2 compression
Advantages: high compression rate; support Split;
disadvantages: slow compression/decompression speed.
4.3.3 Lzo compression
Advantages: Compression/decompression speed is relatively fast; Split is supported;
Disadvantages: Compression rate is average; additional indexes need to be created to support slicing.
4.3.4 Snappy compression
Advantages: fast compression and decompression;
Disadvantages: does not support Split; average compression rate;
4.3.5 Compression location selection
Compression can be enabled at any stage of the MapReduce process.
4.4 Compression parameter configuration
1) In order to support multiple compression/decompression algorithms, Hadoop introduces encoders/decoders
2) To enable compression in Hadoop, you can configure the following parameters
4.5 Compression Practical Cases
4.5.1 Map output using compression
Even if the input and output files of your MapReduce are uncompressed files, you can still
compress the intermediate result output of the Map task, because it needs to be written on the hard disk and transmitted to the Reduce node through the network, and its compression can be greatly
improved Performance, these tasks only need to set two properties, let's see how to set the code.
1) The compression formats supported by Hadoop source code are: ==BZip2Codec, DefaultCodec ==
package com.atguigu.mapreduce.compress;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.BZip2Codec;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCountDriver {
public static void main(String[] args) throws IOException,
ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
// 开启map端输出压缩
conf.setBoolean("mapreduce.map.output.compress", true);
// 设置map端输出压缩方式
conf.setClass("mapreduce.map.output.compress.codec",
BZip2Codec.class,CompressionCodec.class);
Job job = Job.getInstance(conf);
job.setJarByClass(WordCountDriver.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}
}
2) Mapper remains unchanged
package com.atguigu.mapreduce.compress;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends Mapper<LongWritable, Text, Text,
IntWritable>{
Text k = new Text();
IntWritable v = new IntWritable(1);
@Override
protected void map(LongWritable key, Text value, Context
context)throws IOException, InterruptedException {
// 1 获取一行
String line = value.toString();
// 2 切割
String[] words = line.split(" ");
// 3 循环写出
for(String word:words){
k.set(word);
context.write(k, v);
}
}
}
3) Reducer remains unchanged
package com.atguigu.mapreduce.compress;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends Reducer<Text, IntWritable, Text,
IntWritable>{
IntWritable v = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
// 1 汇总
for(IntWritable value:values){
sum += value.get();
}
v.set(sum);
// 2 输出
context.write(key, v);
}
}
4.5.2 Compression is used at the output of Reduce
Based on WordCount case processing.
1) Modify the driver
package com.atguigu.mapreduce.compress;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.BZip2Codec;
import org.apache.hadoop.io.compress.DefaultCodec;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.io.compress.Lz4Codec;
import org.apache.hadoop.io.compress.SnappyCodec;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCountDriver {
public static void main(String[] args) throws IOException,
ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(WordCountDriver.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// 设置reduce端输出压缩开启
FileOutputFormat.setCompressOutput(job, true);
// 设置压缩的方式
FileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class);
//
FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
//
FileOutputFormat.setOutputCompressorClass(job,
DefaultCodec.class);
boolean result = job.waitForCompletion(true);
System.exit(result?0:1);
}
}
2) Mapper and Reducer remain unchanged (see 4.5.1 for details)
Common Errors and Solutions
1) Importing packages is error-prone. Especially Text and CombineTextInputFormat.
2) The first input parameter in Mapper must be LongWritable or NullWritable, not IntWritable. The error reported is a type conversion exception.
3) java.lang.Exception: java.io.IOException: Illegal partition for 13926435656 (4), indicating that the number of Partition
and ReduceTask is not correct, adjust the number of ReduceTask.
4) If the number of partitions is not 1, but the reducetask is 1, whether to execute the partition process. The answer is: the partitioning process is not performed. Because in the source code of MapTask, the premise of executing partition is to judge whether the number of ReduceNum is greater than 1. If it is not greater than 1, it will definitely not be executed.
5) Import the jar package compiled in the Windows environment to run in the Linux environment, and
the hadoop
jar
wc.jar
/user/atguigu/output
reports the following error:
Exception
in
com.atguigu.mapreduce.wordcount.WordCountDriver
thread
“main”
/user/atguigu /
java.lang.UnsupportedClassVersionError:
com/atguigu/mapreduce/wordcount/WordCountDriver : Unsupported major.minor version 52.0
The reason is jdk1.7 for Windows environment and jdk1.8 for Linux environment.
Solution: Unify the jdk version.
6) In the case of caching small pd.txt files, it is reported that the pd.txt file cannot be found.
Reason: most of them are path writing errors. There is also the problem of checking pd.txt.txt. There are also individual computers that write relative paths
and cannot find pd.txt, which can be changed to absolute paths.
7) A type conversion exception is reported.
Usually it is a writing error when setting the Map output and the final output in the driver function.
If the keys output by Map are not sorted, a type conversion exception will also be reported.
8) When running wc.jar in the cluster, the input file cannot be obtained.
Reason: The input file of the WordCount case cannot be placed in the root directory of the HDFS cluster.
9) The following related exception occurred:
Exception
in
thread
“main”
java.lang.UnsatisfiedLinkError:
org.apache.hadoop.io.nativeio.NativeIO W indows . access 0 ( L java / lang / S tring ; I ) Z atorg . apache .hadoop.io.nativeio.Native IO Windows.access0(Ljava/lang/String;I)Z at org.apache.hadoop.io.nativeio.NativeIOWindows.access0(Ljava/lang/String;I)Zatorg.apache.hadoop.io.nativeio.NativeIOWindows.access0(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:609)
at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:977)
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:356)
at org.apache.hadoop.util. Shell.getWinUtilsPath(Shell.java:371)
at org.apache.hadoop.util.Shell.(Shell.java:364)
Solution: Copy the hadoop.dll file to the Windows directory C:\Windows\System32. Some students' computers
also need to modify the Hadoop source code.
Solution 2: Create the following package name, and copy NativeIO.java to the package name
10) When customizing the Outputformat, note that the close method in the RecordWirter must close the stream resource. Otherwise, the data in the output file content is empty.
@Override
public
void
close(TaskAttemptContext context) throws IOException,
InterruptedException {
if (atguigufos != null) {
atguigufos.close();
}
if (otherfos != null) {
otherfos.close();
}
}