1. MapReduce installation
(1) Overview of distributed computing
Visit master:8088 to check whether yarn is started successfully.
(2) Verify that mapreduce is installed successfully
Run the mapreduce regular matching example included in the hadoop installation package.
You can see the following output on the console, indicating that the mapReduce task is running, and you can see the task execution record on the yarn monitoring interface.
Two, hadoop serialization mechanism
Use hadoop's writeable interface to achieve serialization
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.5</version>
<dependency>
@Data
@NoArgsConstructor
@AllArgsConstructor
@ToString
public class BlockWritable implements Writable {
private long blockId;
private long numBytes;
private long generationStamp;
@Override
public void write(DataOutput out) throws IOException {
out.writeLong(blockId);
out.writeLong(numBytes);
out.writeLong(generationStamp);
}
@Override
public void readFields(DataInput in) throws IOException {
this.blockId = in.readLong();
this.numBytes = in.readLong();
this.generationStamp = in.readLong();
}
public static void main(String[] args) throws IOException {
//序列化
BlockWritable blockWritable = new BlockWritable(34234L, 234324345L, System.currentTimeMillis());
DataOutputStream dataOutputStream = new DataOutputStream(new FileOutputStream("D:/block.txt"));
blockWritable.write(dataOutputStream);
//反序列化
Writable writable = WritableFactories.newInstance(BlockWritable.class);
DataInputStream dataInputStream = new DataInputStream(new FileInputStream("D:/block.txt"));
writable.readFields(dataInputStream);
System.out.println((BlockWritable) writable);
}
}
A set of serialization mechanism encapsulated by hadoop, the file size after serialization is much smaller than that of java serialization. In the case of a large amount of data, the performance is greatly improved.
Three, use mapReduce to achieve distributed text line count calculation
(1) Calculation of the number of distributed text lines
(2) Add mapReduce dependency to the project
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>2.7.5</version>
<dependency>
(3) Write mapReduce code
package com.dzx.hadoopdemo.mapred;
import org.apache.hadoop.conf.Configurable;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.JobContext;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
/**
* @author duanzhaoxu
* @ClassName:
* @Description:
* @date 2020年12月24日 14:28:59
*/
public class DistributeCount {
public static class ToOneMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable ONE = new IntWritable(1);
private Text text = new Text();
@Override
protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
this.text.set("count");
context.write(this.text, ONE);
}
}
public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable(0);
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration configuration = new Configuration();
//创建JOB
Job job = Job.getInstance(configuration, "distribute-count");
//设置启动类
job.setJarByClass(DistributeCount.class);
//设置mapper类
job.setMapperClass(ToOneMapper.class);
// job.setCombinerClass(IntSumReducer.class);
//设置reduce类
job.setReducerClass(IntSumReducer.class);
//设置输出结果key类型
job.setOutputKeyClass(Text.class);
//设置输出结果value类型
job.setOutputValueClass(IntWritable.class);
JobConf jobConf = new JobConf(configuration);
//设置文件输入路径
FileInputFormat.addInputPath(jobConf, new Path(args[0]));
//设置结果输出文件路径
FileOutputFormat.setOutputPath(jobConf, new Path(args[1]));
//等待任务执行完成之后结束进程,设置为true会打印一些日志
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
(4) Execute job
Package the written mapReduce code into mapreduce-course-1.0-SNAPSHOT.jar
Prepare a larger text file big.txt and upload it to hdfs
hadoop fs -mkdir -p /user/hadoop-twq/mr/count/input
hadoop fs -put bih.txt /user/hadoop-twq/mr/count/input/
yarn jar mapreduce-course-1.0-SNAPSHOT.jar com.dzx.hadoopdemo.mapred.DistributeCount /user/hadoop-twq/mr/count/input/big.txt /user/hadoop-twq/mr/count/output
As shown in the figure, after the task is executed, a file will be generated under output. Check the content of the file and display count 21000104, indicating that the big.txt text file has more than 21 million rows of data
If the job is executed again, an error that the output file already exists will be reported, and the original output file must be deleted first
Fourth, the relationship between block and map input split
A block -> an input split
A file less than a block size-"an input split
Assuming that the size of each block is 256M, then a 326M big.txt file will be divided into two blocks for storage, so when the job is running, you can see that there are two corresponding map tasks on the yarn monitoring interface. This can also be seen from the following log output.
Five, the principle of MapReduce running on yarn
//设置reduce任务数
job.setNumReduceTasks(2)
RM refers to the ResourceManager of yarn
Six, MapReduce memory cpu resource configuration
Add the following configuration in mapred-site.xml
Then synchronize the above configuration to slave1 and slave2
scp mapred-site.xml hadoopq-twq@slave1:~/bigdata/hadoop-2.7.5/etc/hadoop/
scp mapred-site.xml hadoopq-twq@slave2:~/bigdata/hadoop-2.7.5/etc/hadoop/
Seven, Combiner in MapReduce
(1)Combiner explained
Use combiner to reduce data on each machine in advance, reduce the network transmission of final data, and improve performance.
Implementation in the code: job.setCombinerClass(IntSumReduce.class);
8. Use mapReduce to implement wordCount
(1) Code writing
package com.dzx.hadoopdemo.mapred;
import org.apache.commons.io.FileUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileUtil;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.File;
import java.io.IOException;
import java.util.StringTokenizer;
/**
* @author duanzhaoxu
* @ClassName:
* @Description:
* @date 2020年12月25日 11:06:53
*/
public class WordCount {
public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> {
private Text text = new Text();
private final static IntWritable ONE = new IntWritable(1);
@Override
protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
// String s = value.toString();
// String[] strArray = s.split(" ");
// for (String item : strArray) {
// text.set(item);
// context.write(text, ONE);
// }
StringTokenizer stringTokenizer = new StringTokenizer(value.toString());
while (stringTokenizer.hasMoreTokens()) {
text.set(stringTokenizer.nextToken());
context.write(text, ONE);
}
}
}
public static class WordCountReduce extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable res = new IntWritable(0);
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
res.set(sum);
context.write(key, res);
}
}
public static void main(String[] args) throws Exception {
File file = new File(args[1]);
if (file.exists()) {
FileUtils.deleteQuietly(file);
}
Configuration configuration = new Configuration();
Job job = Job.getInstance(configuration, "word-count");
job.setJarByClass(WordCount.class);
job.setMapperClass(WordCountMapper.class);
job.setCombinerClass(WordCountReduce.class);
job.setReducerClass(WordCountReduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.getConfiguration().set("yarn.app.mapreduce.am.resource.mb", "512");
job.getConfiguration().set("yarn.app.mapreduce.am.command-opts", "-Xmx250m");
job.getConfiguration().set("yarn.app.mapreduce.am.resource.cpu-vcores", "1");
job.getConfiguration().set("mapreduce.map.memory.mb", "400");
job.getConfiguration().set("mapreduce.map.java.opts", "-Xmx200m");
job.getConfiguration().set("mapreduce.map.cpu.vcores", "1");
job.getConfiguration().set("mapreduce.reduce.memory.mb", "400");
job.getConfiguration().set("mapreduce.reduce.java.opts", "-Xmx200m");
job.getConfiguration().set("mapreduce.reduce.cpu.vcores", "1");
JobConf jobConf = new JobConf(configuration);
FileInputFormat.addInputPath(jobConf, new Path(args[0]));
FileOutputFormat.setOutputPath(jobConf, new Path(args[1]));
System.out.println(job.waitForCompletion(true) ? 0 : 1);
}
}
Type the written code into a mapreduce-wordcount.jar package and upload it to the server, and execute the following command:
hadoop jar mapreduce-wordcount.jar com.dzx.hadoopdemo.mapred.WordCount /user/hadoop-twq/mr/count/input/big_word.txt /user/hadoop-twq/mr/count/output
After waiting for the completion of the task execution, check the output file of the result and see the following content, indicating that the statistics of the number of words have been completed
Improve virtual memory configuration
After restarting yarn, you can see that the virtual memory has been enlarged by 4 times.
(2) Detailed explanation of word count program-shuffle
When job set reduceTask to 2
As you can see in the figure, in the combine phase of maptask, the results of the map will be sorted according to the natural alphabetical order of the key.
(3) Custom partitioner
When the reduceTask is set to 2, the final task output file will produce two result set files, then how to achieve this data partitioning involves the partitioning rules.
Hadoop by default is partitioned according to the hash value of the key.
In fact, the hash value of each word is modulo 2.
Custom partitioner
package com.dzx.hadoopdemo.mapred;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
/**
* @author duanzhaoxu
* @ClassName:
* @Description:
* @date 2020年12月25日 14:34:47
*/
public class CustomPartitiner extends Partitioner<Text, IntWritable> {
//自定义分区器
@Override
public int getPartition(Text text, IntWritable intWritable, int i) {
if (text.toString().contains("s")) {
return 0;
}
return 1;
}
}
Repackage and upload it to the server, run the task, and find that the result of the key containing s is output to the part0 file, and the result of the key not containing s is output to the part1 file.
(4) MapReduce application
1. The distinct problem
Use the key of mapReduce to de-duplicate naturally, and use the value of map input as the key of reduce to automatically de-duplicate
2.distcp
Copy hdfs nn1 node data to nn2 node
distcp hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
Nine, hadoop compression mechanism
The data is specially encoded through a certain algorithm, so that the storage space occupied by the data is relatively small. This process is called compression, and vice versa is decompression.
No matter what kind of compression tool needs to weigh time and space, in the field of big data also consider the separability of compressed files
The compression tools supported by Hadoop are DEFAULT, bzip, Snappy
10. Avro row storage and parquet column storage (not updated temporarily)
11. Reading and writing of avro files and parquet files (important) (not updated temporarily)
12. Reading and writing of sequenceFile files (not updated temporarily)