Write directory title here
1. Overview of MapReduce
-
Hadoop MapReduce is a distributed computing framework for writing batch applications. The written program can be submitted to the Hadoop cluster for parallel processing of large-scale data sets.
-
A MapReduce job works by splitting the input dataset into independent chunks, which are processed in parallel by the map, and the framework sorts the output of the map, which is then fed into the reduce.
-
The MapReduce framework is dedicated to <key, value> key-value pair processing, which treats the input of a job as a set of <key, value> pairs and generates a set of <key, value> pairs as output.
-
Both input and output key and value must implement Writable interface.
(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)
2. Brief description of MapReduce programming model
- input : read text file;
- splitting : Split the file according to the line, the K1 line number obtained at this time, V1 indicates the text content of the corresponding line;
- mapping: Split each line in parallel according to the space, and split the obtained List(K2, V2), where K2 represents each word. Since it is used for word frequency statistics, the value of V2 is 1, which means it appears once;
- shuffling: Since the Mapping operation may be processed in parallel on different machines, it is necessary to distribute the data of the same key value to the same node for merging through shuffling, so that the final result can be counted. At this time, K2 is obtained for each word , List(V2) is an iterable collection, and V2 is V2 in Mapping;
- Reducing: The case here is to count the total number of occurrences of words, so Reducing performs a reduction and sum operation on List(V2) and finally outputs it.
In the MapReduce programming model, the splitting and shuffing operations are implemented by the framework, and only mapping and reducing need to be implemented by our own programming, which is the source of the name MapReduce.
3. MapReduce word frequency statistics case
2.1 Project Introduction
Here is a classic case of word frequency statistics: count the number of occurrences of each word in the following sample data.
Spark HBase
Hive Flink Storm Hadoop HBase Spark
Flink
HBase Storm
HBase Hadoop Hive Flink
HBase Flink Hive Storm
Hive Flink Hadoop
HBase Hive
Hadoop Spark HBase Storm
HBase Hadoop Hive Flink
HBase Flink Hive Storm
Hive Flink Hadoop
HBase Hive
For the convenience of everyone's development, I placed a tool class WordCountDataUtils in the project source code, which is used to simulate and generate word frequency statistics samples, and the generated files can be output to the local or directly written to HDFS.
2.2 Project dependencies
If you want to program MapReduce, you need to import hadoop-client dependencies:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${
hadoop.version}</version>
</dependency>
2.3 WordCountMapper
splits each row of data according to the specified separator. It should be noted here that the types defined by Hadoop must be used in MapReduce, because the types predefined by Hadoop are all serializable and comparable, and all types implement the WritableComparable interface.
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String[] words = value.toString().split("\t");
for (String word : words) {
context.write(new Text(word), new IntWritable(1));
}
}
}
WordCountMapper corresponds to the Mapping operation in the following figure:
WordCountMapper inherits from the Mapper class, which is a generic class defined as follows:
WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
......
}
- KEYIN : The type of mapping input key, that is, the offset of each line (the position of the first character of each line in the entire text), Long type, corresponding to the LongWritable
type in Hadoop; - VALUEIN : mapping input value type, that is, each row of data; String type, corresponding to Text type in Hadoop;
- KEYOUT: The type of key output by mapping, that is, each word; String type, corresponding to the Text type in Hadoop;
- VALUEOUT: The type of value output by mapping, that is, the number of occurrences of each word; the int type is used here, corresponding to the IntWritable
type.
2.4 WordCountReducer
Count the occurrences of words in Reduce:
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
InterruptedException {
int count = 0;
for (IntWritable value : values) {
count += value.get();
}
context.write(key, new IntWritable(count));
}
}
As shown in the figure below, the output of shuffling is the input of reduce. The key here is each word, and the values are an iterable data type, like (1,1,1,…).
2.4 WordCountApp
Assemble the MapReduce job and submit it to the server for running. The code is as follows:
/**
* 组装作业 并提交到集群运行
*/
public class WordCountApp {
// 这里为了直观显示参数 使用了硬编码,实际开发中可以通过外部传参
private static final String HDFS_URL = "hdfs://192.168.0.107:8020";
private static final String HADOOP_USER_NAME = "root";
public static void main(String[] args) throws Exception {
// 文件输入路径和输出路径由外部传参指定
if (args.length < 2) {
System.out.println("Input and output paths are necessary!");
return;
}
// 需要指明 hadoop 用户名,否则在 HDFS 上创建目录时可能会抛出权限不足的异常
System.setProperty("HADOOP_USER_NAME", HADOOP_USER_NAME);
Configuration configuration = new Configuration();
// 指明 HDFS 的地址
configuration.set("fs.defaultFS", HDFS_URL);
// 创建一个 Job
Job job = Job.getInstance(configuration);
// 设置运行的主类
job.setJarByClass(WordCountApp.class);
// 设置 Mapper 和 Reducer
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
// 设置Combiner(自行选择)
job.setCombinerClass(WordCountReducer.class);
// 设置自定义分区规则(自行选择)
job.setPartitionerClass(CustomPartitioner.class);
// 设置 Mapper 输出 key 和 value 的类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// 设置 Reducer 输出 key 和 value 的类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// 如果输出目录已经存在,则必须先删除,否则重复运行程序时会抛出异常
FileSystem fileSystem = FileSystem.get(new URI(HDFS_URL), configuration, HADOOP_USER_NAME);
Path outputPath = new Path(args[1]);
if (fileSystem.exists(outputPath)) {
fileSystem.delete(outputPath, true);
}
// 设置作业输入文件和输出文件的路径
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, outputPath);
// 将作业提交到群集并等待它完成,参数设置为 true 代表打印显示对应的进度
boolean result = job.waitForCompletion(true);
// 关闭之前创建的 fileSystem
fileSystem.close();
// 根据作业结果,终止当前运行的 Java 虚拟机,退出程序
System.exit(result ? 0 : -1);
}
}
It should be noted that if the output type of the Mapper operation is not set, the program defaults that it is the same as the output type of the Reducer operation.
2.5 Submit to the server to run
In actual development, you can configure the hadoop development environment locally and start it directly in the IDE for testing. Here we mainly introduce the packaging and submission to the server for operation. Since this project does not use third-party dependencies other than Hadoop, it can be packaged directly:
mvn clean package
Submit the job with the following command:
hadoop jar /usr/appjar/hadoop-word-count-1.0.jar \
com.heibaiying.WordCountApp \
/wordcount/input.txt /wordcount/output/WordCountApp
Check the generated directory on HDFS after the job is completed:
# 查看目录
hadoop fs -ls /wordcount/output/WordCountApp
# 查看统计结果
hadoop fs -cat /wordcount/output/WordCountApp/part-r-00000
4. Advanced Combiner of word frequency statistics case
3.1 Code implementation
To use the combiner function, just add the following line of code when assembling the job:
// 设置 Combiner
job.setCombinerClass(WordCountReducer.class);
3.2 Execution Results
After adding the combiner, the statistical results will not change, but the effect of the combiner can be seen from the printed log:
Print log without combiner:
The print log after joining the combiner is as follows:
Here we only have one input file and it is less than 128M, so there is only one Map for processing. It can be seen that after the combiner, the records are reduced from 3519 to 6 (there are only 6 types of words in the sample). In this use case, the combiner can greatly reduce the amount of data to be transmitted.
5. Advanced Partitioner of word frequency statistics case
4.1 Default Partitioner
Suppose there is a requirement here: output the statistical results of different words to different files. This requirement is actually quite common. For example, when counting product sales, it is necessary to split the results by product category. To achieve this function, you need to use a custom Partitioner.
Here we first introduce the default classification rules of MapReduce: when building a job, if not specified, the default is to use HashPartitioner: hash the key value and take the remainder of numReduceTasks. It is implemented as follows:
public class HashPartitioner<K, V> extends Partitioner<K, V> {
public int getPartition(K key, V value,
int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
}
4.2 Custom Partitioner
Here we inherit Partitioner's custom classification rules, and here we classify according to words:
public class CustomPartitioner extends Partitioner<Text, IntWritable> {
public int getPartition(Text text, IntWritable intWritable, int numPartitions) {
return WordCountDataUtils.WORD_LIST.indexOf(text.toString());
}
}
Specify to use our own classification rules when building a job, and set the number of reduce:
// 设置自定义分区规则
job.setPartitionerClass(CustomPartitioner.class);
// 设置 reduce 个数
job.setNumReduceTasks(WordCountDataUtils.WORD_LIST.size());
4.3 Execution Results
The execution results are as follows, 6 files are generated respectively, and each file contains the statistical results of the corresponding words:
6. Case 2 Introduction
The author of the link cleans the text content, manually uploads it to the Hdfs platform, then analyzes it, imports it into the hive database for sorting and merging, then uses the sqoop tool to import it into mysql, and then uses the ssm framework to return it to the front-end page for display.
In fact, it can be optimized automatically:
- The code in this article can realize the cleaning of text content and automatically upload it to the HDFS platform
- The analysis result is to execute the liunx command, you can use the runtime class to execute the script
- Use the code to read the contents of the directory and import the resulting data into the hive database
- Then there is hive to data to mysql
- Then there is the ssm frame page display data
I don't know if it's feasible. Most of the online processing is done manually. I will provide ideas here, and netizens can realize it by themselves.
Use java code to import hive table data into mysql