Chapter 1 MapReduce Overview
1.1 MapReduce defined
MapReduce is a distributed computing programming framework program is user-developed "Hadoop-based data analysis applications," the core framework.
MapReduce is a core function of the business logic and user-written code that comes with the default components integrated into a complete distributed computing program, run concurrently on a Hadoop cluster.
1.2 MapReduce advantages and disadvantages
1.2.1 advantage
- MapReduce easy to program
it simply implement some interfaces, you can complete a distributed program, this program can be distributed to a large number of distributed low-cost PC machines running. That you write a distributed program, written with a simple serial program is exactly the same. Because of this feature makes MapReduce programming has become very popular. - Good scalability
when your computing resources can not be met, you can extend its computing power by simply adding machine.
1.2.2 shortcomings
- Not good at real-time computing
MapReduce not like MySQL, returns results within milliseconds or seconds. - Not good flow computing
the input data stream is dynamically calculated, and the input data set MapReduce is static, not dynamic changes. This is because the MapReduce their own design characteristics determine the data source must be static. - Not good at the DAG (directed graph) is calculated
dependencies plurality of application programs, the application of an input to an output of a front. In this case, MapReduce is not can not do, but after use, the output of each MapReduce job is written to disk, it will cause a lot of disk IO, resulting in very poor performance.
1.3 MapReduce core idea
The core MapReduce programming ideas as:
- Distributed computing procedures often require into at least two phases.
- The first stage of MapTask concurrent instances, completely run in parallel, independent of each other.
- ReduceTask second stage concurrent instances unrelated, but they all depend on the output data on a stage MapTask concurrent instances.
- MapReduce programming model can contain only one stage and a Map Reduce stage, if the user's business logic is very complex, it can only multiple MapReduce programs, serial operation.
Summary: WordCount data stream analysis to in-depth understanding of the core idea of MapReduce.
1.4 MapReduce process
1.5 official source WordCount
Using decompiler decompiling source code, WordCount cases are found Map class, Reduce classes and driving classes. And the data type is a sequence of type Hadoop package itself.
1.6 Common types of data sequence
Table corresponding to the type commonly used data types Hadoop data sequence
Java type | Hadoop Writable type |
---|---|
Boolean | BooleanWritable |
Byte | ByteWritable |
Int | IntWritable |
Float | FloatWritable |
Long | FloatWritable |
Double | DoubleWritable |
String | Text |
Map | MapWritable |
Array | ArrayWritable |
1.7 MapReduce programming specification
Written program divided into three parts: Mapper, Reducer and Driver.
1.8 WordCount practical operation case
1. demand
Statistics in a given text file output a total number of times each word appears
(1) input data
write date.txt text:
zhangsan lisi wanger maizi
xiangming zhangsan wanger lisi
xiaoha mazi zhangsan
(2) the desired output data
lisi 2
maizi 1
mazi 1
wanger 2
xiangming 1
xiaoha 1
zhangsan 3
2. demand analysis
According to the MapReduce programming specifications were written Mapper, Reducer, Driver
3. Preparing the Environment
(1) create a maven project with IDEA
(2) add the following dependence in pom.xml file
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.1.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.1.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>3.1.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-yarn-common</artifactId>
<version>3.1.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-yarn-client</artifactId>
<version>3.1.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-yarn-server-resourcemanager</artifactId>
<version>3.1.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>3.1.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-jobclient</artifactId>
<version>3.1.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-common</artifactId>
<version>3.1.2</version>
</dependency>
<dependency>
<groupId>net.minidev</groupId>
<artifactId>json-smart</artifactId>
<version>2.3</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>2.12.1</version>
</dependency>
<dependency>
<groupId>org.anarres.lzo</groupId>
<artifactId>lzo-hadoop</artifactId>
<version>1.0.6</version>
</dependency>
</dependencies>
(3) project under src / main / resources directory, create a file named "log4j.properties", fill in the file.
log4j.rootLogger=INFO, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spring.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n
4. Programming
(1) write Mapper class
package com.zhangyong.mapreduce;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/**
* @Author zhangyong
* @Date 2020/3/4 16:35
* @Version 1.0
* Mapper类 计算量
* 泛型一:程序读取数据的偏移量
* 泛型二:读到的内容
* 泛型三:输出结果的类型
* 泛型四:输出结果的内容
*/
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
@Override
/**
* key:偏移量
* value:读取到的内容
* context:上下文
*/
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
System.out.println (key.get () + " " + value.toString ());
String line = value.toString ();
String[] split = line.split ("\\W+");
for (String s : split) {
context.write (new Text (s), new IntWritable (1));
}
}
}
(2) write Reducer class
package com.zhangyong.mapreduce;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/**
* @Author zhangyong
* @Date 2020/3/4 16:35
* @Version 1.0
* Reducer类 统计量
* 泛型一:Map传递过来的结果的类型
* 泛型二: Map传递过来的结果的内容
* 泛型三:输出结果的类型
* 泛型四:输出结果的内容
*/
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
System.out.println (key + " : " + values);
int sum = 0;
for (IntWritable value : values) {
sum += value.get ();
}
context.write (key, new IntWritable (sum));
}
}
(3) write driver class Driver
package com.zhangyong.mapreduce;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import java.io.IOException;
/**
* @Author zhangyong
* @Date 2020/3/4 16:35
* @Version 1.0
* Driver类 Hadoop入口程序
*/
public class WordCountDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration cfg = new Configuration ();
//设置本地模式运行(即使项目类路径下core-site.xml文件,依然采用本地模式)
cfg.set ("mapreduce.framework.name", "local");
cfg.set ("fs.defaultFS", "file:///");
Job job = Job.getInstance (cfg);
job.setJarByClass (WordCountDriver.class);
//下面两行是默认值,可以省略
job.setInputFormatClass (TextInputFormat.class);
job.setOutputFormatClass (TextOutputFormat.class);
//设置Mapper和Reducer
job.setMapperClass (WordCountMapper.class);
job.setReducerClass (WordCountReducer.class);
//设置Mapper输出的类型
job.setMapOutputKeyClass (Text.class);
job.setMapOutputValueClass (IntWritable.class);
//设置Reducer输出的类型
job.setOutputKeyClass (Text.class);
job.setOutputValueClass (IntWritable.class);
//判断输出的路径是否存在,存在就删除
Path out = new Path ("src/resources/output");
FileSystem fs = FileSystem.get (cfg);
if (fs.exists (out)) {
fs.delete (out, true);
}
//设置待分析的文件夹路径
FileInputFormat.addInputPath (job, new Path ("src/resources/input"));
FileOutputFormat.setOutputPath (job, new Path ("src/resources/output"));
boolean b = job.waitForCompletion (true);
}
}
5. Project directory structure
6. Local test
(1) the need to configure the local environment and Hadoop3.1.2 java1.8 environment
(2) Idea running the program on
completion will generate output file at run