Hadoop ecology
HDFS
HDFS data writing process
(1) The client requests the NameNode to upload files through the Distributed FileSystem module, and the NameNode checks whether the target file already exists and whether the parent directory exists
(2) The NameNode returns whether the upload is possible, if not, returns an exception
(3) Yes , yes Upload, the client requests the first block to upload to which datanode server
(4) NameNode returns 3 datanode nodes, assuming dn1, dn2, dn3 respectively
(5) The client requests dn1 to upload data through the FSDataOutputStream module, and dn1 receives When the request is reached, dn2 will continue to be called, and then dn2 will call dn3 to complete the establishment of the communication channel.
(6) dn1, dn2, and dn3 respond to the client step by step
(7) the client starts to upload the first block to dn1 (read from the disk first The data is placed in a local memory cache), with a packet (64KB) as the unit, dn1 receives a packet and transmits it to dn2, and dn2 transmits it to dn3; each packet transmitted by dn1 is placed in a reply queue and waits for a reply
(8) When After the transmission of a block is completed, the client again requests the NameNode to upload the second block to the server (repeat steps 3-7)
HDFS read data process
(1) First call the FileSystem.open() method to obtain the DistributedFileSystem instance
(2) DistributedFileSystem initiates a PRC (remote procedure call) request to the Namenode to obtain the beginning part or all of the block list of the file. For each block returned, it contains a block The address of the DataNode. These DataNodes will calculate the distance between the clients according to the cluster topology defined by Hadoop, and then sort them. If the client itself is a DataNode, then he will read the file locally
(3) DistributedFileSystem will return to the client an input stream object FSDataInputStream that supports file positioning for the client to read data. FSDataInputStream contains a DFSInputStream object, which is used to manage the I/O between DataNode and NameNode
(4) The client calls the read() method, DFSInputStream will find the datanode closest to the client and connect to the datanode
(5) DFSInputStream object The DataNode address where the data block that contains the beginning of the file is located in the file. First, it will connect to the nearest DataNode that contains the first block of the file. Subsequently, the read() function is called repeatedly in the data stream until the block is read. If the data of the first block is read, the datanode connection to the first block will be closed, and then the next block will be read.
(6) If the first block is read, DFSInputStream will go back to the NameNode and get the next one. Batch the locations of blocks, and then continue to read. If all blocks are read, all streams will be closed at this time
MapReduce
(1) The map task process starts, and the input format specified by the client is used to read the data.
(2) The RecordReader is used to call the read() method to read one line at a time, and each line is a file piece.
(3) The line offset is k , The content of each line is v, generate a key-value pair (k, v)
(4) The obtained key-value pair (k, v) is processed by Mapper's map method logic to form a new key-value pair map (k, v), At this time, k is a word and v is the number
. Output map(k,v) to the OutputCollector collector through the context.write() method;
(5) OutputCollector writes the key-value pair map(k,v) into the ring buffer , At this time shuffle begins. The default size of the ring buffer is 100M. When the data in the ring buffer reaches 80%, it will overflow.
(6) Before overflow, it is necessary to hash out a partition value for the key in the ring buffer. According to the partition, the same partition according to the key Sorting
(7) At this time, if the Combiner method is written, the key-value pairs of the same key will be merged together to generate a large file with partitions and ordered in the same partition. The purpose of this merger is to reduce network transmission.
(8) If the Combiner method is not written, the ring buffer begins to overflow to the map task local disk. At this time, there are partitions and small files in order. If the amount of data is too large, multiple files will overflow;
(9) Small The files will be merged and sorted into large files with partitions and ordered within the partitions. At this time, the map task ends
(10) The reduce task starts, and different reduce tasks go to different maptasks to copy files with the same partition number as themselves according to the partition number. To the reduce task local disk
(11) Each reduce task merges and sorts the copied files with the same partition number as themselves into a large file. The key-value pairs inside the file are sorted according to the key, and the shuffle process ends
(12) GroupingComparator(k, nextk) The data in the large file is grouped according to k, and a set of key-value pairs (k, values) are retrieved from the file each time, and a
new key-value pair reduce (k, v) is formed through the logic processing of Reducer's reduce method. At this time k is a word, v is the number of this word
(13) call the context.write(k,v) method to write into OutPutFormat, and then place the data result on the disk
WordMapper class
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class WordMapper extends Mapper<LongWritable, Text,Text, IntWritable> {
private Text word = new Text();
private IntWritable count = new IntWritable(1);
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] words = value.toString().trim()
.replaceAll(",|\\.|!|\\?|;|\"", "").split(" ");
for (String word : words) {
this.word.set(word);
context.write(this.word,count);
}
}
}
WordReducer class
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class WordReducer extends Reducer<Text, IntWritable,Text,IntWritable> {
private IntWritable count = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
count.set(sum);
context.write(key,count);
}
}
WordPartitioner class
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
public class WordPatritioner extends Partitioner<Text, IntWritable> {
@Override
public int getPartition(Text text, IntWritable intWritable, int numberReduceTasks) {
return text.toString().length()%numberReduceTasks;
}
}
Start class
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class WordTest {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
//创建hdfs访问路径配置信息
Configuration config = new Configuration();
config.set("fs.defaultFS","hdfs://192.168.247.130:9000");
//创建mapreduce计算任务
Job job = Job.getInstance(config,"wordCount");
//设置主类,定位外部jar资源
job.setJarByClass(WordTest.class);
//设置任务数:和文件分片数和所存储的DN数有关
job.setNumReduceTasks(2);
//设置分区器Partitioner
job.setPartitionerClass(WordPatritioner.class);
//设置Combiner
job.setCombinerClass(WordReducer.class);
//设置Mapper
job.setMapperClass(WordMapper.class);
//设置Reducer
job.setReducerClass(WordReducer.class);
//设置Mapper的输出键值类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
//设置Reducer的输出键值类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//设置hdfs输入文件绑定job
FileInputFormat.setInputPaths(job,new Path("/kb10/Pollyanna.txt"));
FileOutputFormat.setOutputPath(job,new Path("/kb10/wc01"));
System.out.println(job.waitForCompletion(true));
}
}
Put the test article in hdfs
[root@single ~]# hdfs dfs -put Pollyanna.txt /kb10/
Run the startup class WordTest in Java
2020-11-22 21:06:51,283 INFO [org.apache.hadoop.conf.Configuration.deprecation] - session.id is deprecated. Instead, use dfs.metrics.session-id
2020-11-22 21:06:51,284 INFO [org.apache.hadoop.metrics.jvm.JvmMetrics] - Initializing JVM Metrics with processName=JobTracker, sessionId=
2020-11-22 21:07:00,797 WARN [org.apache.hadoop.mapreduce.JobSubmitter] - Submitting tokens for job: job_local188587393_0001
[org.apache.hadoop.mapred.LocalJobRunner] - reduce task executor complete.
...
...
...
2020-11-22 21:07:02,037 INFO [org.apache.hadoop.mapreduce.Job] - Job job_local188587393_0001 running in uber mode : false
2020-11-22 21:07:02,038 INFO [org.apache.hadoop.mapreduce.Job] - map 100% reduce 100%
2020-11-22 21:07:02,038 INFO [org.apache.hadoop.mapreduce.Job] - Job job_local188587393_0001 completed successfully
2020-11-22 21:07:02,053 INFO [org.apache.hadoop.mapreduce.Job] - Counters: 35
File System Counters
FILE: Number of bytes read=37065
FILE: Number of bytes written=1359678
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=30364
HDFS: Number of bytes written=9124
HDFS: Number of read operations=38
HDFS: Number of large read operations=0
HDFS: Number of write operations=16
Map-Reduce Framework
Map input records=35
Map output records=1365
Map output bytes=12733
Map output materialized bytes=6687
Input split bytes=111
Combine input records=1365
Combine output records=528
Reduce input groups=528
Reduce shuffle bytes=6687
Reduce input records=528
Reduce output records=528
Spilled Records=1056
Shuffled Maps =3
Failed Shuffles=0
Merged Map outputs=3
GC time elapsed (ms)=0
Total committed heap usage (bytes)=1025507328
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=7591
File Output Format Counters
Bytes Written=4580
true
View the data results in hdfs.
Because three jobs are specified, three files are generated.
View wordcount results
[root@single ~]# hdfs dfs -cat /kb10/wc01/part-r-*