hadoop中的wordcount

Hadoop ecology


HDFS

HDFS data writing process
HDFS write data
(1) The client requests the NameNode to upload files through the Distributed FileSystem module, and the NameNode checks whether the target file already exists and whether the parent directory exists
(2) The NameNode returns whether the upload is possible, if not, returns an exception
(3) Yes , yes Upload, the client requests the first block to upload to which datanode server
(4) NameNode returns 3 datanode nodes, assuming dn1, dn2, dn3 respectively
(5) The client requests dn1 to upload data through the FSDataOutputStream module, and dn1 receives When the request is reached, dn2 will continue to be called, and then dn2 will call dn3 to complete the establishment of the communication channel.
(6) dn1, dn2, and dn3 respond to the client step by step
(7) the client starts to upload the first block to dn1 (read from the disk first The data is placed in a local memory cache), with a packet (64KB) as the unit, dn1 receives a packet and transmits it to dn2, and dn2 transmits it to dn3; each packet transmitted by dn1 is placed in a reply queue and waits for a reply
(8) When After the transmission of a block is completed, the client again requests the NameNode to upload the second block to the server (repeat steps 3-7)


HDFS read data process

Insert picture description here
(1) First call the FileSystem.open() method to obtain the DistributedFileSystem instance
(2) DistributedFileSystem initiates a PRC (remote procedure call) request to the Namenode to obtain the beginning part or all of the block list of the file. For each block returned, it contains a block The address of the DataNode. These DataNodes will calculate the distance between the clients according to the cluster topology defined by Hadoop, and then sort them. If the client itself is a DataNode, then he will read the file locally
(3) DistributedFileSystem will return to the client an input stream object FSDataInputStream that supports file positioning for the client to read data. FSDataInputStream contains a DFSInputStream object, which is used to manage the I/O between DataNode and NameNode
(4) The client calls the read() method, DFSInputStream will find the datanode closest to the client and connect to the datanode
(5) DFSInputStream object The DataNode address where the data block that contains the beginning of the file is located in the file. First, it will connect to the nearest DataNode that contains the first block of the file. Subsequently, the read() function is called repeatedly in the data stream until the block is read. If the data of the first block is read, the datanode connection to the first block will be closed, and then the next block will be read.
(6) If the first block is read, DFSInputStream will go back to the NameNode and get the next one. Batch the locations of blocks, and then continue to read. If all blocks are read, all streams will be closed at this time


MapReduce

Insert picture description here

(1) The map task process starts, and the input format specified by the client is used to read the data.
(2) The RecordReader is used to call the read() method to read one line at a time, and each line is a file piece.
(3) The line offset is k , The content of each line is v, generate a key-value pair (k, v)
(4) The obtained key-value pair (k, v) is processed by Mapper's map method logic to form a new key-value pair map (k, v), At this time, k is a word and v is the number
. Output map(k,v) to the OutputCollector collector through the context.write() method;
(5) OutputCollector writes the key-value pair map(k,v) into the ring buffer , At this time shuffle begins. The default size of the ring buffer is 100M. When the data in the ring buffer reaches 80%, it will overflow.
(6) Before overflow, it is necessary to hash out a partition value for the key in the ring buffer. According to the partition, the same partition according to the key Sorting
(7) At this time, if the Combiner method is written, the key-value pairs of the same key will be merged together to generate a large file with partitions and ordered in the same partition. The purpose of this merger is to reduce network transmission.
(8) If the Combiner method is not written, the ring buffer begins to overflow to the map task local disk. At this time, there are partitions and small files in order. If the amount of data is too large, multiple files will overflow;
(9) Small The files will be merged and sorted into large files with partitions and ordered within the partitions. At this time, the map task ends
(10) The reduce task starts, and different reduce tasks go to different maptasks to copy files with the same partition number as themselves according to the partition number. To the reduce task local disk
(11) Each reduce task merges and sorts the copied files with the same partition number as themselves into a large file. The key-value pairs inside the file are sorted according to the key, and the shuffle process ends
(12) GroupingComparator(k, nextk) The data in the large file is grouped according to k, and a set of key-value pairs (k, values) are retrieved from the file each time, and a
new key-value pair reduce (k, v) is formed through the logic processing of Reducer's reduce method. At this time k is a word, v is the number of this word
(13) call the context.write(k,v) method to write into OutPutFormat, and then place the data result on the disk


WordMapper class

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class WordMapper extends Mapper<LongWritable, Text,Text, IntWritable> {
    
    
    private Text word = new Text();
    private IntWritable count = new IntWritable(1);

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    
    
        String[] words = value.toString().trim()
                .replaceAll(",|\\.|!|\\?|;|\"", "").split(" ");
        for (String word : words) {
    
    
            this.word.set(word);
            context.write(this.word,count);
        }
    }
}

WordReducer class

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class WordReducer extends Reducer<Text, IntWritable,Text,IntWritable> {
    
    
    private IntWritable count = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
    
    
        int sum = 0;
        for (IntWritable value : values) {
    
    
            sum += value.get();
        }
        count.set(sum);
        context.write(key,count);
    }
}

WordPartitioner class

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

public class WordPatritioner extends Partitioner<Text, IntWritable> {
    
    
    @Override
    public int getPartition(Text text, IntWritable intWritable, int numberReduceTasks) {
    
    
        return text.toString().length()%numberReduceTasks;
    }
}

Start class

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class WordTest {
    
    
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
    
    
        //创建hdfs访问路径配置信息
        Configuration config = new Configuration();
        config.set("fs.defaultFS","hdfs://192.168.247.130:9000");

        //创建mapreduce计算任务
        Job job = Job.getInstance(config,"wordCount");
        //设置主类,定位外部jar资源
        job.setJarByClass(WordTest.class);
        //设置任务数:和文件分片数和所存储的DN数有关
        job.setNumReduceTasks(2);
        //设置分区器Partitioner
        job.setPartitionerClass(WordPatritioner.class);
        //设置Combiner
        job.setCombinerClass(WordReducer.class);

        //设置Mapper
        job.setMapperClass(WordMapper.class);
        //设置Reducer
        job.setReducerClass(WordReducer.class);

        //设置Mapper的输出键值类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        //设置Reducer的输出键值类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        //设置hdfs输入文件绑定job
        FileInputFormat.setInputPaths(job,new Path("/kb10/Pollyanna.txt"));
        FileOutputFormat.setOutputPath(job,new Path("/kb10/wc01"));

        System.out.println(job.waitForCompletion(true));
    }
}

Put the test article in hdfs

[root@single ~]# hdfs dfs -put Pollyanna.txt /kb10/

Insert picture description here
Run the startup class WordTest in Java

2020-11-22 21:06:51,283 INFO [org.apache.hadoop.conf.Configuration.deprecation] - session.id is deprecated. Instead, use dfs.metrics.session-id
2020-11-22 21:06:51,284 INFO [org.apache.hadoop.metrics.jvm.JvmMetrics] - Initializing JVM Metrics with processName=JobTracker, sessionId=
2020-11-22 21:07:00,797 WARN [org.apache.hadoop.mapreduce.JobSubmitter] - Submitting tokens for job: job_local188587393_0001
[org.apache.hadoop.mapred.LocalJobRunner] - reduce task executor complete.
...
...
...
2020-11-22 21:07:02,037 INFO [org.apache.hadoop.mapreduce.Job] - Job job_local188587393_0001 running in uber mode : false
2020-11-22 21:07:02,038 INFO [org.apache.hadoop.mapreduce.Job] -  map 100% reduce 100%
2020-11-22 21:07:02,038 INFO [org.apache.hadoop.mapreduce.Job] - Job job_local188587393_0001 completed successfully
2020-11-22 21:07:02,053 INFO [org.apache.hadoop.mapreduce.Job] - Counters: 35
	File System Counters
		FILE: Number of bytes read=37065
		FILE: Number of bytes written=1359678
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=30364
		HDFS: Number of bytes written=9124
		HDFS: Number of read operations=38
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=16
	Map-Reduce Framework
		Map input records=35
		Map output records=1365
		Map output bytes=12733
		Map output materialized bytes=6687
		Input split bytes=111
		Combine input records=1365
		Combine output records=528
		Reduce input groups=528
		Reduce shuffle bytes=6687
		Reduce input records=528
		Reduce output records=528
		Spilled Records=1056
		Shuffled Maps =3
		Failed Shuffles=0
		Merged Map outputs=3
		GC time elapsed (ms)=0
		Total committed heap usage (bytes)=1025507328
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=7591
	File Output Format Counters 
		Bytes Written=4580
true

View the data results in hdfs.
Insert picture description here
Because three jobs are specified, three files are generated.
Insert picture description here
View wordcount results

[root@single ~]# hdfs dfs -cat /kb10/wc01/part-r-*

Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_48482704/article/details/109710672