MapReduce process under Hadoop distributed and simple use of HDFS

In the previous article, we have performed a pseudo-distributed configuration, and through several examples, we have seen how to perform MapReduce operations under pseudo-distribution. Today's article mainly looks at how MapReduce works in a distributed environment, and then through a small example, look at the way Java operates HDFS.

The above figure is the execution flow of MapReduce under distributed distribution. You can see that there are many steps in addition to map and reduce. These steps can be controlled more or less by configuration or coding. It is also our means of performance tuning for MapReduce.

After the input file is uploaded to HDFS, it will be fragmented. By default, the Hadoop cluster will save three copies of each fragment (we set it to only one copy under pseudo-distribution). There are 3 spilt1, 3 spilt2 and 3 spilt3 respectively. This is to ensure the reliability of data storage through data redundancy, and can be restored when data problems occur. The size of the shard defaults to the size of the HDSF file block (128M), which can have better performance.
Hadoop assigns a map task to each spill, that is, the number of spills determines the number of map tasks. The allocation process follows certain principles. First of all, if the node storing the spilt data is idle, then execute the map task on this node; however, if the data storage node is performing other tasks, then Hadoop will find the same data node. The closest idle node in the rack performs the map task; finally, if the above conditions cannot be satisfied, the map task has to be performed across the racks. The overall purpose is to reduce data transmission as much as possible and optimize performance.
In this way, the map tasks are run concurrently in their respective nodes, and the output of the map tasks will be sorted. It should be noted here that since the file fragment has only a part of the data of the file, the sorting after map processing is carried out for their respective processing results, that is to say, the sorting result here is a partial sorting, not all the input files. Sorting of data.

If you need to sort the entire file, there are also methods. One is to prevent Hadoop from fragmenting the file, and the other is to aggregate the fragmented files and hand them over to a map task for processing.
In order to optimize performance, the results after map are sorted in memory. At this time, there is no need to write the results to disk. However, reduce needs to obtain data from disk, and the operation of writing data to disk is postponed to the merge stage. , the processed results of each map task are copied to the node where the reduce task is located, written to the disk in the merge phase, and then the reduce operation is performed to output the result. The number of reduce tasks can be configured, and the default is 1. If there are multiple reduce tasks, the resulting partitions will be generated on their respective nodes.

The above is the running process of MapReduce in a distributed situation. Let's take a look at how to operate HDFS. In the previous article, we used the command line to operate HDFS. Today, let's see how to implement it with Java code. We use a simple example, write a string to HDFS, and then read it and display it to the terminal.

package com.yjp.hdfs;

import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;

public class HDFSEcho {
	
	// replace yjp with your username
	private static final URI TMP_URI =
			URI.create("hdfs://localhost/user/yjp/temp");

	public static void main(String[] args) throws Exception {
		if (args.length != 1) {
			System.out.println("Usage: HDFSEcho string");
			System.exit(1);
		}
		
		String echoString = args[0] + "\n";
		
		// Create FileSystem through URI, because it is hdfs, it will return an instance representing HDFS
		Configuration conf = new Configuration();
		FileSystem fs = FileSystem.get(TMP_URI, conf);
		FSDataOutputStream out = null;
		FSDataInputStream in = null;
		
		try {
			// write to file
			Path path = new Path(TMP_URI);
			out = fs.create(path);
			out.writeUTF(echoString);
			IOUtils.closeStream(out);
			
			// read and output from file
			in = fs.open(path);
			IOUtils.copyBytes(in, System.out, conf);
			IOUtils.closeStream(in);
			
			//Delete Files
			fs.delete(path, true);
		} finally {
			IOUtils.closeStream(out);
			IOUtils.closeStream(in);
		}
	}
	
}

Package it as a jar file, remember to start the background process of hadoop, and then try it out.

hadoop jar your package name.HDFSEcho "Hello HDFS!"

The output is

Hello HDFS!

This part of the code is in the hdfs package.

MapReduce process under Hadoop distributed and simple use of HDFS

Guess you like