Touge big data homework three: MapReduced and executing wordcount

Extracurricular homework three: MapReduced and executing wordcount

  • Job details

content

  • Open your own ECS with a specification of at least 2vCPU 4GiB to run MapReduce.
  • Build Hadoop pseudo-distribution and start Hadoop.
  • Create an input data file for the wordcount program and enter some content.
  • Create the /input path on HDFS and upload the input data file of the wordcount program to the /input path of HDFS.
  • After running successfully, check the word frequency statistics in the /output/part-r-00000 file of HDFS.
  • Briefly answer the content of "Classroom Assessment"
  1. After starting Hadoop, which nodes execute HDFS? What do they do?

Answer: After starting Hadoop, the NameNode and DataNodes nodes execute HDFS. The NameNode node is mainly responsible for storing the metadata of the file system, responsible for the file system directory structure and file identifiers, and access control to the file system. The DataNode node is responsible for storing file blocks, processing file read and write requests, etc.

  1. What are the configurations of hadoop.tmp.dir and fs.defaultFS in the configuration file core-site.xml?

Answer: hadoop.tmp.dir in the configuration file core-site.xml configures the storage path of Hadoop temporary files, and fs.defaultFS configures the URI of the NameNode, which is used to indicate the Hadoop file system that the client wants to access.

  1. What are the configurations of dfs.namenode.name.dir and dfs.datanode.data.dir in the configuration file hdfs-site.xml?

Answer: dfs.namenode.name.dir in the configuration file hdfs-site.xml configures the file system path where the NameNode stores metadata information (such as file directory structure, file identifiers), and dfs.datanode.data.dir configures The file system path where the DataNode stores file block information.

  1. Where are your Hadoop NameNode, DataNode and SecondaryNamenode nodes? Why is the SecondaryNamenode not configured?

Answer: Hadoop's NameNode, DataNode and SecondaryNamenode nodes are all on the local server. The SecondaryNamenode is skipped and does not need to be configured.

  1. After starting Hadoop, which nodes execute MapReduce? What do they do?

Answer: After starting Hadoop, the NameNode, DataNode, ResourceManager and NodeManager nodes execute MapReduce. NameNode and DataNode are responsible for storing metadata of files and file systems, ResourceManager is responsible for job scheduling and resource management, and NodeManager is responsible for job execution on nodes.

  1. Why are the configuration files mapred-site.xml and yarn-site.xml not configured? If you want to configure it, what do you need to configure for the node ResourceManager and NodeManager?

Answer: The configuration files mapred-site.xml and yarn-site.xml are not configured. If they are configured, mapred-site.xml needs to be configured with the mapreduce.framework.name item to indicate that MapReduce uses the YARN framework, and yarn-site. xml needs to configure the relevant parameters of ResourceManager and NodeManager, such as yarn.nodemanager.resource.memory-mb, yarn.resourcemanager.hostname, etc.

  1. What is the workflow of running the wordcount program using the above nodes?

Answer: The workflow of the wordcount program is: first, receive the client's application on the ResourceManager node, and then divide the task into subtasks according to the configured capacity. Each subtask is processed by a NodeManager node. After the processing is completed, the results are returned to the ResourceManager. , and finally the ResourceManager node merges the results and outputs them to the client.

  1. When the wordcount program is running, what functions do you see running?

Answer: During the running of the wordcount program, you can see that the following functions are called: Map function, Reduce function, shuffle function, sort function, merge function and output function.

  • 7.7 Exercises
  1. Describe the relationship between MapReduce and Hadoop.

Answer: Google first proposed the distributed parallel programming model MapReduce, and Hadoop MapReduce is its open source implementation. Google's MapReduce runs on the distributed file system GFS. Similar to Google, Hadoop MapReduce runs on the distributed file system HDFS. Relatively speaking, HadoopMapReduce has a much lower threshold for use than Google MapReduce. Programmers can easily develop distributed programs and deploy them to computer clusters even if they do not have any distributed program development experience.

  1. MapReduce is a powerful tool for processing big data, but not every task can be processed using MapReduce. Describe the requirements that tasks or data sets suitable for processing with MapReduce need to meet.

Answer: Data sets suitable for processing with MapReduce need to meet a prerequisite: the data set to be processed can be decomposed into many small data sets, and each small data set can be processed completely in parallel.

  1. The core of the MapReduce computing model is the Map function and the Reduce function. Let’s describe the input, output and processing of these two functions.

Answer: The respective inputs, outputs and processing processes of the Map function and the Reduce function: The input of the Map function is a pair of (key, value), and the output is a set of intermediate results, usually expressed in the form of (key', value'), where key' is the key of the intermediate result, and value' is the value of the intermediate result. The processing process of the Map function is to perform specific processing on the input data and generate a set of intermediate results, each of which contains a key-value pair. The input of the Reduce function is a set of intermediate results, and the output is a set of final results, usually expressed in the form of (key'', value''), where key'' is the key of the final result and value'' is the value of the final result. . The processing process of the Reduce function is to perform specific aggregation operations on a set of intermediate results to generate a set of final results.

  1. Describe the workflow of MapReduce (need to include the process of submitting tasks, Map, and ShuffleReduce).

Answer: • Submit a task: The user submits a MapReduce task to Hadoop; • Map phase: Each Map task reads a data block and performs specific processing on it to generate a set of intermediate results; • Shuffle phase: Sorts the intermediate results , group and partition, and pass them to the corresponding Reduce tasks; • Reduce stage: Each Reduce task performs an aggregation operation on a set of intermediate results assigned to it to generate a set of final results.

  1. The Shuffle process is the core of the MapReduce workflow, also known as the place where magic happens. Let’s try to analyze the role of the Shuffle process.

Answer: The Shuffle process is the core of the MapReduce workflow. Its function is to sort, group, and partition the intermediate results of Map tasks according to keys, and pass them to the corresponding Reduce tasks. The function of the Shuffle process is to provide correct data input for the Reduce task and ensure that the Reduce function can perform correct aggregation operations on the intermediate results.

  1. Describe the Shuffle process on the Map side and the Reduce side respectively (need to include the process of overwriting, sorting, merging, and "receiving").

Answer: The Shuffle process on the Map side includes partitioning the key-value pairs output by the Map according to the key, sorting the data in each partition, and writing the data in the partition to a temporary disk file. The Shuffle process on the Reduce side includes sorting and merging data from different Map tasks according to keys, and then handing the data to the corresponding Reduce task for processing. The "receiving" process means that the Reduce task obtains its own data from the temporary disk file on the node where the Map task is located for processing. Overflow means that when the output of a Map task exceeds a certain threshold, part of the data will be written to disk instead of stored in memory to prevent memory overflow.

  1. There is a principle in MapReduce : mobile computing is more economical than moving data. Describe what local computing is and analyze why local computing is used.

Answer: One of the concepts of MapReduce design is to "move computing closer to data" rather than "move data closer to calculation", because moving data requires a lot of network transmission overhead, especially in large-scale data environments, this overhead is particularly staggering, so , mobile computing is more economical than mobile data. Local computing: In a cluster, whenever possible, the MapReduce framework will run the Map program on the node where the HDFS data is located, that is, the computing node and the storage node will be run together, thereby reducing the cost of data movement between nodes.

  1. Try to explain what factors determine the number of Map tasks and the number of Reduce tasks started by a MapReduce program during operation.

Answer: The number of Map tasks and Reduce tasks are determined by the size of the input data and the number of computing resources available in the Hadoop cluster. The size of the input data determines how many Map tasks need to be started, while the number of available computing resources limits the number of Map and Reduce tasks that can run simultaneously.

  1. Do all MapReduce programs need to go through the two processes of Map and Reduce? If not, please give an example.

Answer: No. For the selection operation of the relationship, only the Map process can be implemented. For each tuple t in the relationship R, check whether it is the required tuple that meets the conditions. If the conditions are met, the key-value pair <,> is output. That is, both the key and the value are t. The Reduce function at this time is just an identity, and it outputs directly without any transformation on the input.

  1. Try to analyze why using Combiner can reduce the amount of data transmission. Can all MapReduce programs use Combiner? Why?

Answer: Answer: For all key-value pairs in each partition, the background thread will perform memory sorting (Sort) on them according to the key. Sorting is the default operation of MapReduce. After sorting, an optional merge operation is also included. If the user does not define the Combiner function in advance, there is no need to perform the merge operation. If the user defines the Combiner function in advance, the merge operation will be performed at this time, thereby reducing the amount of data that needs to be overwritten to disk. The so-called "merge" refers to adding up the values ​​of <key, value> with the same key. For example, there are two key-value pairs <*xmu",1> and <*xmu",1>, after the merge operation In the future, you can get a key-value pair <*xmu",2>, which reduces the number of key-value pairs. However, Combiner cannot be used in all situations, because the output of Combiner is the input of the Reduce task, and the Combiner must not change The final calculation result of the Reduce task. Generally speaking, merging operations can be used in scenarios such as accumulation and maximum values.

  1. The input files and output files of the MapReduce program are stored in HDFS, and the intermediate results obtained when the Map task is completed are stored in the local disk. Try to analyze the advantages and disadvantages of storing intermediate results on local disk instead of HDFS.

Answer: Advantages: 1. The system's memory resources on the client and Map task nodes are more than those on HDFS, which can effectively reduce the time spent on rewriting intermediate results. 2. Makes data access of Map tasks faster and reduces network IO overhead.

Disadvantages: 1. Since the intermediate results are stored on the local disk, the operation results and the files on HDFS are distributed on different nodes, resulting in incomplete data in the system. 2. Storage of intermediate results on the local disk brings data security and reliability issues, and there is a risk of data loss.

  1. The default block size of early HDFS is 64 MB, while the default block size of newer versions is 128 MB. What are the impacts, advantages and disadvantages of using large blocks?

Answer: Advantages: Reduces the number of storage blocks in HDFS, thereby reducing HDFS metadata overhead and improving overall read and write performance. Because larger blocks allow for larger data streams, network bandwidth usage is reduced, thereby increasing the speed of data transfer. The number of data cutting is reduced, thereby reducing the cost of data cutting and improving the efficiency of data processing. Disadvantages: If a block contains only a small file, reading the file requires reading the entire block, which wastes storage space and network bandwidth. If the data contained within a block is not uniform enough, it can lead to data skew and underutilization of resources. If a task only needs to read a small part of the data in a block, the entire block needs to be read, which increases the overhead of data transfer and the latency of reading.

Guess you like

Origin blog.csdn.net/qq_50530107/article/details/131260885