hadoop frequently asked questions and summary

1. Map task writes its output local disk, rather than HDFS, Why?

A: Because the output of the map is the intermediate result: the intermediate result after processing by the reduce task to produce the final result.

And once the job is completed, the output of the map can be deleted. So if it is stored in HDFS and backup, it is inevitable fuss.

2. Why should the optimal size of the slices and block (block) the same size (hadoop2.x default is 128Mb, hadoop1.x is 64Mb)?

A: Because it is to ensure that the maximum size of the input block may be stored on a single node. If the fragment spans two blocks, then for any HDFS node, basically impossible to simultaneously store two data blocks so divided partial data slice data to be transmitted through the network node to map tasks. Compared with the local map data to run the entire task, this approach is clearly less efficient.

3. Combiner is what role?

A: Combiner belong optimization program, the output map after the merger as input to reduce to reduce the data transmission between the map and reduce task, no matter how many times to call combiner, the output reducer is the same.

4. Why HDFS is not suitable for storing small files?

A: Since the total number namenode file system metadata stored in memory, so the file system can be stored in the file is limited namenode memory, based on experience, store information about each file, directory and data block is about accounting for 150 bytes. Therefore, a large number of small files take up a lot of memory space namenode, HDFS therefore not suitable for storing large amounts of small files.

5. Why block in HDFS so great advantages and disadvantages?

A: First are the benefits: HDFS in blocks larger than the disk block, its purpose is to minimize the addressing overhead. If the block is set large enough, the data from the disk transfer time will be significantly greater than the time required for positioning the block start position. Accordingly, the transmission time of a file composed of a plurality of blocks of the transmission rate depending on the disk.

         Disadvantages: If, after the size of the block (block) is too large, it may cause when reading data team at the same time generate a lot of IO requests a block, leading to congestion.

6. Hadoop two mechanisms to provide fault tolerance namenode?

A: The first mechanism to back up files that make up the file system metadata persistent state. Hadoop can be configured so that namenode saved persistent state data on a plurality of file systems. These are real-time synchronization of the write operation is atomic operation. When a general configuration, while the status is written to persistent local disk, writes a hanging remote network file system (NFS).

         The second possible method is to run an auxiliary namenode, but it can not be used as a namenode, the important role of the auxiliary namenode is regularly mirrored by editing the journal merge namespaces to prevent editing log is too large. The auxiliary namenode generally on another single physical computer running, since it would take a lot of CPU time and memory namenode same capacity to perform a merge operation. It will save the namespace merged mirrored copy and enabled when namenode failure. However, the auxiliary namenode saved state is always lagging behind the master node, so when all the primary node failure will inevitably lose some data. In general this time is copied namenode metadata stored on NFS and to assist namenode namenode operating as the new primary.

7. HDFS copy storage strategy?

A: Hadoop default layout strategy is to place the first copy on the node running the client (if the client runs outside the cluster, it randomly selects a node, but the system would avoid the selection of those nodes stored too full or too busy) . The second copy is placed on the first with a different and randomly select additional nodes rack, a third copy and a second copy on the same chassis, and randomly select another node. Other replicas on a cluster node randomly selected, but the system will try to avoid to put too many copies on the same rack.

8. HDFS file compression benefits?

A: reduce the disk space required to store files, and accelerate the transmission of data over the network and disk.

9. Which compression format should you use?

A: The use container format, e.g. sequential file, Avro rcfile or data files, all of these file formats support both compression and segmentation. Usually best used in combination with a rapid compression tool, e.g. LZO, LZ4, or Snappy

         Supports segmentation compression format, e.g. bzip2 implemented using an index or slicing compression format, e.g. LZO

         Application of the files in a cut into the block, and using any of a compression format for each data block compressed file established (whether or not it supports segmentation). In this case, a logical block size selected to ensure that the compressed size of data block size of the block approximate HDFS

For larger files, however, do not use the compression format is not supported splitting the entire file, because it will lose the local nature of the data, which causes low efficiency MapReduce application.

10. RPC sequence 4 large format of the desirable attributes?

A: Compact --- compact format to take full advantage of network bandwidth (data center scarcest resource).

         Fast inter-process communication --- Metro skeleton distributed systems, it is necessary to minimize the serialization and de-serialization performance overhead, which is the most basic.

         --- transparently scalable to read data in the old format.

         --- to support interoperability can use different languages ​​to read / write data permanently stored.

11. Install hadoop steps required to configure and related documents?

A: Installation Steps

         First, configure the hosts file

Second, establish a running account hadoop

Third, the configuration connected to the ssh password-free

Fourth, download and unzip the installation package hadoop

5, configuration NameNode, modify site file (core-site.xml, hdfs-site.xml, mapred-site.xml)

Sixth, the configuration file hadoop-env.sh (JDK)

Seven, masters and slaves configuration file

Eight, each node copy hadoop

Nine, formatting namenode

Ten, start hadoop

XI, with each background process jps test is successful start

Twelve, to view the situation through a cluster website

The main changes to the configuration file: core-site.xml, hdfs-site.xml, mapred-site.xml.

12. Mapreduce program performance tuning methods?

A: The number of mapper, reducer of the number, Combiner is, the compressed intermediate values ​​(map output compression), a custom sequence, shuffle adjustment parameters (the size of the ring buffer is adjusted to reduce the number of write overflow, the io.sort.factor combined when the number of files and other streaming up merger), performance tuning, reduce side, speculative execution (is enabled by default), partition (partition)

13. The classic MapReduce in failure?

A: Run the task failed --- map task or tasks reduce abnormal code, such as sub-process JVM suddenly quit.

         tasktracker failure --- tasktracker crash or run too slowly or fail too many times and so on.

         Jobtracker failure --- the most serious, it is a single point of failure, so in this case the job is bound to fail. YARN further improved the situation, its design goal is to eliminate the possibility of a single point of failure.

14. YARN in failure?

A: The task failed to run --- map task or tasks reduce abnormal code, such as sub-process JVM suddenly quit.

Application master fail

         Node manager fails

         Resouragemanager fail

15. Mapreduce in writing and combiner overflow occurred at what time?

Answer: First, each map task has a ring buffer for storing output task default size is 100MB, this value can be adjusted by io.sort.mb property. Once the buffer contents threshold is reached (io.sort.spill.percent, the default is 0.8), a background thread began to write the contents spill (spill) to disk. In written to disk overflow process, map output continues written to the buffer, but in the meantime if the buffer is filled, map will be blocked until the process is completed writing to the disk. Before writing to disk, according to final data thread reducer to pass the data into the appropriate partition (partition). In each partition, the background thread for the sort key, if there is a Combiner is, it runs on the sorted output. Combiner output after running more compact, thus reducing the data written to the disk and the data passed to the reducer.

16. Reducer how to know where to obtain the map output on the machine it?

A: After the map task is completed, they will notify the parent tasktracker status has been updated, and then notice tasktracker jobtracker. (In hadoop2, the tasks directly inform its application master). These notifications are transmitted through the heartbeat mechanism. A thread Reducer in order to obtain regularly ask jobtracker location map output until all output locations.

17. What is a "speculative execution"?

A: When a job consists of hundreds or thousands of tasks, the task may appear slow implementation, when a task execution slower than the average speed of execution, and start another the same tasks as a backup (when all the tasks have been started after), after which a task is completed, any repetitive tasks running will be terminated, by default, it is started. This is the speculative execution.

18. Hadoop What's skipping mode that? What specific operating mechanism?

A: There are a few large data sets often bad data, such as field did so on, but these few bad data do not affect the final result, but there will be anomalies and failures at run time mapreduce, in order to solve this problem, skipping mode to the start is reporting record after skipping mode, the task being processed to tasktracker. When the mission failed, tasktracker rerun the task, resulting in skipped records failed task. Due to the extra amount of network and logs the error Liu failed to maintain record range, only used twice after skipping mode after the mission failed.

Specific operating mechanisms:

a) mission failed

b) mission failed

c) open skipping mode. Mission failed, but failed recorded tasktracker record keeping.

d) any course enable skipping mode. Mission to continue, but skip the bad record attempt that failed.

By default, skipping mode is turned off. Every task attempts, skippingmode can only detect a bad record, so this mechanism is only suitable for the detection of individual bad record. In order to give a sufficient number of attempts to detect skipping mode and skip any bad points enter a record sheet, it is necessary to increase a maximum number of attempts task (and set by mapred.map.max.attemps mapred.reduce.max.attemps).

Hadoop job output directory detected bad form of a sequence of records stored in the file _logs / skip subdirectories.

19. How to choose the number of reducer?

A: For starters, hadoop, the default configuration of a single reducer is very easy to use, real applications, a great job setting to a higher number, or because all data will be placed in the middle of a reducer, the job processing is very low effect. When the job is run locally, only supports 0 or a reducer.

The number of tasks slot Reducer optimal number of clusters and related available. The total number of slots multiplied by the task number of slots in the cluster of nodes and each node. Task number of slots per node is determined by mapred.tasktracker.reduce.tasks.maxmum attribute value.

         A common method is to set the ratio of the total number reducer slightly less number of grooves, to leave room for the task reducer. If you reduce a large task, is wiser to use more reducer, smaller size makes the task so that the task of failure will not significantly affect the performance of the job.

20. The formula for calculating the size of the slice?

答:max(minmumSize,min(maximumSize,blockSize))

In general, the default: mimumSize <blockSize <maximumSize

However, force may be modified by the size of the fragment to modify the above parameters. Tile size is generally the size of the block.

21. Why hadoop suitable for handling a small number of large files?

A: One reason is FileInputFormat generated block is part of a file or the file. If the file is small (which means a lot more than a small piece of hdfs) ,, and many number of files, then each map task handles only minimal input data (a file) there will be a lot of map tasks, each map task It will result in additional costs. Such as 1GB 64MB file is divided into 16 blocks and 10 000 100kB of 10,000 files required each time a map operation, operating time ratio map 16 on a slow file operation dozens or even hundreds of times.

22. What are ways to avoid file segmentation?

A: First: Parameter minmumSize to make it larger than the size of the file. Second: the use of FileInputFormat concrete subclasses, and reload isSplitable () method returns the value is set to false.

23. The two ways of running mapreduce?

答:job.submit();   ToolRunner.run();

24. How to generate a sort of global use hadoop file?

A: The easiest way is to use a partition, but this method when dealing with large files inefficient. There is also a program, first create a series of sorted files; secondly, a series of these documents; and finally generate a global sort of file. The main idea is to use a partitioner to describe the output of the global order. The key point is how to divide each partition, usually at the time of the zoning we will use sampler. The core idea of ​​sampling is to see only a small part of the key, to get an approximate distribution key, and thus build partitions.

25. Mapreduce working mechanism?

A: When you start a job, hadoop will copy specified by -files, -archives, -libjars other options files to a distributed file system (usually HDFS) into. Then, before the task runs, tasktracker copy files from local disk to a distributed file system (cache), the task can access the file. In this case, these files will be treated as "localization" of.

26. Why hadoop not suitable to run on non-Unix platforms?

A: The main hadoop is written in java, to run on the JVM platform at any one installed. However, due to any course, some codes (control script) needs to be performed in a Unix environment, and therefore is not suitable for running hadoop to form the final product in the non-Unix platforms.

27. Early mapreduce YARN memory model and the difference in memory management?

A: Early mapreduce will set a fixed number of slots map and reduce the number of slots, but in YARN, the Node Manager (nodemanager) allocates memory from a memory pool, which means that the number can run concurrently depends on the total memory requirements amount, rather than the number of slots. YARN will change the calculation process to adjust the map and reduce slots slot, because there may be a high demand for map grooves in the early computing, and greater demand for computing in the late reduce slots, thus achieving such fine-grained control to memory distribution such that more rational and efficient calculation process.

28. Early checkpoint creation process hadoop?

1) main auxiliary namenode request to stop using namenode edits document, temporarily writes a new record to a new file.

2) Get the auxiliary namenode fsimage and edits the file from the primary namenode (using HTTP GET).

3) assist the namenode fsimage file into memory, one by one to perform operations edits, create a new file fsimage.

4) assist namenode new fsimage file is sent back to the main namenode (using HTTP POST).

5) replace the old master namenode fsimage files received from the secondary namenode fsimage file. edits a file generating step of replacing the old file edits. At the same time, also update fstime files to record the execution time of the checkpoint.

Note: The trigger condition checkpoint controlled by two configuration parameters. Typically namenode every hour (controlled by fs.checkpoint.period property, in seconds) to create a check node. Further, when the size of the log reaches the editing 64MB (set by fs.checkpoint.size property, in bytes), even if one hour will not reach a checkpoint. The system checks the edited every 5 minutes size of the log.

29. Several commands to enter and leave safe mode?

答:hadoop  dfsadmin  –safemode get;

         hadoop  dfsadmin –safemode wait;

         hadoop  dfsadmin –safemode  enter;

hadoop  dfsadmin  –safemode leave;

30. To add a new node and delete nodes in the cluster hadoop?

Adding nodes:

1) Add the network address of a node to include files.

2) Run the following commands: hadoop dfsadmin -refreshNodes

3) Run the following command will update to namenode, hadoop mradmin -refreshNodes datanode information through a series of audits

4) to file a new node to update slaves.

5) Start a new datanode and namenode.

6) Check that the new datanode appear in the page.

Delete node:

1) to be released node is added to exclude file, constantly updated include files.

2) Run: hadoop dfsadmin -refreshNodes.

3) using a new set of datanode reviewed to update namenode set: hadoop mradmin -refreshNode

4) Go to the web interface to see if the management to be released datanode state has changed to "being released" because the process of being released at this time datanode in. These datanode they will fast replicated to other datanode.

5) When the status of all datanode staling "Release Complete" indicating that all the blocks have been copied completed. Shut down the node has been released.

6) is removed from the include file nodes, and run the command: hadoop dfsadmin -refreshNodes; hadoop mradmin -refreshNodes.

7) remove the node from the slaves file.

Published 31 original articles · won praise 0 · Views 856

Guess you like

Origin blog.csdn.net/weixin_45678149/article/details/104978836