A Interview Questions and Answers on the Hadoop big data framework

What is Hadoop? Hadoop is a development and operation of large-scale data processing software platform, is an open source software framework Appach implementation using java language to realize computing cluster consisting of a large number of computers to be distributed huge amounts of data, take a look at the following general case, the interview is about Hadoop ask what issues, as well as how to answer.

1. Configure a brief description of how to install apache open source version of hadoop, described only without the need of a complete list steps listed steps can be better.

1) Installation and configuration environment variable JDK (/ etc / profile)

2) turn off the firewall

3) configure the hosts file, hadoop easy access by host name (/ etc / hosts)

4) Set login password-free ssh

5) decompression hadoop installation package, and configure the environment variables

6) modify the configuration file ($ HADOOP_HOME / conf)

hadoop-env.sh core-site.xml hdfs-site.xml mapred-site.xml

7) file system format hdfs (hadoop namenode -format)

8) 启动hadoop （$HADOOP_HOME/bin/start-all.sh）

9) process using jps View

2. Please list the normal operation of the hadoop hadoop cluster were all those who need to start the process, what their roles are, as far as possible to write some comprehensive.

1) NameNode: HDFS daemon is responsible for how the log file is divided into blocks of data, and these data blocks are stored in the data on those nodes, its main function is to centrally manage memory and IO

2) Secondary NameNode: auxiliary daemon, communicating with the NameNode, HDFS to periodically save a snapshot metadata.

3) DataNode: HDFS responsible for reading and writing the data block to the local file system.

4) JobTracker: responsible for the allocation task, and monitor all running task.

5) TaskTracker: responsible for the implementation of specific task, and interact with JobTracker.

3. Please list what you know hadoop scheduler, and a brief description of its working methods.

Three more popular scheduler are: the default Scheduler FIFO, computing power scheduler Capacity Scheduler, Fair Scheduler Fair Scheduler

1) The default FIFO scheduler

hadoop default scheduler, using FIFO principle

2) computing scheduler Capacity Scheduler

Select a small footprint, high priority to the implementation of

3) fair scheduler Fair Scheduler

The same job in the queue queue fair share of all resources

4. Hive has stored metadata that way, each of those characteristics.

1) memory database derby, smaller, less frequently used

2) Local mysql, more commonly

3) Remote mysql, unusual

5. Description of how to achieve the two hadoop sorting.

In Hadoop, the default is sorted by key, if you want to sort by value how to do?

There are two methods for secondary sorting, namely: buffer and in memory sort and value-to-key conversion.

buffer and in memory sort

The main idea is: reduce () function, save all value corresponding to a key down, and then sorted. The biggest drawback of this method is: may cause out of memory.

value-to-key conversion

The main idea is: The key portion value and spliced into a combination key (or tone setSortComparatorClass interfaces implemented WritableComparable function), such a result is obtained reduce sorted key press, press the sort result value, it is noted that the user need realize his Paritioner, so that only the data is divided in accordance with the key. Hadoop explicit support for the secondary sorting, there setGroupingComparatorClass Configuration in class () method can be used to group the sorted key value.

6. Description of Join hadoop implemented in several ways.

1) reduce side join

reduce side join is join one of the most simple way, the main idea is as follows:

In the phase map, map function of reading two files File1 and File2, in order to distinguish between the two sources of key / value data p, each data hit a label (tag), such as: tag = 0 indicates the file from File1, tag = 2 from the file File2. Namely: The main task is to hit the stage map to tag data in different files.

In reduce stage, reduce the function acquired from the same key value list files File1 and File2, then for the same key, the data of File1 and File2 performed join (Cartesian product). Namely: reduce the actual stage of the connecting operation.

2) map side join

Exists reduce side join, because they can not obtain all required fields join in the map phase, namely: a key with the corresponding field may be in different map. Reduce side join is very inefficient, because the stage to shuffle a lot of data transmission.

Map side join optimization is performed for the following scenarios: two tables to be joined, there is a very large table and another table is very small, so small table can be stored directly into memory. In this way, we can be a small table multiple copies, so that there is a task in memory for each map (such as stored in the hash table), and then only scan a large table: For each record in key large table / value, the hash table look to see if the key has the same record, if any, to the output connection.

To support copying files, Hadoop provides a class DistributedCache, the use of such methods are as follows:

A: User static method DistributedCache.addCacheFile () specifies the file to be copied, it is URI parameter file (if the file on HDFS, can do: hdfs: // namenode: 9000 / home / XXX / file, which 9000 is the port number NameNode own configuration). JobTracker will get a list of the URI before the job is started, and copy the appropriate files to the local disk of each TaskTracker.

B: User DistributedCache.getLocalCacheFiles () method to get the file directory, and file reading and writing using the standard API to read the corresponding file.

3) semijoin

SemiJoin, also called semi-connected, borrowed from distributed databases over the method. It produces motive: to reduce side join, the amount of data transmitted across a very large machine, which has become a bottleneck join operation, if we can filter out data will not participate in the join operation in the end of the map, you can greatly reduce network IO .

Realization method is very simple: Select a small table, the assumption that the File1, to participate in the join key extracted, saved to a file in File3, File3 files are generally small, are loaded into memory. In the phase map, using the File3 DistributedCache TaskTracker copied to each, and then the corresponding key File3 File2 not in the record was filtered off, the same work and reduce side join reduce the remaining phase.

4) reduce side join + BloomFilter

In some cases, SemiJoin extracted the key small collection of tables still can not fit in memory, which can be used when BloomFiler to save space.

The most common role BloomFilter are: to determine whether an element is in a collection inside. Its two most important methods are: add () and contains (). The biggest feature is there will be no false negative, that is: if contains () returns false, the element must not in the set, but there will be some true negative, namely: if contains () returns true, the element may be in the set in.

Thus can save a small table key to BloomFilter, the filter large table in the map stage, there may be some records are not small table is not filtered out (but recorded a small table will not be filtered out), it does not matter, but it adds a small amount of network IO only.

7. Description of the MapReduce combiner, the role of the partition.

combiner：

Sometimes a map may generate a lot of output, do first the role of combiner is combined at the output end of the map, to reduce the number of network to the reducer.

Note: The output of the mapper is input combiner, reducer is input combiner output.

partition：

The intermediate result output from the map tasks are divided into (R is the number of pre-defined reduce tasks) R key parts according to the range of, generally used when dividing hash function, such as: hash (key) mod R

This ensures that the key to a range, will be handled by a reduce task.

Recommended Reading articles

40 + annual salary of big data development [W] tutorial, all here!

Zero-based Big Data Quick Start Tutorial

Java Basic Course

web front-end development based tutorial

Basics tutorial to learn linux

Big Data era need to know six things

Big Data framework hadoop Top 10 Myths

Experience big data development engineer salary 30K summary?

Big Data framework hadoop we encountered problems

A Interview Questions and Answers on the Hadoop big data framework

Guess you like