Big data development engineer interview questions

1. Multiple choice questions

1. Which program is responsible for HDFS data storage?

Datanode

2. How many copies of a block in HDFS are saved by default?

Default 3 copies

3. Which program is usually started on the same node as NameNode?

Jobtracker

4. What is the default Block Size of HDFS?

64MB

5. What is usually the main bottleneck of a cluster?

Disk IO

6. About SecondaryNameNode?

Its purpose is to help NameNode merge edit logs and reduce NameNode startup time

7. What can be used as cluster management?

Puppet、Pdsh、Zookeeper

8. The process when the client uploads files

The Client initiates a file writing request to the NameNode. NameNode returns part of the DataNode information it manages to the Client based on the file size and file block configuration. The Client divides the file into multiple Blocks and writes them to each DataNode block in order according to the address information of the DataNode. (NameNode->Client->Blocking->DataNode)  

9. What is the core configuration of Hadoop?

The core configuration of Hadoop is completed through two xml files: ① hadoop-default.xmlhadoop-site.xml . These files all use xml format, so there are some attributes in each xml, including names and values, but these files no longer exist.

10. How to configure it now?

Hadoop now has 3 configuration files, ① core-site.xmlhdfs-site.xmlmapred-site.xml . These files are saved in the conf/ subdirectory.

11. What is the use of jps command?

This command can check whether the Namenode, Datanode, Task Tracker, and Job Tracker are working properly.

12. What is the principle of mapreduce?        

Map stage:

  • Splitting : First, the input data set is divided into multiple small blocks, each small block contains a part of the data record. These small chunks are called shards or input shards.
  • Mapping: In this phase, the user of the MapReduce task specifies a mapping function (Map function) that converts each record of the input dataset into a set of key-value pairs. The generation of these key-value pairs is performed independently and in parallel. Each Map task corresponds to an input shard, so the Map stage has a high degree of parallelism. The output key-value pairs of the Map function are usually related to the characteristics of the problem.
  • Grouping and sorting: The generated key-value pairs will be grouped so that values ​​with the same key can be passed to the same Reduce task. In addition, these key-value pairs will also be sorted by key to make it easier to process in the Reduce stage.

Reduce phase:

  • Reduce: In the Reduce phase, the user specifies a reduction function (reduce function) that receives a set of key-value pairs with the same key and combines them into one or more output results. Each Reduce task processes a unique key group, and Reduce tasks can also run in parallel.
  • Result output: Finally, the output results of the Reduce task are written to persistent storage (such as a distributed file system) for further analysis and use.

13. HDFS writing process

  • Client links namenode to store data
  • The namenode records a piece of data location information (metadata) and tells the client where to store it.
  • The client uses the HDFS API to store data (default is 64M) on the datanode.
  • The datanode will back up the data horizontally and will feedback to the client after the backup is completed.
  • The client notifies the namenode that the storage block is completed.
  • The namenode synchronizes metadata into memory.
  • Another piece cycles through the process above.

   process

  • The client connects to the namenode, checks the metadata, and finds the storage location of the data.
  • The client concurrently reads data through the HDFS API.
  • Close the connection.

14. What is the role of Hadoop’s Combiner?

Combiner is an implementation of reduce, which runs computing tasks on the map side and reduces the output data on the map side.

The function is optimization, but the usage scenario of combiner is that the map and reduce input and output of mapreduce are the same.

15. Briefly describe hadoop installation

  • Create hadoop account
  • setup.Change IP
  • Install java, modify the /etc/profile file, and configure java environment variables
  • Modify the host file domain name
  • Install SSH and configure keyless communication
  • Unpack Hadoop
  • Configure hadoop-env.sh, core-site.sh, mapred-site.sh, hdfs-site.sh under the conf file
  • Configure hadoop environment variables
  • Hadoop namenode-format
  • start-all

16. Please list the hadoop process name

  • namenode manages the cluster and records datanode file information
  • secondname: can be used as a cold standby to perform snapshot backup of data within a certain range.
  • datanode: stores data
  • Jobtracker: Manage tasks and assign tasks to tasktracker
  • Tasktracker: task executor

17. Write the following commands

  • Kill a job
  • Delete /tmp/aaa on hdfs
  • New cluster status commands required to add a new storage node and delete a computing node
hadoop job -list #拿到job-id
hadoop job -kill job-id
Hadoop fs -rmr/tmp/aaa
#加新节点时:
Hadoop-daemon.sh start datanode
Hadoop-daemon.sh start tasktracker

#删除时
hadoop mradmin -refreshnodes
hadoop dfsadmin -refreshnodes

18. Briefly describe the scheduler of hadoop

  • FIFO schedule: default, first-in-first-out principle
  • Capacity schedular: The computing power scheduler selects the one with the smallest occupation and the highest priority to execute first, and so on.
  • Fair schedular: Fair scheduling, all jobs have the same resources

19. The role of combiner and partition 

  • Combiner is the implementation of reduce, which runs computing tasks on the map side and reduces the output data on the map side. The function is optimization, but the usage scenario of ombiner is that the map output result of map reduce is the same as the reduce input and output.
  •   The default implementation of partition is Hashpartition. The map side partitions the data according to the number of reducers, and uses different reducers to copy their own data. The function of partition is to divide the data into different reducers for calculation to speed up the calculation effect. 

20. The difference between hive internal tables and external tables

  • Internal table: Load data into the hdfs directory where hive is located. When deleted, both metadata and data files are deleted.
  • External table: Do not load data into the hdfs directory where hive is located. When deleting, only the table structure is deleted.

21. How to create Hbase rowkey? What is the best way to create a column family?

  • When storing in hbase, the data is stored in lexicographic order (byte order) of the row key. When designing the key, it must be fully sorted.
  • This feature of storage is to store rows that are often read together. (Positional correlation) A column family is a file at the bottom of the data, so columns that are often queried together are put into a column family. The column families should be as few as possible to reduce The file's seeking time.

22. How to deal with data skew problem using mapreduce?

Data skew: When the map/reduce program is executed, most of the reduce nodes have been executed, but one or several reduce nodes run very slowly, resulting in a long processing time for the entire program. This is because the number of certain keys is larger than that of other keys. Much more (sometimes as much as a hundred or a thousand times), the amount of data processed by the reduce node where this key is located is much larger than that of other nodes, resulting in certain nodes running late, which is called data skew.

Solution: When using Hadoop programs for data association, data skew is often encountered. Here is a solution: implement the partition class yourself, and add the key and value to get the hash value. 

 23. How to optimize in the hadoop framework

  • Optimize from an application perspective. Since mapreduce iteratively parses data files line by line, how to write efficient applications under iteration is an optimization idea.
  • Tuning hadoop parameters. The current hadoop system has more than 190 configuration parameters. How to adjust these parameters to make the hadoop job run as fast as possible is also an optimization idea.
  • Optimization from the perspective of system implementation is the most difficult. It is to discover shortcomings in the current Hadoop design and implementation from the perspective of Hadoop implementation mechanism, and then make source code level modifications. Although this method is difficult, it is often effective.
  • Linux kernel parameter adjustment

24. When we develop a job, can we remove the reduce phase?

Yes, just set the reduce number to 0.

25. Under what circumstances will datanode not be backed up?

The datanode will not be backed up when it is forced to shut down or has an abnormal power outage.

26. HDFS architecture

HDFS consists of namenode, secondraynamenode, and datanode. It is n+1 mode.

  • The namenode is responsible for managing the datanode and recording metadata
  • secondraynamenode is responsible for merging logs
  • datanode is responsible for storing data

27. What will happen if one of the three datanodes has an error?

The data of this datanode will be backed up again on other datanodes.

28. Describe where the caching mechanism is used in Hadoop and what are its functions?

After MapReduce submits the job's ID, all files will be stored in the distributed cache so that the files can be shared by all MapReduce.

29. How to determine the health status of Hadoop cluster

Through page monitoring and script control

30. Why is it recommended to use external tables in a production environment?

  • Because external tables will not load data into hive, data transmission is reduced and data can be shared.
  • hive will not modify the data, so there is no need to worry about data damage
  • When deleting a table, only the table structure is deleted without deleting the table data.

Guess you like

Origin blog.csdn.net/qq_43687860/article/details/133160096