[Interview] Basics of Big Data (2)

0. Question outline

2. Basic Principles of Big Data

1、NameNode && DataNode

1、NameNode、DataNode与Secondary NameNode 作用(*2),SNN能接替NN吗?
2、HDFS HA架构?如何实现NameNode HA(*3)?如何实现HA的选举,如NameNode选举。
3、HDFS DataNode死了怎么办,NameNode发生了什么变化?
4、HDFS EditLog写入了,但NameNode元信息没保存在内存中,数据不一致怎么办?

2. HDFS read and write process

1、HDFS介绍,特性,可存储的文件格式
2、HDFS读写流程(*3)
追问1:写入时,小文件过多会怎样 ?
追问2:写文件如何保证正确性?
追问3:零拷贝模式是什么?

3、Shuffle

1、Shuffle 过程
 - 追问1:整个过程有几次排序?
 - 追问2:combiner 与 reduce 区别是什么?
 - 追问3:MapReduce 和 Spark 的 Shuffle区别有什么?
2、Map端和Reduce端如何对应,数量如何确定,Reduce端数量设置方法
3、以WordCount为例,说明MR的执行机制。

4. Other

1、ES和HDFS的区别
2、海量数据的Count问题(单机),如果把大文件hash成不同的小文件,此时小文件装不下某个Key对应的数据,该怎么办?

1. Basic Principles of Big Data

1、NameNode && DataNode

1. The role of NameNode, DataNode and Secondary NameNode (*2), can SNN replace NN?

1

node effect
NameNode Manage the metadata of the entire file system, record the DataNode name and storage path
DataNode Manage the user's file data blocks, and periodically send a list of stored blocks to the NameNode
Secondary NameNode Merge edit logs of NameNode into fsimage

Description :

1、fsimage && edit logs

  • fsimage: A snapshot of the entire file system when the NameNode is started.
  • edit logs: The sequence of changes to the file system after the NameNode is started.

2. Abstract structure of NameNode:
1

3. NameNode persists information to disk:
2
4. Secondary NameNode works. After
3
NameNode runs for a long time, edit logs will become very large. It will take a long time for the above changes to be merged into fsimage, so set the assistant node – Secondary NameNode timing Query the edit logs of the NameNode, merge the changes to fsimage, and copy back to fsimage . Therefore, it is considered to be a check node in the community, not a backup of the NameNode.

2. HDFS HA ​​architecture? How to implement NameNode HA (*3)? How to implement HA election, such as NameNode election.

2.1 Main idea of ​​NameNode HA

  • 1) NN competes to register on ZooKeeper, that is, create a temporary node, and the one that is successfully written is Active;

  • 2) After successful registration, through the watcher mechanism after create, FailoverController will send commands to each NN and determine their respective status and responsibilities;

  • 3) Monitor Health will regularly monitor the NN status, if there is a problem, it may trigger a re-election, that is, repeat 1-2;

  • 4) FailoverController will maintain a heartbeat with ZooKeeper, and after the registered temporary node disappears, it will also trigger a re-election.

主要组件:
1、DFSZKFailoverController(继承自抽象类ZKFailoverController):ZKFC,
故障转移控制器,守护进程,负责总体故障转移和命令发布;

2、ActiveStandbyElector:实现Active、Standby选举;
3、HealthMonitor:健康监视器,实现NN状态的定时监控。

1

2.2. How to implement HA election, such as NameNode election

Answer: ZooKeeper provides a simple mechanism to implement Active Node election. If the current Active fails, Standby will acquire a specific exclusive lock, and the Node that acquired the lock will become Active next.

补充:

1、NameNode选举具体过程:
Namenode(包括YARN ResourceManager) 的主备选举是通过ActiveStandbyElector 完成,主要利用了
Zookeeper 的写一致性和临时节点机制,具体过程如下:

Step 1. 创建锁节点

1)HealthMonitor 检测NameNode,状态正常就有资格参加 Zookeeper的主备选举。
2)没有发起过主备选举,ActiveStandbyElector 就会尝试在 Zookeeper 上创建一个临时节点。
   其写一致性保证只有有一个创建成功,
3)成功的 ActiveStandbyElector 对应的 NameNode 就会成为主 NameNode,回调 ZKFailoverController 方法将其切换为 Active 状态。而失败的切换为standby。

Step 2. 注册 Watcher 监听。
不管创建临时节点是否成功,ActiveStandbyElector 随后向 Zookeeper 注册一个 Watcher 来监听这个节点的状态变化事件,主要关注这个节点的 NodeDeleted 事件

Step 3. 自动触发主备选举。
HealthMonitor 检测到 NameNode 的状态异常时, ZKFailoverController 会主动删除当前在 Zookeeper 上建立的临时节点,这样处于 Standby NameNode 注册的监听器就会收到这个节点的NodeDeleted 事件,收到后马上创建临时节点流程,创建成功就会被选举为Active NameNode。
当然,如果是 Active 状态的 NameNode 所在的机器整个宕掉的话,那么根据 Zookeeper 的临时节点特性,临时节点会自动被删除,从而也会自动进行一次主备切换。

2、防止脑裂(双主现象):
Zookeeper“假死”可能导致出现两个Active NameNode,都可以对外提供服务,无法保证数据一致性。

ActiveStandbyElector通过隔离(Fencing)机制防止脑裂现象。

当某NameNode竞选成功,成功创建ActiveStandbyElectorLock临时节点后会创建另一个名为ActiveBreadCrumb的持久节点,该节点保存了NameNode的地址信息,正常情况下删除ActiveStandbyElectorLock节点时会主动删除ActiveBreadCrumb。

但如果由于异常情况导致Zookeeper Session关闭,此时临时节点ActiveStandbyElectorLock会被删除,但持久节点ActiveBreadCrumb并不会删除,当有新的NameNode竞选成功后它会发现已经存在一个旧的NameNode遗留下来的ActiveBreadCrumb节点,此时会通知ZKFC。
3. What should I do if the HDFS DataNode is dead, and what has changed in the NameNode?
  1. The remaining 2 normal DataNodes will send an identification to the NameNode, and then the NameNode will delete the faulty node from the original pipeline;
  2. NameNode will write data to two normal nodes through the newly constructed pipeline;
  3. The next time HDFS uses this block, it will detect that the number of backups is 2, and then back up again, so that the number of replications reaches 3, and the DataNode status is set to normal.
4. HDFS EditLog is written, but the NameNode meta-information is not stored in the memory, what should I do if the data is inconsistent?

The Secondary NameNode will update the Editlog change information to fsimage. When the NameNode starts, it will first load the faimage file into the memory to form a file system image.

2. HDFS read and write process

1. Introduction to HDFS, features, and file formats that can be stored

1) Introduction : HDFS-Hadoop Distributed File System, distributed file system.

2) Features :

  • Mass data : HDFS can be scaled horizontally, and its stored files can support PB-level or higher data storage.
  • High fault tolerance : multiple copies of data are saved, and they are automatically restored after the copy is lost.
  • Cheap machine : Reliability, safety and high availability are fully considered in the design, and the hardware requirements are low.

limitation:

  • The delay is higher. In order to process large data sets in HDFS, the price of high throughput is high latency, and access cannot be done under 10 milliseconds. (Hbase can make up for it)
  • A large number of small files is inefficient. The entire file system metadata (150 bytes/piece) exists in the memory of the NameNode node, so the number of files is limited.
  • Does not support multiple users to write & modify files. It does not support multiple users to operate on the same file, and the write operation can only be done at the end of the file, that is, the append operation.

4) File format that can be stored :

  • Row storage: SequenceFile, MapFile, Avro Datafile
  • Columnar storage: Rcfile, Orcfile, Parquet
2. HDFS read and write process (*3)

Read the file:
3

  1. The client calls the open method to open a file.
  2. NN returns the block-location (DN) closest to the client according to the topology.
  3. The client directly accesses the DN to read the block data and calculate the checksum. The entire data stream does not pass through the NN.
  4. After reading a block, the next block will be read.
  5. After reading all blocks, close the file.

Write file:
2

  1. The client calls the create method, and NN will do various verifications for creating a new file. (Verification: such as whether the file already exists, whether the client has permission, etc.)
  2. If the verification is passed, the client starts to write data to the DN (DataNode). The file will be divided into blocks according to the block size. The default is 128M (configurable). The DataNode constitutes a pipeline. The client writes data to the output stream object. When transmitting, it is transmitted in units of packets smaller than the block. Split into multiple chunks, and each chunk carries verification information.
  3. Each DataNode will return a confirmation message after writing a block. Not every packet is successfully written, it will return a confirmation.
  4. After writing the data, close the file.
补充:
1、写入详细过程可参考142、DN发给NN完成信号的时机取决于集群是强一致性还是最终一致性
 - 强一致性:所有DataNode写完后才向NameNode汇报;(HDFS一般情况)
 - 最终一致性:任意一个DataNode写完后就能向NameNode汇报。

Follow-up 1: What happens if there are too many small files when writing?

  1. General plan:
  • 1) Hadoop Archive: A file archive tool that packs small files into HAR files and puts them in HDFS blocks, which can reduce NN memory usage and still allow transparent access to files.
  • 2) Sequence file: A series of binary key/value components. If the key is a small file name, and the value is the file content, a large number of small files can be merged into one large file.
  1. Bottom plan:
  • 1) HDFS-8998: DN divides small file areas for special storage, and one block is full before the next one is used.
  • 2) HDFS-8286: Move metadata from NN from memory to third-party kv storage system.
  • 3) HDFS-7240: Apache Hadoop Ozone, hadoop sub-project, born to extend hdfs.

Follow-up 2: How to ensure the correctness of writing documents?

  1. WAL-write ahead log, write Log first, then write memory. EditLog records the latest write operation. If the subsequent failure, the operation is written to the EditLog first, and can be restored based on the record.

  2. The client writes a packet to the DN through the pipeline allocated by NN, and this packet is directly passed to the second and third in the pipeline. Each chunk in the packet carries check information.

Follow-up 3: What is the zero copy mode?
Answer: "zero-copy" is a system call mechanism that skips the copy of the "user buffer" and establishes a direct mapping between disk space and memory, and data is no longer copied to the "user buffer".
The system context switch is reduced to 2 times, which can double the performance

3、Shuffle

The MapReduce computing model is mainly composed of three stages: Map, Shuffle, and Reduce.

  • Map: Input data, do preliminary processing, and output intermediate results in the form;
  • Shuffle: sort and merge the intermediate results according to partition and key, and output to the reduce thread;
  • Reduce: Finalize the input of the same key and write the result to the file.
1. Shuffle process

1
2

Map side: partition, sort, and merge the map results.

  • 1) Partioner : After the Partitioner interface hashes the key, the modulus is taken by the number of Reduce tasks, and then to the corresponding Reduce task position. Then write the data (key/value, partition results after serialization) into the memory buffer. The function of the buffer is to collect map results in batches and reduce disk IO.
  • 2) Spill (overwrite): sort & combiner : When the output data of the map task reaches the set threshold, overflow write will be started. Before writing, sort by partition and then by key. The combiner is a reduce process, which can reduce the intermediate data transmitted between map and reduce tasks and reduce partition index records.
  • 3) Merge on disk : When the data is large, multiple overflow files will be generated. After the map task is processed, all the intermediate data files will be merged and sorted once. In the end, only one intermediate data file will be generated for one map task.

Reduce side: Pull the results from the map side and merge continuously to generate the final result. :

  • copy: Obtain the map task output through the Fetcher thread and copy it to the local. These data are first saved in the memory buffer and written to disk when the usage reaches the threshold.
  • merge & sort: merge the data from each map into an orderly larger file. Merge and sort at the same time, the process overlaps rather than completely separate.
  • reduce: merge will generate a file at the end, most of which are stored on disk. When the reducer input file has been fixed, the entire shuffle phase ends. (Then the Reducer executes and puts the result on HDFS.)

Summary: The Map side does some partial data processing and breaking up, and Reduce does some data aggregation work.

Follow-up 1: How many sorts are there in the whole process?
Answer: 3 times.

  • 1) Quick sort. In the map stage, when the ring buffer overflows to the disk, the files on the landed disk will be partitioned and sorted according to the key, which belongs to the order in the partition.
  • 2) Merge and sort. In the map stage, in the process of combining the overflowed files, it is necessary to archive, sort and merge the overflowed small files.
    Since the overwritten files have been sorted for the first time, all merged files only need to be sorted again to make the output files orderly as a whole.
    3) Merge and sort. In the reduce phase, the map task output file needs to be copied to the reduce task and merged.
    After the second sorting, you only need to sort again to make the output file orderly as a whole.

Follow-up 2: What is the difference between combiner and reduce?
Answer: The combiner occurs on the map side and processes file data in one task, and cannot cross map tasks; reduce can receive multiple map tasks for processing.

Follow-up 3: What is the difference between MapReduce and Spark's Shuffle?

  • 1) Hadoop has a Map completed, Reduce can fetch data without waiting for all Map tasks to be completed, while Spark has to wait until the parent stage is completed, that is, the map operations of the parent stage are all completed before fetch data can be fetched.
  • 2) Hadoop's Shuffle is sort-base , so whether it is the output of Map or Reduce, it is ordered within the partition, and spark does not require this.
  • 3) Hadoop's Reduce waits until the fetch completes the data before passing the data into the reduce function for aggregation, while spark aggregates while fetching.
    ...
2. How does the Map side correspond to the Reduce side, how to determine the number, and how to set the number of Reduce side
  • 1) Number of Maps : Usually determined by the data block size (the total number of input files) of the Hadoop cluster. The parallel scale of the normal number of Maps is roughly 10-100 per Node.

-2) Reduce number : Normally 0.95 or 1.75* (number of nodes * number of CPUs).
Reducer: can be set manually by job.setNumReduceTasks

  • 3) Number correspondence : Partition: Determined by PartitionerClass, the HashPartitioner used by default uses the remainder of the hash value and reducerNum, which is determined by reducerNum, which is equal to the number of Reducers (number of partitions).
    if 自定义reducerNum<Reducer,then 报错;
    else(>) 产生无任务的reduecer但不会影响结果
3. Take WordCount as an example to illustrate the execution mechanism of MR.

……

4. Other

1. The difference between ES and HDFS

ES: Distributed search engine, stores JSON format, has RESTful operation interface, supports full-text search, and the bottom layer is inverted index.
HDFS: With the tools attached to it, it is better at requiring a lot of complex processing and analyzing massive data.
...

2. Count problem of massive data (single machine). If the large file is hashed into different small files, the small file cannot fit the data corresponding to a certain key, what should I do?

……

Two, reference

1. Namenode, Datanode, and Secondary Namenode in Hadoop
2. HDFS detailed explanation 1: The working principle of namenode and datanode
3. The role of Secondary NameNode
4. The process of datanode failure when Hadoop writes files
5. Overview of HDFS NameNode HA implementation
6. Large Data Hadoop HDFS-HA architecture
7, HDFS NameNode high-availability interview version
8, Hadoop HDFS introduction
9, HDFS main features and architecture
10, Gang Ge talk architecture (7)-big data file storage
11, 4. Know the Hadoop file format
12. Hadoop small file processing-a sword of wind emblem-Blog Park
13. [Hadoop] a large number of small file problems and solutions
14. HDFS file writing process (detailed must see)
15, zero copy
16, the original 8 Figure, you can understand the "zero copy"
17. In- depth analysis of Linux IO principles and the realization of several zero copy mechanisms
18. Shuffle process in Hadoop
19, MapReduce shuffle process in detail
20, Shuffle process in Hadoop in detail
21, Spark in detail 04 Shuffle process Shuffle process
22, How many sorts in the MapReduce program? Three times
23. How many times have been experienced in the shuffle process of MapReduce sort?
24, analysis of Hadoop and Spark process Shuffle difference
25, Spark theme (b): Hadoop Spark Shuffle Shuffle VS
26, MapReduce Shuffle and Shuffle This is enough to see the difference between Spark
26, Mapreduce in Mapper, Partition, and determine the number of Reducer Relationship
27, the relationship between the number of map and reduce in MapReduce
28, the advantages and disadvantages and application scenarios of the four popular databases of MongoDB, ElasticSearch, Redis, and HBase
29, talk about the characteristics and differences of MySQL, HBase, and ES
30, Elasticsearch, MongoDB and Hadoop comparison
31. Take wordcount as an example to describe the mr execution process in detail

Guess you like

Origin blog.csdn.net/HeavenDan/article/details/112310648