Big Data - face questions summary

Summary ...

https://juejin.im/post/5b5ac91051882519a62f72e5
https://zhuanlan.zhihu.com/p/35591010


  1. HDFS upload files and reading the file process
    a. Client sends a request to the data the NameNode
    B. The NameNode returns a list may be stored to the Client host1 host2 ...
    C. Client file according 128M block (Block default size of 128M)
    D. Client sends Host1 BLOCK1; streaming transmission process is written.
    Host1 simultaneously transmit Block1 Host2 (DataNodes),
    Host2 to simultaneously transmit Block1 Host3 (DataNodes)
    E. Host1 (DataNodes) sends a notification to the Client, represents transmission finished
    Host1,2,3 NameNode completion report to the storage.
    f. Client sends a message to NameNode, saying finished Block *

  1. HDFS when uploading a file, if one block is suddenly damaged how to do?

I did not find the right answer?
Wherein a bad block, as long as the presence of other blocks, automatically detects reduction.


  1. The role of the NameNode
    3.1 namespace management
    3.2 Metadata Management
    3.3 Management Block copy of the strategy: The default is 3
    3.4 deal with the client read and write requests for the allocation of tasks DataNode

  1. The effect DataNode
    4.1 Slave working node
    4.2 and the checksum of the data stored in Block
    4.3 sent by the client to read or write operation
    4.4 by periodic heartbeat mechanism (3 seconds by default) operating state information, and reporting NameNode Block list
    4.5 cluster starts, the DataNode NameNode information provided Block

5.NameNode at startup operations will do what?
A:
Load FsImage, building the entire namespace in memory, while writing each BlockID BlockMap in this case BlockMap DataNodes list corresponding to temporarily empty, when Fsimage loaded, the entire directory structure in memory HDFS has initialized.
DataNodes need to obtain the missing information from datanode of blockReport in, so after completion of loading fsimage, namenode process rpc enter the wait state, waiting for all of datanodes send blockReports.


6.Hadoop submit workflow
A: The
first step:
Client, the client submits a jar of mr packet JobClient
(submission: hadoop jar)

Step two:
Path JobClient hold a proxy object of RM, RM sends it to a RPC (Remote Procedure Call) request, told the RM operation began, then return a JobID RM and a storage jar package to the client

Third Step:
Client path jar package obtained as a prefix, suffix as the JobID (path = address on hdfs + jobId) spliced into a new path hdfs then the Client hdfs storage jar package through the FileSystem, the default storage 10 parts (DateNode the NameNode and other operations)

Step Four:
Start submit task description information (route package after storage jar and stitching JobID) of the job Client returns to the RPC RM

Step Five:
RM initialize the task, and then into a scheduler

Step Six:
RM read to be processed on the HDFS file, start to enter the slice, each slice corresponds to a MapperTask, the amount of data determining how many starting mapper, reducer number

Step Seven:
the NodeManager by heartbeat mechanism to receive the task (task descriptions) to ResourceManager

Step eight:
receive the task NodeManager to download jar package and configuration files on Hdfs

Ninth step:
the NodeManager start the appropriate child process yarnchild, run mapreduce, run maptask or reducetask

Tenth step:
Map hdfs read data from, and then passed to reduce, the data to reduce the output back hdfs

Here Insert Picture Description

How to achieve Innodb affairs


1.HDFS achieve high availability mechanism

  1. Active NN and the standby switching Standby NN
  2. QJM use metadata to achieve high availability
    QJM mechanism: as long as the guarantee of success Quorum (a quorum) number of operations, I think this is a the ultimate success of the operation
    QJM shared storage system
  3.   利用ZooKeeper实现Active节点选举
    

2. Please describe what kinds of TDH platform scheduling strategy can be used on Yarn, and were set forth the characteristics of each scheduling policy.

  1. FIFO scheduler
    all task into a queue, the queue to get advanced resources, at the back of the task have to wait
  2. Capacity scheduler
    core idea: a budget in advance, to share cluster resources under the guidance of budget
    Here Insert Picture Description
  3. Fair Scheduler
    Here Insert Picture Description

3. BulkLoad data warehousing

Meaning: Because the data is stored in the form of HFile HBase files in HDFS, so we bypass the HBase API, the data directly processed into HFile file and then load it into HBase in order to complete rapid large-scale data warehousing

The basic flow of HBase BulkLoad:

  1. Extraction: extract data from a data source
    - export data to MySQL, run mysqldump command

  2. Conversion: using the MapReduce, to convert the data file HFile
    - TSV or CSV file for use HBase ImportTsv HFile tool to convert it into a document - for each zone of each output file will create a folder HFile file
    - in the HDFS available disk space at least twice the original input file. For example, for a 100GB mysqldump export file, HDFS reserved at least not less than 200GB of disk space, you can delete the original input file in after the end of the task

  3. Load: Load HFile file into HBase
    - HBase CompleteBulkLoad use of tools, to move the file to the appropriate directory HFile HBase table, complete loading
    Here Insert Picture Description
    Here Insert Picture Description

Here Insert Picture Description


Guess you like

Origin blog.csdn.net/shaoye_csdn1/article/details/90636528