HDFS read and write process large data series -Hadoop (b)

Introducing HDFS read and write process, first introduced Block replica placement strategy.

Block replica placement strategy

  • The first copy: DataNode placed upload files; if it is submitted outside the cluster, then a randomly selected disk is not full, CPU less busy node.
  • Second copy of: placing a copy on a different node of the first rack.
  • Third copy: racks node and a second copy of the same.
  • More copies: random node.

HDFS replica placement strategy

HDFS writing process

HDFS writing process

  • The client sent a request to the NameNode, I want to save a file A, this time there will be a logo NameNode, identified as A_copy (file not available).
  • According to a copy of the placement strategy, return three copies of the list to place, and the default is sort of good order.
  • The client and the initiative to connect the nearest DataNode connection (for the time being called DN1), then follow the DN1 DN2, DN2 and DN3 in connection. (Series Pipeline)
  • The client reads the source file, the Block to make smaller cuts,
    • First: the first pass in a first small packet Block DN1.
    • Second: The first pass in the second Block small packet DN1, at the same time, DN1 of the first packet is passed to the DN2.
    • Third: transmitting a first packet to Block third small DN1, at the same time, DN1 second small packet is transmitted to DN2, DN2 first small packet transmitted DN3.
    • And so on

(Block cutting smaller packets, so the benefits of this design is that time does not overlap. If you do not cut, such as a one-time transfer 64M, when passed DN1, waiting, when the transfer DN2, continue to wait, when passing DN3, waiting, when wasting time addition of a benefit, if adding nodes, time has little effect)

  • Finally DataNode and NameNode heartbeat, whether the notification file transfer is completed thoroughly, fill full NameNode location information of metadata.

HDFS reading process

HDFS reading process

  • The client sent a request to the NameNode, NameNode metadata of the file is found, notify the client (for example, file A, was cut into five Block, metafile record of Block1: DN1, DN2, DN3, Block2: DN1, DN4, DN5 and so on and so on)
  • Block requests client data directly to DataNodes (follow distance priority)
  • After all the back to the local Block download, perform MD5 authentication for each Block meta-information. If each Block is correct, is not destroyed, began stitching, the final document will be restored back.

HDFS file permissions

  • Linux file permissions and similar
    • r:read;w:write;x:execute
    • X corresponds to ignore file permissions for the folder indicates whether to allow access to its content
  • If the Linux system users zhangyongli create a file using Hadoop name, then the file in HDFS owner is zhangyongli
  • HDFS permissions purpose, to prevent good people do something wrong, rather than prevent bad people do bad things. HDFS believe, you tell me who you are, I think who you are.

Explanation:

  • Stop good people do wrong: for example, two users AB, A user creates a file X, B Y user creates a file, B A user can not delete user's files X.
  • Prevent bad people do bad things: AB if two users in a bad guy, installed a new linux system, create user AB, completion Hadoop deployment file content, the client program, then go to the new system with the A NameNode X delete files, as NameNode passive trusted, so the future need to integrate kerberos to prevent this operation.
    (Please indicate the source forwarding: http://www.cnblogs.com/zhangyongli2011/ if there is wrong, please leave a message, thank you)

Guess you like

Origin www.cnblogs.com/zhangyongli2011/p/10897766.html