Getting Hadoop - Hadoop acquaintance

What is a .hadoop

Hadoop is recognized as an industry standard open-source Big Data software, it provides the processing power of huge amounts of data in a distributed environment. Almost all major manufacturers around Hadoop development tools, open-source software, business tools and technical services. This year large IT companies such as EMC, Microsoft, Intel, Teradata, Cisco have significantly increased investment in Hadoop area.

Two .hadoop can do

hadoop specializes in log analysis, facebook to use Hive to perform log analysis, 30% of the 2009 facebook had non-programmers who use HiveQL data analysis; Hive custom filters Taobao search is also used; the use of Pig also do advanced data processing, including Twitter, LinkedIn to find people you may know, you can achieve a similar effect Amazon.com's recommendation collaborative filtering. Taobao commodity recommendation is! In Yahoo! 40% of the pig Hadoop job is running, including identifying and filtering spam, as well as user characteristic modeling. (August 25, 2012 the new update, the Lynx system is recommended hive, try a small amount mahout!)

Three core .hadoop

1.HDFS: Hadoop Distributed File System Distributed File System

2.YARN: Yet Another Resource Negotiator Resource Management Scheduler

3.Mapreduce: distributed computing framework

Four .HDFS architecture

Master-slave structure

• master node, the NameNode

• From the node, there are many: Datanode

namenode responsible for:

• receiving a user operation request

• Maintain the directory structure of the file system

The relationship between the block and the datanode relationship between file and block management •

datanode responsible for:

• store files

• The file is divided into block stored on disk

• To ensure data security, there will be multiple copies of files

Secondary NameNode responsible for:

The combined fsimage and edits files to update the NameNode metedata

Five .Hadoop features

Capacity expansion (Scalable): reliably (Reliably) storage and processing gigabytes (PB) data.

Low cost (Economical): common server machine cluster can be distributed and the composition of the process data. These server farms a total of up to thousands of nodes.

High efficiency (Efficient): by distributing data, Hadoop be parallel (Parallel) on the node where the data are processed, which makes the process very fast.

Reliability (Reliable): hadoop can automatically maintain multiple copies of data, and can automatically re-deploy (redeploy) after a failure computing tasks.

六 .NameNode

1 Introduction

namenode management node is the entire file system. He maintains a list of data blocks of the entire file system file directory tree, meta-information file / directory and each file corresponding. Receiving a user operation request.

Documents include:

fsimage: Metadata Mirror file. A memory storing metadata information NameNode period.

edits: operation log file.

fstime: Save the last checkpoint time.

2.NameNode work characteristics

NameNode always kept in memory metedata, for processing "read request", the time to have a "write request" come, first NameNode editlog write to disk, namely edits to write the log file, after a successful return, will modify memory, and returned to the end customer.

Hadoop will maintain a person fsimage file, which is mirrored in metedata of NameNode, but fsimage not keep in line with NameNode memory metedata, but to update content by merging edits files from time to time. Secondary NameNode is used to merge fsimage and edits files to update the NameNode metedata of.

3. When checkpoint

fs.checkpoint.period designated checkpoint twice the maximum time interval, the default 3600 seconds.
fs.checkpoint.size specify the maximum value edits the file, once more than the value of the mandatory checkpoint, regardless of whether it reaches the maximum time interval. The default size is 64M.

Seven .SecondaryNameNode

1 Introduction

HA is a solution. But it does not support hot standby. Configuration can be.
Execution: download from NameNode metadata information (fsimage, edits), and the two combined to generate a new FsImage, stored locally and push them to NameNode, replace the old fsimage.
Default node installed on NameNode but so ... not safe!

2. Workflow

(1) secondary notification namenode switching edits documents;
(2) Secondary and edits the obtained fsimage namenode (via HTTP);
(. 3) The Secondary fsimage into memory, and then begin to merge edits;
(. 4) Secondary new back fsimage to NameNode;
(. 5) fsimage NameNode replace the old with the new fsimage;

Eight .DataNode

File storage services provide real data.
File block (block): The most basic unit of storage. For file content, a file size of length is the size, then the file starting from offset 0, a fixed size, the file is divided and the order number, a good division of said each block a Block. Block HDFS default size is 128MB, 256MB file to a total of 256/128 = 2 Block.
Dfs.block.size
different from the ordinary file system is, HDFS, if a file is smaller than the size of a data block is not occupied the entire block of data storage space;
the Replication: multiple replicas. The default is three.

Nine .HDFS

(1) reading process

1. Initialization FileSystem, then the client (Client) () function to open the file open FileSystem

2. the FileSystem RPC call with metadata node, the data block information to obtain a file, the node returns the address data stored in the data block for each data block, metadata node.

3. the FileSystem FSDataInputStream returns to the client, for reading the data, the client calls the read stream () function to start reading data.

4. DFSInputStream connected to save the file in the first data block of the latest node data, data read from the data client node (client)

5. When the data block has been read, DFSInputStream off and the data node are connected, and then connected to this latest data node a file data block.

6. When the client data has been read when calling FSDataInputStream the close function.

7. During reading of data, if the client nodes in communication with a data error, then try to connect this node contains the next data block.

8. failed data node will be recorded later is no longer connected.

(2) writing process

1. Initialization FileSystem, the client calls create () to create a file

2. the FileSystem called with RPC metadata node, create a new file in the file system namespace, the metadata node first determine the original file does not exist, and the client has permission to create the file, and then create a new file.

3. the FileSystem return DFSOutputStream, write data to the client, the client starts writing data.

4. DFSOutputStream dividing data into blocks, write data queue. a read data queue Data Streamer, and notifies the data distribution node metadata node for storing data blocks (Default Replication 3 each). Data nodes allocated on a pipeline inside. Data Streamer writes the data block a first data node of pipeline. The first node sends a data block to a second data node. The second data node transmits the data to a third data node.

The data block is sent out DFSOutputStream saved ack queue, waiting for data to inform the node pipeline of the data has been written successfully.

6. When the writing is completed the client data, stream the close function is called. This sets all data blocks in the write data pipeline node, and wait for return success ack queue. The last written notification metadata node is completed.

7.如果数据节点在写入的过程中失败，关闭pipeline，将ack queue中的数据块放入data queue的开始，当前的数据块在已经写入的数据节点中被元数据节点赋予新的标示，则错误节点重启后能够察觉其数据块是过时的，会被删除。失败的数据节点从pipeline中移除，另外的数据块则写入pipeline中的另外两个数据节点。元数据节点则被通知此数据块是复制块数不足，将来会再创建第三份备份。