Road Big Data learning --HDFS

Copyright: https://blog.csdn.net/power_k/article/details/91397660

Definition:
According to understand, Big Data is a short time produce large amounts of a wide variety of valuable information about
the characteristics
Volume (a lot), Velocity (high-speed), Variety (diversity), Value (the value of low density), Veracity (authenticity)
big data Summary
to solve the problem of excessive data,
hardware support
1) for vertical expansion chestnut: a computer storage is full, the hard disk can be added, but only up to the maximum. Not enough how to do.
2) the level of extended horizontal expansion to come, more than a few not so OK.
Simple and inexpensive pc or server-side can
network 2g-3g-4g-5g ...
software supports
1. database capacity changes
a. Access tens of thousands to hundreds of thousands of data
b. SqlServer generally can handle one million to ten million data
c. MySql can handle millions or even billions of data, but 20 million more than the amount of data performance decreased significantly
d. pinnacle Oracle structured database that can handle millions of data
according to Google three papers
developed GFS ------- ----> HDFS distributed file system (distributed storage)
MapReduce ---------> MapReduce distributed processing
BigDate -------------> HBase
due to the increasingly Vietnam can not meet the demand, the emergence of Hadoop. http://hadoop.apache.org/ ----------- hadoop official website
hadoop module
Hadoop Common support other commonly used program modules hadoop
hadoop distributed file system (the HDFS) provides program data corresponding to the high throughput-access
frame cluster hadoop YARN job scheduling and resource management
hadoop MapReduce YARN based system, for parallel processing of large data sets. (Batch)
Related projects include:
the Spark: general-purpose computing engine for fast hadoop data.
Module: Spark Core
sparkSQL - sql process may be used
sparkStreaming - Loss Processing
mllib - machine learning library
technology 3. hadoop achieve required support
. A kafka
purpose is to unify Kafka online and offline mechanisms via parallel load of message handling Hadoop but also through the cluster in order to provide real-time information.
b. ZooKeeper
ZooKeeper is a distributed, open-source coordination service for distributed applications, Google's Chubby is an open source implementation, it is a cluster manager, monitors the status of each node in the cluster according to the feedback node submitted the next logical step operation. Finally, the interface easy to use and efficient performance, function and stability of the system to the user
ZooKeeper provides
1) File System
2) notification mechanism
C. Flume
D. Hive large database ----
E. Flink
F. Storm
. G HBase
4.hdfs architecture
master-slave architecture
master node the Namenode (boss)
the NameNode: metadata node. This node is used to manage the namespace file system, is master. In order to see all of its metadata and folders in a file system tree, information is stored on the hard disk in order: namespace mirror (namespace image) and the modified log (edit log), it will be mentioned later. In addition, the NameNode also save a file which includes data blocks, which are distributed over the data node. However, such information is not stored on the hard disk, but from the data collected from the nodes in the system boot time.
From node Datanode (employees)
DataNode: data node. HDFS is a real place to store data. The client (client) and the metadata node (the NameNode) may request the data written to or read block node. Further, DataNode require periodic return block information stored to metadata node.
Client (Secretary)
save the file, read the file write --------- hdfs mechanism of
backup in order to solve security problems ---------
5. file transfer
in block form of the corresponding block of the file storage
file linear cut block (block) offset offset (byte)
block stored in the cluster nodes dispersed
single file block size consistent, files and can be different
block can set the number of copies, copies dispersed in different nodes
copy number not more than the number of nodes
block the number of copies of the files uploaded can be adjusted, the same size
Write Once Read only support that only one writer
6. The read and write operations
read Here Insert Picture Description
write Here Insert Picture Description
7. The backup mechanism
a. Submit block is placed on the submitting node cluster
outside the cluster load is not submitted to select a high storage node
b. place on any node in the first backup different rack
c. placed on different nodes of the second chassis
8. the security
pipeline conduit
a. after the information is returned to the client of some dataname namedata
b. clent dataname and will form a block and cut to duct one ackPakeage (64K)
C. dataname will pick the corresponding data is stored from the pipeline
d. when the storage is completed, will report to namedata dataname
Here Insert Picture Description

Guess you like

Origin blog.csdn.net/power_k/article/details/91397660