I know for big data

Today to talk about the understanding and awareness of the large preliminary data of the word;
(hereinafter just after my brief summary of learning, if wrong please point out)

My perception is that for large data: a large number and wide variety of valuable information in a short time quickly generated;
in the past, the data generated is slow, the pace was slow, but now it fast technological development of society is visible, of course, there are a lot of us really feel, they have started a more advanced technical publishing; for this data overload problem, there are two solutions:
1: vertical expansion: to be like than your computer increasing the capacity of the computer itself is added to the hard disk;
2: scale: multiple servers are linked together to expand; (which requires only a simple or inexpensive server can be a PC side)

Three papers will be mentioned here of Google: called the originator of big data
GFS ================= "thus developed a distributed file system, HDFS
MapReduce =="Distributed processing
BigData》HBase

Hadoop Distributed File System (HDFS) is designed to be suitable for running on a general purpose distributed file system hardware (commodity hardware) is. And its existing distributed file systems have a lot in common. But at the same time, it and other distributed file system differences are evident. HDFS is highly fault-tolerant systems, suitable for deployment on low-cost machines. HDFS provides high throughput data access, ideal for use on large data sets.
------- hadoop hadoop official website

HDFS
the Yarn ------- resource and task scheduling
is a new Hadoop Explorer, it is a universal resource management system that provides a unified application for the upper resource management and scheduling, its introduction in the use of cluster rate, unified resource management and data sharing aspect has brought great benefits.

MapReduce ------- batch

the Spark ---------- the Apache the Spark is designed for large-scale data processing designed for fast general-purpose computing engine. Now the formation of a rapid development of a wide range of applications ecosystem.

Core the Spark
sparkSQL ---- can use sql processing
sparkStreaming ----- streaming
mllib- machine learning library
graphx -------- (spark has stopped maintenance)

Master-slave architecture
master node: Namenode
from node: Datanode
Client
read mechanism hdfs: ① the save file
② read the file
backup ---------- "solve security problems

In the form of a large block of the block files stored in the corresponding
default 128M

During storage linear file into pieces (Block): Offset offset (byte)
block stored in the cluster nodes dispersed
single file block size consistent with the file may not match the file
size of the segmented block needs unity, then the case 128M each block is to 128M
but two files, then you can not, as a 128M may be, you can make another 64M

Such as: a linear file division became 13.1 but the need is 14

block can set the copy, the copy nodes scattered in different
copy number may not exceed the number of nodes:
When you create a backup copy of that order, but if you set a duplicate backup node is useless, because the presence of a node is lost when it all lost;

You can set file upload block size and number of copies
block can adjust the number of copies already uploaded files, but can not change the size of
only supports Write Once Read Many, One time only one writer;

"" "" As another example
Namenode: a company owner abbreviation NN
DataNode: employees referred DN
Client equivalent secretary
boss NN control of global information management metadata management of DN == Metadata: data describing the data
source data: data
reception Secretary read request but also
the corresponding communication between the employees and the DN

DN: responsible work ===> data storage
report their cases
to arrange to receive Secretary

When the boss issued a work order is issued to the time secretary, then secretary assigned to the staff

There is a write operation:
-------------------------- "
a large file needs to be stored to the server
the size of large files / 128M = the number of block blocks
secretary: client cut a large file and then report back to the NN cut the time to upload the file owner's permission to file a large number of blocks file size of the file

After the cut and went NN client application resources ---- DN information
NN return is not high number of load client DN to
start sending block to the client and do a backup of the DN
will report the case to the NN DN storage block after block

pipeline pipeline:

If a block is directly stored into the pipeline blockage will form a low efficiency
after this time, some of the information is returned to the NN DN of the DN client client will then form a conduit, and a cutting block a ackPackage (64K)
DN will pick up the appropriate data is stored from the pipeline
after the storage will report success NN DN

Read request:
NN sends a request to the client to say which data is to be read, after the client receives a request, and the NN eh single application of information (blockID)
NN node information will be sent to the client
will go on after the client get to the node DN data taken by the principle of proximity pick ------

=========================================================================

Backup mechanism
Cluster demo
two cases:
1: Submit the cluster block is placed on the node, filed
2: Submit outside of the cluster node to select a high load is not stored

It should be noted: on any node in the first place a different backup racks
placed (for security) in different racks second rack

pipeline pipeline:

Guess you like