Big Data basics, ideal for zero-based students

Big Data basics, ideal for zero-based students have foundation can also look at consolidating the basics!

First, what is big data

One kind of large scale in the acquisition, storage, management, analysis far beyond the traditional data collection capabilities of database software tools, with vast amounts of data size, fast data transfer, low variety of data types and values ​​of the density of the four features. Large data requires special techniques to efficiently process a large amount of data within the tolerance elapsed time. Techniques suitable for large data comprising massively parallel processing (MPP) database, data mining, distributed file system, distributed databases, cloud computing platform, Internet and scalable storage system.

Second, the basic characteristics of large data

The amount of data (Volume): The first feature is a large volume of data, including collection, storage and calculation are very large.
    Type variety (Variety): The second feature is the type and source of diversity. Including structured, semi-structured and unstructured data, the specific performance of the network, logs, audio, video, pictures, location information, etc., many types of data processing capability of the data put forward higher requirements.
    Low density value (Value): The third feature is the relatively low value of the data density, or a wave in Sentosa but precious. With the widespread use of the Internet and the Internet of things, information perception everywhere, a flood of information, but a lower density value, and how to combine business logic with powerful data mining algorithms to the value of the machine, the era of big data is most needed to solve the problem.
    High speed aging (Velocity): faster growth rate of the fourth feature data, the processing speed is fast, time-critical requirements.
    Data is online (Online). Data is always online, and can call at any time of the calculation, which is different from the traditional data biggest feature big data.
Click to receive free information and courses

Third, the big data data unit

All units in the order given: bit, Byte, KB, MB, GB, TB, PB, EB, ZB, YB, BB, NB, DB. (The intake rate of 2 ^ 10)

Fourth, the data structure of big data

Structured Data: i.e. line data, stored in the database, the data may be unstructured data logical expression implemented by the two-dimensional table structure: unstructured data structure is irregular or incomplete data, there is no predefined data model, not easy to use two-dimensional logical database tables to represent data. All formats including office documents, text, images, XML, HTML, various reports, images and audio / video information and so on. Semi-structured data: refers to the data structure, the data structure of irregular, since the structure does not greatly change can simply build a table corresponding to him. Such as: the sound data between the image files, HTML documents belong to semi-structured data. It is generally self-describing data structures and contents mixed together, there is no clear distinction.

Fifth, the calculation mode of the large data

Batch computing (MapReduce, the Spark ): best suited to complete computing model of large data batch is MapReduce, first of all, MapReduce parallel processing idea of "divide and conquer" for large-scale data with a simple data relationships, easy-to-divided; then a large number of duplicate data recording processing Map and Reduce summarized into two abstract operation; MapReduce finally provides a uniform framework for parallel computing, the parallel computing system involved many details have to calculate the frame layer to complete, thereby greatly It simplifies the programmer the burden of parallel programming.

Flow calculation (Scribe, Flume, Storm, S4, SparkStreaming) is a high flow is calculated real-time calculation model, complete real-time calculation processing of the new data within a certain time window generated by the application, to avoid the accumulation of data, and lost.

Iterative calculation (HaLoop, iMapReduce, Twister, Spark ) in order to overcome Hadoop MapReduce can not support the iterative calculation of the defect, industry and academia to Hadoop MapReduce has been a lot of improvement study. The iterative control HaLoop MapReduce job execution into the interior of the frame, and through the scheduler loop to ensure sensitive Reduce output of the previous iteration and the current iteration Map the input data on the same physical machine, to reduce the overhead of data transmission between iterations;
click to receive free information and courses

Interactive computing

FIG calculation (Pregel, PowerGrapg, GraphX)

Computing memory (a Dremel, Hana, Redis )

Six big data workflows

1, acquisition and pre-processing

Data collected from the data source, the data required by fusion, data integration, data integration, to generate new data sets, for a subsequent query analysis, to provide a unified view of data processing

2, Storage Management

Distributed File System

Distributed database (NEWSQL, NOSQL)

3, the calculation mode

Including batch, interactive process flow calculation, iteration, FIG calculation, calculation memory

4, analysis and mining

5, visualization

Seven, CDH Profile

CDH is the first 100% open source, Apache-based protocol. Apache-based  Hadoop and related projiect development. You can do batch processing, interactive sql queries and query timely, role-based permissions control. The most widely used in the enterprise hadoop distributions.

Eight, distributed architecture CAP works

● consistency (C): Back up all data in a distributed system, if the same value at the same time. (Equivalent to all nodes access the same copy of the latest copy of the data), in other words to say, any time, the application can be used to get access to the same data.

● availability (A): a subset of nodes in the cluster fails, the cluster can respond if the overall client read and write requests. (Update data includes high-availability), that is to say in other words, at any time, any application can read and write data.

● fault tolerance partition (P): In terms of practical effect, the partition corresponding to the communication time requirements. If the system data consistency can not be reached within the time limit, it means that the situation occurred partition, it must operate to choose between C and A current, in other words, the system can be used across network partitions linear scalability and expansion.

Guess you like

Origin blog.csdn.net/kangshufu/article/details/92703893