Big Data technologies open source framework components

table of Contents

(A) General Framework Overview

(Ii) Data collection layer

(Iii) data storage layer

(Iv) resource management and service coordination layer

(E) Calculation Engine layer

(Vi) Data analysis layer

(G) a data visualization layer

 

(A) General Framework Overview

 

Bottom-up, similar to the OSI, large data system under the general framework of seven layers: the data source, the data collection layer, data storage layer, resource management and service coordination layer, calculated engine layer, data analysis and data visualization layer layer. Illustrated as follows:

 

(Ii) Data collection layer

 

Data collection layer abutting directly to the data source, is responsible for collecting the log generated during the use of the product, having a distributed, generalized attributes. Since the actual scene, most of the data source is scattered, difficult to collect together so large, and therefore the design should have the following characteristics:

Scalability: can be configured a variety of different data sources, and when the peak does not become a system bottleneck encountered;

Reliability: the data can not be lost during transmission (especially financial data);

Security: For sensitive data to be encrypted during transmission (password, money, etc.);

Low latency: Because of the large size of the data source is usually the log collection and should therefore be collected as soon as possible to the storage system, can not produce the backlog.

 

In order Hadoop / Spark represented by open source framework, typically have several layers of data collection scheme selection as follows:

Sqoop: Comparison of the total amount of generic relational database for introduction;

Canal: more generic relational database for increment introduced;

Flume: For non-relational log collection relatively common, such as text logs;

Kafka: distributed message queue, similar to the concept of data channels, having a high distributed fault-tolerant features.

 

(Iii) data storage layer

 

As the traditional relational database there are some bottlenecks in distributed, scalability and high availability aspects, it is difficult to adapt to the big data scene is not recommended as the primary storage and computing systems. Data storage layer is responsible for landing and storage of data, including relational and non-relational data, and has a centralized dispatch system. The data storage layer mainly has the following characteristics:

Scalability: As the main carrier landing data, data growth is a long-term task, and therefore the carrying capacity of the cluster within a certain period of time, there will always reach the bottleneck. Accordingly, the data storage layers need to be considered expansion capacity of the machine.

When because of cost considerations, the data storage layer is typically more machines, and thus the need to build on relatively inexpensive equipment, which requires the system itself has a better fault tolerance, failure in one or more machine: fault tolerance It will not cause data loss.

Storage model: Due to the diversity of the data, the data storage layer need to support structured, unstructured, two types of data, and therefore needs support for text, and other data storage column model.

 

Since the concept of distributed computing proposed by Google, usually GFS, BigTable, MegaStore, Spanner and other technical solutions at Google. In order Hadoop / Spark represented by open source framework, data storage layers typically have several options selected as follows:

HDFS: distributed file system, GFS open source implementation, with very good scalability and fault tolerance, and are very suitable structures on an inexpensive apparatus;
HBase: to HDFS constructed on the basis of the distributed database, BigTable open source implementation, capable of storing structured and semi-structured data, supports unlimited expansion of rows and columns;

Kudu: Cloudera running on HDFS open column storage system, comprising scalability and high availability.

 

(Iv) resource management and service coordination layer

 

With the growing scale of Internet technology, the situation is different mix of technology and the increasingly common framework for the operation and maintenance aspects, development, resource utilization produced a huge challenge. For all the technology deployed in the framework of a unified platform, sharing machine resources, and thus the introduction of resource management and service coordination layer. There are several advantages as follows after the introduction:

High resource utilization: can effectively considering the balance between the number of programs and machine resources, make full use of cluster resources;

Low operation and maintenance costs: the operation of each frame are aggregated into a unified platform for operation and maintenance, personnel requirements will be lower;

Shared data can be: the same data can be provided to calculate different computing framework, shared calculation result storage with reduced computation cost.

 

Google uses Borg, Omega and Chubby three options for resource management. In order to Hadoop / Spark represented by the open-source framework, resource management and service coordination layer usually have several options to select as follows:

Yarn: Hadoop framework is responsible for the unified resource management and scheduling system that can centrally manage machine resources (CPU, memory, etc.), and be able to schedule tasks in the queue way;

Zookeeper: distributed coordination services, solutions based on Paxos algorithm, providing complex scenes distributed queue, the distributed lock.

 

(E) Calculation Engine layer


Calculation Engine is divided into two scenarios batch and flow process: when large amounts of data and less demanding real-time, or when complex calculation logic, batch mode using the calculated data, the pursuit of high throughput; when a moderate amount of data, and real-time requirements, and is relatively simple logic calculation, the calculation using the data streaming mode, the pursuit of low latency. Real-time computing framework to deal with complex or bulky data does not currently exist. In addition to these two scenarios, increasingly popular in recent years, interactive approach through standardized OLAP way to organize and calculate data on ease of use has a huge advantage. Applicable scene three kinds of engines as follows:

Batch: indexing, data mining, large-scale complex data analysis, machine learning;

Streaming: Advertising recommendation, real-time reporting, anti-cheating;

Interactive: data query, reporting calculations.

 

Google provides MapReduce, realization of the principle Dremel two frameworks were adopted and widely used open source frameworks scene. Pregel, Precolator, has been adopted under MillWheel also open source scene. Currently the Hadoop / Spark open source framework represented common scheme is as follows:

MapReduce: classic batch engine, has very good scalability and fault tolerance;

Impala / Presto / Drill: respectively, by Cloudera, Facebook, Apache open, standard SQL processing using data stored on the HDFS, Google Dreml Based on open source;

Spark: by DAG engine, based on RDD provides an abstract representation of the data, the main use of memory for fast data mining;

Storm / Spark Streaming / Flink: streaming system, all have good fault tolerance and scalability, details of the implementation vary.

 

(Vi) Data analysis layer

 

Computing framework direct output results, but a lot of things to consider for simplicity, the framework may be replaced by a calculation engine interactive layer. Under normal circumstances, due to technical considerations platform side, the layer used more programs Mysql, Oracle, Postgresql and other relational databases. The usual classification, are summarized as follows:

Impala / Presto / Drill: interactive computing engine for the implemented;

Mysql / Oracle / Postgresql: relational database implementations;

Hive / Pig: calculated implemented in mass data;

Mahout / MLlib: common set of machine learning and data mining algorithms, originally based on MapReduce implementation, most now implemented by Spark;

Beam / Cascading: unified batch and flow calculation two frameworks, provide a higher level API to achieve computational logic.

 

(G) a data visualization layer

 

In large data scenarios, typically implemented by a front-end plug, such as ECharts the like, to achieve more selected programs. Common representations include: line graphs, bar charts, pie charts, scatter, K line, radar, thermodynamic diagram, like FIG path.

Data visualization layer design computer graphics, image processing and other related disciplines, and relates to interactive processing, computer aided design, a plurality of computer vision technology, human-computer interaction.

Published 28 original articles · won praise 30 · views 6549

Guess you like

Origin blog.csdn.net/gaixiaoyang123/article/details/104359655