Architectural design of big data architecture (Lambda architecture, Kappa architecture)

Insert image description here

Other related recommendations:
Microservice architecture of system architecture
Microkernel architecture of system architecture design
Hongmeng Operating system architecture

Column:System Architect

1. Big data technology ecology

  1. Storage: mainly includesHDFS, Kafka
    HDFS is a distributed storage framework provided by Hadoop, which can be used to store massive data

  2. Computing: mainly includesMapReduce, Spark, Flink
    MapReduce is a distributed computing framework provided by Hadoop, which can be used to count and analyze HDFS Massive data on

  3. Online query OLAP: includingkylin, impala, etc.

  4. Random query NoSQL: includingHbase, Cassandra, etc.

  5. Mining, machine learning and deep learning technologies: includingTensorFlow, caffe, mahout

2. Big data layered architecture

Big data layered architecture diagram:Insert image description here

Big data architecture diagram:

Insert image description here

HDFS: (Hadoop distributed file system), which can be used to store massive amounts of data and is suitable for distributed systems running on general-purpose hardware. File system (Distributed File System). HDFS is a highly fault-tolerant system suitable for deployment on cheap machines. HDFS can provide high-throughput data access and is very suitable for applications on large-scale data sets,[usually used to process offline data storage].

Hbase: A distributed, column-oriented open source database suitable for unstructured data storage. [Both real-time data and offline data are supported].

Flume: High availability/reliability, Distributed massive log collection, aggregation and transmission system , Flume supports customizing various data senders in the log system for collecting data; at the same time, Flume provides the ability to simply process data and write to various data recipients (customizable).

Kafka: A high-throughputdistributed publish-subscribe messaging system that can process Stream data on all actions of consumers on the website.

ZooKeeper: open sourcedistributed application coordination service, an important component of Hadoop and Hbase components. It is a software that provides consistent services for distributed applications. The functions provided include: configuration maintenance, domain name services, distributed synchronization, group services, etc.

3. Lambda architecture

3.1 Lambda architecture is decomposed into three layers

Insert image description here

  • Batch Layer(Batch Layer): Two core functions, storing data sets and generating Batch View.
  • Acceleration Layer(Speed ​​Layer): Stores real-time views and processes incoming data streams to update these views.
  • Service Layer(Serving Layer): Used to respond to user query requests and merge the result data sets in Batch View and Real-time View into the final data set.

3.2 Advantages and Disadvantages

The advantages:
 Good fault tolerance, high query flexibility, easy scalability and expansion

The disadvantages:
 Coding overhead caused by full scene coverage. Training offline again for specific scenarios has little benefit. Redeployment and migration are expensive.

3.3 Actual cases

Insert image description here

4. Kappa architecture

4.1 Structure diagram

Insert image description here

  • The input data is directly processed by the real-time data processing engine of the real-time layer to process the continuous source data;
  • It is then further processed by the service backend of the service layer to provide upper-layer business queries.
  • The data of intermediate results all need to be stored. These data include historical data and result data, which are uniformly stored in the storage medium.

4.2 Advantages and Disadvantages

The advantages:
 Unifies real-time and offline code; facilitates maintenance and unifies data caliber; avoids the problem of merging with offline data in Lambda architecture.

The disadvantages:
 (1) The amount of data cached by the message middleware and the backtracking data have performance bottlenecks.
 (2) In real-time data processing, when a large number of different real-time streams are encountered for correlation, it relies heavily on the capabilities of the real-time computing system, which may lead to data loss due to data flow sequence issues.
 (3) When Kappa abandoned the offline data processing module, it also abandoned the more stable and reliable feature of offline computing.

4.3 Actual cases

Insert image description here

  • The real-time log analysis platform is based on Kappa architecture;
  • The unified data processing engine Flink can process all data in real time;
  • And store it in ElasticSearch and OpenTSDB.

5. Comparison between Lambda architecture and Kappa architecture

Compare content Lambda architecture Kappa architecture
the complexity Requires maintenanceTwo systems (engines), high complexity Only maintenance requiredOne system (engine), low complexity
Development and maintenance costs High development and maintenance costs Low development and maintenance costs
Computational overhead Batch processing and real-time calculations need to be run all the time,The calculation overhead is high Complete full calculation when necessary, The calculation overhead is relatively small
real-time Satisfy real-time Satisfy real-time
Historical data processing capabilities Full batch processing, large throughput, and strong historical data processing capabilities Full streaming processing, relatively low throughput, and relatively weak historical data processing
scenes to be used Directly supports batch processing, which is more suitable for historical data analysis and query scenarios. We hope to get the analysis results as soon as possible. Batch processing can be more direct and efficient. meet these needs. It is not a replacement architecture for Lambda, but a simplification. Kappa gives up support for batch processing and is better at the analysis needs of the business itself for incremental data writing scenarios.

Insert image description here

Other related recommendations:
Microservice architecture of system architecture
Microkernel architecture of system architecture design
Hongmeng Operating system architecture

Column:System Architect

Guess you like

Origin blog.csdn.net/qq_41273999/article/details/134107430