big data architecture
Other related recommendations:
Microservice architecture of system architecture
Microkernel architecture of system architecture design
Hongmeng Operating system architecture
Column:System Architect
1. Big data technology ecology
-
Storage: mainly includesHDFS, Kafka
HDFS is a distributed storage framework provided by Hadoop, which can be used to store massive data -
Computing: mainly includesMapReduce, Spark, Flink
MapReduce is a distributed computing framework provided by Hadoop, which can be used to count and analyze HDFS Massive data on -
Online query OLAP: includingkylin, impala, etc.
-
Random query NoSQL: includingHbase, Cassandra, etc.
-
Mining, machine learning and deep learning technologies: includingTensorFlow, caffe, mahout
2. Big data layered architecture
Big data layered architecture diagram:
Big data architecture diagram:
HDFS: (Hadoop distributed file system), which can be used to store massive amounts of data and is suitable for distributed systems running on general-purpose hardware. File system (Distributed File System). HDFS is a highly fault-tolerant system suitable for deployment on cheap machines. HDFS can provide high-throughput data access and is very suitable for applications on large-scale data sets,[usually used to process offline data storage].
Hbase: A distributed, column-oriented open source database suitable for unstructured data storage. [Both real-time data and offline data are supported].
Flume: High availability/reliability, Distributed massive log collection, aggregation and transmission system , Flume supports customizing various data senders in the log system for collecting data; at the same time, Flume provides the ability to simply process data and write to various data recipients (customizable).
Kafka: A high-throughputdistributed publish-subscribe messaging system that can process Stream data on all actions of consumers on the website.
ZooKeeper: open sourcedistributed application coordination service, an important component of Hadoop and Hbase components. It is a software that provides consistent services for distributed applications. The functions provided include: configuration maintenance, domain name services, distributed synchronization, group services, etc.
3. Lambda architecture
3.1 Lambda architecture is decomposed into three layers
- Batch Layer(Batch Layer): Two core functions, storing data sets and generating Batch View.
- Acceleration Layer(Speed Layer): Stores real-time views and processes incoming data streams to update these views.
- Service Layer(Serving Layer): Used to respond to user query requests and merge the result data sets in Batch View and Real-time View into the final data set.
3.2 Advantages and Disadvantages
The advantages:
Good fault tolerance, high query flexibility, easy scalability and expansion
The disadvantages:
Coding overhead caused by full scene coverage. Training offline again for specific scenarios has little benefit. Redeployment and migration are expensive.
3.3 Actual cases
4. Kappa architecture
4.1 Structure diagram
- The input data is directly processed by the real-time data processing engine of the real-time layer to process the continuous source data;
- It is then further processed by the service backend of the service layer to provide upper-layer business queries.
- The data of intermediate results all need to be stored. These data include historical data and result data, which are uniformly stored in the storage medium.
4.2 Advantages and Disadvantages
The advantages:
Unifies real-time and offline code; facilitates maintenance and unifies data caliber; avoids the problem of merging with offline data in Lambda architecture.
The disadvantages:
(1) The amount of data cached by the message middleware and the backtracking data have performance bottlenecks.
(2) In real-time data processing, when a large number of different real-time streams are encountered for correlation, it relies heavily on the capabilities of the real-time computing system, which may lead to data loss due to data flow sequence issues.
(3) When Kappa abandoned the offline data processing module, it also abandoned the more stable and reliable feature of offline computing.
4.3 Actual cases
- The real-time log analysis platform is based on Kappa architecture;
- The unified data processing engine Flink can process all data in real time;
- And store it in ElasticSearch and OpenTSDB.
5. Comparison between Lambda architecture and Kappa architecture
Compare content | Lambda architecture | Kappa architecture |
---|---|---|
the complexity | Requires maintenanceTwo systems (engines), high complexity | Only maintenance requiredOne system (engine), low complexity |
Development and maintenance costs | High development and maintenance costs | Low development and maintenance costs |
Computational overhead | Batch processing and real-time calculations need to be run all the time,The calculation overhead is high | Complete full calculation when necessary, The calculation overhead is relatively small |
real-time | Satisfy real-time | Satisfy real-time |
Historical data processing capabilities | Full batch processing, large throughput, and strong historical data processing capabilities | Full streaming processing, relatively low throughput, and relatively weak historical data processing |
scenes to be used | Directly supports batch processing, which is more suitable for historical data analysis and query scenarios. We hope to get the analysis results as soon as possible. Batch processing can be more direct and efficient. meet these needs. | It is not a replacement architecture for Lambda, but a simplification. Kappa gives up support for batch processing and is better at the analysis needs of the business itself for incremental data writing scenarios. |
Other related recommendations:
Microservice architecture of system architecture
Microkernel architecture of system architecture design
Hongmeng Operating system architecture
Column:System Architect