Big Data Series--Frame Introduction

Introduction

hadoop

  • It is a file system, plus an offline processing framework (map-reduce execution framework), which is mainly used for the storage of massive data files and the calculation of non-real-time massive data.
  • The provided upper-level API is not very friendly, and the mapreduce processing framework is relatively slow, and now it is basically only used as a file system.

spark

  • It is an execution engine that does not save data itself, and requires an external file system to save data. In many cases, data is saved based on hadoop.
  • When Spark calculates, it puts data in memory as much as possible (based on memory). It also provides a good interface for upper-level users, including spl statements (spark sql), which is very convenient for data processing. It is many times faster than the map-reduce processing framework (disk-based). Now basically use it for offline data processing.

storm

  • It is a real-time data processing framework that only provides the most basic data stream transmission framework elements and basic data stream interfaces. Users need to write their own processing procedures and processing logic.

substantial

  • It is a real-time data processing system and has a complete set of ecology. The upper layer provides many data processing operators (interface functions) for users to use, which is more user-friendly and convenient to use. Many companies now use it for real-time data processing.

Hadoop、Spark、Flink

Other URL

Hadoop Comparison: Comparison of the three frameworks of Hadoop, Spark, and Flink

1. Data processing comparison

Hadoop is designed for batch processing, inputting a large number of data sets into the input at a time, processing and producing results.
Spark: The definition is a batch processing system, but it also supports stream processing.
Flink: Support batch processing and stream processing. The two can be run separately or in batches.

2. Comparison of stream engines

Hadoop: Hadoop's default MapReduce is only for batch processing.
Spark: Spark Streaming uses micro-batch processing of data streams to achieve quasi-real-time batch processing and stream processing.
Flink: Flink is a real real -time streaming engine that uses streams to process workloads, including streaming, SQL, micro-batch processing and batch processing.

3. Data flow comparison

Hadoop: MapReduce calculation data flow does not have any loops. Each stage uses the output of the previous stage and generates input for the next stage.
Spark: Although the machine learning algorithm is a cyclic data stream, Spark represents it as a (DAG) direct acyclic graph or a directed acyclic graph.
Flink: Flink supports controlled cyclic dependency graphs at runtime, and it is very effective to support machine learning algorithms.

4. Comparison of calculation models

Hadoop: MapReduce uses a batch-oriented model to batch static data.
Spark: Spark uses micro-batch processing. Micro-batch processing is essentially a "collect first and then process" calculation model.
Flink: Flink uses a continuous streaming model to process data in real time without any delay in collecting or processing data.

5. Performance comparison

Hadoop: Hadoop only supports batch processing and does not support processing streaming data. Compared with Spark and Flink, the performance will be reduced.
Spark: Supports micro-batch processing, but the stream processing efficiency is not as good as Apache Flink.
Flink: The performance is very strong. Flink uses native closed-loop iteration operators, especially in supporting machine learning and graphics processing.

6. Memory management comparison

Hadoop: Provides configurable memory management, which can be performed dynamically or statically.
Spark: Provides configurable memory management. Since Spark 1.6, it has been moving towards automatic memory management.
Flink: Has its own memory management system, providing automatic memory management.

Spark VS Storm

Other URL

Comparison of spark and storm-whispering singing-博客园

Point of contrast

Spark Streaming

Storm

Real-time calculation model

Quasi real-time, collect the data in a period of time as an RDD, and then process it

Pure real-time, one piece of data comes, one piece of data is processed

Real-time calculation of latency

Second level

Milliseconds

Throughput

high

low

Transaction mechanism

Support, but not perfect

Support perfect

Robustness / fault tolerance

Checkpoint, WAL, general

ZooKeeper, Acker, very strong

Dynamically adjust the degree of parallelism

not support

stand by

Storm usage scenarios

1. It is recommended to use in scenarios that require pure real-time and cannot tolerate delays of more than 1 second, such as real-time financial systems, which require pure real-time financial transactions and analysis.
2. In addition, if reliable transactions are required for real-time computing functions Mechanism and reliability mechanism, that is, the data processing is completely accurate, and there can be no more or less one. You can also consider using Storm.
3. If you need to dynamically adjust the parallelism of the real-time calculation program for the peak and low peak time periods, Maximize the use of cluster resources (usually in small companies where cluster resources are tight), you can also consider using Storm
4. If a big data application system, it is pure real-time computing, without the need to execute SQL interactive queries in the middle, For complex transformation operators, etc., Storm is a better choice

Spark Streaming usage scenarios

1. If none of the above three points are applicable to Storm in real-time scenarios, that is, pure real-time, powerful and reliable transaction mechanisms are not required, and parallelism is not required to be dynamically adjusted, then you can consider using Spark Streaming
2. One of the most important factors to consider using Spark Streaming should be the macro consideration of the entire project, that is, if a project includes offline batch processing, interactive query and other business functions in addition to real-time calculation, and in real-time calculation, It may also involve high-latency batch processing, interactive query and other functions, then the Spark ecosystem should be preferred, using Spark Core to develop offline batch processing, Spark SQL to develop interactive queries, and Spark Streaming to develop real-time computing. Seam integration, providing very high scalability to the system

Analysis of the pros and cons of Spark Streaming and Storm

        Spark Streaming is definitely not better than Storm. These two frameworks are excellent in the real-time computing field, but they are good at different sub-scenarios.

        Spark Streaming is only better than Storm in terms of throughput, and throughput has always been very similar to Spark Streaming, and those who depreciated Storm emphasized it. But the question is, do you pay so much attention to throughput in all real-time computing scenarios? Not really. Therefore, it is unreliable to say that Spark Streaming is stronger than Storm through throughput.

        Storm is much better than Spark Streaming in terms of real-time latency. The former is pure real-time and the latter is quasi-real-time. Moreover, Storm's transaction mechanism, robustness/fault tolerance, and dynamic adjustment of parallelism are all better than Spark Streaming.

        One thing about Spark Streaming is that Storm is absolutely incomparable, that is: it is located in the Spark ecological technology stack, so Spark Streaming can be seamlessly integrated with Spark Core and Spark SQL, which means that we can process the middle of real-time processing. Data, immediately delay batch processing, interactive query and other operations in the program seamlessly. This feature greatly enhances the advantages and functions of Spark Streaming.

Spark VS Flink

Other URL

 How does spark compare with flink? What is the current situation in China? - Know almost

item Spark Firm
Maturity High (because of early start) low
Focus Batch processing Stream processing
real-time Near real time Real real time
Popularity More companies are used. Less (because of late start)
performance Weaker than Flink Stronger than Spark
development trend The development trend is not as good as Flink The development trend is better than Spark
Useful place Batch processing Time mechanism, state management, flow batch integration

 

Guess you like

Origin blog.csdn.net/feiying0canglang/article/details/113957602