Introduction
hadoop
- It is a file system, plus an offline processing framework (map-reduce execution framework), which is mainly used for the storage of massive data files and the calculation of non-real-time massive data.
- The provided upper-level API is not very friendly, and the mapreduce processing framework is relatively slow, and now it is basically only used as a file system.
spark
- It is an execution engine that does not save data itself, and requires an external file system to save data. In many cases, data is saved based on hadoop.
- When Spark calculates, it puts data in memory as much as possible (based on memory). It also provides a good interface for upper-level users, including spl statements (spark sql), which is very convenient for data processing. It is many times faster than the map-reduce processing framework (disk-based). Now basically use it for offline data processing.
storm
- It is a real-time data processing framework that only provides the most basic data stream transmission framework elements and basic data stream interfaces. Users need to write their own processing procedures and processing logic.
substantial
- It is a real-time data processing system and has a complete set of ecology. The upper layer provides many data processing operators (interface functions) for users to use, which is more user-friendly and convenient to use. Many companies now use it for real-time data processing.
Hadoop、Spark、Flink
Other URL
Hadoop Comparison: Comparison of the three frameworks of Hadoop, Spark, and Flink
1. Data processing comparison
Hadoop is designed for batch processing, inputting a large number of data sets into the input at a time, processing and producing results.
Spark: The definition is a batch processing system, but it also supports stream processing.
Flink: Support batch processing and stream processing. The two can be run separately or in batches.
2. Comparison of stream engines
Hadoop: Hadoop's default MapReduce is only for batch processing.
Spark: Spark Streaming uses micro-batch processing of data streams to achieve quasi-real-time batch processing and stream processing.
Flink: Flink is a real real -time streaming engine that uses streams to process workloads, including streaming, SQL, micro-batch processing and batch processing.
3. Data flow comparison
Hadoop: MapReduce calculation data flow does not have any loops. Each stage uses the output of the previous stage and generates input for the next stage.
Spark: Although the machine learning algorithm is a cyclic data stream, Spark represents it as a (DAG) direct acyclic graph or a directed acyclic graph.
Flink: Flink supports controlled cyclic dependency graphs at runtime, and it is very effective to support machine learning algorithms.
4. Comparison of calculation models
Hadoop: MapReduce uses a batch-oriented model to batch static data.
Spark: Spark uses micro-batch processing. Micro-batch processing is essentially a "collect first and then process" calculation model.
Flink: Flink uses a continuous streaming model to process data in real time without any delay in collecting or processing data.
5. Performance comparison
Hadoop: Hadoop only supports batch processing and does not support processing streaming data. Compared with Spark and Flink, the performance will be reduced.
Spark: Supports micro-batch processing, but the stream processing efficiency is not as good as Apache Flink.
Flink: The performance is very strong. Flink uses native closed-loop iteration operators, especially in supporting machine learning and graphics processing.
6. Memory management comparison
Hadoop: Provides configurable memory management, which can be performed dynamically or statically.
Spark: Provides configurable memory management. Since Spark 1.6, it has been moving towards automatic memory management.
Flink: Has its own memory management system, providing automatic memory management.
Spark VS Storm
Other URL
Point of contrast |
Spark Streaming |
Storm |
Real-time calculation model |
Quasi real-time, collect the data in a period of time as an RDD, and then process it |
Pure real-time, one piece of data comes, one piece of data is processed |
Real-time calculation of latency |
Second level |
Milliseconds |
Throughput |
high |
low |
Transaction mechanism |
Support, but not perfect |
Support perfect |
Robustness / fault tolerance |
Checkpoint, WAL, general |
ZooKeeper, Acker, very strong |
Dynamically adjust the degree of parallelism |
not support |
stand by |
Storm usage scenarios
1. It is recommended to use in scenarios that require pure real-time and cannot tolerate delays of more than 1 second, such as real-time financial systems, which require pure real-time financial transactions and analysis.
2. In addition, if reliable transactions are required for real-time computing functions Mechanism and reliability mechanism, that is, the data processing is completely accurate, and there can be no more or less one. You can also consider using Storm.
3. If you need to dynamically adjust the parallelism of the real-time calculation program for the peak and low peak time periods, Maximize the use of cluster resources (usually in small companies where cluster resources are tight), you can also consider using Storm
4. If a big data application system, it is pure real-time computing, without the need to execute SQL interactive queries in the middle, For complex transformation operators, etc., Storm is a better choice
Spark Streaming usage scenarios
1. If none of the above three points are applicable to Storm in real-time scenarios, that is, pure real-time, powerful and reliable transaction mechanisms are not required, and parallelism is not required to be dynamically adjusted, then you can consider using Spark Streaming
2. One of the most important factors to consider using Spark Streaming should be the macro consideration of the entire project, that is, if a project includes offline batch processing, interactive query and other business functions in addition to real-time calculation, and in real-time calculation, It may also involve high-latency batch processing, interactive query and other functions, then the Spark ecosystem should be preferred, using Spark Core to develop offline batch processing, Spark SQL to develop interactive queries, and Spark Streaming to develop real-time computing. Seam integration, providing very high scalability to the system
Analysis of the pros and cons of Spark Streaming and Storm
Spark Streaming is definitely not better than Storm. These two frameworks are excellent in the real-time computing field, but they are good at different sub-scenarios.
Spark Streaming is only better than Storm in terms of throughput, and throughput has always been very similar to Spark Streaming, and those who depreciated Storm emphasized it. But the question is, do you pay so much attention to throughput in all real-time computing scenarios? Not really. Therefore, it is unreliable to say that Spark Streaming is stronger than Storm through throughput.
Storm is much better than Spark Streaming in terms of real-time latency. The former is pure real-time and the latter is quasi-real-time. Moreover, Storm's transaction mechanism, robustness/fault tolerance, and dynamic adjustment of parallelism are all better than Spark Streaming.
One thing about Spark Streaming is that Storm is absolutely incomparable, that is: it is located in the Spark ecological technology stack, so Spark Streaming can be seamlessly integrated with Spark Core and Spark SQL, which means that we can process the middle of real-time processing. Data, immediately delay batch processing, interactive query and other operations in the program seamlessly. This feature greatly enhances the advantages and functions of Spark Streaming.
Spark VS Flink
Other URL
How does spark compare with flink? What is the current situation in China? - Know almost
item | Spark | Firm |
Maturity | High (because of early start) | low |
Focus | Batch processing | Stream processing |
real-time | Near real time | Real real time |
Popularity | More companies are used. | Less (because of late start) |
performance | Weaker than Flink | Stronger than Spark |
development trend | The development trend is not as good as Flink | The development trend is better than Spark |
Useful place | Batch processing | Time mechanism, state management, flow batch integration |