Comparison of Flink, Storm, and Spark mainstream computing engine frameworks

Hadoop is a distributed system infrastructure developed by the Apache Foundation.
Hadoop mainly includes HDFS, MapReduce, and data warehouse tool Hive, distributed database Hbase

The core design of the Hadoop framework is: HDFS and MapReduce. HDFS provides storage for massive amounts of data, and MapReduce provides calculations for massive amounts of data.

Hive is a data warehouse built on Hadoop. Hive is a data warehouse tool based on Hadoop. It can map structured data files to a database table and provide complete SQL query functions. SQL statements can be converted to MapReduce. The task is run. Its advantages are low learning costs, simple MapReduce statistics can be quickly realized through SQL-like statements, no need to develop special MapReduce applications, and it is very suitable for statistical analysis of data warehouses.

The development relationship of the apache computing engine.
Three papers in apche identify the basis of big data. Among them, mr received the inspiration of one of the papers and created mapreduce. At the same time, other technologies have appeared along with the development of the times.
The full name of Hbase is Hadoop Database, that is, Hbase is the database of Hadoop, and it is a distributed storage system. Hbase uses Hadoop's HDFS as its file storage system, and uses Hadoop's MapReduce to process massive amounts of data in Hbase. Use zookeeper as its coordination tool.

1. The first generation of computing engine mapreduce

As the first computing engine for batch processing, mapreduce is the pioneer of computing engines. It supports machine learning internally, but the machine learning library is not being updated. Mapreduce writing is very time-consuming, low development efficiency, and too much development time cost. Few companies write mapreduce to run programs.

2. The second generation computing engine pig/hive

As the second-generation engine pig/hive, Hadoop is nested (if you don’t know Hadoop, it’s recommended not to look at it...). Its storage is based on hdfs, and calculations are based on mr. Hive/pig will first perform tasks when processing tasks. Parse its own code into m/r tasks, which greatly reduces the cost of writing mr. Pig has its own scripting language, which is more flexible than hive. Hive belongs to sql-like syntax. Although it is not as flexible as pig, everyone prefers to use hive in a world where programmers know how to use SQL. Pig/hive only supports batch processing and supports Machine learning (hivemall)
3. The third-generation computing engine spark/storm

With the development of the times, enterprises have an increasing demand for real-time data processing, so storm/spark has appeared.

The two have their own calculation modes. Storm is a real streaming process, with low latency (ms-level latency), high throughput, and each piece of data triggers calculations. Spark belongs to the conversion of batch processing to stream processing, which means that streaming data is divided into small batches for calculation according to time. Compared with storm, the delay will be higher than 0.5s (s-level delay), but the performance consumption is lower than storm. "Streaming calculation is a special case of batch calculation (streaming calculation is the result of split calculation)"
4. The fourth generation calculation engine flink

flink appeared in apache in 2015, and was later optimized by the Alibaba technical team (here I am proud of being a Chinese) as blink, flink supports batch processing that also supports streaming computing. Flink is born for streaming computing and belongs to every piece of data to trigger calculations. The performance consumption is lower than storm, the throughput is higher than storm, the delay is lower than storm, and it is easier to write than storm. Because storm needs to write its own logic to implement windows, but there are window methods in flink. Flink internally supports a variety of functions, including window functions and various operators (this is very similar to spark, but there is no way to compare spark in performance and real-time) flink supports only once semantics to ensure that data is not lost. flink supports envent time to control the window time, support out-of-order time and time processing (this is very powerful) for batch processing flink batch processing can be understood as "batch processing is a special case of streaming processing" (batch calculation is streaming Combined result of calculation)

exactly once
At least once At least once
Insert picture description here
Insert picture description here
Compared with storm, both spark and flink support windows and operators, which reduces a lot of programming time. Compared with storm and spark, flink supports disorder and delay time (in In actual scenarios, this function is very awesome), I personally think that this function can hit spark. For spark, its advantage is machine learning. If our scenario does not require high real-time requirements, we can consider spark, but if it is required If it is very high, consider using flink, such as monitoring abnormal consumption of users. If spark is used in this scenario, then wait until the system finds that the early warning is started (0.5s), the criminal has completed the transaction. It is conceivable that in some scenarios, flink How important is the real-time.

Guess you like

Origin blog.csdn.net/yangshengwei230612/article/details/114483125