Apache Flink is now in big data processing

Do we need another new data processing engine? I was very skeptical when I first heard about flink. In the field of big data, there is no shortage of data processing frameworks, but no framework can fully meet different processing needs. Since Apache spark appeared, it seems to be the best framework for solving most problems today, so I am very skeptical about another framework that solves similar problems.
But out of curiosity, I spent weeks trying to understand flink. At the beginning, I carefully looked at a few examples of flink, and it felt very similar to spark. I tended to think that flink was a framework that imitated spark. However, with the deepening of understanding, these APIs reflect some novel ideas of flink, which are still quite different from spark. I got a little fascinated by these ideas, so I spent more time on them.
        Many ideas in flink, such as memory management, dataset API, have appeared in spark and have proved that these ideas are very reliable. Therefore, an in-depth understanding of flink may help us in the future of distributed data processing.
  In the following articles, I will write my first impression of flink as a spark developer. Because I have been working on spark for more than 2 years, but have only been in contact with flink for 2 to 3 weeks, there must be some bias, so everyone should read this article with a skeptical and critical perspective.
What is Apache Flink? Flink
is a new big data processing engine that aims to unify data processing from different sources. This target looks similar to spark and . Yes, flink is also trying to solve the problem that spark is solving. Both systems are trying to build a unified platform that can run batch, streaming, interactive, graph processing, machine learning and other applications. Therefore, the goals of flink and spark are not very different. The main difference between them is the details of implementation.
Later I will focus on comparing the two from different angles.
Apache Spark vs Apache Flink
1. Abstraction
In spark, we have RDDs for batch processing and DStreams for streaming, but the internals are actually RDDs. So all data representations are essentially RDD abstractions.
Later I will focus on comparing the two from different angles. In flink we have DataSets for batch processing and DataStreams for streaming. It looks similar to spark, but their differences are:
1) DataSet is represented by runtime plans at runtime.
In spark, RDD is represented by java objects at runtime. With the introduction of Tungsten, this has changed a bit. But it is represented as a logical plan in flink, sounds familiar? Yes, it is similar to dataframes in spark. So in flink the class Dataframe api you use is optimized as the first priority. But relatively speaking, there is no such optimization in spark RDD.
   The Dataset in flink, which corresponds to the Dataframe in spark, will be optimized before running.
In spark 1.6, the dataset API has been introduced to spark, perhaps eventually replacing the RDD abstraction.
2) Dataset and DataStream are independent APIs
In spark, all the different APIs like DStream, Dataframe are abstracted based on RDD. But in flink, Dataset and DataStream are two independent abstractions on the same common engine. So you can't combine the two behaviors to operate together, of course, the flink community is currently working in this direction (https://issues.apache.org/jira/browse/FLINK-2320), but it can't be easily asserted at the moment final result.
2. Memory management
   Until version 1.5, spark used java memory management for data caching, which obviously easily leads to OOM or gc. So starting from 1.5, spark has turned to precise control of memory usage. This is the tungsten project.
   Flink has insisted on controlling memory by itself since the first day. This is also one of the reasons that inspired spark to take this path. In addition to storing data in its own managed memory, flink also directly manipulates binary data. In spark, starting from 1.5, all dataframe operations are performed directly on tungsten binary data.

3. Language implementation
spark is implemented in scala, which provides programming interfaces of Java, Python and R.
Flink is implemented in java, of course, it also provides Scala API,
so from a language point of view, spark is richer. Since I've moved to scala for a long time, I don't know much about the java api implementation of the two.
4.API
spark and flink are both imitating scala's collection API. So on the surface, both are similar. The following is the word count implemented with RDD and DataSet API respectively

// Spark wordcount
object WordCount {

  def main(args: Array[String]) {

    val env = new SparkContext("local","wordCount")

    val data = List("hi","how are you","hi")

    val dataSet = env.parallelize(data)

    val words = dataSet.flatMap(value => value.split("\\s+"))

    val mappedWords = words.map(value => (value,1))

    val sum = mappedWords.reduceByKey(_+_)

    println(sum.collect())

  }

}

// Flink wordcount
object WordCount {

  def main(args: Array[String]) {

    val env = ExecutionEnvironment.getExecutionEnvironment

    val data = List("hi","how are you","hi")

    val dataSet = env.fromCollection(data)

    val words = dataSet.flatMap(value => value.split("\\s+"))

    val mappedWords = words.map(value => (value,1))

    val grouped = mappedWords.groupBy(0)

    val sum = grouped .sum(1)

    println(sum.collect())
  }

}
I don’t know if it’s accidental or intentional, but the APIs look very similar, which makes it easy for developers to switch from one engine to another. I feel that this Collection API will become the standard for writing data pipelines in the future.
Steaming
spark regards streaming as a faster batch process, while flink regards batch processing as a special case of streaming. The ideas here determine their respective directions. The differences between the two are as follows:

Real-time vs. near real-time perspective
Flink provides a stream processing mechanism based on each event, so it can be considered as a real stream computing . It's very similar to Storm's model.
Spark, on the other hand, is not based on the granularity of events, but uses small batches to simulate streaming, that is, a collection of multiple events. So spark is considered as a near real-time processing system.

  Spark streaming is faster batch processing, while Flink Batch is streaming computing with limited data.
While it is acceptable for most applications to target real-time, there are still many applications that require streaming computing at the event level. These applications prefer to choose storm rather than spark streaming, and now, flink may be a better choice.

    Representation of stream computing and batch computing
       Spark uses the same abstraction for both batch and stream computing: RDD, which makes it easy to combine these two computations to represent. And flink is divided into DataSet and DataStream. Compared with spark, this design is a bad design.

Support for windowing
         Because of spark's small batch mechanism, spark's support for windowing is very limited. It can only be based on process time, and can only do window on batches.
           And Flink's support for windows is in place, and Flink's support for the windowing API is quite powerful, allowing windowing based on process time, data time, and record.
          I'm not sure if spark can introduce these APIs, but so far, Flink's windowing support is better than spark.
          In this part of Steaming, flink wins over

SQL interface
. At present, spark-sql is one of the most active components in spark. Spark provides a DSL similar to Hive's sql and Dataframe to query structured data. The API is very mature and is widely used in streaming computing. It is expected to develop rapidly in streaming computing.
As for flink, so far, the Flink Table API only supports a DSL like DataFrame, and it is still in beta state. The community has plans to increase the SQL interface, but it is not yet sure when it will be available in the framework.
So in this part, spark wins.

Data source Integration

Spark's data source API is the best in the whole framework. The supported data sources include NoSql db, parquet, ORC, etc., and supports some advanced operations, such as predicate push down
Flink currently also relies on map/reduce InputFormat to do Data source aggregation.
In this game, spark wins

Iterative processing
. Spark has better support for machine learning, because in-memory cache can be used in spark to accelerate machine learning algorithms.
However, most machine learning algorithms are actually a cyclic data flow, but in spark, they are actually represented by acyclic graphs, and general distributed processing engines are not encouraged to try cyclic graphs.
But flink is a bit different here. Flink supports looped data flow in runtime, which means that machine learning algorithms are more effective and efficient.
This is where flink wins.

Stream as platform vs Batch as Platform
Spark was born in the era of Map/Reduce, and data is stored on disk in the form of files, which is very convenient for fault-tolerant processing.
Flink brings pure stream data computing into the era of big data, which undoubtedly brings a breath of fresh air to the industry. This idea is very similar to akka-streams.
Maturity
At present , some users who eat crabs have already used flink in the production environment, but from my point of view, Flink is still developing and it will take time to mature.
Conclusion
At present , Spark is a more mature computing framework than Flink, but many ideas of Flink are very good, and the Spark community is also aware of this, and gradually adopts the good design ideas in Flink, so learning Flink can help you Take a look at some of the more fascinating ideas on this side of Streaming.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326456766&siteId=291194637