Comparison of 5 big data frameworks such as Hadoop and Spark

Big data is an umbrella term for the non-traditional strategies and techniques required to collect, organize, process, and derive insights from large data sets. While the computing power or storage capacity required to process data has long exceeded the upper limit of a single computer, the ubiquity, scale, and value of this type of computing have only experienced massive expansion in recent years.

  This article will introduce one of the most basic components of a big data system: the processing framework. The processing framework is responsible for performing computations on the data in the system, such as processing data read from non-volatile storage, or processing data that has just been ingested into the system. Computation on data refers to the process of extracting information and insights from a large number of single data points.

  These frameworks are described below:

  Batch frame only:

  • Apache Hadoop

  Streaming only framework:

  • Apache Storm

  • Apache Samza

  Hybrid frame:

  • Apache Spark

  • Apache Flink

What is the big data processing framework?

  The processing framework and processing engine are responsible for performing computations on the data in the data system. Although there is no authoritative definition of the difference between "engine" and "framework", most of the time the former can be defined as the component that actually handles data manipulation, and the latter as a set of components that perform similar roles.

  For example, Apache Hadoop can be seen as a processing framework with MapReduce as the default processing engine. Engines and frameworks can often be used interchangeably or concurrently. For example, another framework, Apache Spark, can be incorporated into Hadoop and replace MapReduce. This interoperability between components is one of the reasons why big data systems are so flexible.

  While the systems responsible for processing data at this stage of the life cycle are often complex, their goals are broadly aligned: to improve understanding by performing operations on the data, revealing patterns underlying the data, and targeting complex Interact to gain insights.

  To simplify the discussion of these components, we will categorize the different processing frameworks by the state of the data they process by their design intent. Some systems can process data in batch mode, and some systems can process data continuously flowing into the system in streaming mode. There are also systems that can handle both types of data at the same time.

  Before diving into the metrics and conclusions of the different implementations, a brief introduction to the concept of different processing types is required.

batch system

  Batch processing has a long history in the big data world. Batch processing mainly operates on large-capacity static data sets and returns the results after the calculation process is completed.

  Datasets used in batch mode typically meet the following characteristics...

  • Bounded: batch datasets represent a finite set of data

  • Persistent: Data is usually always stored in some type of persistent storage location

  • Bulk: Batch operations are often the only way to handle extremely large datasets

  Batch processing is ideal for computational work that requires access to a full set of records. For example, when calculating totals and averages, the dataset must be treated as a whole, not as a collection of records. These operations require the data to maintain its own state while the computation is in progress.

  Tasks that require processing large amounts of data are often best handled with batch operations. Whether the dataset is processed directly from persistent storage, or loaded into memory first, batch systems are designed with the amount of data in mind and provide ample processing resources. Because batch processing is extremely good at handling large amounts of persistent data, it is often used for analysis of historical data.

  Processing a large amount of data requires a lot of time, so batch processing is not suitable for occasions that require high processing time.

  Apache Hadoop

  Apache Hadoop is a processing framework dedicated to batch processing. Hadoop was the first big data framework to gain significant traction in the open source community. Based on Google's many papers and experience on massive data processing, Hadoop reimplemented related algorithms and component stacks to make large-scale batch processing technology easier to use.

  The new version of Hadoop consists of multiple components, or layers, that work together to process batches of data:

  • HDFS: HDFS is a distributed file system layer that coordinates storage and replication among cluster nodes. HDFS ensures that data is still available after unavoidable node failures, it can be used as a data source, it can be used to store intermediate state processing results, and it can store the final results of calculations.

  • YARN: YARN, short for Yet Another Resource Negotiator, acts as the cluster coordination component of the Hadoop stack. This component is responsible for coordinating and managing the running of the underlying resources and scheduling jobs. By acting as an interface to cluster resources, YARN enables users to run more types of workloads in a Hadoop cluster using an iterative approach than ever before.

  • MapReduce: MapReduce is Hadoop's native batch processing engine.

batch mode

  Hadoop's processing capabilities come from the MapReduce engine. The processing technology of MapReduce meets the requirements of map, shuffle, and reduce algorithms using key-value pairs. Basic processing includes:

  • Read dataset from HDFS file system

  • Split the dataset into small chunks and distribute to all available nodes

  • Computation is performed on a subset of data on each node (the intermediate state result of the computation is rewritten to HDFS)

  • Reassign intermediate state results and group by key

  • "Reducing" the value of each key by aggregating and combining the results computed by each node

  • Rewrite the final result of the calculation to HDFS

Strengths and Limitations

  Since this approach relies heavily on persistent storage, each task needs to perform multiple reads and writes, so it is relatively slow. But on the other hand since disk space is usually the most abundant resource on the server, this means that MapReduce can handle very large datasets. It also means that Hadoop's MapReduce can often run on cheap hardware compared to other similar technologies, because the technology doesn't need to store everything in memory. MapReduce has extremely high scaling potential, and applications with tens of thousands of nodes have appeared in production environments.

  The learning curve of MapReduce is steep, and while other peripheral technologies of the Hadoop ecosystem can greatly reduce the impact of this problem, it is still a problem to be aware of when implementing certain applications quickly through a Hadoop cluster.

  A vast ecosystem has been formed around Hadoop, and the Hadoop cluster itself is often used as a component of other software. Many other processing frameworks and engines can also use HDFS and YARN resource managers through integration with Hadoop.

  Summarize

  Apache Hadoop and its MapReduce processing engine provide a set of tried-and-true batch processing models best suited for processing very large datasets that are not time-critical. A fully functional Hadoop cluster can be built with very low-cost components, making this cheap and efficient processing technology flexible in many cases. Compatibility and integration with other frameworks and engines allows Hadoop to be the underlying foundation for multiple workload processing platforms using different technologies.

Stream Processing System

  Stream processing systems perform computations on data coming into the system at any time. Compared to batch mode, this is a very different way of processing. Instead of performing operations on the entire data set, stream processing operates on each data item transmitted through the system.

  Datasets in stream processing are "boundless", which has several important implications:

  • The complete dataset can only represent the total amount of data that has entered the system so far.

  • Work datasets may be more relevant, representing only a single data item at a given time.

  • Processing work is event based and has no "end" unless explicitly stopped. Processing results are immediately available and will continue to be updated as new data arrives.

  Stream processing systems can process an almost unlimited amount of data, but can only process one piece (true stream processing) or a very small amount of data (Micro-batch Processing) at the same time, and only maintain a minimum amount of state between different records. While most systems provide methods for maintaining some state, stream processing is primarily optimized for more functional processing with fewer side effects.

  Functional operations focus primarily on discrete steps with limited states or side effects. Performing the same operation on the same data yields the same result, or a few other factors, this type of processing is well suited for stream processing because the state of different items is often a combination of some difficulty, limitation, and in some cases unwanted result body. So while some type of state management is usually possible, these frameworks are usually simpler and more efficient without a state management mechanism.

  This type of processing is well suited for certain types of workloads. Tasks with near real-time processing requirements are well suited to use the stream processing pattern. Analytics, server or application error logs, and other time-based metrics are the most appropriate types because responding to changes in data in these areas is critical to business functions. Stream processing is great for data that must respond to changes or spikes, and focus on trends over time.

  Apache Storm

  Apache Storm is a stream processing framework focused on very low latency, perhaps the best choice for workloads requiring near real-time processing. The technology can handle very large amounts of data, delivering results with lower latency than other solutions.

  Stream processing mode

  Storm's stream processing can orchestrate the DAG (Directed Acyclic Graph) named Topology in the framework. These topologies describe the different transformations or steps that need to be performed on each incoming piece of data as it enters the system.

  Topology contains:

  • Stream: Ordinary data stream, which is a kind of unbounded data that will continue to reach the system.

  • Spout: The source of data flow at the edge of the topology, such as API or query, from which data to be processed can be generated.

  • Bolt: Bolt represents the processing steps that need to consume stream data, apply operations to it, and output the results in the form of streams. Bolt needs to establish a connection with each Spout, and then connect to each other to make up all the necessary processing. At the tail end of the topology, the final Bolt output can be used as an input to other systems that are interconnected.

  The idea behind Storm is to use the above components to define a large number of small discrete operations, and then combine multiple components into the desired topology. Storm provides "at least once" processing guarantees by default, which means that each message is guaranteed to be processed at least once, but in some cases it may be processed multiple times if it encounters a failure. Storm cannot guarantee that messages can be processed in a specific order.

  To implement strictly once processing, i.e. stateful processing, an abstraction called Trident can be used. Strictly speaking, a Storm that does not use Trident is usually called Core Storm. Trident can have a huge impact on Storm's processing power, adding latency, providing state for processing, and using micro-batch mode instead of a pure stream processing mode of item-by-item processing.

  To avoid these issues, Storm users are generally advised to use Core Storm whenever possible. Note, however, that Trident's strict once-processing guarantee for content is also useful in certain situations, such as when the system cannot intelligently handle duplicate messages. If you need to maintain state between items, such as wanting to count how many users clicked on a link in an hour, Trident would be your only option. While not taking full advantage of the framework's inherent strengths, Trident increases Storm's flexibility.

  Trident topology contains:

  • Stream batch: This refers to micro-batches of streaming data that provide batch semantics through chunking.

  • Operation: Refers to a batch process that can be performed on data.

Strengths and Limitations

  Storm is probably the best solution for near real-time processing right now. This technology can process data with extremely low latency and can be used for workloads where the lowest latency is desired. If the processing speed directly affects the user experience, for example, the processing result needs to be provided directly to the website page opened by the visitor, then Storm will be a good choice.

  Storm and Trident allow users to replace pure stream processing with micro-batches. While this allows users more flexibility to build more compliant tools, it also undercuts the technology's greatest advantage over other solutions. Having said that, one more way to handle streams is always good.

  Core Storm cannot guarantee the processing order of messages. Core Storm provides an "at least once" processing guarantee for messages, which means that every message is guaranteed to be processed, but duplicates may also occur. Trident provides a strict once-processing guarantee, which can provide sequential processing across batches, but cannot achieve sequential processing within a batch.

  In terms of interoperability, Storm integrates with Hadoop's YARN resource manager, so it can be easily integrated into existing Hadoop deployments. In addition to supporting most processing frameworks, Storm also supports multiple languages, providing users with more options for topology definition.

  Summarize

  For pure stream processing workloads with high latency requirements, Storm may be the most suitable technology. This technology guarantees that every message is processed and can be used with many programming languages. Since Storm cannot do batch processing, additional software may be required if these capabilities are required. If there is a relatively high requirement for a strict one-time processing guarantee, Trident can be considered at this time. However, other stream processing frameworks may be more suitable in this case.

  Apache Samza

  Apache Samza is a stream processing framework tightly bound to the Apache Kafka messaging system. Although Kafka can be used in many stream processing systems, Samza is designed to better leverage Kafka's unique architectural advantages and guarantees. This technology provides fault tolerance, buffering, and state storage through Kafka.

  Samza can use YARN as a resource manager. This means that a Hadoop cluster (at least HDFS and YARN) is required by default, but it also means that Samza can directly use YARN's rich built-in features.

  Stream processing mode

  Samza relies on Kafka's semantics to define how streams are processed. Kafka involves the following concepts when processing data:

  • Topic: Each data stream entering the Kafka system can be called a topic. A topic is basically a stream of related information that consumers can subscribe to.

  • Partition: In order to spread a topic across multiple nodes, Kafka divides incoming messages into multiple partitions. The partitioning will be based on the key (Key), which ensures that each message containing the same key can be divided into the same partition. The order of the partitions is guaranteed.

  • Broker: Each node that makes up a Kafka cluster is also called a broker.

  • Producer: Any component that writes data to a Kafka topic can be called a producer. The producer provides the keys needed to divide the topic into partitions.

  • Consumer: Any component that reads topics from Kafka can be called a consumer. The consumer is responsible for maintaining information about its own branch, so that it knows which records have been processed after a failure.

  Since Kafka is equivalent to an immutable log, Samza also needs to deal with immutable data streams. This means that new data streams created by any transformation can be used by other components without affecting the original data stream.

  Strengths and Limitations

  At first glance, Samza's reliance on a Kafka-like query system may seem like a limitation, however this can also provide the system with some unique guarantees and capabilities that other stream processing systems don't have.

  For example, Kafka already provides a low-latency access replica of the data store, in addition to a very easy-to-use and low-cost multi-subscriber model for each data partition. All output content, including intermediate state results, can be written to Kafka and used independently by downstream steps.

  This tight reliance on Kafka is in many ways similar to the MapReduce engine's reliance on HDFS. While the reliance on HDFS between each computation in batches caused some serious performance issues, it also avoided many other problems encountered with stream processing.

  The close relationship between Samza and Kafka allows the processing steps themselves to be very loosely coupled. Any number of subscribers can be added at any step of the output without prior coordination, which is useful for organizations with multiple teams that need access to similar data. Multiple teams can all subscribe to the data topics that enter the system, or subscribe to topics created by other teams after some processing of the data. All this without putting additional stress on load-intensive infrastructure such as databases.

  Writing directly to Kafka also avoids backpressure problems. Backpressure is when a load spike causes data to flow in faster than a component can handle in real time, which can cause processing to stall and potentially lose data. By design, Kafka can store data for a long time, which means that components can continue processing at their convenience and restart without fear of consequences.

  Samza can store data using a fault-tolerant checkpointing system implemented as a local key-value store. This gives Samza an "at-least-once" delivery guarantee, but in the face of failures due to the possibility of multiple deliveries of data, the technology cannot provide precise recovery of aggregated states such as counts.

  The high-level abstractions provided by Samza make it in many ways easier to work with than the Primitives provided by systems like Storm. Currently Samza only supports JVM languages, which means it is not as flexible as Storm in terms of language support.

  Summarize

  For environments where Hadoop and Kafka are already available or easy to implement, Apache Samza is a good choice for stream processing workloads. Samza itself is well suited for organizations where multiple teams need to use (but not necessarily closely coordinate with each other) multiple data streams at different stages of processing. Samza greatly simplifies many stream processing tasks, enabling low-latency performance. If the deployment requirements are not compatible with the current system, it may not be suitable for use, but if extremely low-latency processing is required, or there is a high demand for strict one-time processing semantics, it is still suitable for consideration.

Hybrid Processing Systems: Batch and Streaming

  Some processing frameworks handle both batch and streaming workloads. These frameworks can process both types of data with the same or related components and APIs, thereby simplifying different processing needs.

  As you can see, this feature is mainly implemented by Spark and Flink, both frameworks will be introduced below. Achieving such a function focuses on how the two different processing modes are unified, and what assumptions are made about the relationship between fixed and non-fixed datasets.

  While projects that focus on one type of processing are better suited to specific use cases, hybrid frameworks are intended to provide a general solution for data processing. This framework not only provides the methods needed to process data, but also provides its own integrations, libraries, and tools for tasks such as graph analysis, machine learning, and interactive querying.

  Apache Spark

  Apache Spark is a next-generation batch processing framework that includes streaming capabilities. Spark, developed on the same principles as Hadoop's MapReduce engine, focuses on accelerating batch workloads through sophisticated in-memory computing and processing optimization mechanisms.

  Spark can be deployed as a standalone cluster (with the appropriate storage layer), or it can be integrated with Hadoop and replace the MapReduce engine.

  batch mode

  Unlike MapReduce, Spark's data processing is all performed in memory, and only requires interaction with the storage layer when reading data into memory at the beginning and persisting the final result. All intermediate state processing results are stored in memory.

  While in-memory processing can greatly improve performance, Spark is also significantly faster when dealing with disk-related tasks, as better holistic optimizations can be achieved by analyzing the entire set of tasks ahead of time. To this end, Spark can create a Directed Acyclic Graph (Directed Acyclic Graph), or DAG, that represents all the operations that need to be performed, the data that needs to be operated, and the relationship between operations and data, so that the processor can perform tasks more intelligently coordination.

  To implement in-memory batch computing, Spark uses a model called Resilient Distributed Dataset, or RDD, to process data. This is an immutable structure that represents a dataset, located only in memory. Operations performed on RDDs generate new RDDs. Each RDD can be traced back to the parent RDD through Lineage, and finally to the data on disk. Spark can achieve fault tolerance through RDDs without writing the results of each operation back to disk.

  Stream processing mode

  Stream processing capability is achieved by Spark Streaming. Spark itself is primarily designed for batch workloads. To compensate for differences in engine design and stream processing workload characteristics, Spark implements a concept called Micro-batch*. In terms of specific strategies, the technique can treat the data stream as a series of very small "batches", which can be processed by the native semantics of the batch engine.

  Spark Streaming buffers streams in sub-second increments, which are then batched as small fixed datasets. The actual effect of this method is very good, but there are still shortcomings in performance compared to the real stream processing framework.

  Strengths and Limitations

  The main reason to use Spark over Hadoop MapReduce is speed. With the help of mechanisms such as in-memory computing strategies and advanced DAG scheduling, Spark can process the same dataset faster.

  Another important advantage of Spark is diversity. The product can be deployed as a standalone cluster or integrated with an existing Hadoop cluster. The product can run batch and stream processing, running a single cluster to handle different types of tasks.

  In addition to the capabilities of the engine itself, an ecosystem of various libraries has been built around Spark to provide better support for tasks such as machine learning and interactive querying. Compared to MapReduce, Spark tasks are "notoriously" easy to write, and therefore can greatly improve productivity.

  Using a batch method for a stream processing system requires buffering of data entering the system. The buffering mechanism allows the technology to handle very large amounts of incoming data, improving overall throughput, but waiting for the buffer to clear also increases latency. This means that Spark Streaming may not be suitable for processing workloads with high latency requirements.

  Since memory is generally more expensive than disk space, Spark is more expensive than disk-based systems. However, increased processing speed means that tasks can be completed faster, and in environments where resources are paid by the hour, this feature often offsets the increased cost.

  Another consequence of this design of Spark in-memory computing is that if deployed in a shared cluster, it may run into resource shortages. Compared to Hadoop MapReduce, Spark is more resource-intensive and may have an impact on other tasks that need to use the cluster at the same time. By its very nature, Spark is less suited to co-existing with other components of the Hadoop stack.

  Summarize

  Spark is the best choice for processing tasks with diverse workloads. Spark batch processing capabilities provide unparalleled speed benefits at the cost of higher memory footprint. For workloads where throughput is more important than latency, Spark Streaming is more suitable as a stream processing solution.

  Apache Flink

  Apache Flink is a stream processing framework that can handle batch tasks. This technique treats batch tasks as a subset of stream processing by treating batch data as a stream of data with bounded boundaries. Taking a stream-first approach to all processing tasks has a number of interesting side effects.

  This stream-first approach is also known as the Kappa architecture, as opposed to the more widely known Lambda architecture (which uses batching as the primary processing method, supplemented by streams and provides early unrefined results). In the Kappa architecture, everything is streamed to simplify the model, which is only feasible after the recent maturity of the stream processing engine.

  Stream Processing Model

  Flink's stream processing model treats each item as a real data stream when processing incoming data. The DataStream API provided by Flink can be used to handle endless streams of data. The basic components that Flink can work with include:

  • Stream (stream) refers to the eternal unbounded data set that flows in the system

  • Operator (operator) refers to the function of performing operations on data streams to generate other data streams

  • Source (source) refers to the entry point of data flow into the system

  • Sink (slot) refers to the location where the data flow enters after leaving the Flink system. The slot can be a database or a connector to other systems

  In order to recover from problems during computation, stream processing tasks create snapshots at predetermined points in time. To implement state storage, Flink can be used with a variety of state backend systems, depending on the complexity and persistence level of the desired implementation.

  In addition, Flink's stream processing capabilities can also understand the concept of "event time", which refers to the time when an event actually occurs, and this function can also handle sessions. This means that execution order and grouping can be ensured in some interesting way.

  batch model

  Flink's batch model is largely just an extension of the stream processing model. At this point the model no longer reads data from a persistent stream, but instead reads a bounded dataset in a stream from persistent storage. Flink uses the exact same runtime for these processing models.

  Flink can implement certain optimizations for batch workloads. For example, since batch operations can be backed by persistent storage, Flink can not create snapshots of batch workloads. Data can still be recovered, but routine processing operations can be performed faster.

  Another optimization is to break up batch tasks so that different stages and components can be called when needed. In this way, Flink can coexist better with other users of the cluster. Analyzing tasks in advance allows Flink to see all the operations that need to be performed, the size of the dataset, and the steps that need to be performed downstream, thereby enabling further optimization.

  Strengths and Limitations

  Flink is currently a unique technology in the field of processing frameworks. While Spark can also perform batch and stream processing, the micro-batch architecture of Spark's stream processing makes it impractical for many use cases. Flink's stream-processing-first approach provides low latency, high throughput, and near-item processing capabilities.

  Many components of Flink are self-managed. Although this practice is rare, for performance reasons, the technology manages memory on its own without relying on native Java garbage collection mechanisms. Unlike Spark, Flink does not require manual optimization and adjustment after the characteristics of the data to be processed change, and the technology can also handle operations such as data partitioning and automatic caching by itself.

  Flink optimizes tasks by subscribing work in a number of ways. This analysis is in part similar to the optimizations that SQL query planners do for relational databases to determine the most efficient implementation for a particular task. The technology also supports multi-stage parallel execution while pooling data from blocked tasks. For iterative tasks, for performance reasons, Flink will try to perform the corresponding computing tasks on the nodes where the data is stored. It is also possible to do "incremental iterations", or iterate over only the parts of the data that have changed.

  In terms of user tools, Flink provides a web-based scheduling view to easily manage tasks and view system status. Users can also view optimizations for submitted tasks to see how the tasks are ultimately implemented in the cluster. For analytical tasks, Flink provides SQL-like queries, graphical processing, and machine learning libraries, in addition to supporting in-memory computing.

  Flink works well with other components. When used in conjunction with the Hadoop stack, the technology fits well into the entire environment, consuming only the necessary resources at all times. The technology integrates easily with YARN, HDFS, and Kafka. With the help of compatibility packages, Flink can also run tasks written for other processing frameworks such as Hadoop and Storm.

  One of the biggest limitations of Flink at the moment is that this is still a very "young" project. The large-scale deployment of this project in the real environment is not as common as other processing frameworks, and there is no in-depth research on the limitations of Flink in terms of scaling capabilities. With the advancement of the rapid development cycle and the improvement of functions such as compatibility packages, more and more Flink deployments may appear as more and more organizations start to try.

  Summarize

  Flink provides low-latency stream processing while supporting traditional batch tasks. Flink is perhaps best suited for organizations with extremely high stream processing needs and a small number of batch jobs. The technology is compatible with native Storm and Hadoop programs and runs on YARN-managed clusters, so it can be easily evaluated. The fast-moving development work makes it worthy of everyone's attention.

in conclusion

  Big data systems can use a variety of processing techniques.

  For batch-only workloads that are not time-sensitive, Hadoop, which is less expensive to implement than other solutions, would be a good choice.

  For workloads that only require stream processing, Storm supports a wider range of languages ​​and enables extremely low-latency processing, but the default configuration can produce duplicate results and order is not guaranteed. Samza's tight integration with YARN and Kafka provides greater flexibility, easier multi-team use, and simpler replication and state management.

  For mixed workloads, Spark offers stream processing in high-speed batch and micro-batch modes. The technology is better supported, with various integration libraries and tools for flexible integration. Flink provides true stream processing and has batch processing capabilities. It can run tasks written for other platforms through deep optimization, providing low-latency processing, but it is still too early for practical application.

  The most suitable solution depends mainly on the state of the data to be processed, the time required for processing, and the desired result. Whether to use a full-featured solution or one that is primarily focused on a certain project requires careful consideration. As it matures and becomes widely accepted, similar issues need to be considered when evaluating any emerging innovative solution.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326361764&siteId=291194637