Summary of Big Data Architecture

 

Introduction

Big data is the general term for non-traditional strategies and technologies required to collect, organize, and process large-volume data sets, and obtain insights from them. Although the computing power or storage capacity required to process data has long exceeded the upper limit of a computer, the universality, scale, and value of this type of computing have only experienced large-scale expansion in recent years.

In the previous article, we have introduced the general concepts, processing procedures, and various specialized terms related to the big data system . This article will introduce one of the most basic components of the big data system: the processing framework. The processing framework is responsible for calculating the data in the system, such as processing data read from non-volatile storage, or processing data just taken into the system. Data calculation refers to the process of extracting information and insights from a large number of single data points.

These frameworks are described below:

  • Batch frame only:
    • Apache Hadoop
  • Stream processing framework only:
    • Apache Storm
    • Apache Samza
  • Hybrid framework:
    • Apache Spark
    • Apache Flink

What is the big data processing framework?

 

The processing framework and processing engine are responsible for calculating the data in the data system. Although there is no authoritative definition of the difference between "engine" and "framework", most of the time the former can be defined as a component that is actually responsible for processing data operations, and the latter can be defined as a series of components that undertake similar functions.

E.g. Apache Hadoop can be regarded as a kind of MapReduce as the default processing engine to process the frame . Engines and frameworks can usually be used interchangeably or at the same time. For example, another framework, Apache Spark, can be incorporated into Hadoop and replace MapReduce. This interoperability between components is one of the reasons why big data systems are so flexible.

Although the systems responsible for processing data at this stage of the life cycle are usually complex, their goals are very consistent from a broad perspective: to improve understanding by performing operations on the data, revealing the patterns contained in the data, and addressing the complexities. Interact to gain insights.

 

In order to simplify the discussion of these components, we will classify them according to the state of the processed data through the design intent of different processing frameworks. Some systems can process data in batch mode, and some systems can process data continuously flowing into the system in stream mode. In addition, there are some systems that can process both types of data at the same time.

Before introducing the indicators and conclusions of different implementations in depth, we first need to give a brief introduction to the concepts of different processing types.

Batch processing system

Batch processing has a long history in the big data world. Batch processing mainly operates on large-capacity static data sets, and returns the results after the calculation process is completed.

The data set used in batch mode usually meets the following characteristics...

  • Bounded: The batch data set represents a limited set of data
  • Persistent: Data is usually always stored in some type of persistent storage location
  • Large volume: batch operations are usually the only way to process extremely large data sets

Batch processing is ideal for calculations that require access to a full set of records to complete. For example, when calculating totals and averages, the data set must be treated as a whole, rather than as a collection of multiple records. These operations require the data to maintain its own state during the calculation.

Tasks that need to process large amounts of data are usually most suitable for processing with batch operations. Whether directly processing the data set from the persistent storage device or loading the data set into the memory first, the batch processing system fully considers the amount of data during the design process and can provide sufficient processing resources. Because batch processing is excellent in dealing with large amounts of persistent data, it is often used to analyze historical data.

The processing of large amounts of data requires a lot of time, so batch processing is not suitable for occasions that require high processing time.

Apache Hadoop

Apache Hadoop is a processing framework dedicated to batch processing. Hadoop is the first big data framework that has gained a lot of attention in the open source community. Based on Google's many papers and experience on massive data processing, Hadoop has re-implemented related algorithms and component stacks, making large-scale batch processing technology easier to use.

The new version of Hadoop contains multiple components, that is, multiple layers, which can be used in conjunction to process batch data:

  • HDFS : HDFS is a distributed file system layer that can coordinate storage and replication between cluster nodes. HDFS ensures that the data is still available after the unavoidable node failure occurs. It can be used as a data source, can be used to store the processing results of the intermediate state, and can store the final results of the calculations.
  • YARN : YARN is the abbreviation of Yet Another Resource Negotiator (another resource manager), which can act as a cluster coordination component of the Hadoop stack. This component is responsible for coordinating and managing the operation of underlying resources and scheduling jobs. By acting as an interface for cluster resources, YARN enables users to run more types of workloads in a Hadoop cluster than in previous iterations.
  • MapReduce : MapReduce is the native batch processing engine of Hadoop.

Batch mode

The processing function of Hadoop comes from the MapReduce engine. The processing technology of MapReduce meets the requirements of map, shuffle, and reduce algorithms that use key-value pairs. The basic process includes:

  • Read data set from HDFS file system
  • Split the data set into small pieces and distribute them to all available nodes
  • Calculate the data subset on each node (the intermediate state result of the calculation will be rewritten into HDFS)
  • Redistribute intermediate results and group them by key
  • "Reducing" the value of each key by summarizing and combining the results calculated by each node
  • Rewrite the calculated final result into HDFS

Advantages and limitations

Because this method relies heavily on persistent storage, each task requires multiple read and write operations, so the speed is relatively slow. But on the other hand, since disk space is usually the most abundant resource on the server, this means that MapReduce can handle very large data sets. It also means that compared to other similar technologies, Hadoop's MapReduce can usually run on cheap hardware, because the technology does not need to store everything in memory. MapReduce has extremely high scaling potential, and applications containing tens of thousands of nodes have appeared in the production environment.

MapReduce has a steep learning curve. Although other peripheral technologies in the Hadoop ecosystem can greatly reduce the impact of this problem, it is still necessary to pay attention to this problem when quickly implementing certain applications through a Hadoop cluster.

A vast ecosystem has been formed around Hadoop, and the Hadoop cluster itself is often used as a component of other software. Many other processing frameworks and engines can also use HDFS and YARN resource managers through integration with Hadoop.

to sum up

Apache Hadoop and its MapReduce processing engine provide a set of tried-and-tested batch processing models, which are most suitable for processing very large-scale data sets that do not require high time. A full-featured Hadoop cluster can be built with very low-cost components, making this cheap and efficient processing technology flexibly applicable in many cases. The compatibility and integration capabilities with other frameworks and engines make Hadoop the underlying foundation for multiple workload processing platforms using different technologies.

Stream Processing System

The stream processing system calculates the data that enters the system at any time. Compared with batch processing mode, this is a completely different processing method. The stream processing method does not need to perform operations on the entire data set, but performs operations on each data item transmitted through the system.

The data set in stream processing is "boundless", which has several important effects:

  • The complete data set can only represent the total amount of data that has entered the system so far.
  • The working data set may be more relevant and can only represent a single data item at a given time.
  • Processing is event-based, and there is no "end" unless it is explicitly stopped. The processing results are immediately available and will continue to be updated as new data arrives.

The stream processing system can process almost unlimited amounts of data, but can only process one (true stream processing) or a small amount (micro-batch processing) data at a time, and only a minimum amount of state is maintained between different records. Although most systems provide methods for maintaining certain states, stream processing is mainly optimized for functional processing with fewer side effects .

Functional operations mainly focus on discrete steps with limited states or side effects. Performing the same operation on the same data will produce the same result or omit other factors. This type of processing is very suitable for stream processing, because the status of different items is usually a combination of certain difficulties, limitations, and results that are not required in some cases. body. So although certain types of state management are usually feasible, these frameworks are usually simpler and more efficient when they do not have a state management mechanism.

This type of processing is very suitable for certain types of workloads. Tasks with near real-time processing requirements are very suitable for using stream processing mode. Analysis, server or application error logs, and other time-based metrics are the most suitable types, because responding to data changes in these areas is extremely critical for business functions. Stream processing is very suitable for processing data that must respond to changes or peaks and pay attention to the trend of changes over a period of time.

Apache Storm

Apache Storm is a stream processing framework that focuses on extremely low latency, and may be the best choice for workloads that require near real-time processing. This technology can handle very large amounts of data and provide results with lower latency than other solutions.

Stream processing mode

Storm's stream processing can arrange DAG (Directed Acyclic Graph ) named Topology in the framework . These topologies describe the different transformations or steps that need to be performed for each incoming fragment when the data fragment enters the system.

The topology includes:

  • Stream : Ordinary data stream, this is a kind of unbounded data that will continue to reach the system.
  • Spout : The data stream source located at the edge of the topology, such as API or query, etc., from which the data to be processed can be generated.
  • Bolt : Bolt represents the processing steps that need to consume stream data, apply operations to it, and output the result in the form of a stream. Bolt needs to establish a connection with each Spout, and then connect to each other to make up all the necessary processing. At the end of the topology, the final Bolt output can be used as the input of other interconnected systems.

The idea behind Storm is to use the above components to define a large number of small discrete operations, and then multiple components to form the desired topology. By default, Storm provides a "at least once" processing guarantee, which means that each message can be processed at least once, but in some cases it may be processed multiple times if it encounters a failure. Storm cannot guarantee that messages can be processed in a specific order.

In order to achieve strict one-time processing, that is, stateful processing, an abstraction called Trident can be used . Strictly speaking, Storm that does not use Trident is usually called Core Storm . Trident will have a great impact on Storm's processing capabilities, increase latency, provide state for processing, and use micro-batch mode instead of the pure stream processing mode of item-by-item processing.

To avoid these problems, it is generally recommended that Storm users use Core Storm as much as possible. However, it should also be noted that Trident's strict one-time processing guarantee for content is also useful in certain situations, such as when the system cannot intelligently handle repeated messages. If you need to maintain state between items, for example, if you want to count how many users clicked on a link within an hour, Trident will be your only choice. Although it cannot take full advantage of the inherent advantages of the framework, Trident improves Storm’s flexibility.

Trident topology includes:

  • Stream batch : This refers to micro-batch of streaming data, which can provide batch processing semantics through chunking.
  • Operation : Refers to a batch process that can be performed on data.

Advantages and limitations

At present, Storm may be the best solution in the field of near real-time processing. This technology can process data with very low latency and can be used for workloads that want the lowest latency. If the processing speed directly affects the user experience, for example, the processing result needs to be directly provided to the website page opened by the visitor, then Storm will be a good choice.

Storm and Trident work together so that users can use micro-batch instead of pure stream processing. Although this allows users to gain greater flexibility to create more compliant tools, at the same time, this approach will weaken the technology's greatest advantage over other solutions. Having said that, it is always good to have one more stream processing method.

Core Storm cannot guarantee the order of message processing. Core Storm provides "at least once" processing guarantees for messages, which means that every message can be guaranteed to be processed, but it may also be repeated. Trident provides a strict one-time processing guarantee, which can provide sequential processing between different batches, but cannot achieve sequential processing within a batch.

In terms of interoperability, Storm can be integrated with Hadoop's YARN resource manager, so it can be easily integrated into existing Hadoop deployments. In addition to supporting most processing frameworks, Storm can also support multiple languages, providing users with more options for topology definition.

to sum up

For pure stream processing workloads with high latency requirements, Storm may be the most suitable technology. This technology can ensure that every message is processed and can be used with multiple programming languages. Since Storm cannot perform batch processing, other software may be required if these capabilities are required. If there are relatively high requirements for strict one-time processing guarantees, you can consider using Trident at this time. However, other stream processing frameworks may be more suitable in this case.

Apache Samza

Apache Samza is a stream processing framework closely tied to the Apache Kafka messaging system. Although Kafka can be used in many stream processing systems, according to the design, Samza can better utilize Kafka's unique architectural advantages and guarantees. This technology can provide fault tolerance, buffering, and state storage through Kafka.

Samza can use YARN as a resource manager. This means that a Hadoop cluster (at least HDFS and YARN) is required by default, but it also means that Samza can directly use the rich built-in features of YARN.

Stream processing mode

Samza relies on Kafka's semantics to define how streams are processed. Kafka involves the following concepts when processing data:

  • Topic : Each data stream entering the Kafka system can be called a topic. A topic is basically a data stream composed of related information that can be subscribed to by consumers.
  • Partition : In order to distribute a topic to multiple nodes, Kafka divides the incoming message into multiple partitions. The division of partitions will be based on the key (Key), so that each message containing the same key can be divided into the same partition. The order of the partitions can be guaranteed.
  • Broker: Each node that forms a Kafka cluster is also called a broker.
  • Producer : Any component that writes data to a Kafka topic can be called a producer. The producer can provide the keys needed to divide the topic into partitions.
  • Consumer : Any component that reads topics from Kafka can be called a consumer. Consumers are responsible for maintaining information about their branches, so that they can know which records have been processed after a failure.

Since Kafka is equivalent to an eternal log, Samza also needs to deal with an eternal data stream. This means that the new data stream created by any transformation can be used by other components without affecting the original data stream.

Advantages and limitations

At first glance, Samza's dependence on Kafka-like query systems seems to be a limitation, but this can also provide the system with some unique guarantees and features that other stream processing systems do not have.

For example, Kafka has provided a data storage copy that can be accessed in a low-latency manner. In addition, it can also provide a very easy-to-use and low-cost multi-subscriber model for each data partition. All output content, including the results of intermediate states, can be written to Kafka and can be used independently by downstream steps.

This close dependence on Kafka is similar in many ways to the MapReduce engine's dependence on HDFS. Although the reliance on HDFS between each calculation of batch processing causes some serious performance problems, it also avoids many other problems encountered by stream processing.

The close relationship between Samza and Kafka allows the processing steps themselves to be very loosely coupled. Without prior coordination, you can add any number of subscribers in any step of the output. This feature is very useful for organizations that have multiple teams that need to access similar data. Multiple teams can all subscribe to the data topics that enter the system, or arbitrarily subscribe to topics created after other teams have processed the data. All this does not put additional pressure on load-intensive infrastructures such as databases.

Writing directly to Kafka can also avoid backpressure problems. Back pressure refers to the situation when the load peak causes the data flow rate to exceed the real-time processing capacity of the component. This situation may cause the processing work to stop and may lose data. According to the design, Kafka can save data for a long time, which means that components can continue processing at a convenient time and can be restarted directly without worrying about any consequences.

Samza can store data using a fault-tolerant checkpoint system implemented in local key-value storage. In this way, Samza can obtain the "at least once" delivery guarantee, but in the face of failures due to the possibility of multiple deliveries of data, the technology cannot provide accurate recovery of the aggregated state (such as counting).

The high-level abstractions provided by Samza make it easier to work with in many ways than primitives provided by systems such as Storm. Currently Samza only supports JVM languages, which means it is not as flexible as Storm in terms of language support.

to sum up

For environments that already have or are easy to implement Hadoop and Kafka, Apache Samza is a good choice for stream processing workloads. Samza itself is very suitable for organizations where multiple teams need to use (but not necessarily coordinate closely with each other) multiple data streams at different processing stages. Samza can greatly simplify many stream processing tasks and achieve low-latency performance. If the deployment requirements are not compatible with the current system, it may not be suitable for use, but if extremely low-latency processing is required or there is a high demand for strict one-time processing semantics, it is still suitable for consideration at this time.

Hybrid processing system: batch and stream processing

Some processing frameworks can handle batch and stream processing workloads at the same time. These frameworks can process two types of data with the same or related components and APIs, thereby simplifying different processing requirements.

As you can see, this feature is mainly implemented by Spark and Flink, which will be introduced below. The key to achieving such a function lies in how the two different processing modes are unified, and what assumptions should be made about the relationship between fixed and non-fixed data sets.

Although projects that focus on a certain type of processing will better meet the requirements of specific use cases, the hybrid framework is intended to provide a general solution for data processing. This framework not only provides the methods needed to process data, but also provides its own integrated items, libraries, and tools, which can be competent for various tasks such as graph analysis, machine learning, and interactive query.

Apache Spark

Apache Spark is a next-generation batch processing framework that includes stream processing capabilities. Spark, which is developed based on the same principles as Hadoop's MapReduce engine, mainly focuses on speeding up batch processing workloads through a complete memory calculation and processing optimization mechanism.

Spark can be deployed as an independent cluster (requires the cooperation of the corresponding storage layer), or can be integrated with Hadoop and replace the MapReduce engine.

Batch mode

Unlike MapReduce, Spark's data processing work is all carried out in memory, only when the data is read into the memory at the beginning, and the final result needs to be interacted with the storage layer when storing the final result in a persistent manner. The processing results of all intermediate states are stored in memory.

Although the in-memory processing method can greatly improve performance, Spark has also greatly improved the speed when processing disk-related tasks, because by analyzing the entire task set in advance, more complete overall optimization can be achieved. For this reason, Spark can create a Directed Acyclic Graph (Directed Acyclic Graph) that represents all the operations that need to be performed, the data that needs to be manipulated, and the relationship between the operations and the data, or DAG , so that the processor can perform smarter tasks. Coordination.

In order to achieve in-memory batch computing, Spark uses a model called Resilient Distributed Dataset, or RDD , to process data. This is a representative data set that is only located in the memory and is an eternal structure. Operations performed on RDDs can generate new RDDs. Each RDD can be traced back to the parent RDD through lineage, and finally back to the data on the disk. Spark can achieve fault tolerance through RDD without writing the results of each operation back to disk.

Stream processing mode

Stream processing capabilities are realized by Spark Streaming. Spark itself is mainly designed for batch processing workloads. In order to make up for the differences in engine design and stream processing workload characteristics, Spark implements a concept called Micro-batch* . In terms of specific strategies, the technology can treat the data stream as a series of very small "batches", which can be processed by the native semantics of the batch engine.

Spark Streaming will buffer the stream in sub-second increments, and then these buffers will be batched as a small fixed data set. The actual effect of this method is very good, but compared with the real stream processing framework, there are still shortcomings in terms of performance.

Advantages and limitations

The main reason to use Spark instead of Hadoop MapReduce is speed. With the help of memory computing strategies and advanced DAG scheduling mechanisms, Spark can process the same data set faster.

Another important advantage of Spark is diversity. The product can be deployed as a standalone cluster or integrated with an existing Hadoop cluster. The product can run batch processing and stream processing, and running a cluster can handle different types of tasks.

In addition to the engine's own capabilities, an ecosystem of various libraries has been established around Spark, which can provide better support for tasks such as machine learning and interactive queries. Compared to MapReduce, Spark tasks are "well known" and easy to write, which can greatly increase productivity.

The batch processing method is adopted for the stream processing system, and the data entering the system needs to be buffered. The buffering mechanism allows the technology to process a very large amount of incoming data and improve the overall throughput rate, but waiting for the buffer to be emptied can also lead to increased latency. This means that Spark Streaming may not be suitable for processing workloads with high latency requirements.

Since memory is usually more expensive than disk space, Spark is more expensive than disk-based systems. However, the increase in processing speed means that tasks can be completed more quickly, and in environments where resources need to be paid for by the number of hours, this feature can usually offset the increased costs.

Another consequence of the design of Spark in-memory computing is that if deployed in a shared cluster, it may encounter insufficient resources. Compared with Hadoop MapReduce, Spark consumes more resources and may affect other tasks that need to use the cluster at the same time. In essence, Spark is not suitable for coexisting with other components of the Hadoop stack.

to sum up

Spark is the best choice for diversified workload processing tasks. Spark batch processing capabilities provide unparalleled speed advantages at the expense of higher memory footprint. For workloads that value throughput rather than latency, it is more suitable to use Spark Streaming as a stream processing solution.

Apache Flink

Apache Flink is a stream processing framework that can handle batch processing tasks. This technology can treat batch data as a data stream with limited boundaries, thereby treating batch processing tasks as a subset of stream processing. Taking stream processing first for all processing tasks will have a series of interesting side effects.

This stream processing-first approach is also called the Kappa architecture , as opposed to the more well-known Lambda architecture (in which batch processing is used as the main processing method, and streams are used as a supplement and provide early unrefined results). In the Kappa architecture, everything is streamed to simplify the model, and all of this is only feasible after the recent stream processing engine has gradually matured.

Stream processing model

Flink's stream processing model treats each item as a real data stream when processing incoming data. The DataStream API provided by Flink can be used to process endless data streams. The basic components that can be used with Flink include:

  • Stream refers to an eternal and unchangeable data set without boundaries that circulates in the system
  • Operator (operator) refers to the function of performing operations on data streams to generate other data streams
  • Source refers to the entry point of the data stream into the system
  • Sink (slot) refers to the position where the data stream enters after leaving the Flink system. The slot can be a database or a connector to other systems

In order to be able to recover after a problem is encountered in the calculation process, the stream processing task will create a snapshot at a predetermined point in time. In order to achieve state storage, Flink can be used with a variety of state backend systems, depending on the complexity and durability level required to be implemented.

In addition, Flink's stream processing capabilities can also understand the concept of "event time", which refers to the time when the event actually occurs. In addition, this function can also handle conversations. This means that the order of execution and grouping can be ensured in some interesting way.

Batch model

Flink's batch processing model is to a large extent only an extension of the stream processing model. At this point, the model no longer reads data from the continuous stream, but reads the bounded data set in the form of a stream from the persistent storage. Flink uses exactly the same runtime for these processing models.

Flink can achieve certain optimizations for batch processing workloads. For example, since batch processing operations can be supported by persistent storage, Flink can not create snapshots of batch processing workloads. Data can still be recovered, but normal processing operations can be performed faster.

Another optimization is to decompose batch processing tasks so that different stages and components can be called when needed. In this way, Flink can better coexist with other users in the cluster. Analyzing tasks in advance allows Flink to view all the operations that need to be performed, the size of the data set, and the operation steps that need to be performed downstream, thereby achieving further optimization.

Advantages and limitations

Flink is currently a unique technology in the field of processing frameworks. Although Spark can also perform batch and stream processing, the micro-batch architecture adopted by Spark's stream processing makes it unsuitable for many use cases. Flink stream processing is the first method that provides low latency, high throughput, and nearly itemized processing capabilities.

Many components of Flink are self-managed. Although this approach is relatively rare, for performance reasons, the technology can manage memory on its own without relying on the native Java garbage collection mechanism. Unlike Spark, Flink does not need to manually optimize and adjust after the characteristics of the data to be processed changes, and the technology can also handle operations such as data partitioning and automatic caching by itself.

Flink divides the work in a variety of ways to optimize tasks. This kind of analysis is similar in part to the optimization of the relational database by the SQL query planner, which can determine the most efficient implementation method for a specific task. The technology also supports multi-stage parallel execution, and at the same time, the data of blocked tasks can be gathered together. For iterative tasks, for performance considerations, Flink will try to perform corresponding computing tasks on the nodes where the data is stored. In addition, "incremental iteration" can be performed, or only the changed parts of the data can be iterated.

In terms of user tools, Flink provides a web-based scheduling view to easily manage tasks and view system status. Users can also view the optimization plan of the submitted task to understand how the task is finally implemented in the cluster. For analytical tasks, Flink provides SQL-like queries, graphical processing, and machine learning libraries, and also supports memory computing.

Flink works well with other components. If used with the Hadoop stack, the technology can be well integrated into the entire environment, occupying only the necessary resources at any time. This technology can be easily integrated with YARN, HDFS and Kafka. With the help of compatibility packages, Flink can also run tasks written for other processing frameworks, such as Hadoop and Storm.

One of the biggest limitations of Flink at present is that this is still a very "young" project. The large-scale deployment of the project in the real environment is not as common as other processing frameworks, and there is no more in-depth research on the limitations of Flink in terms of scaling capabilities. With the advancement of the rapid development cycle and the improvement of compatibility packages and other functions, more and more Flink deployments may appear when more and more organizations start to try.

to sum up

Flink provides low-latency stream processing, while supporting traditional batch processing tasks. Flink may be most suitable for organizations with extremely high stream processing requirements and a small number of batch processing tasks. This technology is compatible with native Storm and Hadoop programs, and can run on a cluster managed by YARN, so it can be easily evaluated. The rapid progress of development work makes it worthy of everyone's attention.

in conclusion

A large data system can use a variety of processing techniques.

For workloads that only require batch processing, if it is not time-sensitive, Hadoop, which is less expensive to implement than other solutions, will be a good choice.

For workloads that only require stream processing, Storm can support a wider range of languages ​​and achieve extremely low-latency processing, but the default configuration may produce duplicate results and cannot guarantee the order. Samza's tight integration with YARN and Kafka provides greater flexibility, easier multi-team use, and simpler replication and state management.

For mixed workloads, Spark can provide high-speed batch and micro-batch stream processing. The support of this technology is more complete, with various integrated libraries and tools, which can realize flexible integration. Flink provides real stream processing and has batch processing capabilities. Through deep optimization, it can run tasks written for other platforms and provide low-latency processing, but it is too early for practical applications.

The most suitable solution mainly depends on the state of the data to be processed, the need for processing time, and the desired result. Specifically, the use of a full-featured solution or a solution that focuses mainly on a certain project, this problem needs to be carefully weighed. As it matures and is widely accepted, similar issues need to be considered when evaluating any emerging innovative solutions.

 

Guess you like

Origin blog.csdn.net/dualvencsdn/article/details/111621859