Dry goods: Big Data open source technology so much these types of need to know!

REVIEW large data collection, collation, processing large volume data sets, and derive the general term for non-traditional strategies and technologies needed insights. Although the computing power or storage capacity to handle data already exceeds the upper limit of a computer, but this type of calculation of universality, size, and value in recent years that it has experienced a massive expansion.

Big data processing framework?

Process frame and is responsible for the data processing engine system is calculated. Although the difference between "engine" and "framework" There is no definition of authority, but most of the time the former can be defined as the actual operation of the component responsible for processing the data, which can be defined as a set of components to assume a similar role.

For example ApacheHadoop can be seen as a kind of process frame MapReduce default processing engine. And the engine frame may be replaced with each other or generally simultaneously. Another example can be incorporated into the frame ApacheSpark substituted Hadoop and MapReduce. This interoperability between the components is one of the reasons big data system flexibility so high.

To simplify the discussion of these components, we will design intent through different processing framework, SEO trends according to state data processed by its classification. Some systems may process the data in batch mode, some systems may be treated with a continuous stream of data continues to flow into the system. There are also some systems can handle both types of data simultaneously.

This article describes some of the most practical large data framework:

  • Batch only framework:

ApacheHadoop

  • Frame only stream processing:

ApacheStorm ApacheSamza

  • Mixed frame:

ApacheSpark ApacheFlink

Batch framework

Batch has a long history in the big data world. The main batch operations on large sets of static data, and returns the result after the calculation process is completed. Batch mode data sets generally follow the following characteristics:

  • Bounded: a finite set of batch data sets represent data

  • Persistent: data is typically stored in a persistent storage location is always some type of

  • For large amounts: Batch operation is usually the only way to deal with very large data sets

Typical ApacheHadoop

ApacheHadoop is a dedicated batch processing framework. Hadoop is the first to receive great attention in the open source community big data framework. Google about the massive data processing based on published papers and Hadoop experience to re-implement the relevant algorithms and components stack, so that large-scale batch processing technology becomes easier to use. New Hadoop includes a plurality of components, i.e. a plurality of layers, with the use of batch data can be processed by:

  • HDFS: HDFS is a distributed file system layer, can be coordinated storage and replication between cluster nodes. HDFS ensures unavoidable after node failure data is still available, which can be used as source data, can be used to store intermediate results of the processing state, and stores the calculated final results.

  • YARN: YARN is an abbreviation YetAnotherResourceNegotiator (another resource manager) can act as the cluster coordinator component Hadoop stack. This component is responsible for coordinating and managing the underlying resources to run and schedule jobs. By acting as a cluster resource interface, YARN so that the user can use to run an iterative manner than in the past more types of workloads in a Hadoop cluster.

  • MapReduce: MapReduce is a Hadoop native batch engine.

Processing mode

Hadoop MapReduce processing power from the engine. MapReduce processing techniques in line with the use of key-value pairs in the map, shuffle, reduce algorithm requires. The basic process comprises:

  • Reading the data set from the file system HDFS

  • Split the data set into smaller pieces and distributed to all available nodes

  • Computed for a subset of data on each node (intermediate results of calculations state rewrites HDFS)

  • Intermediate state redistributed and results grouped bond

  • For "Reducing" value for each key by a combination of aggregated and calculated results for each node

  • The final result is calculated from the re-written to HDFS

Advantages and limitations

Since this method relies heavily on persistent storage, each task needs to perform multiple read and write operations, and therefore relatively slow. But on the other hand due to disk space on the server it is usually the most abundant resources, which means that MapReduce can handle very massive data sets. It also means that compared to other similar technologies, Hadoop's MapReduce can often run on inexpensive hardware, because the technology does not require that everything is stored in memory. MapReduce with a high scaling potential, have appeared in a production environment applications include tens of thousands of nodes.

Stream processing framework

Stream processing system have ready access to the data system is calculated. Compared to the batch mode, which is a very different approach. Streaming manner without performing operations for the entire data set, but the operation performed on each data item transmitted through the system. Stream processing of the data set is "borderless", which produces several important effects:

  • Complete data set can only represent up to now has been the amount of data entered into the system.

  • Working data set may be more relevant at a particular time can only represent a single data item.

  • Processing is based, unless otherwise explicitly stopped no "end" of the event. Processing results are available immediately, and will continue to be updated with the arrival of new data.

Stream processing framework can handle a virtually unlimited amount of data, but at the same time can only handle a (real stream processing), or a very small amount (micro-batch, Micro-batchProcessing) data, between different record lasted only a minimal amount of state. Although most systems provide a method for maintaining certain states, but mainly for streaming fewer side effects, more functional process (Functionalprocessing) optimization.

Typical ApacheStorm

ApacheStorm is a focus on low latency stream processing framework, perhaps the best option is to require near real-time workload processing. This technique can handle very large amounts of data, the delay provided by the results of other solutions than lower.

Processing mode

Storm stream processing of the frame may be called the Topology DAG (topology) will be organized (DirectedAcyclicGraph, directed acyclic graph). These topologies describe various steps or after the conversion of data segments into the system, to be performed on each incoming fragment. Topology includes:

  • Stream: normal data stream, which is a system will continue to arrive no boundary data.

  • Spout: the data stream source is located in the edge of the topology, for example, a query API or the like, may be generated from the data to be processed here.

  • Bolt: Bolt consumed stream representative of data, the operation applied thereto, and outputs the result of the processing step in a stream. Bolt needs to establish a connection with each of the Spout, and then connected to each other to form all the necessary processing. At the end of the topology can be used as input to the final output Bolt other systems connected to each other.

Trident topology includes:

  • Batch flow (Streambatch): This refers to the micro-batch stream data may be provided by batch semantic block.

  • Operation (Operation): refers to the batch process can be performed on the data.

Advantages and limitations:

Storm now, may be the best solution for near-real-time processing areas. This technology can process the data with very low latency, it can be used to want to get the lowest latency workloads. If the processing speed of a direct impact on the user experience, such as the need to deal with the results directly to the web page visitors open, then Storm would be a good choice.

Storm Trident for mating such that a user can replace a pure stream processing micro batch. Although whereby a user can get more tools to build greater flexibility to meet the requirements, but such an approach would undermine the technology compared to other solutions greatest advantage. Having said that, but one more stream approach is always good. In terms of interoperability, Storm can be integrated with the Hadoop YARN Resource Manager, and therefore can be easily integrated into existing Hadoop deployments. In addition to supporting most of the processing framework, Storm also supports multiple languages, provide more choices for the user-defined topology.

Frame mixing process: batch and flow process

Some can handle batch processing framework and the stream processing workload. These frames may handle both types of data using the same or related components and API, thereby to allow different processing needs to be simplified. As you can see, this feature is mainly achieved by the Spark and Flink, following the introduction of these two frameworks.

To achieve this function focuses on how unified the two different processing modes, and what assumptions to the relationship between fixed and non-fixed data sets. Although the focus of processing a certain type of item will better meet the requirements of the specific embodiment of the use, but intended to provide a mixed frame data processing universal solution. Such a framework can not only provide the desired method of processing data, but also provides its own integrated items, libraries, tools, capable graphical analysis, machine learning, interactive query and many other tasks.

Typical ApacheSpark

ApacheSpark is next batch framework comprising a stream processing capabilities. And Hadoop's MapReduce engine is based on the same principles as developed from a variety of Spark focuses primarily on batch workloads run faster through improved memory handling and calculation optimization mechanisms. Spark deployed as a separate cluster (storage layer requires a corresponding fitting), or may be integrated with and substituted Hadoop MapReduce engine.

Processing mode

Spark all data processing performed in memory, only at the beginning of the data read into memory, and needs to interact with the memory layer when the final result persistent storage. All intermediate states of processing results are stored in memory. Although memory approach can significantly improve performance, Spark when dealing with disk-related tasks speed has improved so much, because the advance of the entire set of tasks can be achieved by analyzing more sophisticated integral optimization.

For this purpose Spark can create all of the operations required to perform on behalf of, the data need to operate, and DirectedAcyclicGraph relationship between the operation and the data (directed acyclic graph), ie DAG, whereby the processor can be more intelligent coordination of tasks . In order to achieve the batch calculation memory, called the Spark use ResilientDistributedDataset (Flexible distributed data set), i.e. RDD model to process data. This is a representative of the data set, only in memory, immutable structure.

RDD for operations that may be performed to generate a new RDD. Each RDD dating back to the parent RDD by descent (Lineage), and finally back to the data on the disk. Spark fault tolerance can be achieved under the premise without the need to write back the results of operations for each disk by RDD.

Streaming mode

SparkStreaming stream processing capability is achieved. Spark itself is designed mainly for batch workloads, in order to compensate for differences in engine design and stream processing workload characteristics, called Spark implements the concept of micro-batch (Micro-batch) * a. In terms of specific strategies that technology can be a data stream as a series of very small "batch", whereby the process can be carried out by native semantics of the batch engine. SparkStreaming are buffered in a convection sub-second increments, and then the buffer will be a small-scale batch fixed data set. The actual effect of this approach is very good, but compared to the real stream processing framework remains inadequate in terms of performance.

Advantages and limitations

With the help of the mechanism of memory computing strategy and advanced scheduling of DAG, Spark can handle the same set of data at a faster speed. Spark Another important advantage is that diversity. The product can be deployed as a separate cluster, or integrated with existing Hadoop cluster. The product can run batch processing and stream processing, running a cluster to handle different types of tasks.

in conclusion

Of course, a lot of good big data framework also above simply introduces some. However, these types considered several popular current framework, both developers and companies can combine their own needs to choose. However, a good framework will definitely help developers in the shortest possible time to maximize data revenue.

Guess you like

Origin www.cnblogs.com/1994jinnan/p/11955399.html