Future data architecture - On stream processing architecture

file

Data architecture design is undergoing a revolution, its impact is not only the real-time processing business, this change may be viewed as the core of the architecture of stream-based processing, rather than just as a streaming real-time calculation of a project use. This article will compare the difference with the traditional architecture of data stream processing architecture, and describes how to stream processing architecture used in micro system and the overall service.

Traditional data architecture

Traditional data architecture is a centralized data system may be divided into business data systems and big data systems.

file

Transactional business data storage system data, such as SQL, NOSQL database, such data have accurate data, such as user services, payment services and other systems can be achieved, these need to be updated, is the core of the overall business system support.

Large data storage system is mainly responsible for the data does not need to be updated frequently, because the data is too large, it may take Hadoop and other Big Data framework for implementation, the system calculates the result of the timing of such statistics in user visits a day zero, the result is likely to result write SQL database, complete statistics.

The real-time data systems are often only as a certain item use, such as real-time log alarm systems, real-time recommendations system.

file

The reason for this is because the design limit data processing performance and accuracy in the future Streaming- big data in an article mentioned before, due to uncontrollable event time, we can not be real-time data as accurate and reliable source of data. The low-latency requirements will greatly take up system performance.

This traditional architecture successfully serve for decades, but with the large-scale distributed computing system complexity rising, this architecture has been overwhelmed. Many companies often encounter the following problems.
• In many projects, data from the data analysis to achieve the desired workflow too complex, too slow.

• Traditional data architecture is too simple: the database is the only correct data source, each application needs to obtain the required data by accessing the database.

• With this architecture, the system has a very sophisticated approach abnormalities. When there is abnormal, it is difficult to ensure the system can run well.

And as systems scale to maintain consistency between actual data and status data becomes more and more difficult, the need to constantly update and maintain global state.

Stream processing architecture

As a new option, stream processing architecture addresses the many problems encountered by enterprises in large-scale systems. With flow-based architecture designed to make data continuously recorded to the application, and continues to flow from the source stream data between various applications. There is no centralized database to store a global state data, replaced by a shared and never-ending stream of data, it is the only correct data source, records of historical business data. In the stream processing architecture, each application has its own data, which uses a local database or distributed file storage.

Prior to this line of thinking is impossible to do, and it requires us to have the ability to repeat the consumption of news, but also to maintain high performance messaging system, must also make an accurate handling of the incident time, but now we have Kafka and Flink, everything becomes simple.

file

Stream processing architecture projects mainly of two parts: the messaging layer, process flow layer. Data source is a continuous news flow, such as logs, click stream events, networking data. Various possible output data flow.

Messaging layer continuous data collection events generated from various data sources (producers), and transferred to subscribe to the applications and services of these data (consumer). This design allows the decoupling producers and consumers, topic concept, receives a plurality of data sources, a plurality of consumers, the consumer does not need to process the message it does not need to have been immediately is enabled. The messaging layer need to have high performance, durability, equivalent to the buffer, data can be saved up short-term event. Kafka and enables high performance and durability of both. Offset mechanism to achieve a persistent message, the message can be replayed again calculated; and based on the read-write disk cache can achieve high throughput.

file

Stream processing layer 3 uses: ① continuously move data between applications and systems; ② polymerization and processes events; ③ application locally maintained state

Flink both of these advantages, the Apache framework and Flink is a distributed processing engine for performing calculations in a state where no boundary and has boundary data stream. Flink can run on all common cluster environment, and can be calculated in any size and memory speed.

file

The stream processing architecture used in micro and overall service system

Used in micro service

Be understood from the above, the stream data stream processing architecture Kafka message from the outflow. Flink subscription data from the message queue and processed. The processed data may flow to another message queue. All such applications can share data stream.

Microprocessor-based streaming service architecture also provides flexibility for developers fraud detection system. Add a new data consumer spending is almost negligible, and where appropriate, historical information and data can be saved into any format and using any database services. Message to repeated use, handling, persistence played a role in its largest and most efficient.

Applied to the overall system

In fact, the role of the stream processing architecture is much more than that, is not limited to consumer data stream real-time applications, although they are a very important one.

file

The figure shows the categories of consumers benefiting from the stream processing architecture. A group of consumers could do all kinds of real-time analysis, including real-time updated dashboards. Group B consumer record the current state of data, these data may also be stored in a database or file search.

For example, in power monitoring system, we need real-time alarm for power failure, also need real-time monitoring of current and voltage data, persistent data also need to do historical analysis and forecasting, and so on.

file

This article briefly compares the difference between traditional architecture and data stream processing architecture, as well as the advantages of stream processing architecture lies, but this system is also faced with many challenges to its complexity and depth understanding of Kafka and Flink will make it all easier.

Reference: Flink based tutorial

Flink official documents

Flink Quick Start

Streaming- future of big data

What is Kafka?

More real-time calculation, Flink, Kafka and other related technologies Bowen, welcome attention to real-time streaming computing

file