What is the flow calculation?

I. Background streaming calculated

In daily life, we usually put the data is stored in a table, and then processed, analyzed here involves a question of timeliness. If we are treated to year, month to level unit of data, real-time requirements so much data is not high; but in days, hours, if we are dealing with, even minute data unit, then the timeliness of data requirements relatively high. In the second scenario, if we are still using the traditional data processing, unified data collection, stored in the database, after performing the analysis, it may not meet the timeliness requirements.

Second, the calculated flow and batch computing

Calculation mode is divided into a large data-volume calculation (batch computing), calculated flow (stream computing), interactive computing (interactive computing), FIG calculated (graph computing) and the like. Wherein the flow calculations is calculated and batches of data are two major large calculation mode, respectively, for different scenarios of large data.
Data stream (or data stream) refers to the distribution and number of time series of dynamic data aggregate of infinite, the value of the data over time is reduced, and therefore must be calculated in real time a response is given in seconds. Flow calculation, by definition, the data stream is processed, calculated in real time. Calculating the bulk data is collected and stored in the database, and the data of the batch processing data is calculated. Mainly in the following aspects:
1, the timeliness of different data: real-time calculation of flow, low latency, non-real time calculation bulk, high latency.
2, wherein the different data: the data stream is generally calculated dynamic, no boundaries, and batch data is typically static data.
3, different scenarios: Stream computing applications in real-time scenarios, timeliness relatively high demand scenarios, such as real-time recommendations, operational monitoring ... In general batch batch computing, less demanding application scenarios in real-time, off-line calculation of next, data analysis, off-line reporting.
4, different operation modes, batch computed task flow computing tasks in one operation is continued.

Third, the flow computing framework, platform and related products

The first class, business-class flow computing platforms (IBM InfoSphere Streams, IBM StreamBase etc.);
a second type, open frame calculated Origin (Twitter Storm, S4, etc.);
a third category, the company developed to support streaming traffic itself computing framework.
Strom: first generation Twitter stream processing system development.
Heron: The second generation of development Twitter stream processing system.
Spark streaming: Spark core API is an extension of, can achieve high throughput, real-time processing of streaming data includes fault tolerance mechanisms.
Flink: is a distributed processing engine for streaming data and batch data.
Apache Kafka: written in Scala. The project's goal is to provide a unified real-time data processing, high-throughput, low-latency platform.

Fourth, the streaming main application scenario is calculated

Streaming can be used for two different scenarios: the flow of events and continuous basis.
1, event stream
event stream can continue to produce with large amounts of data, such data first appeared with the emergence of the field of traditional banking and stock trading areas, but also to monitor the Internet, wireless communications network, it is necessary to update the data streams in near real time complex analysis such as trend analysis, forecasting and monitoring. In simple terms, the flow of events is used in the query remains static statement is fixed, the data changing way.
2, continuous computing
such as for streaming data of large sites: site access PV / UV, the user has access to what content, search what content, real-time data is calculated and analyzed dynamically in real-time refresh the user access to data, sites showing real-time changes in traffic, analyze each hour of daily traffic and distribution of users;
such as the financial sector is crucial millisecond latency requirements. Some scenes require real-time processing of data can also be applied Storm, such as real-time analysis of user behavior according to the log files generated by the user real-time recommendations and other commodities.

Fifth, the calculation of the value stream

Through the large data processing we get the value of the data, but the data is a constant value of it? Obviously not, some data shortly after the event will have a higher value, and this value over time rapidly decrease. The key advantage of streaming is that it can provide insight faster, typically milliseconds to seconds.
Calculated flow value is the value of mining operations before service data in a shorter period of time, and converting this latency competitive advantage. For example, the use of flow calculation recommendation engine, the user's preferences can be reflected in the behavior of the model recommended in a shorter period of time, the recommended model can lower latency preference to capture user behavior in order to provide more accurate and timely recommend.
Flow calculation can do this because the conventional batch calculation requires data accumulated in the batch processing data is accumulated to a certain amount before; and calculated to achieve data-flow pick process, effectively reducing the processing delay.

Guess you like

Origin blog.51cto.com/13945147/2436907