Stream processing knowledge

On stream processing

What is stream processing

Stream processing is a large data processing technique. It enables users to query a continuous data stream, and rapid detection of the start condition from the received data in a short time. Detection time ranging from a few milliseconds to a few minutes. E.g., by streaming, and you can detect when the flow temperature reaches the freezing point by receiving the alarm data query from a temperature sensor.

It also has many names: real-time analysis, flow analysis, complex event processing, real-time flow analysis and event handling. Although some differences in terms of history, but now tools (framework) under the term streaming has become more consistent. (The last one and a list of the relevant framework of this paper to learn about the history, please refer to the Quora question )

By it Apache Stormas an image Hadoopto be popular technique gives results as quickly, after it has been adopted as a kind of big data technology. There are a lot of competitors.

Why streaming

Processing data derived from the findings (Insights) is valuable. Such insights (insights) are not all created equal. Some insight (insights) will soon have a high value upon the occurrence of, and over time, this value will decrease rapidly. Stream processing for such a scenario. The key advantage of streaming is that it can provide faster insight (insights), typically in milliseconds to seconds.

Here are some reasons for using a secondary process stream.

Reason 1: Some data naturally to the never-ending stream of events occur. To batch, it needs to be stored, the data stops at a certain time to collect and process data. And then the next you have to perform a number of operations, and then worry about the aggregation across multiple batches. In contrast, an endless stream processing data flow more elegant, more natural. You can detect patterns, test results, view multiple focus levels, can also easily view data from multiple streams.

Naturally stream processing and time series data for time-varying detection mode. For example, if you are trying to detect the length of the web session infinite stream (which is an example of attempts detection sequence). Batch processing is difficult because some sessions will be divided into two groups. Stream processing can easily deal with this problem.

If you step back and consider the most continuous data sequence is a time-series data: traffic sensors, sensor health, transaction logs, activity logs. Almost all IoT data are time-series data. Therefore, the use of natural applicable programming model makes sense.

Reason 2: batch and try to establish data allowing their simultaneous processing, and stream processing incoming data processed immediately, thus spreading the processing over time. Therefore, you can use stream processing is much less than the batch of hardware. Further, the processing flow also allows to reduce the system load for query processing approximation. Thus, the processing flow similar to a natural answer enough for use cases.

Reason 3: Sometimes a lot of data, or even can not be stored. Stream processing allows you to process large fire horse type data, and keep only useful bits.

Reason 4: Finally, there are many streaming data is available (for example, customer services, activities, site visits), and with the growth of IoT use cases (various sensors), the data will be faster. Streaming is a more natural model, and programming can consider these use cases.

However, not all stream processing use cases are applicable tools. A good rule of thumb is, if the processing requires multiple passes or a complete random access data (Fig think data set), then the flow processing is very difficult. Stream processing in a large loss of use cases is a machine learning algorithm to train the model. On the other hand, if the process can be completed by one through the data, with time or position (latest data access process tends), then it is very suitable for streaming.

How to stream

If you want to build a data stream processing application and make real-time decisions, you can use a tool or build it yourself. The answer depends on how much you plan to deal with complexity, how many want to expand, how much reliability and fault tolerance required.

If you want to build your own application, set an event on the theme of the message broker (such as, ActiveMQ, RabbitMQor Kafka) in order to write code from brokertopic to receive events (they become your stream), and then publish the results broker. Such a code is called actor.

However, you can use stream processing framework to save time, rather than starting from scratch encoding the above scenarios. Event stream for each processor allows you actorto write a logical connection actor, and the edge connector to a data source. You can send events directly to the stream processor, you can also brokersend events.

Event stream processor by collecting data, pass data to each actorensure they are operating in the correct order, collect the results, if the load is high, the telescopic handle failure as well as to complete the arduous task. For example Storm, Flinkand Samza. If you want to build an application in this way, please view the user's guide

Since 2016, there has been called streaming SQL ( Streaming SQL 101ideas) of. We call a language as "streaming SQL", which allows users to write similar queries of SQL to query streaming data. There are many streaming SQL language is on the rise.

WSO2Stream processors and SQLStreamsother projects to support SQL exceed five years.

Use streaming SQL language, developers can quickly incorporated into their application streaming query. By 2018, most of the stream processor supports the processing of data flow through the SQL language.

Let us know how to map SQL to flow. Table stream of data movement. Imagine a never-ending list, which as time goes by, new data appears. Flow is such a table. A record or row is called the event flow. However, it is a pattern, and behaves like a database row. In order to understand these ideas, Tyler Akidau speech at the Strata is a good resource.

When writing SQL queries, query data stored in the database. However, when you write SQL queries streaming, you will write them on the current data and future data will appear. Therefore, streaming SQL query will never end. Is there a problem? No, because the query is output streams. Once the event match, and output events immediately available, an event is placed in the output stream.

Flow represents the logical channels through all the events, and will never end. For example, if we have a temperature sensor in the boiler, we can represent the output of the sensor as a stream. However, the data stored in the uptake of the classic SQL database tables, processes it, and writes it to a database table. Instead, the above query when the received data stream into the data stream and generate a data stream as an output. For example, suppose one incident boiler flow occurs every 10 minutes. When an event matches the filters, filter query results will immediately generate an event in the stream.

So, you can build applications in the following manner. By sending the event to the stream processor or send via broker. Then you can use the "Streaming SQL" write-flow of the application. Finally, the Streamprocessor is configured to operate on the results. This is done by calling the service when the stream processor or by publishing the event trigger to brokerbe done and listen to the theme topic.

There are many available frame streaming. (See Quora question: What is the best streaming solution? ).

I recommend the one I helped build, WSO2stream processors ( WSO2SP). It can be from Kafka, HTTPthe request message brokerto obtain data, and may use the "flow the SQL" data stream query language. WSO2SPIt is Apacheopen source under the license. Only two common server, can provide high availability, and can handle 100K + TPS throughput. It can be Kafkaon the scale to one million TPS, and supports the deployment of multiple data centers.

Who is using streaming

In general, the processing flow for us to detect a problem and we have a reasonable response to the improved results with the embodiment is useful. In addition, the data-driven organization, it also plays a key role.

Here are some use cases

  • Algorithmic trading, market surveillance
  • Intelligent Patient Care
  • Monitoring the production line
  • Supply chain optimization
  • Invasion, monitoring and fraud detection (e.g., the Uber )
  • Most smart device applications: smart car, smart home
  • Smart Grid - (for example, load forecasting and outlier detection plug see the smart grid, 4 billion event handling in 100KS )
  • Traffic monitoring, geo-fencing, vehicle tracking and wild animals, such as TFL London's transport management system
  • Motion analysis - with real-time analysis to enhance the movement (for example, this is what we do in real football game (for example, covering real-time analysis on the football broadcast )
  • Context-aware promotion and advertising
  • Computer Systems and Network Monitoring
  • Predictive maintenance (for example, a machine-learning techniques of predictive maintenance )
  • Geospatial data processing

For more discussion about how to use stream processing, see 13 kinds of stream processing and stream processing mode for building real-time applications

History and stream processing framework

Start streaming from the active database to provide data stored in the database conditional query has a long history. One of the first stream processing framework TelegraphCQthat is constructed in the PostgreSQLabove.

They then grow into two branches.

The first is called a branch stream processing. The framework allows users to create queries using a chart connected to the user code and many machines running the query graph. For example Aurora, PIPES, STREAM, Borealisand Yahoo S4. The stream processing architecture scalability concerns.

The second branch known as complex event processing. The framework supports query languages (such as we now use streaming SQL), and focus on it for a given query for effective event handler, usually run 1-2 nodes. For example ODE, SASE, Esper, Cayugaand Sidhi. These architectures focus on the efficient flow algorithms.

The two branches of the stream processing framework is limited to the stock market and other academic research or niche applications. Yahoo stream processing again become the focus of attention S4 and Apache Storm. It was introduced as "similar to Hadoop, but in real time." It became part of the big data movement.

In the past five years, these two branches merged. I detailed in a previous article discussed the issue.

If you want to learn more about the history stream processing framework, read Recent Advancements in Event Processing and Processing flows of information: From data stream to complex event Processing

Hope this is useful. If you enjoyed this article, you may find Stream Processing 101 and Patterns for Streaming Realtime Analytics .


Author: willcat
link: https: //juejin.im/post/5c1a3eb8f265da61137f3a01
Source: Nuggets
copyright reserved by the authors. Commercial reprint please contact the author authorized, non-commercial reprint please indicate the source.

Guess you like

Origin www.cnblogs.com/heroinss/p/11883659.html