From RxJS to Flink: How to deal with data flow?

What is being developed by front-end development

In the process of front-end development, you may have thought about this question: What exactly is front-end development developing? In my opinion, the essence of front-end development is to allow web views to correctly respond to related events. There are three keywords in this sentence: "Web View", "Responding Correctly" and "Related Events".

"Related events" may include page clicks, mouse slides, timers, server requests, etc. "Correct response" means that we have to modify some states according to related events, and "web view" is the most important aspect of our front-end development. The familiar part.

According to this point of view, we can give the formula of view = response function (event):

View = reactionFn(Event)

In front-end development, events that need to be processed can be classified into the following three types:

  • The user performs page actions, such as click, mousemove and other events.
  • The remote server interacts with local data, such as fetch, websocket.
  • Local asynchronous events, such as setTimeout, setInterval async_event.

In this way, our formula can be further derived as:

View = reactionFn(UserEvent | Timer | Remote API)

Two logic processing in the application

In order to further understand the relationship between this formula and front-end development, we take a news website as an example, which has the following three requirements:

  • Click Refresh: Click Button to refresh the data.
  • Check Refresh: Automatically refresh when Checkbox is checked, otherwise stop automatic refresh.
  • Pull-down refresh: refresh the data when the user pulls down from the top of the screen.

If analyzed from the front-end perspective, these three requirements correspond to:

  • Click to refresh: click -> fetch
  • Check refresh: change -> (setInterval + clearInterval) -> fetch
  • Pull down to refresh: (touchstart + touchmove + touchend) -> fetch news_app

1 MVVM

In the MVVM mode, the response function (reactionFn) corresponding to the above will be executed between the Model and the ViewModel or between the View and the ViewModel, and the event (Event) will be processed between the View and the ViewModel.

MVVM can abstract the view layer and the data layer very well, but the response function (reactionFn) will be scattered in different conversion processes, which will make it difficult to accurately track the data assignment and collection process. In addition, because the processing of events is closely related to the view in this model, it is difficult to reuse the logic of event processing between View and ViewModel.

2 Redux

Under the simplest model of Redux, the combination of several events (Event) will correspond to an Action, and the reducer function can be directly regarded as corresponding to the response function (reactionFn) mentioned above.

But in Redux:

  • State can only be used to describe intermediate states, not intermediate processes.
  • The relationship between Action and Event is not one-to-one, which makes it difficult for State to track the actual source of change.

3 Reactive programming and RxJS

Reactive programming is defined in Wikipedia:

In computing, reactive programming or reactive programming (English: Reactive programming) is a declarative programming paradigm oriented to data flow and change propagation. This means that static or dynamic data streams can be easily expressed in programming languages, and the relevant calculation model will automatically propagate the changed values ​​through the data stream.

Reconsider the user's use of the application process in the data flow dimension:

  • Click button -> trigger refresh event -> send request -> update view
  • Check auto refresh
  • Finger touch screen
  • Auto refresh interval -> trigger refresh event -> send request -> update view
  • Slide your finger on the screen
  • Auto refresh interval -> trigger refresh event -> send request -> update view
  • Stop sliding your finger on the screen -> trigger a pull-down refresh event -> send request -> update view
  • Auto refresh interval -> trigger refresh event -> send request -> update view
  • Turn off automatic refresh

Represented by Marbles diagram:

Splitting the logic of the above figure, you will get the three steps of using reactive programming to develop the current news application:

  • Define source data flow
  • Combine/transform data streams
  • Consume the data stream and update the view

Let's describe them in detail separately.

Define source data flow

Using RxJS, we can easily define various Event data streams.

1) Click operation

Involves click data flow.

click$ = fromEvent<MouseEvent>(document.querySelector('button'), 'click');

2) Check operation

Involves the change data stream.

change$ = fromEvent(document.querySelector('input'), 'change');

3) Pull down operation

It involves three data streams: touchstart, touchmove and touchend.

touchstart$ = fromEvent<TouchEvent>(document, 'touchstart');
touchend$ = fromEvent<TouchEvent>(document, 'touchend');
touchmove$ = fromEvent<TouchEvent>(document, 'touchmove');

4) Regular refresh

interval$ = interval(5000);

5) Server request

fetch$ = fromFetch('https://randomapi.azurewebsites.net/api/users');

Combine/transform data streams

1) Click to refresh the event stream

When clicking to refresh, we hope that multiple clicks in a short period of time will only trigger the last time, which can be achieved through the debounceTime operator of RxJS.

clickRefresh$ = this.click$.pipe(debounceTime(300));

2) Automatically refresh the stream

Use the switchMap of RxJS to cooperate with the interval$ data flow defined earlier.

autoRefresh$ = change$.pipe(
  switchMap(enabled => (enabled ? interval$ : EMPTY))
);

3) Pull down to refresh the stream

Combine the previously defined touchstart$touchmove$ and touchend$ data streams.

pullRefresh$ = touchstart$.pipe(
  switchMap(touchStartEvent =>
    touchmove$.pipe(
      map(touchMoveEvent => touchMoveEvent.touches[0].pageY - touchStartEvent.touches[0].pageY),
      takeUntil(touchend$)
    )
  ),
  filter(position => position >= 300),
  take(1),
  repeat()
);

Finally, we merge the defined clickRefresh$autoRefresh$ with pullRefresh$ through the merge function to get the refresh data stream.

refresh$ = merge(clickRefresh$, autoRefresh$, pullRefresh$));

Consume the data stream and update the view

The refresh data flow is directly leveled through switchMap to the defined fetch$ in the first step, and we get the view data flow.

The view stream can be directly mapped to the view through the direct async pipe in the Angular framework:

<div *ngFor="let user of view$ | async">
</div>

In other frameworks, you can obtain the real data in the data stream through subscribe, and then update the view.

So far, we have used reactive programming to complete the current news application. The sample code [1] is developed by Angular with no more than 160 lines.

Let's summarize the correspondence between the three processes experienced when developing front-end applications using reactive programming ideas and the formulas in the first section:

View = reactionFn(UserEvent | Timer | Remote API)

1) Describe the source data flow

Corresponding to the event UserEvent | Timer | Remote API, the corresponding functions in RxJS are:

  • UserEvent: fromEvent
  • Hours: interval, hours
  • Remote API: fromFetch, webSocket

2) Combined conversion data stream

Corresponding to the response function (reactionFn), the corresponding part of the method in RxJS is:

  • COMBINING: merge, combineLatest, zip
  • MAPPING: map
  • FILTERING: filter
  • REDUCING: reduce, max, count, scan
  • TAKING: take, takeWhile
  • SKIPPING: skip, skipWhile, takeLast, last
  • TIME: delay, debounceTime, throttleTime

3) Consumption data stream update view

Corresponding to View, it can be used in RxJS and Angular:

  • subscribe
  • async pipe

What are the advantages of reactive programming over MVVM or Redux?

  • Describe the event itself, not the calculation process or intermediate state.
  • Provides a way to combine and transform data streams, which also means that we have a way to reuse continuously changing data.
  • Since all data streams are obtained by layer-by-layer combination and conversion, this also means that we can accurately track the source of events and data changes.

If we blur the timeline of the RxJS Marbles graph and add the vertical section every time the view is updated, we will find two interesting things:

  • Action is a simplification of EventStream.
  • State is the correspondence of Stream at a certain moment.

No wonder we can have such a sentence on the Redux official website: If you have used RxJS, it is likely that you no longer need Redux.

The question is: do you really need Redux if you already use Rx? Maybe not. It's not hard to re-implement Redux in Rx. Some say it's a two-liner using Rx.scan() method. It may very well be!

At this point, can we further abstract the sentence that the web view can correctly respond to related events?

All events - find --> related events - make --> respond

The events that occur in chronological order are essentially data streams, and further expansion can become:

Source data flow - Conversion --> Intermediate data flow - Subscription --> Consumption data flow

This is the basic idea for reactive programming to work perfectly on the front end. But is this idea only applied in front-end development?

The answer is no. This idea can be applied not only in front-end development, but also in back-end development and even real-time computing.

Three break the wall of information

The front-end and back-end developers are usually separated by a wall of information called the REST API. The REST API isolates the responsibilities of the front-end and back-end developers and improves development efficiency. But it also allows the front-end and back-end developers to be separated by this wall. Let us try to push down this wall of information and get a glimpse of the application of the same idea in real-time computing.

1 Real-time computing and Apache Flink

Before starting the next part, let's introduce Flink. Apache Flink is an open source stream processing framework developed by the Apache Software Foundation for stateful computing on borderless and bordered data streams. Its data flow programming model provides single event (event-at-a-time) processing capabilities on limited and unlimited data sets.

In actual applications, Flink is usually used to develop the following three applications:

  • Event-driven applications Event-driven applications extract data from one or more event streams and trigger calculations, status updates, or other external actions based on incoming events. Scenarios include rule-based alarms, anomaly detection, anti-fraud and so on.
  • Data analysis applications Data analysis tasks need to extract valuable information and indicators from raw data. For example, double eleven turnover calculation, network quality monitoring and so on.
  • The data pipeline (ETL) application extract-transform-load (ETL) is a common method for data conversion and migration between storage systems. ETL jobs are usually triggered periodically to copy data from a transactional database to an analytical database or data warehouse.

Let's take the calculation of the hourly turnover of Double Eleven on the e-commerce platform as an example to see if the solution we obtained in the previous chapter can still be used.

In this scenario, we first need to obtain the user's purchase order data, and then calculate the hourly transaction data, and then transfer the hourly transaction data to the database and cache it by Redis, and finally get it through the interface and display it on the page.

The data flow processing logic in this link is:

User order data flow - Conversion --> Hourly transaction data flow - Subscription --> Write to database

As described in the previous chapter:

Source data flow - Conversion --> Intermediate data flow - Subscription --> Consumption data flow

The thoughts are exactly the same.

If we use Marbles to describe this process, we will get this result. It seems very simple. It seems that the same function can be done using RxJS's window operator, but is this really the case?

2 Hidden complexity

Real real-time computing is much more complex than reactive programming in the front end. Here are a few examples:

Out of order

In the front-end development process, we will also encounter the situation of out-of-order events. The most classic case is to initiate a request first and then receive a response, which can be represented by the following Marbles diagram. There are many ways to deal with this situation in the front-end, we will skip it here.

What we want to introduce today is the time disorder faced by data processing. In front-end development, we have a very important premise, which greatly reduces the complexity of developing front-end applications, that is: the occurrence time and processing time of front-end events are the same.

Imagine if the user performs page actions, such as click, mousemove and other events become asynchronous events, and the response time is unknown, then the complexity of the entire front-end development will be.

However, the occurrence time of an event is different from the processing time, which is an important prerequisite in the real-time computing field. Let's still take the calculation of hourly turnover as an example. After the original data stream is transmitted layer by layer, the order of the data at the computing node is likely to be out of order.

If we still divide the window by the arrival time of the data, the final calculation result will produce an error:

In order to make the calculation result of the window of window2 correct, we need to wait for the late event to arrive and perform the calculation, but in this way we face a dilemma:

  • Wait indefinitely: late event may be lost during transmission, and there will never be data output in window2.
  • Waiting time is too short: late event has not arrived yet, and the calculation result is wrong.

Flink introduced the Watermark mechanism to solve this problem. Watermark defines when to no longer wait for the late event, essentially providing a compromise between the accuracy of real-time calculation and real-time performance.

There is a vivid analogy about Watermark: When going to school, the teacher will close the door of the class, and then say: "Students who come after this point are considered late, and all will be punished." In Flink, Watermark acts as the teacher closing the door.

Data back pressure

When using RxJS in a browser, I wonder if you have considered such a situation: when observable is generated faster than operator or observer consumption, a large amount of unconsumed data will be cached in memory. This situation is called back pressure. Fortunately, generating data back pressure on the front end will only cause the browser's memory to be heavily occupied, and there will be no more serious consequences.

But in real-time computing, what should be done when the speed of data generation is higher than the processing capacity of intermediate nodes, or exceeds the consumption capacity of downstream data?

For many streaming applications, data loss is unacceptable. To ensure this, Flink designed such a mechanism:

  • In an ideal situation, buffer data in a persistent channel.
  • When the speed of data generation is higher than the processing capacity of the intermediate node, or exceeds the consumption capacity of downstream data, the slower receiver will immediately reduce the speed of the transmitter after the buffering effect of the queue is exhausted. A more vivid analogy is that when the flow rate of the data stream slows, the entire pipeline is "back-pressured" from the sink to the water source, and the water source is throttled to adjust the speed to the slowest part to achieve a stable state.

Checkpoint

In the field of real-time computing, there may be billions of data processed every second, and the processing of these data cannot be done independently by a single machine. In fact, in Flink, the operator operation logic will be executed by different subtasks on different taskmanagers. At this time, we are faced with another problem. When a machine has a problem, how should the overall operation logic and state be handled? Ensure the correctness of the final calculation result?

Flink introduces a checkpoint mechanism to ensure that the status and calculation position of the job can be restored. Checkpoint makes the state of Flink have good fault tolerance. Flink uses a variant of the Chandy-Lamport algorithm, which is called asynchronous barrier snapshotting.

When the checkpoint is started, it will make all sources record their offsets and insert the numbered checkpoint barriers into their stream. These barriers will mark the part of the flow before and after each checkpoint as they pass through each operator.

When an error occurs, Flink can restore the state according to the state stored in the checkpoint to ensure the correctness of the final result.

tip of the iceberg

Due to space constraints, the part introduced today can only be the tip of the iceberg, but

Source data flow - Conversion --> Intermediate data flow - Subscription --> Consumption data flow

The model is universal in both reactive programming and real-time computing. I hope this article can give you more thoughts on the idea of ​​data flow.

Original link

This article is the original content of Alibaba Cloud and may not be reproduced without permission

Guess you like

Origin blog.csdn.net/weixin_43970890/article/details/113930002