Stream Processing System Optimization Paper

AJoin: ad-hoc stream joins at scale VLDB2019

Background: Existing stream processing systems such as flink are mainly used to process a single query that has been running on the data stream, and these queries are optimized and executed separately. The framework proposed by Astream 1 mainly targets ad-hoc queries. In this scenario, the stream not only uses long-running queries for processing, but also uses thousands of short-running temporary queries for processing. In order to effectively support this, resources and calculations for streaming temporary queries must be shared in a multi-user environment. Astream uses some shared operators to avoid redundant calculations. Based on Astream, AJoin 2 has made some further optimizations, such as reducing the cost of shared operators, re-optimizing runtime query plans, runtime scale-up and scale-out, etc.

For example, the following scenario:

V={vID, length, geo, lang, time} The video stream displayed on the user wall

W={usrID, vID, duration, geo, time} the user's video viewing stream

C={usrID, comment, length, photo, emojis, time} User evaluation stream

R={usrID, reaction, time} User feedback stream

Q1: The machine learning module is used to learn to obtain user preferences. This query will get the English movies that German users like to watch

Q2: The editorial team is used to discover the network operation organization (Navy). This query is used to detect users from the United States. These users commented within 10 seconds of watching the movie, and the length of the comment is greater than 5.

Q3: The quality assurance group is used to analyze user feedback on recommended videos. This query is used to analyze videos that European users give negative feedback, and there is at least one expression in the comments.

Insert picture description here

This is one of the experimental graphs. You can see that when there are more queries at the same time (focus on the purple part), the performance advantage of Ajoin is very obvious.

Insert picture description here

Shared Arrangements: practical inter-query sharing for streaming dataflows VLDB2020

Background: Current systems for data parallelism, incremental processing and view maintenance on high-speed streams isolate the execution of independent queries. In the presence of concurrent incremental maintenance queries, this will cause unnecessary redundancy and overhead: each query must independently maintain the same index state on the same input stream, and new queries must build this state from scratch. Only then can it begin to issue its first result. This article 3 introduces the sharing arrangement: maintaining an indexed view of the state, which allows concurrent queries to reuse the same memory state without affecting data parallel performance and scalability. We implemented a sharing arrangement in modern stream processors and performed incremental, interactive queries for high-throughput streams, showing an order of magnitude improvement in query response time and resource consumption.

The previous AJoin work belongs to Multi-Query-Optimization (MQO), which shares the state and processing of common sub-expressions, while shared arrangements are shared historical indexes, which allow post-mortem sharing: new queries can be immediately appended to the memory of existing queries Scheduled, and quickly began to generate correct output reflecting all previous events.

Incremental processing: Different from general SQL queries, in incremental SQL queries, when the content of a table changes, we hope that these tables will express the content modification as a delta table containing increased rows and decreased rows (Delta Table In the form of ), these incremental tables will be sent to the upper operator for processing. The SQL in flink is all incremental SQL.

Insert picture description here

Insert picture description here

The figure above is an experimental result graph, running the TPC-H test, the leftmost is the query delay distribution, the middle is the update processing delay, and the right is the memory usage. It can be seen that the research effectively improves performance and reduces memory usage. The middle graph is the complementary cumulative distribution graph, which means the cumulative distribution of queries that are delayed beyond a certain time.

There is also a related work 4 .

Analyzing efficient stream processing on modern hardware VLDB2019

Background: Modern stream processing engines (SPE) process large amounts of data under strict latency constraints. Many SPEs use messages delivered on a shared nothing architecture to perform processing piplines, and apply partition-based scale out strategies to process high-speed input streams. In addition, many of the latest SPEs rely on the Java virtual machine to achieve platform independence and accelerate system development by abstracting from the underlying hardware, but cannot make full use of the existing high-performance hardware resources. This article 5 mainly considers the data transmission and synchronization overhead brought by Scale out, and proposes making full use of high-performance hardware to make Scale up as another option for capacity expansion. Saber 6 is an example, which utilizes GPU resources for computational acceleration. This article comprehensively analyzes what optimizations the existing stream processing system can do to adapt to the existing hardware development.

  • Existing hardware development:
    • CPU : Each socket of a multi-socket CPU has multiple cores and its own memory controller, and the memory access delay across sockets will be longer (NUMA)
    • Memory: Capacity growth (up to terabytes of memory)
    • Network: Access speed increases faster, even faster than memory. Therefore, the remote memory access technology RDMA
  • This article works:
    • Comparing C++ and Java performance: such as RDMA access (data ingestion), queue access (data exchange)
    • Two parallelization strategies are proposed: Upfront Partitioning (UP, widely used in existing Scale out architectures such as flink), and two parallelization strategies for Scale up, Late Local Merge (LM) and Late Global Merge (GM)
    • Implement lock-free window mechanism: minimize conflicts between worker threads

Insert picture description here

The above are several experimental pictures in the article. YSB, LRB, and NYT correspond to different benchmarks, and UP, LM, and GM correspond to different parallelization strategies. It can be seen that the method based on c++ can make better use of memory and is close to memory. Bandwidth, java-based is much inferior. Moreover, the GM strategy proposed by the author also performs best in a stand-alone case.

This article 7 is about RDMA's global memory management.

Insert picture description here

The figure above is an extended experimental graph, which is the result of a network condition of 1Gbs. For network-intensive benchmarks such as YSB, the performance will no longer improve when there are more than 4 nodes. This article 8 shows that when the network bandwidth increases, the CPU will become the bottleneck again, and apache spark and flink require system changes.

Parallel index-based stream join on a multicore cpu SIGMOD2020

Background: The main task of this paper is to index the streaming data of sliding windows to improve the performance of window connections. The traditional index structure can not meet the characteristics of dynamic changes of streaming data. This article 9 proposes an index with a partitioned memory merge tree structure, and also proposes some concurrency control mechanisms for it, so that the index supports frequent updates under multithreading. On this basis, this paper also designs an algorithm to implement index-based parallel stream connection, so as to utilize the computing power of multi-core processors. The figure below is the comparison between PIM-Tree and IM-Tree and B+Tree proposed in the article. The vertical axis is throughput and the horizontal axis is window size. On the right is the throughput improvement of join using multithreading.
Insert picture description here
Insert picture description here

Learning to optimize join queries with deep reinforcement learning

​ Traditional connection sequence optimization only needs to be based on dynamic programming or heuristic methods (Zig-Zag, QuickPick-1000, etc.). When the cost model is non-linear (for example, memory limitations, multiplexing materialization, etc.), the solutions obtained by these methods are often sub-optimal. This article 10 expresses the connection ordering problem as a Markov Decision Process (MDP), and then constructs an optimizer using a deep Q network (DQN) to effectively order the connections. This paper evaluates the proposed method based on the Join Order Benchmark (specially used for stress testing connection optimization). The execution plan cost of the deep optimizer based on reinforcement learning is improved by 2 times compared with all current cost model optimal solutions, and improved by up to 3 times compared with the current best heuristic methods.

Insert picture description here

The experiment in the above figure compares the performance of DQ (the method proposed in this article) under the three cost models compared to some other methods. Among them, the solution obtained by violent solution is used as the baseline, and the others are relative to the performance of this baseline. There are three cost models, the first is that the data is all in the memory, the second is to consider the memory limit, and the memory is overwritten to the disk, and the third is to reuse the hash table established by the upstream operation. It can be seen that as the cost model becomes additional, the relative performance of DQ is also better.
65dda8be17ddfd89f9d4fda0e3915

The above figure shows the delay change of the optimizer itself as the number of join relationships increases, and the vertical axis is log. It can be seen that when the number of join relationships increases, the advantages of DQ become more obvious. If you use GPU or TPU acceleration, it will show a greater advantage.

Chi: A Scalable and Programmable Control Plane for Distributed Stream Processing Systems

The workload of the stream processing system and the shared cloud environment are highly variable and unpredictable. Coupled with a large parameter space and different SLOs, it is difficult for the stream processing system to adjust statically. This article 11 proposes a scalable and programmable control panel for a distributed stream processing system that supports continuous monitoring, feedback, and dynamic reconfiguration. Chi uses the embedded control plane message in the data plane channel to realize the low latency and flexible control plane of the stream processing system. Chi introduced a new responsive programming model and design mechanism to execute control strategies asynchronously, thereby avoiding global synchronization.

Insert picture description here

The figure above is an experiment of dynamic expansion. At t=40s, the data acquisition speed is doubled, and then the strategies of each system are used for expansion. In Flink, it will call a SavePoint and then restart the data stream. Flink will stop due to the savepoint mechanism, and the throughput will drop to zero at the place on the right, and it will recover after a few seconds. Chi's peak increase is relatively small, and it returns to a steady state at a faster rate, which is about 6 times that of flink.

Insert picture description here

The system will perform a checkpoint every 10s, and a checkpoint has been done in 35s, and then a virtual machine is suspended in 40s, and then the three system recovery mechanisms are used. Since flink needs the previous checkpoint to redeploy the data stream, the throughput dropped to 0 again, while the throughput of Chi did not decrease, and the delay returned to a stable state faster, which was about 3 times faster than flink.

Optimal and General OutofOrder SlidingWindow Aggregation

Sliding window aggregation is very common in stream processing system applications. Some operators, such as sum and mean, have no requirements on the order of data arrival. Each can be aggregated, so the time complexity is O(1). But for some operators, such as max, min, etc., the convection arrival sequence requires order. However, many times, such as network jitter, batch processing, failure recovery, etc. will cause some delays, so that the data does not arrive in order. At this time, there are two main strategies. One is to wait for the data to arrive, sort it, and then aggregate; the other is to aggregate every piece of data into a data structure for easy query at any time. The method is to use an enhanced red-black tree to do it, the complexity is O (Logn), where n is the window size. This article 12 proposes a finger B-tree aggregator algorithm, which can achieve O(1) complexity in an orderly situation, and a slight disorder can achieve a complexity close to O(1). In extreme cases, it is only O(logn ) Complexity.
Insert picture description here

As can be seen from the figure, the aggregation performance of flink in out-of-order scenarios is relatively poor.

other

mvcc

  1. Böttcher, Jan, et al. “Scalable garbage collection for in-memory MVCC systems.” Proceedings of the VLDB Endowment 13.2 (2019): 128-141.
  2. Sun, Yihan, et al. “On supporting efficient snapshot isolation for hybrid workloads with multi-versioned indexes.” Proceedings of the VLDB Endowment 13.2 (2019): 211-225.

topk

  1. Zois, Vasileios, Vassilis J. Tsotras, and Walid A. Najjar. “Efficient main-memory top-K selection for multicore architectures.” Proceedings of the VLDB Endowment 13.2 (2019): 114-127.

SparkSQL compilation

Schiavio, Filippo, Daniele Bonetta, and Walter Binder. “Dynamic speculative optimizations for SQL compilation in Apache Spark.” Proceedings of the VLDB Endowment 13.5 (2020): 754-767.

AI4DB

  1. Sun, Ji, and Guoliang Li. “An end-to-end learning-based cost estimator.” Proceedings of the VLDB Endowment 13.3 (2019): 307-319.
  2. Li, G., et al. (2019). “Qtune: A query-aware database tuning system with deep reinforcement learning.” Proceedings of the VLDB Endowment 12(12): 2118-2130.
  3. Tan, J., et al. (2019). “ibtune: Individualized buffer tuning for large-scale cloud databases.” Proceedings of the VLDB Endowment 12(10): 1221-1234.
  4. Van Aken, D., et al. (2017). Automatic database management system tuning through large-scale machine learning. Proceedings of the 2017 ACM International Conference on Management of Data.
  5. Zhu, Y., et al. (2017). Bestconfig: tapping the performance potential of systems via automatic configuration tuning. Proceedings of the 2017 Symposium on Cloud Computing.
  6. Zhang, J., et al. (2019). An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforcement Learning. Proceedings of the 2019 International Conference on Management of Data - SIGMOD '19: 415-432.

references


  1. Karimov, J., et al. (2019). Astream: Ad-hoc shared stream processing. Proceedings of the 2019 International Conference on Management of Data. ↩︎

  2. Karimov, J., et al. (2019). “AJoin: ad-hoc stream joins at scale.” Proceedings of the VLDB Endowment 13(4): 435-448. ↩︎

  3. McSherry, F., et al. (2020). “Shared Arrangements: practical inter-query sharing for streaming dataflows.” Proceedings of the VLDB Endowment 13(10): 1793-1806. ↩︎

  4. Rehrmann, Robin, et al. “Oltpshare: the case for sharing in OLTP workloads.” Proceedings of the VLDB Endowment 11.12 (2018): 1769-1780. ↩︎

  5. Zeuch, S., et al. (2019). “Analyzing efficient stream processing on modern hardware.” Proceedings of the VLDB Endowment 12(5): 516-530. ↩︎

  6. A. Koliousis, M. Weidlich, R. Castro Fernandez, A. L. Wolf, P. Costa, and P. Pietzuch. Saber: Window-based hybrid stream processing for heterogeneous architectures. In SIGMOD, pages 555–569. ACM, 2016. ↩︎

  7. Cai, Qingchao, et al. “Efficient distributed memory management with RDMA and caching.” Proceedings of the VLDB Endowment 11.11 (2018): 1604-1617. ↩︎

  8. A. Trivedi, P. Stuedi, J. Pfefferle, R. Stoica, B. Metzler, I. Koltsidas, and N. Ioannou. On the [ir]relevance of network performance for data processing. In USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 16). USENIX ↩︎

  9. Shahvarani, A. and H.-A. Jacobsen (2020). Parallel index-based stream join on a multicore cpu. Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. ↩︎

  10. Krishnan, Sanjay, et al. “Learning to optimize join queries with deep reinforcement learning.” arXiv preprint arXiv:1808.03196 (2018). ↩︎

  11. Mai, Luo, et al. “Chi: a scalable and programmable control plane for distributed stream processing systems.” Proceedings of the VLDB Endowment 11.10 (2018): 1303-1316. ↩︎

  12. Tangwongsan, Kanat, Martin Hirzel, and Scott Schneider. “Optimal and general out-of-order sliding-window aggregation.” Proceedings of the VLDB Endowment 12.10 (2019): 1167-1180.
    l, and Scott Schneider. “Optimal and general out-of-order sliding-window aggregation.” Proceedings of the VLDB Endowment 12.10 (2019): 1167-1180. ↩︎

Guess you like

Origin blog.csdn.net/Fei20140908/article/details/109515961