Flink OLAP helps ByteHTAP debut in VLDB, the top database conference

From September 5th to September 9th, 2022, VLDB 2022 will be held in Sydney, Australia. The research result of ByteDance's infrastructure " ByteHTAP : ByteDance's HTAP System with High Data Freshness and Strong Data Consistency" was accepted by VLDB 2022 and invited to give an on-site report.
The full name of the VLDB conference is International Conference on Very Large Data Bases. It is one of the three top conferences (SIGMOD, VLDB, ICDE ) with a long history in the database field . It is also a real-time dissemination venue for outstanding research and development achievements in the database field, reflecting current database research. The cutting-edge direction of the industry, the latest technology in the industry and the research and development level of various countries. Since its establishment in 1975, VLDB attracts top global research institutions to submit manuscripts every year, and has extremely high requirements for system innovation, integrity, and experimental design.

The core contribution of the paper

" ByteHTAP : ByteDance's HTAP System with High Data Freshness and Strong Data Consistency" introduces the HTAP system with high data freshness and strong data consistency built by ByteDance for business scenarios.
 
  • ByteHTAP adopts an independent engine and shared storage architecture, and its modular system design makes full use of ByteDance's existing OLTP system and OLAP system.
  • ByteHTAP can provide high data freshness with less than 1 second latency, bringing many new business opportunities to customers, and customers can also configure different data freshness thresholds according to business needs.
  • ByteHTAP provides strong data consistency through the global timestamp of its OLTP and OLAP systems, freeing developers from having to deal with complex data consistency issues in the system.
  • ByteHTAP uses Flink as the OLAP computing engine, introducing some important performance optimizations in computing and storage, such as refactoring the Flink job scheduling process to improve query QPS, pushing computation to the storage layer, and using delete bitmaps to efficiently handle deletes Wait.
  • The article ends by sharing ByteDance's lessons and best practices in developing and running ByteHTAP in production, including cross-OLAP database query capabilities, efficient data import, and Flink development enhancements.

Core computing engine Flink OLAP

As the OLAP computing engine of the ByteHTAP system, Flink has been used in multiple businesses within the company. The ByteDance Flink technical team has made a lot of in-depth optimizations for the Flink engine to support OLAP computing, effectively improving Flink OLAP computing performance. At present, in a 1600-core cluster, the QPS of 128 concurrent simple query scheduling reaches more than 1000 under a small amount of data, and the QPS of complex query scheduling reaches more than 100; the latency of 1000 concurrent WordCount query is about 100ms. Next, we will focus on https://issues.apache.org/jira/browse/FLINK-25318 and gradually contribute internal optimizations to the community.
  1. Query optimizer. Supports pushdown of operators such as TopN and Aggregate; supports parallel construction of Plan Cache and DAG; supports Cached Catalog. The performance of TPC-DS SF100 is improved by more than 20%.
  1. Query execution optimization. Support ClassLoader multiplexing and cross-job Codegen Cache, reduce CPU usage and Meta Space occupancy in the execution phase; implement Runtime Filter to optimize Join computing performance; asynchronous data reading and concurrency optimization, etc.
  1. Resource management and job scheduling. Simplify the query resource application and release process, optimize the interaction between JobMaster and ResourceManager/TaskManager nodes, realize job resource allocation according to TaskManager granularity, and improve resource application performance; support batch deployment of computing tasks, optimize deployment structure and serialization/deserialization, and improve computing tasks Deployment performance.
  1. Query result management. The query is submitted through the Websocket protocol, and the calculation result is optimized from the Pull mode to the Push mode, avoiding the time-consuming waiting for Pull polling; through the reuse of the Dispatcher connection, it reduces the unnecessary connection and interaction created by the JobMaster and the TaskManager when the query and calculation tasks are initialized , reduce query latency.
  1. Memory management optimization. Optimize the memory application and release management of MemoryManager and NetworkBufferPool, reduce the number of memory interactions and locks when computing tasks start and stop; reduce unnecessary metrics, increase parallel GC and other optimizations to reduce the FGC/YGC of JobManager/TaskManager nodes, and improve query execution Performance and production cluster stability.
  1. Network management optimization. Realize the multiplexing of multi-job network connections in TaskManager, optimize the interaction process of upstream and downstream computing tasks Partition Request, reduce the frequent initialization loss of the network layer and the number of upstream and downstream computing task messages, and improve the initialization performance of computing tasks.
  1. Resource isolation management. It supports the management of resource groups according to the TaskManager dimension, and realizes physical isolation of query jobs among multiple tenants; realizes fine-grained scheduling and execution of computing tasks in TaskManager, and supports the priority strategy of small queries under high load conditions.

ByteDance Best Practices

Within ByteDance, ByteHTAP currently supports User Growth, e-commerce, Xingfuli, Feishu and other businesses, with a total of 11 clusters with 6000+ Core AP resources and 50w+ queries per day.
Flink OLAP, which is the core computing engine of ByteHTAP, is currently gradually launching the commercial product of Volcano Engine - Flink version of streaming computing . As an enterprise-level unified computing engine that integrates and optimizes ByteDance's internal cloud-native big data solutions, the Flink version of streaming computing has features such as out-of-the-box, flexible deployment, stream-batch integration, and OLAP multi-modal computing.
 
Click to download the original paper
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5941630/blog/5577578