ApacheCon - Apache project practice on cloud-native big data

The first China Offline Summit of CommunityOverCode Asia (formerly ApacheCon Asia), the official global conference series of the Apache Software Foundation, will be held at Park Plaza Beijing on August 18-20, 2023. The conference will include 17 forums and hundreds of cutting-edge issues.

The ByteDance cloud-native computing team participated deeply in the CommunityOverCode Asia summit and gave related keynote speeches. Eight students shared the practical experience of the Apache open source project in the ByteDance business around 6 topics under 4 topics. In addition, Li Benchao, Apache Calcite PMC Member and Apache Flink Committer, will participate in the Keynote speech and share his experience and gains in participating in open source contributions.

 

Keynote Speech

Is it difficult to contribute to open source?

Maybe many students have thought about participating in some open source contributions to improve their technical capabilities and influence. However, there is usually a distance between the ideal and the reality: because I am too busy with work, I don’t have time to participate; the threshold of open source projects is too high, and I don’t know how to get started; I tried some contributions, but the response from the community was not high, so I didn’t stick to it. In this keynote, Li Benchao will combine his own experience to share some stories and thoughts in the process of contributing to the open source community, how to overcome these difficulties, and finally make a breakthrough in the open source community, and strike a balance between work and open source contribution.

Li Benchao

Bytedance, Flink SQL Technical Leader

Apache Calcite PMC Member, Apache Flink Committer, graduated from Peking University, currently works in the ByteDance streaming computing team, and is the technical leader of Flink SQL.

Keynote speech

Special Topic: Data Lakes and Data Warehouses

Practice of building real-time data lake based on Flink

Wang Zheng Volcanic Engine Cloud Native Computing R&D Engineer

Min Zhongyuan Volcanic Engine Cloud Native Computing R&D Engineer

Speech Introduction: Real-time data lake is a core component of modern data architecture, which allows enterprises to analyze and query large amounts of data in real time. In this sharing, we will first introduce the current pain points of real-time data lakes, such as high timeliness, diversity, consistency, and accuracy of data. Then introduce how we build a real-time data lake based on Flink and Iceberg, mainly through the following two parts: how to import data into the lake in real time, and how to use Flink for OLAP temporary query. Finally, I would like to introduce some practical benefits of Bytedance in real-time data lakes.

Lecturer profile: Wang Zheng, who joined ByteDance in 2021, works in the infrastructure open platform team, and is mainly responsible for the research and development of Serverless Flink and other directions;

Min Zhongyuan joined ByteDance in 2021 and worked in the infrastructure open platform team. He is mainly responsible for the research and development of Serverless Flink and Flink OLAP.

Special Topics: Artificial Intelligence/Machine Learning

Bytedance deep learning batch-flow integrated training practice

Mao Hongyue ByteDance Infrastructure Engineer

Speech introduction: With the development of the company's business, the complexity of the algorithm continues to increase, and more and more algorithm models are exploring real-time training on the basis of offline updates to improve the model effect. In order to realize the flexible arrangement and free switching of complex offline and real-time training, and to schedule offline computing resources in a wider range, the machine learning model training gradually tends to integrate batch and stream. This time, we will share the scheduling of machine learning training including ByteDance The architecture evolution of the framework, the practice of batch-flow integration, and heterogeneous elastic training. It also focuses on the practical experience of multi-stage multi-data source hybrid arrangement, global shuffle of streaming samples, full-link Native, and training data insight in the MFTC (batch-stream integrated collaborative training) scenario.

Lecturer profile: Joined ByteDance in 2022, engaged in machine learning training research and development, mainly responsible for large-scale cloud-native batch-flow integrated AI model training engine, supporting Douyin video recommendation, headline recommendation, pangolin advertisement, Qianchuan graphic advertisement Waiting for business.

Bytedance Spark supports Wanka model reasoning practice

Liu Chang ByteDance Infrastructure Engineer

Zhang Yongqiang ByteDance Machine Learning System Engineer

Speech Introduction: With the development of cloud native, due to its powerful ecological construction capabilities and influence, more and more types of load applications including big data and AI have begun to migrate to Kubernetes. Byte internally explores Spark from Hadoop Migrate to Kubernetes to make jobs run cloud-natively. At the same time, we searched for offline batch processing tasks with a large number of GPUs in great demand. As the tide tasks increased, we found a series of problems: GPU computing power supply (card hours) still had a large gap, and the resource pool size of a single computer room could not match the business The increase in the amount of calculation per unit task, the waste of computing power in the online resource pool, and the lack of a unified platform entrance. Spark and AML (applied machine learning) cooperate to support offline calculation of 10,000 cards mixed GPU model inference through GPU sharing technology, mixed GPU scheduling, Spark engine enhancement, platform and surrounding ecological improvement, and support more than 8 billion jobs The dynamic training data uses the mixed GPU 7k card for 7.5h to complete the model scoring data cleaning, and the resource usage efficiency and stability have been significantly improved.

Lecturer profile: Liu Chang, who joined ByteDance in 2020, works in the infrastructure batch computing team, and is mainly responsible for Spark cloud-native work, Spark On Kubernetes and other research and development directions;

Zhang Yongqiang, who joined ByteDance in 2022, worked in the AML machine learning system team and participated in the construction of a large-scale machine learning platform.

Special Topic: Data Storage and Computing

Bytedance MapReduce -> Spark Smooth Migration Practice

Wei Zhongjia ByteDance Infrastructure Engineer

Speech introduction: With the development of business, ByteDance runs about 1.2 million Spark jobs online every day. In contrast, there are still about 20,000 to 30,000 MapReduce jobs online every day. As a batch processing framework with a long history, from the perspective of big data research and development, the operation and maintenance of the MapReduce engine faces a series of problems. For example, the ROI of the framework update iteration is low, the adaptability to the new computing scheduling framework is poor, and so on. From the user's point of view, there are also a series of problems in the use of the MapReduce engine. For example, the computing performance is poor, and additional Pipeline tools are needed to manage serially running jobs. I want to migrate Spark, but there are a large number of existing jobs and a large number of jobs use various scripts that Spark itself does not support. In this context, the ByteDance Batch team designed and implemented a plan for smooth migration of MapReduce tasks to Spark, which allows users to complete the smooth migration from MapReduce to Spark only by adding a small number of parameters or environment variables to existing jobs , greatly reducing migration costs, and achieved good cost benefits.

Lecturer Profile: Joined ByteDance in 2018, and is currently a big data development engineer of ByteDance infrastructure. He focuses on the field of big data distributed computing and is mainly responsible for the development of Spark kernel and Shuffle Service developed by ByteDance.

ByteDance 100 Billion Files HDFS Cluster Practice

Xiongmu Volcanic Engine Big Data Storage R&D Engineer

Speech introduction: With the in-depth development of big data technology, the scale of data and the complexity of use are getting higher and higher, and Apache HDFS is facing new challenges. In ByteDance, HDFS is not only the storage of the traditional Hadoop data warehouse business, but also the base of the computing engine of the storage-computing separation architecture, and the storage base of the machine learning model training. In ByteDance, HDFS not only builds a storage scheduling capability that serves large-scale computing resource scheduling across multiple regions to improve the stability of computing tasks; it also provides integrated user-side cache, conventional three-copy, cold storage data identification, and hot and cold scheduling ability. This sharing introduces how ByteDance understands the new requirements of emerging scenarios for traditional big data storage, and supports system stability in different scenarios through technological evolution and operation and maintenance system construction.

Lecturer profile: Mainly responsible for the evolution of big data storage HDFS metadata services and upper-level computing ecological support.

Topic: Cloud Native

Bytedance Cloud Native YARN Practice

Shao Kaiyang Volcanic Engine Cloud Native Computing R&D Engineer

Speech Introduction: ByteDance's internal offline business has a huge scale. There are hundreds of thousands of nodes and millions of tasks running online every day, and the amount of resources used every day is on the order of tens of millions. Internally, the offline scheduling system and the online scheduling system are respectively Responsible for scheduling and management of offline business and online business. However, with the development of business scale, this set of systems has exposed some shortcomings: it belongs to two sets of offline systems, and some major event scenarios require offline resource conversion through operation and maintenance, which has a heavy operation and maintenance burden and a long conversion cycle; resource pool Inconsistency makes the overall resource utilization rate low, and quota control, machine operation and maintenance, etc. cannot be reused; big data operations cannot enjoy various benefits of cloud native, such as: reliable and stable isolation capabilities, convenient operation and maintenance capabilities, etc. The offline system urgently needs to be unified, and the traditional big data engine is not designed for cloud-native, so it is difficult to deploy directly on the cloud. Each computing engine and task needs to be deeply modified to support the various features of the original YARN, and the transformation cost is huge. Based on this background, ByteDance proposes a cloud-native-based YARN solution——Serverless YARN, which is 100% compatible with the Hadoop YARN protocol. Big data jobs in the Hadoop ecosystem can be transparently migrated to the cloud-native system without modification. Online resources and Offline resources can be efficiently and flexibly converted and time-division multiplexed, and the overall resource utilization of the cluster has been significantly improved.

Lecturer profile: Responsible for offline scheduling related work in ByteDance infrastructure, with many years of experience in engineering architecture.

 

Guess you like

Origin blog.csdn.net/weixin_46399686/article/details/132227993