TKE User Story | Homework Help Retrieval Service Based on Fluid-based Computing and Storage Separation Practice

author

Lu Yalin, who joined Jobbang in 2019, is in charge of the infrastructure-architecture R&D team of Jobbang. During the Jobbang period, he led the evolution of cloud native architecture, promoted the implementation of containerization transformation, service governance, GO microservice framework, and the implementation of DevOps.

Zhang Haoran, who joined Jobbang in 2019, is a senior architect of Jobbang infrastructure. During Jobbang, he promoted the evolution of the cloud native architecture of Jobbang, responsible for multi-cloud k8s cluster construction, k8s component research and development, linux kernel optimization and tuning, and underlying service containers. related work.

background

Large-scale retrieval systems have always been the underlying cornerstone of each company's platform business. They are often run in the form of ultra-large-scale clusters at the level of thousands of bare metal servers. The amount of data is huge, and the requirements for performance, throughput, and stability are extremely strict, and fault tolerance is required. very low. In addition to the operation level, data iteration and service governance in ultra-large-scale clusters and massive data scenarios are often a huge challenge: incremental and full data distribution efficiency, short-term and long-term hot data tracking, etc. all require in-depth research Problem This article will introduce the fluid-based computing and storage separation architecture designed and implemented by Jobbang, which can significantly reduce the complexity of large-scale retrieval system services, so that large-scale retrieval systems can be managed as smoothly as normal online businesses.

Problems faced by large-scale retrieval systems

The intelligent analysis and search functions of many learning materials of Jobbang rely on a large-scale data retrieval system. Our cluster scale is more than 1,000 units, and the total data volume is more than 100 TB. The entire system consists of several shards. The slice is loaded with the same data set by several servers. At the operational level, we require the performance to reach P99 1.Xms, the peak throughput is 100 GB, and the stability requires more than 99.999%.

In the previous environment, in order to improve the efficiency and stability of data reading, more consideration was given to data localized storage. Our retrieval system generated index items every day and required terabyte-level data updates. These data were produced through offline database building services. After that, it needs to be updated to the corresponding shards respectively. This mode brings many other challenges. The key issues focus on data iteration and scalability:

  1. Discrete data set : In actual operation, each node of each shard needs to copy all the data of this shard, which brings the problem of difficulty in synchronizing data delivery. In actual operation, if you want to synchronize data to a single server node, you need to use hierarchical distribution. First, the first level (tenth level) is distributed, and then the first level is distributed to the second level (hundreds of levels) and then distributed to the third level (thousands of levels). The cycle is long and requires layer-by-layer verification to ensure data accuracy.

  2. The elastic expansion of business resources is weak : the original system architecture adopts the tight coupling of computing and storage, data storage and computing resources are tightly bound, and the flexible expansion of resources is not high. Peak traffic capacity expansion.

  3. Insufficient scalability of single-shard data: The upper limit of single-shard data is limited by the upper limit of single-machine storage in the sharded cluster. If the storage limit is reached, it is often necessary to split the dataset, and this splitting is not driven by business needs.

The problems of data iteration and scalability have to bring cost pressures and weaknesses in automated processes.

Through the analysis of the operation of the retrieval system and the data update process, the key problem currently faced is due to the coupling of computing and storage. Therefore, we consider how to decouple computing and storage. Only by introducing an architecture that separates computing and storage can it be To fundamentally solve the problem of complexity, the most important thing about the separation of computing and storage is to split the way that each node stores the full data of this shard, and store the data in the shard on a logically remote machine, but the separation of computing and storage also brings There are other problems, such as stability problems, the reading method and reading speed under large data volume, the degree of intrusion into the business, etc. Although there are these problems, these problems are solvable and easy to solve based on the Therefore, we confirm that the separation of computing and storage must be the best solution in this scenario, which can fundamentally solve the problem of system complexity.

Computing and Storage Separation Architecture to Solve Complexity Problems

In order to solve the above-mentioned problems of computing and storage separation, the new computing and storage separation architecture must be able to achieve the following goals:

  1. The stability of reading, the separation of computing and storage is, after all, that the original file reading is replaced by the cooperation of various components, and the data loading method can be replaced, but the stability of data reading still needs to be maintained at the same level as the original.

  2. In the scenario of simultaneous data update of thousands of nodes in each shard, the reading speed needs to be maximized, and the pressure on the network needs to be controlled to a certain extent.

  3. It supports reading data through the POSIX interface. POSIX is the most adaptable way to various business scenarios, which shields the impact of downstream changes on the upstream without invading business scenarios.

  4. The controllability of the data iteration process. For online business, the data iteration should be regarded as the cd process equivalent to the service iteration, so the controllability of the data iteration is extremely important, because it is a part of the cd process itself.

  5. For the scalability of data collections, the new architecture needs to be a set of replicable and easy-to-expand patterns, so that it can cope well with the scaling of data collections and the scale of clusters.

In order to achieve the above goals, we finally chose the Fluid open source project as the key link of the whole new architecture.

Component introduction

Fluid is an open source Kubernetes-native distributed data set orchestration and acceleration engine , mainly serving data-intensive applications in cloud-native scenarios, such as big data applications, AI applications, etc. Through the abstraction of the data layer provided by Kubernetes services, data can be moved, copied, expelled, transformed and managed flexibly and efficiently between storage sources such as HDFS, OSS, Ceph and other storage sources such as HDFS, OSS, Ceph and the upper-layer cloud-native application computing of Kubernetes. The specific data operations are transparent to users, and users no longer have to worry about the efficiency of accessing remote data, the convenience of managing data sources, and how to help Kubernetes make operation and maintenance scheduling decisions.

Users only need to directly access the abstracted data in the most natural Kubernetes native data volume mode, and the remaining tasks and underlying details are all handed over to Fluid for processing. The Fluid project currently focuses on two important scenarios, dataset orchestration and application orchestration.

Dataset orchestration can cache the data of a specified dataset to Kubernetes nodes with specified characteristics, while application orchestration will specify that the application is scheduled to be on a node that can or has stored the specified dataset. The two can also be combined to form a collaborative orchestration scenario, that is, node resource scheduling based on collaborative consideration of datasets and application requirements.

Why we choose to use fluid

  1. The retrieval service has been containerized and is naturally suitable for fluid.

  2. As a data orchestration system, Fluid can be used directly by the upper layer without knowing the specific data distribution. At the same time, based on the data-aware scheduling capability, it can realize the nearest business scheduling and accelerate the data access performance.

  3. Fluid implements the pvc interface, so that business pods can be mounted into the pods imperceptibly, so that the pods can be as imperceptible as using local disks.

  4. Fluid provides distributed hierarchical caching of metadata and data, as well as efficient file retrieval.

  5. Fluid+alluxio has built-in multiple cache modes (back-to-source mode, full cache mode), different cache strategies (optimization for small file scenarios, etc.) and storage methods (disk, memory), with good adaptability to different scenarios , which can meet a variety of business scenarios without much modification.

landing practice

  1. Separation of cache nodes and computing nodes: Although the combined deployment of fuse and workers can achieve better data local performance, in the online scenario, we finally choose the solution of separating cache and computing nodes, because by extending a certain startup time In exchange for better elasticity, it is worthwhile, and we do not want the stability of business nodes to be entangled with the stability of cache nodes. Fluid supports the schedulability of datasets, in other words, the schedulability of cache nodes. We schedule dataset cache nodes by specifying the nodeAffinity of datasets to ensure that cache nodes can efficiently and flexibly provide cache services.

  2. High requirements for online scenarios: For online business scenarios, since the system has high requirements for data access speed, integrity and consistency, partial updates of data, unexpected back-to-source requests, etc. cannot occur; therefore, data caching is required. And the choice of update strategy will be critical.

    • Appropriate data caching strategy : Based on the above requirements, we choose to use Fluid's full caching mode. In the full cache mode, all requests will only go to the cache instead of returning to the data source, thus avoiding unexpected long-time requests. At the same time, the dataload process is controlled by the data update process, which is more secure and standardized.

    • Update process combined with permission flow : The data update of online business is also a kind of cd, and also needs update process to control. Through the dataload mode combined with permission process, online data release is more secure and standardized.

    • Atomicity of data update : Since the model is composed of many files, only after all the files are cached, a complete model can be used; therefore, under the premise of full cache and no return to the source, it is necessary to ensure dataload The atomicity of the process, the new version of the data cannot be accessed during the data loading process, and the new version of the data can be read only after the data loading is completed.

The above solutions and strategies, combined with our automated database building and data version management functions, greatly improve the security and stability of the overall system, and at the same time make the flow of the entire process more intelligent and automated.

Summarize

Based on Fluid's computing and storage separation architecture, we have successfully achieved:

  1. Minute-level and terabyte-level data distribution.

  2. The atomicity of data version management and data update makes data distribution and update a manageable and smarter automated process.

  3. Retrieval services can behave like normal stateless services, making it easy to scale horizontally through TKE HPA, and faster scaling brings greater stability and availability.

Outlook

The separation mode of computing and storage allows us to think that very special services can be stateless and can be incorporated into the Devops system like normal services, while the Fluid-based data orchestration and acceleration system is a practice that separates computing and storage. Incision, in addition to being used in retrieval systems, we are also exploring models for model training and distribution in Fluid-based OCR systems.

In terms of future work, we plan to continue to optimize the scheduling strategy and execution mode of upper-layer jobs based on Fluid, and further expand model training and distribution to improve the overall training speed and resource utilization. On the other hand, it also helps the community to continuously evolve its observability. and high availability to help more developers.

about Us

For more cases and knowledge about cloud native, you can pay attention to the public account of the same name [Tencent Cloud Native]~

Welfare:

① Reply to the [Manual] in the background of the official account, you can get the "Tencent Cloud Native Roadmap Manual" & "Tencent Cloud Native Best Practices"~

②The official account will reply to the [series] in the background, and you can get "15 series of 100+ super practical cloud native original dry goods collection", including Kubernetes cost reduction and efficiency enhancement, K8s performance optimization practices, best practices and other series.

③If you reply to the [White Paper] in the background of the official account, you can get the "Tencent Cloud Container Security White Paper" & "The Source of Cost Reduction - Cloud Native Cost Management White Paper v1.0"

[Tencent Cloud Native] New products of Yunshuo, new techniques of Yunyan, new activities of Yunyou, and information of cloud appreciation, scan the code to follow the public account of the same name, and get more dry goods in time! !

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324126143&siteId=291194637