The practice of separation of computing and storage based on Fluid in the homework help retrieval service

Author | Dong Xiaocong, Zhang Haoran

Introduction: Fluid is a cloud native data orchestration and acceleration project under the Cloud Native Foundation CNCF. It is jointly initiated and open sourced by Nanjing University, Alibaba Cloud and the Alluxio community. This article mainly introduces the application practice of Fluid in the computing and storage separation architecture under the Jobbang retrieval service. Based on Fluid's computing and storage separation architecture, the Jobbang significantly reduces the complexity of large-scale retrieval system services, and successfully achieves a hundred T level in minutes. Data distribution, atomic data version management and data updates, and enable retrieval services to easily scale horizontally through K8s HPA like normal stateless services, bringing higher stability and availability to services.

Large-scale retrieval systems have always been the underlying cornerstone of each company's platform business. They are often run in the form of ultra-large-scale clusters at the level of thousands of bare metal servers. The amount of data is huge, and the requirements for performance, throughput, and stability are extremely strict, and fault tolerance is required. very low.

In addition to the operation level, data iteration and service governance in ultra-large-scale clusters and massive data scenarios are often a huge challenge: incremental and full data distribution efficiency, short-term and long-term hot data tracking, etc. all require in-depth research question.

This article will introduce the fluid-based computing and storage separation architecture designed and implemented by Jobbang, which can significantly reduce the complexity of large-scale retrieval system services, so that large-scale retrieval systems can be managed as smoothly as normal online businesses.

Problems faced by large-scale retrieval systems

The intelligent analysis and search functions of many learning materials of Jobbang rely on a large-scale data retrieval system. Our cluster scale is more than 1,000 units, and the total data volume is more than 100 TB. The entire system consists of several shards. The slice is loaded with the same data set by several servers. At the operational level, we require the performance to reach P99 1.Xms, the peak throughput is 100 GB, and the stability requires more than 99.999%.insert image description here

In the past, in order to improve the efficiency and stability of data reading, more attention was paid to localized data storage. Our retrieval system generated index items every day and required terabyte-level data updates. These data were generated through offline database building services. After that, it needs to be updated to the corresponding shards respectively. This mode brings many other challenges. The key issues focus on data iteration and scalability:

1. The discreteness of the data set:

In actual operation, each node of each shard needs to copy all the data of this shard, which brings about the problem of difficulty in synchronizing data delivery. In actual operation, if you want to synchronize data to a single server node, you need to use hierarchical distribution. First, the first level (tenth level) is distributed, and then the first level is distributed to the second level (hundreds) and then distributed to the third level (thousands). This distribution The cycle is long and requires layer-by-layer verification to ensure data accuracy.

2. The elastic expansion of business resources is weak:

The original system architecture adopts the tight coupling of computing and storage, and the data storage and computing resources are tightly bound, and the flexible expansion capability of resources is not high.

3. Insufficient scalability of single-shard data:

The upper limit of single-shard data is limited by the upper limit of single-machine storage in the sharded cluster. If the storage limit is reached, it is often necessary to split the dataset, and this splitting is not driven by business needs.

The problems of data iteration and scalability have to bring cost pressures and weaknesses in automated processes.

Through the analysis of the retrieval system operation and data update process, the key problem currently faced is due to the coupling of computing and storage. Therefore, we consider how to decouple computing and storage. Only by introducing an architecture that separates computing and storage can it be Fundamentally solve the problem of complexity.

The most important thing about the separation of computing and storage is to split the way each node stores the full data of this shard, and store the data in the shard on a logically remote machine, but the separation of computing and storage brings other problems, such as Stability issues, reading methods and reading speeds under large data volumes, and the degree of intrusion into business, etc. Although there are these issues, these issues are solvable and easy to solve. Based on this, we confirm that computing and storage are separated. It is a good recipe in this scenario, which can fundamentally solve the problem of system complexity.

Computing and Storage Separation Architecture to Solve Complexity Problems

In order to solve the above-mentioned problems of computing and storage separation, the new computing and storage separation architecture must be able to achieve the following goals:

1. The stability of reading, the separation of computing and storage is, after all, that the original file reading is replaced by the cooperation of various components, and the data loading method can be replaced, but the stability of data reading still needs to be maintained at the same level as the original.

2. In the scenario of simultaneous data update of thousands of nodes in each shard, the reading speed needs to be maximized, and the pressure on the network needs to be controlled to a certain extent.

3. Support reading data through POSIX interface. POSIX is the most adaptable way to various business scenarios, so that there is no need to intrude into business scenarios, and the impact of downstream changes on upstream is shielded.

4. Controllability of data iteration process. For online business, data iteration should be regarded as a cd process equivalent to service iteration, so the controllability of data iteration is extremely important, because it is part of the cd process itself .

5. The scalability of the data set. The new architecture needs to be a set of replicable and easy-to-expand patterns, so that it can have a good ability to cope with the scaling of the data set and the scale of the cluster.

In order to achieve the above goals, we finally chose the Fluid open source project as the key link of the whole new architecture.

Component introduction

Fluid is an open source Kubernetes-native distributed data set orchestration and acceleration engine. It is open sourced by Nanjing University, Alibaba Cloud and Alluxio communities. It mainly serves data-intensive applications in cloud-native scenarios, such as big data applications and AI applications.

Through the abstraction of the data layer provided by Kubernetes services, data can be moved, copied, expelled, transformed and managed flexibly and efficiently between storage sources such as HDFS, OSS, Ceph and other storage sources such as HDFS, OSS, Ceph and the upper-layer cloud-native application computing of Kubernetes.

The specific data operations are transparent to users, and users no longer have to worry about the efficiency of accessing remote data, the convenience of managing data sources, and how to help Kubernetes make operation and maintenance scheduling decisions. Users only need to directly access the abstracted data in the most natural Kubernetes native data volume mode, and the remaining tasks and underlying details are all handed over to Fluid for processing.

The Fluid project currently focuses on two important scenarios, dataset orchestration and application orchestration. Dataset orchestration can cache the data of a specified dataset to Kubernetes nodes with specified characteristics, while application orchestration will specify that the application is scheduled to be on a node that can or has stored the specified dataset. The two can also be combined to form a collaborative orchestration scenario, that is, node resource scheduling based on collaborative consideration of datasets and application requirements.insert image description here

Why we choose to use Fluid

1. The retrieval service has been transformed into a container, which is naturally suitable for Fluid.

2. As a data orchestration system, Fluid can be used directly by the upper layer without knowing the specific data distribution. At the same time, based on the data-aware scheduling capability, it can realize the nearest scheduling of services and accelerate data access performance.

3. Fluid implements the pvc interface, so that the business pod can be mounted into the pod imperceptibly, so that the pod can be as imperceptible as using a local disk. Fluid provides a distributed layered cache of metadata and data, as well as efficient file retrieval.

4. Fluid+jindoruntime has built-in multiple cache modes (back-to-source mode, full cache mode), different cache strategies (optimization for small file scenarios, etc.) and storage methods (disk, memory), which have good performance for different scenarios. Adaptability, can meet a variety of business scenarios without much modification.

landing practice

1. Separation of cache nodes and computing nodes: Although the combined deployment of fuse and workers can achieve better data local performance, in the online scenario, we finally choose the solution of separating cache and computing nodes, because by extending a certain amount of time It is worthwhile to trade up better elasticity for startup time, and we do not want the stability of business nodes to be entangled with the stability of cache nodes.

2. Fluid supports the schedulability of datasets, in other words, the schedulability of cache nodes. We schedule dataset cache nodes by specifying the nodeAffinity of datasets to ensure that cache nodes can efficiently and flexibly provide cache services.

1. High requirements for online scenarios

For online business scenarios, since the system has high requirements for data access speed, integrity and consistency, partial updates of data, unexpected back-to-source requests, etc. cannot occur; therefore, the choice of data caching and update strategies is based on will be critical.

2. Appropriate data caching strategy

Based on the above requirements, we choose to use Fluid's full cache mode. In the full cache mode, all requests will only go to the cache instead of returning to the data source, thus avoiding unexpected long-time requests. At the same time, the dataload process is controlled by the data update process, which is more secure and standardized.

3. The update process combined with the permission flow

The data update of online business is also a kind of cd, and it also needs to be managed and controlled by the update process. Through the dataload mode combined with the permission process, the online data release is more secure and standardized.

4. The atomicity of data update

Since the model is composed of many files, a complete model can be used only after all files are cached; therefore, under the premise of full cache and no back-to-source, it is necessary to ensure the atomicity of the dataload process. During the data loading process, the new version data cannot be accessed, and the new version data can be read only after the data loading is completed.

The above solutions and strategies, combined with our automated database building and data version management functions, greatly improve the security and stability of the overall system, and at the same time make the flow of the entire process more intelligent and automated.insert image description here

Summarize

Based on Fluid's computing and storage separation architecture, we have successfully achieved:

1. Minute-level and 100-T-level data distribution.

2. The atomicity of data version management and data update makes data distribution and update a manageable and smarter automated process.

3. The retrieval service can be like a normal stateless service, so that it can easily achieve horizontal expansion through k8s HPA, and faster scaling brings higher stability and availability.

Outlook

The separation mode of computing and storage allows us to think that very special services can be stateless and can be incorporated into the devops system like normal services, while the Fluid-based data orchestration and acceleration system is a practice of separating computing and storage. Incision, in addition to being used in retrieval systems, we are also exploring models for model training and distribution in Fluid-based OCR systems.

In terms of future work, we plan to continue to optimize the scheduling strategy and execution mode of upper-layer jobs based on Fluid, and further expand model training and distribution to improve the overall training speed and resource utilization. On the other hand, it also helps the community to continuously evolve its observability. and high availability to help more developers.

about the author

Dong Xiaocong: The person in charge of the infrastructure of Jobbang, mainly responsible for architecture research and development, operation and maintenance, DBA, security and other work. He has been responsible for architecture and technical management in companies such as Baidu and Didi, and is good at the construction and iteration of business, technology, and R&D platforms.

Zhang Haoran: Joined Jobbang in 2019, Jobbang Infrastructure-Senior Architect, during the Jobbang period, promoted the evolution of the Jobbang cloud native architecture, responsible for multi-cloud K8s cluster construction, K8s component research and development, Linux kernel optimization and tuning, and underlying service containers related work.

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3996014/blog/5386058
Recommended