Storage and Computing Separation Practice: Application of JuiceFS in China Telecom's Daily Average PB Data Scenario

01- Challenges of big data operations & upgrade thinking

Challenges of Big Data Operations

China Telecom's big data cluster has a huge amount of data per day, and the daily volume of a single business can reach PB level, and there are a lot of expired data (cold data), redundant data, and storage pressure is high; each provincial company has its own cluster, And multiple group big data clusters that collect business information at the provincial level across the country lead to scattered and redundant data. Provincial clusters and group cluster data cannot be shared, and cross-regional task delays are high.

Telecom began to create various clusters as early as 2012. Internal clusters are deployed by various manufacturers or other internal teams, and the services carried by them are operated by various manufacturers, and the operation and maintenance teams are also provided by various manufacturers. Therefore, there are many versions involved in the cluster, including Apache , CDH, HDP and other versions. As the size of the cluster continues to expand, the pressure on O&M is increasing. Vendors are required to locate and fix problems. This is not a sustainable development path.

In order to solve the pain points of the existing network, strengthen cluster security, focus on reducing costs and increasing efficiency, and meet internal and external support needs, in 2021, China Telecom established a PaaS self-research team. In two years, the PaaS team has optimized the existing cluster to ensure the smooth operation of the existing cluster with tens of thousands of machines.

At the beginning of 2022, the PaaS team began to independently develop the TDP (TelecomDataPlatform) big data platform, gradually replace the existing clusters, and promote productization. In the first half of 2022, two new clusters will be deployed on the TDP base of Hadoop 2 version and will be used for production business. In the second half of 2022, the Hadoop 3 version of the TDP base was developed, and began to face the problem of how to use the self-developed base to upgrade a large number of Hadoop 2 clusters on the existing network.

Cluster upgrade thinking

In the process of upgrading the cluster, it is hoped that the new cluster design can solve the existing pain points, have advanced features in the industry, and make preparations for subsequent technical iterations.

The following are the problems we hope to solve during the cluster upgrade process:

Split into small clusters

We plan to split large clusters into smaller clusters for the following reasons:

From the perspective of machine resources, it is impossible to use thousands of machines at the same time to migrate the original business. In addition, for some very important businesses that require high SLA guarantees, it is impossible to directly upgrade from Hadoop 2 to Hadoop 3 in the production environment.

There are many different businesses in each cluster. After splitting the large cluster into small clusters, they can be divided according to the business, so as to minimize the impact between them and reduce the pressure and risk of business migration. After splitting into small clusters, it can also improve the instability of the entire cluster that may be caused by some tasks, and better control the stability.

For example: Some machine learning tasks are not written in the way of Sark or Machine Learning, but directly call the Python library in your own program. This operation does not limit thread usage. In this way, even if the task only applies for 2 cores and 10G of memory, the load on this machine may actually be over 100. Therefore, after splitting into small clusters, the mutual influence between tasks can be reduced, especially when very important tasks need to be performed on the platform. In the case of small nodes, the operation and maintenance work will be relatively easy.

In addition, splitting the cluster can also avoid the expansion of Namenode and Hive metadata, and reduce the overall operation and maintenance pressure. Therefore, if the business permits, it is planned to split the large cluster into small clusters for upgrade.

The upgrade process is as smooth as possible

The process of splitting small clusters involves two dimensions of data and computing. Data migration takes a lot of time. If the business is complex, computing may also take a long time. Therefore, it is necessary to find a way to separate the migration of data and computing, and maximize the parallel time between the two clusters.

Data mutual access problem between multiple clusters

After a large cluster is split into small clusters, it is necessary to consider how to access data between multiple clusters. At the same time, there are tens of thousands of machines and massive data in the internal system, and we have been facing problems of different types of data relocation, redundancy, and hot and cold data.

Combining big data and AI needs

Our PaaS platform is gradually taking on various AI needs, and one of the biggest needs is the storage of unstructured data. Integrating this part of the requirements with existing structured and semi-structured data storage is also a cutting-edge direction in the industry.

cost reduction

After splitting a large cluster into small clusters, resources will actually be more tight. Due to the different usage of different clusters, some clusters may only be used during holidays, weekends, or daily batch calculations. Therefore, it is necessary to ensure idle resources be fully utilized.

All of our existing machines are high performance machines with very high storage, memory and CPU performance. In future purchases, do all businesses need to purchase such high-performance machines? For example, for a small cluster, is it possible to quickly build a cluster and save part of the cost of storage and computing resources? In addition, in the process of upgrading from Hadoop 2 to Hadoop 3, EC technology can save 50% of storage resources, hoping to reduce the overall storage cost to a lower level.

Based on the above considerations, the following four strategies are summarized:

• Separation of storage and computing: Separate storage and computing. • Object storage: Use object storage to solve storage problems for structured, unstructured, and semi-structured data. • Elastic computing: Elastic computing technology is used to solve the problem of insufficient utilization of cluster resources after splitting small clusters. • Containerization: Containerization technology is used to solve the problems of deep learning computing tasks and resource management, so as to achieve more efficient cost reduction and efficiency increase.

02- Storage-computing separation architecture design & construction process

Separation of storage and calculation - component selection

The early big data architecture was an integrated storage and computing cluster based on Hadoop 2.0, and high-performance machines were used for both computing and storage. The current architecture is separation of storage and computing, more disks are used for object storage, and an object storage pool and corresponding metadata acceleration layer are established. All HDFS accesses will access the underlying object storage through the metadata acceleration layer. layer.

Storage and calculation separation technology selection

object storage

When considering different object storage solutions, we mainly compared Minio, Ceph, and Curve, but object storage on the cloud is not within the scope of consideration. Among these three options, Ceph, which is the most widely used in the industry and supports various K8S containers, S3 and underlying block storage, was finally selected.

Docking HDFS

The main goal of storage-computing separation is to connect object storage with HDFS. This work was involved in both self-developed Hadoop 2 and Hadoop 3. Initially, the S3 code submitted by Amazon was adopted. Domestic Alibaba Cloud, Tencent Cloud, and Huawei Cloud also launched their own implementations and submitted them to the Hadoop community. But these solutions lack accelerated support for metadata.

In recent years, metadata acceleration and data caching technologies have gradually matured. These technologies are designed to solve the problem that Yarn's underlying data cannot be compatible with local data after storage and computing are separated. In this connection, we not only hope to directly connect object storage and HDFS, but also hope to achieve industry-leading performance levels.

Object storage can interface with HDFS in a variety of ways, such as using Hadoop's native interface, Ceph's CephFileSystem, or open source products such as JuiceFS. Cloud vendors also provide similar solutions, such as Alibaba Cloud's Jindo and Tencent Cloud's GooseFS. These products provide metadata acceleration and caching functions.

Although the products of cloud vendors have advantages of maturity and scale, their source code cannot be obtained, and they need to be bound with the resources on the cloud provided by cloud vendors. Therefore, we chose the open source JuiceFS. JuiceFS is currently the most mature open source solution with a highly active community. Compatible with commercial versions of Hadoop such as CDH and HDP. In the end, the combination of Hadoop 3+JuiceFS+TiKV+Ceph was decided as our storage-computing separation solution .

The value brought by the storage-computing separation architecture

  1. The data stored in a single cluster is reduced, and the metadata pressure is reduced . After computing and storage are decoupled, storage and computing can be elastically expanded and contracted independently, and unified data storage and cross-computing cluster sharing are realized. This method can significantly reduce the amount of stored data in a single cluster and reduce the pressure on the overall metadata.

  2. Solve metadata bottlenecks and single-point performance problems Metadata can be expanded horizontally without single-point performance bottlenecks: instead, the pressure of metadata will be borne by the metadata acceleration layer, and horizontal expansion can be achieved, solving the original problem Metadata bottlenecks and single-point performance issues.

  3. Solve the problem of unbalanced federation at the Ceph layer. Before the Ceph layer, there were many unbalanced federation problems in the cluster. For example, if a certain business uses namespace3 (ns3), its data will be stored on ns3, resulting in ns3 and other federations. The overall data and pressure are unbalanced.

  4. Solve the bottleneck problem of overall expansion In the new cluster, use erasure codes to reduce storage costs, and then use horizontal expansion of object storage to make the cluster's ability to expand better.

Storage and calculation separation project practice: flow track project migration

The data of traffic trajectory is mainly DPI data, which is various traffic data of users surfing the Internet, including 3G, 4G and 5G data. Telecom customer service can use a display page to check whether the user's data consumption in a historical period is consistent with the fee deduction.

With the increase of 5G users, existing clusters need to be constantly filled with 5G traffic data, resulting in increasing storage pressure. All the data is collected from 31 provinces through the collection system. The current data volume has reached the PB level, and the overall scale is still growing. The daily processing data volume is about 800-900TB. This is a relatively simple business scenario, but the challenge is that the magnitude of the data is too large.

The reason why this business scenario was chosen for migration is because the SLA of this scenario is not so high, and it itself is hourly. If the downtime is one hour, the impact is relatively small.

Due to the large amount of data faced, we chose to execute hourly batch tasks. By consuming a large amount of resources to process the overall calculation, in order to count the user's hourly traffic consumption and distribution, and store these data in HBase and Hive.

Based on the data of the existing collection system, upload all the data to the Hadoop2 cluster. The migration needs to open up the connection between the Hadoop2 cluster and the object storage, and JuiceFS plays a key role in this process. With JuiceFS, object storage can be mounted without restarting core component services such as Yarn and HDFS. As new data comes in, the day's ingestion system can write it to object storage. Computing tasks can directly read data from object storage. For original tasks, only the path needs to be modified, and other matters do not need to be changed.

Project Migration Practice

During the launch process of the separation of storage and computing, the iteration speed was very fast, and it took only two months to complete the implementation of the entire project. The ease of use of JuiceFS is an important prerequisite for us to solve problems on time and in quantity under great pressure. At the same time, in practice, JuiceFS also plays a very important role in solving some key problems.

First: PB-level support capacity

  1. Solve metadata storage and connection pressure

In the previous JuiceFS test, I chose to use Redis as the metadata engine, and the data was stored in Redis, and the overall performance was very good. However, with hundreds of machines set up, when each node starts a Yarn task, each container accesses the metadata, causing Redis to crash. So, replace it with TiKV.

  1. Timestamp Write Competitive Pressure

In many clusters, even if the time windows are consistent, time conflicts and contention may still occur due to the large size of the machines. In order to solve this problem, some patches are added to optimize timing and competition, and some corresponding parameter configurations are relaxed.

  1. Garbage Cleanup Speed ​​Bottleneck

We found that the amount of data stored in Ceph is getting higher and higher, and it has not been fully released. The main reason is that the amount of business data in DPI is very large, and only a few days of data will be stored, so PB-level data will be written every day, and then consumed Petabytes of data, and delete petabytes of data.

  1. Recycle bin cleaning thread leak problem

After deployment and monitoring, it was found that some specific time points would cause stability problems of TiKV and Ceph. After investigation, it was found that the problem occurred in the thread leakage of the client recycle bin.

Second: Improve performance under high load

During the pilot project of the traffic trajectory project, in order to meet the needs of 32TB computing and 10PB storage, some machines with better performance were selected. However, when evaluating Ceph, the memory and CPU resources used by Ceph were not taken into account, resulting in the throughput, network bandwidth, and disk bandwidth of each machine being basically full, in an environment similar to a stress test. In such a high-load environment, Ceph had to be adjusted to solve the downtime problem caused by memory overflow, and JuiceFS was optimized to accelerate Ceph's data deletion and writing performance.

project planning

The following plans are planned for the Hadoop 3 upgrade in 2023:

At the bottom layer, it will rely entirely on JuiceFS to store and accelerate metadata, and split object storage into different pools or clusters according to different businesses.

At the computing resource layer, each cluster will have its own computing resource pool, but we will add an elastic resource pool for scheduling among multiple resource pools.

At the unified access layer, a set of unified management tools will be provided, and tasks will be submitted through the task gateway, and multiple clusters will be connected through metadata. The cluster will also be divided into different Pods or clusters, such as DPI clusters, location clusters, and texture clusters. Some clusters can also store hot data in their own HDFS, and improve performance through databases and MPP.

In addition, a set of unified cluster management tools will be provided, including storage portrait, task portrait, cluster log collection and log disk, etc., in order to better monitor and manage the cluster.

In short, it is hoped to improve performance by dividing into small clusters and separating storage and computing, accelerate metadata and elastically schedule computing resources through JuiceFS, and finally simplify the operation and maintenance process through unified management tools.

03- Operation and maintenance experience sharing

How to use high-performance machines for hybrid deployments

In principle, do not have heterogeneous models in cluster planning, and choose machines of the same type as much as possible. This ensures a constant ratio of vcore and mem. However, because the company is very cautious about applying for machines, the traffic trajectory project actually only obtained about 180 old high-performance machines for replacement. Although these machines have high performance, they are not suitable for massive computing or storage machines. In order to make full use of these resources, the mixed deployment of existing machines is used to solve the problem of planning.

A total of 10PB storage, 8100 (45C180) Vcore, and 32TB (180G180) computing resources are provided, among which Ceph uses 48G (4G*12) computing resources, and the rest belong to Yarn.

Machine CPU memory planning is unreasonable

During the planning period, the CPU and memory occupied by Ceph were not considered, which resulted in the exhaustion of machine memory, high machine load, server downtime, and task failure. The Ceph node and the Hadoop cluster share the same network card. Node downtime triggers OSD data migration, and the calculation task Shuffle and data migration fill up the network card. After practice, the configuration is optimized:

• All nodes: Two SSDs are required in Raid1 mode as the root disk to improve stability. • Computing nodes: Recommended number of CPU threads: about 1:4 | 1:5 for memory, and Ceph resources are reserved for mixed deployments • Storage nodes: It is recommended to allocate 6GB of memory for a single OSD (single disk), and it is recommended to use dual Network plane; if conditions permit, it is recommended to separate the internal and external networks, and separate Ceph internal data synchronization from external access. This is a more ideal state. Metadata nodes: High-performance NVME disks are recommended. This is done more with PingCAP The conclusion drawn after the second communication is that the disk load includes TiKV under the current high-frequency use of 180 computing machines, and the pressure is very high, reaching 70%~80%. • Ceph node operating system: CentOS-Stream-8 • Other node operating systems: CentOS-7.6+, CentOS-Stream-8

NodeManager local directory is unreasonable

In the case of high load and PB level, the task requires a lot of disk space, so almost all the disk space is allocated to Ceph, while HDFS has only one mechanical hard disk. However, during the peak period of the task, because the intermediate data needs to be written to this mechanical hard disk, the disk IO delay is too high, which eventually becomes the bottleneck of task execution.

After optimization, under the limited conditions of the machine, the Yarn local directory is configured to the root disk (SSD), and all data disks are allocated to OSD to solve the performance problem caused by the disk.

JuiceFS report indicators closed

In order to reduce the load of JuiceFS, all monitoring indicators of Pushgateway are turned off. Because the monitoring indicators need to be continuously reported by the container, if the Pushgateway response time is too long, the HDFS callback will be stuck and the task cannot be completed. Although it is impossible to view some basic statistical indicators in this way, I hope that the monitoring indicators of JuiceFS can be displayed in other ways in the future.

Redis connection limit problem

When using Redis as the metadata engine, the number of connections is positively correlated with the number of Yarn containers. When the amount of data is too large and there are too many tasks, the maximum connection limit (4K) of Redis is instantly filled. Therefore, Redis or relational databases are used when processing cold data, and TiKV is recommended for high-performance computing (NVMe is used for data disks). Currently using TiKV, which can support about 100,000 parallel connections.

Tikv 6 hour periodic busy

I have encountered a problem that has plagued me for a long time before. When using TiKV, there will be a 6-hour periodical Busy. Through log analysis, it is found that JuiceFS has opened a large number of log clearing threads. First, I tried to close the reporting mechanism of TiKV to solve it, but found that this problem still exists.

Through research, it is found that there is a thread overflow bug in JuiceFS, which causes each nodemanager to open tens of thousands of log clearing threads. Each thread triggers a cleanup operation when it calls the filesystem. Every 8 o'clock on the hour, these threads perform purge operations at the same time, overwhelming TiKV, causing very high spikes during peak hours.

Therefore, when choosing a storage garbage collection mechanism, you need to choose between HDFS and JuiceFS. Although both mechanisms are optional, it is more inclined to turn off the garbage collection mechanism of HDFS and let JuiceFS be responsible for the overall garbage collection alone.

JuiceFS delete is slow

JuiceFS garbage collection requires file deletion. When I first used JuiceFS, I found that even if I adjusted the corresponding parameters of Ceph and adjusted the delete and write weights to the highest, I couldn't delete PB-level data every day.

The deletion performance is very low, and it needs to be deleted in a multi-threaded manner, but each deletion operation of JuiceFS requires the client to report indicators, and then the client detects which files in the recycle bin need to be deleted. If the number of deletions is large, the client cannot be used for processing. Finally, the JuiceFS community provided a solution and a corresponding patch that could fix the multi-threading issue and satisfy petabyte-level deletion needs . Mount it on several fixed servers for deletion, and adjust the number of threads to the highest. Currently, this patch has not been merged into the official release, but it is expected that it will be merged in the future.

JuiceFS write conflict

JuiceFS has the problem of write conflicts. It has been alleviated by increasing the time interval for updating the modification time of the parent folder and reducing frequent file attribute rewriting, but it has not been fundamentally resolved. The team is currently actively discussing this issue with the JuiceFS team, and plans to fix this issue in version 1.0.4 of JuiceFS.

04- Subsequent plan

• Deploy larger-scale storage-computing separation clusters • Explore the connection between different clusters and object storage pools • Explore single-cluster access to multi-object storage clusters • Explore the combination of storage-computing separation and data lakes • Build structured and unstructured unified storage pools

In the long run, it is hoped that storage-computing separation products can play a better role in the upgrade process of tens of thousands of internal machines, and be verified in various scenarios; solve cluster stability problems caused by Ceph expansion, support Kerberos, Range improves security, improves ultra-large-scale performance, and continues to improve product stability, security, and ease of use; it will also further explore the development of cloud native.

Finally, China Telecom expects to continue to communicate and solve problems with the JuiceFS community and experts, and very much hopes that experts in the community can join Telecom to build TDP products together. You are also welcome to try TDP products.

If you are helpful, please pay attention to our project  Juicedata/JuiceFS  ! (0ᴗ0✿)

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5389802/blog/8575915