How to achieve 20,000+ scale of clusters under big data?

Abstract: The user’s demand for multi-scenario fusion analysis does not allow the cluster to be split, nor does it allow the data analysis business to be split and cause the business modules to lose association. Therefore, Huawei launched a single cluster of 20,000 nodes to explore.

On July 9, the China Academy of Communications issued certificates for products that passed the big data product capability evaluation at the Big Data Industry Summit·Results Conference. Huawei Cloud FusionInsight MRS successfully passed the evaluation with full scores in all test items, and successfully broke through the single cluster The ultra-large scale of 20,000 nodes sets a new benchmark in the industry.

In order to cope with the rapid development of 5G and IoT, big data technology has been further strengthened on the basis of distributed batch processing capabilities. As Huawei's big data product based on the Hadoop ecosystem, FusionInsight MRS has been committed to the exploration and practice of ultra-large-scale single-cluster carrying capacity. The purpose is that when data grows exponentially, Huawei's self-developed big data products can smoothly meet user needs. With the acceleration of the digital transformation of society, the amount of data has surged beyond expectations. At the same time, users’ demands for multi-scenario fusion analysis do not allow the cluster to be split, and it is not allowed to split the data analysis business and cause the business modules to lose association . Therefore, Huawei's big data research and development team started a single cluster of 20,000 nodes to explore.

Technical pain points of hyperscale clusters

For a distributed system, when the cluster size changes from small to large, simple problems will become extremely complicated. As the number of nodes increases, the simple heartbeat mechanism will also overwhelm the Master node. The FusionInsight MRS cluster with 2W nodes faces many challenges:

1. How to achieve efficient scheduling of batch, streaming, and interactive mixed loads for multi-tenant scenarios, linear expansion of cluster size and processing capacity, and peak and trough resource multiplexing between engines

The centralized storage of data can be effectively solved by super large clusters, but if the data is only stored, it will not generate value. Only by conducting a lot of analysis can we find value from the data. It is a common usage of big data platforms to generate fixed reports by running batch tasks. If hundreds of P data are only used to run batches, it will be a waste of data and massive computing resources; time is money, time It is efficiency. Data T+0 enters the lake, and real-time updates enter the lake. It is to continuously accelerate the realization of data value. Super-large-scale clusters should be able to realize T+0 real-time data entry into the lake, batch analysis of all data, and interactive data analysts. Explore and analyze to ensure the maximum value of the platform. For example, on a large cluster, real-time data entry T+0 into the lake and batch analysis can be quickly realized at the same time, and it can also face the ad hoc query requirements of a large number of analysts, and achieve the isolation and sharing of computing resources. This is a problem that the scheduling system needs to solve. important question.

2. How to face new challenges in storage, computing, and management, and break through the bottleneck limitations of multiple components

Computational aspects: As the scale of the cluster becomes larger, YARN's ResourceManager can schedule more resources, and more tasks can be parallelized. This places higher requirements on the central scheduling process. If the scheduling speed cannot keep up, job tasks Will accumulate at the entrance of the cluster, and the computing resources of the cluster cannot be used effectively.

Storage: As storage capacity increases, HDFS needs to manage more file objects on large-scale clusters, and the amount of HDFS NameNode metadata will increase accordingly. Although the community provides the NameNode federation mechanism, the application layer needs to be aware of the namespaces of different NameNodes, and the use and maintenance will become extremely complicated. In addition, the problem of uneven data mapping between namespaces is easy to occur. At the same time, as the amount of data increases, the amount of data in Hive metadata increases sharply, which will also put great pressure on the metadata database, and it is very easy for all SQL statements to accumulate in the metadata query link and cause blockage.

Operation and maintenance management: In addition to the bottlenecks faced by computing and storage, the platform's operation and maintenance capabilities will also encounter bottlenecks as the scale becomes larger. For example, in the monitoring system of the system, when the node changes from 5000 to 20000, the monitoring indicators processed per second will increase from 600,000 to more than 2 million.

3. How to improve the reliability and operation and maintenance capabilities of large-scale clusters to ensure that the clusters continue to serve

The reliability of the platform has always been the most concerned point of the platform operation and maintenance department. When the cluster undertakes the unified processing and analysis of the entire group of data, it means that the cluster must be always online 24 hours a day, but the technology will continue to develop [Z(4], The platform must ensure that the system can support subsequent updates and upgrades to ensure that the cluster can continue to evolve and develop in the future.

In addition, as the scale of the cluster increases, the problem of insufficient room space will become prominent. If you simply deploy a large cluster across computer rooms, you will face greater challenges in terms of bandwidth load and reliability. How to achieve computer room-level reliability is also crucial for a super-large-scale cluster.

The practical process of super large-scale cluster optimization

In response to the above challenges, FusionInsight MRS has been systematically optimized in version 3.0. If it is said that from 500 to 5000 nodes that year, it is mainly optimized at the code level, then from 5000 to 2W, only code level optimization can not be achieved, and many problems require architecture-level optimization to solve.

1. Self-developed Superior super scheduler to solve the problem of super-large-scale scheduling efficiency and mixed load for multi-tenant scenarios

FusionInsight introduces a data virtualization engine, provides interactive query capabilities on a unified large cluster, and solves the query efficiency problem for analysts. In order to support diversified loads at the same time on super-large clusters, the self-developed Superior scheduler realizes the simultaneous allocation of reserved resources and shared resources for tenants. The tenants have exclusive rights to reserved resources and also meet the needs of resource sharing. For more important businesses, a fixed group of machines can be allocated to a tenant by binding a fixed resource pool to achieve physical isolation. Through the collaboration of the computing engine and the scheduling engine, the data is not out of the lake, and the business closed loop on a large platform is truly achieved.

In terms of multi-tenant capabilities, with more and more tenants, resource isolation between tenants has become the core demand of users. The Hadoop community provides queue-based computing resource isolation capabilities and Quota-based storage resource threshold limit capabilities, but when tasks or read-write operations are assigned to the same host, they will still compete for resources. For this scenario, the following methods are provided for more fine-grained isolation on MRS products:

  • Tag storage: Tag the DataNodes that carry storage resources and specify tags when writing files to achieve maximum storage resource isolation. This feature can be effectively applied to scenarios where hot and cold data storage and hardware resources are heterogeneous.
  • Multi-service: Deploy multiple services of the same type on different host resources in the same cluster. Different applications can use their own service resources according to their needs, and they do not interfere with each other.
  • Multi-instance: On the same host resource in the same cluster, multiple instance resources of the same service are independently deployed to make full use of host resources and not shared with other service instances. For example: HBase multi-instance, Elasticsearch multi-instance, Redis multi-instance, etc.

2. Tackling technical difficulties, breaking through the bottlenecks in computing, storage, management and other aspects

In terms of the scheduling efficiency of computing tasks , the patented scheduling algorithm is used to optimize the one-dimensional scheduling to the two-dimensional scheduling, which has achieved several times the efficiency improvement relative to the open source scheduler. In the actual large-scale cluster production environment, the performance comparison between self-developed Superior and open source Capacity:

  • In the case of synchronous scheduling, Superior is 30 times faster than Capacity
  • In the case of asynchronous scheduling, Superior is 2 times faster than Capacity

At the same time, through the in-depth optimization of 2W clusters, FusionInsight MRS 3.0 version of Superior can achieve the scheduling rate of 35w/s Containers, which completely exceeds the user expectations of large-scale clusters in the scheduling rate, and the cluster resource utilization rate reaches more than 98%. , Nearly double the capacity of open source Capacity, laying a solid foundation for the stable commercial use of large-scale clusters.

The following figure is the monitoring view of "resource utilization" under Superior and Capacity respectively: It can be seen that Superior's resource utilization rate is nearly 100%, and the resources under Capacity cannot be fully utilized.

Superior resource utilization

Capacity resource utilization

In terms of storage , the Hadoop community has launched a federated solution to solve the bottleneck of HDFS in file object management. However, the introduction of a large number of different namespaces directly leads to an increase in the complexity of development, management, and maintenance of the upper-level business. In order to solve this problem, the community has introduced the Router Based Federation feature. Due to the addition of a layer of Router on the NameNode for interaction, the performance is reduced.

In response to the above problems, FusionInsight MRS has optimized the product solution as follows:

  • By identifying key bottlenecks in a large cluster production environment, FusionInsight MRS utilizes technical solutions such as merging the number of interactions in a single read and write process and using improved data communication compression algorithms to control performance degradation within 4%.
  • To solve the problem of data imbalance between different namespaces, FusionInsight MRS uses DataMovementTool to automatically balance the data between different namespaces, which greatly reduces cluster maintenance costs.

As the amount of data increases, Hive metadata is also facing a very large bottleneck when faced with massive tables/partitions. Although the Hive community launched the Metastore Cache solution, it did not solve the problem of cache consistency between multiple Metastores, which made this solution unable to be commercially available on large-scale clusters. FusionInsight MRS enhances the availability of Metastore Cache by introducing distributed cache Redis as an alternative solution, combining distributed locks, cache black and white list mechanism, cache life cycle management and other technical means.

In terms of operation and maintenance management , when the cluster size increased to 2W nodes, the operation and maintenance pressure suddenly increased:

  • The number of monitoring indicators that the system needs to collect has also increased from 60W+ data collected per second to 200W+
  • Concurrent processing of alarms has increased from 200 items/s to 1000 items/s
  • The total number of configuration management entries increased from 500,000 to more than 2 million

The monitoring, alarming, configuration, and metadata storage modules of the active and standby modes in FusionInsight MRS's original architecture have faced huge performance challenges due to the skyrocketing data volume. To solve this problem, the new version uses Flink, HBase, Hadoop, Elasticsearch, etc. The mature distributed component technology has adjusted the original intensive master-slave model to a flexible and scalable distributed model, successfully solving the problems faced by operation and maintenance management, and laying the foundation for the secondary value mining of subsequent operation and maintenance data.

3. Through deployment capabilities such as rolling upgrades/patches, task-level "resumable uploads", and cross-AZ high availability to ensure the continuous and stable operation of the platform

Rolling upgrade/patch: FusionInsight supports the rolling upgrade function starting from version 2.7, realizing the business unawareness of operations such as platform upgrade/patch. But with the development of time, the community capabilities do not support rolling upgrades. For example, the major version upgrades from Hadoop2 to Hadoop3 means that many large clusters have to stay in the old version and cannot be upgraded. Of course, this is unacceptable for business. FusionInsight MRS has successfully achieved a rolling upgrade among major versions of Hadoop through the compatibility processing of community interfaces, and completed the rolling upgrade of the 1W+ node cluster scale in Q2 of 2020. Among FusionInsight customers, the rolling upgrade has become a necessary capability for 500+ clusters.

Task-level "resumed transfer": On large-scale clusters, some large tasks are continuously running, often containing hundreds of thousands of Containers. Such tasks often run for a long time. Once an individual failure occurs in the middle, it may cause the task to be restarted. Implementation, causing a lot of waste of computing resources.

FusionInsight MRS provides multiple mechanisms to ensure the reliable operation of tasks, such as:

  • The storage provides an AZ-aware file storage strategy. The files themselves and their copies are placed on different AZs. When users initiate read and write operations, they will first look for resources in the AZ. Only in extreme scenarios of AZ failures, There is cross-AZ network read and write traffic.
  • Computing provides an AZ-aware task scheduling mechanism to fully allocate tasks submitted by users in the same AZ to avoid consuming network resources between different computing units of the same task.

Through the above storage block placement strategy and localized scheduling of computing tasks, high availability across AZs in a single cluster can also be achieved. When a single AZ fails, core data and computing tasks are not affected.

[Conclusion]

FusionInsight MRS single cluster 21000 nodes were awarded the Big Data Product Capability Evaluation Certificate issued by the Institute of Information and Communications Technology in July 2020, becoming the industry's first single-cluster commercial big data platform breaking through 2W nodes, setting a new industry benchmark. In the future, FusionInsight MRS will continue to deepen the exploration and research of big data technology, and further realize the separation of storage and calculation on the basis of large cluster technology, and realize the separation of data and calculation (data + metadata and calculation separation) through unified metadata and security management. , So as to realize the sharing of data on a larger scale, and then realize one piece of data, flexible deployment and elastic scaling of multiple computing clusters. Through the smoothly expanding architecture, it can support 100,000 or even millions of clusters, and continue to adapt The core appeal of multi-scenario integration of enterprise big data applications.

Future architecture evolution direction

For more than ten years, FusionInsight has been committed to building an enterprise-level intelligent data lake for more than 3000 government and enterprise customers in 60+ countries and regions around the world, combining platform + ecological strategy, and 800+ business partners, widely used in finance, operators, government, and energy In the digital transformation of government and enterprises, the value of data has been released in many industries, including medical, medical, manufacturing, and transportation, to help government and enterprise customers grow rapidly. MRS originates from an open big data ecosystem, superimposed on key enterprise-level capabilities, which not only maintains openness, but also provides customers with an enterprise-level integrated big data platform to help customers realize T+0 data into the lake, one-stop integrated analysis , Let the data "hui" speak.

 

Click to follow and learn about Huawei Cloud's fresh technology for the first time~

Guess you like

Origin blog.csdn.net/devcloud/article/details/108616506