Netease Mutual Entertainment's Journey to the Sea: Cloud Architecture Design and Practice on Big Data Platform

At the beginning of 2020, with the growth of NetEase Interactive Entertainment's overseas business and the need for overseas data compliance, we started the work of moving the NetEase Interactive Big Data Offline Computing Platform overseas. In the early stage, we adopted the solution of cloud host bare metal plus high-performance EBS block storage. However, the storage cost of this solution is high, and the cost is dozens of times that of domestic self-built computer rooms.

Therefore, we decided to build a platform on the public cloud. This platform not only needs to be more suitable for current business scenarios and more compatible with historical business, but also more economical than the public cloud EMR hosting solution. We have carried out cost optimization mainly from three aspects: storage, computing, and data layered lifecycle management. The specific optimization plan will be introduced in detail below.

In the end, this project provided complete Hadoop compatibility for the downstream data business and analysis departments, avoiding overthrowing and reinventing all business logic; it saved a lot of cost for the game data business going overseas, the storage cost was 50% of that before optimization, and the total computing power The cost is 40% of that before optimization, and the cost of cold data is 33% of the online storage cost after optimization. In the future, with the increase of business volume, the cost savings will be reduced by 10 times, which will provide strong support for data-based operations after going overseas.

01. Big data platform overseas cloud solution design

In 2020, we embarked on an urgent mission to sea. In China, our business has been deployed and operated in the form of self-built clusters. In order to quickly go online overseas, we urgently launched a solution that is exactly the same as the domestic cluster, using a set of integrated storage and computing systems built with physical nodes. We chose the bare metal server M5.metal and used EBS gp3 as storage.

The disadvantage of this solution is that the cost is very high, but its advantage is that it solves a very painful problem, that is, we need to be compatible with all historical businesses and ensure that all historical businesses can run overseas quickly and immediately. Our upstream and downstream businesses can be seamlessly migrated overseas and support the scheduling of close to 300,000 jobs per day.

However, cost has always been a problem that cannot be ignored. Therefore, we need to reselect the solution to obtain a solution with better performance and lower cost, and ensure compatibility. According to business requirements and characteristics of big data scenarios, we evaluate how to choose solutions from the following directions:

  • Trade time/space for performance;
  • Deployment optimization based on business scenarios;
  • Add middleware to achieve compatibility integration;
  • Make full use of the characteristics of cloud resources to optimize costs.

Hadoop on the cloud

Generally, there are two solutions for cloud migration of Hadoop, EMR+EMRFS and Dataproc+GCS. These two options are a normal posture for going out to sea. Or use some cloud-native platforms, such as BigQuery, Snowflake, Redshift, etc. as data query solutions, but we did not use these solutions.

Why EMR is not used

Because all of our businesses rely heavily on Hadoop, the Hadoop version we are currently using is an internal version customized according to business needs, and has achieved backward compatibility with various new version functions. can cover. As for solutions such as cloud-native BigQuery, it is a direction with greater changes and farther away for the business.

Why not use S3 storage directly

  1. Due to the high demand for data business security, we have complex business permission design, which far exceeds the upper limit that Amazon IAM (Identity and Access Management) ROLE can achieve.

  2. The performance of S3 is limited, and optimization measures such as bucketing and random directories are required, which are opaque to business use. Adjusting directory prefixes to adapt to S3 partitions or using more buckets requires business adjustments to existing usage methods, which cannot be adapted Our current catalog design. In addition, as a file system implemented by object storage, direct operations such as list and du on S3 directories are basically unavailable in the case of very large file data, but this happens to be a large number of operations in large data scenarios.

Storage Selection: HDFS vs Object Storage vs JuiceFS

We mainly evaluate storage components from the following dimensions.

Business Compatibility : Compatibility is a very critical consideration for our situation where we have a large amount of stock business and need to go overseas. Secondly, cost reduction and efficiency increase not only refer to reducing storage costs, but also include considerations of resource costs and labor costs. In terms of compatibility, the JuiceFS Community Edition is compatible with the Hadoop ecosystem, but it needs to deploy the JuiceFS Hadoop SDK on the client side.

Consistency : At that time, we conducted research on S3, but before the first quarter of 2020, strong consistency was not achieved, and currently not all platforms can achieve strong consistency.

Capacity management : For our current self-built clusters, an important issue is the need to reserve resources. In other words, it is impossible for us to use 100% of the resources, so on-demand usage is a very cost-effective direction.

Performance : Based on HDFS, it can reach the performance level of our domestic self-built HDFS. The SLA we provide to the business in China is to achieve the RPC performance of p90 within 10 milliseconds under the condition of 40,000 QPS in a single cluster. But for something like S3, it is very difficult to achieve such performance.

Authority authentication : In self-built clusters, Kerberos and Ranger are used for authentication and authority management. But S3 didn't support it at the time. JuiceFS Community Edition is also not supported.

Data reliability : HDFS uses three replicas to ensure data reliability. At that time, the JuiceFS metadata engine used Redis when we tested. We found that in the high-availability mode, if the master node switches, the storage will freeze, which is very difficult for us to accept. Therefore, we adopt the method of independently deploying the Redis metadata service on each machine, and the details will be expanded below.

Cost : Solutions like block devices are expensive. Our goal is to use S3, if everyone only uses S3, the cost is of course the lowest. If you use JuiceFS, the latter architecture will have certain additional costs, so we will explain later why its cost is not the lowest.

02. Hadoop overseas multi-cloud migration plan

Separation of storage and computing at the storage layer: Hadoop+JuiceFS+S3

The combination of JuiceFS and Hadoop can reduce the cost of business compatibility and quickly realize the existing business overseas. When many users use the JuiceFS solution, they implement it through SDK plus Hadoop open source version. However, there will be a problem of authority authentication in this way. JuiceFS Community Edition does not support the authority authentication of Ranger and Kerberos. Therefore, we still use the entire framework of Hadoop. The maintenance cost seems to be high, but in China we have a set of self-built components that are being maintained, so there is almost no cost for us. As shown in the figure below, we use Fuse to mount JuiceFS to Hadoop, and then use S3 for storage.

Let’s briefly compare the performance of our self-built single cluster based on EBS.

  • In the case of 40,000 QPS, it can reach p90 10ms;
  • A single node can sustain 30000 IOPS.

When we first went to the cloud, we used the HDD mode, specifically the st1 storage type. But soon we found that when the number of nodes is small, the actual IOPS is far from meeting our requirements. Therefore, we decided to upgrade all st1 storage types to gp3.

Each gp3 provides approximately 3000 IOPS by default. In order to improve performance, we mounted 10 gp3 storage volumes, achieving a total performance of 30,000 IOPS. This improvement allows our system to better meet the IOPS requirements and is no longer limited by the performance bottleneck when the number of nodes is small. The high performance and flexibility of gp3 made it an ideal choice for our IOPS problem.

The current default bandwidth per node is 10Gb. But the bandwidth of different models is also different. We took a benchmark of 30000 IOPS single node with 10Gb bandwidth. Our goal is to be able to integrate our S3 storage, that is, to consider the cost of storage while maintaining high performance, and the data will eventually fall on S3.

The most important thing is to be compatible with Hadoop access, that is, all businesses do not need to make any modifications, and can directly go to the cloud to solve compatibility problems. For some historical businesses, it may have certain business value, but we need to evaluate the cost of business transformation and platform compatibility. In our scenario business, the labor cost of reconstructing all historical businesses is currently greater than the cost of platform compatibility, and Impossible to complete in a short time.

The way we mount JuiceFS may be different from the official website. We deployed JuiceFS and Redis locally on each machine (as shown in the image below). This is done to maximize the performance of JuiceFS and minimize the loss of local metadata. We have tried using Redis cluster and TiDB cluster, but found that the metadata performance is several orders of magnitude worse. Therefore, we decided to use the local deployment method from the beginning.

Another benefit is that our system is tied to DNO (Data Node Object). We can control the number of files for each DNO, that is, the number of files for a single node, so that it can be stabilized within a reasonable level. For example, our DNO has an upper limit of about 3 million to 8 million metadata files, so a single metadata node is about 20GB. In this way, we don't need to pay too much attention to its expansion, and transform a large-scale distributed Redis requirement into a Redis requirement with controllable metadata on a single node. But stability is also a problem. If a single node has a stability problem, we will face the risk of loss.

In order to solve the downtime problem of a single node, we bound with DNO and used the HDFS multi-copy mechanism. There are two copy modes in our cluster, one is triple copy, and the other is EC (Erasure Coding) copy. In different modes, high data reliability is achieved through the replica mechanism: under the multi-copy deployment scheme, even if a node fails completely, we can directly remove it without affecting the overall operation and data reliability.

In practice, deploying a single node locally while using JuiceFS and single-node Redis is the way to get the best performance. Because we need to benchmark the performance of HDFS and EBS solutions.

We have achieved high-performance HDFS through HDFS-based distributed horizontal expansion and JuiceFS cache and read-write strategy optimization. The optimization part is as follows:

  1. Use JuiceFS to replace the gp3 directory, and use a small gp3 storage as the JuiceFS cache directory to achieve the level of IOPS aligned with gp3;
  2. By optimizing the JuiceFS cache mechanism, customized asynchronous deletion, asynchronous merge upload, S3 directory TPS preset and other optimizations to reduce the situation of falling to S3, low-cost storage S3 replaces gp3;
  3. Distributed implementation of node horizontal expansion based on HDFS cluster;
  4. Utilize the characteristics of Hadoop heterogeneous storage, disassemble IO according to business characteristics to optimize performance and cost. We split the HDFS storage into two parts, "DISK" and "SSD". The "SSD" storage type corresponds to a hybrid storage that uses JuiceFS's EBS cache and S3 integration. The "DISK" storage type is configured to write to the DN's EBS-stored directory. In those directories that will be frequently overwritten, such as the Stage directory, we will set these directories to use DISK for storage. EBS storage is more suitable for frequent erasing and writing. Compared with S3, the additional OP cost is less, and the total storage requirements of these directories are controllable, so we reserve a small part of EBS storage in this scenario.

Computing layer: mixed deployment scheme of spot nodes and on-demand nodes

First of all, when we migrate the domestic self-built YARN cluster to the cloud, it cannot adapt to the resource characteristics on the cloud to achieve cost optimization. Therefore, we combine the YARN-based intelligent dynamic scaling solution with label scheduling, and adopt a mixed deployment solution of spot nodes and on-demand nodes to optimize the use of computing resources.

  1. Adjust the scheduler strategy to capacity scheduling (CapacityScheduler);
  2. Divide on-demand node partitions and Spot node partitions;
  3. Adjust the partition of stateful nodes to on-demand nodes, so that tasks with different states run in different areas;
  4. Use on-demand nodes to cover the bottom line;
  5. Recycling notification and GracefulStop, when the preemptive node will receive the notification of recycling in advance before recycling, call 6. GracefulStop to stop the business, to avoid direct failure of the user's job;

Spark+RSS reduces the probability that when a node is recycled, the data is originally on the dynamic node and the job needs to be recalculated.

  1. Based on our business needs, we have made some dynamic intelligent scaling solutions. Compared with the native solution, we pay more attention to the direction of dynamic scaling based on the state of the business, because it is impossible for cloud vendors to know the hot spots of the business.

  2. Intelligent scaling is realized based on the periodic forecast of the internal operation and maintenance tool Smarttool. We take a historical data of the first three weeks, do a simple fitting, and then get the fitting residual sequence resid and the predicted value ymean through the Smarttool preset algorithm. Use this tool to predict what its resource usage should look like at a certain point in a day, and then implement dynamic scaling.

  3. Scheduled scaling based on time rules, such as pre-scaling for a specific time: pre-set capacity at specific times such as the monthly report generation time on the 1st of each month, big promotions, etc.

  4. Dynamic scaling based on usage rate, if the used capacity is greater than the upper limit of the threshold or less than the lower limit of the threshold within a certain period of time, automatic expansion and contraction will be triggered to meet unexpected usage demands. Try to ensure that our business can get a stable but relatively low-cost computing resource solution on the cloud.

Lifecycle management: Data tiering for storage cost optimization

We actually integrated JuiceFS and S3 for data reliability based on the replica mechanism. Whether it is a three-copy or 1.5-copy EC, there will be additional storage expenditure costs, but we consider some data popularity, once the data has passed a certain life cycle, its demand for IO may not be so high. Therefore, we introduced a single copy layer of Alluxio+S3 to process these data. But it should be noted that if the directory structure is not changed, the performance of this layer is actually much worse than using JuiceFS. Nevertheless, we can still accept such performance in cold data scenarios.

Therefore, we independently developed a data governance and organizational layering service, and realized the management and cost optimization of data in different life cycles through asynchronous processing of data. We call this service Data Lifecycle Management Tool BTS.

The design of BTS is based on our file database, metadata and audit log data, and realizes data life cycle management through the management of tables and their heat. Users can use the upper-level DAYU Rulemanager to customize rules and use the popularity of data to generate rules. These rules specify which data is considered cold and which data is considered hot.

According to these rules, we will perform different lifecycle management operations such as compression, merging, transformation, archiving, or deletion on the data, and distribute them to the scheduler for execution. The data lifecycle management tool BTS provides the following capabilities:

  • Data reorganization, merging small files into large files, optimizing EC storage efficiency and namenode pressure;
  • Conversion of table storage and compression methods: Asynchronously convert tables from Text storage format to ORC or Parquet storage format, and convert compression method from None or Snappy to ZSTD, which can improve storage and performance efficiency. BTS supports asynchronous table conversion by partition;
  • Heterogeneous data migration, which asynchronously migrates data between storages of different architectures, providing organizational capabilities for data tiering.

We simply divide the storage layered architecture into three layers:

  • The best performance is HDFS on JuiceFS (hot), 3 copies;
  • followed by a mode (warm) 1.5 copy of HDFS on JuiceFS EC;
  • Again, a copy of Alluxio on S3 (low-frequency cold data);
  • Before all data dies, they are archived to Alluxio on S3 and become single-copy.

Currently, the effects of data lifecycle governance are as follows:

  • 60% cold, 30% warm, 10% hot;
  • The average number of replicas (70% * 1 + 20% * 1.5 + 10% * 3) = 1.3 In the case of archiving, which does not require high performance, we can achieve about 70% of the data. About 20% of the data when using EC replicas, and about 10% when using three replicas. We controlled the number of replicas overall, maintaining an average of about 1.3 replicas.

03. The online effect of the new overseas architecture: performance and cost

In tests, JuiceFS was able to achieve fairly high bandwidth for reading and writing large files. Especially in the multi-threaded model, the bandwidth for reading large files is close to the bandwidth limit of the client's network card.

In the small file scenario, the IOPS performance of random write is better (thanks to the gp3 disk as a cache), while the IOPS performance of random read is relatively low, about 5 times worse.

The comparison between the EBS solution and the JuiceFS+S3 solution in the actual business measurement. The test case is the business SQL in our production environment. It can be seen that JuiceFS + S3 is basically not much different from EBS, and some SQL is even better. So JuiceFS + S3 can replace the full amount of EBS.

Use JuiceFS-based S3+EBS hybrid layered storage-computing separation solution to replace the original EBS solution. Through data governance and data layering, the original Hadoop three-copy mechanism is reduced to an average of 1.3 copies, saving 55% of multiple copies Cost, the overall storage cost dropped by 72.5%.

Through intelligent dynamic scaling, 85% of the cluster utilization rate and 95% of Spot instances are used to replace on-demand nodes, and the overall computing cost is more than 80% compared with that before optimization.

04. Summary and Outlook: Towards Cloud Native

Compared with the original JuiceFS solution, Hadoop+JuiceFS uses additional copies to optimize storage performance and support compatibility and high availability. DN only writes one copy, relying on the iterative optimization of JuiceFS in terms of reliability.

Although a multi-cloud compatible solution that is better than EMR has been implemented on different clouds, more iterations are needed for hybrid multi-cloud and cloud-native solutions.

Regarding the prospect of cloud-native big data scenarios in the future, the solution we currently adopt is not the ultimate version, but a transitional solution aimed at solving compatibility and cost issues. In the future, we plan to take the following measures:

  1. Promote the migration of business to a more cloud-native solution, realize the decoupling of Hadoop environment, and closely integrate data lake and cloud computing.
  2. Driving a higher level of hybrid multi-cloud computing and hybrid storage solutions, enabling true integration, not just compatibility now. This will bring more value and flexibility to upper-level business units.

I hope this content can be of some help to you. If you have any other questions, please join the JuiceFS community to communicate with everyone.

Ministry of Industry and Information Technology: Do not provide network access services for unregistered apps Go 1.21 officially released Ruan Yifeng released " TypeScript Tutorial" Bram Moolenaar, the father of Vim, passed away due to illness The self-developed kernel Linus personally reviewed the code, hoping to calm down the "infighting" driven by the Bcachefs file system. ByteDance launched a public DNS service . Excellent, committed to the Linux kernel mainline this month
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5389802/blog/10094712