ByteDance MapReduce-Spark Smooth Migration Practice

Abstract: This article is compiled from the keynote speech of "ByteDance MapReduce - Spark Smooth Migration Practice" given by Wei Zhongjia, a ByteDance infrastructure engineer, at CommunityOverCode Asia 2023.
With the development of byte business, the company runs about 1 million+ Spark jobs online every day. In contrast, there are still about 20,000 to 30,000 MapReduce tasks online every day. From the perspective of big data R&D and users, It seems that there are also a series of problems in the operation, maintenance and use of the MapReduce engine. In this context, the Bytedance Batch team designed and implemented a solution for smoothly migrating MapReduce tasks to Spark. This solution allows users to complete the smooth migration from MapReduce to Spark by only adding a small number of parameters or environment variables to existing jobs. , greatly reducing migration costs and achieving good cost benefits.

Background introduction

In the past year, the number of Bytedance Spark jobs has experienced a sharp increase from 1 million to 1.5 million, and the day-level data Flink Batch has increased from 200,000 to 250,000, while the usage of MapReduce has been in a state of slow decline, almost from The number has dropped from 14,000 to about 10,000. Based on the above usage situation, MapReduce, as the long-standing batch processing framework we use, has also completed its historical mission and will be offline soon.
Before officially pushing it offline, we first collected statistics on the business side and task maintenance methods of MapReduce type jobs.
The pie chart on the left shows the proportion statistics of the business side. The largest proportion is the Hadoop Streaming job, which accounts for almost 45% of all jobs. The second largest share is the Druid job with 24%, and the third is Distcopy with 22%. The reason why Distcopy and Hadoop Streaming are not divided according to business lines here is because these two types of jobs use exactly the same code and can be regarded as the same job when we promote the upgrade.
The pie chart on the right is the statistics of maintenance methods. The largest proportion is Others, accounting for 60%. Others means jobs that are not managed by any platform within ByteDance. This is also very consistent with the specific characteristics of MapReduce. It is a framework with a long history. When many MapReduce jobs were first launched, even these platforms had not yet appeared. Most of them were submitted directly from containers managed by users themselves or physical machines that could be connected to the YARN cluster.
 

Why we need to promote MapReduce migration to Spark

There are three reasons for driving MapReduce offline:
The first reason is that the operating mode of MapReduce has too high requirements on the throughput of the computing scheduling engine . In MapReduce's operating mode, each Task corresponds to a Container. When the Task is finished running, the Container will be released. This operating mode is no problem for YARN because YARN's throughput is very high. However, when we migrated our internal business from YARN to the K8s cluster, we found that MapReduce jobs often triggered API Server alarms, affecting the stability of the K8s cluster. After a MapReduce task is completed, we often need to apply for more than 100,000 PODs; while the same scale A Spark job may only require a few thousand PODs, because there is another layer of scheduling inside the Spark job. The Container applied by Spark as an Executor will not be launched after running a Task. Instead, the Spark framework schedules new Tasks for continued use.
The second reason is that MapReduce's Shuffle performance is very poor . The MapReduce used internally is version 2.6 based on the community. The Netty framework that its Shuffle implementation relies on is about ten years ago. Compared with the current Netty, there is a major version difference. In actual use, you will also find its performance comparison. Poor performance, and it will also create too many connections on the physical machine, affecting the stability of the physical machine.
The third reason is that from the perspective of development engineers, we have many horizontal transformation projects internally, such as the transformation of K8s just mentioned, and IPV6 adaptation. The cost of transformation is actually the same as that of Spark, but the number of MapReduce tasks Now it is only 1% of Spark. Not only is the ROI of the transformation very low , but it will also take a lot of effort to maintain the History Server and Shuffle Service of MapReduce jobs without transformation. Therefore, it is necessary to promote the migration from MapReduce to Spark.

Difficulties in upgrading Spark

First of all, the proportion of existing tasks is very low. Currently, there are only more than 10,000 jobs per day. However, the absolute value is still very large and involves many business parties. Many of them are tasks that run for a very long time and may have been run for four times. In five years, it has been very difficult to encourage users to actively upgrade.
Secondly, in terms of feasibility, more than half of the jobs are Hadoop Streaming jobs, including Shell, Python, and even C++ programs. Although Spark has a Pipe operator, users are allowed to migrate existing jobs to Spark Pipe The operator still has a lot of work.
Finally, when users assist in starting the transformation, many other problems will be faced. For example, in addition to the migration of the main computing logic, there are many peripheral tools that need to be migrated; how should certain MapReduce parameters be converted into during the migration process? Equivalent Spark parameters, and how to equivalently implement environment variable injection that the Hadoop Streaming job script depends on in Spark. If these problems are left to the user to solve, not only will the workload be heavy, but the failure rate will also be high.
 

Overall program

Design goals

The above has sorted out the current situation, motivations, and difficulties. Based on the above information, the goals before upgrading are:
  • This prevents users from making code-level modifications and enables users to complete the upgrade without moving at all, only by adding some job parameters.
  • It is necessary to support various types of jobs, including Hadoop Streaming, Distcp, and jobs written by ordinary users in Java. Among them, Hadoop Streaming uses the old API of MapReduce, while Distcp is using the new API. This means that our upgrade plan needs to support all MapReduce jobs.
 

Solution dismantling

The dismantling of the overall plan is mainly divided into four parts:
  • Computing process adaptation mainly involves aligning the computing logic of Mapreduce and the computing logic of Spark.
  • Configuration adaptation helps users automatically convert Mapreduce parameters into Spark parameters.
  • Submission-side adaptation is the key point to truly achieve smooth migration, so that users can complete the upgrade without modifying their submission commands.
  • Cooperate with tools to help users verify the correctness of data.
 

Computational process adaptation

The screenshot is from the paper (https://static.googleusercontent.com/media/research.google.com/zh-CN//archive/mapreduce-osdi04.pdf). A classic MapReduce process is divided into five steps:
The first step is to process the input data and then segment it; the second step is to run the Map code provided by the user; the third step is to do Shuffle; the fourth step is to run the Reduce code provided by the user; the fifth step is to convert the Reduce code into The results of code processing are written to the HDFS file system. In fact, there is another very common usage of MapReduce, which is Map Only, that is, the usage without the two steps in the middle of the figure below.
Students who are familiar with Spark should know that the entire MapReduce process can be understood as a subset of Spark, or even a Spark task for a specific logical calculation process. We have listed a pseudocode in the figure, which perfectly corresponds to the entire MapReduce process. process.
The first step is to create a Hadoop RDD, because Hadoop RDD itself relies on Hadoop's own Inputformat code, so this is completely adaptable; the second step is to call Spark's Map operator, and then in Spark's Map operator Call the user's Map function; in the third step, for the universality of migration, use the RepartitionAndSortWithinPartitions method uniformly. This method completely corresponds to the Shuffle process in MapReduce; the fourth step uses the Map operator to execute the Reduce code provided by the user; the fifth step, SaveAsHadoopFile, corresponds to the final stored procedure in Mapreduce.
The above idea is actually the conventional idea when users upgrade from MapReduce to Spark. However, if we want to design a general upgrade solution, it is not enough to just use Spark operators to map the MapReduce calculation process. Through analysis of MapReduce and Spark frameworks, we found:
The lowest layer is the same, both need to rely on the resource scheduler: YARN or K8s. The functions of the upper layer are the same or close, but the implementation is completely different. For example, the names in MapReduce are called InputFormat and OutputFormat, which in Spark are called HadoopRDD saveAsHadoopFile; the counter in Mareduce is called counter, which corresponds to the Accumulator in Spark. Other functions including Shuffle, resource scheduling, History, and speculative execution are all aligned, but the implementations are also different, so what we need to do is to replace the implementation in MapReduce with Spark.
The top layer mentioned in the design goal - the implementation layer should be completely unchanged. The pink layer as shown above cannot run directly on the Spark base, so we add an intermediate layer to adapt to the user's code and The Spark computing interface uses MapRunner and ReduceRunner to adapt the Map and Reducer methods in Hadoop, so that Spark's Map operator can run Mapper and Reducer. We adapt the user's Counter calling behavior through the Counter Adapter. When the user increases a number through the Counter interface, it will be converted into an Accumulator call to Spark. There is also a corresponding Couf Translator for the Configuration. That is, when submitting the task, a Hadoop Configuration is generated for the user and translated using this Translator. into the corresponding Spark parameters. This is the adaptation of the entire computing process. Through this adaptation, the overall logic can be used to directly use a Spark job to run the user's code.

Configuration adaptation

Configurations can be mainly divided into three categories: configurations that require translation, configurations that are directly transparently transmitted, and configurations that need to be ignored.
In the first type of configuration that needs to be translated, such as the parameters of the job resource class, MapReduce and Spark both need to tell the resource framework what kind of Container I need to process the data, but the parameters they use are different when submitting the job. , the parameters need to be translated. There are also environment variables, uploaded files, number of concurrent jobs, etc. in the table. These parameters all need to be translated as above.
The second category is configuration that requires direct transparent transmission, because Spark needs to rely on a lot of classes in Hadoop, and many of these classes also need to be configured. h adoop . Spark . Here we can directly transparently transmit by adding a prefix during translation .
The third category is configuration that needs to be ignored, which are functions that are available in MapReduce but not in Spark. In this case, we will put it in the user manual and tell users that this function is not supported.

Submission side adaptation

For the sake of user experience, we hope that the scripts submitted by users do not need to be modified at all. Jobs are still submitted using Hadoop and do not need to be changed to Spark Submit. Therefore, in the implementation, we put a patch on Hadoop. When the MapReduce job is submitted, the submitter program identifies a specific parameter or environment variable. Once identified, the configuration translation function we just mentioned will be used to This JobConf object performs configuration translation. After the translation is completed, a corresponding Spark submission command will be generated to start a child process to run the Spark Submit command.
At the same time, MapReduce itself has a function to sense the running status of the job through constant polling. Because we now have a child process, the behavior of this Monitor has changed from querying the status of a certain Application ID by calling the RM or AM interface to querying the status of the child process.

Correctness verification

After completing the above three steps, smooth migration can basically be achieved, but before going online, we will recommend users to do dual-run verification. Dual-run verification is actually performed by everyone, but what needs to be mentioned here is that we encountered a problem here One problem is that two comparison methods need to be used for different data types. For most Output Formats, CheckSum can be directly compared, but for a small number of Output Formats, the corresponding Input Reader needs to be used for line-by-line comparison, because some The file generated by Output Format will contain timestamps or some information related to the user. The files generated by each run may be different. If we want to compare at this time, we need to generate the corresponding Reader, one line Read the file line by line and compare it line by line.

Problems and Solutions

Memory settings - One-to-one correspondence between MapReduce memory and Spark Executor memory may trigger OOM in some cases

As mentioned in the previous section, we will do a parallel translation of memory. For example, the original MapReduce task using 4G memory will still use 4G memory after being converted to Spark. However, this will cause many jobs to trigger OOM. The main reason is that the memory models of MapReduce and Spark are not exactly the same. The default cache for Shuffle Spill in MapReduce is 256M, but in Spark, a Memory Manager actually manages the memory, and the default maximum usage is 60% of the total memory. At the same time, the network protocol used in MapReduce Shuffle is also different from Spark, which will create more concurrency and use more memory.
In order to solve this problem, we set a parameter Spark.memory.fraction=0.4 for all parallel migration tasks to reduce the memory during Shuffle Spill. At the same time, the default is to increase 512M memory per Core. After this strategy is online, all The smooth migration task OOM situation has been solved.

Concurrency settings - Hadoop Streaming jobs may cause directory name conflicts when using local directories.

Spark jobs can support multiple tasks running simultaneously in a Container. Some HadoopStreaming job scripts will create another directory in the local directory when used. When using Spark, multiple Tasks running this directory at the same time will cause conflicts. In MapReduce, each Task will use a brand new Container, so corresponding conflicts will not occur.
There are two main solutions to this problem:
First, add a parameter to control the Executor concurrency of the upgraded Spark job. By default, it is directly set to the user as a single-Core Executor, which is equivalent to one Executor running one task.
Second, it is recommended that users modify their directory creation logic. When creating a local directory, do not create a directory with a fixed name. Instead, read the Task ID in the environment variable and create a local directory with a Task ID to avoid conflicts.

Class loading issues - Class loading issues may occur when user jobs are running after the upgrade

Many Jar package tasks will have class loading problems after upgrading. The root cause of this problem is that the Spark class loader uses a custom ClassLoader, which divides class loading into two categories, one is the Framework Class Loader and the other is the User Class Class Loader. Hadoop itself is a class that Spark Framework relies on, so it will be loaded using the Framework's ClassLoader. At the same time, the user's task code also depends on Hadoop, and some dependencies will be loaded by the ClassLoader of the user code, so various problems will occur. Class loading problem.
In most cases, this problem can be avoided by setting the Spark.executor.userClassPathFirst=true parameter so that the Spark job loads the user's class first by default. However, some users will have problems after setting this parameter. This situation can be solved by manually setting it to False.

Functional alignment problem - MapReduce functions are not completely aligned with Spark

In practice, users will worry that some functions in MapReduce are not available in Spark. For example, MapReduce can support partial task success by setting a parameter. As long as the task failure does not exceed this ratio, the entire job is successful, but in Spark There is no such function.
The solution is that in most cases users can solve it by themselves, but for a small number of cases, if the user knows that there will be bad files in the upstream, we will provide some other Spark parameters to avoid task failure.

Task Attempt ID problem - some users rely on the Task Attempt ID in the environment variable, and there is no corresponding value in Spark

The Task Attempt ID problem is essentially an alignment problem. Some users, especially HadoopStreaming jobs, rely on the Task Temp ID in the environment variable, and this value does not have a strict corresponding concept in Spark. Task Attempt ID is the number of retries of a certain Task in MapReduce. In Spark, Shuffle Failure will cause Stage Retry instead of Task Retry. During Stage Retry, the Index of the Task will change, so the number of Retry cannot correspond to a certain one. Partition ID.
We have implemented an approximate solution to this problem, using another globally increasing positive integer—Attempt ID—provided in Spark Task Context to distinguish different Tasks to solve the corresponding value problem.
 

income

statistics

The above smooth migration solution promotes users to upgrade from MapReduce to Spark, and the overall effect is very good. The comparison of the average MapReduce resource application amount (before migration) and Spark application resource amount (after migration) for all completed migrations in the past 30 days is: example. As can be seen from the figure, the number of CPU applications saved per day is 17,000, an increase of 60%. That is, after the upgrade, all these tasks can be run with 40% of the original resources; the memory usage can be saved by about 20,000GB per day. It is about 17% of the previous level.

Interpretation

  • As introduced above, this is a smooth migration solution. The user does not use Spark to rewrite tasks. Everyone knows that Spark is a static engine that can better utilize memory, so the benefits of smooth migration should be lower than the user's manual migration. Because the current upgrade benefits do not actually come from the Spark operator itself. In fact, the user's processing logic is completely unchanged, and the running code is still MapReduce code. If it is a Hadoop Streaming job, the running code is still a script written by the user. , so this benefit does not come from the Spark operator itself.
  • The benefits mainly come from the Shuffle stage, that is, Spark Shuffle is better than MapReduce Shuffle from network framework to implementation details. We have also made some in-depth custom optimizations for Spark Shuffle to improve the performance of Shuffle. If you are interested, you can read related article recommendations. Shuffle optimization shortens the running time of jobs. The average Map time of some jobs is 2 minutes, the Reduce time is 5 minutes, and the Shuffle time is often higher than 10 minutes. After upgrading Spark, the Shuffle time can be directly reduced to 0. Because Spark's Shuffle is an asynchronous Shuffle, data can be calculated in the main thread and read in other threads, thus reducing the time of the intermediate block to milliseconds.
  • Because the revenue mainly comes from Shuffle, the performance improvement for Map Only jobs is not obvious. For all Map Only jobs and Distcp jobs that have completed the upgrade, the resource application volume has not changed much, floating between 90% and 110%.
  • The reason why the revenue of CPU is significantly higher than that of memory is that CPU is completely parallel migration, but memory is different. Map and Reduce usually take the maximum value of memory, which will cause a waste of memory. In addition, in order to avoid the OOM problem caused by bottleneck migration, 512M memory was added to each Core. Therefore, the overall memory application amount increased. However, because the benefits of the Shuffle stage shortened the running time of the job, the overall memory Earnings are still positive.
 
Broadcom announced the termination of the existing VMware partner program deepin-IDE version update, a new look. WAVE SUMMIT is celebrating its 10th edition. Wen Xinyiyan will have the latest disclosure! Zhou Hongyi: Hongmeng native will definitely succeed. The complete source code of GTA 5 has been publicly leaked. Linus: I won’t read the code on Christmas Eve. I will release a new version of the Java tool set Hutool-5.8.24 next year. Let’s complain about Furion together. Commercial exploration: the boat has passed. Wan Zhongshan, v4.9.1.15 Apple releases open source multi-modal large language model Ferret Yakult Company confirms that 95 G data was leaked
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5941630/blog/10452051