Apache Spark: 60 TB+ production use case from Facebook

Apache Spark: 60 TB+ production use case from Facebook

Sital Kedia talks about big data

This article mainly talks about the experience and lessons accumulated in the process of Facebook expanding spark to replace hive.
Wave tip finishing translation https://databricks.com/blog/2016/08/31/apache-spark-scale-a-60-tb-production-use-case.html.

Use case: feature preparation for entity ranking

Real-time entity rankings are used in various ways on Facebook. Some of the original feature values ​​in these online service platforms are generated offline by Hive, and the data is loaded into the real-time query system. The old Hive-based infrastructure built many years ago is a resource-intensive computing architecture and difficult to maintain because the pipeline is divided into hundreds of smaller Hive jobs. In order to achieve updated feature data and improve manageability, an existing pipeline was selected and an attempt was made to migrate it to Spark.

Original Hive implementation

The Hive-based pipeline consists of three logical stages, where each stage corresponds to hundreds of smaller Hive jobs sharing the entity_id, because running large Hive jobs for each stage is not very reliable and subject to the maximum number of tasks per job limits.

Apache Spark: 60 TB+ production use case from Facebook

These three logical steps can be summarized as follows:

  1. Filter out non-production features and noise.

  2. Aggregate each group (entity_id, target_id).

  3. Divide the table into N shards and make each shard run in the form of a pipeline through a custom binary to generate a custom index file for online query.

It takes about three days to build an index based on the Hive pipeline. Management is also challenging, because the pipeline contains hundreds of sharded jobs, making monitoring difficult. There is no easy way to measure the overall progress of the pipeline or calculate the ETA. When considering the above limitations of the existing Hive pipeline, I decided to try to use Spark to build a faster and more manageable pipeline.

Spark implementation

Full flow debugging can be challenging and resource-intensive. We first convert the most resource-intensive part of the Hive-based pipeline: the second stage. We started with 50 GB compressed input samples, and then gradually expanded to 300 GB, 1 TB, and then 20 TB. At each size increment, we solved the performance and stability issues, but testing 20 TB allowed us to find the greatest opportunity for improvement.

When running 20 TB of input, we found that due to the large amount of tasks, we generated too many output files (each with a size of about 100 MB). Three hours of the 10-hour job run time are used to move files from the staging director to the final directory in HDFS. Initially, we considered two options: improving the batch renaming in HDFS to support this case, or configuring Spark to generate fewer output files (due to the large number of tasks (70,000) that are difficult at this stage). We dropped out of the question and considered the third option. Since the tmp_table2 table we generated in the second step of the pipeline is temporary and only used to store the intermediate output of the pipeline, we basically compress, serialize and copy three copies for a single read with several terabytes of data Work load. Let's go one step further: delete two temporary tables and merge all three Hive stages into one Spark job, which reads 60 TB of compressed data and performs 90 TB of random and sorting. The final Spark work is as follows:

Apache Spark: 60 TB+ production use case from Facebook

How did we extend Spark for this job?

Of course, running a single Spark job for such a large pipeline did not run properly on the first attempt or even on the 10th attempt. As far as we know, this is the largest Spark job attempted in terms of shuffle data size (Databricks' Petabyte sorting is on synthetic data). It has made a lot of improvements and optimizations to the core Spark infrastructure and our applications to make this job run. The benefit of this job is that many of the improvements apply to other large workloads in Spark, and we are able to contribute all the work back to the open source Apache Spark project-see JIRA for additional details. Below, we highlight the main improvements that can deploy one of the entity ranking pipelines to the production environment.

Reliability repair

Handling frequent node restarts

In order to perform long-running jobs reliably, we want the system to be fault-tolerant and recover from failures (mainly due to machine restarts due to normal maintenance or software errors). Although Spark aims to tolerate machine restarts, all kinds of errors/problems to be solved can make it enough to deal with common errors.

Make PipedRDD more robust to fetch failures (SPARK-13793) : The previous implementation of PipedRDD was not powerful enough to handle fetch failures caused by node restarts, and as long as fetch failures occur, the job will fail. We have made changes in PipedRDD to handle acquisition failures gracefully so that the job can recover from this type of acquisition failure.

Configurable maximum number of acquisition failures (SPARK-13369) : For such long-running jobs, the probability of acquisition failures due to machine restarts increases significantly. The maximum allowable number of fetch failures in each stage in Spark is hard-coded, so the job will fail when the maximum number is reached. We made a change to make it configurable and increased it from 4 to 20 in this use case to make the job more robust.

Less disruptive cluster restart : Long-running jobs should be able to survive the cluster restart. Spark's restartable shuffle service feature allows us to keep shuffle files after the node restarts. The most important thing is that we have implemented a feature in Spark driver to be able to suspend task scheduling so that too many task failures due to cluster restarts will not cause job failure.

Other reliability fixes

Unresponsive driver (SPARK-13279) : When adding a task, due to an O(N^2) operation, the Spark driver got stuck, causing the job to eventually get stuck and terminated. We solve the problem by removing unnecessary O(N^2) operations.

Excessive driver speculation: We found that Spark drivers spend a lot of time speculating when managing a large number of tasks. In the short term, speculative execution of the job is prohibited. Efforts are currently being made to change the Spark driver to reduce speculation time.

TimSort problem caused by integer overflow of large buffer (SPARK-13850) : The test found that Spark's unsafe memory operation had a bug that caused TimSort memory corruption. Thanks to the Databricks staff for solving this problem, which enables it to run on large memory buffers.

Tune the shuffle service to handle a large number of connections: During the shuffle phase, we saw many executors timeout when trying to connect to the shuffle service. Increasing the number of Netty server threads (spark.shuffle.io.serverThreads) and backlog (spark.shuffle.io.backLog) solves this problem.

Fix Spark executor OOM (SPARK-13958) : It is a challenge to pack more than four reducer tasks for each host first. The Spark executor is out of memory because there is an error in the sorter that causes the pointer array to grow infinitely. We solve this problem by forcing the data to overflow to disk when there is no more memory available for pointer array growth. Therefore, now we can run 24 tasks/hosts without running out of memory.

Performance improvement

After achieving the above reliability improvements, we can reliably run Spark jobs. At this point, we will strive to turn to performance-related projects to make the most of Spark. We use Spark metrics and several analyzers to find some performance bottlenecks.

The tools we use to find performance bottlenecks

Spark UI metrics: Spark UI can gain insight into the time spent in a specific stage. The execution time of each task is divided into sub-phases to make it easier to find the bottleneck in the job.

Jstack : Spark UI also provides on-demand jstack functions on the execution program process, which can be used to find hot spots in the code.

Spark Linux Perf / Flame Graph support : Although the above two tools are very convenient, they cannot provide an aggregated view of CPU profiling for jobs running on hundreds of computers at the same time. On the basis of each job, we added support for enabling Perf profiling (via libperfagent for Java symbols), and can customize the duration/frequency of sampling. Using our internal metric collection framework, the analysis samples are aggregated and displayed as Flame Graph in the execution program.

Performance optimization

Fix memory leak in sub-sorter (SPARK-14363) (30% acceleration): When the task releases all memory pages but the pointer array is not released, we found a problem. As a result, large blocks of memory are not used and cause frequent overflows and executor OOM. Our fix now frees up memory correctly and makes large sorts run efficiently. We noticed that the performance of the CPU has increased by 30% after this fix.

Snappy optimization (SPARK-14277) (10% acceleration): JNI method is being called for each row read/write-(Snappy.ArrayCopy). We asked this question and changed the Snappy behavior to use a non-JNI-based System.ArrayCopy. This change alone provides a CPU improvement of about 10%.

Reduce random write latency (SPARK-5581) (up to 50% speedup): On the map side, when random data is written to the disk, the map task is to open and close the same file for each partition. We made a fix to avoid unnecessary opening/closing and observed a CPU improvement of up to 50% for jobs that write to a large number of shuffle partitions.

Fix the problem of repeated task running due to fetch failure (SPARK-14649): Spark driver resubmits the task that is already running when the fetch fails, resulting in poor performance. We fixed the problem by avoiding re-running the running task, and we saw that the job was more stable in the event of an acquisition failure.

Configurable buffer size of PipedRDD (SPARK-14542) (10% acceleration): When using PipedRDD, we found that the default buffer size for transferring data from the sorter to the pipeline process is too small and our work costs more than 10% Time to copy data. We make the buffer size configurable to avoid this bottleneck.

Cache index files for shuffle fetch acceleration (SPARK-15074): We observe that shuffle services often become a bottleneck, and reducers spend 10% to 15% of their time waiting for map data. Deeply researching this issue, we found that the shuffle service is opening/closing the shuffle index file for each shuffle fetch. We made changes to cache index information so that we can avoid file opening/closing and reuse index information for subsequent extraction. This change reduces the total shuffle time by 50%.

Reduce the update frequency of shuffle byte write indicators (SPARK-15569) (up to 20% acceleration): Using Spark Linux Perf integration, we found that about 20% of the CPU time is used to detect and update the written shuffle byte indicators.

Configurable sorter initial buffer size (SPARK-15958) (up to 5% acceleration): The default initial buffer size of sorter is too small (4 KB), we found it to be very small for large workloads-and As a result, we wasted a lot of time to expand the buffer and copy the contents. We made a change to make the buffer size configurable, and the large buffer size is 64 MB, we can avoid a large amount of data copying and increase the working speed by about 5%.

Configure the number of tasks: Since our input size is 60 T and each HDFS block size is 256 M, we generated more than 250,000 tasks for this job. Although we were able to run Spark jobs with so many tasks, we found that when there were too many tasks, the performance dropped significantly. We introduced a configuration parameter to make the map input size configurable, so we can reduce the number by 8 times by setting the input split size to 2 GB.

After completing all these reliability and performance improvements, we are happy to report that we have built and deployed a faster and more manageable pipeline for one of our entity ranking systems, and we have provided the ability to run other similar jobs in Spark.

Performance comparison between Spark pipeline and Hive pipeline

We use the following performance metrics to compare Spark pipelines with Hive pipelines. Please note that these numbers are not a direct comparison of Spark and Hive at the query or job level, but a comparison of building optimized pipelines with flexible computing engines (such as Spark), not just at the computing engine query/job level of the following operations (such as Hive).

CPU time: From the perspective of the operating system, this is the CPU usage rate. For example, if your job runs for 10 seconds using only 50% of the CPU on a 32-core computer, then your CPU time will be 32 0.5 10 = 160 CPU seconds.

Apache Spark: 60 TB+ production use case from Facebook

CPU reservation time: This is the CPU reservation from the perspective of the resource management framework. For example, if we reserve a 32-core machine for 10 seconds to run the job, the CPU reserve time is 32 * 10 = 320 CPU seconds. The ratio of CPU time to CPU reservation time reflects how we use the reserved CPU resources on the cluster. When accurate, compared with CPU time, when running the same workload, the reserved time can better compare the execution engine. For example, if a process requires 1 CPU second to run but 100 CPU seconds must be reserved, the efficiency of this indicator is lower than a process that requires 10 CPU seconds but only reserves 10 CPU seconds to perform the same amount of work. We also calculate the memory reservation time, but it is not included. Since the experiment is run on the same hardware, the number is similar to the CPU reservation time, and in the case of Spark and Hive, we will not cache the data in memory. Spark is able to cache data in memory, but due to the memory limitations of our cluster, we decided to use off-core work similar to Hive.

Apache Spark: 60 TB+ production use case from Facebook

Latency: The end-to-end elapsed time of the job.

Apache Spark: 60 TB+ production use case from Facebook

Conclusion and future work

Facebook uses high-performance and scalable analytics to assist in product development. Apache Spark provides unique features that unify various analysis use cases into a single API and efficient calculation engine. We replaced the pipeline broken down into hundreds of Hive jobs with a single Spark job. Through a series of performance and reliability improvements, we were able to extend Spark to handle one of the use cases of entity ranking data processing in production. In this particular use case, we showed that Spark can reliably shuffle and sort 90 TB + intermediate data, and run 250,000 tasks in one job. Compared to the old Hive-based pipeline, the Spark-based pipeline has produced significant performance improvements (4.5-6x CPU, 3-4x resource reservation and ~5x latency) and has been running in production for several months.

Guess you like

Origin blog.51cto.com/15127544/2664901