iQiyi Big Data Acceleration: From Hive to Spark SQL

01

   Introduction

Since iQIYI launched its big data business in 2012, it has built a series of platforms based on big data open source ecological services, covering the entire big data process such as data collection, data processing, data analysis, and data application, providing support for the company's operational decisions and various A data intelligence business provides strong support. With the continuous growth of data scale and the increase of computational complexity, how to quickly mine the potential value of data has brought great challenges to the big data platform.

In response to the real-time analysis needs of massive data, the big data team has launched a big data acceleration project since 2020, based on big data technology to accelerate iQIYI data circulation, and promote more real-time operational decision-making and more efficient information distribution. One of them is to promote the switching of OLAP data analysis from the Hive engine to the Spark SQL engine, which has achieved significant benefits, with tasks speeding up by 67% and resource saving by 50%, bringing efficiency and revenue increases to businesses such as BI, advertising, membership, and user growth.

02

   background

In the initial stage of iQIYI's big data platform construction, a big data infrastructure and data warehouse were built based on the open source Hadoop ecosystem, and Hive was mainly used for data processing and analysis. Hive is a Hadoop-based offline analysis tool, which provides a rich SQL language to analyze data stored in the Hadoop distributed file system: supports mapping structured data files into a database table, and provides complete SQL Query function; support conversion of SQL statements into Hadoop MapReduce tasks to run, and analyze the required content through SQL query, so that users who are not familiar with Hadoop MapReduce can easily use SQL language to query, summarize and analyze data. However, Hive's processing speed is relatively slow, especially when processing complex queries with large-scale data.

With the development of business and the surge of data volume, especially after the access of new time-sensitive services such as advertising intelligent bidding, information flow recommendation, real-time membership operation, and user growth, using Hive for offline analysis can no longer meet the business requirements for data timeliness. sexual needs. To this end, we have introduced a series of more efficient OLAP engines such as Trino and ClickHouse, but these engines are more focused on data analysis. The data warehouse and pre-order data cleaning processing that data analysis relies on are still built on the basis of Hive. superior. Therefore, how to improve the performance of Hive processing and analysis, so as to realize the overall acceleration of iQIYI's big data link, has become an urgent problem to be solved.

03

   Scheme selection

We investigated several mainstream alternatives such as Hive on Tez, Hive on Spark, and Spark SQL, and systematically analyzed and compared them from multiple dimensions such as functional compatibility, performance, stability, and transformation costs, and finally selected Spark SQL. .

  • Hive on Tez

This solution uses Tez as a pluggable execution engine of Hive to replace MapReduce to execute jobs. Tez is Apache's open source computing framework that supports DAG jobs. Its core idea is to further split the two operations of Map and Reduce to form a large DAG job. Compared with MapReduce, Tez saves a lot of unnecessary intermediate data storage and reading processes, and directly expresses in one job what MapReduce requires the cooperation of multiple jobs to complete.

Advantage:

  • Insensible switching: SQL syntax is still Hive SQL, and Hive’s execution engine can be replaced by Tez from MapReduce through configuration, and the upper-layer application does not need to be modified

Disadvantages:

  • Poor performance: This scheme has poor parallel processing ability for large-scale data sets, and it is obvious when data skew occurs

  • The community is not active: the program is relatively seldom implemented in the industry, and there are not many exchanges and discussions in the community

  • High operation and maintenance costs: When the Tez engine executes abnormally, there are few reference materials

  • Hive on Spark

This solution uses Spark as a pluggable execution engine of Hive to replace MapReduce to execute jobs. Spark is a large-scale data processing engine based on memory computing. Compared with MapReduce, Spark has the characteristics of scalability, full use of memory, and flexible computing models. It is more efficient in processing complex tasks.

Advantage:

  • Insensitive switching: SQL syntax is still Hive SQL, and the execution engine of Hive can be replaced by Spark from MapReduce through configuration, and the upper-layer application does not need to be modified

Disadvantages:

  • Poor version compatibility: only Spark 2.3 and below are supported, and new features of Spark 3.x and above cannot be used, which does not meet future upgrade requirements

  • Unsatisfactory performance: Hive on Spark still uses Hive Calcite to parse SQL into MapReduce primitives, but it uses the Spark engine instead of the MapReduce engine to execute these primitives, and the performance is not very ideal

  • The community is not active: the program is rarely implemented in the industry, and the community is not active

  • Inflexible resource application: When submitting Spark tasks in the Hive on Spark solution, resources can only be set fixedly, which is difficult to apply to multi-tenant and multi-queue scenarios

  • Spark SQL

Spark SQL is Spark's solution for structured data. It provides Hive-compatible SQL syntax, supports the use of Hive Metastore metadata, and can provide complete SQL query functions. Therefore, the Hive-based data warehouse can still be used in the Spark SQL scenario, and most of the existing Hive SQL tasks can be smoothly switched to Spark SQL.

Spark SQL converts SQL statements into Spark tasks to run, and uses a memory-based model to organize data calculation and cache. Compared with the Hive on MapReduce solution that puts intermediate data on the disk, Disk IO overhead is smaller and execution efficiency is higher.

  • Selection summary

The following table shows the detailed comparison between Hive on MapReduce, Hive on Tez, Hive on Spark and Spark SQL. It can be seen that Spark SQL is most suitable for our scenario.

e1507ec7e58fd9ab0e2759ff0f6d92ec.png

04

   technical transformation

Migrating from Hive to Spark SQL faces many challenges and transformation work, including Spark compatibility transformation and performance optimization, SQL task syntax adjustment, data consistency guarantee, system integration and dependency transformation, etc.

  • Spark Compatibility Transformation

There are certain grammatical differences between Spark SQL and Hive SQL, and many compatibility issues were found during the migration process. We intercepted and rewritten the execution plan of each stage of SQL through the Spark Extension method, and realized syntax, execution logic, method functions, etc. Compatibility improves the migration success rate.

Here are a few key differences:

  • Support UDF multi-threading: Previously, UDFs on Hive would not throw an exception when processing SimpleDateFormat type dates. However, an error will be reported when executing them with Spark, because the Spark engine uses a multi-threaded method to execute such functions. This problem can be solved by modifying the code of UDF and setting SimpleDateFormat to ThreadLocal.

  • Grouping ID support: Spark does not support Hive's grouping_id, and uses its own grouping_id() instead, but this will cause compatibility issues. By modifying Spark, we realized that grouping_id is automatically converted into grouping_id() when parsing SQL

  • Parameter compatibility: Hive-specific parameters need to be mapped to corresponding parameters in Spark

ca18c9b73e2d64b1f37e904f9a3e871c.png

  • Complicated functions cannot be aliased: In Hive, if you do not alias a calculated column, Hive will give a column name starting with _c by default, but Spark will not. When calling some When a function that returns a comma (such as get_json_object), it will report a problem that the number of columns does not match. The work around suggestion for this question is to alias all the columns, rejecting such an alias as _c0.

  • Does not support permanent functions: The reason why Spark does not support permanent functions is that the jar package is not downloaded from HDFS in the code. In addition, the temporary function does not need to specify the library name, but the permanent function is required. In order to promote the permanent function, a special function is added: when the corresponding function cannot be found in the current library, it will search for the permanent function under the default library.

  • The reset parameter is not supported: there are scenarios where the reset command is used for online tasks. We modified Spark to enable Spark SQL to support the reset command.

  • Spark new feature enablement and configuration optimization

  • Enable Dynamic Resource Allocation Strategy (DRA): Tasks automatically apply for or release Executors according to the needs of the current program to achieve dynamic resource adjustment, which solves the problem of unreasonable resource allocation. The automatic recovery of idle resources greatly reduces the waste of cluster resources. In addition, by limiting the maximum number of Executors, large queries can avoid queue blocking caused by excessive resource occupation.

  • Enable Adaptive Query Optimization (AQE): Record relevant statistical indicators in the task execution phase, and optimize the execution plan in the subsequent execution phase based on the statistical indicators, such as: dynamically merge small Shuffle partitions, dynamically select the appropriate Join strategy, and dynamically optimize the skewed Partitioning, etc., improves data processing efficiency.

  • Automatically merge small files: Insert the Rebalance operator before writing, and combine with Spark's AQE optimization to automatically merge small partitions and split large partitions, thus solving the problem of a large number of small files.

  • Spark Architecture Improvements

In our scenario, the application submits SQL tasks to Spark ThriftServer through JDBC, and then accesses the Spark cluster. However, Spark ThriftServer only supports a single user, which limits the ability of multiple tenants to access Spark, and has problems such as low resource utilization and UDF mutual interference.

To overcome these problems, we introduced Apache Kyuubi. Kyuubi is an open source Spark ThriftServer solution that supports the use of an independent SparkSession to process SQL requests and has the same capabilities as Spark Thrift Server. Compared with Spark ThriftServer, Apache Kyuubi supports user, queue, and resource isolation, and has service-oriented and platform-based capabilities.

For Apache Kyuubi, we have also made some personalized changes to better serve production scenarios:

  • Tag-based configuration: For different computing scenarios or platforms, some tags are predefined to bind some specific configurations. When the task is executed, you only need to bring the corresponding tags, and the preset configuration will be automatically supplemented in the configuration center. For example: ad hoc query tasks, configurations such as shared engines and large query limits; ETL tasks, configurations of independent engines and small file merge configurations, etc.

  • Concurrency limit: In some abnormal situations, a client may send a large number of requests and the Kyuubi service worker thread may be fully occupied. In Kyuubi, we have implemented concurrency restrictions at the User and IP levels to prevent a user or client from sending a large number of requests and causing the service to be full. This feature has also been contributed to the community.

  • Event collection: Kyuubi exposes various events at each stage of SQL execution. Through these events, SQL audit and exception analysis can be easily performed, providing good data support for small file optimization and SQL optimization.

05

   Automated Migration Tool

When migrating from Hive to Spark SQL, in addition to solving the above-mentioned known compatibility problems, some unknown problems may also be encountered. It is necessary to ensure that the engine can run successfully after the switch, the switch will not cause data inconsistency, and provide The ability to automatically downgrade back to the original solution to avoid affecting online data. The commonly used method is generally two sets of engines run logarithmically for a period of time, and then switch after the logarithmic results are consistent.

Before the switch, more than 20,000 Hive tasks were running on the big data platform, and it was obviously unrealistic to switch to Spark SQL one by one manually. In order to improve migration efficiency, we designed and developed a set of Pilot-based automatic switching engine, dual-running, and logarithmic migration tools.

Pilot is an intelligent SQL engine jointly developed by iQIYI's big data team and BI magic mirror team. It provides a unified entrance for OLAP data analysis and integrates various OLAP analysis engines such as Hive, Spark SQL, Impala, Trino, ClickHouse, and Kylin. Supports functions such as automatic routing, automatic downgrade, current limiting, interception, intelligent analysis and diagnosis, and auditing between different clusters/different engines. At present, Pilot has been connected to data development and analysis platforms such as Babel data development platform, Gear timing workflow engine, advertising data platform, BI portal reporting system, magic mirror, Paodingjian, and Venus log service center.

By automatically switching the SQL engine through Pilot, we can switch Hive SQL to Spark SQL without the user's perception, ensuring data consistency and having the ability to roll back:

  1. Collect information about Hive tasks through Pilot, and obtain information such as SQL statements, queues, and workflow names

  2. SQL parsing: Use SparkParser to analyze the SQL statement of the Hive task, find the database, data table and other information corresponding to the input and output

  3. Build an output mapping table: Create a mapping table for output data for dual-run tasks, distinguish it from the online data table, and avoid affecting online data

  4. Engine replacement: replace the execution engine of the dual-running task with Spark SQL

  5. Simulation run: Use Hive and Spark engines to execute corresponding SQL tasks, and output the task running results to the above mapping table for logarithmic verification

  6. Consistency check: Perform data consistency check by comparing the number of rows and cyclic redundancy code (based on CRC32 algorithm) of the two tables.

    Among them, the CRC32 algorithm is a simple and fast data verification algorithm. Spark provides the built-in function CRC32. The value of this function is of type Long, and the maximum value does not exceed 10^19. In our application scenario, first concat_ws the column data of each row in the table to calculate its CRC32, and convert the CRC32 to Decimal(19, 0); then sum the CRC32 values ​​​​calculated on each row of the table to get the reflected The checksum CRC32 value of the entire table content is used for consistency comparison. The check SQL is specifically:

    de0db0fed458bf4b06dbfdfbb06cf267.png

    Some fields in the mapping table are collection types such as Map and List, and there will be two tables with the same actual data. Due to the difference in the internal data sorting of the collection type fields, the CRC32 statistical results will be offset and affect the consistency check results. In response to such situations, we have developed a special UDF to perform consistency checks after the internal sorting of the collection.

    bd1acce4d648087f99228cff82819ba7.png

    Some fields in the mapping table are floating-point types such as Float and Double. In the process of data consistency verification, due to the problem of statistical accuracy, there may be differences in the CRC32 statistical results of the two tables, resulting in misjudgments in the consistency verification process. For this reason, we have optimized the verification algorithm, and when calculating the CRC32 statistical value, 4 digits after the decimal point are reserved for the floating-point field.

    c0345b43efec88f62bad4453815b047d.png

  7. Automatic downgrade: When the Hive task switches to SparkSQL and fails to run, it will be automatically downgraded to Hive through Pilot and resubmitted to run, ensuring that the task can be executed smoothly no matter what.

    We provide a platform-based means to execute the above process: users find the workflow to which they belong according to the project name.

    7ef58107258e4105e241bfcfd18e87aa.png

Simple configuration of the project, input public parameters to be used when the task simulation runs.

279afb7104afbf07489e0a3d3a26a863.png

During the simulation operation stage, it supports monitoring the operation status

bdf276b12ca7c84ac2a2388f9ec208dc.png

After the simulation run is completed, a set of tasks with migration conditions can be obtained. On this basis, one-click migration can be realized through simple operations.

a8f9dddead30c72e512c390685884dae.png

06

   transfer effect

After a period of hard work, we have smoothly migrated 90% of Hive tasks to Spark SQL, and achieved obvious benefits. Task performance has increased by 67%, CPU usage has been reduced by 50%, and memory usage has been reduced by 44%.

Here are some business effects:

  • Advertising: The overall performance of offline tasks is improved by about 38%, computing resources are saved by 30%, computing efficiency is increased by 20%, and the output of advertising data is accelerated to increase revenue

  • BI: The total time consumption is reduced by 79%, resources are saved by 43%, and the output timeliness of P0 tasks is guaranteed. Core reports are produced half an hour to one hour earlier.

  • User growth: Data production is advanced by 2 hours, helping user growth core reports to be produced before 10:00, improving UG operational efficiency

  • Members: Order data production is output 8 hours earlier, and data analysis speed is increased by more than 10 times, helping members improve operational analysis efficiency

  • iQiyi: The average execution time is shortened by 40%, and the daily execution time is reduced by about 100 hours

07

   Future plan

  • Upgrade Migration Tool

For some Hive tasks that do not meet the conditions for smooth migration, they need to be rewritten into SQL compatible with Spark SQL syntax before continuing to migrate. We are improving the migration tool to support extraction of key error information for failed tasks and matching automatic diagnosis root cause tags , giving optimization suggestions and even automatic rewriting to help speed up migration. 

  • engine optimization

At the Spark engine level, there are still some remaining issues that need to be followed up and optimized:

  • Larger storage problem: Due to the introduction of Repartition in small file optimization, the data is broken up, resulting in a reduction in the data compression rate written by some tasks. Follow-up research on the Z-order optimization provided by the community will automatically optimize the data distribution.

  • DPP causes SQL parsing to be too slow: During the migration, it was found that DPP optimization may lead to very slow SQL parsing of some multi-table Joins. Currently, this problem is avoided by limiting the number of DPP-optimized Joins. In Spark 3.2 and subsequent versions, Spark SQL parsing is accelerated, and there are also some related patches, which are planned to be analyzed and applied to the current version.

  • Perfect mission-critical indicators: We have collected some Spark SQL execution indicators on the platform side, such as: input and output file size and number of files, and the running time of each stage of Spark SQL, etc., so that we can intuitively see problematic tasks and some optimizations Effect. In the future, these indicators need to be improved, such as: Shuffle data volume, data skew, data expansion and other indicators, and explore more optimization methods to improve the calculation efficiency of Spark SQL.

  • simulation test engine

In scenarios such as service version upgrades, SQL engine parameter optimization, and cluster migration, it is often necessary to rerun tests on business data to ensure the accuracy and consistency of data processing. The traditional rerun test method relies on business personnel to design and implement manually, which is often inefficient.

Pilot's simulation dual-running tool can solve the above pain points. We plan to provide this tool independently as a service and transform it into a more general simulation test engine to help users quickly build dual-running tasks and perform automated logarithms.

Guess you like

Origin blog.csdn.net/g6U8W7p06dCO99fQ3/article/details/131148490