Application Practice of Apache Kyuubi in Bilibili Big Data Scenario

01 Background introduction

In recent years, with the rapid development of Bilibili's business, the amount of data has continued to increase, and the scale of offline computing clusters has grown from the initial two hundred to the current nearly ten thousand, from a single computer room to a multi-computer room architecture. Currently we mainly use Spark, Presto, and Hive on the offline computing engine. The architecture diagram is shown below. Our BI, ADHOC, and DQC services all implement unified SQL scheduling through the self-developed Dispatcher routing service. Dispatcher will dynamically select the query based on the query SQL syntax features, the amount of HDFS read, and the current engine load. The current best computing engine performs the task. If the user's SQL fails, the engine will be automatically downgraded to lower the threshold for users to use; among them, we used STS in the early stage of Spark query, but STS itself has many performance and usability problems, so we introduced Kyuubi, through which Kyuubi provides many Tenants, multi-engine agents, and full compatibility with Hive Thrift protocol capabilities realize resource isolation and authority verification for Adhoc tasks in various departments.

Query query situation

Currently, SparkSQL accounts for nearly half of Adhoc query scenarios, relying on Kyuubi's support for Scala syntax. At present, some advanced users use Scala syntax to submit statements for execution, and can freely switch between SQL and Scala modes, which greatly enriches adhoc usage scenarios.

02 Kyuubi application

Kyuubi is an open source project contributed by NetEase Shufan big data team to the Apache community. Kyuubi is mainly used in big data field scenarios, including big data offline computing, adhoc, BI and other directions. Kyuubi is a distributed, multi-user, JDBC or ODBC compatible big data processing service.

Provide query services such as SQL for currently popular computing engines (such as Spark, Presto, or Flink, etc.).

Reasons why we choose Kyuubi:

1. It is fully compatible with the Hive thrift protocol and complies with the existing technology selection of Bilibili.

2. High availability and resource isolation are essential for large-scale production environments.

3. Flexible and scalable, more adaptive development can be done based on kyuubi.

4. It supports multi-engine agents, laying the foundation for future unified computing portals.

5. The high-quality realization of the community and the activeness of the community.

Kyuubi's architecture can be divided into three parts:

1. Client: Users use jdbc or restful protocol to submit jobs and obtain results. 

2. kyuubi server: receives, manages and schedules the Kyuubi Session established with the client, and the Kyuubi Session is finally routed to the actual engine for execution. 

3. kyuubi engine: Accept and process the tasks sent by kyuubi server, different engines have different implementation methods. 

03 Improvements based on Kyuubi

Kyuubi has been running stably in the production environment of station B for more than a year. Currently, all Adhoc queries are connected to the big data computing engine through kyuubi. During this year, we experienced two major version evolutions, from kyuubi 1.3 to kyuubi 1.4, and then upgraded kyuubi 1.6 from kyuubi 1.4. Compared with the previous STS, kyuubi has better performance in terms of stability and query performance. During this evolution process, we combined the business of station B and the functional characteristics of kyuubi to partially transform kyuubi.

3.1 Add QUEUE mode

Kyuubi Engine natively provides various isolation levels of CONNECTION, USER, GROUP and SERVER. The capacity of big data computing resources at Station B is divided by department, and different departments correspond to different queues on Yarn. We modified it based on the GROUP mode to achieve resource isolation and permission control at the Queue level.

The mapping between user information and queues is uniformly configured and managed by the upper-level tool platform. Kyuubi only needs to care about the user and queue information submitted by the upstream Dispatcher, schedule and distribute it to the spark engine of the corresponding queue for calculation. Currently we have 20+ adhoc queues, and each queue corresponds to one or more Engine instances (Engine pool).

3.2 Support multi-tenancy in QUEUE mode

The kyuubi server is started by the super user Hive. In the spark scenario, the driver and executor share the same user name. Different users submit different SQL, and the driver and executor cannot distinguish who submitted the current task. There are problems in data security, resource application, and permission access control.

In response to this problem, we have made improvements in the following aspects.

3.2.1 kyuubi server end

1. The kyuubi server starts as the hive principal.

2. Dispatcher submits SQL as username proxyUser.

3.2.2 spark engine side

1. Driver and Executor are started as hive.

2. Driver submits SQL as username proxyUser.

3. The Executor starts the Task thread and needs to execute the Task as username proxyUser.

4. At the same time, it is necessary to ensure that all public thread pools and bound UGI information are correct. For example, on the ORC Split thread pool, when the number of Orc files reaches a certain amount, the thread pool will be enabled for split calculation. The thread pool is shared globally, and the user UGI information that triggers the call for the first time is permanently bound, which will cause user UGI information confusion.

3.3 kyuubi engine UI display function

In daily use, we found that the kyuubi 1.3 Engine UI page display is not friendly enough. Different users cannot distinguish between executing different SQL, and session, job, stage, and task cannot be associated.

As a result, it is difficult to troubleshoot and locate user problems. We have expanded the kyuubi Engine UI page by referring to STS. We have modified the following aspects.

1. Customize kyuubi Listener to monitor Spark Job, Stage, Task related events and SparkSQL related events: SessionCreate, SessionClose, executionStart, executionRunning, executionEnd, etc.

2. When the Engine executes SQL-related operations, it binds and sends related SQL Events, constructs SQL-related status events, and performs status analysis, summary, and storage of the collected Events.

3. Customize the Kyuubi Page for real-time display of Session and SQL related status.

Session Statistics information display

Session Statistics information display

SQL Statistics information display

SQL Statistics information display

3.4 kyuubi supports configuration center to load Engine parameters

In order to solve the differences in computing resource requirements between queues, for example, queues with a large amount of tasks need more computing resources (Memory, Cores), and queues with a small amount of tasks need a small amount of resources. For the difference in the requirements of each queue, we combine the Engine-related resource parameters are unified to the configuration center for management. Before starting the Engine for the first time, each queue will query the parameters of its own queue and add them to the startup command to overwrite the parameters.

3.5 Engine execution task progress display and consumption resource reporting function

During task execution, users are most concerned about the progress and health of their tasks, while the platform is more concerned about the cost of computing resources consumed by the tasks. On the Engine side, we collect user, session, job, and stage information based on events and store them, start scheduled tasks to associate the collected user, session, job, and stage information and calculate resource consumption costs, and inject the results into the corresponding operation log , return to the front-end log display.

Task progress information display

Task progress information display

Query resource consumption report display

Query resource consumption report display

04 Kyuubi stability construction

4.1 Large result set overflow written to disk

In an adhoc scenario, users usually pull a large number of results to the driver. A large number of users pull the result set at the same time, which will cause a lot of memory consumption, resulting in memory shortage of the spark engine and performance degradation of the driver, which directly affects the user's query. For this purpose, the process of driver fetch result is specially optimized. When the result is obtained, the memory usage of the driver will be monitored in real time. When the memory usage of the driver exceeds the threshold, the fetched result will be written directly to the local disk file. When the user requests the results, the results are read and returned in batches from the file, which increases the stability of the driver.

4.2 Restrictions on the concurrent number of tasks, execution time and number of tasks for a single SQL

In the production process, we often encounter the situation that a single large job directly occupies all the computing resources of the entire Engine, resulting in a short job that cannot obtain computing resources for a long time and is always pending. In order to solve this problem, we have the following aspect to optimize.

  • In terms of the number of concurrent tasks: by default, as long as there are resources available during task scheduling, all resources will be scheduled and allocated. Subsequent SQL will face the situation that no resources are available at all. We limit the number of tasks that can be scheduled for a single SQL. The specific limit The number is dynamically adjusted according to the size of available resources.
  • In terms of single SQL execution time: both the upper-layer Dispatcher and the lower-layer Engine have timeout limits, and it is stipulated that the adhoc task will be killed if it exceeds 1 hour.
  • In terms of the number of tasks in a single stage: At the same time, we also limit the number of tasks in a single stage. A stage allows a maximum of 300,000 tasks.

4.3 Limits on the number and size of files in a single table scan

In order to ensure the stability of kyuubi, we restrict the SQL that queries too much data. Achieve the goal by customizing the external optimization rule (TableScanLimit). TableScanLimit matches LocalLimit and collects child nodes project and filter. Match leaf nodes HiveTableRelation and HadoopFsRelation. That is, it matches the Logical relation of the Hive table and the DataSource table, and adopts different calculation methods for different tables.

1. HiveTableRelation:

  • For non-partitioned tables, get the totalSize, numFiles, and numRows values ​​of the table through table meta.
  • Partition table, to determine whether there are partitions pushed down. If so, take the data totalSize, numFiles, and numRows of the corresponding partition. If not, take the data of the whole table.

2. HadoopFsRelation: Determine whether there is a dynamic filter in partitionFilter

  • If it does not exist, the partition to be scanned is obtained through partitionFilter
  • exists, then further filter the partitions scanned by partitionFilter to obtain the final partitions that need to be scanned

After obtaining the dataSize, numFiles, and numRows of the SQL query, it is also necessary to obtain the final columns to be scanned based on the table storage type, the type of different fields, whether there is a limit, and the project and filter pushed down, and estimate the table scan required size, if the table scan size exceeds the specified threshold, the query will be rejected and the reason will be informed.

4.4 Dangerous join condition discovery & join expansion rate limit

4.4.1 Dangerous join condition discovery

To ensure the stability of kyuubi, we also restrict the SQL that affects Engine performance. When writing SQL, the user may not understand the underlying implementation of spark’s join, which may cause the program to run very slowly or even OOM. At this time, if you can provide users with join conditions that may cause the engine to run slowly, and remind users to improve And it is convenient to locate the problem, and even reject these dangerous query submissions.

When selecting the join method, if it is an equivalent join, select it in the order of BHJ, SHJ, and SMJ. If you have not selected the join type, it will be judged as Cartesian Join. If the join type is InnerType, use Cartesian Join. Cartesian Join will generate a flute Carl product is relatively slow. If it is not InnerType, use BNLJ. When judging BHJ, the size of the table exceeds the broadcast threshold. Therefore, broadcasting the table may cause pressure on the driver memory, poor performance or even OOM. Therefore, the These two join types are defined as dangerous joins.

If it is a non-equivalent join, only BNLJ or Cartesian Join can be used. If the build side cannot be selected at the first BNLJ, indicating that the size of the two tables exceeds the broadcast threshold, Cartesian Join is used. If the Join Type is not InnerType, only BNLJ can be used, so it is a dangerous join when Cartesian Join is selected in the Join strategy and BNLJ is selected for the second time.

4.4.2 Limitation of Join Expansion Rate

The statusScheduler in shareState is used to collect the status and indicators of Execution. The indicators are the metrics reported by each task according to the nodes. We started a join detection thread to monitor the "number of output rows" of the Join node regularly. And the "number of output rows" of the two parent nodes of the Join to calculate the expansion rate of the Join node.

Expansion detection of Join nodes:

05 kyuubi new application scenarios

5.1 Use of large query connection&scala mode

5.1.1 Use of connection mode

Large adhoc tasks and complex SQL will cause the performance of kyuubi engine to degrade within a certain period of time, seriously affecting the execution efficiency of other normal adhoc tasks. We have opened up the large query mode on the adhoc front end, allowing these complex tasks with large query volumes to use the kyuubi connection mode. In the kyuubi connection mode, a user task independently enjoys the resources applied by itself, an independent Driver, and the size of the task is determined by its own SQL characteristics, which will not affect the SQL tasks of other users. At the same time, we will also appropriately release the previous ones limitation factor.

The usage scenario of connection mode in Bilibili:

  1. The table scan determines that the adhoc task is a large task, and the execution time exceeds 1 hour.
  2. For complex SQL tasks, Cartesian product or Join expansion exceeds the threshold.
  3. The number of tasks in a single stage of a single SQL exceeds 30W.
  4. The user chooses the connection mode by himself.

5.1.2 Use of scala mode

SQL mode can solve 80% of big data business problems, SQL mode plus Scala mode programming can solve 99% of business problems; SQL is a very user-friendly language, users can use SQL without knowing the internal principles of Spark Complex data processing, but it also has certain limitations.

The SQL mode is not flexible enough to perform data processing operations in two ways: dataset and rdd. Unable to handle more complex businesses, especially non-data processing related requirements. On the other hand, when the user executes the scala code project, he must package the project and submit it to the computing cluster. If there is an error in the code, it needs to be packaged and uploaded back and forth, which is very time-consuming.

Scala mode can directly submit code, similar to the Spark interactive shell, simplifying the process. In response to these problems, we combine the advantages of SQL mode and Scala mode, and perform mixed programming, which can basically solve most cases in data analysis scenarios.

5.2 Presto on spark

In order to ensure the stability of the cluster, Presto limits the maximum memory of each query, and queries that exceed the configured memory will be killed by Presto oom. Some ETL tasks will appear as the business grows, the amount of data increases, and the memory usage will also increase. When the threshold is exceeded, the process will fail.

In order to solve this problem, the prestodb community has developed a presto on spark project, which solves the expansion problem caused by the excessive memory usage of the query by submitting the query to Spark. However, the community solution is not very friendly to existing queries, and users The submission methods include presto-cli, pyhive, etc., but to use the Presto on spark project, you must submit the query to yarn through spark-submit.

In order to allow users to execute presto on spark queries without perception, we have made some modifications on the presto gateway. At the same time, we have developed the Presto-Spark Engine in kyuubi with the help of the restful interface of kyuubi and the scheduling capability of service + engine. This engine can Friendly to submit queries to Yarn.

The main implementation details are as follows:

1. The presto gateway saves the execution history of the query, including the resource usage and error information of the query.

2. The presto gateway requests the HBO service to determine whether the current query needs to be submitted through presto on spark.

3. The presto gateway obtains the list of available kyuubi servers through zk, randomly selects one, and opens a session to kyuubi through http.

4. The presto gateway submits the statement according to the obtained sessionHandle information.

5. After kyuubi server receives the query, it will start an independent Presto-Spark Engine, build the startup command, and execute the command to submit spark-submit to yarn.

6. Presto gateway continuously obtains the operation status through http according to the returned OperatorHandle information.

7. If the job is successful, the result is obtained and returned to the client through the fetch result request.

06 kyuubi deployment method

6.1 Kyuubi server access to K8S

Integrate the practice of Engine on yarn label

Problems encountered in production practice:

1.  At present, kyuubi server/engine is deployed on a mixed cluster with a complex environment, and the environment of each component depends on each other. During the release process, there will inevitably be problems such as environment inconsistency and misoperation, resulting in service operation errors.

2.  Resource management issues. Initially, the engine uses the client mode, and the engine drivers of different queues all use a large memory ranging from 50g to 100g. At the same time, AM, NM, DN, and kyuubi server all share the resources on the same physical machine. When AM is started Too many, occupying the resources of the entire machine, resulting in insufficient memory of the machine, and the engine cannot be started.

To solve this problem, we have developed a resource allocation and scheduling implementation based on the Queue mode: each kyuubi server and spark engine records the current resource usage on the znode. Each kyuubi server znode information: the current number of SparkEngine registered by kyuubi, the current registered SparkEngine instance of kyuubi server, the total memory size of kyuubi server, and the total remaining memory size of current kyuubi server, etc.

Each engine znode information: the kyuubi server IP/port to which it belongs, the current SparkEngine memory, the queue to which the current SparkEngine belongs, etc. Every time the Spark engine is started/exited, the directory lock of the queue will be obtained, and then the resource update operation of the kyuubi server to which it belongs will be performed. If the kyuubi server goes down, when it starts, it traverses to obtain the information of all engines in the znode, and quickly restores resources and status.

3.  There are also some problems with resource management functions: resource fragmentation, unfriendly expansion of new functions, and high maintenance costs. The engine uses the client mode, and an AM with too much memory will occupy too many computing resources of the client, resulting in limited horizontal expansion of the engine.

In response to the above-mentioned problems, we have made corresponding solutions:

1. Connect kyuubi server to k8s

We designated a batch of machines as kyuubi servers to schedule resource pools on k8s to realize the isolation of kyuubi server environment and resources. Realized the rapid deployment of kyuubi server, improved the horizontal expansion capability of kyuubi server, and reduced the operation and maintenance cost.

2. Engine on yarn label

We hand over kyuubi engine resource management to yarn, and yarn is responsible for engine allocation and scheduling. We adopt the cluster mode to prevent the engine from being limited by resources when expanding horizontally. After adopting the cluster mode, we encountered a new problem: in the queue mode, the engine driver uses a large memory ranging from 50g to 100g, but due to the configuration limitation of the yarn cluster, the maximum amount of Container resources that can be applied for is <28G, 10vCore>. In order to ensure that the Driver can obtain enough resources in the cluster mode, we have modified yarn to adapt to such scenarios. We split the requirements into the following three items:

  • Place the kyuubi Driver in an independent Node Label, and the server in the Node Label is used independently by the kyuubi driver;
  • kyuubi Executor is still placed in the adhoc leaf queue of each corresponding queue of the Default Label to undertake adhoc task processing;
  • The resources requested by the Driver need to be greater than MaxAllocation, that is, <28G, 10vCore> as mentioned above. It is hoped that the MaxAllocation at the Queue level can be dynamically set according to the Node Label, so that kyuubi Driver can obtain a large amount of resources.

First, we created a kyuubi_label on yarn, and established a kyuubi queue within the label to map with the Default Label, so that all drivers can submit to the kyuubi queue uniformly. And use "spark.yarn.am.nodeLabelExpression=kyuubi_label" to specify the Driver to submit to kyuubi_label, and use "spark.yarn.executor.nodeLabelExpression=" to specify the Executor to submit to the default label to achieve the following effects:

Secondly, we delegated the maximum resources of yarn from the original "cluster" level control to the "queue + Label" level control. By adjusting the Conf of "queue name + kyuubi_label", we can increase the maximum value of the Driver's Container resources to <200G, 72vCore>, and ensure that the maximum value of other Containers is still <28G, 10vCore>. Also apply for a 50G Driver, and a failure prompt will appear in the default cluster:

However, it can run successfully under the same queue of kyuubi_lable, so that we not only use the resource management and control capabilities of yarn, but also ensure the amount of resources obtained by kyuubi driver.

07 Future Planning

1. Small ETL tasks are connected to kyuubi to reduce ETL task resource application time

2. Kyuubi Engine (Spark and Flink) cloud-native, connected to K8S unified scheduling

3. The Spark jar task is also integrated into Kyuubi

The above is the content shared today. If you have any thoughts or questions, welcome to interact with us in the message area. If you like the content of this issue, please give us a thumbs up!

The country's first IDE that supports multi-environment development——CEC-IDE Microsoft has integrated Python into Excel, and Uncle Gui participated in the framework formulation. Chinese programmers refused to write gambling programs and were pulled out 14 teeth, with 88% body damage . Podman Desktop, an open-source imitation Song font, breaks through 500,000 downloads. Automatically skips opening screen advertisements. The application "Li Tiao Tiao" stops updating indefinitely. There is a remote code execution vulnerability Xiaomi filed mios.cn website domain name
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4565392/blog/5586530