Linkis: Data middleware that connects upper-level applications and underlying computing engines
Linkis is a data middleware open sourced by WeBank, which is used to solve the connection, access and multiplexing problems between various tools and applications in the foreground and various computing and storage engines in the backend.
1 Introduction
Linkis, a data middleware that connects multiple computing and storage engines such as Spark
, TiSpark
, Hive
, Python
and , HBase
etc., provides a unified REST
// interface to WebSocket
the outside world JDBC
, and submits and executes SQL
, Pyspark
, HiveQL
, Scala
and other scripts.
Based on the microservice architecture, Linkis provides enterprise-level features such as financial-level multi-tenant isolation, resource management and control, and authority isolation. It supports unified variable, UDF, function, and user resource file management, and has high concurrency, high performance, and high availability. Big data operations / Request full lifecycle management capabilities.
2. Background
The wide application of big data technology has spawned an endless stream of upper-level applications and lower-level computing engines.
It is a common practice for almost all enterprises at this stage to implement business needs by introducing multiple open source components, and to continuously update and enrich the big data platform architecture.
As shown in the figure below, when we have more and more upper-level applications, tool systems, and underlying computing and storage components, the entire data platform will become a network structure as shown in the figure above.
With the continuous introduction of new components to meet business needs, more and more pain points are also generated:
- The business requirements are varied, and the upper-level components have their own characteristics. Users have a strong sense of fragmentation when using it, and the learning cost is high.
- There are many types of data, and the storage and calculation are very complex. Generally, a component only solves one problem. Developers must have a complete technology stack.
- The introduction of new components cannot be compatible with the original data platform in terms of multi-tenant isolation, user resource management, and user authority management. The top-down customized development is not only a huge project, but also reinvents the wheel.
- The upper-layer applications are directly connected to the underlying computing and storage engines. Any changes in the underlying environment will directly affect the normal use of business products.
3. Original design intention
How to provide unified data middleware, connect with upper-layer application tools, and shield various calls and usage details at the bottom layer, so that business users only need to focus on business implementation, even if the computer room of the bottom-level platform is expanded or the overall relocation is not affected. Linkis was designed from the ground up!
4. Technical Architecture
As shown in the figure above, based on the SpringCloud microservice technology, we have created multiple microservice clusters to build Linkis' middleware capabilities.
Each microservice cluster undertakes part of the functional responsibilities of the system, and we have clearly divided them as follows. like:
- Unified Job Execution Service (UJES) : A distributed REST / WebSocket service for receiving various access requests submitted by the upper system.
- Currently supported calculation engines are:
Spark
,Python
,TiSpark
,Hive
andShell
etc. - Supported scripting languages are:
SparkSQL
,Spark Scala
,Pyspark
,R
,Python
,HQL
andShell
etc.;
- Currently supported calculation engines are:
- Resource Management Service (RM) : Supports real-time management and control of the resource usage of each system and user, limits the resource usage and concurrency of the system and users, and provides real-time resource dynamic charts to facilitate viewing and management of system and user resources.
- Currently supported resource types:
Yarn
queue resources, servers (CPU and memory), concurrent users, etc.
- Currently supported resource types:
- Unified storage service (Storage) : Universal IO architecture, can quickly connect to various storage systems, provide a unified call entry, support all common format data, high integration, easy to use.
- Unified Context Service (CS) : Unified user and system resource files (user scripts,
JAR
,ZIP
,Properties
etc.), unified management of parameters and variables of users, systems, and computing engines, one setting, and automatic reference everywhere. - Material library service (BML) : System and user-level material management, which can be shared and transferred, and supports automatic management of the entire life cycle.
- Metadata service (Database) : Real-time
Hive
database table structure and partition status display.
Relying on the mutual cooperation of these microservice clusters, we have improved the way and process of the external service of the entire big data platform.
5. Business Architecture
-
Gateway : Based on Spring Cloud Gateway, the plug-in function has been enhanced, and the front-end Client and background multi-WebSocket microservices have been added 1 11 more thanNNN support, mainly used for parsing and routing and forwarding user requests to specified microservices.
-
Unified entry : The unified entry is the job lifecycle manager of a certain type of engine job for the user. Entrance manages the entire life cycle of a job from receiving jobs, submitting jobs to the execution engine, feeding back job execution information to users, and completing jobs.
-
Engine manager : The engine manager is responsible for managing the full life cycle of the engine. Responsible for applying for and locking resources from the resource management service, instantiating new engines, and monitoring the life status of the engines.
-
Execution engine : The execution engine is a microservice that actually executes user jobs, and it is started by the engine manager. In order to improve the interaction performance, the execution engine directly interacts with the unified portal, and pushes the execution log, progress, status and result set to the unified portal in real time.
-
Resource management service : Real-time control of the resource usage of each system and each user, management engine manager resource usage and actual load, limit the resource usage and concurrency of the system and users.
-
Eureka : Eureka is a service discovery framework developed by Netflix, and SpringCloud integrates it into its sub-projects
spring-cloud-netflix
to realize SpringCloud's service discovery function. Each microservice has a built-in Eureka Client, which can access Eureka Server and obtain service discovery capabilities in real time.
6. Process flow
How does Linkis handle a SparkSQL submitted by the upper system?
- The user of the upper system submits a SQL, which is first passed through
Gateway
,Gateway
responsible for parsing the user request, and routing and forwarding to the appropriate unified entranceEntrance
. Entrance
It will first find out whether there is an available Spark engine service for the user of the system, and if so, directly submit the request to the Spark engine service.- If there is no Spark engine service available, start
Eureka
registering the discovery function through the service to get a list of all engine managers, andRM
obtain the actual load of the engine managers in real time by requesting . Entrance
Get the engine manager with the lowest load and start asking the engine manager to start a Spark engine service.- The engine manager receives the request and starts asking
RM
the user under the system whether the new engine can be started. - If it can be started, it starts requesting resources and locks them; otherwise, it returns an exception that failed to start
Entrance
. - The resource is successfully locked, and the new Spark engine service is started; after the startup is successful, the new Spark engine is returned to
Entrance
. Entrance
After getting the new engine, start requesting SQL from the new engine.- Spark's new engine receives SQL requests, starts
Yarn
submitting and executing SQL to , and pushes logs, progress, and status toEntrance
. Entrance
Push the acquired logs, progress and status to in real timeGateway
.Gateway
Push back logs, progress and status to the frontend.- Once the SQL is executed successfully,
Engine
actively push the result set to the front endEntrance
andEntrance
notify the front end to fetch the result.
How to ensure high real-time performance
As we all know, Spring Cloud integrates Feign
as a communication tool between microservices.
Based on Feign
HTTP interface calls between microservices, only A microservice instance can randomly access an instance of B microservice according to simple rules.
However, the execution engine of Linkis can directly and proactively push logs, progress and status to the unified portal that requests it. How does Linkis do it?
Linkis Feign
implements a set of its own underlying RPC
communication scheme based on .
As shown in the figure above, we Feign
have encapsulated Sender
and on the basis of Receiver
.
Sender
It is directly available as the sender, and the user can specify to access a certain microservice instance, or random access, and also supports broadcasting.
Receiver
As the receiving end, the user needs to implement Receiver
the interface to process the real business logic.
Sender
Three access methods are provided, as follows:
-
ask
The method is a synchronous request response method, which requires the receiver to return a response synchronously. -
send
The method is a synchronous request method, which is only responsible for sending the request to the receiving end synchronously, and does not require the receiving end to give a reply. -
deliver
It is an asynchronous request method. As long as the process of the sending end does not exit abnormally, the request will be sent to the receiving end through other threads later.
7. How to support high concurrency
Linkis designed 5 55 large asynchronous message queues and thread pools, Job occupies less than11 millisecond, to ensure that each unified entrance can undertake more than10000 1000010000 + TPS permanent job requests.
- How to improve the request throughput of the upper layer?
Entrance
The WebSocket processor has a built-in processing thread pool and processing queue to receiveSpring Cloud Gateway
the upper layer request forwarded by the route.
- How to ensure that the execution requests of different users in different systems are isolated from each other?
Entrance
In the job scheduling pool , each user of each system has a dedicated thread to ensure isolation.
- How to ensure efficient job execution?
- The job execution pool is only used to submit the job. Once the job is submitted to
Engine
the terminal, it will be put into the job execution queue immediately to ensure that each job does not occupy the execution pool thread for more than 1 11 millisecond. - The RPC request receiving pool is used to receive and process
Engine
the logs, progress, status and result sets pushed by the end, and update the relevant information of the Job in real time.
- The job execution pool is only used to submit the job. Once the job is submitted to
- How to push the job's log, progress and status to the upper system in real time?
- The WebSocket sending pool is specially used to process the log, progress and status of the Job, and push the information to the upper system.
8. User-level isolation and scheduling timeliness
Linkis designed the Scheduler module - a group scheduling consumption module that can intelligently monitor and expand, and is used to realize the high concurrency capability of Linkis.
Each user of each system will be grouped separately to ensure system-level and user-level isolation.
Each consumer has an independent monitoring thread to count indicators such as the length of the waiting queue in the consumer, the number of events being executed, and the growth rate of execution time.
The grouping object corresponding to the consumer will set thresholds and alarm ratios for these indicators. Once a certain indicator exceeds the threshold, or the ratio between multiple indicators exceeds the limited range (for example, if the average execution time is monitored to be greater than the distribution interval parameter, it will be considered to exceed threshold), the monitoring thread will immediately expand the consumer accordingly.
When expanding, it will make full use of the above parameter adjustment process, and increase a certain parameter in a targeted manner, and other parameters will be automatically expanded accordingly.
9. Summary
As a data middleware, Linkis has made many attempts and efforts to shield the details of lower-level calls.
For example: How does Linkis implement unified storage services? How does Linkis unify UDF, functions and user variables?
Due to limited space, this article will not discuss it in detail. If you are interested, you are welcome to visit the official website: https://linkis.apache.org
Is there a set of data middleware that is truly based on open source, has been self-developed and perfected through financial-level production environments and scenarios, and then returned to the open-source community, so that people can use it to serve production with relative confidence, support financial-level business, and have enterprise Level feature guarantee?
We hope Linkis is the answer.