GF Securities builds an "efficiency-improving and controllable" big data empowerment layer based on Apache Kyuubi

In November 2023, GF Securities became one of the first securities firms to obtain the quantitative management level (Level 4) through the DCMM data management maturity capability maturity assessment. Currently, tens of thousands of Kyuubi operations have become the basis for GF data comprehensive management and key data systems. core part.

The author of this article is Liang Bowen, the big data platform architect of GF Securities, who participated in the Apache Kyuubi project incubation and became a PMC member. This article mainly introduces and discusses the big data platform of GF Securities, focusing on the digital intelligence middle platform strategy and the challenges of the sensitive data era, focusing on the four key transformation goals of "improving efficiency and controllability", and building a big data enabling layer based on Apache Kyuubi. Architectural analysis, implementation strategies, and full-link efficiency improvement of integrated rights control ideas. At the same time, as a user and contributor of Apache Kyuubi, we explore and anticipate the future in the industry and community.

This article mainly covers the following content:

1. Target analysis and architectural benefits of big data enabling layer and Apache Kyuubi

2. Implementation strategy and scenario construction of Apache Kyuubi big data enabling layer

3. Overall idea of ​​integrated data permission management and control

4. Introduction to the rights control capabilities of Kyuubi Authz plug-in Spark and new features of docking with Ranger

01 Big data enabling layer and Kyuubi’s goals and architecture

For the GF Securities big data platform, we hope to track the big data ecological technology based on understanding its mission and expectations for the architecture, and then determine the direction of data empowerment transformation and determine the goals, benefits and decision points of the big data empowerment layer. .

GF Securities, which has been continuously building a big data platform system since 2014, is one of the first large securities firms to practice and evolve an integrated big data platform. As an important part of the "digital middle platform", GF Securities' construction of an integrated company-level data middle platform must be unified in the structure to face strong supervision, strong power control, emphasis on stable and flexible data forms and scenario data needs. Responsiveness and alignment. In terms of platform architecture, we adhere to the thinking method of "active evaluation, prudent implementation, and continuous evolution", and use self-research, integration, and introduction of various means to build an integrated data service system with our own data strategy as the goal. Specifically, we hope to maintain the initiative and integration capabilities from the architecture level to the source code level in the selection and adaptation of key engines, key services, and key links on the premise of a unified core base. Under the premise of maintaining a strong focus on the big data open source ecosystem, we will continue to strengthen support for data form scenarios. This is a mission with ongoing challenges.

While actively absorbing the industry's technological progress and development, we also actively rely on our own demands to participate in the contribution and co-construction of the big data ecosystem. Among them, we have 1 PMC member of the Apache Kyuubi project, ranking among the top 4 in terms of code contribution. And participated in top big data projects, including data engine Apache Spark, data lake storage Iceberg, OLAP data warehouse, etc. At the same time, we have made core contributions to the Spark big data ecosystem, especially the Spark permission control that is compatible with multiple data lakes such as Iceberg, and participated in the completion of the adaptation of Spark 3.1 to 3.5 and Scala2.13.

This diagram is for schematic purposes only and does not serve as a representation of the specific big data platform architecture.

Before considering the direction of data-enabled transformation, let’s first review the current status and potential bottlenecks of the overall big data platform.

The core of the unified big data platform base first includes the Hadoop base and Hive Metastore service. HDFS/YARN provided by Hadoop continues to provide reliable distributed computing and storage capabilities, which is the foundation for the overall big data ecosystem. Hive Metastore serves as a unified big data warehouse standard, providing warehouse information for docking various engines, and is the basis for all data scenarios. We have introduced three engines based on this in each period, namely Hive MR, Trino, and Spark, each of which plays a different role. First, HiveMR provides the final query and processing distributed execution solution, and also uses HiveSQL as the logical interface. Secondly, in order to solve the demand for speed in query scenarios, Trino/PrestoSQL was introduced as the OLAP query engine for Adhoc and BI. Because it may have single-point architectural limitations and has encountered 0day calculation errors, it is only used for queries, and Trino is independent. The syntax will become an asset independent of HiveSQL. At the same time, the Spark engine has been introduced in some scenarios as an ETL calculation engine compatible with HiveSQL. However, SparkSQL has not yet had a suitable solution and cannot be used on a large scale in a manner similar to HiveServer.

Here are several bottlenecks caused by data empowerment:

1. The engine interfaces for query and processing are inconsistent. Due to the huge performance improvement of Trino during data exploration, data development often uses Trino for data exploration first. However, Trino syntax is incompatible with HiveSQL syntax, resulting in the need to rewrite and verify once and assemble it into data processing logic again. The round trip is time-consuming and laborious. 

2. Relying on component details is too complicated for access parties. In data research and development, when connecting to the big data platform, developing a big data job requires consideration of many factors, from the technology stack version to the configuration path details, including the data warehouse metadata service HMS, engine (Spark, etc.), storage format, table format, etc. . This results in a series of costs for development, debugging, and operation and greatly reduces human efficiency, making it impossible to focus on completing business data logic. 

3. The big data platform cannot evolve the service component version as a whole. When the job specifically determines the combination and details of components, it means that the big data platform provides services more as a resource bottom layer, and cannot uniformly shade and evolve engine versions, introduce new coordination services, and control the bottom layer according to the latest evaluation and business needs. resources etc. 

4. Permissions are controlled through specific channels and cannot be opened and empowered as a general service. They can only be limited to specific application layer services. Strong regulatory requirements in the financial industry require that requirements such as list permission checks, confidential content filtering SQL rewriting, and centralized auditing must be met in specific scenarios on specific platforms. Rewriting the application layer such as data portals can meet the above needs, but it significantly limits the possibility of connecting the big data platform in general scenarios. 

Architecturally, we need a unified big data enabling layer to solve the above problems while meeting the consistent requirements of the financial direction.

For the target positioning of the big data empowerment layer, in addition to reviewing the current situation and historical evolution burden, we also need to understand the overall challenges from a longer perspective in order to ensure our evolution direction and method.

In terms of macro challenges, the integrated data center brings all-round challenges at all stages of the data life cycle, which must be uniformly met at the big data enabling layer:

1. In data ingestion : The ability to ingest heterogeneous data from multiple sources into the data lake, including streaming, batch, structured, very structured and other forms of data sources, requires multi-catalog capabilities. 

2. Out data output : push effective data assets to the scene to directly form specific data support capabilities, adapt to different schemas, unify and then transform. 

3. With collaborative computing : The logical data lake method involves different catalogs and heterogeneous data sources in on-site calculations. The calculated access part replaces traditional data acquisition. It also has the reading and writing capabilities of heterogeneous data sources, and requires uniformity. logical view. 

4. Over complex correlation : It meets ad-hoc queries and data exploration, and has multi-level complex correlation calculation capabilities. At the same time, it fully utilizes CBO and RBO to maximize query response performance in the face of massive data, and has the ability to output results directly to BI, AdHoc and other scenarios. 

5. Of data governance : Comprehensive data management puts forward new requirements for metadata and blood relationships. It requires data middle-end and big data platforms to be able to proactively perceive the flow of data changes and fine-grained relationships in the data life cycle, especially data processing. At the same time, the results of data governance are further applied in the entire life cycle. 

6. By systematic processing : hierarchical modeling, scheduling dependencies, high-performance ETL capabilities, distributed computing power and storage challenges, etc. 

In terms of micro challenges, the processing methods and output forms of data have also undergone great changes. The unified base needs to meet these new requirements brought about by indexing/labeling, mixed forms, and flexible scenarios:

1.  Indicatorization/labeling : The data output method is from a single traditional wide table to output indicators, and more are output in a logical form combined with complex associations and dynamically combined at runtime for output. This requires us to have the ability to filter rows, conduct data control and optimize execution plans. It also supports multi-level narrow tables and complex wide tables. At the same time, indexed and labeled data processing will also require the concurrency ability to perform multi-point distributed parallel processing on the same resource field.

2. Hybrid form : An integrated solution that solves the challenges posed by multiple storage architectures, including logical data lakes and heterogeneous storage. It has multi-catalog collaborative computing capabilities and meets various requirements such as warehouse tables, flow tables, and dimension tables. Resource entities that participate in calculations in logical and storage form. 

3.  Elastic scenarios : Facing elastic scenarios with elastic resource strategies, it can not only meet AdHoc's explosive and complex queries, but also perform large-scale read and write processing scenarios. There must be dynamic scheduling of resources used during execution, and collaboration with the base to expand and reclaim runtime resources. Specifically, the execution plan must be fully optimized, and the industry's CBO and RBO capabilities must be continuously used to make adjustments using all on-site information.

Therefore, from a comprehensive perspective, whether from a macro or micro perspective, it is no longer possible to face current challenges with the traditional 4V characteristics of big data, namely Volume, Variety, Velocity and Value. Fits perfectly.

"Sensitive Data" Agile Data is the understanding and abstraction of the new data era by GF Securities' big data platform. Sensitive data requires the construction of a data center in a data-empowered manner. With more active data activity and more effective data maturity, the data itself can be exposed and deepened into scenarios, so that it can be more widely used and support data scenarios of various shapes and sizes. , while internally using a variety of mixed data forms to respond to computing and storage challenges. The four major characteristics of sensitive data AgileData can be summarized as "HEAD", namely Hybrid, data maturity Enabling, data activity Active, and elastic computing power Dynamic.

1. Hybrid data form Hybrid : Provides a unified access form upwards, and comprehensively uses multiple data forms downwards, including lake warehouse integration, streaming batch integration, heterogeneous multi-catalog and other means, to unify the open data forms Processing does not pursue the characteristics of a single engine, but comprehensively examines and manages data and engines as general requirements. 

2. Effective data maturity Enabling : Use mature and effective data to drive data scenarios, go through good data governance and data control, and achieve reliable data quality to meet data specifications. The data platform itself must enable the elements required for these processes. More importantly, through continuous iteration of data maturity, the data platform and data business can achieve mutual value. 

3. Ready data activity Active : data products, indexed, labeled, logical/physical flexible input and output, instant, ready and effective. 

4. Elastic data computing power Dynamic : fine-grained dynamically scalable computing power platform, globally integrated adaptive data engine, large-scale computing power and multi-tenancy. 

Combined with the above analysis and the background of data-based middle-platform strategy, the overall data platform can transform into a sensitive data platform and export the data capabilities of the big data platform to the outside world in the form of active empowerment.

In the past, the big data platform itself was the bottleneck of data empowerment, and upstream and downstream docking parties would become more unresponsive to demands for empowerment. In the face of various data scenarios, data development lacks systematic analysis methods for platform capabilities. It can only put forward more vertical data link demands for big data platforms, and then use the smallest path to meet the needs of scattered technical methods. Operation and maintenance management, as resource control and system management during runtime, puts forward various demands on the big data platform for resources, permissions, availability, stability, etc. The big data platform is struggling to cut into the data life cycle and can only operate at a lower cost. Ensure that the means are satisfied while discarding the possibility and path of evolution.

After the transformation, the big data enabling layer itself adopts low-cost unified access means, unified high-performance elastic computing power engine, unified data warehouse management and unified resource control and other management and control measures, and also has data governance at the bottom layer. Internal maturity improvement means of bidirectionally aligning metadata, rules, standards, etc. with data management and control. At this time, data development can be flexibly invested in various data application scenarios on demand with standardized capabilities, from data processing scenarios that require massive computing power to data display and indicator output with complex logical associations and high timeliness requirements. At the same time, the demands and control of operation and maintenance management are directly implemented in the big data empowerment layer, which enables refined empowerment and control of data development in terms of permissions, storage, calculation, etc. The big data platform architecture itself can also focus on incorporating more technologies that are suitable for sensitive data scenarios and unify them into the data center.

At this point, the overall four major goals of the big data empowerment layer, "efficiency improvement and controllability", have been formed, as follows:

  • Control : Controllability is the lifeline of financial scenarios, satisfying all aspects of data permissions, resource quotas, auditing, and monitoring. Through fine-grained data management and control capabilities, it provides the possibility of refined data management and control, is compatible with multi-domain heterogeneous data source management and control, and serves as a prerequisite for controllable and open data. In terms of resource management and control, it is possible to finely distinguish the management and control differences in resources and permissions of each end user of each data line, and finely control the runtime tuning parameters. Provide integrated audit collection and control, and improve service operation and query positioning capabilities.
  • Can : sustainable iteration, sustainable evolution, availability. Under the unified computing access interface, the base continues to improve the engine version and the overall big data technology stack, is compatible with multi-domain heterogeneous data sources and storage underlying details, and makes full use of the industry's achievements in CBO, RBO, RBO, etc. to continuously improve the execution plan. Optimization in the process achieves iterative optimization of stock data and stock calculation operations. Sustainable evolution means scalability and expansion of more big data technologies and key life cycle coordination services, and the ability to integrate standardized data and services with more flexible capabilities. Availability means to further consolidate the high-availability capabilities at all levels, including the engine layer and execution layer, on the basis of the original distributed storage and computing capabilities, giving data scenarios reliable operation capabilities and computing power guarantee capabilities.
  • Efficiency : Improve efficiency throughout the entire life cycle, covering data docking and data empowerment beforehand (access preparation, etc.), during the process (operational efficiency, permission control and row filtering implementation), and afterward (auditing, collection of lineage, supplementary metadata information, etc.) ). Improve execution efficiency and continuously reflect the benefits brought by the optimization of execution plans using the same data and logic. Improve the labor efficiency of data development, and use SQL abstraction to focus on data logic to complete query processing of data, significantly reducing access environment requirements and condition preparation.
  • Tip : Taking into account existing data assets and data operations as a benchmark, reduce the impact of destructive changes on existing data assets, while greatly improving its data responsiveness and data efficiency. Unify the low-cost SQL access method and programming interface, shield the base infrastructure details from the access data operation requirements, provide a lightweight access method, and minimize the language environment requirements. Unify the docking methods in different usage scenarios and unify refined tuning to benefit all scenarios.

Under the overall goal of "improving efficiency and controllable" big data empowerment layer, Apache Kyuubi enters our architecture evaluation field of view. Apache Kyuubi is often first used as an optional service for the SparkSQL service gateway layer. Spark and Spark SQL, as mature, stable and continuously evolving computing engines, have good performance and evolution capabilities in different data ranges. However, Apache Kyuubi's overall positioning as a distributed multi-tenant serverless lake warehouse data gateway portal gives it greater mission possibilities.

First of all, it focuses on providing a service model of SQL (but not limited to SQL), allowing access parties to run interactive and abstract logic rather than specific jobs in a low-cost manner, which can meet the potential needs of evolution and efficiency improvement; at the same time, in order to The core main architecture server and engine segment separation model strengthens the multi-tenant model, which can effectively isolate users' session isolation requirements and meet fine-grained resource and configuration management and control needs; secondly, the lake warehouse compatibility positioning means that it can be based on actual scenario needs Flexible combination, different engine features and cross-domain underlying data storage forms, taking into account various continuously evolving platform base combinations such as warehouse tables, object storage, data lake tables, streaming and batch integrated tables; thirdly, its in-depth data logic submission and execution The specific process can be deeply adapted to each computing process to give full play to the engine characteristics, and provide possibilities and necessary on-site key support for data permissions, data lineage, monitoring and management, etc. in a flexible combination of plug-ins.

GF Securities has been tracking the progress of the Apache Kyuubi project since 2021 with the overall design of the base upgrade, and the overall evaluation phase has begun since version 1.2. Many features of Apache Kyuubi come from its core architecture. The session-namespace-server-engine multi-layer framework has remained stable since version 1.0 and continues to be enriched and expanded. Apache Kyuubi plays a variety of features on this basis, including unified docking methods and portals, multi-tenancy and session isolation, engine adaptation following Spark engine evolution, fine-grained resource management and control, and multiple shared isolation modes, which are consistent with and meet the requirements of data centering. The target positioning and construction direction of NTU’s data empowerment layer.

Build a big data empowerment layer in the data center, provide services and capabilities to upstream and downstream in a turnkey manner, and realize the mutual success of the platform and data business through reasonable architectural matching. Combining Kyuubi as one of the unified architecture solutions for the big data enablement layer, we will architecturally connect our expectations and goals:

1. Unified access: Standardized use of JDBC Driver with Hive Thrift protocol, taking into account low-cost access in multi-language environments. At the same time, appropriate services are selected through the namespace and service discovery is provided on the JDBC Driver to meet high availability support, without the need for additional implementation by the user.

2. Unified SQL capabilities: Using SparkSQL externally, it is compatible with HiveSQL syntax to provide the possibility for smooth transition of existing data jobs and assets, and provides key data extension processing capabilities such as update/delete/merge through additional feature syntax and combined with Iceberg data lake format. The underlying centralization optimizes read and write configurations, feature switches, enabling AQE, etc. according to scenario needs.

3. Mask the underlying details: 1) Insensitively compatible with Iceberg tables, meeting CRUD operations. 2) Introduce and evolve the base engine version according to platform needs. 1 is the latest stable release version of Spark 3.3, and 2 can continue to manage Flink and Trino engines.

4. Permission control: Comprehensive coverage of authentication and authentication needs. User authentication is performed at the server layer connection point for terminal user access, with a flexible combination of JDBC, token verification and other methods. Authentication goes deep into the execution engine, provides Spark docking Ranger permission control, senses fine-grained library table column permissions, row filtering, column masking and other rules, and directly intercepts or takes effect in the execution plan.

5. Full life cycle support: fine-grained computing resource control and unified parameter tuning. Adjust the upper limit of job resources at any time by user. Supports extracting column-level blood relationships and aligning execution plans in the engine with comprehensive data management. The data life cycle covers the necessary components of the access plan scenario, and centrally controls key processes, such as small file writing optimization, reading quantity limit, etc.

With Apache Kyuubi as the unified SQL service gateway layer, we can finally fully embrace a series of key feature evolutions on Spark3 that we have seen for a long time, enable and optimize parameters for users by default in a standardized way, and empower all connected data scenarios. In this way, the core capabilities are gathered to align with the various requirements of the big data empowerment layer. These include:

  • Adaptive Query Execution Dynamic query execution:
    • AQE has been one of the most popular features since Spark2.4, and is also the most popular key feature of Spark3. By using Spark3.3, we benefit from the important capability of being turned on by default (Spark3.2+).
    • Data skew and imbalance are the normal state of data, and AQE significantly improves the operating performance in this situation. It dynamically determines the number of partitions during execution and re-integrates the processing capabilities and data distribution.
    • This allows data query and data processing to truly focus on business logic and reduce the time cost of traditional task granularity tuning.
  • WholeStage CodeGen full-stage runtime code generation:
    • When running in CodeGen mode, in-depth execution details and execution site statistics are generated to generate the most efficient running code, so that the same data assets and the same query processing logic can continue to benefit from engine optimization to improve operational performance and shorten execution time.
  • Dynamic Resource Allocation Dynamic resource allocation:
    • During the running phase, executor instances are dynamically applied for and destroyed within a controllable number range, maximizing the use of distributed running resources. It fully takes into account the urgent demand for resident services combined with elastic scaling resources in the Adhoc scenario, and also takes into account the dynamic release of runtime resources under data processing. Significantly improve the overall availability and resource controllability.
  • DRA with Shuffle Tracking does not rely on ESS's dynamic resource allocation:
    • Use ShuffleTracking to avoid using ESS (External Shuffle Service). For a long time, DRA has required ESS to meet the needs of executor redistribution. This requires that each execution node in each distributed cluster must install an independent ESS service and expose ports for communication, which greatly increases the operation and maintenance costs and restrictions. Possibility of version evolution.
    • The unified use of ShuffleTracking does not require independent deployment of services and is dynamically dependent on job and engine versions, significantly reducing usage footprint and operation and maintenance costs. This is an experimental feature provided by Spark3.2+.

Another important reason for using Kyuubi as the big data enablement layer is that it still retains the flexible combination of diverse front-end protocols and back-end services under the same appearance, that is, exposing services through the server, making it suitable for different docking needs. You can choose a suitable front-end protocol, while retaining the possibility of adding other engines of different types and different principles.

Currently, Apache Kyuubi has successfully implemented this architectural idea, specifically:

Front-end protocol support includes HiveThrift protocol, REST HTTP interface protocol, Mysql protocol (experimental), and FlightSQL (prototype demonstration), while the back-end service has in-depth support for Spark3.0+ (each component is compatible with all mainline versions covering 3.0~3.3) , Flink, Trino, Doris (from 1.6), etc., and began to provide in-depth connectors for other underlying lake warehouses under Spark and other engines, such as Spark DSV2 Hive connector, etc.

02 Implementation strategy and scenario construction of Kyuubi’s big data empowerment layer

After experiencing the target positioning of the big data enablement layer and the architectural analysis of Apache Kyuubi as one of the core services, we began to consider how to implement it effectively and smoothly. The evolution strategy and scenario construction will be the addition of Apache to the big data platform. Kyuubi’s key points. It must be progressively connected to avoid destructive changes and must not affect existing data and logic. It must also transition smoothly to become a key channel for core and main data query and data processing.

Therefore, we have adopted four levels to promote the implementation of Apache Kyuubi, following the principles from easy to difficult, from reading and writing, from closed to open, and setting clear goals and verification points at each key stage.

1. Flat hatching: ad hoc query, read-only query. Prioritize the introduction of Kyuubi as a query engine in the controllable application layer area, and use SparkSQL as a read-only query engine to enable data exploration scenarios. And focus on verifying the feasibility of multi-tenant scenarios in SERVER global sharing mode and the effectiveness of Spark dynamic resource scaling DRA.

2. Pilot construction: Test the water for read-only data batch running operations and verify system docking characteristics. For example, the pilot is used for data quality batch running. Test the ability to dock existing HiveSQL syntax data processing jobs with Kyuubi and perform parsing and execution based on SparkSQL compatibility.

3. Mature implementation: further promoted to large-scale data processing operations for writing data. Fully enable all expected base architecture combinations, including full-system benchmark capabilities such as Kyuubi, Spark3.3, Iceberg, etc., and verify fine-grained resource isolation and configuration control at the USER and CONNECTION levels in data processing scenarios.

4. Controllable openness: Based on the premise of automatic docking of authority control, controllable openness is used for data processing and data exploration. Connect to Ranger data permission rules and apply library table, row, row, and filter permissions. Flexible usage scenarios for docking data development. At the same time, data processing and data exploration are targeted and isolated, and different resources and control strategies are used for different scenarios.

 

 

During the process of architecture research and implementation, we realized that the unified requirements for all aspects of the big data enablement layer and the single selection of Apache Kyuubi do not mean that the same set of services must be exposed to the outside world through the same port, but that it should be flexibly based on the needs of the scenario. combination to achieve a balance between resources and isolation. Make full use of each sharing mode to reduce start-up and stop time, integrate the Kyuubi UI positioning problem on the engine, and fully meet the needs of resource isolation, configuration isolation, etc. Among them, we need to divide and use the different sharing modes in Apache Kyuubi according to actual needs, for example:

  • In scenarios such as ad hoc queries and BI tools, consider using SERVER mode to avoid repeated round-trip operations such as packet transmission, start and stop, and resource allocation required for engine job submission. At the same time, the identity authentication and data authentication capabilities of end users are maintained in this mode. .
  • Data processing, ETL, data collection and inference scenarios: Consider dividing the engine instances used by users and sessions, using USER mode and CONNECTION mode respectively.

In the data processing operation scenario, the big data enabling layer first changed the ideas and methods of data processing and development docking capabilities through Kyuubi. The human efficiency of data development can be expected to increase by more than 100%. The development of a new type of data can be completed within a few days from a period of several weeks. The preparation can be completed in one day at the fastest, and the full-link development and verification can be completed within one week. In terms of specific benefits, data processing job development no longer requires a detailed understanding of languages, engines, versions, feature switches, key tuning, etc., but can focus on core logic and abstractly describe the intention of data query and processing through SQL. Submit the language to Kyuubi via JDBC Driver and it's done. In particular, the JDBC capability under the JVM is available out of the box. You can complete the preparation by simply referencing the Driver library. Other languages ​​such as Python can also easily connect to the Driver in a variety of ways.

In data exploration Adhoc query and self-service query scenarios, and as a non-preferred engine, about 20% of HiveSQL queries actively use Kyuubi (SparkSQL). In terms of specific benefits, under the premise of zero additional usage costs, multi-level experience and assets can be smoothly evolved and increased, including: reusing existing productized HiveSQL scenario queries, reusing Hive UDF, and reusing data development Ability and experience in HiveSQL, reuse of original Hive data warehouse assets and capabilities, etc. And data exploration and data development can continue to be completed using the same set of syntax of HiveSQL, which greatly avoids the time-consuming manual conversion of syntax back and forth between Tinro/Hive-Hive. For common external job data processing and data retrieval scenarios, we continue to focus on Kyuubi as a unified data warehouse reading and writing entrance. Here, PySpark/Spark is used as the link client as an example.

  • The access method PySpark supplements the implementation of the Spark Hive Dialect plug-in to solve the problem of inconsistency between SparkSQL column naming dialect and HiveSQL style, and solves the JDBC RDD conversion problem.
  • The SQL for data preparation and data processing is submitted to Kyuubi through JDBC for unified processing in the data warehouse.
  • Data extraction using JDBC needs to solve the problems of performance and accumulation of intermediate results. SparkRDD pushes the operation down as much as possible to Kyuubi for execution on Spark. The running result set is obtained sequentially through the Resultset micro-batch method. In order to avoid the accumulation of Spark on the Driver under Kyuubi For all data results, turn on the incremental_collection switch to disperse the pressure and obtain the partition data in the results batch by batch.
  • All data connections are performed under user authentication and data permission verification. Data permissions are based on the Ranger service-aware authentication rules and are aligned with comprehensive data management and data control.
  • As a user, PySpark/Spark maintains distributed execution capabilities. You can use a self-built standalone spark cluster or mix it with YARN under controllable conditions.

The PySpark access method documents, PyHive access documents, Spark Hive Dialect plug-in, etc. involved in the above process have all been submitted to the community.

Overall, we have built a big data empowerment layer through Apache Kyuubi, which has significantly improved development efficiency, operating efficiency, and management and control levels:

Operation efficiency increased by 50% : Compared with the Hive MR engine in typical data processing scenarios, the execution time of the entire link is shortened by 30-50%, and some scenarios can even reach a 60-70% improvement, fully reflecting the engine's execution plan optimization. 

Development efficiency increased by 100%: Human efficiency in developing new types of data has been improved, from weekly level to daily level, significantly shortening the data scenario exploration cycle and data logic docking time. 

Operation and maintenance management capabilities have been improved by 100%: operation and maintenance management and control and adaptability have been improved. What you see is what you manage. From operating resources to data content permissions to configuration optimization, management and control are directly connected to the data scenario, allowing for fine-grained full life cycle management and control. . 

03 Overall idea of ​​integrated data permission management and control

In the identification of the goal of "improving efficiency and controllability" of the big data empowerment layer, controllability is actually the top priority step. In the specific implementation process, controllable docking is the last step, which involves additional elements and components that need to be further aligned with the data life cycle and converted into implementation strategies through integrated data permission control.

As a member of the financial industry, whether it is regulatory requirements or business isolation needs, data permission control is a prerequisite for data empowerment and data openness. Therefore, the data center strategy requires the implementation of integrated authority control:

1. Taking data permissions as an aspect, the same policy must be able to connect to various engines in the big data enabling layer, such as Spark, Trino, etc.

2. Able to authorize data in different dimensions and granularities, with database table column level permissions in place.

3. Able to effectively respond to refined data management and control requirements, including in-table data row filtering, column masking, etc.

4. Be able to fully connect the accumulated results of data governance, including classification and classification definitions, data standards, etc. for data management and control.

5. Permission support that can be differentiated with different strategies in different data scenarios: On the one hand, the unified data warehouse can obtain different permissions from different perspectives in different scenarios, and manage and control in different process methods; on the other hand, it can differentiate the permissions for different purposes. Permission rules reduce the number of rules in a single scenario and improve the efficiency of runtime rule matching.

Based on this, the big data enabling layer needs to respond to the needs brought about by integrated data rights control:

1. Rely on a unified big data permission service to uniformly control permission rules for different scenarios, such as the Ranger service.

2. Management, control and auditing are carried out centrally in the integrated rights control service.

3. While maintaining low-cost access, effectively authenticate users’ identities according to different scenarios and methods.

4. Each engine in the data empowerment layer uses targeted docking plug-ins, which can connect to data permission rules, apply rules in data authentication and data execution plans, and intercept unauthorized operations.

5. The data empowerment layer needs to be able to meet column-level lineage extraction, metadata perception, etc., and provide data warehouse metadata change perception information for comprehensive data management.

6. Integrated data management then synchronizes management rules and specifications into integrated rights control services, such as classification and grading.

In the current big data system ecosystem, Apache Ranger may be the only big data permission control service that continues to iterate, especially in the context that the Apache Sentry project ceased maintenance after 2018. The Apache Ranger project provides docking with various engines but does not include support for the Spark engine and the corresponding rule definition system. The Apache Kyuubi Authz plug-in is the only choice to interface with the Apache Ranger access control strategy in the Spark ecosystem. Its rule definition follows the Hive rule style in Ranger and fully supports all key features of Ranger, covering library table column permissions, row filtering, column masking and other control rules. GF Securities participates in the continuous evolution of Apache Kyuubi’s Authz plug-in:

●For the first time, it has completed the adaptation to the gradually improved DataSource V2 API in Spark 3.x, covering 20+ commands and their execution plans, and remains compatible with the differences in details of Spark 3.0-3.3.

● For the first time, Iceberg’s own execution plan is supported, covering MergeInto, UpdateFrom and other heterogeneous V2 commands rewritten by Iceberg plug-ins.

● It also provides a permission-general adaptation mode of DataSource V2, making it easy to adapt to connect more data lake plug-in commands and continuously evolve to engine version changes.

● Other authentication processing related to temporary/permanent views and temporary/permanent functions

The key to integrated data rights control has always been to "unify data warehouses and unify rights control." The former requires the unification of base and data warehouse metadata, while the latter requires that data permission rules can be implemented in different enabling layer data engines and smooth out differences.

The same data permission control policy takes effect in all engines, and we use comprehensive means to satisfy it in different engines. First, HiveServer2 continues to be an available data engine, using the Hive plug-in provided by Ranger and performing permission intervention through HiveAuthorizer. Secondly, use the Kyuubi Authz plug-in to connect Kyuubi to the Spark engine. Then, the Trino engine is retained to continue to support the query of the existing Trino syntax. Based on the use of the Trino plug-in provided by Ranger, secondary development is carried out, and the same permission policy control used in Hive and Spark is used for specific catalogs. For the streaming storage part in the unified data warehouse, we have adapted Ranger's Kafka to meet the requirements of SASL/SCRAM connection and authentication.

At this point, the integrated power control system is being connected.

04 Kyuubi Authz authentication Spark plug-in introduction and new features of docking Ranger2.3

 

As mentioned before, the Ranger service itself does not provide Spark plug-in access to data. Kent Yao previously provided Spark docking Ranger security plug-in in Apache Submarine, which was refactored and put into this project after open source Apache Kyuubi. Starting from version 1.6.0, Apache Kyuubi provides the Authz plug-in, which is the only choice to interface with the Apache Ranger access control strategy in the Spark ecosystem. Its rule definition follows the Hive rule style in Ranger and fully supports all key features of Ranger, covering library table column permissions, row filtering, column masking and other control rules.

The above picture is for illustration only and is not used as a reference for specific code structure.

The overall mechanism of Apache Kyuubi's Authz authentication plug-in is as follows:

1. Enable the Authz plug-in through Spark's SQL plug-in mechanism, and inject various access control optimizers, including RuleAuthorization, etc.

2. Perform feature mapping and resource extraction on SparkSQL commands and their execution plans, and complete the parsing and permission request construction of V1, V2 and other commands in PrivilegeBuilder.

3. The Authz plug-in integrates the RangerBasePlugin, the core component of the Ranger client. Thanks to this, the access control rules, policies, user information, etc. can be regularly pulled from the Ranger Admin backend REST interface, and loaded into the memory for later use, and cached at the same time. Local guarantee of service continuity.

4. Perform rule matching and authentication on entity resources involving access control, which supports AccessRequest to check whether access permissions are available, Row-level filter row filtering rule application, Data Masking column-level data masking, etc.

Specifically, a Ranger data rule completes RBAC user- and role-based permission control through three core elements.

A Ranger permission control rule grants specific permissions to resources by user/role.

1. Resources : Define resources in three dimensions: library, table, and column through standard dimensions, supporting fuzzy matching. Under this framework, management includes databases, data tables, data columns, UDFs, etc. 

2. Users and roles : Define users, roles, and user groups, and bind mutual relationships, and then select a range for authorization at different granularities. There can be a one-to-many relationship between users and roles, user groups and user roles, and user roles and user roles. The difference between user groups and user roles is that user groups can have their own attributes, and user groups cannot be bound to themselves. 

3. Permissions : Specific permission operations granted, including various operation permissions granted by Access to corresponding resources, Row-level filtering data row filtering rules, and DataMasking for data masking of specified columns. 

Ranger further abstracts the above three types of rules and further provides general capabilities such as validity period, enablement, and description. Access rules grant various operation permissions to corresponding resources, such as SELECT/UPDATE of library table permissions, etc. Row-level filtering rules define data row filtering rules. Different filtering rules can be specified according to different authorization objects, and UDF can be used. DataMasking performs data masking on specified columns.

In specific applications, authorizations for different resources can often be flexibly granted. For example, to expose specific data to the outside, you can consider using Access authorization as a table in the permanent view to maintain the association in the logic perceived by the end user, and at the same time, perform Row-level filtering on this view to control the scope of the data involved. And further, DataMasking can be performed on the view to finally implement unified data masking rules for sensitive data, avoiding direct intervention in the underlying data table and being unable to complete the association.

In the big data enabling layer, we require different "authentication-authentication" combinations to handle different scenarios.

The first is to deal with different users. For example, end users use account + token to complete user authentication through JDBC or LDAP; while at the same time, batch running users use account passwords, etc., to authenticate through JDBC.

Secondly, different rights control strategies are used in different scenarios. In the scenario of self-service query for end users, the most fine-grained rules are used to control user behavior. In the scenario of running batches, granular authorization is given to users based on data processing needs. Reduce the pressure on the number of rules, reduce the memory occupied during runtime and the time required for authentication.

GF Securities provides a JDBC Authentication plug-in for Apache Kyuubi. On the one hand, it is very easy to use and can quickly connect to RMDBS to complete authentication, meeting the general requirement of common databases for authentication details. Both encryption and signature can be quickly completed in SQL mode. On the other hand, in the authentication scenario of dynamically generating tokens, you can also quickly connect to in-memory databases such as H2 to avoid deploying database methods and complete a series of authentication operations such as signature, encryption, and validity verification through SQL. Components and documentation have been submitted to the community.

Screenshots are for illustration only

After completing the connection with various rights control rules, the Ranger audit page can help us comprehensively control the various data resources accessed by end users of various engines from a unified perspective, including database tables, UDFs, etc., as well as the corresponding engines and Operation timing, etc.

A key new feature in Apache Ranger 2.3 is the new row filtering condition that supports macros for dynamic matching of user attributes, which is particularly valuable. This feature can significantly reduce the number of row filtering rules, efficiently reuse user attributes, etc., and retain the ability to authorize users with different granularities.

In Row-level filtering, previous versions could only specify separate complete filtering rules for each authorization. For example, to filter the data table by user-specified multiple different indicator IDs, you need to write them one by one, such as index_id in ("1 ","7","9"...) etc. At this time, if each person or role has slightly different condition details, multiple different row filtering rules must be repeatedly defined and cannot be reused.

With the support of RANGER-3605 and RANGER-3550, you can use this feature to define the row filtering condition as index_id in ${{USER.allowIndexIds}}, which can decouple the row filtering rules and authorization, and set the specific attributes The value is placed under the user attribute for reuse.

Ranger also supports other macros, such as obtaining user group attributes, determining whether a user belongs to a specific user group, etc. It is recommended that macro usage refer to the above Ranger Issue description.

Closely related to the above capabilities in Apache Ranger 2.3 is the newly introduced UserStore interface and features, covering user attribute pulling, user-user group relationships, etc. The benefits related to this are that user groups can be authorized, which greatly reduces the number of permission rules. At the same time, because user groups can have their own attribute definition capabilities, they complement each other in combination with row filtering macro capabilities.

We will submit the relevant ability switches to the Kyuubi documentation for easy reference.

05 Outlook and follow-up work

As a top big data ecological open source project that has quickly completed incubation, Apache Kyuubi accurately positions the big data ecological niche and architectural integration, and has good and continuous technological evolution and business value. We will continue to explore and practice the practice and application of Apache Kyuubi in the big data empowerment layer in more scenarios. Including introducing the Flink engine to explore the use of streaming computing scenarios and combining it with FlinkSQL to break through the bottlenecks in the streaming data exploration phase; enhancing various JDBC features not implemented by KyuubiHiveDriver to solve the problem that Spark/PySpark can only overwrite in writing scenarios; expanding more and more Effective FE features solve data access bottlenecks and efficiency to achieve a wider range of docking features, such as MySQL protocol improvement, PG protocol docking, etc., and more efficient data access methods, such as vectorization and staged access such as improving FlightSQL. Count means.

At the same time, in the Authz access control plug-in, which we are mainly involved in, we will further promote the evolution of its key capabilities. First of all, to solve the current problem of only supporting a single set of rights control strategies, as DataSourceV2 and multiple catalogs have gradually matured, it has become particularly critical to allow different catalogs to use different rights control rules. Secondly, it decouples commands and rights control analysis to strengthen its adaptability in different execution plans of different components, provides dynamic plug-ins based on SPI definitions, and further expands the private command encapsulation and execution plans of other plug-ins and other components, and Allows coexistence of, for example, private commands and execution plans in Paimon, Hudi, Delta Lake, Amoro, etc.

Apache Kyuubi continues to develop in depth in various fields, and has now been successfully implemented in hundreds of companies in Internet and IT manufacturers, securities companies and other industries around the world, including GF Securities and Huatai Securities. In participating in the contribution of Apache Kyuubi, we at GF Securities based on our own scenarios and data strategic vision, submitted contributions and revisions that we promoted and participated in, and actively communicated and cooperated with the community. During this period, we received a lot of help from committers, contributors, and users. , guidance and support.

 
Broadcom announced the termination of the existing VMware partner program deepin-IDE version update, a new look. WAVE SUMMIT is celebrating its 10th edition. Wen Xinyiyan will have the latest disclosure! Zhou Hongyi: Hongmeng native will definitely succeed. The complete source code of GTA 5 has been publicly leaked. Linus: I won’t read the code on Christmas Eve. I will release a new version of the Java tool set Hutool-5.8.24 next year. Let’s complain about Furion together. Commercial exploration: the boat has passed. Wan Zhongshan, v4.9.1.15 Apple releases open source multi-modal large language model Ferret Yakult Company confirms that 95 G data was leaked
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4565392/blog/10465471