To improve the security and controllability of data, the data stack is based on the practice of Spark SQL permission control implemented by Ranger

In enterprise-level applications, data security and privacy protection are extremely important. As one of the underlying computing engines of the data stack , Spark must ensure that data can only be accessed by authorized personnel to avoid data leakage and abuse. In order to realize the fine-grained management of data by Spark SQL and improve the security and controllability of data, Data Stack implements the permission control of Spark SQL for data processing based on Apache Ranger .

This article is based on Apache Spark 2.4.8 and Apache Ranger 2.2 to explain the principles, and talk to you about " Kangaroo Cloud One-stop Big Data Basic Software Stack " based on Ranger's practical exploration of Spark SQL permission control.

Implement Spark SQL permission control based on Ranger

Apache Ranger is an open source permission management framework that provides secure access control to the Hadoop ecosystem . Ranger provides developers with an extensible framework for unified data security management , with built-in access control for Hadoop, Hive, HBase, Kafka and other components.

Ranger does not provide a built-in Spark permission control plug-in, which needs to be implemented by the developer. Based on the Ranger data stack, Spark SQL has realized three aspects of permissions: access control of libraries, tables, columns and UDFs, row-level permission control, and data desensitization Management and Control. Next, we will explain its implementation principle in two parts, namely the custom Ranger plug-in and the Spark SQL Extensions mechanism .

Custom Ranger plugins

The permission verification of adding a new service in Ranger can be divided into two parts: the first part is to add a new service module for Ranger; the second part is to add a Ranger permission verification plug-in to the new service .

● Ranger adds new service modules

Adding a new service module to Ranger is to add a corresponding service module on the Ranger Admin Web UI interface, which is used to add authorization policies for corresponding resources to corresponding services. The addition of new service modules can be divided into the following three steps:

• Define a description file for the new service. The file name is ranger-servicedef-< serviceName>.json. The name of the service, the name displayed in the ranger admin web interface, the definition of the new service access class, and the need to use it are defined in the description file. A list of resources to be checked for permissions, a list of access types that need to be checked, etc.

The main part of the content of ranger-servicedef-< serviceName>.json is parsed as follows:

{
  "id":"服务id,需要保证唯一",
  "name":"服务名",
  "displayName":"在Ranger Admin Web UI上显示的服务名",
  "implClass":"在Ranger Admin内部用于访问新服务的实现类",
  // 定义新服务用于权限校验的资源列表,如Hive中的database、table
  "resources":[
    {
      "itemId": "资源id, 从1开始递增",
      "name": "资源名",
      "type": "资源类型,通常为string和path",
      "level": "资源层级,同一层级的会在一个下拉框展示",
      "mandatory": "是否为必选",
      "lookupSupported": "是否支持检索",
      "recursiveSupported": false,
      "excludesSupported": true,
      "matcher": "org.apache.ranger.plugin.resourcematcher.RangerDefaultResourceMatcher",
      "validationRegEx":"",
      "validationMessage": "",
      "uiHint":"提示信息",
      "label": "Hive Database",
      "description": "资源描述信息"
    }
  ],
  // 定义资源需要进行校验的访问类型列表,如select、create
  "accessTypes":[
    {
      "itemId": "访问类型id, 从1开始递增",
      "name": "访问类型名称",
      "label": "访问类型在Web界面上的显示名称"
    }
  ],
  "configs":[
    {
      "itemId": "配置参数id, 从1开始递增",
      "name": "配置参数名称",
      "type": "参数类型",
      "mandatory": "是否必填",
      "validationRegEx":"",
      "validationMessage": "",
      "uiHint":"提示信息",
      "label": "在Web界面上的显示名称"
    }
  ]
}

• Develop the implementation class corresponding to the new service module in Ranger, and fill in the class name in the implClass field in ranger-servicedef-< serviceName>.json. The implementation class of the new service module needs to inherit the abstract class RangerBaseService. RangerBaseService is the base class of all services in Ranger. It defines a set of public methods and properties so that all services can share and inherit. RangerBaseService provides basic functionality such as access control, resource management, and audit trails.

It is relatively easy to develop the implementation class of a new service module. Just inherit RangerBaseService and implement the validateConfig and lookupResource methods. The validateConfig method is used to verify whether the configuration of the service is correct, and the lookupResource method defines the method of loading resources .

• After the first step and the second part are completed, put the configuration file ranger-servicedef-< serviceName>.json and the implementation class jar package corresponding to the new service module into the CLASSPATH of Ranger Admin, and use the REST API provided by Ranger Admin to Ranger registers the defined service type, so that the module of the new service can be seen on the Ranger Admin UI interface and the corresponding permission control can be configured through the interface.

● Add the Ranger permission verification plug-in to the new service

To implement Ranger’s permission verification in the new service, a corresponding permission control plug-in needs to be developed and registered in the new service. When the plug-in is implemented, it is necessary to find an entry point in the service to intercept resource access requests and call the Ranger API to authorize access . Next, let’s introduce the four important classes in the development of the Ranger permission verification plug-in:

• RangerBasePlugin: The core class of Ranger permission verification, which is mainly responsible for pulling policies, policy cache updates and completing resource access permission verification

• RangerAccessResourceImpl: an implementation class that encapsulates authentication resources, and such a class needs to be constructed when calling the authentication interface

• RangerAccessRequestImpl: The implementation class for requesting resource access, including the encapsulation object of authentication resources, users, user groups, access types and other information. When calling the authentication interface isAccessAllowed, you need to pass RangerAccessRequestImpl as a parameter

• RangerDefaultAuditHandler: Audit log processing class

Implementing the Ranger permission verification plug-in is divided into the following steps:

• Write the target class to inherit RangerBasePlugin, usually only need to call the constructor of the parent class in the constructor implemented by the target class and fill in the corresponding service type name and rewrite the init method of RangerBasePlugin and call the parent class in the rewritten init method init method.

The init method of RangerBasePlugin implements the policy pull and starts a background thread to regularly update the local cache policy.

• Write a connecting class, which is used to configure the target service to intercept all resource requests of the target service and call the isAccessAllowed method of RangerBasePlugin for resource request authentication. For Spark SQL to implement Ranger's permission verification, we are based on Spark SQL's Extensions mechanism (explained later), by customizing a Spark Extensions and registering it in Spark to complete it by traversing the generated abstract syntax tree in the SQL syntax parsing stage Permission verification for resource access.

Spark SQL Extensions mechanism

Spark SQL Extensions was introduced in SPARK-18127, which provides a flexible mechanism that enables Spark users to customize extensions in the Parser, Analyzer, Optimizer, and Planner stages of SQL parsing, including custom SQL syntax parsing, new Add data sources, etc.

file

SparkSessionExtensions is the core class of the Spark SQL Extensions mechanism. SparkSessionExtensions saves user-defined extension rules, including the following methods:

• buildResolutionRules: build extension rules added to the resolution phase of the Analyzer

• injectResolutionRule: registers the extension rule generator with the resolution phase of the Analyzer

• buildPostHocResolutionRules: build extension rules added to Analyzer's post-hoc resolution phase

• injectPostHocResolutionRule: Registers the extension rule generator with the Analyzer's post-hoc resolution stage

• buildCheckRules: Build extended check rules, which will be run after the analysis stage to check whether there is a problem with the LogicalPlan

• injectCheckRule: Register an extended check rule generator

• buildOptimizerRules: Build extended optimization rules, which will be called and executed in the optimizer phase

• injectOptimizerRule: Register an extended optimization rule generator

• buildPlannerStrategies: Build extended physical execution plan strategies for converting LogicalPlan into executables

• injectPlannerStrategy: registers the extended physical execution plan strategy generator

• buildParser: Build extended parsing rules

• injectParser: Register an extended parsing rule generator

It is easy to implement custom rules based on the Spark SQL Extensions mechanism. First, write a class to implement Function1[SparkSessionExtensions, Unit]. SparkSessionExtensions is used as a function input parameter, and the corresponding method of SparkSessionExtensions is called to register the custom parsing rules to the corresponding SQL parsing stage for execution, and then Register the written class in Spark through the parameter spark.sql.extensions.

Practice of Spark SQL permission control in the data stack

Spark is mainly used in offline data warehouse scenarios in the data stack to batch process offline data . In most scenarios, most of the data is stored in the business database, such as MySQL, Oracle, etc. On the data stack, ChunJun will be used to collect data first, and the data will be synchronized from the business database to the ODS layer of the Hive library, and then processed through the Hive or Spark engine. Batch calculation of data, and finally synchronize the result data to the corresponding business database through ChunJun.

file

Most of the corresponding business databases are relational databases, and each relational database already has a very complete authority management mechanism. In the early data stack, there was a lack of security control of data on Hive, which also led to The data can be obtained and viewed by each user, which lacks data privacy protection.

In order to solve the problem of Hive data security, we chose to use Ranger to control the permissions of Hive.

Ranger is a very comprehensive data security management framework, which provides a Web UI for users to set permissions and policies, making Ranger easier to use. Ranger also has a wealth of security-related functions, finer control, and supports database table-level permission management, as well as very practical functions such as row-level filtering and data desensitization . It is also more flexible to extend Ranger, and it is easy to implement permission control for a new service on Ranger.

On Datastack, Spark is used to process data in Hive, and Hive uses Ranger to control data permissions. Therefore, in order to ensure data security, Datastack has developed a Spark SQL permission control plug-in based on Ranger .

We mentioned above that customizing the Ranger permission control plug-in for a new service is divided into two parts. The first part is to add the corresponding service module in the Ranger Admin Web UI interface. Considering that Spark is only used to process data in Hive, the The authority policy should be consistent with Hive, so when Spark SQL implements the authority control plug-in based on Ranger, it does not repeat the wheel but directly reuses the HADOOP SQL service module, and uses the same set of policies with Hive, so we only need to use the same set of policies in Spark. Develop the permission management plug-in of Ranger on the terminal side.

file

Based on the Spark SQL Extensions mechanism, we have written the class RangerSparkSQLExtension, and in this class will implement good authentication rules, row-level filtering rules and data desensitization rules by calling the SparkSessionExtensions.injectOptimizerRule method to register the Optimizer stage of SQL parsing.

Take the data masking Rule as an example. When the data masking Rule is matched, the Rule will add a Project node to the Logical Plan and add the logic of the masking_function function call. The following figure shows the changes before and after matching the data desensitization rule, taking select name from t1 where id = 1 as an example:

file

Summarize

Data Stack has always been committed to data security and privacy protection. Realizing Spark SQL's Ranger-based permission control is one of Data Stack's explorations in data security. This article describes the principle of implementing Spark SQL permission verification based on Ranger. Based on Ranger, Spark SQL has stronger control and richer capabilities in permission management and control.

In the future, under the premise of ensuring security, the data stack will further optimize the performance, such as registering the permission verification Rule to the SQL optimizer, which may be executed multiple times, which will increase some unnecessary authentication. Looking forward to your continued attention to the number stack.

"Dutstack Product White Paper": https://www.dtstack.com/resources/1004?src=szsm

"Data Governance Industry Practice White Paper" download address: https://www.dtstack.com/resources/1001?src=szsm If you want to know or consult more about Kangaroo Cloud big data products, industry solutions, and customer cases, visit Kangaroo Cloud official website: https://www.dtstack.com/?src=szkyzg

At the same time, students who are interested in big data open source projects are welcome to join "Kangaroo Cloud Open Source Framework DingTalk Technology qun" to exchange the latest open source technology information, qun number: 30537511, project address: https://github.com/DTStack

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3869098/blog/8796785