Big data permissions and security

Big data permissions and security

1. Overview of permissions

1.1. Current status of permission management and control on big data platforms

The management and control of permissions has always been one of the most troublesome issues in big data platforms. If the management is strict, business will not flow smoothly and users will be unhappy. If the management is loose, security will not be guaranteed. Moreover, the big data platform has many components and services; the architecture and processes are complex; sometimes, even if you want to manage them, you may not be able to manage them.

Permission control, how much to do, how to do it, and how much it costs depends on the starting point of the goal. The goal of permission control is to limit the scope of users' regular business behaviors, control sensitive data, and restrict business logic and processes; by reducing unnecessary permissions for users, reduce the victimization area, reduce possible business risks, and also facilitate clarity. User’s rights and responsibilities.

1.2. Technical solutions for authority management and control

The technical solutions involved include Kerberos, LDAP, Ranger, Sentry, and ACL, including the permission management and control solutions for each component and the goals of permission control.

1.3. Permission management and control steps

There are two steps in permission management: authentication and authorization. The former authenticates the identity, and the latter grants permissions based on the identity.

In the authorization process, how to carry out centralized and unified management of permissions; how to allow users to apply for permissions independently; how to hand over the management of permissions to specific business leaders instead of platform administrators; how to differentiate between different components. Establish permission relationships between users.

In the user identity authentication process, it is necessary to analyze the current key target processes of permission construction and select appropriate permission technology solutions.

2. Permission scheme

2.1. Overview of technical solutions for rights management

Work related to permission management can be divided into two parts:

  • Managing user identities, that is, user identity authentication (Authentication)
  • Mapping relationship management of user identities and permissions, that is, authorization (Authorization)

For user identity authentication, a common open source solution in the Hadoop ecosystem is Kerberos+LDAP; for authorization, common solutions include Ranger, Sentry, etc., and solutions such as Knox that use Gateway proxy services.

2.2、Kerberos

Kerberos is the most widely used centralized unified user authentication management framework in the Hadoop ecosystem.

2.2.1. Workflow

Provide a centralized authentication server. Various background services do not directly authenticate the user's identity, but authenticate through the third-party service Kerberos. The user's identity and key information are managed uniformly in the Kerberos service framework. In this way, various background services do not need to manage this information and authenticate themselves, and users do not need to register their identity and password information on multiple systems.

2.2.2. Principle
  1. Kerberos implements identity authentication based on Tickets rather than passwords. If the client cannot use the local key to decrypt the encrypted Ticket returned by the KDC, the authentication will fail.
  2. The client will interact with the Authentication Service, Ticket Granting Service and target Service in sequence, for a total of three interactions.
  3. When the client interacts with other components, it will obtain two pieces of information, one of which can be decrypted using the local key, and the other cannot be decrypted.
  4. The target service that the client wants to access will not interact directly with the KDC, but will be authenticated by whether the client's request can be correctly decrypted.
  5. The KDC Database contains the passwords corresponding to all principals.
  6. The information encryption method in Kerberos is generally symmetric encryption (can be configured as asymmetric encryption).
    Insert image description here
2.2.3. Core idea

The core idea of ​​Kerberos is key-based consensus. Only the central server knows the key information of all users and services. If you trust the central server, you can trust the authentication results given by the central server.

2.2.4. Application difficulties

Kerberos is sound in principle, but it is cumbersome to implement and implement.

  • All background services must be specifically connected to the Kerberos framework, and all clients must also be adapted. A background service is required to provide the corresponding client access encapsulation SDK. Otherwise, the client needs to be modified to adapt to the Kerberos authentication process.
  • For user identity authentication to be truly implemented, complete authentication and delivery of the entire business link must be achieved. It is not a big problem for the client to directly connect to a single service. In the scenario of hierarchical proxy of big data platform service and multi-node deployment of the cluster, the link series connection that requires user identity authentication is not that simple.
  • The user submits a Hive script task through the development platform. The task is first submitted to the scheduling system by the development platform, and then submitted to the Hive Server by the scheduling system. The Hive Server then submits it to the Hadoop cluster for execution. Each upstream component needs to authenticate users to downstream components.
  • For an MR task running on a Hadoop cluster, this authentication relationship chain needs to be passed on. If each link wants to support Kerberos-based authentication, it must either correctly handle the transmission of the secret key or implement the user's proxy mechanism.
  • Identity authentication timeout issues, key information storage and confidentiality issues, etc. For example, what should I do if the MR task is halfway through and the key or token has expired? The task cannot be interrupted.
  • Regarding performance issues, centralized management means a single point to a certain extent. If each RPC request must complete the Kerberos user authentication process, response delay, concurrency and throughput capabilities will be a relatively big problem.
2.2.5. Usage scenarios

Generally speaking, Kerberos is currently the most effective and complete unified identity authentication framework, but if it is to be fully implemented, the cost will be high. User identity authentication is only a small part of the rights management process. Although it is technically difficult, from the actual impact, a reasonable rights model and standardized management processes are usually the key to data security.

  • User authentication and single sign-on in corporate networks.
  • Implement cross-domain user authentication and authorization in distributed systems.
  • Ensure secure communication between users and services in cloud environments.

2.3、Ranger

2.3.1. Overview

Apache Ranger provides a centralized security management framework and addresses authorization and auditing. It can perform fine-grained data access control on Hadoop ecological components such as HDFS, Yarn, Hive, Hbase, etc. By operating the Ranger console, administrators can easily configure policies to control user access permissions.

2.3.2. Ranger architecture

Ranger is mainly composed of the following three components

  • Ranger Admin: Ranger Admin is the core module of Ranger. It has a built-in web management page. Users can formulate security policies through this web management interface or REST interface.
  • Agent Plugin: Agent Plugin is a plug-in embedded in Hadoop ecological components. It regularly pulls policies from Ranger Admin and executes them, while recording operation records for auditing.
  • User Sync: User Sync synchronizes the permission data of operating system users/groups (Users/Groups) to the Ranger database.
    Insert image description here
2.3.3. Ranger workflow

Insert image description here

2.3.4. Usage scenarios
1、HDP

Apache Ranger, included with Hortonworks Data Platform, provides fine-grained access control and auditing for Hadoop components such as Hive, HBASE, and HDFS using policies.
Insert image description here

2、Apache Ranger

The Apache Ranger official website is a source package version and does not provide a binary installation package, so it needs to be compiled by Maven and deployed and installed by yourself.

2.4、Sentry

2.4.1. Overview

Apache Sentry is a Hadoop open source component released by Cloudera. It provides fine-grained, role-based authorization and a multi-tenant management model.
Sentry provides the ability to control and enforce precise levels of permissions over data for authenticated users and applications on a Hadoop cluster. Sentry currently works with Apache Hive, Hive Metastore/HCatalog, Apache Solr, Impala, and HDFS (Hive table data only).

2.4.2. Roles in Sentry
  • object: protected object
  • privilege: access rights to object
  • role: collection of privileges
  • user: user
  • group: collection of users
    Insert image description here
2.4.3. Usage scenarios
1、CDH

Insert image description here

2.5、Knox

2.5.1. Overview

Apache Knox Gateway is an application gateway for interacting with the REST API and UI of Apache Hadoop deployments. Knox Gateway provides a single access point for all REST and HTTP interactions with the Apache Hadoop cluster.

2.5.2. Services provided

Knox provides three sets of user-facing services:

  • Proxy Service: The main goal of the Apache Knox project is to provide access to Apache Hadoop by proxying HTTP resources.
  • Authentication Service: Authenticates USTAPI access as well as WebSSO flow for UIS. LDAP/AD, header-based PROAUTH, Kerberos, SAML, OAUTH are all available options.
  • Customer Service: Client development can be done via scripting via DSL or directly using Knox Shell classes as SDK.
    Insert image description here
2.5.3. Usage scenarios
  • All users' Rest/HTTP requests to the cluster are forwarded through the Knox proxy. Since it is a proxy, some identity authentication and authority verification management work can be done during the forwarding process. Because it is only for Rest/HTTP services, it is not a Complete rights management framework.
  • Using the Gateway model has great limitations, such as single point, performance, process, etc., but it can be considered a match for Rest/HTTP scenarios. Its advantage is that it can hide the topological logic of the Hadoop cluster by closing the entrances to Hadoop-related services. In addition, for services that do not support permission authentication management, Gateway can also superimpose a layer of permission control on its own.

3. Permission model

3.1. Overview

  • Permission control can be understood as power restriction, that is, different people may see and use different things because they have different powers. Corresponding to an application system, a user may have different data permissions (seen) and operation permissions (used).
  • In essence, no matter what type of permission management model, three basic elements can be abstracted - namely: user, system/application, and policy.
  • Common permission model concepts in open source projects: RBAC/ACL/POSIX/SQL Standard.

3.2. RBAC model

3.2.1. Overview

The permission logic of "user-role-permission" is the RBAC (Role-Based Access Control) permission model commonly used in the industry. Its core is to introduce the concept of roles and use roles as an intermediary to make user and permission configuration more flexible.

3.2.2. Application of RBAC model
  • RBAC is role-based access control, which implements access control to system resources by associating user roles with permissions.
  • RBAC has the advantages of flexibility, scalability, and security, but it is difficult to implement and requires administrators to have a high technical level.
  • When implementing RBAC, it needs to be defined based on roles, permissions, resources, access control policies, etc., and applied in accordance with the specifications.
    Insert image description here

3.3. POSIX model

3.3.1. Overview

The POSIX permission model is a file-based permission model, similar to the file system permissions of Linux systems. That is, a file has a corresponding OWNER and GROUP, and can only support setting the permissions of OWNER, GROUP and other users. The authorized permissions only have read, write and execute permissions.

3.3.2. Application of POSIX model

This model is not suitable for enterprise users. An obvious disadvantage is that it has only one GROUP, and cannot implement different GROUPs to have different permissions, nor can it achieve refined permission management. It can only be authorized at the file level, and the authorized permissions are also limited. Only read, write and execute permissions.

3.4. ACL model

3.4.1. Overview

ACL (Access Control List) access control list, an access control mechanism, mainly contains three key elements: user (User), resource (Resource) and operation (Operate). When a user requests to operate a resource, the resource's permission list is checked. If the user's operation permission exists in the resource's permission list, it is allowed, otherwise it is denied.

3.4.2. Application of ACL model

ACL stands for Access Control List. The ACL permission model can make up for the shortcomings of the POSIX permission model and achieve more refined permission management. By setting an access control list, you can grant multiple permissions to a user, or grant different permissions to different users. However, ACL also has obvious shortcomings. When the number of users is large, the ACL list will become large and difficult to maintain. This problem is especially obvious in large enterprises.
Insert image description here

3.5. SQL standard permission model

3.5.1. Overview

The SQL Standard model is one of the Hive/Spark usage permission models. It essentially uses SQL authorization syntax to manage permissions. The permission model in Hive is also based on the ACL and RBAC model, which means that individual users can be authorized directly or through roles.

3.5.2. Application of SQL Standard model

The SQL standard permission model is not fundamentally different from the ACL model from a model perspective. It just imitates the standard authorization syntax in traditional databases such as MySQL to interact with users in a system with SQL-like syntax.

4. Data security

4.1. Risks and pressures faced by data security

4.1.1. Internal supervision of enterprises

At present, enterprises lack technical means and effective management systems for data security, which increases the risk of data leakage.
Another is the leakage of data information due to insufficient security awareness among internal employees.

4.1.2. External legal and compliance requirements

As domestic and foreign governments and industries attach great importance to information security, they have put forward relevant legal regulations and management systems, constantly requiring enhanced data security and more detailed security requirements. For example, my country's "Cybersecurity Law of the People's Republic of China" officially came into effect in June 2017, the EU's "General Data Protection Regulation" (referred to as GDPR) that came into effect in May 2018, and China's GB/T 35273 "Information Security Law" that came into effect in May 2018. Technical Personal Information Security Specifications" etc.

4.1.3. Data leakage risk

As IT technology continues to iterate, the risk pathways leading to data leakage continue to increase, increasing the risk of data leakage.
The increasing risk of malicious attacks is also an aspect.

4.1.4. Data security status and problems

Analyze some of the security issues faced by various industries and companies in terms of data security.

1. Data asset management issues

Data asset management issues are mainly reflected in the following three aspects:

  • Asset status unclear
  • Access status unclear
  • Permission status unclear

Data asset sorting is a continuous process, and data and business are constantly changing, so automated tools are needed to carry out data asset management. Accurately grasping the security status of data assets is the basic condition for building a data security system. Such as: storage location, manager, department, classification, classification, etc.

2. Data management responsibility issues

Data management responsibility issues are mainly reflected in the following two aspects:

  • Data assets are not accountable
  • Blurred boundaries of management roles

Data security management roles are generally performed by R&D, operation and maintenance, security, and operations personnel. There is no independent team or virtual team, resulting in unclear rights and responsibilities, which is not conducive to the overall improvement of data security protection capabilities. It is crucial to establish data security management roles: data asset administrators, database administrators, security auditors, security detection engineers, data operation and maintenance engineers, permission administrators, etc.

3. The problem of imperfect data system

The problem of imperfect data systems is mainly reflected in the following two aspects:

  • System specifications are not implemented or are difficult to implement
  • Lack of audit means

The data management system establishes a set of practical system specifications through data security consulting planning, and formulates data security control measures and SLA evaluation indicators to avoid the data security management department's inability to grasp the implementation status in a timely manner due to the lack of audit methods.

4. The problem of confusing data exchange management

The problem of chaotic data exchange management is mainly reflected in the following two aspects:

  • Exchange and sharing methods and interfaces are not standard
  • Data management and control pressure on operation and maintenance personnel and application system managers is high

Data will be exchanged and shared externally, internally and with partners. As more and more open interfaces are opened, exchange relationships become more and more complex. Standardizing the methods and interfaces for exchange and sharing will avoid duplication of functions, complex calls, and multiple Click login and other phenomena will not affect the development of data applications.

5. Scattered safety technical measures

The scattered issues of safety technical measures are mainly reflected in the following two aspects:

  • Data security products have fragmented functions
  • Security capability island

The construction of data security capabilities will also be carried out on an organizational basis to avoid scattered construction of various organizations and establish a defense system from the unified data life cycle.

6. Insufficient data audit capabilities

The problem of insufficient data audit capabilities is mainly reflected in the following two aspects:

  • Differences in Effectiveness of Safety Rules
  • Illegal things and compliant operations are not audited

Security risks can be discovered through auditing the operational tracks and patterns of attacks, and relevant dynamic trust mechanisms can be established.

4.2. Risk points in data security

4.2.1. Risk points of data security

Insert image description here

4.3. Data security life cycle management

4.3.1. Data security life cycle management

Insert image description here

4.4. Data security life cycle capability model

4.4.1. Data security life cycle capability model

Insert image description here

4.5. Data security governance

4.5.1. Multi-dimensional data security governance
  • Organization management construction

    Based on the organizational structure of your own company, define the relevant responsibilities of management, business departments, implementation departments, compliance monitoring and auditing departments, operating departments, etc. from top to bottom.

  • Standard system and specification construction

    Establish or improve the overall strategy, management methods, emergency methods and specific operating procedures for data leakage prevention. Support data leakage prevention work from the institutional system.

  • Technical tool construction

    Use professional and mature technology, implement detailed strategies approved by management, realize data leakage through the platform, and record, alert and block data leakage. Technically achieve the goal of preventing leakage.

  • Overall implementation of core technologies

    Data asset management, classification and classification, data rights management and audit, KMS+CA, zero trust, data security gateway, data profiling, DLP, blockchain privacy, watermark, TEE, federated learning, homomorphic encryption.

4.5.2. Data security platform

Based on the organizational structure of your own company, define the relevant responsibilities of management, business departments, implementation departments, compliance monitoring and auditing departments, operating departments, etc. from top to bottom.

  • Standard system and specification construction

    Establish or improve the overall strategy, management methods, emergency methods and specific operating procedures for data leakage prevention. Support data leakage prevention work from the institutional system.

  • Technical tool construction

    Use professional and mature technology, implement detailed strategies approved by management, realize data leakage through the platform, and record, alert and block data leakage. Technically achieve the goal of preventing leakage.

  • Overall implementation of core technologies

    Data asset management, classification and classification, data rights management and audit, KMS+CA, zero trust, data security gateway, data profiling, DLP, blockchain privacy, watermark, TEE, federated learning, homomorphic encryption.

4.5.2. Data security platform

Insert image description here

Guess you like

Origin blog.csdn.net/docsz/article/details/131090421
Recommended