0 Description
This article is based on " CDH Data Warehouse Project (1) - CDH Installation, Deployment and Construction Detailed Process" and "CDH Data Warehouse Project (2) - User Behavior Data Warehouse and Business Data Warehouse Construction" and build CDH data warehouse. This chapter mainly introduces Kerberos authentication and Sentry authority management based on CDH data warehouse
1 Kerberos security authentication
1.1 Overview of Kerberos
Kerberos is a computer network authorization protocol used to authenticate personal communications in a secure manner in a non-secure network. The software design adopts the client/server structure, and can perform mutual authentication, that is, both the client and the server can authenticate each other. It can be used to prevent eavesdropping, prevent replay attacks, protect data integrity, etc. It is a system that uses a symmetric key system for key management.
1.2 Kerberos concept
There are some concepts in Kerberos that need to be understood:
1) KDC: Key Distribution Center, which is responsible for managing the issuing of tickets and recording authorization.
2) Realm: The identity of the Kerberos management realm.
3) Principal: When adding a user or service, it is necessary to add a principal to kdc. The form of principal is: main name/instance name@domain name.
4) Master name: The master name can be a user name or a service name, indicating that it is the principal used to provide various network services (such as hdfs, yarn, hive).
5) Instance name: The instance name is simply understood as the host name.
1.3 Principles of Kerberos authentication
2 Kerberos installation
2.1 Server node installs kerberos related software
The server node only needs to install one node, here it is installed on the chen102 node
yum install -y krb5-server krb5-workstation krb5-libs
View installation results
rpm -qa | grep krb5
2.2 client node installation
The client node can install multiple
yum install -y krb5-workstation krb5-libs
2.3 Configure Kerberos
There are two files that need to be configured: kdc.conf and krb5.conf. The kdc configuration only requires the Server service node configuration, namely chen102.
1) kdc configuration
vim /var/kerberos/krb5kdc/kdc.conf
illustrate:
CHEN.COM: Realm name, Kerberos supports multiple realms, generally all capitalized.
acl_file: admin user rights.
admin_keytab: The keytab for KDC verification.
supported_enctypes: Supported verification methods, pay attention to remove aes256-cts, JAVA uses aes256-cts verification method needs to install additional jar package, all not used here
2) krb5 file configuration
vim /etc/krb5.conf
The modified content is synchronized to other nodes in the cluster
default_realm: The default realm, which sets the default realm of the Kerberos application and must be consistent with the name of the realm to be configured.
ticket_lifetime: Indicates the time limit for the voucher to take effect, usually 24 hours.
renew_lifetime : Indicates the maximum time limit for the certificate to be extended, usually one week. Subsequent access to securely authenticated services will fail when the credentials expire.
udp_preference_limit= 1: Disable the use of udp, which can prevent a bug in Hadoop.
realms: configure the realm used, if there are multiple realms, just add other statements to the [realms] section.
domain_realm: The mapping relationship between the cluster domain name and the Kerberos realm. In the case of a single realm, it can be ignored.
2.4 Generate Kerberos database
Execute on the server node
kdb5_util create -s
After the creation is complete, the corresponding files will be generated in the /var/kerberos/krb5kdc directory
ls /var/kerberos/krb5kdc/
2.5 Grant Kerberos administrators all privileges
vim /var/kerberos/krb5kdc/kadm5.acl
#修改为以下内容:
*/[email protected] *
Explanation:
*/admin: all principals of admin instance
@HADOOP.COM: realm
*: all permissions
The meaning of this authorization is to grant all principals of admin instance all permissions corresponding to CHEN.COM realm. That is, if the instance is admin when creating the Kerberos principal, it has all the permissions of the CHEN.COM domain. For example, creating the following principal user1/admin has all the permissions of the CHEN.COM domain.
2.6 Start Kerberos service (executed by server node)
start krb5kdc
systemctl start krb5kdc
start kadmin
systemctl start kadmin
Set up autostart
systemctl enable krb5kdc
systemctl enable kadmin
Check if it is set to boot automatically
systemctl is-enabled krb5kdc
systemctl is-enabled kadmin
Note: When the startup fails, you can check it through /var/log/krb5kdc.log and /var/log/kadmind.log.
2.7 Create administrator principal/instance
kadmin.local -q "addprinc admin/admin"
2.8 kinit administrator verification
kinit admin/admin
klist
Other nodes try
2.9 Kerberos database operation
2.9.1 Log in to the kerberos database
1) Log in locally (no authentication required)
kadmin.local
2) Remote login (subject authentication is required, first authenticate the administrator subject just created)
kadmin
2.9.2 Create kerberos principal
kadmin.local -q "addprinc test/test"
2.9.3 Modify the subject password
kadmin.local -q "cpw test/test"
2.9.4 View all principals
kadmin.local -q "list_principals"
2.10 Kerberos Principal Authentication
Kerberos provides two authentication methods, one is through password authentication, and the other is through keytab key file authentication, but the two methods cannot be used at the same time
2.10.1 Password authentication
kinit test/test
klist
2.10.2 keytabl key file authentication
1) Generate the keytab file of the subject admin/admin to the specified directory /root/admin.keytab
kadmin.local -q "xst -k /root/test.keytab test/[email protected]"
2) Use keytab for authentication
kinit -kt /root/test.keytab test/test
View the subject name contained in keytab
klist -ekt /root/test.keytab
3) View the authentication certificate
klist
At this time, you can no longer log in after passing the password authentication.
2.11 Destruction certificate
kdestroy
klist
3 CDH install Kerberos
3.1 CDH enables Kerberos security authentication
Create admin principal/instance for CM
kadmin.local -q "addprinc cloudera-scm/admin"
3.2 CDH page starts kerberos
3.3 Environmental Confirmation
3.4 Fill in the configuration
Kerberos encryption type: aes128-cts, des3-hmac-sha1, arcfour-hmac
3.5 Fill in the subject name and password
3.6 Waiting to import kdc
3.7 Waiting to restart the cluster
3.8 View subject
4 Kerberos security environment practice
After Kerberos is enabled, the communication between the system and the system (flume-kafka), and the communication between the user and the system (user-hdfs) need to be authenticated first, and the communication can only be carried out after the authentication is passed.
Therefore, after Kerberos is enabled, the scripts used in the data warehouse, etc., need to add a step of security authentication to work normally.
4.1 User access service authentication
After Kerberos security authentication is enabled, daily access services (such as accessing HDFS, consuming Kafka topics, etc.) need to perform security authentication first
1) Create a user principal/instance in the Kerberos database
kadmin.local -q "addprinc hive/[email protected]"
2) Perform user authentication
kinit hive/[email protected]
3) Access HDFS
hadoop fs -ls /
4) hive query
hive
5) Consume Kafka topic
(1) Modify Kafka configuration
① Search for "security.inter.broker.protocol" in the configuration item of Kafka, and set it to SALS_PLAINTEXT.
② Search for "ssl.client.auth" in the Kafka configuration item and set it to none.
(2) Create the jaas.conf file
vim /var/lib/hive/jaas.conf
KafkaClient {
com.sun.security.auth.module.Krb5LoginModule required
useTicketCache=true;
};
(3) Create a consumer.properties file
vim /etc/kafka/conf/consumer.properties
The content of the file is as follows
security.protocol=SASL_PLAINTEXT
sasl.kerberos.service.name=kafka
(4) Declare the jass.conf file path
export KAFKA_OPTS="-Djava.security.auth.login.config=/var/lib/hive/jaas.conf"
(5) Use kafka-console-consumer to consume Kafka topic data
kafka-console-consumer --bootstrap-server chen102:9092 --topic topic_start --from-beginning --consumer.config /etc/kafka/conf/consumer.properties
Can consume data normally
4.2 windows webui browser authentication
After we set CDH to support kerberos, the situation shown in the figure below will appear:
you can log in to 9870, but you cannot view directories and files. This is because our local environment has not passed the authentication.
Next we set up local authentication.
(1) Download Firefox
(2) Set up the browser
① Open the Firefox browser, enter: about:config in the address bar, and enter the setting page.
② Search for "network.negotiate-auth.trusted-uris", modify the value to your own server host name
③ Search for "network.auth.use-sspi", double-click to change the value to false
(3) Install kfw
① Install the provided kfw -4.1-amd64.msi.
② Copy the contents of the /etc/krb5.conf file of the cluster to C:\ProgramData\MIT\Kerberos5\krb.ini, and delete the configuration related to the path.
链接:https://pan.baidu.com/s/1sMmqTbVcVhNQubjQR5CrCQ
提取码:amo6
document content:
[logging]
[libdefaults]
default_realm = CHEN.COM
dns_lookup_realm = false
dns_lookup_kdc = false
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true
udp_preference_limit = 1
[realms]
CHEN.COM = {
kdc = chen102
admin_server = chen102
}
[domain_realm]
③ Open MIT, enter the subject name and password:
④ Test
4.3 User Behavior Data Warehouse
4.3.1 Log collection Flume configuration
The log is collected by Flume, and the data is sent to Kafka. The Flume is equivalent to a Kafka producer. Therefore, we need to perform the security authentication of the above Kafka client. But here we do not need to manually configure, after opening Kerberos, CM will automatically configure.
4.3.2 Consume Kafka Flume configuration
Consume Kafka Flume and transfer data from Kafka to HDFS. This Flume is equivalent to a Kafka consumer. Therefore, we also need to perform the security authentication of the Kafka client above (no manual authentication is required, CM will automatically configure it). In addition, we also need to perform security authentication of the HDFS client, which requires manual configuration.
(1) Generate the keytab file of the hive user.
There are two methods of user authentication: "enter the password" and "use the keytab key file". Here, the keytab key file is required for authentication.
kadmin.local -q "xst -k /var/lib/hive/hive.keytab hive/[email protected]"
(2) Increase read permission
chmod +r /var/lib/hive/hive.keytab
(3) Distribute keytab files
(4) Modify flume agent configuration files
## 组件
a1.sources=r1 r2
a1.channels=c1 c2
a1.sinks=k1 k2
## source1
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 5000
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = chen102:9092,chen103:9092,chen104:9092
a1.sources.r1.kafka.topics=topic_start
## source2
a1.sources.r2.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r2.batchSize = 5000
a1.sources.r2.batchDurationMillis = 2000
a1.sources.r2.kafka.bootstrap.servers = chen102:9092,chen103:9092,chen104:9092
a1.sources.r2.kafka.topics=topic_event
## channel1
a1.channels.c1.type=memory
a1.channels.c1.capacity=100000
a1.channels.c1.transactionCapacity=10000
## channel2
a1.channels.c2.type=memory
a1.channels.c2.capacity=100000
a1.channels.c2.transactionCapacity=10000
## sink1
a1.sinks.k1.type = hdfs
#a1.sinks.k1.hdfs.proxyUser=hive
a1.sinks.k1.hdfs.kerberosPrincipal=hive/[email protected]
a1.sinks.k1.hdfs.kerberosKeytab=/var/lib/hive/hive.keytab
a1.sinks.k1.hdfs.path = /origin_data/gmall/log/topic_start/%Y-%m-%d
a1.sinks.k1.hdfs.filePrefix = logstart-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = second
##sink2
a1.sinks.k2.type = hdfs
#a1.sinks.k2.hdfs.proxyUser=hive
a1.sinks.k2.hdfs.kerberosPrincipal=hive/[email protected]
a1.sinks.k2.hdfs.kerberosKeytab=/var/lib/hive/hive.keytab
a1.sinks.k2.hdfs.path = /origin_data/gmall/log/topic_event/%Y-%m-%d
a1.sinks.k2.hdfs.filePrefix = logevent-
a1.sinks.k2.hdfs.round = true
a1.sinks.k2.hdfs.roundValue = 10
a1.sinks.k2.hdfs.roundUnit = second
## 不要产生大量小文件
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k2.hdfs.rollInterval = 10
a1.sinks.k2.hdfs.rollSize = 134217728
a1.sinks.k2.hdfs.rollCount = 0
## 控制输出文件是原生文件。
a1.sinks.k1.hdfs.fileType = CompressedStream
a1.sinks.k2.hdfs.fileType = CompressedStream
a1.sinks.k1.hdfs.codeC = lzop
a1.sinks.k2.hdfs.codeC = lzop
## 拼装
a1.sources.r1.channels = c1
a1.sinks.k1.channel= c1
a1.sources.r2.channels = c2
a1.sinks.k2.channel= c2
After the configuration is successful, restart Flume and execute the acquisition script to verify whether the acquisition can be performed normally. Check hdfs and you can see the following page, indicating that the collection can be done normally
4.3.3 ods layer
Add the following content to ods_log.sh:
kinit -kt /var/lib/hive/hive.keytab hive/hive
beeline -u "jdbc:hive2://chen102:10000/;principal=hive/[email protected]" -n hive -e "$sql"
4.3.4 dwd layer
dwd_start_log.sh
kinit -kt /var/lib/hive/hive.keytab hive/hive
beeline -u "jdbc:hive2://chen102:10000/;principal=hive/[email protected]" -n hive -e "$sql"
4.3.5 dws layer
dws_log.sh
kinit -kt /var/lib/hive/hive.keytab hive/hive
beeline -u "jdbc:hive2://chen102:10000/;principal=hive/[email protected]" -n hive -e "$sql"
4.3.6 ads layer
ads_uv_log.sh
kinit -kt /var/lib/hive/hive.keytab hive/hive
beeline -u "jdbc:hive2://chen102:10000/;principal=hive/[email protected]" -n hive -e "$sql"
4.4 Business Data Warehouse
Add kerberos authentication to the business data warehouse respectively
4.4.1 sqoop import
sqoop_import.sh
kinit -kt /var/lib/hive/hive.keytab hive/hive
4.4.2 ods layer
ods_db.sh
kinit -kt /var/lib/hive/hive.keytab hive/hive
beeline -u "jdbc:hive2://chen102:10000/;principal=hive/[email protected]" -n hive -e "$sql"
4.4.3 dwd layer
dwd_db.sh
kinit -kt /var/lib/hive/hive.keytab hive/hive
beeline -u "jdbc:hive2://chen102:10000/;principal=hive/[email protected]" -n hive -e "$sql"
4.4.4 dws layer
dws_db_wide.sh
kinit -kt /var/lib/hive/hive.keytab hive/hive
beeline -u "jdbc:hive2://chen102:10000/;principal=hive/[email protected]" -n hive -e "$sql"
4.4.5 ads layer
ads_db_gmv.sh
kinit -kt /var/lib/hive/hive.keytab hive/hive
beeline -u "jdbc:hive2://chen102:10000/;principal=hive/[email protected]" -n hive -e "$sql"
4.4.6 sqoop export
kinit -kt /var/lib/hive/hive.keytab hive/hive
4.5 Test run
4.5.1 hdfs new directory
hadoop fs -mkdir /user/hive/bin_kerberos
4.5.2 Upload the latest script to the hdfs directory
4.5.3 Create data
4.5.4 Hue page new scheduling task
4.5.6 View execution results
5 Sentry
5.1 sentry overview
The cdh version of Hadoop usually adopts the structure of Kerberos+Sentry in the processing of data security. Kerberos is mainly responsible for user authentication of platform users, while sentry is responsible for data authority management.
5.2 What is Sentry
Apache Sentry is a Hadoop open source component released by Cloudera, which provides fine-grained, role-based authorization and multi-tenant management mode.
Sentry provides data control and enforcement of granular levels of permissions for authenticated users and applications on a Hadoop cluster. Sentry can currently be used with Apache Hive, Hive Metastore/HCatalog, Apache Solr, Impala and HDFS (only for Hive table data).
Sentry is designed to be a pluggable authorization engine for Hadoop components. It allows custom authorization rules to authenticate user or application access requests to Hadoop resources. Sentry is highly modular and can support the authorization of various data models in Hadoop
5.3 Sentry installation and deployment
5.3.1 Add Sentry service
5.3.2 Custom sentry role
5.3.3 Configure database connection
5.3.4 Complete sentry service addition
5.4 Sentry integrates with Hive
5.4.1 Modify configuration parameters
(1) Cancel the HiveServer2 user simulation
Search for "HiveServer2 enable simulation" in the hive configuration item and uncheck it
(2) Ensure that hive users can submit MR tasks
Search for "allowed system users" in the yarn configuration item and ensure that "hive" is included .
(3) Search for "Enable storage notifications in the database" in the Hive configuration item and check it.
(4) Search for "Sentry" in the Hive configuration item, and check Sentry.
5.5 sentry and impala configuration
Search for "Sentry" in the Impala configuration item and select it.
5.6 HDFS configuration sentry
1) Search for "Enable Access Control List" in the HDFS configuration item and check it.
2) Search for "Enable Sentry Synchronization" in the HDFS configuration item, and make the modification as shown in the figure below.
5.7 sentry and hue
1) Configure HUE to support Sentry
Search for "Sentry" in the HUE configuration item and check Sentry.
6 sentry authority authentication test
6.1 Operation via HUE
1) View the administrator group in Sentry rights management.
Search for "Administrator Group" in Sentry's configuration items, including hive and impala. Only when a user belongs to a group in it, can other users be granted permissions.
2) Create two users reader and writer on all nodes of the Hive cluster to prepare for the permission test.
useradd reader
passwd reader
useradd writer
passwd writer
3) Use the hive user to log in to HUE, create two user groups reader and writer, and create two users reader and writer under the two user groups to prepare for the authority test.
4) Sentry work interface (HUE users need to be granted permission to access Sentry)
5) Click the Roles button and click the Add button
6) Edit Role
admin_role (first add administrator privileges for hive users)
reader_role
writer_role
7) Permission test
(1) Use reader and writer users to log in to HUE respectively
(2) Query any table in the gmall database, and find that only reader users can find out, but writer cannot. Indicates that permission control takes effect.
The reader user has the query permission.
The reader user does not have the data insertion permission.
The writer user does not have the query permission.
The writer user has insert permission
6.2 Command line operation
1) Create two users reader_cmd and writer_cmd on all nodes of the Hive cluster
useradd reader_cmd
passwd reader_cmd
useradd writer_cmd
passwd writer_cmd
2) Use the Sentry administrator user hive to connect to HiveServer2 through the beeline client
kinit -kt /var/lib/hive/hive.keytab hive/[email protected]
beeline -u "jdbc:hive2://chen102:10000/;principal=hive/[email protected]"
① 创建Role(reader_role_cmd,writer_role_cmd)
create role reader_role_cmd;
create role writer_role_cmd;
② Assign privilege to role
GRANT select ON DATABASE gmall TO ROLE reader_role_cmd;
GRANT insert ON DATABASE gmall TO ROLE writer_role_cmd;
③ Grant the role to the user group
GRANT ROLE reader_role_cmd TO GROUP reader_cmd;
GRANT ROLE writer_role_cmd TO GROUP writer_cmd;
After executing the above command, you can also view the currently created character on the hue page
④ Check the authorization status
(1) Check all roles (administrators)
SHOW ROLES;
(2) View the role (administrator) of the specified user group
SHOW ROLE GRANT GROUP reader_cmd;
(3) View the role of the current authenticated user
SHOW CURRENT ROLES;
(4) View the specific permissions of the specified ROLE (administrator)
SHOW GRANT ROLE reader_role_cmd;
⑤ Permission test
(1) Create Kerberos principals for reader_cmd and writer_cmd
kadmin.local -q "addprinc reader_cmd/[email protected]"
kadmin.local -q "addprinc writer_cmd/[email protected]"
(2) Use reader_cmd to log in to HiveServer2, and query any table under the gmall library
kinit reader_cmd/[email protected]
beeline -u "jdbc:hive2://chen102:10000/;principal=hive/[email protected]"
Has query permission
but does not have insert permission
(3) Use writer_cmd to log in to HiveServer2, and query any table under the gmall library
kinit writer_cmd/[email protected]
beeline -u "jdbc:hive2://chen102:10000/;principal=hive/[email protected]"
writer_cmd does not have the query table permission
writer_cmd has the insert table permission
(4) Query result
reader_cmd has the query permission for the gmall table, but writer_cmd does not. Indicates that the authorization is in effect.
The latter is the performance test of the CDH data warehouse project and the operation of cleaning the CDH cluster. For details, see
"CDH Data Warehouse Project (4) - Cluster Performance Test/Resource Management/Clear CDH Cluster"