Ideas and methods for emergency recovery and troubleshooting of SELinux faults on EC2

Overview

SELinux, the full name of Security-Enhanced Linux, is a security module that provides a mandatory access control mechanism for the system. The operating system that installs and enables the SELinux module will mark each process and system resource with a special security mark, called the SELinux context, and Allow or deny access based on SELinux context information.

The Amazon Cloud Technology Developer Community provides developers with global development technology resources. There are technical documents, development cases, technical columns, training videos, activities and competitions, etc. Help Chinese developers connect with the world's most cutting-edge technologies, ideas, and projects, and recommend outstanding Chinese developers or technologies to the global cloud community. If you haven't followed/collected it yet, please don't rush through it when you see it. Click here to make it your technical treasure trove!

 

According to the basic requirements for national network security level protection, the third-level system "should set security marks for important subjects and objects at the level of secure computing environment, and control the subject's access to security-marked information resources", so it needs to pass the national level 3 assessment All systems need to enable SELinux in the system.

Note: Amazon’s official system image, Amazon Linux 2022, has SELinux enabled with the Enforcing policy by default.

As SELinux becomes more and more widely used, system failures caused by improper SELinux configuration are also increasing. In serious cases, system startup may fail. This article starts with how to emergency recover an EC2 host with SELinux failure, and then focuses on three common SELinux problems  . "Tag failure", one of the root causes of major problems , introduces the methods and ideas for troubleshooting on AWS.

Two methods for EC2 emergency recovery

EC2 serial console access

If the faulty host meets the prerequisites for serial console access and serial console access is enabled, you can directly use EC2 serial console access for emergency recovery and troubleshooting. The serial console does not require your EC2 instance to have Any networking capabilities. Using a serial console, you can enter commands into your instance as if your keyboard and monitor were connected directly to the instance's serial port. Serial console sessions persist across instance restarts and stops. During the restart, you can view all startup messages from the beginning. The serial console is not enabled by default and requires explicit authorization before it can be used. The specific method is as follows:

1. Grant the account permission to execute the serial console. The recommended IAM policy is as follows:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:GetSerialConsoleAccessStatus",
                "ec2:EnableSerialConsoleAccess",
                "ec2:DisableSerialConsoleAccess"
            ],
            "Resource": "*"
        }
    ]
}


After configuring this policy, the account will be able to see the "EC2 Serial Console Access" option when connecting to the EC2 instance. The initial default is "Prohibited". You need to click "Manage" and set it to "Allow". Once completed, the account is Serial console access to EC2 is available.

2. Before logging in to the EC2 server through the serial console, we also need to establish a user and password for EC2 that allows logging in with a password on the serial console. Use the default SSH method to remotely log in to the EC2 server. After logging in, use passwd to set the password. The following uses root as an example:

[ec2-user ~]$ sudo passwd root  

3. Disable SELinux for emergency recovery. For system failures caused by SELinux, we can first disable SELinux for emergency recovery and troubleshooting. After logging into the server via the serial console, directly modify /etc/selinux/config and set SELINUX=disabled, and then restart the server to take effect.

Rescue Example

For servers that do not support serial console access, or that support but were not enabled earlier, we can also use rescue instances to perform emergency recovery on SELinux failed hosts. The specific operations are as follows:

  1. Launch a new Amazon EC2 instance in the Virtual Private Cloud (VPC) using the same Amazon Machine Image (AMI) as the compromised instance and in the same Availability Zone. The new instance will become your "rescue" instance. Alternatively, you can use an existing instance that you have access to, but only if it uses the same AMI as the compromised instance and is in the same Availability Zone.
  2. Detach the Amazon Elastic Block Store (Amazon EBS) root volume (/dev/xvda or /dev/sda1) from the compromised instance . Make a note of the device name to make sure it's the same when you reconnect later
  3. Attach the EBS volume as a secondary device (/dev/sdf) to the rescue instance.
  4. Use SSH to connect to your rescue instance.
  5. Become root, use lsblk to identify the correct device name, and save it for use throughout the process:
$ sudo -i
# lsblk
# rescuedev=/dev/xvdf1

NOTE : The device (/dev/xvdf1) may be attached to the rescue instance with a different device name . Use the lsblk command to view the available disk devices and their mount points to determine the correct device name.

6. Select the appropriate temporary mount point to use and make sure it exists, use /mnt unless the mount point is already in use

# rescuemnt=/mnt
# mkdir -p $rescuemnt

7. Mount the root file system from the attached volume:

# mount $rescuedev $rescuemnt

Note : If the volume mount fails, check dmesg|tail. If the log shows UUID conflicts, use option -o nouuid.

8. Modify the SELinux configuration file

# cd /mnt/rescuemnt
# vi ./etc/selinux/config

9. After completion, uninstall the auxiliary device:

# exit
# umount $rescuemnt

10. Detach the secondary volume (/dev/sdf) from the rescue EC2 instance and attach it to the original instance as /dev/xvda or /dev/sda1 (root volume). Make sure this is the same as seen in step 2. /11. Start the EC2 instance, and then verify whether the instance restarts normally.

SELinux log analysis

After the system is restored, our first step is to analyze the SELinux interception log and analyze which processes SELinux intercepted that caused the system exception. By default, SELinux interception logs are recorded in the /var/log/audit/audit.log and /var/log/messages files. For both audit and messages, we can access them through the keyword "AVC" (Access Vector Cache) vector cache) filtering out logs intercepted by SELinux, here are the AVC deny logs (and associated system call) example (for specific log format meaning, please refer to  the official RedHat link  ):

type=AVC msg=audit(1226874073.147:96): avc: denied { getattr } for pid=2465 comm="httpd" 
path="/var/www/html/file1" dev=dm-0 ino=284133 scontext=unconfined_u:system_r:httpd_t:s0 
tcontext=unconfined_u:object_r:samba_share_t:s0 tclass=file 

type=SYSCALL msg=audit(1226874073.147:96): arch=40000003 syscall=196 success=no exit=-13 
a0=b98df198 a1=bfec85dc a2=54dff4 a3=2008171 items=0 ppid=2463 pid=2465 auid=502 uid=48 
gid=48 euid=48 suid=48 fsuid=48 egid=48 sgid=48 fsgid=48 tty=(none) ses=6 comm="httpd" 
exe="/usr/sbin/httpd" subj=unconfined_u:system_r:httpd_t:s0 key=(null) 

By analyzing the SELinux interception log, the process and path file information intercepted by SELinux when the fault occurs can be identified.

Note: If the intercepted abnormal process cannot be visually identified by analyzing the log, since both audit and message have archiving functions by default, you can also compare and analyze the interception log at the fault time point with the interception log at the normal time period to identify the time when the fault occurred. Additional intercepted processes.

Process context label reset

There are two ways to analyze context tags: one is to directly analyze the context fields of the subject and object in the interception log, including analyzing SELinux users, roles, types, and levels. This analysis requires administrators to have a very deep understanding of SELinux. Understand, the requirements are high.

As a simple but more practical troubleshooting method, for systems that only enable the default SELinux policy, we can use the rhel-autorelabel service to restore and reset the system process context label, and then compare the file label attributes before and after the reset to identify Exception in process context label.

For file label attributes, we have two methods for comparison. One is manual comparison. We can use the system command "ls -z file" to view and record the context label value of the file before and after reset. This comparison method is suitable for comparing a small amount of files; the other is to install the AIDE file integrity verification tool on the system, add the files that need to be compared to the aide.conf file, and automatically perform the comparison. How to use AIDE is as follows:

cp /etc/aide.conf /etc/aide.conf_bak(可选,备份原始conf文件)
vi /etc/aide.conf (可选,在默认的基础上增加需要对比的文件,来源可以是日志分析识别到的一个或多个异常进程)
aide --init 
mv /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz
aide --check

The operation command to enable the rhel-autorelabel service is as follows:

systemctl enable rhel-autorelabel
touch /.autorelabel
reboot
#系统重启后,将会对文件重打SELinux上下文标签

Failure cause verification

After identifying the specific cause of the label failure through label reset and comparative analysis, we can try the system command chcon to modify the context label of the abnormal process to a correct or incorrect value, verify whether the abnormality disappears or reappears, and thereby determine the cause of the failure. root cause.

Summarize

This article focuses on "label failure", a common source of SELinux faults, and discusses the ideas and methods of SELinux fault analysis. However, in the actual environment, there are many reasons for system faults, and even SELinux faults still have many possibilities, so you need to consult System log files and official documents are analyzed in detail.

The author of this article

image.png

Wang Junfeng

The security consultant of Amazon Cloud Technology Professional Services Team is responsible for the consulting, design and implementation of cloud security compliance, cloud security solutions, etc., and is committed to providing security best practices for customers to migrate to the cloud, and to solve the security needs of customers when migrating to the cloud. .

Article source: https://dev.amazoncloud.cn/column/article/63186bc260678178b03b8104?sc_medium=regulartraffic&sc_campaign=crossplatform&sc_channel=CSDN 

Guess you like

Origin blog.csdn.net/u012365585/article/details/132713215