Thesis study: On Training Robust PDF Malware Classifiers

Thesis Title: On Training Robust PDF Malware Classifiers

Source: 29th USENIX Security Symposium 2020

Link: https://www.usenix.org/conference/usenixsecurity20/presentation/chen-yizheng

Knowledge points

Reference https://blog.csdn.net/Shall_ByeBye/article/details/106883218

Article content

It is pointed out that there are methods that can easily evade the detector of malicious PDF files, and the current models are blindly pursuing high detection accuracy and low false positive rate. This paper trains a robust malicious PDF detection model by adding robustness attributes. Compared with general detection models, it has stronger robustness and has a better detection effect in the face of general escape attacks.

This work proposes a new robust training method for PDF malware. This work adopts Verifiably Robust Training, and takes advantage of the characteristics that effective PDF must be parsed as a tree structure, and proposes a new distance index for the PDF tree structure, and uses this distance index to specify two types of robust attributes, subtree Insert and delete . As long as the attacker meets the robust attributes, no matter how strong the attacker is, it cannot produce a variant that can evade the detection of the classifier. For example, if the robust attribute is specified as inserting a subtree, any PDF malware variant generated by inserting a subtree cannot escape detection.

difficulty:

① Past studies have shown that when training malware classifiers, if there are adversarial samples in the training set, the FP rate of the trained model will be very high.

Solution: Propose a new index to measure the PDF file structure tree, thereby reducing the FP rate of the model. (Good talk)

②The traditional and popular algorithms for malicious PDF detection are not suitable for training a robust classifier, such as random forest.

Solution: Use a neural network to train a robust classifier. (force)

③In order to evaluate the robustness of the model proposed in this paper, 7 attacks were used to test 12 baseline models.

Innovation:

①: Although there are many performance indicators in the field of machine learning to measure the quality of the model, none of them are suitable for evaluating the robustness of a model that specifically detects adaptive counterattacks. A new indicator is specifically proposed to measure PDF files to reduce the FPR of the model

The author found that all PDF malware variants that can retain malicious functions must satisfy the correct PDF syntax, that is, it can be parsed into a tree structure. In order to generate variants systematically and efficiently, attackers must use subtree insertion and subtree deletion operations to generate variants of malware. As long as the classifier is robust to these two operations, it is also robust to evasion attacks.
Based on this, the author proposes the subtree distance as a distance indicator : the subtree distance of two PDF software, that is, the number of different subtrees under their root nodes. No matter what kind of subtree is inserted under the root node of x, the distance between x and the subtree of the generated variant x˜ is 1. This can better limit the robust area and reduce FPR.
With the help of the limit of the subtree distance, the author specifies that when the subtree distance is 1, subtree insertion and subtree deletion are two basic robust properties.

The robustness attribute is specifically : a variant with a subtree distance of 1 generated by performing arbitrary subtree insertion (delete) operations on malware, the classifier will not classify it as benign. These attributes can all be extended to the case where the subtree distance is N. **

The processor will not classify it as benign. These attributes can all be extended to the case where the subtree distance is N. **

Guess you like

Origin blog.csdn.net/qq_40742077/article/details/108898146