Behavioral Analysis in Malware Detection

 

Malware has long been a major threat to information security. Methods for analyzing and defending against this type of attack vary. There are generally two approaches: static and dynamic analysis.

The purpose of static analysis is to look for patterns of malicious content in files or process memory. These may be strings, fragments of encoded or compressed data, sequences of compiled code. Not only individual patterns can be searched, but also combinations of these patterns with additional conditions (e.g. binding to a signature location, checking the relative distance of each other locations).

Dynamic analysis is the analysis of the behavior of a program. It is worth noting that the program can be run in the so-called emulation mode. It should be able to safely interpret actions without causing damage to the OS. Another option is to run the program in a virtual environment (sandbox). In this case, there will be a fair enforcement action on the system, and the call will be logged. The level of detail recorded is a balance between the depth of observation and the performance of the analysis system. Its output is a log (traces of behavior) of the program's actions on the operating system, which can be further analyzed.

A key advantage of dynamic or behavioral analysis is that regardless of the attacker's attempts to obfuscate code and malicious intent, malicious activity will be spotted by virus analysts. Reducing the malware detection task to action analysis allows us to make assumptions about the robustness of advanced malware detection algorithms. Moreover, since the initial state of the analysis environment is the same (virtual server state cast), the repeatability of behavior simplifies the task of classifying legitimate and malicious behavior.

Typically, approaches to behavioral analysis are based on rule sets. Expert analysis is transferred to signatures, on the basis of which malware and file detection tools draw conclusions. However, a problem arises in this case: only attacks that strictly follow the written rules are counted, while attacks that do not meet these conditions but are still malicious may be ignored. The same problem occurs when the same malware changes. This can be solved by using softer trigger criteria, i.e. one more general rule can be written, or a large number of rules per malware. In the first case we risk many false positives, while in the second case a serious time commitment is required which can lead to a lag in necessary updates.

It is necessary to extend the knowledge we already have to other similar cases. That is, those cases that we have not encountered before or have not dealt with with the rules, but based on the similarity of certain characteristics, we can conclude that the activity may be malicious. This is where machine learning algorithms come into play.

ML models, when trained correctly, are generalizable. This means that instead of just learning all the examples it was trained on, the trained model is able to make decisions about new examples based on the patterns of the training samples.

However, for generalizability to work, two main factors must be considered during the training phase:

  • The feature set should be as complete as possible (so that the model can see as many patterns as possible and thus better extend its knowledge to new examples), but not redundant (so as not to store and process patterns that do not carry useful information for the model Characteristics).
  • Datasets should be representative, balanced and regularly updated.

Since we were able to gather the amount of data we needed, and had the assumption that machine learning could extend existing solutions, we decided to do this research: generate a set of attributes, train a model on them, and trust the model about the maliciousness of the file accuracy of the conclusions.

How expert knowledge can be brought into machine learning models

In the context of malware analysis, the raw data are the files themselves, while the intermediate data are the auxiliary processes that produce them. These processes in turn make system calls. These call sequences are the data we need to convert into feature sets.

The creation of the dataset starts from the expert side. Attributes that experts believe should be meaningful in terms of malware detection are selected. All attributes can be restored to the form of n-grams through system calls. The model is then used to estimate which attributes contribute the most to detection, discard redundant attributes, and obtain the final version of the dataset.

source data:

{"count":1,"PID":"764","Method":"NtQuerySystemInformation","unixtime":"1639557419.628073","TID":"788","plugin":"syscall","PPID":"416","Others":"REST: ,Module=\"nt\",vCPU=1,CR3=0x174DB000,Syscall=51,NArgs=4,SystemInformationClass=0x53,SystemInformation=0x23BAD0,SystemInformationLength=0x10,ReturnLength=0x0","ProcessName":"windows\\system32\\svchost.exe"}  

{"Key":"\\registry\\machine","GraphKey":"\\REGISTRY\\MACHINE","count":1,"plugin":"regmon","Method":"NtQueryKey","unixtime":"1639557419.752278","TID":"3420","ProcessName":"users\\john\\desktop\\e95b20e76110cb9e3ecf0410441e40fd.exe","PPID":"1324","PID":"616"}  

{"count":1,"PID":"616","Method":"NtQueryKey","unixtime":"1639557419.752278","TID":"3420","plugin":"syscall","PPID":"1324","Others":"REST: ,Module=\"nt\",vCPU=0,CR3=0x4B7BF000,Syscall=19,NArgs=5,KeyHandle=0x1F8,KeyInformationClass=0x7,KeyInformation=0x20CD88,Length=0x4,ResultLength=0x20CD98","ProcessName":"users\\john\\desktop\\e95b20e76110cb9e3ecf0410441e40fd.exe"}  

Intermediate data (sequence):

syscall_NtQuerySystemInformation*regmon_NtQueryKey*syscall_NtQueryKey

How the knowledge of the model is accumulated, how this process changes, and why it is necessary to stop the accumulation of data in time

As mentioned above, the basic requirements for data are representativeness, balance and regular updating. Let us clarify these three points in the behavioral analysis of malicious files:

1. Representation. The attribute distribution of the data should be close to the distribution in real life.

2. Balance. The raw data on which the model is trained is labeled "legitimate" or "malicious", and this information is passed to the model, i.e., we solve the problem when the number of malicious examples approaches that of pure examples.

3. Regular update. Much of this has to do with representation. Since the trends in malware files are constantly changing, it is necessary to regularly update the knowledge of the model.

Taking into account all the above requirements, the following data accumulation process was established:

1. The data are divided into two types – main data streams and reference cases. Benchmarks are manually checked by experts and their markup is guaranteed for correctness. They need to validate models and manage training samples by adding benchmarks. Mainlines are marked with rules and automatic checks. They need to flesh out the samples with various real-life examples.

2. All benchmarks are immediately added to the training samples.

3. Additionally, some initial datasets from the process are added to the amount of data required for training. The required data volume is understood here as the number of training samples that have proven to be sufficiently complete (in terms of data diversity) and representative. Since benchmarks are manually tested by experts, it is not possible to collect only tens of thousands from benchmarks, thus increasing the variety of data from the data stream.

4. Periodically test the model on new data streams.

5. First of all, the accuracy of the benchmark examples must be ensured. If there is a contradiction, the benchmark data will be given priority, and they will be preserved in any case.

Over time, enough data is accumulated from the data stream that automatic error-based accumulation needs to be removed in favor of more controlled training samples:

1. So far, the accumulated training samples are fixed;

2. The data from the data stream is now only used to test the model, no instances are added to the training samples;

3. Only when the reference instance set is updated, the training samples can be updated.

Thus, we were able to achieve the following:

1. It is verified that the trained and fixed model can be robust enough to data drift;

2. Monitor each new instance added to the training samples (the benchmark is checked manually by experts);

3. We can track every change and guarantee accuracy on benchmark datasets.

How to ensure the quality of the model improves with each update

After the data accumulation process described above, a legitimate question might arise: why are we so sure that each update improves the model?

The answer is still the same benchmark sample. We believe this to be the most correct, as the examples of this sample were manually tested and labeled by experts, and each time an update is made, the first thing we check is that we still guarantee 100% accuracy on this sample. Tests performed "in the wild" confirmed the increasing accuracy.

This is achieved by cleaning inconsistent reference data from the training samples. By contradictory data, we mean examples accumulated from the stream that are close enough in vector distance to traces of the baseline sample, but have opposite labels.

Our experiments show that such examples are outliers even from the perspective of data flow, because the accuracy of the data flow improves after removing these examples from the training samples in order to improve the accuracy of the benchmark samples.

Complementarity of ML methods and behavioral detectors in the form of associations

ML models perform very well when combined with behavioral detection in the form of correlation. It is important to note that it is combined because the model generalizes well in situations where the solution needs to be scaled with detection of similar and related When testing under standard conditions, it does not apply.

Examples where ML methods can really scale solutions are:

- Unusual child process chaining. By itself, a large number of forked chains is a legitimate phenomenon. But anomalies in the number of nodes, the degree of nesting, the repetition or non-repetition of certain process names will all be noticed by the model, and people don't like to find out in advance that this kind of thing is malicious.

- Non-standard values ​​for default call parameters. In most cases, analysts are interested in meaningful function parameters, looking for something malicious in them. The other parameters, roughly speaking default values, are not really of interest. But at some point, it happens, so the sixth value comes up instead of, say, the default of five. Analysts may not have guessed that it was possible, but models have noticed it.

- Atypical sequence of function calls. This is a situation when each function in isolation does not do anything malicious. It's also not doing anything malicious. But it just so happens that their sequences are not found in legitimate software. It takes enormous experience for analysts to spot such patterns. But the model noticed (and more than one), solving the problem of classifying by a feature that wasn't actually intended to be an indicator of maliciousness.

Signature-based behavioral analysis is an important example:

- The use of a specific component by a single invocation of malicious behavior. A system uses hundreds of objects with varying degrees of variation. It's unlikely to capture the use of one object against the background of a million other objects -- the anomaly granularity remains low.

- Proactive threat model detection. Decide that certain actions on certain objects in the system are unacceptable at least once. The first time the model might not understand that this is an important phenomenon, there would be a chance for errors or uncertain decisions during the classification phase of something like that.

- Blurring of the sequence of actions. For example, you may know that you need to do 3-4 moves in a certain order. It doesn't matter what's in between. If you throw garbage moves between 3-4 key moves, you will confuse the model and make incorrect decisions. And the dimensionality of the number of features makes it impossible to account for this confusion by storing all combinations of call sequences, not just the total.

Guess you like

Origin blog.csdn.net/ptsecurity/article/details/131318912