The paper “Detection of Malicious Code Variants Based on Deep Learning”——Personal summary

This is just something interesting i found, and maybe this summary will be useful for you who like me.

——————————————————————————————————————————————

Background

Malicious code attacks have increased exponentially,and the malicious code variants ranking is the key threat to Internet security.

Related work

A. Malware Detection Based on Feature Analysis （ detecting the malicious behaviors of applications）
（1）Static analysis
1）kernel behavior analysis ：
Disadvantage： Easily deceived by obfuscation techniques .
2）trace semantics ：
Advantage：Being effective in combating the obfuscation of instructions .
Disadvantage：Being limited to feature extraction and analysis at the instruction level. Moreover, the pattern-matching was complex.
（2）Dynamic nanlysis
Disadvantage：

thevariety of countermeasures developed to generate unreliable results
time consuming because of the large amount of computational overhead, leading to low efﬁciency when exposed to a large dataset.

B. Malicious Code Visualization
（1）self-organizing maps（自组织映射）
（2）tree maps and thread graphs（树形图和螺纹图）（better than the one above）
Disadvantages of the two mentioned studied above ：the malicious code may imply much more information，but it is not fulled utilized.
two studies improved as followed
（3） binary texture analysis（二进制模式纹理分析）
Implementation method：converted the malware executable ﬁle into grayscale images，and then identiﬁed malware according to the texture features of these images
（4）color image matrices（彩色图像矩阵）
Implementation method： classiﬁed malware families by using an image processing method.

C. Image Processing Techniques for Malware Detection
（1）GIST algorithm
Disadvantage：time consuming
（2） bio-inspired parallel implementation （仿生并行实现）
（3）an image fusion algorithm based on shearlet and genetic algorithm for image fusion（剪切和遗传算法）

D. Malware Detection Based on Deep Learning
（1）an online malware detection prototype system
（2）deep belief network(DBN）（深度信念网络）
Advantages of the two mentioned studied above ： demonstrated increased accuracy for detecting new malware variants.（恶意软件变体的检测准确率提高）
Disadvantages of the two mentioned studied above ：because they are based on static analysis and dynamic analysis，they continue to be subject to limitations of feature extraction.

What is Deep Learning？
Uses a deep neural network to simulate the human brain’s learning processes. Neural networks can approximate complex functions （近似复杂函数）by learning the deep nonlinear network structure（非线性网络结构） to solve complex problems.

MALWARE DETECTION BASED ON A CNN
CNN （convolution neural network）（卷积神经网络）

The principles and methods of implementation of malware detection based on CNN: First, the binary ﬁles of malicious code are transformed into the grayscale images.Next,the convolution neural network is employed to identify and classify the images. According to the results of image classiﬁcation, we realized the automatic recognition and classiﬁcation of malicious software.

A. Binary Malware to Gray Image

Principles: A malware binary bit string can be split into a number of substrings that are 8 bits in length. Each of these substrings can be seen as a pixel, because the 8 bits can be interpreted as an unsigned integer in the range 0–255.After binary conversion, the binary malware bit string has been converted into a 1-D vector of decimal numbers.According to a speciﬁed width, this 1-D array can be treated as a 2-D matrix of a certain width. Finally, the malicious code matrix is interpreted as a grayscale image.

(e.g.)
0110000010101100 the process is 0110000010101100→01100000,10101100→96,172

And then,we used a convolutional neural network (CNN) to recognize malware images.

B. Malware Image Classiﬁcation Based on CNN
Structure of CNN ：

input layer

the convolution and subsampling layers

the convolution layer ： enhance signal characteristics and reduce noise

subsampling layer：reduce the amount of data processing while retaining useful information

Disadvantages:

loal perception and weight sharing network structure reduce the complexity of network models and the number of weights.
the image can be considered as the input of the network.

IV. MALWARE IMAGE DATA EQUILIBRIUM (恶意软件图像数据均衡）

Advantage: improve data quality by dynamic resampling based on a bat algorithm.

A. Image Data Augmentation (IDA) Technology
Advantage： If the data sample is relatively small, we can use data augmentation to increase the sample, thereby restraining the inﬂuence of imbalanced data.

B. Data Equalization Based on a Bat Algorithm（ bat algorithm is a novel swarm intelligence algorithm （群智能算法））

Advantage: more efficient to optimize the sampling weights of multiple malware families.

The paper “Detection of Malicious Code Variants Based on Deep Learning”——Personal summary

This is just something interesting i found, and maybe this summary will be useful for you who like me.

猜你喜欢