[When artificial intelligence meets security] 8. Detailed examples of malicious family classification based on API sequence and machine learning

As you probably know, the author will share fewer and fewer articles about network security in the future. But if you want to learn the application of artificial intelligence and security, you will have benefits. The author will recreate a series of blogs "When Artificial Intelligence Meets Security", introduce in detail the papers and practices related to artificial intelligence and security, and share various Cases involving malicious code detection, malicious request identification, intrusion detection, adversarial examples, etc. I just want to help beginners better and share new knowledge more systematically. This series of articles will be more focused, more academic, and more in-depth, and it is also the author's slow growth history. It is indeed difficult to change majors, and system security is also a hard bone, but I will try to see how far I can learn it in the next four years. Enjoy the process, come on together~

The previous article introduced security-related data sets for everyone to download and experiment, including malicious URLs, traffic analysis, domain name detection, malware, image classification, spam, etc. This article will explain how to learn the extracted API sequence features and build a machine learning algorithm to classify malicious families, which is also a typical task or work in the security field. Basic article, I hope it will be helpful to you~

As a novice in network security, the author shares some self-study basic tutorials for everyone, mainly online notes, I hope you like it. At the same time, I hope that you can operate and make progress together with me. In the future, I will learn more about AI security and system security knowledge and share related experiments. In short, I hope this series of articles will be helpful to bloggers. It is not easy to write articles. If you don’t like it, don’t spray it, thank you! If the article is helpful to you, it will be the biggest motivation for my creation. Likes, comments, and private chats are all acceptable. Let's work together!

Previous recommendation:

Author's github resources:


1. Malware Analysis

Malware or malicious code analysis usually includes static analysis and dynamic analysis. Feature types can be divided into static features and dynamic features according to whether the malicious code is running in the user environment or simulation environment.

So, how to extract static or dynamic features of malware? Therefore, the first part will briefly introduce static and dynamic features.

1. Static features

Features that don't actually work, usually include:

  • bytecode: Binary code is converted into bytecode, a relatively primitive feature without any processing
  • IAT form: The more important part of the PE structure, which declares some functions and their locations, which is convenient for importing when the program is executed. The tables and functions are relatively related
  • Android permissions table: If your APP declares some permissions that are not used by functions, there may be malicious purposes, such as mobile phone information
  • printable characters: Convert binary code to ASCII code and perform related statistics
  • IDA disassemble jump block: The jump block when IDA tools are debugging, and it is processed as sequence data or graph data
  • Commonly used API functions
  • Malware Imagery

Static feature extraction method:


2. Dynamic features

Equivalent to static features are more time-consuming, it needs to actually execute the code. Usually includes:
-API call relationship: More obvious features, which APIs are called, and express the corresponding functions
control flow graph: Commonly used in software engineering, machine learning represents it as a vector for classification
data flow diagram: Commonly used in software engineering, machine learning represents it as a vector for classification

Dynamic feature extraction method:


2. Malicious family detection based on logistic regression

The previous series of articles detailed how to extract static and dynamic characteristics of malware, including API sequences. Next, we will build a machine learning model to learn API sequences to achieve classification. The basic process is as follows:

insert image description here

1. Dataset

The entire dataset includes samples of 5 types of malicious families, each sample passed through the dynamic API sequences successfully extracted by previous CAPE tools. The distribution of data sets is as follows: (readers are advised to extract samples of their own data sets, including BIG2015, BODMAS, etc.)

malicious family category quantity Training set test set
AAAA class1 352 242 110
BBBB class2 335 235 100
CCCC class3 363 243 120
DDDD class4 293 163 130
EEEE class5 548 358 190

The data set is divided into training set and test set, as shown in the following figure:

insert image description here

The data set mainly includes four fields, namely serial number, malicious family category, Md5 value, API sequence or feature.

insert image description here

It should be noted that the feature extraction process involves a large amount of data preprocessing and cleaning work, and readers need to complete it according to actual needs. For example, filter code that extracts features that are null values.

#coding:utf-8
#By:Eastmount CSDN 2023-05-31
import csv
import re
import os

csv.field_size_limit(500 * 1024 * 1024)
filename = "AAAA_result.csv"
writename = "AAAA_result_final.csv"
fw = open(writename, mode="w", newline="")
writer = csv.writer(fw)
writer.writerow(['no', 'type', 'md5', 'api'])
with open(filename,encoding='utf-8') as fr:
    reader = csv.reader(fr)
    no = 1
    for row in reader: #['no','type','md5','api']
        tt = row[1]
        md5 = row[2]
        api = row[3]
        #print(no,tt,md5,api)
        #api空值的过滤
        if api=="" or api=="api":
            continue
        else:
            writer.writerow([str(no),tt,md5,api])
            no += 1
fr.close()

2. Model Construction

Since the machine learning algorithm is relatively simple, only the key codes are given here. In addition, commonly used feature representations include TF-IDF and Word2Vec. Here, TF-IDF is used to calculate the feature vector. Readers can try Word2Vec to finally achieve family classification and obtain an Acc value of 0.6215.

# -*- coding: utf-8 -*-
# By:Eastmount CSDN 2023-06-01
import os
import csv
import time
import numpy as np
import seaborn as sns
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

start = time.clock()
csv.field_size_limit(500 * 1024 * 1024)

#---------------------------第一步 加载数据集------------------------
#训练集
file = "train_dataset.csv"
label_train = []
content_train = []
with open(file, "r") as csv_file:
    csv_reader = csv.reader(csv_file)
    header = next(csv_reader)
    for row in csv_reader:
        label_train.append(row[1])
        value = str(row[3])
        content_train.append(value)
print(label_train[:2])
print(content_train[:2])

#测试集
file = "test_dataset.csv"
label_test = []
content_test = []
with open(file, "r") as csv_file:
    csv_reader = csv.reader(csv_file)
    header = next(csv_reader)
    for row in csv_reader:
        label_test.append(row[1])
        value = str(row[3])
        content_test.append(value)
print(len(label_train),len(label_test))
print(len(content_train),len(content_test)) #1241 650

#---------------------------第二步 向量转换------------------------
contents = content_train + content_test
labels = label_train + label_test

#计算词频 min_df max_df
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(contents)
words = vectorizer.get_feature_names()
print(words[:10])
print("特征词数量:",len(words))

#计算TF-IDF
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(X)
weights = tfidf.toarray()

#---------------------------第三步 编码转换------------------------
le = LabelEncoder()
y = le.fit_transform(labels)
X_train, X_test = weights[:1241], weights[1241:]
y_train, y_test = y[:1241], y[1241:]

#---------------------------第四步 分类检测------------------------
clf = LogisticRegression(solver='liblinear')
clf.fit(X_train, y_train)
pre = clf.predict(X_test)
print(clf)
print(classification_report(y_test, pre, digits=4))
print("accuracy:")
print(metrics.accuracy_score(y_test, pre))

#计算时间
elapsed = (time.clock() - start)
print("Time used:", elapsed)

The output result is shown in the figure below:

1241 650
1241 650
['__anomaly__', 'accept', 'bind', 'changewindowmessagefilter', 'closesocket', 'clsidfromprogid', 'cocreateinstance', 'cocreateinstanceex', 'cogetclassobject', 'colescript_parsescripttext']
特征词数量: 269
LogisticRegression(solver='liblinear')
              precision    recall  f1-score   support

           0     0.5398    0.5545    0.5471       110
           1     0.6526    0.6200    0.6359       100
           2     0.6596    0.5167    0.5794       120
           3     0.8235    0.5385    0.6512       130
           4     0.5665    0.7842    0.6578       190

    accuracy                         0.6215       650
   macro avg     0.6484    0.6028    0.6143       650
weighted avg     0.6438    0.6215    0.6199       650

accuracy:
0.6215384615384615
Time used: 2.2597622

3. Malicious family detection based on SVM

1. SVM model

The core idea of ​​the SVM classification algorithm is to find a hyperplane that meets the classification requirements in high dimensions by establishing a certain kernel function, so that the points in the training set are as far away from the classification surface as possible, that is, to find a classification surface that makes its two sides The white space is the largest. As shown in Figure 19.16, the training samples on the hyperplane that is closest to the classification surface and parallel to the optimal classification surface among the two types of samples are called support vectors.

insert image description here

The SVM classification algorithm is implemented in the Sklearn machine learning package svm.SVC, namely C-Support Vector Classification, which is implemented based on libsvm. The construction method is as follows:

SVC(C=1.0, 
	cache_size=200, 
	class_weight=None, 
	coef0=0.0,
	decision_function_shape=None, 
	degree=3, 
	gamma='auto', 
	kernel='rbf',
	max_iter=-1, 
	probability=False, 
	random_state=None, 
	shrinking=True,
	tol=0.001, 
	verbose=False)

The SVC algorithm mainly includes two steps:

  • training :nbrs.fit(data, target)
  • Forecast :pre = clf.predict(data)

2. Code implementation

The following only gives the key code of SVM to realize the classification of malicious families. This algorithm is also a commonly used model in various security tasks. It should be noted that the prediction results are saved to a file here. In real experiments, it is recommended that you save more experimental process data, so that you can better compare various performances and reflect the contribution of the paper.

# -*- coding: utf-8 -*-
# By:Eastmount CSDN 2023-06-01
import os
import csv
import time
import numpy as np
import seaborn as sns
from sklearn import svm
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

start = time.clock()
csv.field_size_limit(500 * 1024 * 1024)

#---------------------------第一步 加载数据集------------------------
#训练集
file = "train_dataset.csv"
label_train = []
content_train = []
with open(file, "r") as csv_file:
    csv_reader = csv.reader(csv_file)
    header = next(csv_reader)
    for row in csv_reader:
        label_train.append(row[1])
        value = str(row[3])
        content_train.append(value)
print(label_train[:2])
print(content_train[:2])

#测试集
file = "test_dataset.csv"
label_test = []
content_test = []
with open(file, "r") as csv_file:
    csv_reader = csv.reader(csv_file)
    header = next(csv_reader)
    for row in csv_reader:
        label_test.append(row[1])
        value = str(row[3])
        content_test.append(value)
print(len(label_train),len(label_test))
print(len(content_train),len(content_test)) #1241 650

#---------------------------第二步 向量转换------------------------
contents = content_train + content_test
labels = label_train + label_test

#计算词频 min_df max_df
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(contents)
words = vectorizer.get_feature_names()
print(words[:10])
print("特征词数量:",len(words))

#计算TF-IDF
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(X)
weights = tfidf.toarray()

#---------------------------第三步 编码转换------------------------
le = LabelEncoder()
y = le.fit_transform(labels)
X_train, X_test = weights[:1241], weights[1241:]
y_train, y_test = y[:1241], y[1241:]

#---------------------------第四步 分类检测------------------------
clf = svm.LinearSVC()
clf.fit(X_train, y_train)
pre = clf.predict(X_test)
print(clf)
print(classification_report(y_test, pre, digits=4))
print("accuracy:")
print(metrics.accuracy_score(y_test, pre))

#结果存储
f1 = open("svm_test_pre.txt", "w")
for n in pre:
    f1.write(str(n) + "\n")
f1.close()

f2 = open("svm_test_y.txt", "w")
for n in y_test:
    f2.write(str(n) + "\n")
f2.close()

#计算时间
elapsed = (time.clock() - start)
print("Time used:", elapsed)

The experimental results are shown in the figure below:

insert image description here

1241 650
1241 650

['__anomaly__', 'accept', 'bind', 'changewindowmessagefilter', 'closesocket', 'clsidfromprogid', 'cocreateinstance', 'cocreateinstanceex', 'cogetclassobject', 'colescript_parsescripttext']
特征词数量: 269
LinearSVC()
              precision    recall  f1-score   support

           0     0.6439    0.7727    0.7025       110
           1     0.8780    0.7200    0.7912       100
           2     0.7315    0.6583    0.6930       120
           3     0.9091    0.6154    0.7339       130
           4     0.6583    0.8316    0.7349       190

    accuracy                         0.7292       650
   macro avg     0.7642    0.7196    0.7311       650
weighted avg     0.7534    0.7292    0.7301       650

accuracy:
0.7292307692307692
Time used: 2.2672032

4. Malicious family detection based on random forest

The key code of this part is as follows, and the visual analysis code is supplemented.

# -*- coding: utf-8 -*-
# By:Eastmount CSDN 2023-06-01
import os
import csv
import time
import numpy as np
import seaborn as sns
from sklearn import svm
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

start = time.clock()
csv.field_size_limit(500 * 1024 * 1024)

#---------------------------第一步 加载数据集------------------------
#训练集
file = "train_dataset.csv"
label_train = []
content_train = []
with open(file, "r") as csv_file:
    csv_reader = csv.reader(csv_file)
    header = next(csv_reader)
    for row in csv_reader:
        label_train.append(row[1])
        value = str(row[3])
        content_train.append(value)
print(label_train[:2])
print(content_train[:2])

#测试集
file = "test_dataset.csv"
label_test = []
content_test = []
with open(file, "r") as csv_file:
    csv_reader = csv.reader(csv_file)
    header = next(csv_reader)
    for row in csv_reader:
        label_test.append(row[1])
        value = str(row[3])
        content_test.append(value)
print(len(label_train),len(label_test))
print(len(content_train),len(content_test)) #1241 650

#---------------------------第二步 向量转换------------------------
contents = content_train + content_test
labels = label_train + label_test

#计算词频 min_df max_df
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(contents)
words = vectorizer.get_feature_names()
print(words[:10])
print("特征词数量:",len(words))

#计算TF-IDF
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(X)
weights = tfidf.toarray()

#---------------------------第三步 编码转换------------------------
le = LabelEncoder()
y = le.fit_transform(labels)
X_train, X_test = weights[:1241], weights[1241:]
y_train, y_test = y[:1241], y[1241:]

#---------------------------第四步 分类检测------------------------
clf = RandomForestClassifier(n_estimators=5)
clf.fit(X_train, y_train)
pre = clf.predict(X_test)
print(clf)
print(classification_report(y_test, pre, digits=4))
print("accuracy:")
print(metrics.accuracy_score(y_test, pre))

#结果存储
f1 = open("rf_test_pre.txt", "w")
for n in pre:
    f1.write(str(n) + "\n")
f1.close()

f2 = open("rf_test_y.txt", "w")
for n in y_test:
    f2.write(str(n) + "\n")
f2.close()

#计算时间
elapsed = (time.clock() - start)
print("Time used:", elapsed)

#---------------------------第五步 可视化分析------------------------
#降维
pca = PCA(n_components=2)
pca = pca.fit(X_test)
xx = pca.transform(X_test)

#画图
plt.figure()
plt.scatter(xx[:,0],xx[:,1],c=y_test, s=50)
plt.title("Malware Family Detection")
plt.show()

The output result is as follows, the effect reached 0.8092, which feels pretty good.

1241 650
1241 650
['__anomaly__', 'accept', 'bind', 'changewindowmessagefilter', 'closesocket', 'clsidfromprogid', 'cocreateinstance', 'cocreateinstanceex', 'cogetclassobject', 'colescript_parsescripttext']
特征词数量: 269
RandomForestClassifier(n_estimators=5)
              precision    recall  f1-score   support

           0     0.7185    0.8818    0.7918       110
           1     0.9000    0.8100    0.8526       100
           2     0.7963    0.7167    0.7544       120
           3     0.9444    0.7846    0.8571       130
           4     0.7656    0.8421    0.8020       190

    accuracy                         0.8092       650
   macro avg     0.8250    0.8070    0.8116       650
weighted avg     0.8197    0.8092    0.8103       650

accuracy:
0.8092307692307692
Time used: 2.1914324

At the same time, five types of malicious families are analyzed visually. However, the overall effect is mediocre. It is necessary to further optimize the code and dimension to distinguish data sets, or 3D scatter plots. Readers are asked to think for themselves.

insert image description here


5. Summary

This is the end of this article, I hope it will be helpful to you. Busy May, really busy, I graduated from the project, and I will write a few security blogs after I finish my work. Thank you for your support and companionship, especially your family’s encouragement and support. Keep up the good work!

  • 1. Malware analysis
    1. Static features
    2. Dynamic features
  • 2. Malicious family detection based on logistic regression
    1. Data set
    2. Model construction
  • 3. Malicious family detection based on SVM
    1. SVM model
    2. Code implementation
  • 4. Malicious family detection based on random forest
  • 5. Summary

The author asks the following questions, welcome to add:

  • What are common characteristics of malware or binaries? What are the advantages and disadvantages of each.
  • Malware to grayscale image is a common family classification method, what are its advantages and disadvantages compared with the method proposed in this paper?
  • How to extract malware CFG and ICFG? How can it be learned by the machine learning model after extraction?
  • What are the common vector representation methods, and what are their characteristics? Can you implement the code for Word2Vec?
  • What is the connection and difference between machine learning and deep learning? If a deep learning model is built to learn API sequences, how effective is its malicious family detection?
  • Where is the current state of malware family classification or malicious code detection? What are the characteristics and limitations of industry and academia, and how to better connect to promote the development of the field?
  • Is there a better way to innovate or break through the binary direction? How to improve its robustness, semantic enhancement, and interpretability.
  • How to detect malware from unknown families, and how to trace the source of high-threat malware?
  • How does malware detection better integrate with underlying hardware and compilers? And how to fight variants, obfuscation and confrontation.
  • Can malware detection quickly generate variants through chatGPT technology? And how to counter the development of this technology.

The road of life is made up of crossroads, game after game, entanglements and gains and losses. Gains and losses, gains and losses, different choices, different excitement. Although tired and busy, seeing Xiao Luoluo is quite satisfying, and I thank my family for their company.
Xiao Luo: Dad, you are back from work
Me: Did you cry at the supermarket with your mother-in-law today?
Xiao Luo: Yes, I want to take the little hair cake by myself
Me: I heard that grandpa and grandma laughed at me, from now on...
Xiao Luo: What's the use of their laughing!

Yes, haha, what's the use? Little Luoluo has grown up, and the little cutie has grown into a little naughty. Recently, I am reluctant to take a taxi, change to a bus and share a motorcycle, but I also rely on buying lottery tickets. Our 5 million words, why didn’t I follow the goddess to buy a house in our community in 2017? By this year, I feel that I can earn nearly 1 million yuan, which is enough for me to teach in Guizhou for ten years. It's all a game, it's all a choice, it's all sweet and sour, I hope Xiaoluo can grow up happily and healthily, I love you, keep working, come on

insert image description here

(By:Eastmount 2023-09-06 night in Guiyang http://blog.csdn.net/eastmount/ )

Guess you like

Origin blog.csdn.net/Eastmount/article/details/132708001