As you probably know, the author will share fewer and fewer articles about network security in the future. But if you want to learn the application of artificial intelligence and security, you will have benefits. The author will recreate a series of blogs "When Artificial Intelligence Meets Security", introduce in detail the papers and practices related to artificial intelligence and security, and share various Cases involving malicious code detection, malicious request identification, intrusion detection, adversarial examples, etc. I just want to help beginners better and share new knowledge more systematically. This series of articles will be more focused, more academic, and more in-depth, and it is also the author's slow growth history. It is indeed difficult to change majors, and system security is also a hard bone, but I will try to see how far I can learn it in the next four years. Enjoy the process, come on together~
The previous article introduced security-related data sets for everyone to download and experiment, including malicious URLs, traffic analysis, domain name detection, malware, image classification, spam, etc. This article will explain how to learn the extracted API sequence features and build a machine learning algorithm to classify malicious families, which is also a typical task or work in the security field. Basic article, I hope it will be helpful to you~
Article directory
As a novice in network security, the author shares some self-study basic tutorials for everyone, mainly online notes, I hope you like it. At the same time, I hope that you can operate and make progress together with me. In the future, I will learn more about AI security and system security knowledge and share related experiments. In short, I hope this series of articles will be helpful to bloggers. It is not easy to write articles. If you don’t like it, don’t spray it, thank you! If the article is helpful to you, it will be the biggest motivation for my creation. Likes, comments, and private chats are all acceptable. Let's work together!
Previous recommendation:
- [When artificial intelligence meets security] 1. Is artificial intelligence really safe? The Bund Conference of the Zhejiang University team shared AI against sample technology
- [When artificial intelligence meets security] 2. Teacher Zhang Chao from Tsinghua University - GreyOne: Discover Vulnerabilities with Data Flow Sensitive Fuzzing
- [When artificial intelligence meets security] 3. Machine learning in the security field and machine learning malicious request identification case sharing
- [When artificial intelligence meets security] 4. Detailed explanation of malicious code detection technology based on machine learning
- [When artificial intelligence meets security] 5. Research on host malicious code identification based on machine learning algorithm
- [When artificial intelligence meets security] 6. Intrusion detection and attack identification based on machine learning - taking KDD CUP99 data set as an example
- [When artificial intelligence meets security] 7. Summary of security data sets based on machine learning
- [When artificial intelligence meets security] 8. Detailed examples of malicious family classification based on API sequence and machine learning
Author's github resources:
1. Malware Analysis
Malware or malicious code analysis usually includes static analysis and dynamic analysis. Feature types can be divided into static features and dynamic features according to whether the malicious code is running in the user environment or simulation environment.
So, how to extract static or dynamic features of malware? Therefore, the first part will briefly introduce static and dynamic features.
1. Static features
Features that don't actually work, usually include:
- bytecode: Binary code is converted into bytecode, a relatively primitive feature without any processing
- IAT form: The more important part of the PE structure, which declares some functions and their locations, which is convenient for importing when the program is executed. The tables and functions are relatively related
- Android permissions table: If your APP declares some permissions that are not used by functions, there may be malicious purposes, such as mobile phone information
- printable characters: Convert binary code to ASCII code and perform related statistics
- IDA disassemble jump block: The jump block when IDA tools are debugging, and it is processed as sequence data or graph data
- Commonly used API functions
- Malware Imagery
Static feature extraction method:
- CAPA
– https://github.com/mandiant/capa - IDA Pro
- Security Vendor Sandbox
2. Dynamic features
Equivalent to static features are more time-consuming, it needs to actually execute the code. Usually includes:
-API call relationship: More obvious features, which APIs are called, and express the corresponding functions
–control flow graph: Commonly used in software engineering, machine learning represents it as a vector for classification
–data flow diagram: Commonly used in software engineering, machine learning represents it as a vector for classification
Dynamic feature extraction method:
- Cuckoo
– https://github.com/cuckoosandbox/cuckoo - CAPE
– https://github.com/kevoreilly/CAPEv2
– https://capev2.readthedocs.io/en/latest/ - Security Vendor Sandbox
2. Malicious family detection based on logistic regression
The previous series of articles detailed how to extract static and dynamic characteristics of malware, including API sequences. Next, we will build a machine learning model to learn API sequences to achieve classification. The basic process is as follows:
1. Dataset
The entire dataset includes samples of 5 types of malicious families, each sample passed through the dynamic API sequences successfully extracted by previous CAPE tools. The distribution of data sets is as follows: (readers are advised to extract samples of their own data sets, including BIG2015, BODMAS, etc.)
malicious family | category | quantity | Training set | test set |
---|---|---|---|---|
AAAA | class1 | 352 | 242 | 110 |
BBBB | class2 | 335 | 235 | 100 |
CCCC | class3 | 363 | 243 | 120 |
DDDD | class4 | 293 | 163 | 130 |
EEEE | class5 | 548 | 358 | 190 |
The data set is divided into training set and test set, as shown in the following figure:
The data set mainly includes four fields, namely serial number, malicious family category, Md5 value, API sequence or feature.
It should be noted that the feature extraction process involves a large amount of data preprocessing and cleaning work, and readers need to complete it according to actual needs. For example, filter code that extracts features that are null values.
#coding:utf-8
#By:Eastmount CSDN 2023-05-31
import csv
import re
import os
csv.field_size_limit(500 * 1024 * 1024)
filename = "AAAA_result.csv"
writename = "AAAA_result_final.csv"
fw = open(writename, mode="w", newline="")
writer = csv.writer(fw)
writer.writerow(['no', 'type', 'md5', 'api'])
with open(filename,encoding='utf-8') as fr:
reader = csv.reader(fr)
no = 1
for row in reader: #['no','type','md5','api']
tt = row[1]
md5 = row[2]
api = row[3]
#print(no,tt,md5,api)
#api空值的过滤
if api=="" or api=="api":
continue
else:
writer.writerow([str(no),tt,md5,api])
no += 1
fr.close()
2. Model Construction
Since the machine learning algorithm is relatively simple, only the key codes are given here. In addition, commonly used feature representations include TF-IDF and Word2Vec. Here, TF-IDF is used to calculate the feature vector. Readers can try Word2Vec to finally achieve family classification and obtain an Acc value of 0.6215.
# -*- coding: utf-8 -*-
# By:Eastmount CSDN 2023-06-01
import os
import csv
import time
import numpy as np
import seaborn as sns
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
start = time.clock()
csv.field_size_limit(500 * 1024 * 1024)
#---------------------------第一步 加载数据集------------------------
#训练集
file = "train_dataset.csv"
label_train = []
content_train = []
with open(file, "r") as csv_file:
csv_reader = csv.reader(csv_file)
header = next(csv_reader)
for row in csv_reader:
label_train.append(row[1])
value = str(row[3])
content_train.append(value)
print(label_train[:2])
print(content_train[:2])
#测试集
file = "test_dataset.csv"
label_test = []
content_test = []
with open(file, "r") as csv_file:
csv_reader = csv.reader(csv_file)
header = next(csv_reader)
for row in csv_reader:
label_test.append(row[1])
value = str(row[3])
content_test.append(value)
print(len(label_train),len(label_test))
print(len(content_train),len(content_test)) #1241 650
#---------------------------第二步 向量转换------------------------
contents = content_train + content_test
labels = label_train + label_test
#计算词频 min_df max_df
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(contents)
words = vectorizer.get_feature_names()
print(words[:10])
print("特征词数量:",len(words))
#计算TF-IDF
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(X)
weights = tfidf.toarray()
#---------------------------第三步 编码转换------------------------
le = LabelEncoder()
y = le.fit_transform(labels)
X_train, X_test = weights[:1241], weights[1241:]
y_train, y_test = y[:1241], y[1241:]
#---------------------------第四步 分类检测------------------------
clf = LogisticRegression(solver='liblinear')
clf.fit(X_train, y_train)
pre = clf.predict(X_test)
print(clf)
print(classification_report(y_test, pre, digits=4))
print("accuracy:")
print(metrics.accuracy_score(y_test, pre))
#计算时间
elapsed = (time.clock() - start)
print("Time used:", elapsed)
The output result is shown in the figure below:
1241 650
1241 650
['__anomaly__', 'accept', 'bind', 'changewindowmessagefilter', 'closesocket', 'clsidfromprogid', 'cocreateinstance', 'cocreateinstanceex', 'cogetclassobject', 'colescript_parsescripttext']
特征词数量: 269
LogisticRegression(solver='liblinear')
precision recall f1-score support
0 0.5398 0.5545 0.5471 110
1 0.6526 0.6200 0.6359 100
2 0.6596 0.5167 0.5794 120
3 0.8235 0.5385 0.6512 130
4 0.5665 0.7842 0.6578 190
accuracy 0.6215 650
macro avg 0.6484 0.6028 0.6143 650
weighted avg 0.6438 0.6215 0.6199 650
accuracy:
0.6215384615384615
Time used: 2.2597622
3. Malicious family detection based on SVM
1. SVM model
The core idea of the SVM classification algorithm is to find a hyperplane that meets the classification requirements in high dimensions by establishing a certain kernel function, so that the points in the training set are as far away from the classification surface as possible, that is, to find a classification surface that makes its two sides The white space is the largest. As shown in Figure 19.16, the training samples on the hyperplane that is closest to the classification surface and parallel to the optimal classification surface among the two types of samples are called support vectors.
The SVM classification algorithm is implemented in the Sklearn machine learning package svm.SVC
, namely C-Support Vector Classification, which is implemented based on libsvm. The construction method is as follows:
SVC(C=1.0,
cache_size=200,
class_weight=None,
coef0=0.0,
decision_function_shape=None,
degree=3,
gamma='auto',
kernel='rbf',
max_iter=-1,
probability=False,
random_state=None,
shrinking=True,
tol=0.001,
verbose=False)
The SVC algorithm mainly includes two steps:
- training :
nbrs.fit(data, target)
- Forecast :
pre = clf.predict(data)
2. Code implementation
The following only gives the key code of SVM to realize the classification of malicious families. This algorithm is also a commonly used model in various security tasks. It should be noted that the prediction results are saved to a file here. In real experiments, it is recommended that you save more experimental process data, so that you can better compare various performances and reflect the contribution of the paper.
# -*- coding: utf-8 -*-
# By:Eastmount CSDN 2023-06-01
import os
import csv
import time
import numpy as np
import seaborn as sns
from sklearn import svm
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
start = time.clock()
csv.field_size_limit(500 * 1024 * 1024)
#---------------------------第一步 加载数据集------------------------
#训练集
file = "train_dataset.csv"
label_train = []
content_train = []
with open(file, "r") as csv_file:
csv_reader = csv.reader(csv_file)
header = next(csv_reader)
for row in csv_reader:
label_train.append(row[1])
value = str(row[3])
content_train.append(value)
print(label_train[:2])
print(content_train[:2])
#测试集
file = "test_dataset.csv"
label_test = []
content_test = []
with open(file, "r") as csv_file:
csv_reader = csv.reader(csv_file)
header = next(csv_reader)
for row in csv_reader:
label_test.append(row[1])
value = str(row[3])
content_test.append(value)
print(len(label_train),len(label_test))
print(len(content_train),len(content_test)) #1241 650
#---------------------------第二步 向量转换------------------------
contents = content_train + content_test
labels = label_train + label_test
#计算词频 min_df max_df
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(contents)
words = vectorizer.get_feature_names()
print(words[:10])
print("特征词数量:",len(words))
#计算TF-IDF
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(X)
weights = tfidf.toarray()
#---------------------------第三步 编码转换------------------------
le = LabelEncoder()
y = le.fit_transform(labels)
X_train, X_test = weights[:1241], weights[1241:]
y_train, y_test = y[:1241], y[1241:]
#---------------------------第四步 分类检测------------------------
clf = svm.LinearSVC()
clf.fit(X_train, y_train)
pre = clf.predict(X_test)
print(clf)
print(classification_report(y_test, pre, digits=4))
print("accuracy:")
print(metrics.accuracy_score(y_test, pre))
#结果存储
f1 = open("svm_test_pre.txt", "w")
for n in pre:
f1.write(str(n) + "\n")
f1.close()
f2 = open("svm_test_y.txt", "w")
for n in y_test:
f2.write(str(n) + "\n")
f2.close()
#计算时间
elapsed = (time.clock() - start)
print("Time used:", elapsed)
The experimental results are shown in the figure below:
1241 650
1241 650
['__anomaly__', 'accept', 'bind', 'changewindowmessagefilter', 'closesocket', 'clsidfromprogid', 'cocreateinstance', 'cocreateinstanceex', 'cogetclassobject', 'colescript_parsescripttext']
特征词数量: 269
LinearSVC()
precision recall f1-score support
0 0.6439 0.7727 0.7025 110
1 0.8780 0.7200 0.7912 100
2 0.7315 0.6583 0.6930 120
3 0.9091 0.6154 0.7339 130
4 0.6583 0.8316 0.7349 190
accuracy 0.7292 650
macro avg 0.7642 0.7196 0.7311 650
weighted avg 0.7534 0.7292 0.7301 650
accuracy:
0.7292307692307692
Time used: 2.2672032
4. Malicious family detection based on random forest
The key code of this part is as follows, and the visual analysis code is supplemented.
# -*- coding: utf-8 -*-
# By:Eastmount CSDN 2023-06-01
import os
import csv
import time
import numpy as np
import seaborn as sns
from sklearn import svm
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
start = time.clock()
csv.field_size_limit(500 * 1024 * 1024)
#---------------------------第一步 加载数据集------------------------
#训练集
file = "train_dataset.csv"
label_train = []
content_train = []
with open(file, "r") as csv_file:
csv_reader = csv.reader(csv_file)
header = next(csv_reader)
for row in csv_reader:
label_train.append(row[1])
value = str(row[3])
content_train.append(value)
print(label_train[:2])
print(content_train[:2])
#测试集
file = "test_dataset.csv"
label_test = []
content_test = []
with open(file, "r") as csv_file:
csv_reader = csv.reader(csv_file)
header = next(csv_reader)
for row in csv_reader:
label_test.append(row[1])
value = str(row[3])
content_test.append(value)
print(len(label_train),len(label_test))
print(len(content_train),len(content_test)) #1241 650
#---------------------------第二步 向量转换------------------------
contents = content_train + content_test
labels = label_train + label_test
#计算词频 min_df max_df
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(contents)
words = vectorizer.get_feature_names()
print(words[:10])
print("特征词数量:",len(words))
#计算TF-IDF
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(X)
weights = tfidf.toarray()
#---------------------------第三步 编码转换------------------------
le = LabelEncoder()
y = le.fit_transform(labels)
X_train, X_test = weights[:1241], weights[1241:]
y_train, y_test = y[:1241], y[1241:]
#---------------------------第四步 分类检测------------------------
clf = RandomForestClassifier(n_estimators=5)
clf.fit(X_train, y_train)
pre = clf.predict(X_test)
print(clf)
print(classification_report(y_test, pre, digits=4))
print("accuracy:")
print(metrics.accuracy_score(y_test, pre))
#结果存储
f1 = open("rf_test_pre.txt", "w")
for n in pre:
f1.write(str(n) + "\n")
f1.close()
f2 = open("rf_test_y.txt", "w")
for n in y_test:
f2.write(str(n) + "\n")
f2.close()
#计算时间
elapsed = (time.clock() - start)
print("Time used:", elapsed)
#---------------------------第五步 可视化分析------------------------
#降维
pca = PCA(n_components=2)
pca = pca.fit(X_test)
xx = pca.transform(X_test)
#画图
plt.figure()
plt.scatter(xx[:,0],xx[:,1],c=y_test, s=50)
plt.title("Malware Family Detection")
plt.show()
The output result is as follows, the effect reached 0.8092, which feels pretty good.
1241 650
1241 650
['__anomaly__', 'accept', 'bind', 'changewindowmessagefilter', 'closesocket', 'clsidfromprogid', 'cocreateinstance', 'cocreateinstanceex', 'cogetclassobject', 'colescript_parsescripttext']
特征词数量: 269
RandomForestClassifier(n_estimators=5)
precision recall f1-score support
0 0.7185 0.8818 0.7918 110
1 0.9000 0.8100 0.8526 100
2 0.7963 0.7167 0.7544 120
3 0.9444 0.7846 0.8571 130
4 0.7656 0.8421 0.8020 190
accuracy 0.8092 650
macro avg 0.8250 0.8070 0.8116 650
weighted avg 0.8197 0.8092 0.8103 650
accuracy:
0.8092307692307692
Time used: 2.1914324
At the same time, five types of malicious families are analyzed visually. However, the overall effect is mediocre. It is necessary to further optimize the code and dimension to distinguish data sets, or 3D scatter plots. Readers are asked to think for themselves.
5. Summary
This is the end of this article, I hope it will be helpful to you. Busy May, really busy, I graduated from the project, and I will write a few security blogs after I finish my work. Thank you for your support and companionship, especially your family’s encouragement and support. Keep up the good work!
- 1. Malware analysis
1. Static features
2. Dynamic features - 2. Malicious family detection based on logistic regression
1. Data set
2. Model construction - 3. Malicious family detection based on SVM
1. SVM model
2. Code implementation - 4. Malicious family detection based on random forest
- 5. Summary
The author asks the following questions, welcome to add:
- What are common characteristics of malware or binaries? What are the advantages and disadvantages of each.
- Malware to grayscale image is a common family classification method, what are its advantages and disadvantages compared with the method proposed in this paper?
- How to extract malware CFG and ICFG? How can it be learned by the machine learning model after extraction?
- What are the common vector representation methods, and what are their characteristics? Can you implement the code for Word2Vec?
- What is the connection and difference between machine learning and deep learning? If a deep learning model is built to learn API sequences, how effective is its malicious family detection?
- Where is the current state of malware family classification or malicious code detection? What are the characteristics and limitations of industry and academia, and how to better connect to promote the development of the field?
- Is there a better way to innovate or break through the binary direction? How to improve its robustness, semantic enhancement, and interpretability.
- How to detect malware from unknown families, and how to trace the source of high-threat malware?
- How does malware detection better integrate with underlying hardware and compilers? And how to fight variants, obfuscation and confrontation.
- Can malware detection quickly generate variants through chatGPT technology? And how to counter the development of this technology.
The road of life is made up of crossroads, game after game, entanglements and gains and losses. Gains and losses, gains and losses, different choices, different excitement. Although tired and busy, seeing Xiao Luoluo is quite satisfying, and I thank my family for their company.
Xiao Luo: Dad, you are back from work
Me: Did you cry at the supermarket with your mother-in-law today?
Xiao Luo: Yes, I want to take the little hair cake by myself
Me: I heard that grandpa and grandma laughed at me, from now on...
Xiao Luo: What's the use of their laughing!
Yes, haha, what's the use? Little Luoluo has grown up, and the little cutie has grown into a little naughty. Recently, I am reluctant to take a taxi, change to a bus and share a motorcycle, but I also rely on buying lottery tickets. Our 5 million words, why didn’t I follow the goddess to buy a house in our community in 2017? By this year, I feel that I can earn nearly 1 million yuan, which is enough for me to teach in Guizhou for ten years. It's all a game, it's all a choice, it's all sweet and sour, I hope Xiaoluo can grow up happily and healthily, I love you, keep working, come on
(By:Eastmount 2023-09-06 night in Guiyang http://blog.csdn.net/eastmount/ )