Android Malicious Application Identification (4) (Feature Processing and Classification Model Construction) - End

Preface

The first three chapters preliminarily sorted out the data:
1.Identification of Android malicious applications (1) (Python batch crawling to download Android applications)
2.Identification of Android malicious applications (2) (Android APK decompilation)
3.Android malicious application identification (3) (batch decompilation and attribute value extraction)
It does not involve any machine learning part. We know that feature selection is an important step in machine learning and needs to be screened. Identify salient features and discard non-salient features, which can increase training speed, reduce noise interference, and improve model effects.

1. Method of feature generation

As for how to generate features, I found relevant information inThe article summarizes Python feature generation methods (full).
Insert image description here
As shown in the figure above, according to the generation method, it can be divided into group aggregation method and conversion method. Among them, the grouping and aggregation method is to find the relevant mathematical statistics of the original data, and the conversion method is to convert the original data according to the data type to generate features.

2. String feature generation

Since the attribute values ​​obtained based on the previous three articles are strings, this chapter only generates features for string type data, citing the literatureData feature processing of text data ( Eigenvalue a).

I first installed the sklearn package and found that there was still no module, so I checked and reported an error of no module named sklearn when I ran it, and then installed a module called a>scikit-learn, no error will be reported.
Here we only generate features for the uses-permission attribute value of the .xml file in benign1. You can see that the sparse matrix (eigenvalue matrix) is generated in the figure below
Code understanding:

import re
from collections import Counter
from xml.dom.minidom import parse
from sklearn.feature_extraction.text import CountVectorizer
from character_handle.frequency_order import sample_get
def counter(arr):
    return Counter(arr)
list_uses_permission = []
# 加载XML,并提取属性值

xmldom = parse(sample_get.get_file("benign1"))

for element in xmldom.getElementsByTagName("uses-permission"):
    a = element.getAttribute("android:name")
    b = re.findall(r'\w{1,}', a)[-1]
    list_uses_permission.append(b)
count_v = CountVectorizer()
data = count_v.fit_transform(list_uses_permission)
# 输出标签为uses-permission的属性值(顺序)
print(list_uses_permission)
# 输出属性值列表(汇按照英文字母表的顺序排列)
print(count_v.get_feature_names())

# 输出词条列表对应的值化数据
print(data.toarray())

Insert image description here
Of course, the above is just an example in benign1, so each attribute value will only appear once in this matrix. But before actual training, after we determine which classic feature values ​​need to be used, we need to save the feature vector set based on whether these classic feature values ​​appear in an xml file.

3. Data processing

First of all, let me talk about it. At present, I need a relatively large amount of memory to obtain these samples. I don’t know how to improve it, unless I decompile an .xml file and then delete other files, so that all I get is the .xml file. There is no need to take up a lot of space, this method can be automated and implemented by itself. This article is mainly based on large memory. I bought a 1T mobile hard disk, so these samples, including installation packages, etc. can be installed so that I need to analyze the use of different files in the future. If you just want to learn the method, see the previous ones. enough.

3.1 Obtaining original data

This section mainly fills in the missing data processing before. I continued to crawl and download 928 Android applications on the 360 ​​Assistant website (two of which were reported as risky software by Tencent Butler). I will temporarily list them All are considered benign applications. Then, I downloaded 501 malicious applications from github (Check it yourself) for the time being as an experiment.
Insert image description here
Rename all malicious applications. For the method, see Batch renaming of Python folders, which is consistent with benign applications.
Insert image description here

3.2 Obtaining decompiled data

Similarly, batch decompile APKs of benign and malicious applications:

import datetime
import os
import subprocess
import threading


def execCmd(cmd):
    try:
        print("命令%s开始运行%s" % (cmd, datetime.datetime.now()))
        # os.system(cmd)
        subprocess.Popen(cmd, shell=True, stdout=None, stderr=None).wait()
        print("命令%s结束运行%s" % (cmd, datetime.datetime.now()))
    except:
        print('%s\t 运行失败' % (cmd))

def batchDecompile(cmds):
    if cmds:
        # if if_parallel:
            # 并行
            threads = []
            for cmd in cmds[0:20]:
                th = threading.Thread(target=execCmd, args=(cmd,))
                print("start !!!!!!!!")
                th.start()
                threads.append(th)

            # 等待线程运行完毕
            for th in threads:
                # 现在有 A, B, C 三件事情,只有做完 A 和 B 才能去做 C,而 A 和 B 可以并行完成。
                th.join()
                print("OK!!!!!!!!!!!")
            del cmds[0: 20]
            return batchDecompile(cmds)
   

# 需要执行的命令
cmds = ["F: & cd F:\\benign_apk & " + "apktool.bat d -f " + "benign" + str(i) + ".apk" for i in range(276,929)]
# 良性应用有929个,因为我之前不会彻底批量化,只能手动批量化处理了275个,现在这个程序是能够直接批量化处理到底的
batchDecompile(cmds)

This is a long time, waiting for both to be decompiled.

3.3 Feature acquisition

We have obtained 500 benign and 500 malicious samples in the above batch decompilation, and then extracted the permission attributes (the code is in Section 2). Each sample is a list, which contains multiple (possibly Duplicate) attribute value

# 如下
benign_samples = [ ["send_msg","call"..."call" .. ], [  samples2  ], ... ,[samples  n] ]
malicious_samples = [ ["send_msg","call"..."call" .. ], [  samples2  ], ... ,[samples  n] ]

Then, the feature vector is obtained. Here I use the improved TF-IDF algorithm [1] (see reference citation) to extract the features:

# 计算单个特征属性在正例样本中和在反例样本中的TF-IDF平均值差异
def compute_feature_difference(positive_samples, negative_samples):
    vectorizer = TfidfVectorizer()

    # 将文档转换为TF-IDF特征向量
    positive_tfidf_matrix = vectorizer.fit_transform([" ".join(sample) for sample in positive_samples])
    negative_tfidf_matrix = vectorizer.transform([" ".join(sample) for sample in negative_samples])

    # 计算平均值
    positive_avg_tfidf = positive_tfidf_matrix.mean(axis=0)
    negative_avg_tfidf = negative_tfidf_matrix.mean(axis=0)

    # 计算差异
    tfidf_difference = np.abs(positive_avg_tfidf - negative_avg_tfidf)
    return tfidf_difference


# 选择差异绝对值前5%的特征作为特征子集
def select_top_features(tfidf_difference):
    # 计算前5%特征属性的数量
    top_percentage = 0.05
    num_top_features = int(tfidf_difference.shape[1] * top_percentage)

    # 获取前5%特征属性的索引
    top_feature_indices = np.argsort(tfidf_difference.A1)[::-1][:num_top_features]
    return top_feature_indices


# 使用你的改进版TF-IDF算法将文档转换为TF-IDF矩阵
tfidf_difference = compute_feature_difference(positive_samples, negative_samples)
selected_features = select_top_features(tfidf_difference)

# 使用选定的特征子集构建TF-IDF特征矩阵
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([" ".join(sample) for sample in positive_samples + negative_samples])
tfidf_matrix_selected = tfidf_matrix[:, selected_features]

# 标签:1表示正例,0表示负例
labels = [1] * len(positive_samples) + [0] * len(negative_samples)

After building the sample, you can start bringing in the machine learning model for classification:

# 创建SVM分类器
svm_classifier = SVC(kernel='linear', random_state=42)

# 定义交叉验证的评估指标(这里使用准确度)
scoring = make_scorer(accuracy_score)

# 进行十折交叉验证
cross_val_scores = cross_val_score(svm_classifier, tfidf_matrix_selected, labels, cv=10, scoring=scoring)

# 打印每折交叉验证的准确度
print("Cross-Validation Accuracy Scores:", cross_val_scores)

# 打印平均准确度
print("Mean Accuracy:", np.mean(cross_val_scores))

The following is the average accuracy obtained by ten-fold cross-validation. Because I only used 200 samples (100 benign and 100 malicious) when running, the results may not be satisfactory. You can try to modify the samples yourself. number, and filter the top k percentile as a feature subset to improve the efficiency of the classifier.
Insert image description here
The idea of ​​the above algorithm comes from:
[1] Pan Jianwen, Zhang Zhihua, Lin Gaoyi, etc. Malicious Android application detection method based on feature selection [J/OL] . Computer Engineering and Application: 1-10[2023-10-25].http://kns.cnki.net/kcms/detail/11.2127.tp.20221104.1411.008.html

Guess you like

Origin blog.csdn.net/weixin_44165950/article/details/132944924