1. 数据准备阶段
软件爬取
恶意应用,待定,从开源项目获取,或者自己搞
非恶意应用:从应用市场爬取,不过目前看来,爬四五个就会被限制,爬虫代码如下
# coding=utf-8
import urllib
import requests
import re
import time
import socket
#timeout = 5
#socket.setdefaulttimeout(timeout)
from bs4 import BeautifulSoup
def parser_apks(self, count=30):
_root_url = "http://app.mi.com" # 应用市场主页网址
res_parser = {}
# 设置爬取的页面,从第一页开始爬取,第一页爬完爬取第二页,以此类推
page_num = 1
while count:
# 获取应用列表页面
wbdata = requests.get("http://app.mi.com/catTopList/27?page=" + str(page_num)).text
print("开始爬取第" + str(page_num) + "页")
# 解析应用列表页面内容
soup = BeautifulSoup(wbdata, "html.parser")
links = soup.find_all("a", href=re.compile("/details?"), class_="", alt="")
for link in links:
# 获取应用详情页面的链接
detail_link = urllib.parse.urljoin(_root_url, str(link["href"]))
package_name = detail_link.split("=")[1]
download_page = requests.get(detail_link).text
#解析应用详情页面
soup1 = BeautifulSoup(download_page, "html.parser")
download_link = soup1.find(class_="download")["href"]
#获取直接下载的链接
download_url = urllib.parse.urljoin(_root_url, str(download_link))
# 解析后会有重复的结果,通过判断去重
if download_url not in res_parser.values():
res_parser[package_name] = download_url
count = count - 1
if count == 0:
break
if count > 0:
page_num = page_num + 1
print("爬取apk数量为: " + str(len(res_parser)))
return res_parser
def craw_apks(self, count=30, save_path="./apk/"):
res_dic = parser_apks(count)
for apk in res_dic.keys():
print("正在下载应用: " + apk)
request = urllib.request.urlretrieve(res_dic[apk], save_path + apk + ".apk")
print("下载完成")
time.sleep(5) #等待一会
if __name__ == "__main__":
craw_apks(30)
IDE
android studio,打开并反编译一个apk,可以看到.xml文件的内容信息,使用的sdk,申请的权限信息等
Xposed使用教程
Xposed 插件开发之一: Xposed入门:https://blog.csdn.net/niubitianping/article/details/52571438
Xposed地址:https://github.com/rovo89/Xposed
Xposed框架实现Android中的Hook一个例子:https://www.jianshu.com/p/372630e37683
Xposed 的一个教程,从模拟器开始:https://juejin.im/entry/5900145b0ce463006146f26b
2. 算法模型
xgboost教程
XGBOOST从原理到实战二分类 、多分类:https://blog.csdn.net/HHTNAN/article/details/81079257
手把手教写出XGBoost实战程序:https://juejin.im/post/5a1bb29e51882531ba10aa49
机器学习XGBoost算法使用:http://irory.me/blog/16
XGBoost使用教程(纯xgboost方法):https://blog.csdn.net/u011630575/article/details/79418138
相关项目
微软恶意软件分类挑战,malware-detection:https://github.com/dchad/malware-detection
用机器学习进行恶意软件检测——以阿里云恶意软件检测比赛为例:https://xz.aliyun.com/t/3704 代码地址:https://github.com/Rman0fCN/ML_Malware_detect