由于本人从事安全相关的行业的工作，接触到很多想用机器学习解决网络安全相关的问题，不可避免的需要用到很多安全相关的开源数据集和工具，这里记录一下本人自己用过并感觉不错的数据集和开源工具。当然，这可能只是安全领域数据集和开源工具极小的一部分，希望能起到抛砖引玉的目的吧，本人后续也会不断更新。

1. 安全数据集

对于安全数据集，已经有行业从业者不辞辛劳的总结的很完善了，这里给出两个我看到的汇总网址：

网络安全中机器学习大合集

SecRepo.com - Samples of Security Related Data

安全联盟曝光台

其中，对于“安全联盟的曝光台”，安全联盟是由知道创宇、腾讯等互联网企业于2012年发起的第三方公益组织，自成立以来，通过与12321举报中心、腾讯、搜狗、金山等上百家机构、企业合作，通过发动群众参与网络治理，安全联盟已建成国内最大的第三方网络安全数据共享交换平台，日平均共享数据4500万次，每日接收网民举报超5000条，截止目前已拥有超过8.9亿条恶意网址、电话数据。这些恶意数据被应用到搜索引擎、浏览器、IM、社交平台、路由器OS等互联网终端，每天为网民提供超过30亿次恶意风险提醒，极大程度地帮助网民远离网络诈骗。

这里面包罗万象，我第一次看到感觉如获至宝，感觉发现了一座金矿，紧接着又有点傻眼，这个金矿应该怎么挖？我的答案是当然是站在前人的肩膀上，多利用前辈们的智慧啦。个人建议买一本《Web安全之机器学习入门》并下载随书代码，这本书里面列举了用机器学习方法解决典型的各种网络安全问题，上面列举的很多数据集都可以用在这里面，能帮你迅速上手并判断是否有深入使用和研究的价值。当然更棒的是，这本书里也列举了一些网络安全领域的公开数据集，与上面汇总帖里的数据集互为补充，能为你在开始一个网络安全领域的机器学习项目提供快速的建模手段。

说完汇总，我也列一下我在机器学习项目里使用过数据集：

恶意url: http://www.sysnet.ucsd.edu/projects/url/
An anonymized 120-day subset of our ICML-09 data set is available from the following links:

URL Data Set (Matlab) (470 MB)

URL Data Set (SVM-light) (234 MB)

僵尸网络DGA域名数据：　http://osint.bambenekconsulting.com/feeds/dga-feed.txt
恶意流量分析：　http://malware-traffic-analysis.net/
恶意软件分类数据：https://www.kaggle.com/c/malware-classification
http://www.malshare.com/index.php

2. 威胁情报

当前安全领域高级持续威胁APT日益泛滥，威胁情报作为应对APT的重要手段也被越来越多的的安全厂商所重视，本人在github上发现了一个比较好的威胁情报资料汇总，网址是： https://github.com/hslatman/awesome-threat-intelligence

3. 开源扫描器集合

安全行业从业人员自研开源扫描器合集
 开源扫描仪的工具箱

4. 开源软件集合

Stratosphere Linux IPS (slips)

a behavioral-based intrusion detection and prevention system that uses machine learning algorithms to detect malicious behaviors.

https://github.com/stratosphereips/StratosphereLinuxIps

https://github.com/stratosphereips/StratosphereTestingFramework

Learn2ban

Open source machine learning DDOS detection tool

https://github.com/equalitie/learn2ban

malware-detection

Experiments in malware detection and classification using machine learning techniques.

https://github.com/dchad/malware-detection

Use of machine learning for anomaly detection in netflow data

https://github.com/eraclitux/machine-learning-netflow

Botnet Detection using Machine Learning

https://github.com/hmishra2250/Botnet-Detection-using-Machine-Learning

Fraud_Detector

Fraud Detection using ensemble of Statistical, Network analysis and Machine learning approach.

https://github.com/kskk02/Fraud_Detector

Intrusion Detection With Machine Learning

https://github.com/slrbl/Intrusion-and-anomaly-detection-with-machine-learning

Adaptive Machine Learning for Credit Card Fraud Detection

https://github.com/dalpozz/AMLFD

time series data analysis

https://github.com/linkedin/luminol

open source and threat intelligence

https://github.com/Te-k/harpoon

Apache Spot：一个全新的网络安全开源项目

Apache Spot 是一个基于网络流量和数据包分析，通过独特的机器学习方法，发现潜在安全威胁和未知网络攻击能力的开源方案。目前 Apache Spot 已支持对 Netflow、sflow、DNS、Proxy 的网络流量分析，主要依靠 HDFS、Hive 提供存储能力，Spark 提供计算能力，基于 LDA 算法提供无监督式机器学习能力，最终依赖 Jupyter 提供图形化交互能力。

详细介绍： https://mp.weixin.qq.com/s/DQdcByiuMNlUMhK7uHAdCA

https://spot.apache.org/

https://hub.docker.com/r/apachespot/spot-demo/

AIEngine (Artificial Intelligent Engine)

AIEngine is a packet inspection engine with capabilities of learning without any human intervention. AIEngine helps network/security professionals to identify traffic and develop signatures for use them on NIDS, Firewalls, Malware analysis, Traffic classifiers and so on.

网址：https://bitbucket.org/camp0/aiengine/

Passive DNS

PassiveDNS对安全研究非常重要，因为它可以得到以下三方面的答案：该域名曾经绑定过哪些IP、这个IP有没有其他的域名、该域名最早/最晚什么时候出现。Passive DNS同时也在SOC的时候起到很大的帮助。通过识别的恶意域名，可以找到其他被恶意破坏的机器。目前有很多网站允许我们访问它的PassiveDNS系统，例如：Virustotal(https://www.virustotal.com/)、passivetotal(https://www.passivetotal.com)、CIRCL (https://www.circl.lu/services/passive-dns/)。有很多这样的网站，但是，自己在本地有一个当然会更方便。

更详细的介绍：http://www.freebuf.com/articles/network/103815.html,以及https://www.farsightsecurity.com/solutions/dnsdb/

更多的开源工具： PassiveDNS::Client, https://github.com/chrislee35/passivedns-client

Vulhub

Vulhub是一个面向大众的开源漏洞靶场，无需docker知识，简单执行两条命令即可编译、运行一个完整的漏洞靶场镜像。

开源代码：https://github.com/Cherishao/vulhub

安全数据集和开源工具

1. 安全数据集

2. 威胁情报

3. 开源扫描器集合

4. 开源软件集合

猜你喜欢