深度指纹识别：通过深度学习破坏网站指纹防御

Deep Fingerprinting: Undermining Website Fingerprinting Defenses with Deep Learning

摘要

Website fingerprinting enables a local eavesdropper to determine which websites a user is visiting over an encrypted connection. State-of-the-art website fingerprinting attacks have been shown to be effective even against Tor. Recently, lightweight website fin- gerprinting defenses for Tor have been proposed that substantially degrade existing attacks: WTF-PAD and Walkie-Talkie. In this work, we present Deep Fingerprinting (DF), a new website fingerprinting attack against Tor that leverages a type of deep learning called Convolutional Neural Networks (CNN) with a sophisticated ar- chitecture design, and we evaluate this attack against WTF-PAD and Walkie-Talkie. The DF attack attains over 98% accuracy on Tor traffic without defenses, better than all prior attacks, and it is also the only attack that is effective against WTF-PAD with over 90% accuracy. Walkie-Talkie remains effective, holding the attack to just 49.7% accuracy. In the more realistic open-world setting, our attack remains effective, with 0.99 precision and 0.94 recall on undefended traffic. Against traffic defended with WTF-PAD in this setting, the attack still can get 0.96 precision and 0.68 recall. These findings highlight the need for effective defenses that protect against this new attack and that could be deployed in Tor.

目的

通过发起DF WF攻击，破坏WTF-PAD防御，识别用户匿名访问的站点。（旨在识别网站）

背景

先前的工作表明，在某些条件下，本地和对手之间可以通过网络流量中的模式来识别Tor用户访问过的页面。

为了部署攻击，攻击者会从一系列自己的访问中捕获数据包序列
在每个trace中提取每个网站独有的功能。（比如说数据包大小，频率，两个方向上的传输时间，流量突发数量）
之后这些特征向量用于训练一个监督分类器

通过进行网站指纹的流量分析，得出网络中的目的站点。
防御：

将伪数据包添加在流量中（添加虚拟数据包或延迟数据包）。
增加实际数据包的延迟（增加了访问网页的延迟和带宽开销）
自适应填充（WTF-PAD）（实际应用）
自适应填充仅在信道使用率较低时才添加填充，从而节省带宽，掩盖流量突发及其相应功能
WT防御，核心思想是将半双工通信产生的trace转换为与全双工相比填充更少的冲突。

该攻击的假设为本地网络的窃听者，本地系统管理员，ISP， AS：

只能被动的访问用户与网络入口节点之间的链接
可以记录数据包，但不能修改，延迟，删除或解密

模型

细节

参数选择

框架

keras作为前端
tensorflow作为后端

源码和数据集地址

URL

数据表示

trace表示为一个tuple <timestape, ±packet_size>
其中±号表示数据包的方向，正数为传出，负数为传入

丰富编码能力

~~随着我们对网络的深入了解，我们会增加过滤器的数量。~~
~~cnn在处理数据流时采用阶梯式特征~~
~~较高层是由较低层特征的组合制成的高级抽象特征~~

设备开销

8GB内存
NVIDIA GTX 1070
训练时间：64分钟

数据集

训练：验证：测试=8：1：1

closed-world dataset

访问ALEXA前100站点主页
每个站点均访问1250次
使用tcpdump工具转储了每次访问所产生的流量
十台主机进行访问，分5个batch，每个batch访问25次

好处：

避免ip被ban
随时间推移捕获站点的变体，丰富数据集

具体流程：

使用tor-browser-crawler驱动tor浏览器
未对用户行为进行建模，遵循简单用户模型
丢弃损坏的数据包，即删除没有传入传出数据包或太短的数据包
仅保留访问量至少为1000的网站和类
最后的数据集为95个站点的，访问量至少为1000的网站

open-wordl dataset

访问ALEXA前50000个站点，不包括封闭数据集中前100个站点，
相同的十台主机进行访问，每台主机收集5000个不同站点的数据，只访问每个站点1次，并对其主页进行截图
清洗数据集：删除访问拒绝，空白页，验证码界面（很多网站使用了云火炬公司的cdn，该cdn在tor浏览器的接口处使用验证码），超时错误界面
最终的数据集共有40716条trace

效果

non-defended closed-world：

defended closed-world:

opened-world:

precision精确率：针对模型判断出的所有正例（TP+FP）而言，真正例的比例，越高越好
recall召回率：TP / (tp + FN)

conclusion

In this study, we investigated the performance of WF using deep learning techniques in both the closed-world scenario and the more realistic open-world scenario. We proposed a WF attack called Deep Fingerprinting (DF) using a sophisticate design based on a CNN for extracting features and classification. Our closed-world results show that the DF attack outperforms other state-of-the-art WF attacks, including better than 90% accuracy on traffic defended by WTF-PAD. We also performed open-world experiments, including the first open-world evaluation of WF attacks using deep learning against defended traffic. On undefended traffic, the DF attack attains a 0.99 precision and a 0.94 recall, while against WTF-PAD, it reaches a 0.96 precision and a 0.68 recall. Finally, we provided a discussion on our results along with suggestions for further investigation.
Overall, our study reveals the need to improve WF defenses to be more robust against attacks using deep learning, as attacks only get better, and we have already identified several directions to improve the DF attack further.

简写

website fingerprinting (WF) 网站指纹
deep fingerprinting(DF) 深度指纹
automated website fingerprinting（AWF）自动网站指纹
True Positive rate(TPR) 真阳性
False Positive Rate(FPR) 假阳性
stacked denoising autoencoders(SDAE) 堆叠降噪自动编码器
- 将多个DAE堆叠在一起形成一个深度的架构。只有在训练时才对输入进行加噪，训练完成就不需要在进行加噪
- 逐层贪婪训练：每层自编码器都单独进行非监督训练，以最小化输入与重构结果之间的误差为训练目标
- AE->DAE->SDAE
- AE：对输入数据进行编码，然后通过神经元传递到更紧凑的隐藏层，AE执行解码。在解码过程中尝试从隐藏层重构原始输入，同时使错误最小化
- AE的主要好处是从训练数据中提取了高级特征，从而降低了维度。
- DAE去噪自动编码器，使用AE的基本概念，但在输入中增加了噪声，DAE尝试在嘈杂的输入中重建原始值，帮助该模型更好的通用化。
- SDAE通过重叠隐藏层作为下一个DAE的输入来组合多个DAE。
- SDAE在图像分类方面实现了更低的分类错误率。
cloudflare 云火炬一家美国的跨国科技企业，总部位于旧金山。主要提供基于反向代理的内容分发网络cdn（content delivery network）及任播技术
cdn基本原理：广泛采用各种缓存服务器，将这些缓存服务器分布到用户访问相对集中的地区与网络中，在用户访问网站时，利用全局负载技术将用户的访问指向距离最近的工作正常的缓存服务器上，由缓存服务器直接响应用户需求。其包括四个组件：分布式存储，负载均衡，网络请求的重定向和内容管理四个要件。

WTF-PAD补充

暂定

西杭

发布了267 篇原创文章 · 获赞 51 · 访问量 25万+

他的留言板关注