从digital footprint视角看web日志分析

从digital footprint视角看web日志分析


“The past is never dead. It’s not even past.” - William Faulkner


Digital footprint


1. what?

1.1 什么是Digital footprint?

  • Digital footprint or digital shadow refers to one’s unique set of traceable digital activities, actions, contributions and communications that are manifested on the Internet or on digital devices
  • 个人独有的、可追踪的、在电子设备上的活动痕迹

1.2 Digital footprint有什么作用?

用处大得很。。。

  • 商业中:
    • The closed loop takes data from the open loop and provides this as a new data input. This new data determines what the user has reacted to, or how they have been influenced.
    • The feedback then builds a digital footprint based on social data, and the controller of the social digital footprint data can determine how and why people purchase and behave. 用户习惯
    • bigdata概念提出来,对数据的分析变得特别火

从书名可看出,你的Digital Footprint可以成为别人的生意。
My Digital Footprint:A two-sided digital business model where your privacy will be someone else’s business! (by Tony Fish)

  • web日志中:
    • Digital footprint(脚印)&digital shadow(影子)的定义还是很形象的,相当于一个用户的副本,这些可以描绘出用户的一个虚拟形象。这个用户做了什么?喜欢什么?
    • 一般购物网站往往会通过刻画用户形象给用户进行个性化推荐
    • 从安全的角度来看的话,在web日志(web server log & web application log)中,可以给正常用户刻画出正常用户行为,建立一个模型。反之,也可以给恶意用户刻画出一个恶意行为,比如对特定页面、短时间内大量发送请求等等的行为,同样也可以建立一个模型。

1.3 Digital footprint的分类?

  • passive
    • A passive digital footprint is created when data is collected without the owner knowing 偷偷收集

web日志应该是属于被动收集。

  • active:

    • active digital footprints are created when personal data is **released deliberately by a user**for the purpose of sharing information about oneself by means of websites or social media 得到用户允许的
  • 实例?

    • 主动收集:
    • 用户注册登录后,分享自己的通讯录。

2. why?


2.1 为什么要收集Digital footprint?

在web日志中,

  • Log files provide us with a precise view of the behavior of a server as well as critical information like when, how and “by whom” a server is being accessed.

  • This kind of information can help us monitor the performance, troubleshoot and debug applications, as well as help forensic investigators unfold the chain of events that may have led to a malicious activity.


3. how?


3.1 怎么收集用户的digital footprint?有什么来源?

digital footprints can also be stored in many ways depending on the situation. 有很多不同情况、场景来收集

  • Web browsing:
    • On the World Wide Web, the internet footprint;also known as cyber shadow, electronic footprint, or digital shadow, is the information left behind as a result of a user’s web-browsing and stored as cookies. 背后是用户的浏览器访问行为和记录的cookie
    • The term usually applies to an individual person, but can also refer to a business, organization and corporation.ip不一定为个人的

弄日志分析的时候差点忽略了正常用户是通过浏览器访问这个似乎显而易见的东西了。。。还有ip可能并非是个人的。

  • user-agent:

web日志的话,正常用户是通过浏览器来访问的。那么他的user-agent便应该是浏览器的标识才对。
常见的例如

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11
Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16

是否可以认为异常user-agent就可以怀疑他是攻击者呢?
如果是用扫描器的话,有以下常见的user-agent

sqlmap、acunetix、Netsparker、nmap

不好好的通过浏览器来访问?这个用户是想来干嘛?

  • 准确定位到个人

首先要看Ip不一定是真的。可能用了代理

在Apache官网中有说到,

The IP address reported here is not necessarily the address of the machine at which the user is sitting. If a proxy server exists between the user and the server, this address will be the address of the proxy, rather than the originating machine.可能用了代理

如果没用代理,Ip也可能是某个地方的出口、如校园网等等。。。

The term usually applies to an individual person, but can also refer to a business, organization and corporation


4. when?


4.1 几时开始收集digital footprint?

web日志的话,从你访问Web服务的那一刻开始就记录了。

web server log的定义是 server运行时,记录一切的活动。
产生的日志文件格式名为web-access.log,access访问,不管你是小偷还是正常用户、只要你一访问便有记录了。

web server log有点像摄像头monitor一样,无时无刻不在记录的。


一旦你进入摄像头的范围,就会被记录到。


4.2 digital footprint的时间问题:

如果你是保安,你在摄像头中看到凌晨三点有人出现在停车场,你会不会有一点点怀疑他是小偷?
按道理,大多数情况下,正常用户是不会凌晨三点鬼鬼祟祟的来访问网站的。

比如内部oa的网站,大半夜会有员工访问?这就很奇怪了。。。

在日常普通的离线日志分析中,有客户要求要日志分类成正常上班时间、非正常上班时间来分析。

所以在日志分析可以定制化加上这些规则:


5. where?


5.1 收集场景有哪些?

passive digital footprint:

  • In an online environment a footprint may be stored in an online data base as a “hit”. This footprint may track the user IP address, when it was created, and where they came from; with the footprint later being analyzed.
  • In an offline environment, a footprint may be stored in files, which can be accessed by administrators to view the actions performed on the machine, without being able to see who performed them.

active digital footprints:

  • In an online environment, a footprint can be stored by a user being logged into a site when making a post or change, with the registered name being connected to the edit.
  • In an off line environment a footprint may be stored in files, when the owner of the computer uses a keylogger, so logs can show the actions performed on the machine, and who performed them. One of the features of keylogger is to monitor the clipboard for any changes. This may be problematic as the user may copy passwords or take screenshots of sensitive information which will then be logged.

web日志中:

  • web server
  • web application server
  • database
  • 安全设备 waf 等
  • 等等

6. who?

6.1 Digital footprint里面有什么用户?

  • robots
    • 爬虫 spider
    • 扫描器 scanner
  • people
    • normal people
    • abnormal people

Digital footprint到日志分析关联:

Digital footprint到日志分析关联,更多是类比上的帮助。
一条条记录,就像一个个脚印。

web server logs中的记录在Log文件中是一条条的,对一次资源的尝试请求就会产生一条记录。

127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326

这么多分散的记录,我们要怎么在当中识别一个恶意用户的行为呢?更进一步,我们能不能知道这个攻击是否攻击成功了?

首先要认识到一点,web server logs有个先天性缺陷。
Detecting Attacks on Web Applications from Log Files中也提到了,

Log files have its limits,though. Web server log files contain only a fraction of the full HTTP request and response

Knowing those limits, the majority of attacks can be recognized and acted upon to prevent further exploitation.

要认识到的缺陷就是Web server log只有一个完整的http request到response的部分信息。

那么如何在这个基础上,进行关联呢?这就是难点。。。

目前有两个思路:

  • 基于经验去关联分析
  • 机器学习、算法

基于经验去关联分析:

一些思路例子:

  • 比如攻击者对登录页面进行爆破,先认定为爆破,如果他调转到要登录后才能访问的管理员后台页面(response code为200),则认定他爆破攻击成功。

  • refer绘制攻击路径、加上实际判断
    Using logs to investigate a web application attack

机器学习、算法:

  • 图算法
  • hmm
  • 孤独森林算法

testing中,似乎工程化有点难度啊。。。


相关概念:

Digital Age:

Remember to be careful about what you post on social networking sites such as Facebook or Twitter. You never know who might read something from your past that could impact your future.

The past is never dead. It’s not even past.”- William Faulkner

威胁情报、ip信誉库对某个ip进行的也算是一个Digital Age的绘画?


参考:


疑问:

  • web日志中有cookie吗?

  • HEAD方法可认为是扫描行为吗?

  • web日志收集ip的原理

猜你喜欢

转载自blog.csdn.net/qq_28921653/article/details/78945526