table of Contents
Gangster please consciously pass ~ ~ ~
1. Background
Taking my blog site where the access log analysis example for a period of time
Knowledge used
basic list of data types, data type basic dictionaryre
module regular matching,pandas
data processing module,xlwt
the moduleexce
l write, etc.Function and ultimately the
analysis of access logs obtainedip
thetop20
access addresstop20
, access to the clientua
's rankings and claimexcel
reports
2, the evolution of ideas
2.1, the first step in reading log
For nginx
a log analysis, the first to be analyzed to get nginx
the log file, the log file has a fixed definition of the method, the log of each row in each particular field represents the specific meaning, for example:
95.143.192.110 - - [15/Dec/2019:10:22:00 +0800] "GET /post/pou-xi-he-jie-jue-python-zhong-wang-luo-nian-bao-de-zheng-que-zi-shi/ HTTP/1.1" 304 0 "https://www.ssgeek.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"
The above field information representing the contents of the log sequence: visitors sources ip
, access time, http
the request method, the request address, http
status code, byte size of this request refer
message, the client ua
identifier
So, first extract the contents of a line, this line contents are grouped statistics and record specific information for each field, then this line of analysis tools to analyze the entire log file, log in to match each field, you need to use re
module regular matching, as follows:
import re
obj = re.compile(r'(?P<ip>.*?)- - \[(?P<time>.*?)\] "(?P<request>.*?)" (?P<status>.*?) (?P<bytes>.*?) "(?P<referer>.*?)" "(?P<ua>.*?)"')
def load_log(path):
with open(path, mode="r", encoding="utf-8") as f:
for line in f:
line = line.strip()
parse(line)
def parse(line):
# 解析单行nginx日志
try:
result = obj.match(line)
print(result.group("ip"))
except:
pass
if __name__ == '__main__':
load_log("nginx_access.log")
By re
order packet matches a module: ip
, time
, request
, status
, bytes
, referer
, ua
above all the contents of the final print out the source of visitorsip
Further strengthened, output all fields, direct printing print(result.groupdict())
can be, output is multiple dictionaries, as follows:
{'ip': '46.229.168.150 ', 'time': '24/Dec/2019:13:21:39 +0800', 'request': 'GET /post/zabbix-web-qie-huan-wei-nginx-ji-https HTTP/1.1', 'status': '301', 'bytes': '178', 'referer': '-', 'ua': 'Mozilla/5.0 (compatible; SemrushBot/6~bl; +http://www.semrush.com/bot.html)'}
2.2, the second step log analysis
Precise analysis log single line, and the output of the formatting and adding means for filtering
load_log()
Function:
the load_log()
function, in order to avoid an error log (similar to the "dirty data"), thus defining the two empty list lst
and error_lst
the results for matching records, each element in the list represents the matching line of the log, final printing the number of rows number of rows total number of rows, to match can not be matched to the (error log the number of rows)
parse()
Function:
In the parse()
function, passing the parameters line
, once for each row grouping to match each field is processed, the assignment to list the elements after processing is complete, which identifies the client ua only lists some of the common, if you want to match more precise, reference may be common browsers (PC / mobile) user-agent reference table , write more accurate matching rules to
import re
import datetime
obj = re.compile(
r'(?P<ip>.*?)- - \[(?P<time>.*?)\] "(?P<request>.*?)" (?P<status>.*?) (?P<bytes>.*?) "(?P<referer>.*?)" "(?P<ua>.*?)"')
def load_log(path):
lst = []
error_lst = []
i = 0
with open(path, mode="r", encoding="utf-8") as f:
for line in f:
line = line.strip()
dic = parse(line)
if dic: # 正确的数据添加到lst列表中
lst.append(dic)
else:
error_lst.append(line) # 脏数据添加到error_lst列表中
i += 1
print(i)
print(len(error_lst))
print(len(lst))
def parse(line):
# 解析单行nginx日志
dic = {}
try:
result = obj.match(line)
# ip处理
ip = result.group("ip")
if ip.strip() == '-' or ip.strip() == "": # 如果是匹配到没有ip就把这条数据丢弃
return False
dic['ip'] = ip.split(",")[0] # 如果有两个ip,取第一个ip
# 状态码处理
status = result.group("status") # 状态码
dic['status'] = status
# 时间处理
time = result.group("time") # 21/Dec/2019:21:45:31 +0800
time = time.replace(" +0800", "") # 替换+0800为空
t = datetime.datetime.strptime(time, "%d/%b/%Y:%H:%M:%S") # 将时间格式化成友好的格式
dic['time'] = t
# request处理
request = result.group(
"request") # GET /post/pou-xi-he-jie-jue-python-zhong-wang-luo-nian-bao-de-zheng-que-zi-shi/ HTTP/1.1
a = request.split()[1].split("?")[0] # 往往url后面会有一些参数,url和参数之间用?分隔,取出不带参数的url
dic['request'] = a
# user_agent处理
ua = result.group("ua")
if "Windows NT" in ua:
u = "windows"
elif "iPad" in ua:
u = "ipad"
elif "Android" in ua:
u = "android"
elif "Macintosh" in ua:
u = "mac"
elif "iPhone" in ua:
u = "iphone"
else:
u = "其他设备"
dic['ua'] = u
# refer处理
referer = result.group("referer")
dic['referer'] = referer
return dic
except:
return False
if __name__ == '__main__':
load_log("nginx_access.log")
Execute code, view the results of the printing, the console output:
9692
542
9150
The number of rows in turn represents the total number of lines in the log file, matching error (not to match), and match the correct number of rows
2.3, the third step analysis of the log
Using the pandas
module analyzes the log
analyse()
function:
The analytic filtered to give the lst
list as a parameter, the data format, such as a list form[{ip:xxx, api:xxx, status:xxxx, ua:xxx}]
df = pd.DataFrame(lst)
The parsed list converted into the form of a similar type, the console output df
as post-processing for each data sequence number added, corresponding to a first row header, the header is in front of the dictionary is obtainedkey
ip status ... ua referer
0 95.143.192.110 200 ... mac -
1 95.143.192.110 304 ... mac -
2 95.143.192.110 304 ... mac -
3 95.143.192.110 304 ... mac https://www.ssgeek.com/
4 203.208.60.122 200 ... android -
... ... ... ... ... ...
9145 46.4.60.249 404 ... 其他设备 -
9146 46.4.60.249 404 ... 其他设备 -
9147 46.4.60.249 404 ... 其他设备 -
9148 46.4.60.249 404 ... 其他设备 -
9149 154.223.188.124 404 ... windows -
pd.value_counts(df['ip'])
Remove ip
and count the number of ip
times; first column is the result obtained ip
, the second column is the number, pandas
the default that the first row is the row index, so the entire data needs to be shifted to the right, by reset_index()
redefining an index to the effect of the form :
index ip
0 89.163.242.228 316
1 207.180.220.114 312
2 78.46.90.53 302
3 144.76.38.10 301
4 78.46.61.245 301
... ... ...
1080 203.208.60.85 1
1081 66.249.72.8 1
1082 141.8.132.13 1
1083 207.46.13.119 1
1084 203.208.60.7 1
This time has been found that the index, but also to follow the right header, and does not correspond to the need to re-set a header reset_index().rename(columns={"index": "ip", "ip": "count"})
, the effect of the form
ip count
0 89.163.242.228 316
1 207.180.220.114 312
2 78.46.90.53 302
3 78.46.61.245 301
4 144.76.38.10 301
... ... ...
1080 47.103.17.71 1
1081 42.156.254.92 1
1082 220.243.136.156 1
1083 180.163.220.61 1
1084 106.14.215.243 1
Log analysis is often just need to get the top few visits, such as former 20
name, pandas
also gives a very easy to iloc
achieve this demand by sections iloc[:20, :]
: Remove the front 20
line, remove all the columns, the final processing code
ip_count = pd.value_counts(df['ip']).reset_index().rename(columns={"index": "ip", "ip": "count"}).iloc[:20, :]
print(ip_count)
The results obtained for the data
ip count
0 89.163.242.228 316
1 207.180.220.114 312
2 78.46.90.53 302
3 144.76.38.10 301
4 78.46.61.245 301
5 144.76.29.148 301
6 204.12.208.154 301
7 148.251.92.39 301
8 5.9.70.72 286
9 223.71.139.28 218
10 95.216.19.59 209
11 221.13.12.147 131
12 117.15.90.21 130
13 175.184.166.181 129
14 148.251.49.107 128
15 171.37.204.72 127
16 124.95.168.140 118
17 171.34.178.76 98
18 60.216.138.190 97
19 141.8.142.158 87
Similarly, can request
, ua
etc. The same operation
2.4, the fourth step to generate a report
Using the xlwt
module writes the resultant data to analyze pandas excel
table, before writing data needs to be converted to the pandas general data processing
ip_count_values = ip_count.values
request_count_values = request_count.values
ua_count_values = ua_count.values
This data type is: an array of objects numpy.ndarray
, like this:
[['89.163.242.228 ' 316]
['207.180.220.114 ' 312]
['78.46.90.53 ' 302]
['204.12.208.154 ' 301]
['144.76.29.148 ' 301]
['144.76.38.10 ' 301]
['78.46.61.245 ' 301]
['148.251.92.39 ' 301]
['5.9.70.72 ' 286]
['223.71.139.28 ' 218]
['95.216.19.59 ' 209]
['221.13.12.147 ' 131]
['117.15.90.21 ' 130]
['175.184.166.181 ' 129]
['148.251.49.107 ' 128]
['171.37.204.72 ' 127]
['124.95.168.140 ' 118]
['171.34.178.76 ' 98]
['60.216.138.190 ' 97]
['141.8.142.158 ' 87]]
By xlwt
module written sheet
pages, each sheet
data processing corresponding to the page is written
# 写入excel
wb = xlwt.Workbook() # 打开一个excel文档
sheet = wb.add_sheet("ip访问top20") # 新建一个sheet页
# 写入头信息
row = 0
sheet.write(row, 0, "ip") # 写入行,列,内容
sheet.write(row, 1, "count") # 写入行,列,内容
row += 1 # 行号加一
for item in ip_count_values:
sheet.write(row, 0, item[0])
sheet.write(row, 1, item[1])
row += 1
2.5, fifth step log collection
Log analysis done, back to that required to capture a log file, and the timing is to be analyzed, may utilize time
the module and obtain time determination, a timing analysis to achieve, for example, No. 3 monthly for log analysis 1:00
import time
if __name__ == '__main__':
while 1:
stime = datetime.datetime.now().strftime("%d:%H:%M:%S")
if stime == "03:01:00:00":
lst, error_lst = load_log("nginx_access.log")
analyse(lst)
time.sleep(1)
Of course, you can call the script timing analysis through server-level timing tasking capabilities
2.6, the results show
According to the evolution of the foregoing, the final code is as follows:
import re
import datetime
import pandas as pd
import xlwt
obj = re.compile(
r'(?P<ip>.*?)- - \[(?P<time>.*?)\] "(?P<request>.*?)" (?P<status>.*?) (?P<bytes>.*?) "(?P<referer>.*?)" "(?P<ua>.*?)"')
def load_log(path):
lst = []
error_lst = []
i = 0
with open(path, mode="r", encoding="utf-8") as f:
for line in f:
line = line.strip()
dic = parse(line)
if dic: # 正确的数据添加到lst列表中
lst.append(dic)
else:
error_lst.append(line) # 脏数据添加到error_lst列表中
i += 1
return lst, error_lst
def parse(line):
# 解析单行nginx日志
dic = {}
try:
result = obj.match(line)
# ip处理
ip = result.group("ip")
if ip.strip() == '-' or ip.strip() == "": # 如果是匹配到没有ip就把这条数据丢弃
return False
dic['ip'] = ip.split(",")[0] # 如果有两个ip,取第一个ip
# 状态码处理
status = result.group("status") # 状态码
dic['status'] = status
# 时间处理
time = result.group("time") # 21/Dec/2019:21:45:31 +0800
time = time.replace(" +0800", "") # 替换+0800为空
t = datetime.datetime.strptime(time, "%d/%b/%Y:%H:%M:%S") # 将时间格式化成友好的格式
dic['time'] = t
# request处理
request = result.group(
"request") # GET /post/pou-xi-he-jie-jue-python-zhong-wang-luo-nian-bao-de-zheng-que-zi-shi/ HTTP/1.1
a = request.split()[1].split("?")[0] # 往往url后面会有一些参数,url和参数之间用?分隔,取出不带参数的url
dic['request'] = a
# user_agent处理
ua = result.group("ua")
if "Windows NT" in ua:
u = "windows"
elif "iPad" in ua:
u = "ipad"
elif "Android" in ua:
u = "android"
elif "Macintosh" in ua:
u = "mac"
elif "iPhone" in ua:
u = "iphone"
else:
u = "其他设备"
dic['ua'] = u
# refer处理
referer = result.group("referer")
dic['referer'] = referer
return dic
except:
return False
def analyse(lst): # [{ip:xxx, api:xxx, status:xxxx, ua:xxx}]
df = pd.DataFrame(lst) # 转换成表格
# print(df)
# print(df['ip']) # 只取出ip这一列
ip_count = pd.value_counts(df['ip']).reset_index().rename(columns={"index": "ip", "ip": "count"}).iloc[:20, :]
request_count = pd.value_counts(df['request']).reset_index().rename(columns={"index": "request", "request": "count"}).iloc[:20, :]
ua_count = pd.value_counts(df['ua']).reset_index().rename(columns={"index": "ua", "ua": "count"}).iloc[:, :]
# 从pandas转化成我们普通的数据
ip_count_values = ip_count.values
request_count_values = request_count.values
ua_count_values = ua_count.values
# print(type(ip_count_values))
# 写入excel
wb = xlwt.Workbook() # 打开一个excel文档
sheet = wb.add_sheet("ip访问top20") # 新建一个sheet页
# 写入头信息
row = 0
sheet.write(row, 0, "ip") # 写入行,列,内容
sheet.write(row, 1, "count") # 写入行,列,内容
row += 1 # 行号加一
for item in ip_count_values:
sheet.write(row, 0, item[0])
sheet.write(row, 1, item[1])
row += 1
sheet = wb.add_sheet("request访问top20") # 新建一个sheet页
# 写入头信息
row = 0
sheet.write(row, 0, "request") # 写入行,列,内容
sheet.write(row, 1, "count") # 写入行,列,内容
row += 1 # 行号加一
for item in request_count_values:
sheet.write(row, 0, item[0])
sheet.write(row, 1, item[1])
row += 1
sheet = wb.add_sheet("ua访问top") # 新建一个sheet页
# 写入头信息
row = 0
sheet.write(row, 0, "ua") # 写入行,列,内容
sheet.write(row, 1, "count") # 写入行,列,内容
row += 1 # 行号加一
for item in ua_count_values:
sheet.write(row, 0, item[0])
sheet.write(row, 1, item[1])
row += 1
wb.save("abc.xls")
if __name__ == '__main__':
lst, error_lst = load_log("nginx_access.log")
analyse(lst)
The resulting excel
report content follows
ip
Ranking
Access address rankings
Client
ua
rankings
2.7, scalable direction
This paper analyzes the log entry be made of, for example, can be further extended direction: timing messages such as push mail analysis reports, analysis reports, etc. graphical display