Python log analysis Nginx

Gangster please consciously pass ~ ~ ~

1. Background

Taking my blog site where the access log analysis example for a period of time

  • Knowledge used
    basic list of data types, data type basic dictionary remodule regular matching, pandasdata processing module, xlwtthe module excel write, etc.

  • Function and ultimately the
    analysis of access logs obtained ipthe top20access address top20, access to the client ua's rankings and claim excelreports

2, the evolution of ideas

2.1, the first step in reading log

For nginxa log analysis, the first to be analyzed to get nginxthe log file, the log file has a fixed definition of the method, the log of each row in each particular field represents the specific meaning, for example:

95.143.192.110 - - [15/Dec/2019:10:22:00 +0800] "GET /post/pou-xi-he-jie-jue-python-zhong-wang-luo-nian-bao-de-zheng-que-zi-shi/ HTTP/1.1" 304 0 "https://www.ssgeek.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"

The above field information representing the contents of the log sequence: visitors sources ip, access time, httpthe request method, the request address, httpstatus code, byte size of this request refermessage, the client uaidentifier

So, first extract the contents of a line, this line contents are grouped statistics and record specific information for each field, then this line of analysis tools to analyze the entire log file, log in to match each field, you need to use remodule regular matching, as follows:

import re


obj = re.compile(r'(?P<ip>.*?)- - \[(?P<time>.*?)\] "(?P<request>.*?)" (?P<status>.*?) (?P<bytes>.*?) "(?P<referer>.*?)" "(?P<ua>.*?)"')

def load_log(path):
    with open(path, mode="r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            parse(line)

def parse(line):
    # 解析单行nginx日志
    try:
        result = obj.match(line)
        print(result.group("ip"))
    except:
        pass


if __name__ == '__main__':
    load_log("nginx_access.log")

By reorder packet matches a module: ip, time, request, status, bytes, referer, ua
above all the contents of the final print out the source of visitorsip

Further strengthened, output all fields, direct printing print(result.groupdict())can be, output is multiple dictionaries, as follows:

{'ip': '46.229.168.150 ', 'time': '24/Dec/2019:13:21:39 +0800', 'request': 'GET /post/zabbix-web-qie-huan-wei-nginx-ji-https HTTP/1.1', 'status': '301', 'bytes': '178', 'referer': '-', 'ua': 'Mozilla/5.0 (compatible; SemrushBot/6~bl; +http://www.semrush.com/bot.html)'}

2.2, the second step log analysis

Precise analysis log single line, and the output of the formatting and adding means for filtering

load_log()Function:
the load_log()function, in order to avoid an error log (similar to the "dirty data"), thus defining the two empty list lstand error_lstthe results for matching records, each element in the list represents the matching line of the log, final printing the number of rows number of rows total number of rows, to match can not be matched to the (error log the number of rows)

parse()Function:
In the parse()function, passing the parameters line, once for each row grouping to match each field is processed, the assignment to list the elements after processing is complete, which identifies the client ua only lists some of the common, if you want to match more precise, reference may be common browsers (PC / mobile) user-agent reference table , write more accurate matching rules to

import re
import datetime

obj = re.compile(
    r'(?P<ip>.*?)- - \[(?P<time>.*?)\] "(?P<request>.*?)" (?P<status>.*?) (?P<bytes>.*?) "(?P<referer>.*?)" "(?P<ua>.*?)"')


def load_log(path):
    lst = []
    error_lst = []
    i = 0
    with open(path, mode="r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            dic = parse(line)
            if dic:  # 正确的数据添加到lst列表中
                lst.append(dic)
            else:
                error_lst.append(line)  # 脏数据添加到error_lst列表中
            i += 1
    print(i)
    print(len(error_lst))
    print(len(lst))

def parse(line):
    # 解析单行nginx日志
    dic = {}
    try:
        result = obj.match(line)
        # ip处理
        ip = result.group("ip")
        if ip.strip() == '-' or ip.strip() == "":  # 如果是匹配到没有ip就把这条数据丢弃
            return False
        dic['ip'] = ip.split(",")[0]  # 如果有两个ip,取第一个ip

        # 状态码处理
        status = result.group("status")  # 状态码
        dic['status'] = status

        # 时间处理
        time = result.group("time")  # 21/Dec/2019:21:45:31 +0800
        time = time.replace(" +0800", "")  # 替换+0800为空
        t = datetime.datetime.strptime(time, "%d/%b/%Y:%H:%M:%S")  # 将时间格式化成友好的格式
        dic['time'] = t

        # request处理
        request = result.group(
            "request")  # GET /post/pou-xi-he-jie-jue-python-zhong-wang-luo-nian-bao-de-zheng-que-zi-shi/ HTTP/1.1
        a = request.split()[1].split("?")[0]  # 往往url后面会有一些参数,url和参数之间用?分隔,取出不带参数的url
        dic['request'] = a

        # user_agent处理
        ua = result.group("ua")
        if "Windows NT" in ua:
            u = "windows"
        elif "iPad" in ua:
            u = "ipad"
        elif "Android" in ua:
            u = "android"
        elif "Macintosh" in ua:
            u = "mac"
        elif "iPhone" in ua:
            u = "iphone"
        else:
            u = "其他设备"
        dic['ua'] = u

        # refer处理
        referer = result.group("referer")
        dic['referer'] = referer

        return dic

    except:
        return False


if __name__ == '__main__':
    load_log("nginx_access.log")

Execute code, view the results of the printing, the console output:

9692
542
9150

The number of rows in turn represents the total number of lines in the log file, matching error (not to match), and match the correct number of rows

2.3, the third step analysis of the log

Using the pandasmodule analyzes the log
analyse()function:
The analytic filtered to give the lstlist as a parameter, the data format, such as a list form[{ip:xxx, api:xxx, status:xxxx, ua:xxx}]

df = pd.DataFrame(lst)The parsed list converted into the form of a similar type, the console output dfas post-processing for each data sequence number added, corresponding to a first row header, the header is in front of the dictionary is obtainedkey

                    ip status  ...       ua                  referer
0      95.143.192.110     200  ...      mac                        -
1      95.143.192.110     304  ...      mac                        -
2      95.143.192.110     304  ...      mac                        -
3      95.143.192.110     304  ...      mac  https://www.ssgeek.com/
4      203.208.60.122     200  ...  android                        -
...                ...    ...  ...      ...                      ...
9145      46.4.60.249     404  ...     其他设备                        -
9146      46.4.60.249     404  ...     其他设备                        -
9147      46.4.60.249     404  ...     其他设备                        -
9148      46.4.60.249     404  ...     其他设备                        -
9149  154.223.188.124     404  ...  windows                        -

pd.value_counts(df['ip'])Remove ipand count the number of iptimes; first column is the result obtained ip, the second column is the number, pandasthe default that the first row is the row index, so the entire data needs to be shifted to the right, by reset_index()redefining an index to the effect of the form :

                 index   ip
0      89.163.242.228   316
1     207.180.220.114   312
2         78.46.90.53   302
3        144.76.38.10   301
4        78.46.61.245   301
...                ...  ...
1080    203.208.60.85     1
1081      66.249.72.8     1
1082     141.8.132.13     1
1083    207.46.13.119     1
1084     203.208.60.7     1

This time has been found that the index, but also to follow the right header, and does not correspond to the need to re-set a header reset_index().rename(columns={"index": "ip", "ip": "count"}), the effect of the form

                    ip  count
0      89.163.242.228     316
1     207.180.220.114     312
2         78.46.90.53     302
3        78.46.61.245     301
4        144.76.38.10     301
...                ...    ...
1080     47.103.17.71       1
1081    42.156.254.92       1
1082  220.243.136.156       1
1083   180.163.220.61       1
1084   106.14.215.243       1

Log analysis is often just need to get the top few visits, such as former 20name, pandasalso gives a very easy to ilocachieve this demand by sections iloc[:20, :]: Remove the front 20line, remove all the columns, the final processing code

    ip_count = pd.value_counts(df['ip']).reset_index().rename(columns={"index": "ip", "ip": "count"}).iloc[:20, :]
    print(ip_count)

The results obtained for the data

                  ip  count
0    89.163.242.228     316
1   207.180.220.114     312
2       78.46.90.53     302
3      144.76.38.10     301
4      78.46.61.245     301
5     144.76.29.148     301
6    204.12.208.154     301
7     148.251.92.39     301
8         5.9.70.72     286
9     223.71.139.28     218
10     95.216.19.59     209
11    221.13.12.147     131
12     117.15.90.21     130
13  175.184.166.181     129
14   148.251.49.107     128
15    171.37.204.72     127
16   124.95.168.140     118
17    171.34.178.76      98
18   60.216.138.190      97
19    141.8.142.158      87

Similarly, can request, uaetc. The same operation

2.4, the fourth step to generate a report

Using the xlwtmodule writes the resultant data to analyze pandas exceltable, before writing data needs to be converted to the pandas general data processing

    ip_count_values = ip_count.values
    request_count_values = request_count.values
    ua_count_values = ua_count.values

This data type is: an array of objects numpy.ndarray, like this:

[['89.163.242.228 ' 316]
 ['207.180.220.114 ' 312]
 ['78.46.90.53 ' 302]
 ['204.12.208.154 ' 301]
 ['144.76.29.148 ' 301]
 ['144.76.38.10 ' 301]
 ['78.46.61.245 ' 301]
 ['148.251.92.39 ' 301]
 ['5.9.70.72 ' 286]
 ['223.71.139.28 ' 218]
 ['95.216.19.59 ' 209]
 ['221.13.12.147 ' 131]
 ['117.15.90.21 ' 130]
 ['175.184.166.181 ' 129]
 ['148.251.49.107 ' 128]
 ['171.37.204.72 ' 127]
 ['124.95.168.140 ' 118]
 ['171.34.178.76 ' 98]
 ['60.216.138.190 ' 97]
 ['141.8.142.158 ' 87]]

By xlwtmodule written sheetpages, each sheetdata processing corresponding to the page is written

# 写入excel
wb = xlwt.Workbook()  # 打开一个excel文档
sheet = wb.add_sheet("ip访问top20")  # 新建一个sheet页
# 写入头信息
row = 0
sheet.write(row, 0, "ip")  # 写入行,列,内容
sheet.write(row, 1, "count")  # 写入行,列,内容
row += 1  # 行号加一
for item in ip_count_values:
    sheet.write(row, 0, item[0])
    sheet.write(row, 1, item[1])
    row += 1

2.5, fifth step log collection

Log analysis done, back to that required to capture a log file, and the timing is to be analyzed, may utilize timethe module and obtain time determination, a timing analysis to achieve, for example, No. 3 monthly for log analysis 1:00

import time

if __name__ == '__main__':
    while 1:
        stime = datetime.datetime.now().strftime("%d:%H:%M:%S")
        if stime == "03:01:00:00":
            lst, error_lst = load_log("nginx_access.log")
            analyse(lst)
        time.sleep(1)

Of course, you can call the script timing analysis through server-level timing tasking capabilities

2.6, the results show

According to the evolution of the foregoing, the final code is as follows:

import re
import datetime
import pandas as pd
import xlwt

obj = re.compile(
    r'(?P<ip>.*?)- - \[(?P<time>.*?)\] "(?P<request>.*?)" (?P<status>.*?) (?P<bytes>.*?) "(?P<referer>.*?)" "(?P<ua>.*?)"')


def load_log(path):
    lst = []
    error_lst = []
    i = 0
    with open(path, mode="r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            dic = parse(line)
            if dic:  # 正确的数据添加到lst列表中
                lst.append(dic)
            else:
                error_lst.append(line)  # 脏数据添加到error_lst列表中
            i += 1

    return lst, error_lst

def parse(line):
    # 解析单行nginx日志
    dic = {}
    try:
        result = obj.match(line)
        # ip处理
        ip = result.group("ip")
        if ip.strip() == '-' or ip.strip() == "":  # 如果是匹配到没有ip就把这条数据丢弃
            return False
        dic['ip'] = ip.split(",")[0]  # 如果有两个ip,取第一个ip

        # 状态码处理
        status = result.group("status")  # 状态码
        dic['status'] = status

        # 时间处理
        time = result.group("time")  # 21/Dec/2019:21:45:31 +0800
        time = time.replace(" +0800", "")  # 替换+0800为空
        t = datetime.datetime.strptime(time, "%d/%b/%Y:%H:%M:%S")  # 将时间格式化成友好的格式
        dic['time'] = t

        # request处理
        request = result.group(
            "request")  # GET /post/pou-xi-he-jie-jue-python-zhong-wang-luo-nian-bao-de-zheng-que-zi-shi/ HTTP/1.1
        a = request.split()[1].split("?")[0]  # 往往url后面会有一些参数,url和参数之间用?分隔,取出不带参数的url
        dic['request'] = a

        # user_agent处理
        ua = result.group("ua")
        if "Windows NT" in ua:
            u = "windows"
        elif "iPad" in ua:
            u = "ipad"
        elif "Android" in ua:
            u = "android"
        elif "Macintosh" in ua:
            u = "mac"
        elif "iPhone" in ua:
            u = "iphone"
        else:
            u = "其他设备"
        dic['ua'] = u

        # refer处理
        referer = result.group("referer")
        dic['referer'] = referer

        return dic

    except:
        return False


def analyse(lst): # [{ip:xxx, api:xxx, status:xxxx, ua:xxx}]
    df = pd.DataFrame(lst)  # 转换成表格
    # print(df)
    # print(df['ip'])  # 只取出ip这一列
    ip_count = pd.value_counts(df['ip']).reset_index().rename(columns={"index": "ip", "ip": "count"}).iloc[:20, :]
    request_count = pd.value_counts(df['request']).reset_index().rename(columns={"index": "request", "request": "count"}).iloc[:20, :]
    ua_count = pd.value_counts(df['ua']).reset_index().rename(columns={"index": "ua", "ua": "count"}).iloc[:, :]

    # 从pandas转化成我们普通的数据
    ip_count_values = ip_count.values
    request_count_values = request_count.values
    ua_count_values = ua_count.values
    # print(type(ip_count_values))

    # 写入excel
    wb = xlwt.Workbook()  # 打开一个excel文档
    sheet = wb.add_sheet("ip访问top20")  # 新建一个sheet页
    # 写入头信息
    row = 0
    sheet.write(row, 0, "ip")  # 写入行,列,内容
    sheet.write(row, 1, "count")  # 写入行,列,内容
    row += 1  # 行号加一
    for item in ip_count_values:
        sheet.write(row, 0, item[0])
        sheet.write(row, 1, item[1])
        row += 1

    sheet = wb.add_sheet("request访问top20")  # 新建一个sheet页
    # 写入头信息
    row = 0
    sheet.write(row, 0, "request")  # 写入行,列,内容
    sheet.write(row, 1, "count")  # 写入行,列,内容
    row += 1  # 行号加一
    for item in request_count_values:
        sheet.write(row, 0, item[0])
        sheet.write(row, 1, item[1])
        row += 1

    sheet = wb.add_sheet("ua访问top")  # 新建一个sheet页
    # 写入头信息
    row = 0
    sheet.write(row, 0, "ua")  # 写入行,列,内容
    sheet.write(row, 1, "count")  # 写入行,列,内容
    row += 1  # 行号加一
    for item in ua_count_values:
        sheet.write(row, 0, item[0])
        sheet.write(row, 1, item[1])
        row += 1

    wb.save("abc.xls")

if __name__ == '__main__':
    lst, error_lst = load_log("nginx_access.log")
    analyse(lst)

The resulting excelreport content follows

  • ipRanking

  • Access address rankings

  • Client uarankings

2.7, scalable direction

This paper analyzes the log entry be made of, for example, can be further extended direction: timing messages such as push mail analysis reports, analysis reports, etc. graphical display

Guess you like

Origin www.cnblogs.com/ssgeek/p/12119657.html
Recommended