Python function time-consuming exception automatic monitoring practical tutorial

content hidden

1. Time-consuming collection of performance data functions and generation of visual reports

        1. Save performance data file (cProfile)

        2. Read and view detailed performance data

2. Generate function call stack structure diagram (gprof2dot) tutorial

3. Performance analysis and optimization

4. Time-consuming abnormal automatic monitoring

        1. Normally distributed data scheme

        2. Turkey box plot scheme

The content of this article includes Python performance visualization analysis, logic optimization, and dynamic calculation of safety thresholds according to different models, so as to realize automatic monitoring and early warning of time-consuming functions and total program time-consuming .

When doing Python performance analysis and optimization, you can use cProfile to generate performance data files, obtain detailed time-consuming distribution data through pstats, and combine gprof2dot scripts to generate function call stack structure diagrams for visual analysis to improve the efficiency of performance analysis.

Then, from the specific time-consuming distribution, first analyze the specific logic implementation of the function that takes up the bulk, and gradually optimize it. At the same time, save the average time-consuming data of the pstats function as the sample data for subsequent abnormal automatic monitoring.

To realize time-consuming automatic monitoring, it is necessary to dynamically adjust the safety threshold according to the algorithm, instead of manually setting the safety threshold range, so that the self-circulation and iterative calibration of abnormal monitoring can be realized.

1. Time-consuming collection of performance data functions and generation of visual reports

1. Save performance data file (cProfile)

The first is the preservation of performance data files. cProfile and profile provide deterministic performance analysis of Python programs. A profile is a set of statistics that describe how often and when various parts of a program execute. Call enable at the beginning of the program to start performance data collection, and call dump_stats at the end to stop performance data collection and save the performance data to a file in the specified path.

import cProfile
# Turn on the data acquisition switch when the program starts
pr = cProfile.Profile()
pr.enable()
# Dump the performance data to the specified path file at the end of the program running, profliePath is the absolute path parameter to save the file
pr.dump_stats(profliePath)

2. Read and view detailed performance data

After saving the performance data to a file, you can use pstats to read the data in the file, and the profile statistical data can be formatted into a report through the pstats module.

import pstats 
# 读取性能数据 
pS = pstats.Stats(profliePath) 
# 根据函数自身累计耗时做排序 
pS.sort_stats('tottime') 
# 打印所有耗时函数信息 
pS.print_stats()
print_stats()输出示例:
79837 function calls (75565 primitive calls) in 37.311 seconds
Ordered by: internal time
ncalls  tottime  percall  cumtime  percall  filename:lineno(function)
 2050    30.167    0.015   30.167    0.015  {time.sleep}
   16     6.579    0.411    6.579    0.411  {select.select}
    1     0.142    0.142    0.142    0.142  {method 'do_handshake' of '_ssl._SSLSocket' objects}
  434     0.074    0.000    0.074    0.000  {method 'read' of '_ssl._SSLSocket' objects}
    1     0.062    0.062    0.062    0.062  {method 'connect' of '_socket.socket' objects}
   37     0.046    0.001    0.046    0.001  {posix.read}
   14     0.024    0.002    0.024    0.002  {posix.fork}

Output field descriptions:

  • The number of times the ncalls function is called (only one number indicates that there is no recursion, and when there are numbers separated by slashes, the following numbers indicate the number of non-recursive calls)
  • tottime The total running time of the function, excluding sub-function call time
  • The average time for the percall function to run once, equal to tottime/ncalls
  • The total running time of the cumtime function, including subfunction call times
  • The average time for a percall function to run once, equal to cumtime/ncalls
  • filename:lineno(function) The file name where the function is located, the line number of the function, the function name or the basic framework function class

If you want to obtain the information of each field in print_stats(), you can use the following methods:

# func————filename:lineno(function)
# cc ———— call count,调用次数 
# nc ———— ncalls
# tt ———— tottime
# ct ———— cumtime
# callers ———— 调用堆栈数组,每项数据包括了func, (cc, nc, tt, ct) 字段
for index in range(len(pS.fcn_list)): 
    func = pS.fcn_list[index] 
    cc, nc, tt, ct, callers = pS.stats[func]  
    print cc, nc, tt, ct, func, callers
    for func, (cc, nc, tt, ct) in callers.iteritems():
        print func,cc, nc, tt, ct

2. Generate function call stack structure diagram (gprof2dot) tutorial

The gprof2dot script converts the information obtained by gprof or callgrind analysis into a program call directed graph object described in DOT language, and then renders the DOT directed graph object into a picture through Graphviz, so that the whole program can be seen intuitively. The call stack, including the class and line number of the function, the time-consuming ratio, the number of function recursions, and the number of calls.

First download the gprof2dot.py script from GitHub to the local, and put it in the same directory as the script file of the executed program. Of course, to use this script, you need to install graphviz and use the brew command to install it. If an exception occurs during the installation process, according to the exception Prompt to execute the command to install the required tools

 
 
brew install graphviz

The logic of generating the program function call stack structure diagram can be realized by referring to the following logic, and can be modified according to your own needs.

import os
# Get the current gprof2dot.py script path
gprof2dotPath = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'gprof2dot.py')
# The function call stack structure diagram saves the file name path, here is the result of generating a PNG image
dumpProfPath = profliePath.replace("stats", "png")
dumpCmd = "python %s -f pstats %s | dot -Tpng -o %s" % (gprof2dotPath, profliePath, dumpProfPath)
os.popen(dumpCmd)

3. Performance analysis and optimization

After generating the function call stack structure diagram, you can easily see the call relationship between each function. The information included in each box includes the class and line number of the function , the time-consuming percentage and the number of calls . If this If there is recursion in the function, there will be a circle arrow on the edge of the square to indicate the number of recursion .

Find the time-consuming parts from the structure diagram, analyze the implementation logic of specific functions, and locate the specific time-consuming reasons. The optimization strategy is as follows:

  • Remove redundant logic: remove redundant code
  • Optimize the recursive function: Add logs to print the parameters when recursing. If you find that many parameters are repeated, you can add a cache to avoid redundant recursive consumption.
  • Merge general logic calls: A function calls the same sub-function multiple times to obtain parameters, and checks whether the calls of this sub-function can be integrated and merged to avoid redundant function calls.
  • Determine whether the initialization of the test program is necessary through the context environment, and do not reset the test environment if it is not necessary.

4. Time-consuming abnormal automatic monitoring

If the average value + fixed floating percentage is calculated through historical time-consuming data, there is a big problem in configuring the time-consuming safety threshold parameters to implement abnormal monitoring, because the time-consuming function execution is easily affected by the device and the operating environment. The method of artificially fixing the floating percentage is poor in maintainability, and the data itself cannot be iterative and self-circulating. It is impossible to manually adjust the parameters every time a false positive problem occurs.

The dimension of monitoring here includes two aspects. One is the average execution time of each function of the program, and the other is the total execution time of the complete program. In the early stage, these historical time-consuming data are stored in the database for subsequent use. The implementation of automated anomaly monitoring provides sample data.

To achieve automatic threshold adjustment, it is necessary to use conventional model algorithms. Here, only automatic monitoring of time-consuming single-dimensional exceptions is implemented.

According to the principles, unsupervised anomaly detection models can generally be divided into the following categories:

  • Based on statistical and probabilistic models : mainly to make assumptions about the distribution of data, and to find "anomalies" defined under the assumptions;
  • Linear model : the main idea is to find a suitable low-dimensional subspace through a linear method so that abnormal points are distinguished from normal points;
  • Based on distance : This method considers that the abnormal point is far from the normal point, and distinguishes the abnormal point by comparing the distance between the data points;
  • Density-based : due to the uneven distribution of data, when the absolute distance cannot measure the relative distance between data points, local density is used to represent the abnormality of data points;
  • Based on clustering : cluster data points, do not belong to any cluster, are far from the nearest cluster, and points in sparse clusters are considered as outliers;
  • Tree-based : Find outliers by building a tree model by dividing subspaces.

Abnormal time-consuming data is fluctuating one-dimensional data. Here, we directly use the method based on statistics and probability models to judge whether the data conforms to the normal distribution based on the saved historical data.

If it conforms to the normal distribution, the safety threshold is calculated by means of μ+3δ (mean value + 3 times the standard deviation);

If it does not meet the normal distribution, use the Turkey box plot scheme Q+1.5IQR to calculate the safety threshold.

According to the actual test, as the sample data increases, there will be a function time-consuming curve that conforms to the normal distribution in the early stage, and as the sample data increases, it will become inconsistent with the normal distribution.

The code for judging whether the data conforms to the normal distribution in Python is as follows. When the pvalue value is greater than 0.05 , it is a normal distribution, and dataList is time-consuming array data:

from scipy import stats
import numpy
percallMean = numpy.mean(dataList) # 计算均值
# percallVar = numpy.var(dataList) # 求方差
percallStd = numpy.std(dataList) # 计算标准差
kstestResult = stats.kstest(dataList, 'norm', (percallMean, percallStd))
# 当pvalue值大于0.05为正态分布
if kstestResult[1] > 0.05:
    pass

1. Normally distributed data scheme

In statistics, if a data distribution is approximately normal, then about 68% of the data values ​​will be within one standard deviation of the mean, about 95% will be within two standard deviations, and about 99.7% will be within three standard deviations within the range of difference. Therefore, if any data points exceed 3 times the standard deviation, those points are likely to be outliers or outliers. That is, the upper limit of the safety threshold of the normal distribution is: percallMean + 3 * percallStd

2. Turkey box plot scheme

Anomaly detection based on the 3σ rule of normal distribution or Z-score method is based on the assumption that the data obey the normal distribution, but the actual data often do not strictly obey the normal distribution. Applying this method to judge outliers in non-normally distributed data has limited effectiveness. The Tukey box plot is a common method used to reflect the characteristics of the original data distribution, and it can also be used for outlier identification. It mainly relies on actual data when identifying outliers, so it has its own advantages.

The box plot provides us with a criterion for identifying outliers: an outlier is defined as a value less than Q1-1.5IQR or greater than Q+1.5IQR. While this criterion is somewhat arbitrary, it is derived from empirical judgment, which has shown that it works well with data that requires special attention.

The Python implementation logic for calculating the box plot safety threshold is as follows:

import numpy
percallMean = numpy.mean(dataList)  # 计算均值
boxplotQ1 = numpy.percentile(dataList, 25)
boxplotQ2 = numpy.percentile(dataList, 75)
boxplotIQR = boxplotQ2 - boxplotQ1
upperLimit =  boxplotQ2 + 1.5 * boxplotIQR

In the implementation of the program, after a program or use case is executed, the historical data is used to judge whether it conforms to the normal distribution. Of course, the historical sample data must reach at least 20 to be more accurate. When the sample data is less than 20, continue to collect data. Make abnormal judgments. Calculate the safety threshold parameters according to the normal distribution model or the box diagram model, and judge whether the current average time consumption of each function or the total time consumption of the use case exceeds the threshold, and if it exceeds the threshold, an early warning will be given.

Gaussian model and box plot comparison of threshold ranges

After the stats file data is summarized and analyzed here, the code to draw the time-consuming curve and threshold value or normal curve and threshold value according to the corresponding model is implemented. The statFolder parameter can be replaced with the folder where the stats file is located.

# coding=utf-8
import os
import pstats
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import traceback
from scipy import stats
import numpy
​
"""
Aggregate function takes time to average data
"""
def dataSummary(array, fileName, fcn, percall):
    (funcPath, line, func) = fcn
    exists = False
    for item in array:
        if item["func"] == func and item["funcPath"] == funcPath and item["line"] == line:
            exists = True
            item["cost"].append({
                "percall": percall,
                "fileName": fileName
            })
    if not exists:
        array.append({
            "func": func,
            "funcPath": funcPath,
            "line": line,
            "cost": [{
                "percall": percall,
                "fileName": fileName
            }]
        })
​
"""
高斯函数计算Y值
"""
def gaussian(x, mu, delta):
    exp = numpy.exp(- numpy.power(x - mu, 2) / (2 * numpy.power(delta, 2)))
    c = 1 / (delta * numpy.sqrt(2 * numpy.pi))
    return c * exp
​
"""
读取汇总所有stats文件数据
"""
def readStatsFile(statFolder, filterData):
    for path, dir_list, file_list in os.walk(statFolder, "r"):
        for fileName in file_list:
            if fileName.find(".stats") > 0:
                fileAbsolutePath = os.path.join(path, fileName)
                pS = pstats.Stats(fileAbsolutePath)
             # 先对耗时数据从大到小进行排序
                pS.sort_stats('cumtime')
                # pS.print_stats()
                # 统计前100条耗时数据
                for index in range(100):
                    fcn = pS.fcn_list[index]
                    (funcPath, line, func) = fcn
                    # cc ———— call count,调用次数
                    # nc ———— ncalls,调用次数(只有一个数字时表示不存在递归;有斜杠分割数字时,后面的数字表示非递归调用的次数)
                    # tt ———— tottime,函数总计运行时间,除去函数中调用的子函数运行时间
                    # ct ———— cumtime,函数总计运行时间,含调用的子函数运行时间
                    cc, nc, tt, ct, callers = pS.stats[fcn]
                    # print fileName, func, cc, nc, tt, ct, callers
                    percall = ct / nc
                    # 只统计单次函数调用大于1毫秒的数据
                    if percall >= 0.001:
                        dataSummary(filterData, fileName, fcn, percall)
​
"""
绘制高斯函数曲线和安全阈值
"""
def drawGaussian(func, line, percallMean, threshold, percallList, dumpFolder):
    plt.title(func)
    plt.figure(figsize=(10, 8))
    for delta in [0.2, 0.5, 1]:
        gaussY = []
        gaussX = []
        for item in percallList:
            # 这边为了呈现正态曲线效果,减去平均值
            gaussX.append(item - percallMean)
            y = gaussian(item - percallMean, 0, delta)
            gaussY.append(y)
        plt.plot(gaussX, gaussY, label='sigma={}'.format(delta))
    # 绘制水位线
    plt.plot([threshold - percallMean, threshold - percallMean], [0, 5 * gaussian(percallMean, 0, 1)], color='red',
             linestyle="-", label="Threshold:" + str("%.5f" % threshold))
    plt.xlabel("Time(s)", fontsize=12)
    plt.legend()
    plt.tight_layout()
    # 可能不同类中包含相同的函数名,加上行数参数避免覆盖
    imagePath = dumpFolder + "cost_%s_%s.png" % (func, str(line))
    plt.savefig(imagePath)
​
"""
绘制耗时曲线和安全阈值
"""
def drawCurve(func, line, percallList, dumpFolder):
    boxplotQ1 = numpy.percentile(percallList, 25)
    boxplotQ2 = numpy.percentile(percallList, 75)
    boxplotIQR = boxplotQ2 - boxplotQ1
    upperLimit = boxplotQ2 + 1.5 * boxplotIQR
    # 不符合正态分布,绘制波动曲线
    timeArray = [i for i in range(len(percallList))]
    plt.title(dataItem["func"])
    plt.figure(figsize=(10, 8))
    # 绘制水位线
    plt.plot([0, len(percallList)], [upperLimit, upperLimit], color='red', linestyle="-",
             label="Threshold:" + str("%.5f" % upperLimit))
    plt.plot(timeArray, percallList, label=dataItem["func"] + "_" + str(dataItem["line"]))
    plt.ylabel("Time(s)", fontsize=12)
    plt.legend()
    plt.tight_layout()
    imagePath = dumpFolder + "cost_%s_%s.png" % (func, str(line))
    plt.savefig(imagePath)
​
if __name__ == "__main__":
    try:
        statFolder = "/Users/chenwenguan/Downloads/2aab7e17-a1b6-1253/"
        chartFolder = statFolder + "chart/"
        if not os.path.exists(chartFolder):
            os.mkdir(chartFolder)
        filterData = []
        readStatsFile(statFolder, filterData);
        for dataItem in filterData:
            percallList = map(lambda x: x["percall"], dataItem["cost"])
            func = dataItem["func"]
            line = dataItem["line"]
            # 样本个数大于20才进行绘制
            if len(percallList) > 20:
                percallMean = numpy.mean(percallList) # 计算均值
                # percallVar = numpy.var(percallMap) # 求方差
                percallStd = numpy.std(percallList)  # 计算标准差
                # pvalue值大于0.05为正太分布
                kstestResult = stats.kstest(percallList, 'norm', (percallMean, percallStd))
                print "percallStd:%s, pvalue:%s" % (percallStd, kstestResult[1])
                # 符合正态分布绘制分布曲线
                if kstestResult[1] > 0.05:
                    threshold = percallMean + 3 * percallStd
                    drawGaussian(func, line, percallMean, threshold, percallList, chartFolder)
                else:
                    drawCurve(func, line, percallList, chartFolder)
            else:
                pass
    except Exception:
        print 'exeption:' + traceback.format_exc()

The curve renderings of the two time-consuming models are as follows:

Example of function time-consuming Gaussian distribution curve and threshold effect

Guess you like

Origin blog.csdn.net/weixin_39842528/article/details/131583923