Python box plot drawing and extraction of eigenvalues

Reference:
https://blog.csdn.net/qq_41080850/article/details/83829045
https://blog.csdn.net/aijiudu/article/details/89387328
Python draws box plot boxplot(): https://blog .csdn.net/weixin_44052055/article/details/121442449

1. Introduction

The box plot in the gray box below is a box plot (English: Box plot): also known as a box-and-whisker plot, a box plot, a box plot, or a box-whisker plot, it is a statistical chart used to display a set of data dispersion data . It is named for its shape like a box.

The biggest advantage of the box plot is that it is not affected by outliers and can describe the discrete distribution of data in a relatively stable way.

Five-number summary method: use the following five numbers to summarize the data (minimum value; first quartile (Q1); median (Q2); third quartile (Q3); maximum value), Boxplots are similar.

insert image description here
Features of box plots:

  • The standard for judging outliers in the box plot is based on the quartile and the interquartile range. The quartile has a certain resistance, and as many as 25% of the data can be arbitrarily far away without greatly disturbing the quartile. Quantiles, so the outliers will not affect the data shape of the box plot, and the result of identifying outliers by the box plot is more objective. It can be seen that the boxplot has certain advantages in identifying outliers.
  • For a sample from a standard normal distribution, only a very small number of values ​​are outliers. The more outliers, the heavier the tail and the smaller the degree of freedom (that is, the number of free variables); and the skewness indicates the degree of deviation, and the outliers are concentrated on the side of the smaller value, and the distribution is left skewed; outliers Concentrated on the side of larger values, the distribution is right skewed.
  • On the same number axis, the boxplots of several batches of data are arranged in parallel, and the shape information such as the median, tail length, outliers, and distribution intervals of several batches of data are clearly revealed. So boxplots are helpful for comparing batches of data shapes .
  • Limitations: It is impossible to accurately measure the skewness and tail weight of the data distribution; for data with a relatively large batch size, the reflected information is more ambiguous, and there are certain limitations in using the median to represent the overall evaluation level.

2. Drawing and Extraction

Python has provided many methods for drawing box plots, which has brought great convenience to our users. The available methods are

  • Method 1: Use the or method pandasin the package ;Series.plot()DataFrame.plot()DataFrame.boxplot()
  • Method 2: A situation when using the parameters seabornin the package cataplot()or boxplot(), where seaborn.boxplot()is ;seaborn.cataplot()kind='box'
  • Method 3: Use the method of the object matplotlibin the package .axesboxplot()

There are so many methods, you can refer to the specific use: https://blog.csdn.net/qq_41080850/article/details/83829045 , here are only two posts that I feel are relatively simple.

1. Graphic drawing

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
 
data = [1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900, 2000, 2100]
 
df = pd.DataFrame(data)
df.plot.box(title="Box figure")
plt.show()

insert image description here

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
 
data = [1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900, 2000, 2100]

fig,ax=plt.subplots(figsize=(6, 6), dpi=100, facecolor='w')
ax.boxplot(data,sym='rd',positions=[2])
plt.show()

insert image description here
I am more inclined to the latter, let's study the parameters of the function in detail

2. Use of ax.boxplot function

(I found that someone has already done it, so I used the wheel directly)

https://blog.csdn.net/weixin_44052055/article/details/121442449

The above blog has a very detailed introduction to the parameters of this function. The commonly used parameters are listed below:

parameter meaning
x Specify the data to draw the boxplot, which can be a set of data or multiple sets of data;
notch Whether to display the boxplot in the form of a notch, the default is not notch;
sym Specifies the shape of the abnormal point, which is displayed by a blue + sign by default;
vert Do you need to place the boxplot vertically? By default, it is vertically placed;
whis Specify the distance between the upper and lower whiskers and the upper and lower quartiles, the default is 1.5 times the quartile difference;
positions Specify the position of the boxplot, the default is range(1, N+1), N is the number of boxplots;
widths Specify the width of the boxplot, the default is 0.5;
patch_artist Whether to fill the color of the box, the default is False;
meanline Whether to represent the mean in the form of a line, and the default is to represent it in points;
showmeans Whether to display the mean value or not by default;
showcaps Whether to display the two lines at the top and end of the boxplot, which are displayed by default;
showbox Whether to display the box of the box plot, which is displayed by default;
showfliers Whether to display abnormal values, the default display;
boxprops Set the properties of the box, such as border color, fill color, etc.;
labels Add labels to the boxplot, similar to the role of the legend;
flierprops Set the properties of outliers, such as the shape, size, fill color, etc. of outliers;
medianprops Set the properties of the median, such as line type, thickness, etc.;
meanprops Set the properties of the mean, such as point size, color, etc.;
capprops Set the properties of the top and end lines of the boxplot, such as color, thickness, etc.;
whiskerprops Set whisker properties, such as color, thickness, line type, etc.;
manage_ticks Whether to adapt to the position of the label, the default is True;
autorange Whether to automatically adjust the range, the default is False;

As you can see, this function can also draw multiple box plots. The following are the settings I commonly use:

	pos1=list(range(1,10))
    wid1=2;wid2=1
    tmp=ax.boxplot(datagroup,sym='rd',showmeans=True,meanline=False,
    positions=pos1,
    medianprops={
    
    'color': 'blue','linewidth': str(wid1)},   # medium setting
    flierprops={
    
    "marker": "o", "markerfacecolor": "red", "markersize": 2},  # error points
    #meanprops={'color': 'blue', 'ls': '--', 'linewidth': str(wid)},
    meanprops={
    
    "marker": "^", "markerfacecolor": "green", "markersize": 12},# mean setting
    capprops={
    
    'linewidth': str(wid2)},      # caps
    boxprops={
    
    'linewidth': str(wid1)},      # box
    whiskerprops={
    
    'linewidth': str(wid2)},  # whisker
    )

3. Feature value extraction

The drawing is very simple, but after I get the graph, I can't exactly know the characteristic values ​​(mean, median, etc.) of these box plots. How can I get these values? First of all, I searched and found that a senior wrote a function to meet this need

https://www.cnblogs.com/wangxiaobei2019/p/11719453.html

def BoxFeature(input_list):
    """
    get the feature of box figure.
    
    > @param[in] input_list:    the series
    return: 
    < @param[out] out_list:     the feature value
    < @param[out_note]:         [ave,min,Q1,Q2,Q3,max,error_number]
    """
    percentile = np.percentile(input_list, (25, 50, 75), interpolation='linear')
    Q1 = percentile[0]  # upper quartile
    Q2 = percentile[1]
    Q3 = percentile[2]  # lower quartile
    IQR = Q3 - Q1       # Interquartile range
    ulim = Q3 + 1.5*IQR # upper limit
    llim = Q1 - 1.5*IQR # lower limit
    # llim = 0 if llim < 0 else llim
    # out_list = [llim,Q1,Q2,Q3,ulim]
    # ------- count the number of anomalies ----------
    right_list = []     # normal data
    Error_Point_num = 0
    value_total = 0
    average_num = 0
    for item in input_list:
        if item < llim or item > ulim:
            Error_Point_num += 1
        else:
            right_list.append(item)
            value_total += item
            average_num += 1
    average_value =  value_total/average_num
    out_list = [average_value,min(right_list), Q1, Q2, Q3, max(right_list), Error_Point_num]
    return out_list

Later, the author ax.boxplotconducted a detailed study on the return value of the function and found that I can get what I want from its return value

ave = tmp['means'][0]._y  
med = np.mean(tmp['medians'][0]._y)
min_value = tmp['caps'][2*i]._y[0]
Q1 = tmp['boxes'][0]._y[0]
Q3 = tmp['boxes'][0]._y[3]
max_value = tmp['caps'][2*i+1]._y[0]
error_num = len(tmp['fliers'][0]._y)

It is worth noting that there is a slight difference (mean value) between the above small function and the eigenvalue obtained according to the result returned by the drawing function, but the effect is not significant. At this point, you're done!

3. Multi-box plot drawing

1. One type

It's just a lot of boxes. These boxes are of the same type, but there are many groups.

2. Various types

Draw multiple types of boxplots, each with many groups. In fact, there are two ways: one is to put the same group of boxes together and legendgive the type; the other is to put the same type of boxes together and legendgive the group. These two ways can be positionscontrolled and differentiated by .

The following ones are of the same type put together, different colors are used to indicate different groups, and the legend is given to the groups.

def multiBox():
    #data是acc中三个箱型图的参数
    data = [
    [0.8676,0.8484,0.8293,0.8917,0.9151,0.9470,0.8935,0.8078,0.9081,0.8555,0.8897,0.9062,0.9190,0.8964,0.8520,0.8697,0.8738],
        [0.8512,0.8026,0.7911,0.8787,0.9131,0.9532,0.8656,0.8159,0.9187,0.8421,0.8758,0.9096,0.9128,0.8951,0.8748,0.8537,0.8750],
        [0.9161,0.9047,0.8635,0.9026,0.9328,0.9490,0.8911,0.8669,0.9227,0.8683,0.9114,0.9372,0.9475,0.9053,0.8839,0.9364,0.9032]]
    #data2 是F1 score中三个箱型图的参数
    data2=[
    [0.9291,0.9180,0.9067,0.9427,0.9557,0.9728,0.9438,0.8937,0.9518,0.9221,0.9416,0.9508,0.9578,0.9454,0.9201,0.9303,0.9327],
           [0.9196,0.8905,0.8834,0.9354,0.9546,0.9760,0.9279,0.8986,0.9576,0.9143,0.9338,0.9527,0.9544,0.9447,0.9332,0.9211,0.9333],
           [0.9562,0.9500,0.9267,0.9488,0.9652,0.9738,0.9424,0.9287,0.9598,0.9295,0.9536,0.9676,0.9731,0.9503,0.9384,0.9672,0.9491]]
	#data3 是IoU中三个箱型图的参数
    data3 = [
    [0.8733,0.8624,0.8673,0.8815,0.9363,0.9433,0.9163,0.8350,0.9094,0.8878,0.8956,0.9050,0.9238,0.9077,0.8686,0.8747,0.8877],
             [0.8563,0.8368,0.8618,0.8743,0.9406,0.9479,0.8866,0.8473,0.9195,0.8679,0.8922,0.9091,0.9225,0.9111,0.8857,0.8629,0.8910],
             [0.9172,0.9091,0.8864,0.9029,0.9503,0.9530,0.9200,0.8857,0.9211,0.9033,0.9201,0.9391,0.9430,0.9227,0.9056,0.9360,0.9145]]
	#箱型图名称
    labels = ["A", "B", "C"]
    #三个箱型图的颜色 RGB (均为0~1的数据)
    colors = [(202/255.,96/255.,17/255.), (255/255.,217/255.,102/255.), (137/255.,128/255.,68/255.)]
	#绘制箱型图
	#patch_artist=True-->箱型可以更换颜色,positions=(1,1.4,1.8)-->将同一组的三个箱间隔设置为0.4,widths=0.3-->每个箱宽度为0.3 
    bplot = plt.boxplot(data, patch_artist=True,labels=labels,positions=(1,1.4,1.8),widths=0.3) 
	#将三个箱分别上色
    for patch, color in zip(bplot['boxes'], colors):
        patch.set_facecolor(color)

    bplot2 = plt.boxplot(data2, patch_artist=True, labels=labels,positions=(2.5,2.9,3.3),widths=0.3) 
    
    for patch, color in zip(bplot2['boxes'], colors):
        patch.set_facecolor(color)

    bplot3 = plt.boxplot(data3, patch_artist=True, labels=labels,positions=(4,4.4,4.8),widths=0.3)  
    
    for patch, color in zip(bplot3['boxes'], colors):
        patch.set_facecolor(color)

    x_position=[1,2.5,4]
    x_position_fmt=["acc","F1 score","IoU"]
    plt.xticks([i + 0.8 / 2 for i in x_position], x_position_fmt)

    plt.ylabel('percent (%)')
    plt.grid(linestyle="--", alpha=0.3)  #绘制图中虚线 透明度0.3
    plt.legend(bplot['boxes'],labels,loc='lower right')  #绘制表示框,右下角绘制
    plt.savefig(fname="pic.png",figsize=[10,10])  
    plt.show()

The drawing results are as follows:
insert image description here

Guess you like

Origin blog.csdn.net/Gou_Hailong/article/details/124769916#comments_27575294
Recommended