Article directory
Reference:
https://blog.csdn.net/qq_41080850/article/details/83829045
https://blog.csdn.net/aijiudu/article/details/89387328
Python draws box plot boxplot(): https://blog .csdn.net/weixin_44052055/article/details/121442449
1. Introduction
The box plot in the gray box below is a box plot (English: Box plot): also known as a box-and-whisker plot, a box plot, a box plot, or a box-whisker plot, it is a statistical chart used to display a set of data dispersion data . It is named for its shape like a box.
The biggest advantage of the box plot is that it is not affected by outliers and can describe the discrete distribution of data in a relatively stable way.
Five-number summary method: use the following five numbers to summarize the data (minimum value; first quartile (Q1); median (Q2); third quartile (Q3); maximum value), Boxplots are similar.
Features of box plots:
- The standard for judging outliers in the box plot is based on the quartile and the interquartile range. The quartile has a certain resistance, and as many as 25% of the data can be arbitrarily far away without greatly disturbing the quartile. Quantiles, so the outliers will not affect the data shape of the box plot, and the result of identifying outliers by the box plot is more objective. It can be seen that the boxplot has certain advantages in identifying outliers.
- For a sample from a standard normal distribution, only a very small number of values are outliers. The more outliers, the heavier the tail and the smaller the degree of freedom (that is, the number of free variables); and the skewness indicates the degree of deviation, and the outliers are concentrated on the side of the smaller value, and the distribution is left skewed; outliers Concentrated on the side of larger values, the distribution is right skewed.
- On the same number axis, the boxplots of several batches of data are arranged in parallel, and the shape information such as the median, tail length, outliers, and distribution intervals of several batches of data are clearly revealed. So boxplots are helpful for comparing batches of data shapes .
- Limitations: It is impossible to accurately measure the skewness and tail weight of the data distribution; for data with a relatively large batch size, the reflected information is more ambiguous, and there are certain limitations in using the median to represent the overall evaluation level.
2. Drawing and Extraction
Python has provided many methods for drawing box plots, which has brought great convenience to our users. The available methods are
- Method 1: Use the or method
pandas
in the package ;Series.plot()
DataFrame.plot()
DataFrame.boxplot()
- Method 2: A situation when using the parameters
seaborn
in the packagecataplot()
orboxplot()
, whereseaborn.boxplot()
is ;seaborn.cataplot()
kind='box'
- Method 3: Use the method of the object
matplotlib
in the package .axes
boxplot()
There are so many methods, you can refer to the specific use: https://blog.csdn.net/qq_41080850/article/details/83829045 , here are only two posts that I feel are relatively simple.
1. Graphic drawing
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = [1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900, 2000, 2100]
df = pd.DataFrame(data)
df.plot.box(title="Box figure")
plt.show()
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = [1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900, 2000, 2100]
fig,ax=plt.subplots(figsize=(6, 6), dpi=100, facecolor='w')
ax.boxplot(data,sym='rd',positions=[2])
plt.show()
I am more inclined to the latter, let's study the parameters of the function in detail
2. Use of ax.boxplot function
(I found that someone has already done it, so I used the wheel directly)
https://blog.csdn.net/weixin_44052055/article/details/121442449
The above blog has a very detailed introduction to the parameters of this function. The commonly used parameters are listed below:
parameter | meaning |
---|---|
x | Specify the data to draw the boxplot, which can be a set of data or multiple sets of data; |
notch | Whether to display the boxplot in the form of a notch, the default is not notch; |
sym | Specifies the shape of the abnormal point, which is displayed by a blue + sign by default; |
vert | Do you need to place the boxplot vertically? By default, it is vertically placed; |
whis | Specify the distance between the upper and lower whiskers and the upper and lower quartiles, the default is 1.5 times the quartile difference; |
positions | Specify the position of the boxplot, the default is range(1, N+1), N is the number of boxplots; |
widths | Specify the width of the boxplot, the default is 0.5; |
patch_artist | Whether to fill the color of the box, the default is False; |
meanline | Whether to represent the mean in the form of a line, and the default is to represent it in points; |
showmeans | Whether to display the mean value or not by default; |
showcaps | Whether to display the two lines at the top and end of the boxplot, which are displayed by default; |
showbox | Whether to display the box of the box plot, which is displayed by default; |
showfliers | Whether to display abnormal values, the default display; |
boxprops | Set the properties of the box, such as border color, fill color, etc.; |
labels | Add labels to the boxplot, similar to the role of the legend; |
flierprops | Set the properties of outliers, such as the shape, size, fill color, etc. of outliers; |
medianprops | Set the properties of the median, such as line type, thickness, etc.; |
meanprops | Set the properties of the mean, such as point size, color, etc.; |
capprops | Set the properties of the top and end lines of the boxplot, such as color, thickness, etc.; |
whiskerprops | Set whisker properties, such as color, thickness, line type, etc.; |
manage_ticks | Whether to adapt to the position of the label, the default is True; |
autorange | Whether to automatically adjust the range, the default is False; |
As you can see, this function can also draw multiple box plots. The following are the settings I commonly use:
pos1=list(range(1,10))
wid1=2;wid2=1
tmp=ax.boxplot(datagroup,sym='rd',showmeans=True,meanline=False,
positions=pos1,
medianprops={
'color': 'blue','linewidth': str(wid1)}, # medium setting
flierprops={
"marker": "o", "markerfacecolor": "red", "markersize": 2}, # error points
#meanprops={'color': 'blue', 'ls': '--', 'linewidth': str(wid)},
meanprops={
"marker": "^", "markerfacecolor": "green", "markersize": 12},# mean setting
capprops={
'linewidth': str(wid2)}, # caps
boxprops={
'linewidth': str(wid1)}, # box
whiskerprops={
'linewidth': str(wid2)}, # whisker
)
3. Feature value extraction
The drawing is very simple, but after I get the graph, I can't exactly know the characteristic values (mean, median, etc.) of these box plots. How can I get these values? First of all, I searched and found that a senior wrote a function to meet this need
def BoxFeature(input_list):
"""
get the feature of box figure.
> @param[in] input_list: the series
return:
< @param[out] out_list: the feature value
< @param[out_note]: [ave,min,Q1,Q2,Q3,max,error_number]
"""
percentile = np.percentile(input_list, (25, 50, 75), interpolation='linear')
Q1 = percentile[0] # upper quartile
Q2 = percentile[1]
Q3 = percentile[2] # lower quartile
IQR = Q3 - Q1 # Interquartile range
ulim = Q3 + 1.5*IQR # upper limit
llim = Q1 - 1.5*IQR # lower limit
# llim = 0 if llim < 0 else llim
# out_list = [llim,Q1,Q2,Q3,ulim]
# ------- count the number of anomalies ----------
right_list = [] # normal data
Error_Point_num = 0
value_total = 0
average_num = 0
for item in input_list:
if item < llim or item > ulim:
Error_Point_num += 1
else:
right_list.append(item)
value_total += item
average_num += 1
average_value = value_total/average_num
out_list = [average_value,min(right_list), Q1, Q2, Q3, max(right_list), Error_Point_num]
return out_list
Later, the author ax.boxplot
conducted a detailed study on the return value of the function and found that I can get what I want from its return value
ave = tmp['means'][0]._y
med = np.mean(tmp['medians'][0]._y)
min_value = tmp['caps'][2*i]._y[0]
Q1 = tmp['boxes'][0]._y[0]
Q3 = tmp['boxes'][0]._y[3]
max_value = tmp['caps'][2*i+1]._y[0]
error_num = len(tmp['fliers'][0]._y)
It is worth noting that there is a slight difference (mean value) between the above small function and the eigenvalue obtained according to the result returned by the drawing function, but the effect is not significant. At this point, you're done!
3. Multi-box plot drawing
1. One type
It's just a lot of boxes. These boxes are of the same type, but there are many groups.
2. Various types
Draw multiple types of boxplots, each with many groups. In fact, there are two ways: one is to put the same group of boxes together and legend
give the type; the other is to put the same type of boxes together and legend
give the group. These two ways can be positions
controlled and differentiated by .
The following ones are of the same type put together, different colors are used to indicate different groups, and the legend is given to the groups.
def multiBox():
#data是acc中三个箱型图的参数
data = [
[0.8676,0.8484,0.8293,0.8917,0.9151,0.9470,0.8935,0.8078,0.9081,0.8555,0.8897,0.9062,0.9190,0.8964,0.8520,0.8697,0.8738],
[0.8512,0.8026,0.7911,0.8787,0.9131,0.9532,0.8656,0.8159,0.9187,0.8421,0.8758,0.9096,0.9128,0.8951,0.8748,0.8537,0.8750],
[0.9161,0.9047,0.8635,0.9026,0.9328,0.9490,0.8911,0.8669,0.9227,0.8683,0.9114,0.9372,0.9475,0.9053,0.8839,0.9364,0.9032]]
#data2 是F1 score中三个箱型图的参数
data2=[
[0.9291,0.9180,0.9067,0.9427,0.9557,0.9728,0.9438,0.8937,0.9518,0.9221,0.9416,0.9508,0.9578,0.9454,0.9201,0.9303,0.9327],
[0.9196,0.8905,0.8834,0.9354,0.9546,0.9760,0.9279,0.8986,0.9576,0.9143,0.9338,0.9527,0.9544,0.9447,0.9332,0.9211,0.9333],
[0.9562,0.9500,0.9267,0.9488,0.9652,0.9738,0.9424,0.9287,0.9598,0.9295,0.9536,0.9676,0.9731,0.9503,0.9384,0.9672,0.9491]]
#data3 是IoU中三个箱型图的参数
data3 = [
[0.8733,0.8624,0.8673,0.8815,0.9363,0.9433,0.9163,0.8350,0.9094,0.8878,0.8956,0.9050,0.9238,0.9077,0.8686,0.8747,0.8877],
[0.8563,0.8368,0.8618,0.8743,0.9406,0.9479,0.8866,0.8473,0.9195,0.8679,0.8922,0.9091,0.9225,0.9111,0.8857,0.8629,0.8910],
[0.9172,0.9091,0.8864,0.9029,0.9503,0.9530,0.9200,0.8857,0.9211,0.9033,0.9201,0.9391,0.9430,0.9227,0.9056,0.9360,0.9145]]
#箱型图名称
labels = ["A", "B", "C"]
#三个箱型图的颜色 RGB (均为0~1的数据)
colors = [(202/255.,96/255.,17/255.), (255/255.,217/255.,102/255.), (137/255.,128/255.,68/255.)]
#绘制箱型图
#patch_artist=True-->箱型可以更换颜色,positions=(1,1.4,1.8)-->将同一组的三个箱间隔设置为0.4,widths=0.3-->每个箱宽度为0.3
bplot = plt.boxplot(data, patch_artist=True,labels=labels,positions=(1,1.4,1.8),widths=0.3)
#将三个箱分别上色
for patch, color in zip(bplot['boxes'], colors):
patch.set_facecolor(color)
bplot2 = plt.boxplot(data2, patch_artist=True, labels=labels,positions=(2.5,2.9,3.3),widths=0.3)
for patch, color in zip(bplot2['boxes'], colors):
patch.set_facecolor(color)
bplot3 = plt.boxplot(data3, patch_artist=True, labels=labels,positions=(4,4.4,4.8),widths=0.3)
for patch, color in zip(bplot3['boxes'], colors):
patch.set_facecolor(color)
x_position=[1,2.5,4]
x_position_fmt=["acc","F1 score","IoU"]
plt.xticks([i + 0.8 / 2 for i in x_position], x_position_fmt)
plt.ylabel('percent (%)')
plt.grid(linestyle="--", alpha=0.3) #绘制图中虚线 透明度0.3
plt.legend(bplot['boxes'],labels,loc='lower right') #绘制表示框,右下角绘制
plt.savefig(fname="pic.png",figsize=[10,10])
plt.show()
The drawing results are as follows: