Box Plot Meaning

Box plot, also known as box and whisker plot, box plot, boxplot. You are 95% sure that you are right now. How many lines are there in this graph? What does each line mean? Does the line in the middle represent the arithmetic mean, the median, or the mode?

What's the point of the box plot? What is the practical significance of data analysis?

Next, I will take you from the concept to a step-by-step analysis of the box plot and the story behind it.

1. What is a box plot?


John Tukey , the inventor of the boxplot  . Mr. Tukey was born in New Bedford, Massachusetts, in 1915. He earned a master's degree from Brown University at the age of 22 and a Ph.D. in chemistry from Princeton University. Interestingly, he didn't directly start the statistical work that made him famous, but instead entered the fire control laboratory during World War II, where a lot of weapons-related research eventually turned to solving statistical problems first. Since then, Tukey has changed the direction of his life, and a generation of statistics masters is about to emerge.

write picture description here

The biggest advantage of box plot is that it is not affected by outliers (outliers are also called outliers) , and can describe the discrete distribution of data in a relatively stable way. It's important to read twice that the boxplot is not affected by outliers.

In order to illustrate more vividly, let's draw a picture first and see the picture to talk. Using the tool RStudio, assuming that there is a data set num = c(1,6,2,7,4,2,3,3,8,25,30), draw directly through boxplot(num), as follows: 
write picture description here

First of all, you can perceive what this is from the appearance. Oh, there is a rectangular block in the middle, think of it as a box. There is a line inside the box and two T-shaped things on the outside. Oh, and there are two hollow circles on the outermost side, which not all boxplots have. Next, explain these things one by one.

2. Five elements of boxplot


There is one important point that needs to be explained, otherwise it may be ignored by most people. To draw a box plot, first sort the data from large to small, yes, from large to small. 
(1) Median 
The median is the one-half quantile. Therefore, the calculation method is to divide a set of data (the median here, especially the ordered sequence from large to small, usually the median does not require an ordered sequence) into two equal parts, Take the middle number.

If the original sequence length n is odd, then the median position is (n+1)/2;

If the original sequence length n is even, then the median position is n/2, n/2+1, and the value of the median is equal to the arithmetic mean of the numbers in these two positions.

(2) Upper quartile Q1 

Determine the position of the quartiles. Qi's position=i(n+1)/4, where i=1, 2, 3. n represents the number of items contained in the sequence.

It is important to emphasize that the method of finding the quartile is to divide the sequence into four equal parts. There are currently two types of specific calculations: (n+1)/4 and (n-1)/4, and (n+1)/4 is generally used.

Well, I have not explained this part very clearly, and I need to use the powerful tool of the R language to illustrate. For example, there is an ordered sequence of test = c(1,2,3,4,5,6,7,8), and the median and upper quartile of the sequence of test are obtained by summary(test) , the lower quartile, and the arithmetic mean.

How is this Q1=2.75 calculated? First, the sequence length n=8, (1+n)/4=2.25, what does this mean? It shows that the upper quartile is counted at the 2.25th position. In fact, this number does not exist, but we know that this position is between the second and third numbers.

It can only be assumed that there is a uniform distribution from the 2nd to the 3rd number. Then the 2.25th number is the second number*0.25+the third number*0.75, that is, 2*0.25+3*0.75=0.5+2.25=2.75. 
write picture description here

Isn't it cool~~

(3) The calculation method of the lower quartile Q3 
is the same as above, except that (1+n)/4*3=6.75, which is between the sixth position and the seventh position place between. The corresponding specific value is 0.75*6+0.25*7=6.25.

4. Upper limit

The upper bound is the maximum value in the non-anomalous range.

First of all, you need to know what is the interquartile range and how is it calculated?

Interquartile range IQR=Q3-Q1 , then upper limit=Q3+1.5IQR

5. Lower limit

The lower bound is the minimum value in the non-anomalous range.

Lower limit = Q1-1.5IQR


(4) Inner limit 

The two T-shaped box whiskers we have seen in our article so far are the inner limits. The extreme distance to which the upper T-shaped line segment extends is Q3+1.5IQR (wherein, IQR=Q3-Q1) and the maximum value after removing outliers, whichever is the smallest, the lower T-shaped line segment extends to The extreme distance is the maximum between Q1-1.5IQR and the minimum value after removing outliers.

write picture description here 
Or use the chestnuts used at the beginning to illustrate. 
IQR=Q3-Q1=7.5-2.5=5 
upper and lower limit=Q3+1.5*IQR=7.5+1.5*5=15, and the maximum value 8 after excluding two abnormal addresses 30 and 25, the two take the minimum value , so the upper inner limit is 8 
lower inner limit=Q1-1.5*IQR=2.5-1.5*5=-5, and the minimum value 1 after excluding two abnormal addresses 30 and 25, the two take the maximum value, so the lower The limit is 1

(5) 
The calculation methods of the outer limit and the inner limit are the same, and the only difference is that: the extreme distance to which the above T-shaped line segment extends is Q3+3IQR (wherein, IQR=Q3-Q1) and elimination The maximum value after outliers is the smallest, and the extreme distance that the T-shaped line segment below extends is the largest between Q1-3IQR and the minimum value after removing outliers.

3. Box plot and abnormal address cleaning

The most important use of box plots is to identify outliers. It is very useful in data cleaning.


  1. Value of Box Plots

    1. Intuitively identify outliers in data batches

    I have talked about identifying outliers for a long time. In fact, the standard for judging outliers in boxplots is based on quartiles and interquartile ranges. Quartiles have a certain resistance, and up to 25% of the data can be becomes arbitrarily far without disturbing the quartiles greatly, so the outliers do not affect the data shape of the boxplot, and the results of the boxplot to identify outliers are more objective. It can be seen that the boxplot has certain advantages in identifying outliers.

    2. Use the boxplot to judge the skewness and tail weight of the data batch

    For a standard normally distributed sample, there are very few values ​​that are outliers. The more outliers, the heavier the tail and the smaller the degree of freedom (that is, the number of freely changing quantities);

    The skewness indicates the degree of deviation. If the outliers are concentrated on the side of the smaller value, the distribution is left-skewed; if the outliers are concentrated on the side of the larger value, the distribution is right-skewed.

    3. Use boxplots to compare the shapes of several batches of data

    On the same number line, the boxplots of several batches of data are arranged in parallel, and the shape information such as the median, tail length, outliers, and distribution interval of several batches of data is clearly revealed. As shown in the figure above, it can be seen intuitively that the sales of each branch in the third quarter are generally declining.

    How to analyze a boxplot?
  2. 3

    However, the box plot also has its limitations, such as: it cannot accurately measure the skewness and tail weight of the data distribution; for data with relatively large batches, the reflected information is more ambiguous, and the median represents the overall evaluation level. limitation.


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325566819&siteId=291194637