The application of quartiles-box plots and outlier rules based on income examples

Content import:

Hello everyone, here is a little bit of analysis every day. This issue introduces you to the basic series of data analysis, mainly introduces the principle and application of quartiles, the calculation method of quartiles, and draws a box diagram based on the quartiles, and briefly introduces them. How to detect data outliers through the box plot. Combining case analysis of academic performance and income, the content is explained in a simple way, the case fits the actual situation, and the content of the article is suitable for data analysis novices. In the next issue, I will introduce the application of centralized trends. Welcome everyone's attention.

Concept introduction:

The quartile refers to the value in statistics that arranges all values ​​from small to large and divides them into four equal parts, which are in the position of three dividing points. It is mostly used to draw box plots in statistics. It is the 25% and 75% values ​​of a group of data after sorting. Quartiles are divided into 4 parts by 3 points, each part contains 25% of the data. Obviously, the middle quartile is the median, so the quartile usually refers to the value at the 25% position and the value at the 75% position.

The first quartile (Q1), also known as the "smaller quartile", is equal to the 25th percentile of all values ​​in the sample in descending order.

The second quartile (Q2), also known as the "median", is equal to the 50th percentile of all values ​​in the sample in descending order.

The third quartile (Q3), also known as the "larger quartile", is equal to the 75th percentile of all values ​​in the sample in descending order.

The difference between the third quartile and the first quartile is also called the interquartile range (IQR).

The following figure shows a schematic box plot , from which we can read the following information:

This set of data shows:

The minimum value (min)=0.5.

Lower quartile (Q1) = 7

Median (Med)=8.5 (The number in the middle position after a piece of data is sorted from small to large)

Upper quartile (Q3)=9

Maximum value (max)=10

Mean=8

Interquartile range=Q3-Q1=2

Calculation:

Step 1: Determine the position of the quartile.

Quartile is a number that divides a sequence of numbers into four equal parts. A sequence has three quartiles. Set the lower quartile, median and upper quartile as Q1, Q2, and Q3, then: The positions of Q1, Q2, and Q3 can be determined by the following formula:

Where n represents the number of data items

Step 2: Determine the corresponding quartile according to the position of the quartile determined in the first step.

For example, if the quantity of a certain product produced by workers in a workshop in a certain month is 13, 13.5, 13.8, 13.9, 14, 14.6, 14.8, 15, 15.2, 15.4, 15.7 kg, the positions of the three quartiles are: (Calculating the quartiles requires sorting the data first)

That is, the output of a certain product of the third, sixth, and ninth workers in the variable series is the lower quartile, the median and the upper quartile, respectively. which is:

Q 1 = 13.8 kg, Q 2 = 14.6 kg, Q 3 = 15.2 kg

If you don’t understand, you can follow our WeChat official account to watch the video. The video quotes familiar salary data to tell you how to calculate the quartile.

application:

Application 1: Tell you parents to judge the performance from the perspective of ranking

Now, the school's grades should have come out. Usually, people use the absolute number of grades to judge good grades, which is what everyone calls grades. Then in accordance with the usual practice, such as unqualified below 60, 60-70 qualified, 70-80, 80-90 good, and above 90 is excellent. But this method has disadvantages. If the test questions are difficult, then there will be very few excellent people. On the contrary, if the test questions are simple, most people do well in the test, and being excellent has no reference value.

Today we introduce a way to divide the scores by relative value. First, we sort the scores and divide them by quartiles. The data we get is the top 25%, the bottom 25%, etc. Of course, this The classification can be more refined by the octet. This method avoids the influence of the difficulty of the test questions on the evaluation, and is also in line with the current style of recruitment, because the college entrance examination is based on the ranking of the applicants, so the relative number of results is higher than the absolute number of results.

I don’t know if you understand, we uploaded a small video on WeChat official account to help you understand. If you are interested, please follow our WeChat public account to watch the video.

Application 2: Use quartiles to draw box plots and determine outliers

The box chart can distinguish between normal values ​​and outliers. It is similar to the 3sigma principle that everyone often hears. The difference is that the application of the 3sigma principle requires the data to conform to the normal distribution, but the box chart outliers are applicable to Data outlier measurement for all distribution types.

The composition of the cabinet diagram is shown in the figure below. The upper edge is the upper quartile plus 1.5 times the cabinet; the lower edge is the lower quartile minus 1.5 times the cabinet; the upper cabinet is the upper quartile Quantile; the lower box is the lower quartile; the length of the box is the upper quartile minus the lower quartile. Data above the upper edge or below the lower edge is called an outlier.

The text is not intuitive enough. We uploaded a video on the WeChat official account, detailing how to draw a box diagram to detect outliers. In order to facilitate your code learning, we also prepared the python code for outlier-related cases as a small gift for you. Anyone who is interested can follow our WeChat public account to watch videos and get information.

If you want to get more content, please follow the official account of Haidata.

Share this issue here, we will update the content every day, we will see you next issue, and look forward to your visit again. If you have any suggestions, such as the knowledge you want to know, the problems in the content, the materials you want, the content to be shared next time, and the problems encountered in learning, please leave a message below. Please pay attention if you like it.

Guess you like

Origin blog.csdn.net/qq_40433634/article/details/108771622