Understand various graphs of data analysis (box plots, data distribution graphs, linear regression graphs, correlation graphs) in one article (Alibaba Tianchi)

1. Box plot

1.1 Definition of box plot

Boxplot is also calledBox-whisker plot (Box-whisker Plot), box plot or box plot, is described by using five statistics in the data: minimum value, upper quartile, median, lower quartile and maximum value A statistical graph of data. It can visually display the outliers of the data, the degree of dispersion of the distribution, and the symmetry of the data.

 

Median: The value in the middle after the data is arranged in order from small to large. If the sequence is an even number, it is the average of the middle two numbers;

Lower quartile Q1: the number located at 25% of the data sequence;

Upper quartile Q3: the number located at 75% of the data sequence;

Interquartile range IQR: i.e. IQR = Q3-Q1;

Lower edge: = Q1 – 1.5 *IQR;

Upper edge: = Q3 + 1.5 *IQR;

Some people may have this question: The upper and lower edges are plus or minus IQR, why are the lengths of the dotted lines in the picture different?

In fact, the determination of the lower edge is based on the minimum value greater than Q1-1.5*IQR, so unless there happens to be a value equal to Q1-1.5*IQR, the actual lower edge is larger than Q1-1.5*IQR. In the same way, the upper edge is the maximum value less than Q3 + 1.5 *IQR. So in most cases, the upper and lower dashed lines are not of equal length. Instead, the length is -2.698~2.698\sigma, which is slightly smaller than6\sigma.
 

 


1.2 Characteristics of box plots

1. Intuitively observe outliers. If there are outliers in the data, that is, outside the upper and lower edge areas, they are represented in the form of dots
2. When the box plot When it is very short, it means that a lot of data are concentrated in a small range
3. When the box plot is very long, it means that the data distribution is relatively discrete and the difference between the data is relatively large< /span> 7. If the upper and lower dotted lines are relatively long, it indicates the upper and lower quartiles. The data outside of 6. The high and low positions of the median can reflect the degree of skewness of the data 5. When the median is close to the top, it means that most of the data values ​​are relatively small Large
4. When the median is close to the bottom, it means that most of the data values ​​are relatively small



1.3 Disadvantages of box plots

1. Although the box plot can show the skewness of the data distribution, it cannot provide an accurate measure of the skewness and tail weight of the data distribution;

2. For larger data batches, the shape information reflected by the box plot is more blurry;

3. There are certain limitations in using the median to represent the overall average.

Therefore, it is best to use box plots in conjunction with other descriptive statistical tools such as mean, standard deviation, skewness, distribution function, etc. to describe the distribution shape of the data batch.

1.4 python implementation of boxplot

fig = plt.figure(figsize=(4, 6)) #Specify the width and height of the drawing object
sns.boxplot(train_data['V0'] ,orient="v", width=0.5) #The first parameter is the data, orient is the orientation, and width is the line width

 

2. Data distribution chart

2.1 Histogram

2.2.1 Definition

Histogram, also known asmass distribution chart, is a statistical report chart consisting of a series of vertical Stripes or line segmentsrepresent the distribution of data. Generally, the horizontal axis represents the data type, and the vertical axis represents the distribution.

A histogram is an accurate graphical representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable (a quantitative variable) and was first introduced by Karl Pearson. It is a bar chart. To build a histogram, the first step is to segment the range of values, that is, divide the entire range of values ​​into a series of intervals, and then count how many values ​​are in each interval. These values ​​are typically specified as consecutive, non-overlapping intervals of variables. Intervals must be adjacent and usually (but not necessarily) equal in size.

Histograms can also be normalized to show "relative" frequencies. It then shows the proportion of each case belonging to several categories with a height equal to 1.

 2.2.2 python implementation of histogram

plt.figure(figsize=(10,5)) #Set the size of the graphic

sns.distplot(train_data['V0'],fit=stats.norm)

Program analysis: displot() integrates the functions of matplotlib's hist() and kernel function estimation kdeplot, and adds novel uses for rugplot distribution observation bar display and the use of scipy library fit to fit parameter distributions. By default, it will draw a histogram and do a kernel density estimate (KDE). The specific usage is as follows:

seaborn.distplot(a, bins=None, hist=True, kde=True, rug=False, fit=None, hist_kws=None, kde_kws=None, rug_kws=None, fit_kws=None, color=None, vertical=False, norm_hist=False, axlabel=None, label=None, ax=None)

2.2 Q-Q diagram

2.2.1 Definition    

The QQ plot is a scatter plot, corresponding to the QQ plot of the normal distribution, which is a scatter plot in which the quantile of the standard normal distribution is the abscissa and the sample value is the ordinate. To use the QQ plot to identify whether the sample data is approximately normally distributed, you only need to see whether the points on the QQ plot are approximately near a straight line. If the graph is a straight line, it means it is a normal distribution, and the slope of the straight line is the standard deviation and the intercept. As the mean value, the QQ plot can also be used to obtain rough information about the skewness and kurtosis of the sample.

    If the sample is normally distributed, then f(x) is the probability density function of a normal distribution. According to the characteristics of the normal distribution, we can derive the probability density function of the corresponding standard normal distribution:
y=f(\frac{x-m}{std})
where m is the sample mean and std is the sample standard deviation.

Suppose the probability density function of the standard normal distribution is y= f(n). Since these values ​​correspond one to one, there is:
(x-m)/std=n a>
That is: x=n*std+m
This is a straight line with the slope of the sample standard deviation and the intercept of m, which represents normality in the q-q diagram straight line of distribution.

It can be seen that the first and second graphs are poorly consistent with the normal distribution, and the third graph basically conforms to the normal distribution.​ 

2.2.2 python implementation of Q-Q diagram

# 通过比较数据和正态分布的分位数是否相等来判断数据是不是符合正态分布
res = stats.probplot(train_data['V0'], plot=plt)

  

3. Linear regression diagram

3.1 Definition

The linear regression diagram will draw a scatter plot about the two variables x and y, and at the same time fit a modely ~ x with the data, and combine the corresponding straight line and 95 % confidence intervals are plotted.

3.2 Python implementation of linear regression relationship

sns.regplot(x='V0', y='target', data=train_data, 
            scatter_kws={'marker':'.','s':3,'alpha':0.3},
            line_kws={'color':'k'});

 

 

4. Heat map

4.1 Definition of heat map

Heatmap can elegantly display the differences in data through changes in color depth. Heatmaps can also be used toshow correlation between different indicators, different samples, etc.

At this time, the color represents the size of the correlation coefficient. So you can see that the correlation coefficient between yourself and yourself is 1, which is the darkest blue. Approximately white indicates a weaker correlation, while blue (positive correlation) or red (negative correlation) indicates a strong correlation. Of course, in addition to the correlation coefficient in the calculation of correlation, we will also look at whether the pvalue is significant. If we want to express the pvalue in the graph, we can add an * sign or a specific value to the grid. At the same time, because you can see here that the relationship between two different indicators is shown twice, such as symbolizing and normalized-losses (the second grid in the top row, and the second grid from top to bottom) first grid), so sometimes we only show half of the figure, that is, half of the figure above or below the diagonal.

4.2 Python implementation of heat map

ax = sns.heatmap(train_corr, vmax=.8, square=True, annot=True)#画热力图   annot=True 显示系数

 

 

Guess you like

Origin blog.csdn.net/tangxianyu/article/details/124210558