How to perform descriptive statistical analysis in Python?

Preface

The text and pictures in this article are from the Internet and are for learning and communication purposes only. They do not have any commercial use. If you have any questions, please contact us for processing.

PS: If you need Python learning materials, you can click on the link below to get it yourself


Python free learning materials and group communication answers Click to join


Guide

When conducting data analysis, it is generally necessary to perform descriptive statistical analysis on the data first to discover its inherent laws, and then choose further analysis methods. Descriptive statistical analysis requires statistical description of the data related to all variables of the survey population, including frequency analysis of data, analysis of central tendency of data, analysis of data dispersion, data distribution, and some basic statistical graphics.

This article uses the data set classdata as an example to illustrate how to calculate various data indicators in the process of data exploration. This data set is student information data of a certain class, including name, gender, height and weight. First, we create a data frame, the code is as follows :

import pandas as pd
import numpy as np
classdata=pd.read_csv("D:/Pythondata/data/class.csv")
classdata.head()

Running the above program, the result is shown in Figure 1, showing the first 5 observation samples of the data set classdata.

 

Figure 1 The first 5 observations of the data set classdata

Figure 1 The first 5 observations of the data set classdata

1. Central tendency

1. Arithmetic mean

There are two main methods for calculating the mean value of variables in Pandas. One is to directly use the describle function, and the other is to call the mean function. The code is as follows:

classdata.mean()

Run the program, and the results are shown below. It can be seen that the average age of Age is 13.3, the height is 62.34, and the weight is 100.

Age        13.315789
Height     62.336842
Weight    100.026316
dtype: float64

Similarly, we call describe, the code is as follows:

classdata.describe()

Run the program, the result is shown in Figure 2 below, which is consistent with the calculation result of calling the mean function.

 

Figure 2 Mean values ​​of variables

Figure 2 Mean values ​​of variables

2. Geometric mean

To calculate the geometric mean of variables, you need to call the Python library scipy. For example, we calculate the geometric mean of the variable Heigth of the data set classdata. The code is as follows:

from scipy import stats
stats.gmean(classdata['Height'])

Run the program and the results are as follows:

62.133135310943146

3. Mode

In Pandas, we can directly call the mode function to calculate the mode of the variable. For example, we calculate the mode of the variable Age, the code is as follows:

classdata['Age'].mode()

After running the program, the result is as follows

12

Second, the degree of dispersion

1. Range and interquartile range

The range is also called the full range, which is the difference between the maximum and minimum values ​​of a set of data; the interquartile range is the difference between the third quantile and the first quantile, also known as the inner range or interquartile range. Use the describe function to calculate the maximum, minimum, and quantiles to calculate the range and interquartile range.

stat = classdata.describe() #保存基本统计量
stat.loc['range'] = stat.loc['max']-stat.loc['min'] #极差
stat.loc['dis'] = stat.loc['75%']-stat.loc['25%'] #四分位数间距
print(stat)

Run the above program, the result is shown in Figure 3 below, the range of the variable Age is 5, the interquartile range is 2.5, the range of the variable Heigth is 20.7, the interquartile range is 7.65, and the range of the variable Weight is 99.5, four The quantile difference is 28.

 

Figure 3 Range calculation results

Figure 3 Range calculation results

2. Average deviation

The average deviation is the average of the sum of the absolute values ​​of the difference between the value of each variable and its average. In Pandas, the average difference is calculated by the mad function, which can be called directly. For example, if we calculate the average difference of each variable, the code is as follows:

classdata.mad()

After running the program, the result is as follows:

Age    1.279778
Height   4.069252
Weight  17.343490
dtype: float64

3. Standard deviation

There are many ways to calculate the standard deviation in Pandas. Among them, the functions describe and std can be calculated. We have shown the usage of the describe function in the previous article. We will not repeat it here. We can directly call the std function. The code is as follows:

classdata.std()

Run the above program, the results are as follows:

Age    1.492672
Height   5.127075
Weight  22.773933
dtype: float64

4. Dispersion coefficient

The dispersion coefficient is usually calculated based on the standard deviation. Therefore, it is also called the standard deviation coefficient. It is the ratio of the standard deviation of a set of data to its corresponding average, and is a relative index to measure the degree of data dispersion.
We can calculate the standard deviation coefficient through the following procedure, the code is as follows:

stat2 = classdata.describe()
stat2.loc['var'] = stat2.loc['std']/stat2.loc['mean'] 
stat2

Run the above program, the result is shown in Figure 4 below:

 

Figure 4 Dispersion coefficient results

Figure 4 Dispersion coefficient results

3. Distribution status

1. Skewness

Skewness is a measure of the direction and degree of distribution skewness. In Pandas, you can directly call the skew function to calculate the skewness coefficient of variables. The code is as follows:

classdata.skew()

Running the above program, the results are shown below, where the skewness coefficients of the variables Age, Height and Weight are 0.06, -0.26 and 0.18, respectively.

Age    0.063612
Height  -0.259670
Weight  0.183351
dtype: float64

2. Kurtosis

Kurtosis is the shape of the peak of the central tendency of the distribution. In Pandas, you can directly call the kurt function to calculate the skewness coefficient of the variable. The code is as follows:

classdata.kurt()

Running the above program, the results are shown below, where the skewness coefficients of the variables Age, Height and Weight are -1.11, -0.14 and 0.68, respectively.

Age   -1.110926
Height  -0.138969
Weight  0.683365
dtype: float64

Four, related analysis

1. Scatter chart

There are many ways to make a scatter chart, you can directly call the plot.scatter function of the Pandas library to draw, such as the following program to draw a scatter chart.

classdata.plot.scatter(x='Age', y='Height')

Running the above program, the result is shown in Figure 5. From the data distribution of the scatter plot, it can be seen that the variable Height and Age present a strong correlation.

 

Figure 5 Scatter plot of the variables Age and Height

Figure 5 Scatter plot of the variables Age and Height

 

Similarly, we can also call the pyplot function of the matplotlib library to draw a scatter plot, the code is as follows:

import matplotlib.pyplot as plt
plt.scatter(classdata['Height'],classdata['Weight'] )
plt.xlabel("Height")
plt.ylabel("Weight")
plt.show()

Running the above program, the result is shown in Figure 6 below. From the data distribution of the scatter plot, it can be seen that the variables Height and Weight also show a strong correlation.

 

Figure 6 Scatter plot of variable Height and Weight

Figure 6 Scatter plot of variable Height and Weight

2. Correlation coefficient

In Pandas, you can directly call the corr function to calculate the correlation coefficient between variables, as follows:

classdata.corr()

After running the program, the result is shown in Figure 7:

 

Figure 7 Correlation coefficient between variables

Figure 7 Correlation coefficient between variables

 

In addition to calculating the correlation coefficient matrix, we can also draw the correlation coefficient matrix. Here we need to call the seaborn library to draw. The code is as follows:

import seaborn as sns
%matplotlib inline
# calculate the correlation matrix
corr = classdata.corr()
# plot the heatmap
sns.heatmap(corr,xticklabels=corr.columns,yticklabels=corr.columns)

After running the above program, the result is shown in Figure 8:

 

Figure 8 Correlation coefficient matrix diagram

Figure 8 Correlation coefficient matrix diagram

Guess you like

Origin blog.csdn.net/pythonxuexi123/article/details/112792642