1. Data analysis
1.1 Basic statistical analysis
1.1.1 Meaning
Basic statistical analysis is to count the minimum value, first quartile value, median value, third quartile value and maximum value of a variable.
1.1.2 Data Center
The central position of the data can be divided into mean (Mean), median (Median) and mode (Mode).
1.1.3 describe function
The descriptive statistical analysis function is describe. The return values of this function are mean, standard deviation, maximum, minimum, quantile, etc. There can be some parameters in the brackets, such as percentiles=[0,0.2,0.4,0.6,0.8] means that only 0.2, 0.4, 0.6, 0.8 quantiles are calculated instead of the default 1/4, 1/2, 3/ 4th quantile.
The commonly used statistical functions are:
size: count the number of samples (this function does not require parentheses)
sum(): sum.
mean(): mean value.
var(): variance.
std(): Standard deviation.
1.1.4 Examples
Show i_nuc_utf8.csv
import pandas as pd
import numpy as np
np.set_printoptions(suppress=True)#显示的数字不是以科学记数法显示
import pandas as pd
#display.max_columns 显示最大列数,display.max_rows 显示最大行数
df=pd.DataFrame(pd.read_csv('i_nuc_utf8.csv'))#d读取文件
df#展示读取的文件
operation result:
df.describe()#所有各列的基本统计
Operation result:
Calculate the basic data of a certain variable.
The method of calculating the average value (there are three types).
Calculate the median and mode of the entire table
1.1.5 Summary
The describe function calculates the basic data of each variable.
Three ways to calculate the average.
Represents the variables of the table, such as: df['variable'].
The median function calculates the median, the
mode function calculates the mode
1.2 Group analysis
1.2.1 Meaning
Divide the analysis object into different parts according to the grouping field.
1.2.2 Format
The commonly used command form is as follows:
df.groupby(by=['group by',...])['columns to be counted'].agg({column alias: statistical function 1,...})
where:
by means for grouping Column;
[] indicates the column used for statistics;
agg indicates the name of the statistical alias statistical value, and the statistical function is used for statistical data. Commonly used statistical functions are: size means count; sum means summation; mean means average.
1.2.3 Examples
In the groupby grouping, numeric columns will be aggregated, and non-numeric columns will be excluded from the result. When the grouping basis is unknown, you need to use list.
df.groupby(by=['班级','性别'])['军训','英语','体育'].mean()
When you want to calculate the average, standard deviation, total, etc. of each group of data at the same time, you need to use agg().
df.groupby(by=['班级','性别'])['军训'].agg({
'总分':np.sum,
'人数':np.size,
'平均值':np.mean,
'方差':np.var,
'标准差':np.std,
'最高分':np.max,
'最低分':np.min,
})
operation result:
1.2.4 Summary
Keep in mind that df.groupby(by=['group by',...])['columns to be counted'].agg({column alias: statistical function 1,...})
What is the grouping basis and what is the column to be counted.
1.3 Distribution analysis
1.3.1 Meaning
An analysis method to study the distribution law of each group. Based on group analysis
1.3.2 Procedure
1.3.2.1 Generate total column
df['总分']=df.英语+df.体育+df.军训+df.数分+df.高代+df.解几
df['总分']#按一行求总数
operation result:
df['总分'].describe()
operation result:
1.3.2.2 Divide the interval and generate labels
bins=[min(df.总分)-1,400,450,max(df.总分)+1]
bins
labels=['400分以下','400到450','450及其以上']
labels
1.3.2.3 Stratification and statistics according to the total column and interval just generated
总分分层=pd.cut(df.总分,bins,labels=labels)#cut函数
总分分层
1.3.2.4 Generate variables
df['总分分层']=总分分层
df['总分分层']
1.3.2.5 Statistics
df.groupby(by=['总分分层'])['总分'].agg({
'人数':np.size})
1.4 Cross analysis
1.4.1 Meaning
Crossover analysis is usually used to analyze the relationship between two or more grouping variables.
1.4.2 Format and parameters
Format:
pivot_table(values,index,columns,aggfunc,fill_values)
Parameters:
values represent the value in the pivot table
index represents the grouping basis of the pivot table, display
columns represent the column in the pivot table
aggfunc represents the statistical function
fill_values represents the NA value Unified replacement The
return value is a pivot table
1.4.3 Examples
from pandas import pivot_table
df.pivot_table(index=['班级','姓名'])
df.pivot_table(values=['总分'],index=['总分分层'],columns=['性别'],aggfunc=[np.size,np.mean])
operation result:
1.5 Structural analysis
1.5.1 Meaning
Draw a pie chart
1.5.2 Examples
Divide into classes by gender, find the total score
df['总分']=df.英语+df.体育+df.军训+df.数分+df.高代+df.解几
df_pt=df.pivot_table(values=['总分'],
index=['班级'],columns=['性别'],aggfunc=[np.sum])
df_pt
df_pt.div(df_pt.sum(axis=1),axis=0)
Girls in class 2308242 accounted for 0.332193 of the total score, and boys accounted for 0.667807 of the total score
1.5.3 Summary
Use div to find the percentage.
The sum function sums up. axis=1, add on the row, axis=0, add on the column.
1.6 Correlation analysis
1.6.1 Meaning
1.6.2 The relationship between correlation coefficient and correlation degree
1.6.3 Examples
df.loc[:,['英语','体育','军训','解几','数分','高代']].corr()
High algebra and number points have a moderate correlation, and solution numbers and number points have a moderate correlation.
1.6.4 Summary
loc value, corr function for correlation analysis.