Python-based big data analysis foundation and actual data analysis

1. Data analysis

1.1 Basic statistical analysis

1.1.1 Meaning

Basic statistical analysis is to count the minimum value, first quartile value, median value, third quartile value and maximum value of a variable.

1.1.2 Data Center

The central position of the data can be divided into mean (Mean), median (Median) and mode (Mode).

1.1.3 describe function

The descriptive statistical analysis function is describe. The return values ​​of this function are mean, standard deviation, maximum, minimum, quantile, etc. There can be some parameters in the brackets, such as percentiles=[0,0.2,0.4,0.6,0.8] means that only 0.2, 0.4, 0.6, 0.8 quantiles are calculated instead of the default 1/4, 1/2, 3/ 4th quantile.
The commonly used statistical functions are:
size: count the number of samples (this function does not require parentheses)
sum(): sum.
mean(): mean value.
var(): variance.
std(): Standard deviation.

1.1.4 Examples

Show i_nuc_utf8.csv
Insert picture description here

import pandas as pd
import numpy as np
np.set_printoptions(suppress=True)#显示的数字不是以科学记数法显示
import pandas as pd
#display.max_columns 显示最大列数,display.max_rows 显示最大行数
df=pd.DataFrame(pd.read_csv('i_nuc_utf8.csv'))#d读取文件
df#展示读取的文件

operation result:
Insert picture description here

df.describe()#所有各列的基本统计

Operation result:
Insert picture description here
Calculate the basic data of a certain variable.
Insert picture description here
The method of calculating the average value (there are three types).
Insert picture description here
Calculate the median and mode of the entire table
Insert picture description here

1.1.5 Summary

The describe function calculates the basic data of each variable.
Three ways to calculate the average.
Represents the variables of the table, such as: df['variable'].
The median function calculates the median, the
mode function calculates the mode

1.2 Group analysis

1.2.1 Meaning

Divide the analysis object into different parts according to the grouping field.

1.2.2 Format

The commonly used command form is as follows:
df.groupby(by=['group by',...])['columns to be counted'].agg({column alias: statistical function 1,...})
where:
by means for grouping Column;
[] indicates the column used for statistics;
agg indicates the name of the statistical alias statistical value, and the statistical function is used for statistical data. Commonly used statistical functions are: size means count; sum means summation; mean means average.

1.2.3 Examples

Insert picture description here
In the groupby grouping, numeric columns will be aggregated, and non-numeric columns will be excluded from the result. When the grouping basis is unknown, you need to use list.

df.groupby(by=['班级','性别'])['军训','英语','体育'].mean()

When you want to calculate the average, standard deviation, total, etc. of each group of data at the same time, you need to use agg().

df.groupby(by=['班级','性别'])['军训'].agg({
    
    '总分':np.sum,
                                     '人数':np.size,
                                     '平均值':np.mean,
                                     '方差':np.var,
                                     '标准差':np.std,
                                     '最高分':np.max,
                                     '最低分':np.min,
                                     })

operation result:
Insert picture description here

1.2.4 Summary

Keep in mind that df.groupby(by=['group by',...])['columns to be counted'].agg({column alias: statistical function 1,...})
What is the grouping basis and what is the column to be counted.

1.3 Distribution analysis

1.3.1 Meaning

An analysis method to study the distribution law of each group. Based on group analysis

1.3.2 Procedure

1.3.2.1 Generate total column

df['总分']=df.英语+df.体育+df.军训+df.数分+df.高代+df.解几
df['总分']#按一行求总数

operation result:
Insert picture description here

df['总分'].describe()

operation result:
Insert picture description here

1.3.2.2 Divide the interval and generate labels

bins=[min(df.总分)-1,400,450,max(df.总分)+1]
bins

Insert picture description here

labels=['400分以下','400到450','450及其以上']
labels

Insert picture description here
1.3.2.3 Stratification and statistics according to the total column and interval just generated

总分分层=pd.cut(df.总分,bins,labels=labels)#cut函数
总分分层

Insert picture description here

1.3.2.4 Generate variables

df['总分分层']=总分分层
df['总分分层']

Insert picture description here

1.3.2.5 Statistics

df.groupby(by=['总分分层'])['总分'].agg({
    
    '人数':np.size})

Insert picture description here

1.4 Cross analysis

1.4.1 Meaning

Crossover analysis is usually used to analyze the relationship between two or more grouping variables.
Insert picture description here

1.4.2 Format and parameters

Format:
pivot_table(values,index,columns,aggfunc,fill_values)
Parameters:
values ​​represent the value in the pivot table
index represents the grouping basis of the pivot table, display
columns represent the column in the pivot table
aggfunc represents the statistical function
fill_values ​​represents the NA value Unified replacement The
return value is a pivot table

1.4.3 Examples

from pandas import pivot_table
df.pivot_table(index=['班级','姓名'])

Insert picture description here

df.pivot_table(values=['总分'],index=['总分分层'],columns=['性别'],aggfunc=[np.size,np.mean])

operation result:
Insert picture description here

1.5 Structural analysis

1.5.1 Meaning

Draw a pie chart
Insert picture description here

1.5.2 Examples

Divide into classes by gender, find the total score

df['总分']=df.英语+df.体育+df.军训+df.数分+df.高代+df.解几
df_pt=df.pivot_table(values=['总分'],
                    index=['班级'],columns=['性别'],aggfunc=[np.sum])
df_pt

Insert picture description here

df_pt.div(df_pt.sum(axis=1),axis=0)

Insert picture description here
Girls in class 2308242 accounted for 0.332193 of the total score, and boys accounted for 0.667807 of the total score

1.5.3 Summary

Use div to find the percentage.
The sum function sums up. axis=1, add on the row, axis=0, add on the column.

1.6 Correlation analysis

1.6.1 Meaning

Insert picture description here

1.6.2 The relationship between correlation coefficient and correlation degree

Insert picture description here

1.6.3 Examples

df.loc[:,['英语','体育','军训','解几','数分','高代']].corr()

Insert picture description here
High algebra and number points have a moderate correlation, and solution numbers and number points have a moderate correlation.

1.6.4 Summary

loc value, corr function for correlation analysis.

Guess you like

Origin blog.csdn.net/qq_45059457/article/details/108599311