Data analysis using Python - analysis of variance

Hello everyone, analysis of variance can be used to determine whether there are significant differences between several groups of observed data or processed results. The Analysis of Variance (ANOVA) introduced in this article is a mathematical statistical method used to test whether the means of two or more groups of samples are significantly different.

According to the number of factors that affect test conditions, variance analysis can be divided into: single-factor variance analysis, two-factor variance analysis, and multi-factor variance analysis; two-factor variance analysis analyzes the impact of two factors on test indicators; multi-factor variance analysis It is an analysis method that analyzes more factor indicators. This article is an example of whether there are differences in the monthly salary income of different cities at the monthly level as an example of whether there are differences in multiple sets of data:

1. One-way analysis of variance

One-way analysis of variance only considers whether a single factor has a significant impact on the experimental indicators:

import pandas as pd
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
data= pd.read_excel('D:/shujufenxi/jpt.xlsx',index_col=0)
# 先来看下从城市因素开始分析,
df_city=data.melt(var_name='城市',value_name='月薪')#使用melt()函数将读取数据进行结构转换,以满足ols()函数对数据格式的要求,melt()函数能将列标签转换为列数据

Using the melt() function to analyze the data structure and visualize it, we can observe the obvious differences with the naked eye:

import matplotlib.pyplot  as plt
plt.rcParams['font.sans-serif'] = ['KaiTi', 'SimHei', 'FangSong']  # 汉字字体,优先使用楷体,如果找不到楷体,则使用黑体
plt.rcParams['font.size'] = 12  # 字体大小
plt.rcParams['axes.unicode_minus'] = False  # 正常显示负号
import pandas as pd
import seaborn as sns
data= pd.read_excel('D:/shujufenxi/jpt.xlsx',index_col=0)
data_melt = data.melt()
data_melt.columns = ['城市', '月薪']
sns.boxplot(x = '城市', y = '月薪', data = data_melt)
plt.show()

Perform analysis of variance:

import pandas as pd
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
data= pd.read_excel('D:/shujufenxi/jpt.xlsx',index_col=0)
# 先来看下从城市因素开始分析,
df_city=data.melt(var_name='城市',value_name='月薪')#使用melt()函数将读取数据进行结构转换,以满足ols()函数对数据格式的要求,melt()函数能将列标签转换为列数据
model_city=ols('月薪~C(城市)',df_city).fit()# ols()创建一线性回归分析模型
anova_table=anova_lm(model_city)# anova_lm()函数创建模型生成方差分析表
print(anova_table)
# 进行事后比较分析
print(pairwise_tukeyhsd(df_city['月薪'], df_city['城市']))

In the upper part of the result graph, df is the degree of freedom, sum_sq is the sum of squares of errors, mean_sq is the mean square, F represents the statistic F value, PR (>F) represents the significance level P value; the lower part is multiple comparisons, Perform post hoc analysis. Group1 and group2 represent different levels of factors, and then analyze whether there is a significant difference between the two groups. The rejection at the end represents whether to reject the null hypothesis. True represents the rejection of the null hypothesis, indicating the mean value of the two groups. There is a significant difference.

2. Two-factor analysis of variance

Two-factor ANOVA has different data structure requirements than single-factor ANOVA. The code is as follows:

import pandas as pd
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
data= pd.read_excel('D:/shujufenxi/jpt.xlsx',index_col=0)
df_twoway=data.stack().reset_index()
df_twoway.columns=['月份','城市','月薪']
model_twoway=ols('月薪~C(月份)+C(城市)',df_twoway).fit()
anova_table=anova_lm(model_twoway)
print(anova_table

Guess you like

Origin blog.csdn.net/csdn1561168266/article/details/129216380