Case data: https://cloud.189.cn/t/aYbUv2JbEzUn
1. Data feature analysis
1.1 Central tendency: mean, median, mode, quantile
Mean | Average of all data |
Median | The values are arranged from small to large, the one in the middle |
Mode | The most frequently occurring value in the data set |
Quartile | All values are arranged from small to large, the first cut point is the lower quartile, the second cut point is the median, and the third cut point is the upper quartile |
1.1 Deviation trend: a measure of the degree of data dispersion
Standard deviation | Measure how much data deviates from the mean. The larger the value, the more discrete the data, and the smaller the data, the more gathered |
variance | Standard deviation squared |
2. Data distribution: skewness and kurtosis
2.1 Skewness (skewness of the average): the degree of skewness of the statistical data distribution
Normality
Positive bias
Negative bias
2.2 Kurtosis: Reflects sharpness or flatness, with the normal distribution being zero as the standard, positive kurtosis means relatively sharp distribution, and negative kurtosis means relatively flat distribution
Three, commonly used methods
method | Features |
mean() | Arithmetic mean |
sum() | Calculate the sum of the samples |
where() | Calculate sample variance |
std() | Calculate sample standard deviation |
corr() | Calculate the sample Spearman (Pearson) correlation coefficient matrix |
skew() | Calculate sample skewness (third moment) |
kurt() | Calculate sample kurtosis (fourth moment) |
describe() | Basic description of the sample |
median() | Calculate the median |
quantile() | Find the quantile, q=0.25/0.5/0.75 |
mode() | Find the mode |
Four, single column numerical analysis
# 导入包
import pandas as pd
import seaborn as sns
# 读取数据
df=pd.read_csv("./data/HR.csv")
# 拿到 satisfaction_level 数据
s = df["satisfaction_level"]
# 查看null异常值
s[s.isnull()]
# 查看异常值具体行信息
df[df['satisfaction_level'].isnull()]
# 删除空值
s = s.dropna()
# 填充空值
# s = s.fillna()
s.describe()
# ----------------------
# 拿到 last_evaluation 数据
s2 = df["last_evaluation"]
# 查看描述
s2.describe()
# 正偏:大部分数据比平均数小;负偏:大部分数据比平均数大
s2.skew()
# 正值:比正太分布陡峭;负值:比正太分布平坦
s2.kurt()
# 去除异常值
s2 = s2[s2<1]
s2
# ---------------------
# 或者用四分位数去除离群值
s3 = df['last_evaluation']
q_low = s3.quantile(q=0.25)
q_high = s3.quantile(q=0.75)
q_interval = q_high-q_low
k = 1.5
s3 = s3[s3<q_high+k*q_interval][s3>q_low-k*q_interval]
s3.describe()
# -------------------
# 得到 number_project 数据
s4 = df["number_project"]
# 查看详情
s4.describe()
# 偏度、峰度查看
print("skew",s4.skew())
print("kurt",s4.kurt())
# 统计 参数:normalize=True:出现次数构成比例
s4.value_counts(normalize=True).sort_index()
# 得到 average_monthly_hours
s5 = df["average_monthly_hours"]
s5.describe()
# --------------------------------
# 剔除异常值
s5 = s5[s5<s5.quantile(0.75)+1.5*(s5.quantile(0.75)-s5.quantile(0.25))][s5>s5.quantile(0.25)-1.5*(s5.quantile(0.75)-s5.quantile(0.25))]
s5.describe()
# 某个区间内出现多少次
np.histogram(s5.values,bins=10)
# 自定义间隔
np.histogram(s5,bins=np.arange(s5.min(),s5.max()+10,10))
# ------------------------------
# 对比分析
df.loc[:,["last_evaluation","department"]].groupby("department").mean()