(4) Simple analysis of single factor

Case data: https://cloud.189.cn/t/aYbUv2JbEzUn

1. Data feature analysis

1.1 Central tendency: mean, median, mode, quantile

Mean Average of all data
Median The values ​​are arranged from small to large, the one in the middle
Mode The most frequently occurring value in the data set
Quartile All values ​​are arranged from small to large, the first cut point is the lower quartile, the second cut point is the median, and the third cut point is the upper quartile

1.1 Deviation trend: a measure of the degree of data dispersion

Standard deviation Measure how much data deviates from the mean. The larger the value, the more discrete the data, and the smaller the data, the more gathered
variance Standard deviation squared

2. Data distribution: skewness and kurtosis

2.1 Skewness (skewness of the average): the degree of skewness of the statistical data distribution

Normality

Positive bias

Negative bias

2.2 Kurtosis: Reflects sharpness or flatness, with the normal distribution being zero as the standard, positive kurtosis means relatively sharp distribution, and negative kurtosis means relatively flat distribution

Three, commonly used methods

method Features
mean() Arithmetic mean
sum() Calculate the sum of the samples
where() Calculate sample variance
std() Calculate sample standard deviation
corr() Calculate the sample Spearman (Pearson) correlation coefficient matrix
skew() Calculate sample skewness (third moment)
kurt() Calculate sample kurtosis (fourth moment)
describe() Basic description of the sample
median() Calculate the median
quantile() Find the quantile, q=0.25/0.5/0.75
mode() Find the mode

Four, single column numerical analysis

# 导入包
import  pandas as pd
import seaborn as  sns

# 读取数据
df=pd.read_csv("./data/HR.csv")

# 拿到 satisfaction_level 数据
s = df["satisfaction_level"]

# 查看null异常值
s[s.isnull()]


# 查看异常值具体行信息
df[df['satisfaction_level'].isnull()]


# 删除空值
s = s.dropna()
# 填充空值
# s = s.fillna()

s.describe()

# ----------------------
# 拿到 last_evaluation 数据
s2 = df["last_evaluation"]

# 查看描述
s2.describe()

# 正偏:大部分数据比平均数小;负偏:大部分数据比平均数大
s2.skew()

# 正值:比正太分布陡峭;负值:比正太分布平坦
s2.kurt()

# 去除异常值
s2 = s2[s2<1]
s2

# ---------------------

# 或者用四分位数去除离群值
s3 = df['last_evaluation']


q_low = s3.quantile(q=0.25)
q_high = s3.quantile(q=0.75)
q_interval = q_high-q_low
k = 1.5

s3 = s3[s3<q_high+k*q_interval][s3>q_low-k*q_interval]
s3.describe()

# -------------------
# 得到 number_project 数据
s4 = df["number_project"]

# 查看详情
s4.describe()

# 偏度、峰度查看
print("skew",s4.skew())
print("kurt",s4.kurt())

# 统计   参数:normalize=True:出现次数构成比例
s4.value_counts(normalize=True).sort_index()

# 得到 average_monthly_hours
s5 = df["average_monthly_hours"]
s5.describe()

# --------------------------------
# 剔除异常值
s5 = s5[s5<s5.quantile(0.75)+1.5*(s5.quantile(0.75)-s5.quantile(0.25))][s5>s5.quantile(0.25)-1.5*(s5.quantile(0.75)-s5.quantile(0.25))]
s5.describe()

# 某个区间内出现多少次
np.histogram(s5.values,bins=10)

# 自定义间隔
np.histogram(s5,bins=np.arange(s5.min(),s5.max()+10,10))

# ------------------------------
# 对比分析
df.loc[:,["last_evaluation","department"]].groupby("department").mean()

 

Guess you like

Origin blog.csdn.net/qq_29644709/article/details/114667497