1. Data conversion
1.1 Data value replacement
Data replacement can not only replace one value, but also perform multi-value replacement for different values at the same time. The parameter input method can be a list or a dictionary format. Data values are replaced by replace in Pandas.
data = {'姓名':['李红','小明','马芳','国志'],'性别':['0','1','0','1'],
'籍贯':['北京','甘肃','','上海']}
df = pd.DataFrame(data)
df = df.replace('','不详')
print(df)
#传入列表实现多值替换
df = df.replace(['不详','甘肃'],['兰州','兰州'])
print(df)
#传入字典实现多值替换
df = df.replace({'1':'男','0':'女'})
print(df)
#-------------------------------------------------------------------
姓名 性别 籍贯
0 李红 0 北京
1 小明 1 甘肃
2 马芳 0 不详
3 国志 1 上海
姓名 性别 籍贯
0 李红 0 北京
1 小明 1 兰州
2 马芳 0 兰州
3 国志 1 上海
姓名 性别 籍贯
0 李红 女 北京
1 小明 男 兰州
2 马芳 女 兰州
3 国志 男 上海
1.2 Data Transformation Using Functions or Mappings
data = {'姓名':['李红','小明','马芳','国志'],'性别':['0','1','0','1'],
'籍贯':['北京','兰州','兰州','上海']}
df = pd.DataFrame(data)
df['成绩'] = [58,86,91,78]
print(df)
def grade(x):
if x>=90:
return '优'
elif 70<=x<90:
return '良'
elif 60<=x<70:
return '中'
else:
return '差'
df['等级'] = df['成绩'].map(grade)
print(df)
#----------------------------------------------------
姓名 性别 籍贯 成绩
0 李红 0 北京 58
1 小明 1 兰州 86
2 马芳 0 兰州 91
3 国志 1 上海 78
姓名 性别 籍贯 成绩 等级
0 李红 0 北京 58 差
1 小明 1 兰州 86 良
2 马芳 0 兰州 91 优
3 国志 1 上海 78 良
2. Data standardization
2.1 Dispersion standardized data
Dispersion normalization is to map the values of the original data to [0,1]. The conversion formula is as follows:
def MinMaxScale(data):
data = (data-data.min())/(data.max()-data.min())
return data
x = np.array([[ 1., -1., 2.],[ 2., 0., 0.],[ 0., 1., -1.]])
print('原始数据为:\n',x)
x_scaled = MinMaxScale(x)
print('标准化后矩阵为:\n',x_scaled,end = '\n')
#--------------------------------------------------------------
原始数据为:
[[ 1. -1. 2.]
[ 2. 0. 0.]
[ 0. 1. -1.]]
标准化后矩阵为:
[[0.66666667 0. 1. ]
[1. 0.33333333 0.33333333]
[0.33333333 0.66666667 0. ]]
2.2 Standard deviation normalized data
Standard deviation standardization is also known as zero-mean standardization or z-score standardization. The mean value of the data processed by this method is 0, and the standard deviation is 1. The conversion formula is as follows:
def StandardScale(data):
data = (data-data.mean())/data.std()
return data
x = np.array([[ 1., -1., 2.],[ 2., 0., 0.],[ 0., 1., -1.]])
print('原始数据为:\n',x)
x_scaled = StandardScale(x)
print('标准化后矩阵为:\n',x_scaled,end='\n')
#------------------------------------------------------------------
原始数据为:
[[ 1. -1. 2.]
[ 2. 0. 0.]
[ 0. 1. -1.]]
标准化后矩阵为:
[[ 0.52128604 -1.35534369 1.4596009 ]
[ 1.4596009 -0.41702883 -0.41702883]
[-0.41702883 0.52128604 -1.35534369]]
3. Data discretization
3.1 Equal width method
Pandas provides the cut function, which can discretize continuous data with equal width.
np.random.seed(666)
score_list = np.random.randint(25, 100, size = 10)
print('原始数据:\n',score_list)
bins = [0, 59, 70, 80, 100]
score_cut = pd.cut(score_list, bins)
print(pd.value_counts(score_cut))# 统计每个区间人数
#----------------------------------------------------
原始数据:
[27 70 55 87 95 98 55 61 86 76]
(80, 100] 4
(0, 59] 3
(59, 70] 2
(70, 80] 1
3.2 Equal frequency method
Compared with the equal-width method, the equal-frequency method avoids the problem of uneven distribution of classes, but it is also possible to divide two values with very similar values into different intervals.
3.3 Cluster analysis method
It is mainly clustered by cluster analysis algorithm, and then the clusters obtained by clustering are processed to make the same mark for the continuous data merged into one cluster.