Data representation

Data expressed: Sometimes we original characteristic by converting the data set to generate a new "feature" or ingredients, will be better than the direct use of the original features of the effect, i.e., data representation (data representation)

1. The use of dummy variables conversion type wherein

Dummy variables (Dummy Variables): also referred to as dummy variables, the variables used to certain types of methods for converting a binary variable.

############################# type characteristic conversion using dummy variables ############### ######################## 
# import PANDAS 
import PANDAS PD AS 
# manually enter a data table 
fruits = pd.DataFrame ({ 'feature values': [ 5,6,7,8,9], 'type characteristic': [ 'watermelon', 'banana', 'orange', 'apple', 'grapes']}) 
# fruits display data table 
the display (fruits)

# Conversion data table string value 
fruits_dum = pd.get_dummies (Fruits) 
# display data after conversion table 
display (fruits_dum)

# Instruction program will also be considered as a string value 
fruits [ 'characteristic value'] = fruits [ 'characteristic value'] .astype (STR) 
# string conversion get_dummies 
[ 'characteristic value' pd.get_dummies (fruits, columns = ])

2. The data processing packing

############################# data packing process ############### ######################## 
# import numpy 
import numpy NP AS 
# import drawing tools 
import matplotlib.pyplot AS PLT 
# generates a random number sequence 
rnd = np.random .RandomState (38 is) 
X = rnd.uniform (5,5, size = 50) 
# adding noise to the data 
y_no_noise = (np.cos (. 6 * X) + X) 
X-x.reshape = (-1,1 ) 
Y = (+ y_no_noise rnd.normal (len = size (X))) / 2 
# graphing 
plt.plot (X-, Y, 'O', C = 'R & lt') 
# display graphics 
plt.show ()

#导入神经网络
from sklearn.neural_network import MLPRegressor
#导入KNN
from sklearn.neighbors import KNeighborsRegressor
#生成一个等差数列
line = np.linspace(-5,5,1000,endpoint=False).reshape(-1,1)
#分别用两种算法拟合数据
mlpr = MLPRegressor().fit(X,y)
knr = KNeighborsRegressor().fit(X,y)
#绘制图形
plt.plot(line,mlpr.predict(line),label='MLP')
plt.plot(line,knr.predict(line),label='KNN')
plt.plot(X,y,'o',c='r')
plt.legend(loc='best')
#显示图形
plt.show()

#设置箱体数为11
bins = np.linspace(-5,5,11)
#将数据进行装箱操作
target_bin = np.digitize(X,bins=bins)
#打印装箱数据范围
print('装箱数据范围:\n{}'.format(bins))
#打印前十个数据的特征值
print('\n前十个数据点的特征值:\n{}'.format(X[:10]))
#找到它们所在的箱子
print('\n前十个数据点所在的箱子:\n{}'.format(target_bin[:10]))

#导入独热编码
from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder(sparse = False,categories='auto')
onehot.fit(target_bin)
#使用独热编码转化数据
X_in_bin = onehot.transform(target_bin)
#打印结果
print('装箱后的数据形态:{}'.format(X_in_bin.shape))
print('\n装箱后的前十个数据点:\n{}'.format(X_in_bin[:10]))

#使用独热编码进行数据表达
new_line = onehot.transform(np.digitize(line,bins=bins))
#使用新的数据来训练模型
new_mlpr = MLPRegressor().fit(X_in_bin,y)
new_knr = KNeighborsRegressor().fit(X_in_bin,y)
#绘制图形
plt.plot(line,new_mlpr.predict(new_line),label='New MLP')
plt.plot(line,new_knr.predict(new_line),label='New KNN')

plt.plot(X,y,'o',c='r')
#设置图注
plt.legend(loc='best')
#显示图形
plt.show()

总结 : 

  使用哑变量转化类型特征是为了把字符串类型的特征转化为数值特征,以便于我们可以使用算法进行分类回归预测.

  对样本特征进行装箱处理有一个好处 : 它可以纠正模型过拟合或者欠拟合的问题.尤其是当针对大规模高维数据集使用线性模型的时候,装箱处理可以大幅提高线性模型的预测准确率.

 

文章引自 : 《深入浅出python机器学习》

Guess you like

Origin www.cnblogs.com/weijiazheng/p/10953449.html