Data expressed: Sometimes we original characteristic by converting the data set to generate a new "feature" or ingredients, will be better than the direct use of the original features of the effect, i.e., data representation (data representation)
1. The use of dummy variables conversion type wherein
Dummy variables (Dummy Variables): also referred to as dummy variables, the variables used to certain types of methods for converting a binary variable.
############################# type characteristic conversion using dummy variables ############### ######################## # import PANDAS import PANDAS PD AS # manually enter a data table fruits = pd.DataFrame ({ 'feature values': [ 5,6,7,8,9], 'type characteristic': [ 'watermelon', 'banana', 'orange', 'apple', 'grapes']}) # fruits display data table the display (fruits)
# Conversion data table string value fruits_dum = pd.get_dummies (Fruits) # display data after conversion table display (fruits_dum)
# Instruction program will also be considered as a string value fruits [ 'characteristic value'] = fruits [ 'characteristic value'] .astype (STR) # string conversion get_dummies [ 'characteristic value' pd.get_dummies (fruits, columns = ])
2. The data processing packing
############################# data packing process ############### ######################## # import numpy import numpy NP AS # import drawing tools import matplotlib.pyplot AS PLT # generates a random number sequence rnd = np.random .RandomState (38 is) X = rnd.uniform (5,5, size = 50) # adding noise to the data y_no_noise = (np.cos (. 6 * X) + X) X-x.reshape = (-1,1 ) Y = (+ y_no_noise rnd.normal (len = size (X))) / 2 # graphing plt.plot (X-, Y, 'O', C = 'R & lt') # display graphics plt.show ()
#导入神经网络 from sklearn.neural_network import MLPRegressor #导入KNN from sklearn.neighbors import KNeighborsRegressor #生成一个等差数列 line = np.linspace(-5,5,1000,endpoint=False).reshape(-1,1) #分别用两种算法拟合数据 mlpr = MLPRegressor().fit(X,y) knr = KNeighborsRegressor().fit(X,y) #绘制图形 plt.plot(line,mlpr.predict(line),label='MLP') plt.plot(line,knr.predict(line),label='KNN') plt.plot(X,y,'o',c='r') plt.legend(loc='best') #显示图形 plt.show()
#设置箱体数为11 bins = np.linspace(-5,5,11) #将数据进行装箱操作 target_bin = np.digitize(X,bins=bins) #打印装箱数据范围 print('装箱数据范围:\n{}'.format(bins)) #打印前十个数据的特征值 print('\n前十个数据点的特征值:\n{}'.format(X[:10])) #找到它们所在的箱子 print('\n前十个数据点所在的箱子:\n{}'.format(target_bin[:10]))
#导入独热编码 from sklearn.preprocessing import OneHotEncoder onehot = OneHotEncoder(sparse = False,categories='auto') onehot.fit(target_bin) #使用独热编码转化数据 X_in_bin = onehot.transform(target_bin) #打印结果 print('装箱后的数据形态:{}'.format(X_in_bin.shape)) print('\n装箱后的前十个数据点:\n{}'.format(X_in_bin[:10]))
#使用独热编码进行数据表达 new_line = onehot.transform(np.digitize(line,bins=bins)) #使用新的数据来训练模型 new_mlpr = MLPRegressor().fit(X_in_bin,y) new_knr = KNeighborsRegressor().fit(X_in_bin,y) #绘制图形 plt.plot(line,new_mlpr.predict(new_line),label='New MLP') plt.plot(line,new_knr.predict(new_line),label='New KNN') plt.plot(X,y,'o',c='r') #设置图注 plt.legend(loc='best') #显示图形 plt.show()
总结 :
使用哑变量转化类型特征是为了把字符串类型的特征转化为数值特征,以便于我们可以使用算法进行分类回归预测.
对样本特征进行装箱处理有一个好处 : 它可以纠正模型过拟合或者欠拟合的问题.尤其是当针对大规模高维数据集使用线性模型的时候,装箱处理可以大幅提高线性模型的预测准确率.
文章引自 : 《深入浅出python机器学习》