We will encounter in the practical application of inadequate data collection features case, to solve this problem, we need to be expanded feature dataset,
Generally use two methods:
- Interactive features (Interaction Features)
- Characteristic polynomial (Ploynomial Features)
1. Prepare the data set
# Import numpy Import numpy NP AS # Import drawing tools Import matplotlib.pyplot AS PLT # introducing neural network from sklearn.neural_network Import MLPRegressor # generates a random number sequence RND = np.random.RandomState (38 is) X = rnd.uniform (-5, . 5, size = 50) # adding noise to the data y_no_noise = (np.cos (. 6 * X) + X) X-x.reshape = (-1,1) Y = (+ y_no_noise rnd.normal (size = len ( X))) / 2 # set box number. 11 bins = np.linspace (-5,5,11) # the data packing operation target_bin np.digitize = (X-, bins = bins) # introducing hot encoded Import OneHotEncoder sklearn.preprocessing from onehot = OneHotEncoder (sparse = False, the Categories = 'Auto') onehot.fit (target_bin) # conversion using hot encoded data X_in_bin = onehot.transform (target_bin) Generating an arithmetic sequence # Line = np.linspace (-5,5,1000, Endpoint = False) .reshape (-1,1) # using hot encoded data expressed new_line = onehot.transform (np.digitize (line , bins = bins))
2. Add the data set to the interactive feature
It means to add interactive feature interaction terms in the raw data features, increasing the number of features.
############################# add interactive features to the data sets ############## ######################### # manually generated two arrays array_1 = [1,2,3,4,5] array_2 = [6, 7 are , 8,9,0] # hstack using two stacked arrays array_3 np.hstack = ((array_1, array_2)) # print result print ( 'after the array 1 2 added to the data obtained: {}' format. (array_3))
After 1 2 added to the data array obtained: [1234567890]
# The data packing and stacking the raw data x_stack np.hstack = ([X-, X_in_bin]) Print (X_stack.shape)
(50, 11)
# The data stack line_stack np.hstack = ([Line, NEW_LINE]) # retrain the model mlpr_interact MLPRegressor = (). Fit (x_stack, Y) # draw graphics plt.plot (line, mlpr_interact.predict (line_stack) , label = 'Interaction for the MLP') plt.ylim (-4,4) for bins in VLINE: plt.plot ([VLINE, VLINE], [- 5, 5], ':', C = 'Gray') PLT. Plot (X-, Y, 'O', C = 'R & lt') plt.legend (LOC = 'Lower right') # display graphics plt.show ()
# Stacking processing using the new data X_multi np.hstack = ([X_in_bin, X-X_in_bin *]) # print result Print (X_multi.shape) Print (X_multi [0])
(50, 20) [ 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. -0. -0. -0. -1.1522688 -0. -0. -0. -0. -0. -0. ]
# We retrain the model mlpr_multi MLPRegressor = (). Fit (X_multi, Y) line_multi np.hstack = ([NEW_LINE, NEW_LINE Line *]) # draw graphics plt.plot (line, mlpr_multi.predict (line_multi) , label = 'MLP Regressor ') for bins in VLINE: plt.plot ([VLINE, VLINE], [- 5, 5],': ', C =' Gray ') plt.plot (X-, Y,' O ', C =' R & lt ') plt.legend (LOC =' Lower right ') # display graphics plt.show ()
3. Add the data set to the characteristic polynomial
############################# characteristic polynomial is added to the data set ############### ######################## # import polynomial feature tools from sklearn.preprocessing import PolynomialFeatures # characteristic polynomial is added to the dataset poly = PolynomialFeatures (degree = 20, = False include_bias) X_poly poly.fit_transform = (X-) # print result print (X_poly.shape)
(50, 20)
# Print result print ( 'original data set of the first sample feature: \ n {}' the format (X-[0]).) Print ( '\ n data processing a sample wherein the first concentration: \ n {} '.format (X_poly [0])) # print result print (' PolynomialFeatures processing of the raw data:. \ n {} 'format (poly.get_feature_names ()))
A first set of raw data sample characteristics: [-1.1522688] Data processed first sample concentration characteristics: [-1.1522688 1.3277234 -2.0312764 -1.52989425 1.76284942 2.34057643 -3.58083443 3.10763809 4.1260838 -2.6969732 -4.75435765 5.47829801 -6.3124719 7.27366446 -8.38121665 9.65741449 17.02456756 12.82237519 -14.77482293 -11.12793745] PolynomialFeatures processing of the raw data: [ 'X0', 'X0 ^ 2', 'X0 ^. 3', 'X0 ^. 4', 'X0 ^. 5', '. 6 ^ X0', ' x0 ^ 7 ',' x0 ^ 8 ',' x0 ^ 9 ',' x0 ^ 10 ',' x0 ^ 11 ',' x0 ^ 12 ',' x0 ^ 13 ',' x0 ^ 14 ',' x0 ^ 15 ',' x0 ^ 16 ' ,' x0 ^ 17 ',' x0 ^ 18 ',' x0 ^ 19 ',' x0 ^ 20 ']
# Import linear regression from sklearn.linear_model import LinearRegression data used to train the linear regression model, using the processing # LNR_poly LinearRegression = (). Fit (X_poly, Y) line_poly = poly.transform (Line) # draw graphics plt.plot (line, LNR_poly .predict (line_poly), label = 'Linear Regressor') plt.xlim (np.min (X-) -0.5, np.max (X-) +0.5) plt.ylim (np.min (Y) -0.5, NP. max (Y) +0.5) plt.plot (X-, Y, 'O', C = 'R & lt') plt.legend (LOC = 'Lower right') # display graphics plt.show ()
to sum up:
Linear model in the high dimensional data set with good performance, but in the low-dimensional data set but in general, we need to add interactive features characteristic polynomial or the like to expand the data set to be characteristic data sets to the data set l dimension, thereby improving the accuracy of the linear model, so you can solve the problem of fitting a linear model appears less in low-dimensional data to a certain extent,
Article cited; "in layman's language python machine learning"