When using python3.5 to re-verify the code in "Classic Examples of Python Machine Learning", I often encounter various warnings and errors.
Generally speaking, the warning comes from the update of the function library. The original book uses python2.x, and the function library is old. Some modules have been merged. Remember to update the name when calling the corresponding method, otherwise the red warning will hurt your eyes.
The exception error thrown is centered on this line of code in the coding test for a single data example on page P40:
input_data_encoded[i]=int(label_encoder[i].transform([input_data[i]]))
The exception is:
raise ValueError("bad input shape {0}".format(shape)) ValueError: bad input shape ()
After step-by-step analysis, it is found that the value of input_data[i] is a single string, but the parameters in the transform method need a list format, so it is changed to: [input_data[i]]
After solving this problem, the second question is:
output_class=classifier.predict(input_data_encoded)
The error is:
ValueError: Expected 2D array, got 1D array instead: array=[ 3. 3. 0. 0. 2. 1.]. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
This problem is similar to the first one, to reshape input_data_encoded into a 1D array.
The value of input_data_encoded before reshaping is: [0 0 1 1 2 0]
The reshape code is:
input_data_encoded=input_data_encoded.reshape(1,6)
After reshape it is: [[0 0 1 1 2 0]]
The above is the difficulty of this section. The code after all corrections is as follows:
import numpy as np from sklearn import preprocessing from sklearn.ensemble import RandomForestClassifier import matplotlib.pyplot as plt from sklearn import model_selection from sklearn.model_selection import validation_curve from sklearn.model_selection import learning_curve #Display Chinese font when drawing from pylab import mpl mpl.rcParams ['font.sans-serif'] = ['SimHei'] input_file=r'D:\python\AI\2\car.data.txt' x=[] count=0 with open(input_file,'r') as f: for line in f.readlines(): data=line[:-1].split(',') x.append(data) x=np.array(x) # string to value label_encoder=[] x_encoded=np.empty(x.shape) for i,item in enumerate(x[0]): label_encoder.append(preprocessing.LabelEncoder()) label_encoder[-1].fit(x[:,i]) x_encoded[:,i]=label_encoder[-1].transform(x[:,i]) x=x_encoded[:, :-1].astype(int) y=x_encoded[:,-1].astype(int) #train classifier params={'n_estimators':200,'max_depth':8,'random_state':7} classifier=RandomForestClassifier(**params) classifier.fit(x,y) #Cross-validation accuracy=model_selection.cross_val_score(classifier,x,y,scoring='accuracy',cv=3) print('Accuracy of the classifier: '+str(round(100*accuracy.mean(),2))+'%') #Code test on a single data example input_data=['high','high','3','4','small','high'] input_data_encoded=[-1]*len(input_data) for i,item in enumerate(input_data): input_data_encoded[i]=int(label_encoder[i].transform([input_data[i]])) input_data_encoded=np.array(input_data_encoded) print('Array before reshaping:', input_data_encoded) # predict and print the output of the data points #reshape the array input_data_encoded=input_data_encoded.reshape(1,6) print('After the array is reshaped:', input_data_encoded) output_class=classifier.predict(input_data_encoded) print('Output class(输出类型):',label_encoder[-1].inverse_transform(output_class)[0]) #define the hyperparameters of the random forest regressor #Test the effect of the number of evaluators on the classifier classifier = RandomForestClassifier(max_depth=4, random_state=7) parameter_grid = np.linspace(25, 200, 8).astype(int) train_scores, validation_scores = validation_curve(classifier, x, y, 'n_estimators', parameter_grid, cv=5) print('\n##### Verification curve #####') print('\nParam: n_estimators\nTraining scores:\n', train_scores) print('\nParam: n_estimators\nValidation scores:\n', validation_scores) #paint plt.figure() plt.plot(parameter_grid, 100*np.average(train_scores, axis=1), color='black') plt.title(u'Training curve') plt.xlabel(u'Number of estimators') plt.ylabel(u'Accuracy') plt.show() #Test the effect of the maximum depth parameter on the classifier classifier = RandomForestClassifier(n_estimators=20, random_state=7) parameter_grid = np.linspace(2, 10, 5).astype(int) train_scores, valid_scores = validation_curve(classifier, x, y, 'max_depth', parameter_grid, cv=5) print(u'\nParam: max_depth\nTraining scores:\n', train_scores) print(u'\nParam: max_depth\nValidation scores:\n', validation_scores) #paint plt.figure() plt.plot(parameter_grid, 100*np.average(train_scores, axis=1), color='black') plt.title(u'Validation curve(validation curve)') plt.xlabel(u'Maximum depth of the tree') plt.ylabel(u'Accuracy') plt.show() #generate learning curve classifier = RandomForestClassifier(random_state=7) parameter_grid = np.array([200, 500, 800, 1100]) train_sizes, train_scores, validation_scores = learning_curve(classifier, x, y, train_sizes=parameter_grid, cv=5) print('\n##### Learning curve #####') print('\nTraining scores:\n', train_scores) print('\nValidation scores:\n', validation_scores) #paint plt.figure() plt.plot(parameter_grid, 100*np.average(train_scores, axis=1), color='black') plt.title(u'Learning curve') plt.xlabel(u'Number of training samples') plt.ylabel(u'Accuracy') plt.show()