python machine learning based tutorial - supervised learning

1. Classification and Regression

  Category: is based on a given label, the new division of data into one of these tags

  Return: that is, according to some properties of things, to determine the other properties of things in which the interval range

    For example: According to a person's level of education, age, etc., which determine the range of the person's income

  Difference: classification output is fixed, discrete, is a point; regression continuous output is interval.

2. generalization, over-fitting and underfitting

  Generalization: a model able to make an accurate prediction of the data have not seen, say model can generalize from the training set to test set

  Overfitting: the constructed model is too complex for an existing data model for that is overly concerned with the details of the training set, get to have a good performance on the training set, but you can not generalize to new data.

  Underfitting: model performance on the training set on the poor, is underfitting

3. supervised learning algorithm

3.1 Some sample data set

scikit-learn the data set is Bunch object containing real data and some of the data set information (like dictionaries)

3.1.1 forge datasets

Import the necessary packages

 

wherein two data sets forge

# Generate a data set 
X, y = mglearn.datasets.make_forge () # X and y are two features returned forge 
print ( "X.shape: {}" . Format (X.shape)) # X.shape: ( 26, 2)
# Data set plot 
mglearn.discrete_scatter (X [:, 0] , X [:, 1], y) # 0 and the input X of the first column as the x-axis, y-y-axis
plt.legend ([ "Class 0 "," Class 1 "] , loc = 4) # category name setting image
plt.xlabel (" First feature ") # set the name of the image x-axis
plt.ylabel (" Second feature ") # y-axis to set the image The name

3.1.2 wave data set

wave feature data set is only one input and a continuous target variable (x-axis represents wherein, y axis represents the output)

X-, Y = mglearn.datasets.make_wave (N_SAMPLES = 40 ) # 40 to generate data, X.shape is (40,1), y.shape is (40,) 
plt.plot (X-, Y, ' O ' ) # a dotted circle indicates use 
plt.ylim ( -3,. 3 ) setting section # y-axis display range 
plt.xlabel ( " the Feature " ) 
plt.ylabel ( " the Target " )

3.1.3 cancer datasets

cancer clinical measurement data set recorded data breast tumors, each of benign and malignant tumor markers (i.e., two labels only)

from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
print(cancer.keys())
>>>dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

print(cancer.data.shape)
print(cancer.target.shape)
print(cancer.target_names)
print(cancer.feature_names)
>>>(569, 30)
>>>(569,)
>>>['malignant' 'benign']
>>>['mean radius' 'mean texture' 'mean perimeter' 'mean area' 'mean smoothness' 'mean compactness' 'mean concavity' 'mean concave points' 'mean symmetry' 'mean fractal dimension' 'radius error' 'texture error' 'perimeter error' 'area error' 'smoothness error' 'compactness error' 'concavity error' 'concave points error' 'symmetry error' 'fractal dimension error' 'worst radius' 'worst texture' 'worst perimeter' 'worst area' 'worst smoothness' 'worst compactness' 'worst concavity' 'worst concave points' 'worst symmetry' 'worst fractal dimension']
{n: v for n, v in zip(cancer.target_names, np.bincount(cancer.target))}
>>>{'malignant': 212, 'benign': 357}

np.bincount features:用于统计每个数字的出现次数

x = np.array([0, 1, 1, 3, 2, 1, 7])
a = np.bincount(x)
>>>array([1, 3, 1, 1, 0, 0, 0, 1])
# 如上例,x的范围就是0-7,bincount生成一个np.max(x)+1的array:a
# 当a要生成第一个数时,此时要生成的数字索引为0,就在x中找0出现了多少次,本例为1次,append进来
# 然后a生成第二个数,此时要生成的数组索引是1,就在x中找1出现了多少次,本例为3
# 依次遍历到7,即x

np.bincount parameters of weights

np.array = W ([0.3, 0.5, 0.2, 0.7, 1., -0.6 ])
 # We can see that the maximum number of x is 4, the number of bin 5, so that the index value 0- >. 4 
X = np.array ([2,. 1,. 3,. 4,. 4,. 3 ])
 # index 0 -> 0 
# index. 1 -> W [. 1] = 0.5 
# index 2 -> w [0] = 0.3 
# index. 3 -> W [2] + W [. 5] = 0.2 - 0.6 = -0.4 
# index. 4 -> W [. 3] + W [. 4] = + 0.7 = 1.7. 1 
np.bincount (X, weights = W )
 # Therefore, output is: array ([0., 0.5, 0.3, -0.4, 1.7])

 

3.1.4 Boston housing data sets

This dataset contains 506 data points, each data point 13 wherein

from sklearn.datasets import load_boston
boston = load_boston()
print(boston.data.shape)
>>>(506, 13)

 

3.2 k neighbors

3.2.1 k-nearest neighbor classifier, k-NN algorithm

Rounded to the nearest neighbor category as their own category

mglearn.plots.plot_knn_classification(n_neighbors=1)

Take three neighbors, and contains the maximum number as your own categories according to three categories in

mglearn.plots.plot_knn_classification(n_neighbors=3)

 

3.2.2 Application of k-nearest neighbor

Import data sets, and divide the training set and test set

from sklearn.model_selection import train_test_split
X, y = mglearn.datasets.make_forge()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

The number of neighbors: import classes and instantiated knn algorithm, and set the parameters

from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=3)

Using a training set of the classifier fitting (knn is saved for the training data set, in order to predict when the calculation example)

clf.fit(X_train, y_train)
>>>KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=None, n_neighbors=3, p=2, weights='uniform')

调用predict方法进行预测

clf.predict(X_test)
>>>array([1, 0, 1, 0, 1, 0, 0])
clf.score(X_test, y_test)
>>>0.8571428571428571

3.2.3 分析KNeighborsClassifier

绘制决策边界: 绘制1个,3个和9个邻居的决策边界

fig, axes = plt.subplots(1, 3, figsize=(10,3)) # 绘制一个1行3列的共3个子图; fig是主图,axes是子图
for n_neighbors, ax in zip([1, 3, 9], axes):
    clf = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X, y)
    mglearn.plots.plot_2d_separator(clf, X, fill=True, eps=0.5, ax=ax, alpha=.4)
    mglearn.discrete_scatter(X[:,0], X[:,1], y, ax=ax)
    ax.set_title("{} neighbor(s)".format(n_neighbors))
    ax.set_xlabel("feature 0")
    ax.set_ylabel("feature 1")
axes[0].legend(loc=3)    # loc标示legend的位置是左下角还是右下角,还是上边

# # # 待更新

Guess you like

Origin www.cnblogs.com/draven123/p/11372904.html