Machine Learning No.3

Task four --KNN the decision boundary + + features standardized cross-validation

.KNN a decision boundary and the influence of K

This part is mainly centered on how to choose the right K.

First, the decision boundary is divided into: linear and non-linear decision-making. With the increase of K, the decision boundary will become smoother, the more stable model.

II. Cross-validation

"Parameter adjustment"

Commonly used cross-validation technique is K-fold cross-validation. We first training data into training and test sets, and then use the training set to train the model, and then assess the accuracy of the model on the test set.

III. Characteristics standardization

KNN algorithm is very dependent on a distance, so this algorithm depends on the distance, standardization is very important feature, in addition to the teacher inside the paper, this semester's class also learned several tree frog, said the following about this other several bar.

Fractional scaling normalization: ( J is to Max ( | V '|) <1 is the smallest integer )

 z-score normalization (Ex the mean, standard deviation is the denominator)

Task five - image recognition learning

I. Read and representation of the image

The image can be read by python's own library.

Example image:

import matplotlib.pyplot as plt

img = plt.imread('d:\Lena.jpg')
print(img.shape)
plt.imshow(img)

Read the results:

(512L, 512L, 3L)

II. Characteristic image

The method mentioned here is made on the construction features of the image, the image will not be considered as KNN is blocked, rotation, brightness and the like.

Color features:

(1) SIFT feature is a local feature, which is invariant to scaling, even if the rotation angle changes, image brightness, or shooting angle, detection is still possible to obtain good results, is a very stable local features.

(2) HOG is a partial grid on an image operation unit, so that its optical and geometric deformation of the image can keep good invariance.

III. Picture feature dimension reduction

This part I mainly get to know a bit the PCA , which is the principal component analysis. (I said, feeling like a very familiar PCA, when a larger number of jobs to find information about this seems to be found, but we had to break down the big job is a dimension reduction by the SVD singular values, then this part of it and it has quite ignorant the ... I want to refresh your knowledge of mathematics of probability theory, some formula forget ---)

On principal component analysis, to online information found in this passage to explain:

Plainly he is the high-dimensional vector mapped to low-dimensional medium. Sklearn try to call for a moment:

import  numpy as np
import matplotlib.pyplot as plt

x=np.empty((100,2))
x[:,0]=np.random.uniform(0.0,100.0,size=100)
x[:,1]=0.75*x[:,0]+3.0*np.random.normal(0,3,size=100)
plt.figure()
plt.scatter(x[:,0],x[:,1])
plt.show()

from sklearn.decomposition import PCA    #在sklearn中调用PCA机器学习算法

pca=PCA(n_components=1)                  #定义所需要分析主成分的个数n
pca.fit(x)                               #对基础数据集进行相关的计算,求取相应的主成分
print(pca.components_)                    #输出相应的n个主成分的单位向量方向
x_reduction=pca.transform(x)                #进行数据的降维
x_restore=pca.inverse_transform(x_reduction)       #对降维数据进行相关的恢复工作
plt.figure()
plt.scatter(x[:,0],x[:,1],color="g")
plt.scatter(x_restore[:,0],x_restore[:,1],color="r")
plt.show()

 

任务六——利用KNN 进行图像识别

一.文件的读取、可视化、采样

图片展示:

classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
num_classes = len(classes)  # 样本种类的个数
samples_per_class = 5       # 每一个类随机选择5个样本

# TODO 图片展示部分的代码需要在这里完成。 hint:  plt.subplot函数以及 plt.imshow函数用来展示图片
plt.figure(figsize=(10,10)) #图片的尺寸

import random

for i in range(num_classes):
    indexs = np.where(y_train==i)[0] 
    random.shuffle(indexs)
    for j in range(samples_per_class):
        ax = plt.subplot(7,10,10*j+i+1)
        if j==0:
            ax.set_title(classes[i])
            
        plt.axis('off')
        plt.imshow(X_train[indexs[j]]/255)

(图片尺寸5×5我有点看不清就改成了10×10)

先获取每一类图片在训练集里面的下标值,打乱下标,随机获取前j个。ax = plt.subplot(7,10,10*j+i+1)的间距确实比ax = plt.subplot(5,10,10*j+i+1)看起来舒服。

--------我是分割线---------

--------我是分割线---------

下一步,emm如果不跟着老师的例子走的话我肯定想不起来统计每一个类别图片出现的次数,分析样本是否平衡还是不平衡。

# 统计并展示每一个类别出现的次数

for i in range(num_classes):
    print("%s:%d"%(classes[i],len(np.where(y_train==i)[0])))

得到结果:

plane:5000
car:5000
bird:5000
cat:5000
deer:5000
dog:5000
frog:5000
horse:5000
ship:5000
truck:5000

随机采样这里要注意random函数不要用python的numpy矩阵,要用shuffle numpy的矩阵和numpy.random!

numpy.random,shuffle(x)是进行原地洗牌,直接改变x的值,而无返回值。对于多维度的array来说,只对第一维进行洗牌,比如一个 3×3 的array,只对行之间进行洗牌,而行内内容保持不变。

我在跑的时候试了一下直接用了Python的numpy和random,果然是出问题了。

二.使用KNN算法识别图片

我用的是:

params_k = [1,3,5]  # 可以选择的K值
params_p = [1,2] # 可以选择的P值

不过还是没跑出来。。。

然后去了解了一下GridSearchCV,是一种调参手段,防止参数选择不当出现过度拟合或者欠拟合的情况。经GridSearchCV计算返回的对象既可以fit(),也可以返回最佳参数。

三.抽取图片特征,用KNN识别图片

 这个准确率也没跑出来。。。

最后

关于numpy的使用,还是不太熟练吧,用的时候还是要查文档查资料多一下。

Guess you like

Origin www.cnblogs.com/Ygrittee/p/12153709.html