Detailed KNN algorithm - (forecast dating sites) some of my small opinion

KNN algorithm generally say something about the whole forecasting process

first step

We need to obtain a large number of data sets our text processing, a lot of the clutter data which is separated data, which is the label, and the removal of unwanted useless symbols, labels and matrix formation data matrix

#  解析文本
from numpy import *
def file2matrix(filename):  # 将文本记录转换为Numpy的解析程序
    fr = open(filename)
    arrayOLines = fr.readlines()  # 读取文件每行数据包括换行符转换为字符串,返回一个列表类似['123\n','siz']
    numberOfLines = len(arrayOLines)  # 得到这个列表的长度,也就是文件的行数
    returnMat = zeros((numberOfLines, 3))  # zeros函数创建数组numberoflines行3列的元素为0的numpy数组,其实是矩阵
    classLabelVector = []
    index = 0
    for line in arrayOLines:  # 列表从头遍历到尾
        line = line.strip()  # 默认头尾删除空白符(包括'\n', '\r',  '\t',  ' '),strip函数里有的话删除头尾指定字符
        listFromLine = line.split('\t')  # 根据\t划分元素,返回列表
        returnMat[index, :] = listFromLine[0:3]
        # 从0+1列到第3列数据赋值给returnMat矩阵的index列,因为这次数据的样本只有3个特征
        classLabelVector.append(int(listFromLine[-1]))  # 数据已经列好了,前3项为数据,最后一项为标签
        # classLabelVector新增最后一列
        # 将最后一列的数据添加到classLabelVector的最后一列
        index += 1
    return returnMat, classLabelVector

The second step

The data matrix in the data all become characteristic values between 0-1, unified weights
can be the weights are not uniform, where we unify weights


# 归一化特征值
def autoNorm(dataSet):  # 输入数据集
    minVals = dataSet.min(0)
    # min(0)返回该矩阵中每一列的最小值  min(1)返回该矩阵中每一行的最小值
    maxVals = dataSet.max(0)
    # max(0)返回该矩阵中每一列的最大值  max(1)返回该矩阵中每一行的最大值
    ranges = maxVals - minVals
    normDataSet = zeros(shape(dataSet))  # 构造和dataSet一样大小的矩阵
    # shape 输出第一个为矩阵的行数,第二个为矩阵的列数
    m = dataSet.shape[0]  # 将矩阵行数输出,shape[1]将矩阵的列数输出
    normDataSet = dataSet - tile(minVals, (m, 1))  # 补全矩阵m行,列数不补,minvals自己自带列数
    normDataSet = normDataSet / tile(ranges, (m, 1))  # 数据集与最小值的差值除以最大值与最小值的差值
    # 将数字特征转化到0到1的区间
    return normDataSet, ranges, minVals

third step

KNN algorithm principle: to find the three eigenvalues of the weighted average of the smallest weight of the first k elements (here by the Manhattan distance formula to find)
watermelon book reviews laziness learning for classification is spot learning spot predicted, although this another way quite general, but good simple to use, easy to understand.
PS: classify0 function parameter is a user input is to predict the data and data matrix and data matrix labels already processed , k value selected maximum specific gravity of the top k

def classify0(inX, dataSet, labels, k):
    # inX是输入的需要判断的分量
    # dataSet为学习器的训练样本,
    # labels给上每个分量的标签,label数量与dataset的行数相同
    # k是离inX参数最近的参数数目
    dataSetSize = dataSet.shape[0]  # 返回dataset这个矩阵的行数
    diffMat = tile(inX, (dataSetSize, 1)) - dataSet  # diff一般是指差分的意思,前后做差
   # tile 可以把inX这个向量补成和dataset一样大小的矩阵,如果inX是[1,2]tile(inX,3)就变成([1,2,1,2,1,2])
    # tile这里是在行填充datasetsize次,在列不填充
    sqDiffMat = diffMat ** 2  # sq所得差值平方
    sqDistances = sqDiffMat.sum(axis=1)  # 将所有矩阵差值平方的数值相加起来
    # axis = 0为矩阵列求和 ,, =1 为矩阵行求和
    # 分别对训练集每列不同的特征值与inX这个相应分量的特征值做差之后平方相加在开根号
    distances = sqDistances ** 0.5  # 再开方使用欧式距离公式,即根号下两点差值的平方之和
    # 得到的Distances是n个特征值的权值测量矩阵
    sortedDistIndicies = distances.argsort()  # argsort返回从小到大的数组元素的数组号 arg可以理解为arguments参数
    # indicies其实indices(index的复数)不知道为什么书上打错了,
    classCount = {}
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]  # 将前k个权值最小的元素序号的标签记录在voteIlabel
        classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
        # get是返回字典键值指定的值,votelabel如果美丽对应的0,就get得到0,
        # 这里加1重新赋给字典是 因为原本是0到k-1个点变成1到k个点,将序号规范化
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    # 将前k个点排序
    return sortedClassCount[0][0]

the fourth step

The interface between the dating site prediction function, the user, the user needs to enter their own data
, there are three data values characteristic:
1, airline miles per year
2, the percentage of time spent playing video games
3, the number of liters of ice cream per week consumption


def classifyPerson():
    resultList = ['not at all', 'in small doses', 'in large doses']
    percentTats = float(input("percentage of time spent playing video games?"))
    ffMiles = float(input("frequent flier miles earned per year?"))
    iceCream = float(input("liters of ice cream consumed per year?"))
    datingDataMat, datingLabels = file2matrix('E:/machinelearingdatas/machinelearninginaction-master/Ch02'
                                              '/datingTestSet2.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    inArr = array([ffMiles, percentTats, iceCream])
    classifierResult = classify0((inArr - minVals) / ranges, normMat, datingLabels, 3)
    print("You will probably like this person : ", resultList[classifierResult - 1])

So in the end you can call this function, this function contains the function modules before all

Additional image

We can also be processed data matrix and matrix processing data tag
to tag data sets and data distribution represented by scattergram

# 分析数据用Matplotlib创建散点图
import matplotlib
import matplotlib.pyplot as plt

datingDataMat, datingLabels = file2matrix('E:/machinelearingdatas/machinelearninginaction-master/Ch02'
                                          '/datingTestSet2.txt')
fig = plt.figure()
# figure(num=None, figsize=None, dpi=None, facecolor=None, edgecolor=None, frameon=True)
# num:图像编号或名称,数字为编号 ,字符串为名称
# figsize:指定figure的宽和高,单位为英寸;
# dpi参数指定绘图对象的分辨率,即每英寸多少个像素,缺省值为80      1英寸等于2.5cm,A4纸是 21*30cm的纸张 
# facecolor:背景颜色
# edgecolor:边框颜色
# frameon:是否显示边框
ax = fig.add_subplot(111)  # 1*1的网格,第一子图,每subplot命令会创建一个子图
ax.scatter(datingDataMat[:, 0], datingDataMat[:, 1], 15 * array(datingLabels), array(datingLabels))
# 将二列的数据作为横坐标,第三列的数据作为纵坐标,第三个参数是大小,第四个参数是颜色,详见技术文档
plt.show()

pycharm in the resulting scatterplot

pycharm in the resulting scatterplot
Published 19 original articles · won praise 4 · Views 503

Guess you like

Origin blog.csdn.net/qq_35050438/article/details/103086794