机器学习之KNN鸢尾花分类

版权声明:本文为博主原创文章,未经博主允许禁止转载(http://blog.csdn.net/napoay) https://blog.csdn.net/napoay/article/details/87904704

KNN简介

邻近算法,或者说K最近邻(kNN,k-NearestNeighbor)分类算法是数据挖掘分类技术中最简单的方法之一。所谓K最近邻,就是k个最近的邻居的意思,说的是每个样本都可以用它最接近的k个邻居来代表。
kNN算法的核心思想是如果一个样本在特征空间中的k个最相邻的样本中的大多数属于某一个类别,则该样本也属于这个类别,并具有这个类别上样本的特性。该方法在确定分类决策上只依据最邻近的一个或者几个样本的类别来决定待分样本所属的类别。
kNN方法在类别决策时,只与极少量的相邻样本有关。由于kNN方法主要靠周围有限的邻近的样本,而不是靠判别类域的方法来确定所属类别的,因此对于类域的交叉或重叠较多的待分样本集来说,kNN方法较其他方法更为适合。

Python实现

Python2

#!/usr/bin/python
# -*- coding: utf8 -*-

import math
import operator
import random
import csv


def distance(l1, l2):
    d = 0

    for x in range(4):
        d += pow((l1[x] - l2[x]), 2)

    return math.sqrt(d)


def getNeighbors(traningSet, testInstance, k):
    distances = []
    for i in range(len(traningSet)):
        dis = distance(testInstance, traningSet[i])
        distances.append((traningSet[i], dis))
    distances.sort(key=operator.itemgetter(1))
    # print "distances:", distances
    neighbors = []
    for i in range(k):
        neighbors.append(distances[i][0])
    return neighbors


def getResult(neighbors):
    votes = {}
    for i in range(len(neighbors)):
        result = neighbors[i][-1]
        if result in votes:
            votes[result] += 1
        else:
            votes[result] = 1
    sortedVotes = sorted(votes, key=operator.itemgetter(1), reverse=True)
    return sortedVotes[0]


if __name__ == '__main__':
    trainingset = []
    testSet = []
    dataSet = []
    splitRatio = 0.75
    filename = 'iris.data.txt'
    with open(filename, 'r') as datafile:
        lines = csv.reader(datafile)
        dataSet = list(lines)
        print("dataset len", len(dataSet))
        for x in range(len(dataSet)):
            for y in range(4):
                dataSet[x][y] = float(dataSet[x][y])
            if random.random() < splitRatio:
                trainingset.append(dataSet[x])
            else:
                testSet.append(dataSet[x])
    # print "trainingset ", trainingset
    # print "testset ", testSet

    print "trainingset len", len(trainingset)
    print "testset len", len(testSet)

    results = []
    for i in range(len(testSet)):
        neighbors = getNeighbors(trainingset, testSet[i], 3)
        result = getResult(neighbors)
        results.append(result)
        print "期望值:", testSet[i][-1], "实际值:", result
    correct = 0
    for i in range(len(results)):
        if results[i] == testSet[i][-1]:
            correct += 1
    print "准确率:", correct / float(len(results))

Python3

#!/usr/bin/python
# -*- coding: utf8 -*-

import math
import operator
import random
import csv


def distance(l1, l2):
    d = 0

    for x in range(4):
        d += pow((l1[x] - l2[x]), 2)

    return math.sqrt(d)


def getNeighbors(traningSet, testInstance, k):
    distances = []
    for i in range(len(traningSet)):
        dis = distance(testInstance, traningSet[i])
        distances.append((traningSet[i], dis))
    distances.sort(key=operator.itemgetter(1))
    # print "distances:", distances
    neighbors = []
    for i in range(k):
        neighbors.append(distances[i][0])
    return neighbors


def getResult(neighbors):
    votes = {}
    for i in range(len(neighbors)):
        result = neighbors[i][-1]
        if result in votes:
            votes[result] += 1
        else:
            votes[result] = 1
    sortedVotes = sorted(votes, key=operator.itemgetter(1), reverse=True)
    return sortedVotes[0]


if __name__ == '__main__':
    trainingset = []
    testSet = []
    dataSet = []
    splitRatio = 0.75
    filename = 'iris.data.txt'
    with open(filename, 'r') as datafile:
        lines = csv.reader(datafile)
        dataSet = list(lines)
        print("dataset len", len(dataSet))
        for x in range(len(dataSet)):
            for y in range(4):
                dataSet[x][y] = float(dataSet[x][y])
            if random.random() < splitRatio:
                trainingset.append(dataSet[x])
            else:
                testSet.append(dataSet[x])
    # print "trainingset ", trainingset
    # print "testset ", testSet

    print("trainingset len", len(trainingset))
    print("testset len", len(testSet))

    results = []
    for i in range(len(testSet)):
        neighbors = getNeighbors(trainingset, testSet[i], 3)
        result = getResult(neighbors)
        results.append(result)
        print("期望值:", testSet[i][-1], "实际值:", result)
    correct = 0
    for i in range(len(results)):
        if results[i] == testSet[i][-1]:
            correct += 1
    print("准确率:", correct / float(len(results)))

结果

('dataset len', 150)
trainingset len 109
testset len 41
期望值: Iris-setosa 实际值: Iris-setosa
期望值: Iris-setosa 实际值: Iris-setosa
期望值: Iris-setosa 实际值: Iris-setosa
期望值: Iris-setosa 实际值: Iris-setosa
期望值: Iris-setosa 实际值: Iris-setosa
期望值: Iris-setosa 实际值: Iris-setosa
期望值: Iris-setosa 实际值: Iris-setosa
期望值: Iris-setosa 实际值: Iris-setosa
期望值: Iris-setosa 实际值: Iris-setosa
期望值: Iris-setosa 实际值: Iris-setosa
期望值: Iris-setosa 实际值: Iris-setosa
期望值: Iris-setosa 实际值: Iris-setosa
期望值: Iris-setosa 实际值: Iris-setosa
期望值: Iris-setosa 实际值: Iris-setosa
期望值: Iris-versicolor 实际值: Iris-versicolor
期望值: Iris-versicolor 实际值: Iris-versicolor
期望值: Iris-versicolor 实际值: Iris-versicolor
期望值: Iris-versicolor 实际值: Iris-virginica
期望值: Iris-versicolor 实际值: Iris-versicolor
期望值: Iris-versicolor 实际值: Iris-versicolor
期望值: Iris-versicolor 实际值: Iris-versicolor
期望值: Iris-versicolor 实际值: Iris-versicolor
期望值: Iris-versicolor 实际值: Iris-virginica
期望值: Iris-versicolor 实际值: Iris-versicolor
期望值: Iris-versicolor 实际值: Iris-versicolor
期望值: Iris-virginica 实际值: Iris-virginica
期望值: Iris-virginica 实际值: Iris-virginica
期望值: Iris-virginica 实际值: Iris-versicolor
期望值: Iris-virginica 实际值: Iris-virginica
期望值: Iris-virginica 实际值: Iris-virginica
期望值: Iris-virginica 实际值: Iris-virginica
期望值: Iris-virginica 实际值: Iris-virginica
期望值: Iris-virginica 实际值: Iris-virginica
期望值: Iris-virginica 实际值: Iris-virginica
期望值: Iris-virginica 实际值: Iris-virginica
期望值: Iris-virginica 实际值: Iris-virginica
期望值: Iris-virginica 实际值: Iris-virginica
期望值: Iris-virginica 实际值: Iris-virginica
期望值: Iris-virginica 实际值: Iris-virginica
期望值: Iris-virginica 实际值: Iris-virginica
期望值: Iris-virginica 实际值: Iris-virginica
准确率: 0.926829268293

猜你喜欢

转载自blog.csdn.net/napoay/article/details/87904704