Machine learning practical - study notes (b)

Chapter II - K - nearest neighbor

img

Introduction Algorithm (k-Nearest Neighbor)

K - nearest neighbor classification data is the simplest and most effective algorithm, K using the method of the distance between the measurement values of different features are classified

Algorithm principle : there is a sample data set, also known as the training sample set, and each data sample set are labels exist, that we know each sample set correspondence between the data and the category. After entering the new data without a label, the feature data corresponding to the characteristics of each new data sample and comparing focus, then concentrated extraction algorithm wherein data is most similar to the sample (nearest neighbor) class label.

Finally, select k highest number of classified data that appears most similar, as the classification of new data.

K - nearest neighbor algorithm pseudocode

Each point on the unknown class attribute data set sequentially perform the following operations:

(1) calculate the distance between the point known class data set and the current point

(2) sorted in ascending order from

(3) Select the minimum distance to the current point k (point k derived)

(4) determining a previous k occurrence frequency points where the categories

(5) returns the first k points appear most frequently as the current category classification point prediction

Calculated distance L2 (Euclidean distance)

N eigenvalues
d ( I 1 , I 2 ) = p = 1 N ( I 1 p I 2 p ) 2 d(I_{1},I_{2})=\sqrt{\sum_{p=1}^{N}(I_{1}^{p}-I_{2}^{p})^{2}}

k - nearest neighbor algorithm of the general process

(1) Data collection: Any method may be used.

(2) Preparation data: distance calculating the desired value, preferably a structured data format.

(3) Data analysis: Any method may be used.

(4) training algorithm: This step does not apply k - nearest neighbor.

(5) test algorithm: calculation error rate.

(6) using the algorithm: first need to input the output sample data and structured, and then run the K - nearest neighbor algorithm determines that the input data belongs to which of the categories, respectively, and finally the application of the subsequent processing is performed to calculate the classification.

Algorithms defect

1. k - nearest neighbor must save the entire data set, if the training data set is large, you must use a lot of storage space. Further, since it is necessary to calculate the distance value for each data set, it can be very time-consuming to practical use.

2. k - nearest neighbor can not give the information infrastructure of any data, so we can not know the average instance of samples and samples with a typical example of what characteristics.

Algorithm example

import matplotlib.pyplot as plt
import numpy as np

The introduction of the original data

# load data from file导入txt数据
def load_data(filename):
    dataset = []
    label = []
    file = open(filename)
    for line in file.readlines():  # 逐行读取
        lineArr = line.strip().split('\t')  # 分割字符串
        dataset.append(lineArr[0:3])  # 前三列为数据
        label.append(lineArr[-1])  # 最后一列为标签
    return np.array(dataset, dtype=np.float64),\
        np.array(label, dtype=np.int)  # 返回值为数据和标签的数组


data, label = load_data("datingTestSet2.txt")
print(data.shape, label.shape)  # 打印数组规模

Draw raw data profile (with an intuitive understanding of the k-nearest neighbor algorithm)

def plot(x, y):
    label1 = np.where(y.ravel() == 1)
    plt.scatter(x[label1, 0], x[label1, 1], marker='x',
                color='r', label='didnt like=1')
    label2 = np.where(y.ravel() == 2)
    plt.scatter(x[label2, 0], x[label2, 1], marker='*',
                color='b', label='smallDoses=2')
    label3 = np.where(y.ravel() == 3)
    plt.scatter(x[label3, 0], x[label3, 1], marker='.',
                color='y', label='largeDoses=3')
    plt.xlabel('pilot distance')
    plt.ylabel('game time')
    plt.legend(loc='upper left')
    plt.title("Raw data")
    plt.show()


plot(data, label)

Data preprocessing

Normalized reason: when using L2 distance formula, if a large difference in the attribute data, calculation results for the impact is large, however, the heavy weight of the eigenvalues ​​should be adjusted according to the reality. A property should not be so often seriously affect the results.

# 因为特征数据的范围不一样,所以需要归一化
# 公式:newvalue = (oldvalue - min) / (max - min)
# 这里将数据变成0-1之间,而不是0中心化


def normalFeature(x):
    x_min = np.min(x, axis=0)
    x_max = np.max(x, axis=0)
    x_new = (x - x_min) / (x_max - x_min)
    return x_new, x_min, x_max


x_new, x_min, x_max = normalFeature(data)
print(x_new.shape)
print(x_min)

Manually implement a k- nearest neighbor

class KNearestNeighbor(object):
    # 输入训练数据集
    def train(self, X, y):
        self.X_train = X
        self.y_train = y

    # 预测测试数据的分类结果
    def predict(self, X_test, y_test, k=1, display=True):
        dist = self.distance(X_test)
        num_test = X_test.shape[0]
        y_pred = np.zeros(num_test)
        for i in range(num_test):
            closest_y = []
            closest_y = self.y_train[np.argsort(dist[i])[:k]]
            y_pred[i] = np.argmax(np.bincount(closest_y))
            # 每过10次输出一次测试结果
            if (i % 10 == 0) & display:
                print("prediction is %d,the real is %d" %
                      (y_pred[i], y_test[i]))
        return y_pred

    # 将测试数据按照特征值与训练数据进行L2距离计算
    def distance(self, X_test):
        num_test = X_test.shape[0]
        num_train = self.X_train.shape[0]

        dist = np.zeros((num_test, num_train))
        for i in range(num_test):
            dist[i] = np.sqrt(
                np.sum(np.square(self.X_train - X_test[i]), axis=1))
        return dist

Divide training set test set, test and training

# 划分训练和测试集合
rate = 0.1  # 划分的概率
m = data.shape[0]
m_test = int(m * rate)
x_train = x_new[m_test:m, :]
y_train = label[m_test:m]
x_test = x_new[0:m_test, :]
y_test = label[0:m_test]
print(x_train.shape, x_test.shape)
print(y_train.shape, y_test.shape)


# 训练测试一下
classify = KNearestNeighbor()
classify.train(x_train, y_train)
y_test_pred = classify.predict(x_test, y_test, k=3)
# 输出测试准确率
acc = np.mean(y_test == y_test_pred)
print("the test accuracy is ", acc)

Using an algorithm

result = ["didnt like", "small dose", "large dose"]
input = np.array([[10000, 10, 0.5]])
# 一定记得使用train_set的min和max把数据标准化
input_new = (input-x_min) / (x_max - x_min)
pred = classify.predict(input_new, y_test, k=3, display=False)
print(pred)
print("you will probablly like this person:", result[int(pred[0])-1])

Programs and data

Link: https: //pan.baidu.com/s/16G2uSqzng_uPVM96Mxp08g
extraction code: gv19

Released four original articles · won praise 5 · Views 1536

Guess you like

Origin blog.csdn.net/qq_43699254/article/details/104641096