机器学习-有监督学习-分类-KNN

代码

import numpy as np
import pandas as pd
# 直接引入sklearn里的数据集，鸢尾花iris
from sklearn.datasets import load_iris
# 切分数据集为训练集和测试集
from sklearn.model_selection import train_test_split
# 计算分类预测的准确率
from sklearn.metrics import accuracy_score

# TODO 0.数据加载和预处理
iris = load_iris()
"""
    Two-dimensional, size-mutable, potentially heterogeneous tabular data.
    Data structure also contains labeled axes (rows and columns).
    Arithmetic operations align on both row and column labels. Can be
    thought of as a dict-like container for Series objects. The primary
    pandas data structure.
    二维、大小可变、可能异构的表格数据。
    数据结构还包含标记的轴(行和列)。
    算术运算对齐两行和列标签。可以
    可以将其看作系列对象的类似于dict的容器。
    pandas的主要数据结构。
"""
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['class'] = iris.target
df['class'] = df['class'].map({0: iris.target_names[0], 1: iris.target_names[1], 2: iris.target_names[2]})
# 调用对象的前n行对象
# print(df.head(3))
# 对象描述的信息 describe()
# print(df.describe())

# 拿出 x , y
x = iris.data
y = iris.target.reshape(-1, 1)
print(x.shape, y.shape)

# 划分训练集和测试集
"""
Parameters
    *arrays : sequence of indexables with same length / shape[0]
        Allowed inputs are lists, numpy arrays, scipy-sparse
        matrices or pandas dataframes.
    test_size : float or int, default=None
        If float, should be between 0.0 and 1.0 and represent the proportion
        of the dataset to include in the test split. If int, represents the
        absolute number of test samples. If None, the value is set to the
        complement of the train size. If ``train_size`` is also None, it will
        be set to 0.25.
    train_size : float or int, default=None
        If float, should be between 0.0 and 1.0 and represent the
        proportion of the dataset to include in the train split. If
        int, represents the absolute number of train samples. If None,
        the value is automatically set to the complement of the test size.
    random_state : int or RandomState instance, default=None
        Controls the shuffling applied to the data before applying the split.
        Pass an int for reproducible output across multiple function calls.
        See :term:`Glossary <random_state>`.
    shuffle : bool, default=True
        Whether or not to shuffle the data before splitting. If shuffle=False
        then stratify must be None.
    stratify : array-like, default=None
        If not None, data is split in a stratified fashion, using this as
        the class labels.
参数
数    组：具有相同长度/形状的索引序列[0]允许的输入是列表、numpy数组、scipy sparse矩阵或熊猫数据帧。
测试大小：浮点或整数，默认值为无,如果为浮动，则应介于0.0和1.0之间，并表示比例要包含在测试拆分中的数据集的。如果为int，则表示
          试验样品的绝对数量。如果没有，则该值设置为列车尺寸的补充。如果“火车大小”也是空的，它将设置为0.25。
序列大小：浮点或整数，默认值为无,如果为float，则应介于0.0和1.0之间，并表示要包含在列车拆分中的数据集的比例。如果
          int，表示列车样本的绝对数量。如果没有，该值将自动设置为测试大小的补码。
random_state: int或RandomState实例，默认值为None控制应用拆分之前应用于数据的无序处理。为跨多个函数调用的可复制输出传递一个int。
shuffle: bool，默认值为True是否在拆分前洗牌数据。如果shuffle=False那么分层必须是无。
分    层：类似数组，默认值为无,如果不是没有，则以分层方式拆分数据，将其用作类标签。
"""
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=35, stratify=y)


# TODO 1.核心算法实现
# 距离函数定义 l1:曼哈顿距离  l2:欧氏距离
def l1_distance(a, b):
    return np.sum(np.abs(a - b), axis=1)


def l2_distance(a, b):
    return np.sqrt(np.sum((a - b) ** 2, axis=1))


# TODO 2.分类器实现
class kNN(object):
    # 定义一个初始化方法，_init_ 是类的构造方法
    def __init__(self, n_neighbors=1, dist_func=l1_distance):
        self.n_neighbors = n_neighbors
        self.dist_func = dist_func

    # 训练模型方法
    def fit(self, x, y):
        self.x_train = x
        self.y_train = y

    # 模型预测方法
    def predict(self, x):
        # 初始化预测分类数组
        y_pred = np.zeros((x.shape[0], 1), dtype=self.y_train.dtype)
        # 遍历输入的x数据点，取出每一个数据点的序号和x_test
        for i, x_test in enumerate(x):
            # x_test和所有的训练数据计算距离
            distances = self.dist_func(self.x_train, x_test)
            # 得到的距离由近到远排序，取出索引值
            nn_index = np.argsort(distances)
            # 选取最近的K个点，保存他们的分类类别
            nn_y = self.y_train[nn_index[:self.n_neighbors]].ravel()
            # 统计类别中出现频率最高的那个，赋值给y_pred[i]
            y_pred[i] = np.argmax(np.bincount(nn_y))
        return y_pred


# TODO 3.测试
# 定义一个kNN 实例
knn = kNN(n_neighbors=3)
# 训练模型
knn.fit(x_train, y_train)
# 传入测试数据做预测
y_pred = knn.predict(x_test)
print(y_test.ravel())
print(y_pred.ravel())
# 求出预测准确率
accuracy = accuracy_score(y_test, y_pred)
print("预测准确率: ", accuracy)

# 定义一个KNN实例
knn = kNN()
# 训练模型
knn.fit(x_train, y_train)
# 保存结果到list
result_list = []
# 针对不同的参数选取，做预测
for p in [1, 2]:
    knn.dist_func = l1_distance if p == 1 else l2_distance
    # 考虑不同的k取值 ， 步长为2
    for k in range(1, 10, 2):
        knn.n_neighbors = k
        # 传入测试数据做预测
        y_pred = knn.predict(x_test)
        print(y_test.ravel())
        print(y_pred.ravel())
        # 求出预测准确率
        accuracy = accuracy_score(y_test, y_pred)
        print("预测准确率: ", accuracy)
        result_list.append([k, 'l1_distance' if p == 1 else 'l2_distance', accuracy])
df = pd.DataFrame(result_list, columns=['k', '距离函数', '预测准确率'])
print(df)

数据展示

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.057333	3.758000	1.199333
std	0.828066	0.435866	1.765298	0.762238
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

结果展示

	k	距离函数	准确率
0	1	l1_distance	0.933333
1	3	l1_distance	0.933333
2	5	l1_distance	0.977778
3	7	l1_distance	0.955556
4	9	l1_distance	0.955556
5	1	l2_distance	0.933333
6	3	l2_distance	0.933333
7	5	l2_distance	0.977778
8	7	l2_distance	0.977778
9	9	l2_distance	0.977778

机器学习-有监督学习-分类-KNN

猜你喜欢