KNN - Fruit Classification

1. Dataset processing

1. Download the dataset

Try to find more pictures of apples, bananas, and carambola as the data set for this time.
insert image description here

2. Unified data set format

Use the code to unify the image size and file format
Unify the image name, use the name + underscore + number for the corresponding type of image, which is convenient for labeling the image

# 统一图片格式
fileList = os.listdir(r"C:\Users\cx\Desktop\work\machine_learning\knn\fruit\carambola")
# 输出此文件夹中包含的文件名称
print("修改前:" + str(fileList)[1])
# 得到进程当前工作目录
currentpath = os.getcwd()
# 将当前工作目录修改为待修改文件夹的位置
os.chdir(r"C:\Users\cx\Desktop\work\machine_learning\knn\fruit\carambola")
# 名称变量
num = 1
# 遍历文件夹中所有文件
for fileName in fileList:
    # 匹配文件名正则表达式
    pat = ".+\ .(jpg|jpeg|JPG)"
    # 进行匹配
    pattern = re.findall(pat, fileName)
    # 文件重新命名
    os.rename(fileName, "carambola_" + str(num) + ".jpg")
    # fileName.resize(256, 256)
    # 改变编号,继续下一项
    num = num + 1
print("***************************************")
# 改回程序运行前的工作目录
# os.chdir(currentpath)
# 刷新
sys.stdin.flush()
# 输出修改后文件夹中包含的文件名称
print("修改后:" + str(os.listdir(r"C:\Users\cx\Desktop\work\machine_learning\knn\fruit\carambola"))[1])

Uniform file size

from PIL import Image
import os
import glob

# 修改图片文件大小
# filename:图片文件名
# outdir:修改后要保存的路径
def convertImgSize(filename, outdir, width=256, height=256):
    img = Image.open(filename)
    try:
        new = img.resize((width, height), Image.BILINEAR)
        p = os.path.basename(filename)
        print(p)
        new.save(os.path.join(outdir, os.path.basename(filename)))
    except Exception as e:
        print(e)

if __name__ == '__main__':
    # 查找给定路径下图片文件,并修改其大小
    for filename in glob.glob('C:/Users/cx/Desktop/work/machine_learning/knn/fruit/carambola/*.jpg"):
        convertImgSize(filename, 'C:/Users/cx/Desktop/work/machine_learning/knn/fruit/carambola')

The final processed picture:
insert image description here

3. Load the dataset

Add labels to the data set: apple, banana, carambola The corresponding labels for pictures are: apple, banana, carambola
Normalize the pictures and flatten them into a one-dimensional array

# 加载数据集
def lode_data():
    data = []
    labels = []
    for img in os.listdir(r"./fruit"):
        # 为图片贴标签
        label = img.split("_")
        labels.append(label[0])

        #图片归一化 
        img = "./fruit/" + img
        img = cv2.imread(img, 1)
        img = (img - np.min(img)) / (np.max(img) - np.min(img))                    
        data.append(img.flatten())
    data = np.array(data)
    labels = np.array(labels)
 
    return data, labels

2. Separation of training set and validation set

Here I directly use the encapsulated method to divide the verification set loaded above into 30% verification set and 70% verification set.

data, labels = lode_data()
# 从样本中随机抽取30% 做验证集, 其余70% 做训练集
train_data,test_data,train_labels,test_labels = train_test_split(data, labels, test_size = 0.30, random_state = 20, shuffle = True)        

3. Define the KNN model

The KNN model definition has a routine, which can be realized well according to the corresponding steps. The specific steps include:

  • Calculate Euclidean distance
  • Sort by calculated distance
  • Get the top k sample labels
  • Return most frequently occurring tags

1. Calculate the Euclidean distance

After expanding the picture into a one-dimensional vector, the Euclidean distance between each picture in the test set and other pictures can be calculated. Corresponding to each pixel, first calculate the difference, then calculate the square and sum, and finally obtain the Euclidean distance by taking the square root.

    dis = (np.tile(test_img, (data .shape[0], 1)) - data) ** 2
    dis_sq = dis.sum(axis=1)
    dis_res = dis_sq ** 0.5

2. Sort all distances

Use the argsort() function to sort all distances and return the corresponding index

 dis_sort = dis_res.argsort()

3. Obtain the labels of the first k samples

Construct a classifier, traverse the first k indexes with the shortest distance, get the corresponding labels according to the indexes, and finally put all the labels into the classifier.

    classcount={
    
    }
    for i in range(k):
        # 取距离最近的前k个,获取对应标签
        vote_label = labels[dis_sort[i]]
        classcount[vote_label] = classcount.get(vote_label, 0) + 1

4. Return the label with the most occurrences

Sort all tags in descending order, the first is the tag with the most occurrences.

    # 将获取的标签进行降序排序
    sorted_classcount = sorted(classcount.items(), key = operator.itemgetter(1), reverse = True)
    # 返回出现次数最多的标签
    return sorted_classcount[0][0]

5. KNN algorithm code

# knn算法实现
def knn(test_img, data , labels, k):
    # 计算欧氏距离
    dis = (np.tile(test_img, (data .shape[0], 1)) - data) ** 2
    dis_sq = dis.sum(axis=1)
    dis_res = dis_sq ** 0.5

    # 按照距离依次排序, 返回索引
    dis_sort = dis_res.argsort()

    # 构造分类器
    classcount={
    
    }
    for i in range(k):
        # 取距离最近的前k个,获取对应标签
        vote_label = labels[dis_sort[i]]
        classcount[vote_label] = classcount.get(vote_label, 0) + 1
    # 将获取的标签进行降序排序
    sorted_classcount = sorted(classcount.items(), key = operator.itemgetter(1), reverse = True)
    # 返回出现次数最多的标签
    return sorted_classcount[0][0]

4. Test model

In the above definition of the KNN model, inputting a test data will return a label with the most occurrences in distance K. Use each test sample in the test set to compare the label returned by the model with its own correct label, and finally get the correct rate. The value of K traverses from 0 to 20, and the correct rate corresponding to each K value is output.

# 获取标签匹配成功的概率
def test_all(train_data, train_labels, test_data, test_labels, k):
    right = 0
    for i in range(len(test_data)):
        if knn(test_data[i], train_data, train_labels, k) == test_labels[i]:
            right+=1
    return right/len(test_data)

# 训练
def  train():
    right = []
    data, labels = lode_data()

    # 从样本中随机抽取20% 做验证集, 其余80% 做训练集
    for k in range(0, (len(labels)-1)):
        train_data,test_data,train_labels,test_labels = train_test_split(data, labels, test_size = 0.20
                                                 ,random_state = 20, shuffle = True) 
        right.append(test_all(train_data, train_labels, test_data, test_labels, k + 1))
        i = str(k + 1) 
        print('K = {}, 正确率 = {}'.format(i, right[k]))

    plt.plot( range(len(test_data) + 1) , right)
    plt.show()

train()

1. K value and accuracy curve

insert image description here

2. Result analysis

Take several representative k value analysis:

  1. When K = 5, the training accuracy rate is only 0.59. The reason for this result may be that the background of the two pictures is too simple, there are many blank places, and the shape and size of the two types of fruits are similar. The gray values ​​of these blank places are very similar, and the Euclidean distance calculated by different fruits is very small. When the K value is relatively small, the two pictures in the training set are unique, resulting in unsatisfactory results. For example, in the following two pictures, the frame is roughly the size of the picture, and there are many blank areas with the same grayscale. The following two different fruits have a relatively small distance:
    insert image description here

  2. When K = 40, the correct rate is 0.72. The number of sample sets of my three fruits is not exactly equal, about 50 apples, 40 carambola, and 60 bananas. Finally, when the k value is too large, in the test results with a distance of k, labels with a larger number of samples are more likely to be counted. Moreover, the background of some pictures is relatively blurred, and the difference between the two pictures is relatively large. In the end, the distance calculated for the same fruit is relatively large, resulting in a relatively small matching probability. For example, the distance between the following fruits of the same kind is relatively large:
    insert image description here

  3. When K = 30, the correct rate is 0.81. Since the overall difference between the three data sets is relatively obvious, and the situation that the Euclidean distance calculated by different fruits that appear when the k value is relatively small is small, and the Euclidean distance calculated by the same fruit is relatively large, the final test results are relatively ideal. The overall difference between different fruits is obvious:insert image description here

Guess you like

Origin blog.csdn.net/chenxingxingxing/article/details/127696580