Machine Learning 11 - Learning Vector Quantization (LVQ) for Prototype Clustering


foreword

Zhou Zhihua's "Machine Learning" introduces Learning Vector Quantization (LVQ). As we can see, LVQ is also a prototype-based clustering algorithm. Unlike K-Means, LVQ uses sample real class labels to assist clustering. First, LVQ is based on the class of samples . Marking, randomly select a sample from each category as the prototype of the cluster, thus forming a prototype feature vector group, then randomly select a sample from the sample set, and calculate the distance between it and each vector in the prototype vector group, And select the cluster where the prototype vector with the smallest distance is located as its division result, and then compare it with the real class label.

The flow of the LVQ algorithm is as follows:

insert image description here
The general process is:

  1. The category of statistical samples, assuming that there are q categories in total, and the labels initialized as prototype vectors are {t1,t2,...,tq}. Randomly select q sample point prototype vectors {p1, p2 ,..., pq} from the sample. Initialize a learning rate a, a range of values ​​(0,1).
  2. Randomly select a sample (x, y) from the sample set, calculate the distance between the sample and q prototype vectors (Euclidean distance), find the prototype vector p with the smallest distance from the sample, and judge the mark y of the sample and the prototype vector The label t is not consistent. If consistent, update to p' = p + a*(xp), otherwise update to p' = p - a*(x - p).
  3. Repeat step 2 until the stopping condition is met. (like reaching the maximum number of iterations)
  4. Return q prototype vectors.

1. Learning vector quantization (LVQ) simply implements two classifications

  1. Data generation
    A relatively small sample data is used here. There are 13 samples in total. The characteristics of each sample are: density, sugar content, and whether it is good or not. The labels of whether it is good or not are: Y and N.
import re
import math
import numpy as np
import pylab as pl #pylab模块多用折线图和曲线图上 
# import matplotlib.pyplot as plt

data = \
"""1,0.697,0.46,Y,
2,0.774,0.376,Y,
3,0.634,0.264,Y,
4,0.608,0.318,Y,
5,0.556,0.215,Y,
6,0.403,0.237,Y,
7,0.481,0.149,Y,
8,0.437,0.211,Y,
9,0.666,0.091,N,
10,0.639,0.161,N,
11,0.657,0.198,N,
12,0.593,0.042,N,
13,0.719,0.103,N"""
# data
  1. data preprocessing
# 数据简单处理
a = re.split(',', data.strip(" "))# 数据划分
dataset = []     # dataset:数据集
for i in range(int(len(a)/4)):
    temp = tuple(a[i * 4: i * 4 + 4])
    dataset.append(watermelon(temp))
  1. calculate distance
# 计算欧几里得距离,a,b分别为两个元组
def dist(a, b):
    return math.sqrt(math.pow(a[0]-b[0], 2)+math.pow(a[1]-b[1], 2))
  1. Algorithmic Model Establishment
# 算法模型
def LVQ(dataset, a, max_iter):
    # 统计样本一共有多少个分类
    T = list(set(i.good for i in dataset))
    print('样本分类总数', T)
    # 随机产生原型向量
    P = [(i.density, i.sweet,i.good) for i in np.random.choice(dataset, 2)]
    print('原型向量', P)
    while max_iter>0:
        # 从样本集dataset中随机选取一个样本X
        X = np.random.choice(dataset, 1)[0]
        #print(i for i in P)
        #index = np.argmin(dist((X.density, X.sweet), (i[0], i[1])) for i in P)
        # 找出P中与X距离最近的原型向量P[index]
        m = []
        for i in range(len(P)):
            m.append(dist((X.density, X.sweet),(P[i][0],P[i][1])))
        index = np.argmin(m)
        #print('m为',m)
        #print ('index为',index)
        # 获得原型向量的标签t,并判断t是否与随机样本的标签相等
        t = P[index][2]
        #print('t为',t)
        if t == X.good:
            P[index] = ((1 - a) * P[index][0] + a * X.density, (1 - a) * P[index][1] + a * X.sweet,t )
        else:
            P[index] = ((1 + a) * P[index][0] - a * X.density, (1 + a) * P[index][1] - a * X.sweet,t )
        max_iter -= 1
    return P

insert image description here

  1. pl drawing
# 画图
def draw(C, P):
    colValue = ['r', 'y', 'g', 'b', 'c', 'k', 'm']
    for i in range(len(C)):
        coo_X = []    # x坐标列表
        coo_Y = []    # y坐标列表
        for j in range(len(C[i])):
            coo_X.append(C[i][j].density)
            coo_Y.append(C[i][j].sweet)
        pl.scatter(coo_X, coo_Y, marker='x', color=colValue[i%len(colValue)], label=i)
    # 展示原型向量
    P_x = []
    P_y = []
    for i in range(len(P)):
        P_x.append(P[i][0])
        P_y.append(P[i][1])
        pl.scatter(P[i][0], P[i][1], marker='o', color=colValue[i%len(colValue)], label="vector")
    pl.legend(loc='upper right')
    pl.show()

insert image description here

  1. Learning Vector Quantization (LVQ) outputs prototype vectors
def train_show(dataset, P):
    C = [[] for i in P]
    for i in dataset:
        C[i.good == 'Y'].append(i)
    return C

P = LVQ(dataset, 0.1, 2000)
C = train_show(dataset, P)
draw(C, P)
print('P为',P)
# print('C为',C)

insert image description here

  • In LVQ(dataset, a, max_iter): dataset is the sample set, a is the distance between the sample and the prototype vector, and max_iter is the maximum number of iterations.

2. Learning vector quantization (LVQ) to achieve three classifications

  1. Dataset generation
from sklearn import datasets
import matplotlib.pyplot as plt
import numpy as np

X=datasets.make_blobs(n_samples=1000,centers=3) #1000个样本点分为3类
# X

insert image description here

  1. Initialize prototype vector
    P=np.zeros((q,col)) #原型向量
    for i in range(q):   #初始化原型向量
        index=np.where(sample[1]==Label[i])[0]
        choose=np.random.randint(0,len(index),1)
        P[i,:]=sample[0][index[choose],:]
  1. training subject
    for i in range(1000):   #训练
        choose=np.random.randint(0,row,1) #随机选取一个样本
        dis=np.linalg.norm(sample[0][choose,:]-P,axis=1) #计算与原型向量的距离
        y=dis.tolist().index(min(dis))  #获取距离最近的原型向量下标
        if Label[y]==sample[1][choose]: #更新原型向量
            P[y,:]=P[y,:]+eta*(sample[0][choose,:]-P[y,:])
        else:
            P[y,:]=P[y,:]-eta*(sample[0][choose,:]-P[y,:])
  1. classification mark
    IDX=[]  #分类标记
    for i in sample[0]:  #以距离最近的标记为样本的类别
        D=np.linalg.norm(i-P,axis=1)
        y=D.tolist().index(min(D))
        IDX.append(Label[y])
    plot(IDX,sample[0],max(Label)+1,P)
    return P
  1. drawing
def plot(a,X,k,p):  #绘画板块
    m=k
    for j in range(m):
        index=[i for i,v in enumerate(a) if v==j]
        x=[]
        y=[]
        for k in index:
            x.append(X[k][0])
            y.append(X[k][1])
        plt.scatter(x,y)
    plt.scatter(p[:,0],p[:,1],marker='x')
    plt.show()
  • full code
from sklearn import datasets
import matplotlib.pyplot as plt
import numpy as np

X=datasets.make_blobs(n_samples=1000,centers=3) #1000个样本点分为3类

def lvq(sample,q,Label,eta):
    if q!=len(Label):
        return 0
    row,col=np.shape(sample[0]) #获取样本集的规格
    P=np.zeros((q,col)) #原型向量
    for i in range(q):   #初始化原型向量
        index=np.where(sample[1]==Label[i])[0]
        choose=np.random.randint(0,len(index),1)
        P[i,:]=sample[0][index[choose],:]
    for i in range(1000):   #训练
        choose=np.random.randint(0,row,1) #随机选取一个样本
        dis=np.linalg.norm(sample[0][choose,:]-P,axis=1) #计算与原型向量的距离
        y=dis.tolist().index(min(dis))  #获取距离最近的原型向量下标
        if Label[y]==sample[1][choose]: #更新原型向量
            P[y,:]=P[y,:]+eta*(sample[0][choose,:]-P[y,:])
        else:
            P[y,:]=P[y,:]-eta*(sample[0][choose,:]-P[y,:])
    IDX=[]  #分类标记
    for i in sample[0]:  #以距离最近的标记为样本的类别
        D=np.linalg.norm(i-P,axis=1)
        y=D.tolist().index(min(D))
        IDX.append(Label[y])
    plot(IDX,sample[0],max(Label)+1,P)
    return P
def plot(a,X,k,p):  #绘画板块
    m=k
    for j in range(m):
        index=[i for i,v in enumerate(a) if v==j]
        x=[]
        y=[]
        for k in index:
            x.append(X[k][0])
            y.append(X[k][1])
        plt.scatter(x,y)
    plt.scatter(p[:,0],p[:,1],marker='x')
    plt.show()

Test code: lvq(X,5,[0,1,0,1,2],0.5)

The output is:
insert image description here
insert image description here


Summarize

One disadvantage of K-Nearest Neighbors is that you need to keep the entire training dataset. However Learning Vector Quantization (or LVQ for short) is an artificial neural network algorithm that allows you to choose how many training instances to hang on to and know exactly what those instances look like.

The representation of LVQ is a collection of codebook vectors. These are chosen randomly at the beginning, and are adapted to best summarize the training dataset over multiple iterations of the learning algorithm (with different results for each run, so that it takes multiple iterations to achieve the desired result). After learning, the codebook vectors can be used to make predictions like K-Nearest Neighbors. The most similar neighbors (best matching codebook vectors) are found by computing the distance between each codebook vector and the new data instance . The class value or (actual value in case of regression) of the best matching unit is then returned as a prediction. Best results are obtained if you rescale the data to have the same range, for example between 0 and 1.

Guess you like

Origin blog.csdn.net/ex_6450/article/details/126437602