Data mining learning - SOM network clustering algorithm + python code implementation

Table of contents

1. Brief introduction of SOM

2. SOM training process

(1) Initialization

(2) Sampling (extracting sample points)

(3) Competition

(4) Cooperation and adaptation (updating weight values)

(5) repeat

3. Python code implementation

(1) Initialization

(2) Calculate the topological distance between the sample point and the weight vector

(3) Competition

(4) Update weights


1. Brief introduction of SOM

       SOM (Self Organizing Map) self-organizing map network, also known as competitive neural network. It can be displayed by mapping high-dimensional data into a low-dimensional space with simple structure and interrelationships, so as to realize data visualization, clustering, classification and other functions.

       The SOM network is different from other neural networks. Compared with other neural networks, it is closer to the Kmeans clustering algorithm, that is, the K-means clustering algorithm.

       Its structure is shown in the figure below.

 It can be seen from the figure that each node in the output layer is connected to the input node through D weight edges (that is, each node in the output layer is represented by a D-dimensional weight Wij), and the relationship between each node in the output layer is There is a certain connection according to the distance.

2. SOM training process

(1) Initialization

  choose random values ​​for the initial weight vector

(2) Sampling (extracting sample points)

  Extract a sample point from the input data as the input training vector sample

(3) Competition

  For weight vectors, neurons compute their respective discriminant function values, which provide the basis for competition. The particular neuron with the smallest discriminant function value is declared the winner. (The discriminant function can be defined as the square Euclidean distance between the input training vector sample and the weight vector)

  Vernacular: Calculate the topological distance between each neuron and the training sample point selected in the second step, and the closest distance is the winning weight vector point (winner)

(4) Cooperation and adaptation (updating weight values)

  In neurobiological studies, we have found lateral interactions within a group of excitatory neurons. When a neuron is activated, the nearest neighbors tend to be more excited than those farther away. And there exists a topological neighborhood that decays with distance.

  The winning neuron in the previous step will get the right to decide the weight value. Not only does the winning neuron get weight updates, but its neighbors also get their weights updated, although not as much as the winning neuron.

  It can be understood that the nearest weight vector node moves a certain distance to the sample point, and the adjacent nodes also move a certain distance.

(5) repeat

  Go back to step 2 and repeat until all input data points are matched

3. Python code implementation

(1) Initialization

    def __init__(self, X, output, iteration, batch_size):
        """
        :param X:  形状是N*D, 输入样本有N个,每个D维
        :param output: (n,m)一个元组,为输出层的形状是一个n*m的二维矩阵
        :param iteration:迭代次数
        :param batch_size:每次迭代时的样本数量
        初始化一个权值矩阵,形状为D*(n*m),即有n*m权值向量。权值由numpy随机函数生成。
        """
        self.X = X
        self.output = output
        self.iteration = iteration
        self.batch_size = batch_size
        self.W = np.random.rand(X.shape[1], output[0] * output[1])
        print(self.W.shape)

(2) Calculate the topological distance between the sample point and the weight vector

    def GetN(self, t):
        """
        :param t:时间t, 这里用迭代次数来表示时间
        :return: 返回一个整数,表示拓扑距离,时间越大,拓扑邻域越小
        """
        a = min(self.output)
        return int(a - float(a) * t / self.iteration)

    def Geteta(self, t, n):
        """
        :param t: 时间t, 这里用迭代次数来表示时间
        :param n: 拓扑距离
        :return: 返回学习率,
        """
        return np.power(np.e, -n) / (t + 2)

(3) Competition

 def train(self):
        """
        train_Y:训练样本与形状为batch_size*(n*m)
        winner:一个一维向量,batch_size个获胜神经元的下标
        :return:返回值是调整后的W
        """
        count = 0
        while self.iteration > count:
            train_X = self.X[np.random.choice(self.X.shape[0], self.batch_size)]
            normal_W(self.W)
            normal_X(train_X)
            train_Y = train_X.dot(self.W)
            winner = np.argmax(train_Y, axis=1).tolist()
            self.updata_W(train_X, count, winner)
            count += 1
        return self.W

    def train_result(self):
        normal_X(self.X)
        train_Y = self.X.dot(self.W)
        winner = np.argmax(train_Y, axis=1).tolist()
        print(winner)
        return winner

(4) Update weights

    def updata_W(self, X, t, winner):
        N = self.GetN(t)
        for x, i in enumerate(winner):
            to_update = self.getneighbor(i[0], N)
            for j in range(N + 1):
                e = self.Geteta(t, j)
                for w in to_update[j]:
                    self.W[:, w] = np.add(self.W[:, w], e * (X[x, :] - self.W[:, w]))

    def getneighbor(self, index, N):
        """
        :param index:获胜神经元的下标
        :param N: 邻域半径
        :return ans: 返回一个集合列表,分别是不同邻域半径内需要更新的神经元坐标
        """
        a, b = self.output
        length = a * b

        def distence(index1, index2):
            i1_a, i1_b = index1 // a, index1 % b
            i2_a, i2_b = index2 // a, index2 % b
            return np.abs(i1_a - i2_a), np.abs(i1_b - i2_b)

        ans = [set() for i in range(N + 1)]
        for i in range(length):
            dist_a, dist_b = distence(i, index)
            if dist_a <= N and dist_b <= N: ans[max(dist_a, dist_b)].add(i)
        return ans

Guess you like

Origin blog.csdn.net/weixin_52135595/article/details/127156259