Machine Learning Notes: Restricted Boltzmann Machines

This article is organized from the heart of the machine blog:
https://baijiahao.baidu.com/s?id=1599798281463567369&wfr=spider&for=pc

0x01 intro

Restricted Boltzmann machine (RBM, Restricted Boltzmann machine) was proposed by Geoff Hinton and others at the University of Toronto. It is an algorithm that can be used for dimensionality reduction, classification, regression, collaborative filtering, feature learning, and topic modeling. Essentially a probabilistic graphical model that can be explained by a random neural network .

Random means that the neurons in the network are random neurons, and there are only two output states (inactive and active), and the specific value of the state is determined according to the law of probability and statistics.

0x02 Preliminary knowledge

  • sigmoid function
  • Bayes official
  • bipartite graph
  • Monte-Calo method: Approximate the overall mean with the sample mean. However, suppose we are given a distribution p ( x ) p(x)After p ( x ) , how to do the sampling? The Markov chain MC method can be used to generate samples under a specified distribution using a Markov chain.
  • Markov chains
    can be found in any random process textbook. For ergodic Markov chains, there exists a unique stationary distribution π \piπ : If we want to sample under a certain distribution, we only need to simulate the Markov process with it as the distribution, and finally the sample distribution will gradually converge to a stationary distribution.
  • Regular distribution
    If a physical system has a certain degree of freedom (for example, water molecules in a drop of water can be arranged arbitrarily in space), the position of water molecules in the system is random. If the system is in state iiThe probability of i is pi p_ipi, then there must be
    ∑ ipi = 1 \sum_i p_i = 1ipi=1
    Different states have different energy, note stateiiThe energy of i is E i E_iEi, according to statistical mechanics: when the system is in thermal equilibrium, the system is in state iiProbability of i pi p_ipiHas the following form:
    pi = 1 ZT e − E i T p_i = \frac{1}{Z_T}e^{-\frac{E_i}{T}}pi=ZT1eTEi
    where ZT Z_TZTUse ∑ ipi = 1 \ sum_i p_i = 1ipi=1 T T T is the system temperature. This probability distribution is called a canonical distribution.
    Obviously, at a given temperature, the state with small energy has a higher probability, whenT → ∞ T \to \inftyT is a uniform distribution.
    In ML, you can customize the energy function and learn from the laws of physics to achieve training.
  • Metropolis-Hastings Sampling and Gibbs Sampling

0x03 Definition and Structure

RBMs are two-layer neural networks , and these shallow neural networks are the building blocks of DBNs (Deep Belief Networks). The first layer of RBM is called the visible layer or input layer, and its second layer is called the hidden layer.

insert image description here

Each layer consists of several nodes. Adjacent layers are fully connected, but nodes in the same layer are disconnected. That is, there is no intra-layer communication, which is where the limitation in restricted Boltzmann machines lies.

Each node is a unit that processes input data, and each node randomly decides whether to pass the input. Random means "random judgment", where the parameters that modify the input (decide whether to pass the input) are initialized randomly.

The input layer (i.e. the visible layer) takes as input the low-level features in the dataset samples. For example, for a dataset consisting of grayscale images, each input node receives a pixel value from the image. The data in the MNIST dataset has 784 pixels, so the neural network that processes them must have 784 input nodes.

After getting the input, as shown in the figure below, at node 1 of the hidden layer, x is multiplied by a weight, and then a bias term is added. The result of these two operations can be used as input to a nonlinear activation function that, given an input x, can give the output of this node, or the strength of the signal after it passes through it. This is actually the same process as our common neural network.

insert image description here
The following figure is the case of multiplying multiple nodes:

insert image description here
Since the inputs of all visible (or input) nodes are passed to all hidden nodes, an RBM can be defined as a symmetric bipartite graph, where symmetry means that every hidden node is connected pairwise to every visible node.

0x04 Reconstruction

Below, we focus on how they reconstruct the data by themselves in an unsupervised manner, which allows for several forward summations between the visible layer and the first hidden layer without involving deeper networks. Backpropagation.

It should be noted that, unlike the neural network we normally understand, each node of the RBM (including the input node) contains a bias term, because this is required for forward/backward propagation.

During the reconstruction phase, the activation state of the first hidden layer becomes the input during the back pass. They are multiplied by the same weights for each connected edge, just as x is adjusted with the weights during the forward pass. The sum of these products is added to the bias term of the visible layer at each visible node, and the output of these operations is a reconstruction, which is an approximation to the original input. This can be expressed by the following diagram:

insert image description hereNote: Each b in the above image is different.

Because the weights of the RBM are randomly initialized, the difference between the reconstruction result and the original input is usually large. You can think of the difference between r and the input value as the reconstruction error, which is then backpropagated along the weights of the RBM, in an iterative learning process, until it reaches a certain error minimum.

During the forward pass, we can see that the RBM uses the input to predict the activation value of the node, or the probability of the output, given the weights: P ( a ∣ x , w ) P(a|x, w)P(ax,w ) ; but in the process of backward pass, when the activation value is used as input and outputs the reconstruction or prediction of the original data, the RBM will try to estimate the probability of the input x given the activation value a, which has the same The same weight parameters in the transfer process, namely:P ( x ∣ a , w ) P(x|a, w)P(xa,w ) . Together these two probability estimates will result in a joint probability distribution over the input x and activation value a, orP ( x , a ) P(x, a)P(x,a)

Reconstruction here is different from regression, and it is also different from classification. Regression estimates a continuous value based on many inputs, classification predicts discrete labels to apply to a given input sample, and reconstruction predicts the probability distribution of the original input.

This reconstruction is called generative learning, and it must be distinguished from the discriminative learning performed by the classifier. Discriminant learning maps inputs to labels. Given an input x, the output is P ( y ∣ x ) P(y|x)P ( y x ) . If it is assumed that the input data and the reconstruction result of the RBM are normal curves of different shapes, they only partially overlap.

To measure the distance between the predicted probability distribution of the input data and the true distribution, RBM uses KL divergence to measure the similarity of the two distributions. KL divergence measures the non-overlapping or divergent regions of the two curves, and the RBM's optimization algorithm tries to minimize these regions, so when the shared weights are multiplied by the hidden layer activations, an approximation of the original input is obtained. The left side of the graph is the probability distribution p of a set of inputs and its reconstructed distribution q, and the right side of the graph is the integral of their differences.
insert image description here

0x05 probability distribution

In a randomly generated grayscale image, the value of each pixel is uniform. However, in the grayscale image of the MNIST dataset, the value probability of each pixel is not uniform. In the grayscale image of the black and white portrait, we can find that the value probability of each pixel is still uneven. Therefore, using the probability distribution of portraits to fit the probability distribution of the MNIST dataset will have a large difference.

Or to take another example: languages ​​are specific probability distributions of letters, because each language uses some letters more and others less. In English the letters e, t and a are the most common, whereas in Icelandic the most common letters are a, t and n. So trying to reconstruct Icelandic using an English-based set of weights would result in a larger discrepancy.

Imagine an RBM that only inputs pictures of dogs and elephants, the input layer is x, and there are only two output nodes a1 and a2, one for each animal. During the forward pass the RBM asks itself the question: Given these pixels, which node should I send a stronger signal to, the elephant node or the dog node? The question for the RBM during the backward pass is: given an elephant, what kind of pixel distribution should be expected?

This is the joint probability distribution: the probability of x given a, and the probability of a given x, can be determined from the shared weights between the two layers of the RBM.

In a sense, the process of learning to reconstruct is learning which pixels tend to appear together in a given set of images.

These reconstructions represent what the RBM's activations "think" the input data looks like, which Geoff Hinton calls the machine "dreaming." When presented to a neural network during training, this visualization is a very useful heuristic, giving reassurance that the RBM is indeed learning. If not, then its hyperparameters should be tuned.

One final note: you will notice that the RBM has two bias terms. This is one aspect that differentiates it from other autoencoders. The bias term of the hidden layer helps the RBM to obtain non-zero activation values ​​in the forward pass, while the bias of the visible layer helps the RBM learn the reconstruction in the backward pass.

Guess you like

Origin blog.csdn.net/weixin_43466027/article/details/117194359