European space to hyperbolic space

Abstract: The port learns (Representation Learning) is the field of artificial intelligence has been exploring the topic, but also to the way of strong AI have to solve a difficult problem. Until now, most learning methods represent to stay in the European space (Euclidean Space), but in fact in addition to the simple Euclidean space, there are many other non-Euclidean space can be used to build AI model, and we also found nearly two years more and more scholars have begun to make some interesting work in this area. In this article, we explain the difference between plain and other non-Euclidean space Euclidean space such as hyperbolic space (Hyperbolic Space), and the motivation for using this type of modeling space behind. Read this article does not require knowledge of hyperbolic space.

 

 

1. What is a good representation (Good Representation)?

 

Learns (Representation Learning) is the topic of today's field of machine learning is often discussed, especially the rise of deep learning, so many scholars added to the Representation among learning. What is learning?  The explanation is easy to understand digital (vector, matrix ...) to express real-world objects (object), and this facilitates the subsequent expression of classified or other decision-making issues. Closely associated with the study indicate that the feature works (feature engineering), but the industry is characterized by the use of man-made engineering more experience to design a feature for a particular area of the process, the high cost of the entire time, and a feature areas is often difficult to solve similar problems. In contrast, learns, we want a reasonable feature automatically from a given algorithm or data representation through high school.

 

For example, in the following piece of the drawings, an input given input, we want to learn the parameters of the model by means of a representation of a layer, and this represents a good support for the upper application A, B, C. 

 

   Pictures from [1] 

 

In the field of image recognition, CNN has become the most mainstream of the algorithm, convolution layer by layer by layer, Pooling layer to the original image data (as pixels) do nonlinear conversion, in fact, the whole process can be seen as is a learning of course, we hope that through such layers of conversion, can learn the "best way of expression" on behalf of each image, and this facilitates the subsequent expression of classified or other tasks. Similarly, in the field of natural language processing, embedded word (word embedding) must be one of the most important results in recent years, popular terms is to use a low-dimensional vector to express every word, also known as distributed representation (distributed representation) [2]. In the learning process represents the word we often use the Skip-Gram [4], CBOW [3], Glove [5] algorithms to find the most reasonable vectors can be used to express each word. Finally, in the field of speech recognition, end to end model is becoming mainstream, represented by learning the original speech signal, whether a learning than traditional voice features (such as MFCCs), more effective representation of the topic is a lot of people study.

 

Having said that, given a problem, we can try to use various algorithms come and learn different representation, that here the key issue is how we express to judge what is good and what is bad representation ? For example, we designed a CNN architecture, learn how to judge out of the picture representation (representation) is good or bad? The simplest assessment criteria, but also a lot of people are currently using, for in a task (task) accuracy - this "ultimate" indicators, so many people will mistakenly believe that high accuracy is good representation or not at all take care of things in addition to accurate rates. The study included in the academic community, this lack of nature thinking in recent years has also led to low-quality paper, "flooding", forming a "race" Competition of accuracy, but in essence did not bring the level to the development of science greater impact.

 

In fact, the accuracy rate is just one aspect, in fact, there are many other factors to be taken into account. Bengio and other scholars [1] made a number of characteristics that need to have good representation. First of all,  a good representation Factors can learn a Disentangle , and these can be generalized factor (generalize) to other similar tasks. For example, given a figure of the picture, we can imagine each picture from "Pictures of People," "character point of view", "expression", "there is no glasses" to make up these different factors, and can be seen as these factors have a mutual independence.

 

For example, in the picture below, some images are mainly used to describe the perspective, some mainly used to describe hairstyles and facial expressions. Here is a simple example, suppose we use neural network learning the picture, and finally with a four-dimensional vector to express this picture. Well, this time, we want the first hidden unit can be used to control the angle, and the second unit can be used to control whether to wear glasses, and so on.

 


The result is that, if we know which part is represented for the characters in the picture, you only need to add a classification algorithm can achieve the purpose of classification in this part of the character's representation. In fact, this way of thinking to see with the human world is similar, humans are capable of complex information from each individual in the signal extracted. For example, a food lovers, each to a new city may only focus on local cuisine, not the other information need not be concerned. 

 

In addition, a similar hierarchy (hierachical representation) also represents our most important concern. In many real-world scenarios, there is a hierarchy of objects. For example, for the word mentioned above indicates, the relationship between "apple", "Grape" and "fruit" is not parallel, but this relationship has "belong." Similarly, the "Beijing", "Shanghai" belong to "urban." Then our question is whether learning the word out by embedding algorithm (word embedding) represents able to reproduce this hierarchy?  We all know that is similar to the Skip-gram more of these algorithms is dependent on the degree of similarity between the semantics of words, but not necessarily a hierarchical relationship between these words. In addition, many network also has a hierarchical structure, such as opinion leaders in the network is usually located in the top level of the hierarchy. For those nodes in the network, what kind of representation should be able to restore the original data in hierarchical structure?

 

In addition to these two properties, Bengio in [1] also mentioned other features need to be considered, interested readers can access relevant documents. So far, some background is, just to explain why sometimes a need to consider the motive of hyperbolic space (motivation).

 

2. represent the vast majority of learning methodology relies on Euclidean space

 

European space we are most familiar concepts of mathematics from primary school to learn there are basically dealing with Euclidean space. For example, we are very clear, "Given two points, only to find a line through these two points," "given a point and a straight line, we can only find a parallel straight lines." And we know how to do two vectors of addition, subtraction, even linear transformation vector. For Euclidean space, one of the easiest to understand is that it is "flat", such as two-dimensional space is a plane. Conversely, if there is some space that is "sagging"? This is the focus of our next discussion. 

 

In the field of machine learning, most of the models we have discussed so far are dependent on the European space operations. From the familiar linear regression, logistic regression to study in depth the CNN, RNN, LSTM, Word2Vec these are dependent on Euclidean space. In fact, in addition to this space Euclidean space "flat", on some issues we will face space to other "bend" the hyperbolic space (hyperbolic space) such as discussed in this article. For example, suppose given the coordinates of the two cities, we all know that to calculate the distance between the two cities on a 2D map, in fact, Euclidean distance. But if we consider "the earth is round", that calculate the distance between the two cities is no longer a straight line distance, but rather from a similar curve. So for this kind of problem, Euclidean distance formula no longer directly suitable. Similarly, also curved space is also required depending on the nature of the non-Euclidean space.

 

This article focuses on an important space for non-Euclidean space family, called hyperbolic space (hyperbolic space), as well as its use in a common machine learning, modeling. Hyperbolic space can be seen as a space, but it is negative curvature (constant negative curvature), also a constant curvature Riemannian space ( exceptions Riemannian space of constant curvature) (of course, also a constant curvature Euclidean space Riemannian space , but the curvature is 0). In this paper, around 2017, a Spotlight NIPS paper "Poincaré Embeddings for Learning Hierarchical Representations" to discuss. According to my understanding, this should be the first to appear in the computer industry to focus on application scenarios, the article discussed hyperbolic space, forecast to bring some influence. Dmitri Northeastern University and other scholars have tried to apply the theory of hyperbolic space on a complex network analysis, but limited to two-dimensional space [6].

 

Before specific knowledge of the details, let's look at a Demo, to understand the hyperbolic space specifically what kind of work done. Here we use the WordNet data set that contains the hierarchical relationship between many of the concepts and the concept, as shown below.

 

 

   WordNet dataset [7]

 

The following Demo show is when the process shows a change when we hyperbolic space model is applied to this data set. We can see, at the beginning of each point is dispersed evenly throughout space, but as the learning process, the distribution of these points in space, apparently showing a similar tree structure, which is the hierarchy.

 

  
Demo from [8]

 

By learning period, eventually generates a map similar to the following. This is the result of a two-dimensional visual space. We can see an interesting phenomenon, such as WordNet inside "Entity" which is located biased in favor of the concept of Root from the center located relatively close to the place in the hyperbolic space. Instead, the very notion of detail such as "tiger" will be located in the space where the edges, but the concept of similar meaning are also located adjacent to the position.                                        

 

    Pictures from [9]

 

So, we can see from the figure represents a school out of a certain hierarchy (hierarhical structure). This is the big difference with the European space, out of school in the European space, he said only contain similar information, but do not have similar hierarchy.

 

 

3. Features hyperbolic space

 

Hyperbolic space has several obvious features, but these features are not available in the Euclidean space. The first feature is that it has the ability to express the hierarchy (as in the example we discussed above the same) . For example, our data itself has a certain hierarchy, it is possible to recover this hierarchy (of course, when it is difficult to sparse data) in hyperbolic space inside. The second important feature is that it itself has space size (capacity) with the European space is very different. In life, such as we buy a one liter barrel size, it can be used to hold 1 liter of water, of course, is an example of this occurs in the European space. Suppose we have the same size of the bucket, but considering that the hyperbolic space is what will be different?

 

FIG left represents a visual hyperbolic space, which contains a number of images having the same Arts and Sciences, each image identical shape, but sample sizes (at least in the European space). However, an interesting phenomenon is that in hyperbolic space which, in fact, the size of these images are the same, and with their location closer to the edge of the ball, the naked eye can see the picture size will become smaller and smaller, but the actual size in space is the same. What does this phenomenon? Also said that with the ball to the edge of the walk, space will be carried by the size of the increases exponentially, so can accommodate exponentially more texture image. In fact, this description, capacity hyperbolic space is much larger than the European space. For example, we return to the beginning of the example, assume the same bucket size, it is possible in the hyperbolic space can be installed on tens of thousands of liters or tens of millions of liters of water. Of course, this example is not mathematically rigorous, but at least to facilitate understanding. Meaning the right of the chart is represented: in hyperbolic space, through a given point and with a straight line parallel are numerous. This is from the point of view of spatial characteristics, the biggest difference with the European space.


For learning space indicate what impact it? In depth study which, we often discuss the implicit layer (hidden layer), and each hidden layer has a certain number of hidden units, such as 64, 128 and so on. Has a space the size of hidden units 64 can indicate how much of it? This problem is very important, we need to have the equivalent of an advance understanding of the complexity of the model. If the value of each hidden unit is 0 or 1 (as the RBM), it is easy to calculate the number of all combinations can be represented by 64 hidden units is 2 to the power 64, or can also be understood space capacity (Capacity ). Similarly, if hyperbolic space considering this issue, the number of hidden units can bring 64 square space 2 it? No rigorous answer may be much less than 64. Therefore, our hypothesis is: we can use less of the same model parameters expressing a capacity (Capacity) in Euclidean space in hyperbolic space.

 

 

4. The different representations of hyperbolic space

 

In order to do arithmetic hyperbolic space, we need based on some standardized model to calculate. Several hyperbolic space model suitable for expressing such Beltrami-Klein model, the hyperboloid model, and the Poincare half-plane model [1]. Pointcare ball model which is more suitable for large data model of the scene, because a lot of optimization theory requires only a simple modification can be used on Pointcare model. For example in the Euclidean space stochastic gradient descent method (SGD), by a simple modification is to be used to optimize the optimization problem in hyperbolic space.

 

 

5. Example: use of hyperbolic space representation of a network to learn

 

Use this case to illustrate the entire modeling process in detail, a similar process can be applied directly to other scenes. Detailed derivation process can first focus on "science and technology greedy" public number (this number public), replies keyword "learns" to get.

 

5.1 Problem Definition: 

Given a network, we hope to learn a representation of each network node. Actually, this is also known as embedded network (network embedding). Suppose there are N nodes in the network, and we know all the edges connected to it. We represented by y = 1 is connected to two sides, y = 0 indicates not connected.

 

5.2 Construction of the objective function:

For each side (connected or not connected), we can define a conditional probability:

Wherein d_ij distance representative of nodes i and j in hyperbolic space, so d_ij smaller, the greater the probability that two nodes are connected. r, h super parameters of the model (hyperparameter). hi, hj representative node i, j in hyperbolic space vector representation (vector representation), usually takes the parameter adjustment process. When we have for the conditional probability to each edge, we can easily write the whole objective function:

That is, all edges (connected or not connected) into consideration. This objective function is also called entropy loss, also often referred to in the logistic regression.

 

There are several operations require redefinition, is directed in a hyperbolic space is shown in each node, the other is here portrayed D () function of distance. Because these operations we need to redefine the hyperbolic space.

 

5.3 operator defined

 

First, for each node represents, in the model, we define the pointcare it must be within a circle having a radius, and can not exceed this range. The use of mathematics, we can be defined as:

Where || x || <1 represents a norm of a vector x is less than 1. If it is found inside the area norm does not satisfy this condition during the update process, the way it can be mapped to the projection meet. Secondly, from the two nodes at pointcare model is defined as:

Where ||. || represents the Euclidean space in the norm. arcosh is inverse hyperbolic cosine function. It is easy to observe, along with hi, hj the norm tends to 1, which is close to the rounded edges, arcosh which values ​​will continue to become larger, and this change is exponential. This also explains another level, why the capacity of hyperbolic space will become larger and larger as it approaches the edge.

 

This, we have defined the objective function and the operator need, the next step is to go through a number of optimization algorithms to find the optimal solution of the objective function (usually local optimization).

 

5.4 to find the optimal solution

 

An important benefit of using the Pointcare model is the European space as stochastic gradient descent method which can be used to use a simple modification. The overall idea is to first calculate the Euclidean space gradient, and the gradient obtained hyperbolic space by simple conversion to make stochastic gradient descent in hyperbolic space. That conversion mentioned here how to do it? By so-called metric tensor It is easy to convert gradient calculated hyperbolic space required gradient. For example, in the following figure, showing the process of updating a gradient, with the only difference that a conventional gradient update algorithms in front of one item is actually inverse of metric tensor.

 

6. Summary

 

In this article we are only given a very basic description, the purpose is to allow the reader to intuitively understand the advantages of hyperbolic space and when need be modeled using the hyperbolic space. Of course, research in this area is still in a relatively early stage, but I believe there will be more work will be presented.

 

Many people think that the neural network is used to simulate the brain. Up to this sentence in the description of the cartoon version of the brain, in fact, even neuroscientists may not have a very clear understanding of the mechanisms of the brain. Most of the current neural network running in a 2D plane, but the brain itself, there is a three-dimensional space, but also has a certain geometric properties between neurons, that would not bind non-Euclidean space is more suitable for a neural network to simulate its internal operational mechanism? Only more to explore in order to give the correct answer.

 

 

reference:

[1] Bengio, Yoshua, Aaron Courville, and Pascal Vincent. "Representation learning: A review and new perspectives." IEEE transactions on pattern analysis and machine intelligence 35, no. 8 (2013): 1798-1828.

 

[2] Hinton, Geoffrey E., James L. McClelland, and David E. Rumelhart. Distributed representations. Pittsburgh, PA: Carnegie-Mellon University, 1984.

 

[3] Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).

 

[4] Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality." In Advances in neural information processing systems, pp. 3111-3119. 2013.

 

[5] Pennington, Jeffrey, Richard Socher, and Christopher Manning. "Glove: Global vectors for word representation." In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532-1543. 2014.

 

[6] Krioukov, Dmitri, Fragkiskos Papadopoulos, Maksim Kitsak, Amin Vahdat, and Marián Boguná. "Hyperbolic geometry of complex networks." Physical Review E 82, no. 3 (2010): 036106.

 

[7] Miller, George A., Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine J. Miller. "Introduction to WordNet: An on-line lexical database." International journal of lexicography 3, no. 4 (1990): 235-244.

 

[8] https://www.facebook.com/ceobillionairetv/posts/1599234793466757 

 

[9] https://github.com/facebookresearch/poincare-embeddings

Transfer: https://www.itcodemonkey.com/article/8616.html , thanks for sharing

Guess you like

Origin www.cnblogs.com/baiting/p/11006331.html