GNN Popular Notes: Application of Graph Neural Network in Recommendation/Advertising

Original video: July online open class " Application of Graph Neural Network in Recommended Advertising Scenarios
" , the courseware can be downloaded and shared on the video page ..Not low
Subtitle proofreading: Tianbao, proofread the subtitles throughout the process, but because it is an open class, the interaction is very strong, so a lot of interactive sentences have been removed Note writing
: July, based on the original open class/courseware, a lot of work has been done Interpretation, Supplement, Explanation, In-depth, Expanded
Exchange and Discussion: Involving multiple colleagues and lecturers, including but not limited to recommending Teacher Wu, Doctor of Beiyou Group, Teaching Assistant Teacher Du, etc.
 

 

foreword

In the afternoon, I saw a face-to-face article on the community, "From traditional IT to risk control after 10 years of undergraduate graduation: After self-study, I won multiple offers of 400,000 to 500,000 through training camps". The author of the face-to-face book mentioned in the article: "I I followed July’s csdn around 2014, and benefited a lot from many of his blog posts”, including a student who mentioned in his circle of friends before: “July’s boss who is online in July wrote about understanding the three-level realm of SVM and mastering the game kill The device Xgboost is very good, and it is good to try more", which inspired me to update my blog again. After all, my 20 million-visited blog has had enough influence over the past ten years.

In addition, the graph neural network is very popular recently. We also specially opened a class " Graph neural network under actual combat CV ". Therefore, this article is based on Mr. Wu’s GNN open class. It is not only a note and a memorandum, but also will continue to expand and deepen on the basis of the notes, so as to become a comprehensive introductory article to understand the graph neural network and its application in recommendation/advertising.

This article involves the following contents successively:

  • A popular introduction to GNN , including what is graph neural network, what is Embedding, what is Word2Vec, graph neural network GE (GE's DeepWalk, GE's LINE, GE's GraRep, GE's Node2Vec, GE's Struc2Vec, GE's GraphSAGE, GE's GraphWave, GE summary), graph neural network GCN;
  • The application of GNN in recommendation/advertising , including End2End modeling and two-stage modeling, GCMC of End2End, PinSage of End2End (composition practice of PinSage, GNN model of PinSage, Training of PinSage, Servering of PinSage), Taobao EGES of End2End , two-stage case;
  • An introduction to the GNN architecture , including Ali's high-performance graph service-Euler (Euler's distributed graph engine, Euler's graph operator, Euler's GQL query interface Euler 2.0, Euler's MP abstraction layer).

After that, there will be Q&A, such as what kind of capabilities are required to do recommendation algorithm work, and how to improve engineering capabilities, etc. Finally, if you have any questions/problems, please leave a message/correct.

 

1 General Introduction to GNN

Today's open class, the topic is an application of graph neural network in recommended advertising scenarios, and the shared content is divided into three parts

  1. The first is an introduction to graph neural networks;
  2. Then the second part is some applications and cases of graph neural network in recommended advertisements;
  3. Then the third piece is when the graph neural network is implemented in the industry, what are the necessary components it needs. Because we not only need these algorithms of the graph neural network, but also need some other engineering components, which can be regarded as a combination.

1.1 What is a graph neural network

Before understanding the graph neural network, we need to understand why there is a graph neural network structure.

Generally speaking, the birth of a technology is accompanied by solving a certain problem or a certain type of problem, so what problem does the graph neural network solve? Are the existing Yitian Tulong CNN and RNN not popular? Good question!

As we all know, CNN is mainly used for image data processing (see this article for details: CNN Popular Notes ), and RNN is mainly used for time series data processing (for more information, please see this article: From RNN to LSTM ). In reality, however, in an increasing number of applications, data is generated from non-Euclidean domains and represented as graphs with complex relationships and interdependencies between objects. The complexity of graph data poses great challenges to existing machine learning algorithms.

For example, the image on the left of the following figure (Euclidean space), and the image on the right (non-Euclidean space):

Traditional neural network structures such as CNN and RNN all accept data in Euclidean space as input, and they cannot handle data structures in non-Euclidean space. For data such as graphs, graph neural networks are more suitable for processing, which has also led to increasing research enthusiasm for graph neural networks in recent years, as shown in the following figure:


In short, a graph is a very commonly used data structure, including our typical social network, user commodity graph, protein body structure, and traffic road network knowledge graph. These are all structures that can be described as Graph, and even regular networks grid data, which is actually a special form of graph. Therefore, graphs are a very worthy field of study.

Second, let's talk about the general classification of Graph research:

  • The first is the classic graph algorithm, such as path search and bipartite graph. Where can this be used? If you want to go to Baidu map or Gaode map, then this RP algorithm, this path search, that is to say, we will use some algorithms of Graph. Then where is the bipartite graph used? This typical scenario is Didi, and Didi’s order splitting algorithm is a bipartite graph matching;
  • There is another type of problem called probabilistic graphical model, such as conditional random field. Everyone who learns NLP knows that this CRF is used a lot (if you want to understand CRF better, you can watch this CRF video ) ;
  • Then the third piece is a topic we introduced today, which is the graph neural network, which mainly includes the two big pieces of Graph Embedding and GCN, as well as some knowledge graph content, which we will not cover too much here.

Then the third one is to talk about the graph neural network. It has been a very hot research direction since 2018. There are several reasons for this. The first is the success of deep learning in other fields, including CV and NLP. Then These are some regular grid data, and the irregular network data naturally want to have a deepening.
Then the second is a wide range of usage scenarios. As mentioned above, many scenarios can be built into a problem of Graph. Here I list a topic for KDD2020. GNN actually beats the pack, and it is the ranking First, and this one has been such a trend since 18 years ago, here is a panorama of CV for everyone:


Here is what I summed up until about 2018. There are actually two purposes for showing this picture to everyone:

  1. The first goal is GNN. I just said GNN. What is his starting point? How did he think of such a problem? Because of the success of CNN in the field of vision and the success of deep learning in NLP, he wants to do a migration;
  2. Then the second piece, I want to remind everyone that the core of the algorithm is actually the same, whether it is CV, NLP, recommendation, advertising or everything else, they are actually very related, at least the underlying things They are all consistent and have a sequence of development. Generally speaking, we think that CV is 1~2 years ahead of NLP, and then NLP is 1~2 years ahead of recommendation, so if you want to make recommendations, or students who are learning to recommend, you can pay more attention to some recent CV and NLP For development, it will provide you with a lot of input and some insight.

To give the simplest example, for example, the Relu optimizer we are using now, in fact, we all know NLP, Transformer , and BERT in our work recently. He used GLUE. When we switched to GLUE, it can be used in certain scenarios. There is a certain improvement and certain business benefits.

Of course there is this Attention. Here I would like to ask you a question. Everyone knows about this attention mechanism, so I would like to ask where did the attention mechanism originate from? CV, NLP is still recommended, does anyone know? Yes, it is actually a CV. I saw that some students answered correctly. In fact, CV first had such an idea, and then came to NLP. In NLP, we all know that RNN adds this Attention, and this Attention has been carried forward in the NLP field and then fed back to CV.

Note that considering that this article focuses on the application of graph neural networks in recommendation/advertising, you can read this article for some basic concepts of graph neural networks , such as

  • Spatial convolution (there are two main types of graph convolutional neural networks, one is based on the airspace, and the other is based on the frequency domain. To explain in a simple way, the airspace can be compared to convolution directly on the pixels of the picture. For example, the GraphSage mentioned in Section 1.4.6, and the frequency domain can be compared to the Fourier transform of the image and then the convolution);
  • messaging network;
  • Fourier transform on the graph and so on.

1.2 What is Embedding

Let's look down at the graph neural network, first introduce Graph Embedding, then let's talk about what Embedding is.

  1. Embedding is a function in mathematics, which is a function from X to Y, which maps from one space to another. Usually, a high-dimensional abstract space is mapped to a low-dimensional concrete space;
  2. The second property is that Embedding is generally a dense and distributed representation, which corresponds to this OneHot code, which will be described in detail later.


Secondly, why is there such a thing as Embedding:

  • The first reason is abstract things, which should have a low-dimensional representation. We know that when you see a picture, there is a dog in the picture, such as the picture on the upper right side of the picture above: pixel map, you see it is a dog, but what is its bottom-level representation? The representation is a bunch of pixels, right? The representation of this pile of pixels is the lowest-dimensional representation of the dog picture in the visual field; similarly, if you see a word dog, then it should have a similar representation , this is what we call Word2Vec. In other words, when we see a husky, a golden retriever, or a bichon, it has its own personality, right? Even every dog ​​has its own personality, which is expressed at the lowest level.
  • The second reason is that computers are better at processing low-dimensional information, which is why CV develops faster than NLP, because it is good at processing such information;
  • The third reason is to solve a problem of OneHot encoding, what is wrong with OneHot encoding? two:

The first problem is that it increases with your materials and your vocabulary. For example, if you do Word2Vector, then your corpus assumes that there are 100,000 words, then your OneHot code must have 100,000 dimensions. If There are 10 million, then your OneHot encoding is 10 million dimensions, so it is not fixed;

Then the second question is that OneHot encoding is a tricky way, it only has one 1 for each word or item, right? In this case, you have no way to express such a relationship between who is similar to whom, but Embedding can do it. Now Embedding is very mature, starting from Word2Vec and then moving to Item2Vec, Node2Vec and the Graph Embedding we are talking about today. They are all in the category of Embedding.

1.3 What is Word2Vec


I’m just showing it here. I won’t talk about it separately here. You just need to take a look. The formula on the far right is the most primitive derivation. Are you familiar with this formula when you see it? I want to ask, I would like to ask what is the relationship between this formula and softmax ( please let me know what softmax x is)?

Does anyone know, does it look like a regression of softmax? In fact, Word2Vec, its final optimized formula is a form of softmax, but what is the difference between it and softmax?

  • Whether the input is deterministic. Softmax regression or we call it LR regression. Its input is certain. For example, if you do a binary classification on a picture, then your input is a bunch of pixels, right? This pair of pixels will not change anymore; but when we do Word2Vec, the thing we ask for is the input vector, which is what you asked for, and then there is also a network parameter, so it is the input and The parameters are uncertain, which is a problem that you have to optimize.
  • The optimization method is different. It is precisely because the essential difference between softmax and Word2Vec is that the input is uncertain, which also leads to the difference in their optimization methods. We know that softmax is a convex optimization problem, you can solve it directly, right? Then Word2Vec, because its input and the parameters inside are uncertain, so we need a method similar to SGD (coordinate rise) to find it.

Second, we need to increase everyone's sensitivity to a certain extent. Do you know what the formula circled in the lower right corner of the picture above is called? Regardless of whether it is softmax or Word2Vec, the one below is usually called Z. This Z represents the partition function, that is, you do a calculation on the whole world and do a summation on the whole. Is this calculation time-consuming? How does Word2Vec do it? , we know that negative sampling is right. Negative sampling is actually called NCE, that is, noise contrastive estimation (a reminder to everyone, that is, as long as you encounter such a partition function calculation in the future, the first thing you think of is noise contrastive estimation). This is what we often call negative sampling.
Regarding negative sampling, why the final optimization is equivalent to the original formula optimization? For such a proof, interested students can read it again: Miscellaneous Talk on Noise Contrast Estimation .

For more understanding of word2vec, please read this article: How to understand word2vec in a popular way .

1.4 GE of graph neural network

Now let's take a look at a panorama of Graph Embedding


Of course, this Graph Embedding has already had many such branches since 2016. It has not even considered TransE, TransR, TransH and other technologies of knowledge graphs. Today we only introduce some particularly important and necessary ones. Regardless of whether you are at work or you are going for an interview, as long as you say you have learned or understood Graph Embedding, then these things will definitely be asked, and we will talk about these most necessary things.

1.4.1 DeepWalk of GE

The first one is DeepWalk, which is the ancestor of Graph Embedding. His idea is very simple. He just borrowed from Word2Vec. Let’s imagine that if I have a graph now, I want you to do Graph Embedding, then Referring to Word2Vec, what do you think, what is it missing?

Before expanding, let’s talk about the relevant formula symbols. For example, the formula in the above figure \pi _{vx}is an unnormalized transition probability from vto to . The simplest calculation method is , for unweighted graphs, , and Z is a normalized constant .x\pi _{vx} = w_{vx}w_{vx} = 1

Word2Vec first has several elements:

  • The first element is the thesaurus vocabulary. Our words correspond to this Graph. The vocabulary corresponds to each node. A node is a word. There are 100 million nodes on your graph, so your vocabulary is 1 100 million
  • There is also a concept called a sentence. When doing Word2Vec, there will be a sentence, and then the sentence will be an n-gram. For example, let’s make a 5-gram. Then if we use the skip-gram method, it is to use the central word To predict whether the first two and the latter two are correct, then we need to construct this sentence, that is, how to construct this sentence on the picture;
  • Then the third element is the corpus, which is a corpus composed of a bunch of sentences.

Then DeepWalk actually solves these three elements. Then how do I do these three elements from the Graph, make a batch of sentences, a corpus, and a vocabulary, and then optimize it according to the Word2Vec method? Its idea is actually very simple, that is, random walk, which can actually be seen, that is, at the beginning of a field, it is relatively simple for us to make a little contribution, and the citations of this paper should already be very high up.

  1. How did you do it? For example, we have a picture, and I will simply draw one here for you. For example, on a cross, the weight of the line on the right is 1, the weight of the line on the lower side is 2, and the weight of the line on the left is 3. The weight of the line is 4;
  2. For this cross graph, start from a certain node, go to random walk, and how to walk, this formula actually sees PI_vx, which is a conditional probability expressed from V to x, and walks from V node to the next A probability of a node X, this Z is the distribution function, and then PI_vx is its weight;
  3. For example, the center of the cross is called node 1, and the lower node of the cross is called node 2. What is its probability equal to? Equal to: (weight of 1~2)/(1+2+3+4), this is the probability of its random walk, then you can walk multiple random walks in parallel, and then randomly come up with some sentences, and then It is enough to optimize according to a method of Word2Vec.

1.4.2 GE's LINE

Next, the LINE algorithm has been used a lot in the industry. Although this method is similar to DeepWalk in terms of modeling, it is actually very different. For example, it does not use Word2Vec for reference. Its idea is to define There are two similarities, one is the first-order similarity and the other is the second-order similarity.

As Qianmeng said, LINE is also a method based on the assumption of neighborhood similarity, but unlike DeepWalk, which uses DFS to construct neighborhoods, LINE can be regarded as an algorithm that uses BFS to construct neighborhoods. In addition, LINE can also be applied to weighted graphs (DeepWalk can only be used for unweighted graphs).

The first-order similarity is used to describe the local similarity between paired vertices in the graph. It is formally described as if there is a direct connection between and, then the edge weight can v_{i}see the similarity of the two vertices. If there is no direct connection edge, the first-order similarity is 0. As shown in the figure below, if there is a direct edge between nodes 6 and 7, and the weight of the edge is large, it is considered that the two are similar and the first-order similarity is high, and there is no direct edge between nodes 5 and 6, then the two The first-order similarity is 0.v_{j}w_{ij}

The specific calculation process of the first-order similarity is as follows (the first set of formulas on the right side of the figure above is the first-order similarity):

  • For example v_{i}, and the edge weights of v_{j}these two vertices (analogous nodes 6 and 7) are w_{ij}, define the empirical distribution

\hat{p_{1}}(v_{i},v_{j}) =\frac{w_{ij}}{W}

in

W = \sum_{​{(i,j)\epsilon E}}^{}w_{ij}

The expression means the weight of the edge weight v_{i}of v_{j}the node sum in the sum of all edge weights;

  • Then after you vectorize Embedding, andv_{i} the joint probability between two vertices isv_{j}

p_{1}(v_{i},v_{j}) = \frac{1}{1 + exp(-\vec{u_{i}}^{T}\cdot \vec{u_{j}})}

Among them, \vec{u{_{i}}}, are the low-dimensional vector representations of vertices and , \vec{u{_{j}}}respectively , and the whole formula can be regarded as an inner product model to calculate the similarity between two vertices;v_{i}v_{j}

  • Therefore, the goal of our optimization is this O_{1} = d(\hat{p}_{1}(\cdot ,\cdot ),p_{1}(\cdot ,\cdot )), the purpose is to reduce the distance between the two distributions, so that the difference between the two distributions is as small as possible and as similar as possible. That is to say, now we need to continuously shorten the distance between the real distribution of the first-order similarity and the empirical distributiond(\cdot ,\cdot ) , and finally achieve a balance between the two, so that the loss of the first-order similarity is as small as possible;
  • Considering that the commonly used index to measure the difference between two probability distributions is KL divergence, and under discrete conditions is

KL(p||q) = \sum p(x)log\tfrac{p(x)}{q(x)}

Therefore, the difference between the two distributions can be expressed by KL-divergence. After using KL-divergence and ignoring the constant term, the objective function of the final first-order similarity is obtained as

O_{1} = - \sum_{​{(i,j)\epsilon E}}^{}w_{ij}log\ p_{1}(v_{i},v_{j})

Students who like to get to the bottom of it may ask, what is the calculation process of the last step above? The following is a detailed derivation (note: because p, \hat{p}adding a subscript 1 is only to distinguish it from the second-order similarity formula later, and has no other special meaning. For the convenience of derivation, the subscript 1 is omitted in the following derivation process):

considering

D_{KL}({\hat{p}(v_{i},v_{j}) || p(v_{i},v_{j})}) = \sum_{(i,j)\epsilon E}\hat{p}(v_{i},v_{j})log\frac{\hat{p}(v_{i},v_{j})}{p(v_{i},v_{j})}

And \hat{p}(v_{i},v_{j}) =\frac{w_{ij}}{W}, so there is

O_{1} = D_{KL}({\hat{p}(v_{i},v_{j}) || p(v_{i},v_{j})}) \\ = \sum_{​{(i,j)\epsilon E}}^{} \frac{w_{ij}}{W}log\frac{w_{ij}}{W\cdot p(v_{i},v_{j})}

and similarly Xlog_{a}\frac{M}{N\cdot D} = Xlog_{a}M - Xlog_{a}(N\cdot D), there are

O_{1} \Rightarrow \mathop \sum_{ {(i,j)\epsilon E}}^{} w_{ij} log\frac{w_{ij}}{W\cdot p(v_{i}, v_{j})} \\ = \sum_{ {(i,j)\epsilon E}}^{}w_{ij} logw_{ij} - \sum_{ {(i,j)\epsilon E} }^{}w_{ij}log(W\cdot p(v_{i},v_{j})) \\ = \sum_{ {(i,j)\epsilon E}}^{}w_{ij }logw_{ij} - \sum_{ {(i,j)\epsilon E}}^{}w_{ij}logW - \sum_{ {(i,j)\epsilon E}}^{} w_{ ij}log\p(v_{i},v_{j})

In addition, because w_{ij}, Wis a constant, so \sum_{​{(i,j)\epsilon E}}^{}w_{ij}logw_{ij}, \sum_{​{(i,j)\epsilon E}}^{}w_{ij}logWare all constants, which can be ignored, so that we can finally get

O_{1} = - \sum_{​{(i,j)\epsilon E}}^{}w_{ij}log\ p(v_{i},v_{j})

Is only first-order similarity enough? Obviously not enough, as shown below:

  1. There is no direct edge between nodes 5 and 6. If judged by the first-order similarity, they are not similar, right? But is there any problem with this conclusion? To give an example in reality, although two people are not friends for the time being, if they have a bunch of friends in common, can it be seen that they are similar? After all, the preference of making friends is close.
  2. Therefore, although nodes 5 and 6 are not directly connected, they have many of the same neighbor vertices (1, 2, 3, 4). If the coincidence degree of their neighbors is defined as the second-order similarity, can we draw Are 5 and 6 actually similar?
  3. Well, the next question is how to define the coincidence of neighbors. For nodes 5 and 6, 5 has 4 neighbors of 1 2 3 4, and 6 also has 4 neighbors of 1 2 3 4, assuming that the edge weight from 5 to 1 2 3 4 is set to 1 2 3 4, then 6 to 1 The edge weights of 2 3 4 are also 1 2 3 4, so their edge weights are exactly the same, then through the formula optimization of the second-order similarity (the second set of formulas on the right side of the big picture at the beginning of this section is the second-order similarity degree), we get that the similarity between 5 and 6 is 1, they are completely similar.

What is the calculation process of the second-order similarity?

Second-order similarity assumes that two vertices with a large number of identical adjacent vertices are similar, in which case each vertex can be seen as a specific context/neighbor (context), and thus can be considered to have the same or similar context distribution vertices are considered similar. In other words, therefore, in this scenario, each vertex is both a vertex itself and a context for other vertices.

  1. Based on this, the author proposes two kinds of vector representations \vec{u{_{i}}}, {\vec{u{_{i}}}}'one \vec{u{_{i}}}is v_{i}the vector representation when the vertex is the vertex itself, and the other is the vector representation when {\vec{u{_{i}}}}'the vertex is the context of other vertices.v_{i}
  2. For each directed edge e_{ij}, the author defines the conditional probabilityv_{i} of generating a context/neighbor vertex under the condition of a v_{j}given vertex, where |V| is the number of context vertices of the vertex, which is actually for each node of its context Probability (the denominator can be seen as a normalization factor). For example , neighbors , neighbors , if the probability distribution of and is similar (that is, the context of the two vertices and is similar), then the two nodes are similar in terms of second-order proximity .p_{2}(v_{j}|v_{i}) = \frac{exp({\vec{c}_{j}}^{T}\cdot \vec{u}_{i})}{\sum_{k=1}^{|V|} exp({\vec{c}_{k}}^{T}\cdot \vec{u}_{i})}v_{i}v_{i}v_{1}v_{2} v_{3} v_{4}v_{5}v_{2} v_{3} v_{6}p_{2}(\cdot|v_{1})p_{2}(\cdot|v_{5})v_{1}v_{5}
  3. Therefore, in order to preserve Second-order Proximity, we hope that the conditional distribution of each vertex on its context can best fit its empirical distribution . Naturally, this objective function needs to be minimized: O_{2} = \sum_{i\epsilon V}^{} \lambda _{i}d ({\hat p_{2}(\cdot |v_{i}),p_{2}(\cdot |v_{i})), where is \lambda _{i}the factor controlling the importance of nodes , which can be estimated by methods such as the degree of the vertex or PageRank. It represents the distance between d(\cdot ,\cdot )two distributions. The author also uses KL divergence to measure. In addition, the author uses the introduction to \lambda _{i}indicate the importance of different vertices in the network or graph. This value can be used to examine the importance of vertices through PageRank and other methods. to get.
  4. The empirical distribution in the above formula \hat{p}_{2}(\cdot|v_{i})can be expressed as \hat{p}_{2}(v_{j}|v_{i}) = \frac{w_{ij}}{d_{i}}, where it w_{ij}represents e_{ij}the weight of the edge, and d_{i}represents v_{i}the out-degree of the vertex. For a weighted graph, , d_{i} = - \sum_{​{k\epsilon N(i)}}^{}W_{ik}where N(i)is v_{i}the "out" neighbor of the node (the neighbor node starting from the i node);
  5. Suppose \lambda_{i} = d_{i}, then use KL divergence (using KL divergence as the distance function, that is, use KL distance instead d(\cdot ,\cdot )), and ignore the constant term, there are:O_{2} = - \sum_{​{(i,j)\epsilon E}}^{}w_{ij}log\ p_{2}(v_{j}|v_{i})

Thus, by learning v_{i}two vector representations of each node \vec{u}_{i}, {\vec{u{_{i}}}}'the objective function is minimized, and finally da dimensional vector representation of each node is obtained.

After I sent this note to the group recommending advanced classes, some students had doubts about the calculation in step 5 above, and another student Li immediately pushed

First, our goal of solving is

O_{2} = \sum_{i\epsilon V}^{}\sum_{​{(i,j)\epsilon E}}^{}\lambda _{i}d ({\hat p_{2}(v_{j}|v_{i}),p_{2}(v_{j}|v_{i}))

According to D_{KL}(p||q) = \sum p(x)log\tfrac{p(x)}{q(x)}, available

D_{KL}({\hat p_{2}(v_{j}|v_{i})||p_{2}(v_{j}|v_{i})) \\ = \sum_{​{(i,j)\epsilon E}}^{} {\hat p_{2}(v_{j}|v_{i}) log\frac{\hat p_{2}(v_{j}|v_{i})}{p_{2}(v_{j}|v_{i})}

Using the KL distance instead d(\cdot ,\cdot ), one has

O_{2} = \sum_{i\epsilon V}^{} \sum_{​{(i,j)\epsilon E}}^{} \lambda_{i} {\hat p_{2}(v_{j}|v_{i}) log\frac{\hat p_{2}(v_{j}|v_{i})}{p_{2}(v_{j}|v_{i})}

Considering \hat{p}_{2}(v_{j}|v_{i}) = \frac{w_{ij}}{d_{i}}, and assuming \lambda_{i} = d_{i}, all can be substituted

O_{2} = \sum_{i\epsilon V}^{} \sum_{​{(i,j)\epsilon E}}^{} d_{i} \frac{w_{ij}}{d_{i}} log\frac{\hat p_{2}(v_{j}|v_{i})}{p_{2}(v_{j}|v_{i})}

Because d_{i} = - \sum_{​{k\epsilon N(i)}}^{}W_{ik}it is a constant term, the above formula can be obtained

O_{2} = \sum_{i\epsilon V}^{} \sum_{​{(i,j)\epsilon E}}^{} {w_{ij}} log\frac{\hat p_{2}(v_{j}|v_{i})}{p_{2}(v_{j}|v_{i})} \\ = \sum_{​{(i,j)\epsilon E}}^{} {w_{ij}} log\frac{w_{ij}}{d_{i}} - \sum_{​{(i,j)\epsilon E}}^{} w_{ij}logp_{2}(v_{j}|v_{i})

Because w_{ij}it is a constant, it d_{i} = - \sum_{​{k\epsilon N(i)}}^{}W_{ik}is also a constant, so ignoring the constant term can be obtained

O_{2} = - \sum_{​{(i,j)\epsilon E}}^{} w_{ij}logp_{2}(v_{j}|v_{i})

The second-order similarity is indeed more abstract than the first-order similarity. for example

  1. Two nodes a1 b1, three neighbors a2  a3  a4 of a1, edge weights are (1, 2, 3), respectively, three neighbors a2  b2  a3 of b1 , edge weights are (3, 2, 2),
  2. The goal of the second-order similarity between a1 and b1 is to minimize the gap between softmax(1/6, 2/6)~softmax(3/7, 2/7)

It is to see how similar the first-order neighbors of two nodes are. These two nodes may not be directly connected. Another example is an extreme example, the neighbors of a1 and b1 are exactly the same, and the edge weights are also exactly the same, then the second-order similarity between a1 and b1 is 1.

Next, we consider another question. I don't know if you have practiced the LINE algorithm? For LINE, can we optimize the two similarities together, that is, the first-order similarity of loss plus the second-order similarity of loss directly add up and optimize together?

In fact, it is not possible, because we review this example

For example, when 5 and 6 are in the first-order similarity, their similarity is 0, right? It has no edge weight, but when you are in the second-order similarity, they are completely similar, that is, the similarity is 1, which is equivalent to One loss makes the two of them infinitely close, and the other loss makes the two of them infinitely farther away, so you can't learn it.

Therefore, in normal practice, we optimize separately, that is, to find an Embedding through the first-order similarity, and to find another Graph Embedding through the second-order similarity. The two Graph Embeddings can be pooled, whether it is concat or Average pooling. Or Sum pooling, get them together, and finally get a Graph Embedding (if you say you know LINE during the interview, this is easy to be asked).

1.4.3 GE's GraRep

Then there is another inspiration, that is, we consider the first-order similarity and the second-order similarity here, so can we consider the third-order similarity and the fourth-order similarity? If you talk about the algorithm, you will definitely be asked.

The answer is completely conceivable. For example, in all the previous lines in the above figure, a is greater than e, b is greater than f, c is greater than g, and d is greater than h. The similarity is greater than that of the next line. For example, the first a is because of the large weight The second b is because they have many paths. For example, there are 4 paths from A1 to A2, but there is only one path below, and these are similar.

1.4.4 GE's Node2Vec

The following introduces another algorithm that is particularly commonly used in the industry, called Node2Vec

Before expanding, let me talk about the expression of related formula symbols, such as the shortest path from d_{tx}node t to node . The value of this shortest path can only be {0,1,2}, because the second-order random walk only looks at two layers.x

Many people must have heard of this algorithm, so I will give you a brief introduction here. The source of this Graph Embedding method is still Word2Vec, but it has been improved. One of its most important contributions is the improvement of DeepWalk.

We know that DeepWalk's Graph Embedding method can be understood as DFS-based . For example, in the picture below, from U to s4 to s5 to s6, this is a DFS going forward. Find a sequence through DFS and then do Word2Vec. This is an obvious DFS method.


Then the second one is called content similarity. The assumption of content similarity is that the greater the side weight I have with whom, the more similar I am to whom. LINE is a typical algorithm of this kind, because it optimizes U and s1 s2 s3 s4 The similarity, and in the case of first-order similarity, BFS is used to construct the neighborhood .

So what about Node2Vec , it defines a second-order Random Walk, which balances DFS and BFS , let's look at this picture

This graph t is the node that was just sampled last time, V is the node that is currently sampled this time, and it x_{1} x_{2} x_{3}is already the node that he will sample next time;

Then P and Q control an algorithm for sampling

  • P is the control rate of turning back. Assuming that P is infinite, you will not turn back and go straight forward. Assuming P is 0.1, then the probability of turning back is very high, 10, right?
  • Q controls a balance between BFS and DFS. Assuming Q=1, then you are x_{1} x_{2} x_{3}all 1, whether the weight is the same, then either BFS or DFS, without making any distinction, then if Q is very large , then it is inclined to BFS, you can see that it will tend to go x_{1}forward , right; if Q is very small, that is, this side is very large with this side, then it will tend to DFS to go forward , after sampling this sequence, it is the same as DeepWalk, and then just do an optimization of Word2Vec to solve the Embedding

There are two apps here, one is from Tencent and the other is from Facebook. They are both advertising apps. You may know LookaLike. Let me repeat it briefly. Advertisers give you a batch of seed users. That is, I want to put this advertisement only on the seed users, but the advertiser’s thinking is limited, he only understands these seed users, in fact, there are more users on our platform, we can intelligently increase the volume, this In ByteDance, it is called Smart Volume, and in other companies it may be called LookaLike, customized audience, right? I can automatically amplify my amount for you, that is, you give a group of people, and then I will find another group for you, and people similar to him will also put in for you, and then your income will be higher, this is Some applications of Node2Vector on this.

Some students asked: What trick does Node2Vec have, and the actual effect is negligible?

For Node2Vec Trick, we generally use it directly. Like LINE, it is used more often. The method is to weight the samples. This is the method of composition. It is not you to improve the algorithm, but you to improve the composition.

Let me give you an example. For example, if you are in the field of advertising, then we can use the click map, as well as the click side and the conversion side. Then we convert the side and the click side. It should have a weight exchange. For example, the weight of your click is 1 , then the weight you convert may be 5, or 10 or 20, which needs to be adjusted, and then corresponds to a sampling probability.
For example, as a user, you have very high clicks on some advertisements, but their conversion rate is very low, but you have a high conversion rate on other advertisements, so in fact, the so-called optimization of advertisements ultimately optimizes the value of advertisers. It is because the value of conversion is high, so let them walk more randomly on such an advertisement that is willing to convert.
So Graph Embedding, in fact, the most important thing is not these algorithms, but your composition, which is the underlying infrastructure.

1.4.5 GE's Struc2Vec

Then let's take a look at this Struc2Vec. First, review the difference between structural similarity (DFS) and content similarity (BFS).

We just introduced that the definition of content similarity means that adjacent nodes are more similar, similar to BFS.

So where are the limitations of structural similarity? For example, let's look at this U and this V:

If the environment around their positions in the graph is similar, then they should have greater similarity, which is structural similarity. So can DFS solve this problem, to a certain extent, because your random walk can take a long time, but if this picture is very large and there is a lot of space between U and V, then your U is a walk It cannot go to V , so DFS has certain problems in solving structural similarity.
So how does Struc2Vec do it? It goes a step further on the basis of Node2Vec and defines a structural similarity through a layer .

One of the biggest applications of Struc2Vec is in the field of risk control, which is considered a standard configuration (for example, this is an introduction by Ant Financial at the Artificial Intelligence Conference . Compared with Node2Vec, AUC from 70 to 90 has made a qualitative leap , we will explain why later). Compared with Node2Vec, Struc2Vec has better applicability in the scenario of risk control.

Let's first introduce how a Struc2Vec is made. For example, after we have such a graph, how do we define structural similarity?


Struc2Vec gives a solution

  1. Let's compare U and V, but from another perspective, let's compare whether U and V have similar neighbors. If their first-order neighbors are similar, then we will compare whether their second-order neighbors are similar. If their The second-order neighbors are still similar, so let’s compare the third-order neighbors, right? We can always find a dissimilarity. Assuming that U and V are very similar on the 10th floor, then they are indeed very similar;

  2. If U and another node such as X are similar in the first order and second order, but not very similar in the third order, then the similarity between U and X is not as great as that between U and V, right? It is such a thought.

It must be further elaborated. If the first-level neighbors of u are R_{1}udenoted by (the figure is from here )

Struc2Vec is to compare the similarity of neighbors R_k(u) and R_k(v) of different classes of u and v

  • In addition, s(S) is used to sort the nodes in a certain node set V according to the order of degree from small to large, and the value range of S is [0,n-1];
  • R_k(u)is the set of nodes whose shortest path is k from point u, R_k(v)and is the set of nodes whose shortest path is k from point v;
  • f_{k}(u,v)It is the structural distance between those nodes whose distance between u and v is k, f_{k-1}(u,v)which means the distance when considering k-1 hop neighbors, plus only considering the distance of k hop neighbors, it forms the distance of k hop neighbors , the initial value f_{-1} = 0;
  • g(D1, D2) is used to measure the distance between two sorted degree sequences D1, D2, , s(R_{k}(u))respectively s(R_k(v))represent the degree sequence of the nodes whose distance is k from u and v are sorted according to the degree (of course, the elements contained The number may be different, and DTW can be used to calculate the distance between two ordered degree sequences)

Then there are:

So when K becomes larger (K ​​is what I call the several-order similarity), does the larger K mean that the receptive field of CNN becomes larger, and the first-order receptive field is only 1, right? In the second-order The receptive field becomes 2, so a similarity is compared bottom-up.

To be honest, when Mr. Wu talked about this, I didn't realize it for a while. In fact, to put it bluntly, it’s a bit like comparing two upside-down pyramids; more bluntly speaking, it’s a bit like comparing friends on the second, third, fourth, and fifth rings. If the friends on the second to fifth rings are all similar, Considering that birds of a feather flock together and people are divided into groups, can they not be similar?

ok, the structural distance of each node pair on different k is obtained above, how to operate next?

The first step is to build a layered network first. For further explanation, I quote Xiamo Shangshuyingliang 's explanation:

  1. In the same layer, the calculation formula of the weight of the edge between two nodes is :w_{k}(u,v) = e ^{-f_{k}(u,v)}, k = 0,...,k^{*} , k^{*}which represents the diameter of the adjacent area. According to "If u, v are more similar, then their structural distance f_{k}(u,v)is smaller, or tends to 0", then (u, v) has a greater weight in the graph, and the weight reaches a maximum value of 1 f_{k}(u,v) = 0at that time . That is to say, in the graph of each layer, the edge represents the structural similarity of any two nodes, and the weight of the greater similarity is greater.
  2. Layers are connected by directed edges. Specifically, for any node in the kth layer u_{k} , there are directed edges (u_{k},u_{k-1})and (u_{k},u_{k+1}) their weights are: w(u_{k},u_{k+1}) = log(\Gamma _{k(u)}+e),k = 0,\cdots ,k-1, w(u_{k},u_{k-1}) = 1,k = 1,\cdots ,k;
  3. where \Gamma _{k}(u)it means that in the kth layer, the weight of all edges pointing to u is greater than the average weight of the layer, and the specific formula is \Gamma _{k(u) = \sum_{v\epsilon V}^{}1(w_{k}(u,v) > \bar{w_{k}}), where it \bar{w_{k}means kthe average value of all edge weights in the layer.

To put it bluntly, it actually \Gamma _{k}(u)shows how many nodes in the kth layer are similar to node u. If u is similar to many nodes, it means that it must be at a low level at this time, and too little information is considered, so it will be very large \Gamma _{k}(u). That is w(u_{k},u_{k+1}) > w(u_{k},u_{k-1}) , in this case, it is not suitable to use the nodes in this layer as the context, and you should consider jumping to a higher layer to find a suitable context.

In the second step, the construction of the multi-layer network M in the previous step is to find a suitable context, and the method of finding the context is the same as DeepWalk in the way of sampling random walks.

To quote Moshang Shuyingliang again :

Specifically, assuming that the random walk reaches the node u_{k}, then its possible next nodes are u_{k+1}, u_{k-1}, v_{k}, w_{k}(see the left half of the figure above). The probability of staying in this layer and continuing to walk is q, and the probability of natural layer jumping is 1-q.

In this layer, the probability of walking to other nodes is related to the weight of the edge, that is, the probability is obtained by normalizing the weight once:

p_{k}(u,v) = \frac{e^{-f_{k}(u,v)}}{Z_{k}(u)}

The denominator is the normalization factor:

Z_{k}(u) = \sum_{v\epsilon V v\neq u}^{} e^{-f_{k}(u,v)}

If a layer is skipped, it is also divided into two directions, and the probability is also related to the weight of the edge:

p_{k} (u_{k},u_{k+1}) = \frac{w(u_{k},u_{k+1})}{w(u_{k},u_{k+1}) + w(u_{k},u_{k-1})}

p_{k} (u_{k},u_{k-1}) = 1 - p_{k}(u_{k},u_{k+1})

Note that when the number of layers is higher, because the considered neighborhood is wider, the calculation of structural similarity between nodes is more stringent. Therefore, two nodes with similar structures calculated at the bottom layer are not necessarily similar at the high level , and the relationship between many nodes at the high level The interval f_{k}(u,v)may not be defined at all. This leads to that if the random walk jumps to a higher layer at a certain node u, then in the random walk sequence, the node on the left is k layer, and the node on the right is k+1 layer. The value ranges on the left and right sides are different.

In other words, a certain node may appear on the left, but not on the right, because although it is similar to the node u in the middle at layer k, it may be undefined or too large at layer k+1, causing random walk f_{k+1}(u,v)in The probability of going to this node at layer k+1 is almost negligible.

In addition, it can be found from the source code that when jumping at node u, the node will only be added once u_{k}, and will not u_{k}be u_{_{k+1}}added to the random walk. With the random walk sampled in the previous step, you can directly apply the SGNS (skip-gram with negative sampling) in word2vec to learn the model.

———————————————————
Let’s discuss why Struc2Vec is better than Node2Vec in the risk control scenario. For example, when you were in Ant Financial, you saw Struc2Vec. After such an algorithm, do you think you can use it? In fact, the most essential reason is this. Node2Vecr obtains a sequence by balancing DFS and BFS sampling, and then does Graph Embedding. It cannot obtain such a long-distance structural similarity. Struc2Vec can, so the risk control scene needs this kind of structural similarity.

  • Let me give you an example, you have a lot of relationships in Alipay, and in Taobao, for example, you follow many merchants, some big Vs and some big institutions, they all have many sides, and their repayment ability is strong It does not mean that your repayment ability must be strong, but if such a merchant has the same size as another merchant, then their repayment ability or borrowing ability is actually more similar, that is, in the social field. Also more similar.
  • For example, your similarity in Weibo, you follow a lot of big Vs, but you may not have that much similarity with others, but between big Vs and big Vs, their influence on the Internet may be similar, and then the first Second, we also saw this Z_k, we have seen a lot of Z_k, we must be sensitive to see that Z_K is used in this paper is a hierarchical softmax. When we know Wrod2Vec, Z_k has two optimization methods, one is hierarchical Softmax, and the other is NCE, right?

Now we basically use NCE, and Softmax is rarely mentioned. Why, he can use Softmax instead of NCE. First of all, can NCE be used? In fact, it is definitely possible. As long as you see this Z_k, you can use NCE, but you can also use Softmax here, why?

Imagine how Word2Vec's hierarchical softmax is composed and how to build this tree, through the Huffman tree, right? What is the weight of the Huffman tree? Word frequency, then you are in the NLP field, word frequency is naturally unfair, I love, and it, you these words frequency is very high, but there are some words appreciate and some very long words, its frequency of this word is low, then it The constructed Huffman tree must not be very balanced.

But for Graph Embedding, the word frequency of each node in a graph is one, so the Huffman tree he built must be balanced, so he can use it. If you are in Word2ec, the effect is not so ideal. This is an essential reason.

1.4.6 GE's GraphSAGE

Next, introduce this GraphSAGE algorithm


I just introduced Struc2Vec to you, not to say that it is widely used in recommendation systems, but to introduce to you a concept of structural similarity and content similarity, and GraphSAGE is a very sharp one in the case of large-scale implementation of recommendation systems and advertisements. weapons.

Here is another concept of inductive learning and transductive learning. All the algorithms introduced just now are called transductive learning.

  • Transductive learning means that if you have a graph, then I can use any algorithm, such as Node2Vec, Struc2Vec, LINE, etc. Graph Representation is fine. You learn the Embedding of each node in the graph, and then you save it for him. In a vocabulary, it cannot be moved anymore, and you need to relearn it after your map changes;
  • For inductive learning, GraphSAGE is a representative, which means that I not only learned your Embedding, but also learned a methodology when I was learning your Embedding, how do I get your Embedding, then when your graph has a certain When it changes, I can quickly respond to your change. I can use the methodology I learned and aggregate its neighbors to a node that you have never seen before, and get a node of your node. Embedding, and then respond quickly to changes like this. This is very important in the recommendation system. After all, for the graph of the recommendation system, first, it is very, very huge, and second, it changes very quickly. Every operation of your user will have an impact on this graph.

Next, let’s take a more specific look at GraphSAGE. Compared with the previous ones, it is very easy to understand. This algorithm is divided into two steps, the first is sampling, and the second is aggregation.

  1. Randomly sample the neighbors, you go to pick how many first-order neighbors, and how many second-order neighbors, for example, you set 10 first-order neighbors, we look at this picture, he only took three One, right, we have padding, padding to 10, but you only took 5 in the second-order neighborhood;
  2. Generate the vectorized representation/embedding of the target node: first aggregate the characteristics of the second-order neighbors to generate the embedding of the first-order neighbors, and then aggregate the first-order embeddings to generate the embedding of the target node to obtain the second-order neighbor information;
  3. The embedding of the target node is used as the input of the fully connected layer to predict the label of the target node.

Let's look at this algorithm in detail

This h0, this x_v is the original embedding I input. For example, if you are in the recommended scenario, it is the recommended original embedding. If you use other methods, you first learn the embedding first, and then you use it GraphSAGE let him learn the structure of the network, and then V is all the nodes, and the big K is the level you want to sample;

  • Just now we sampled first-order neighbors and second-order neighbors. Starting from the first layer, for each node u, you sampled several of its neighbors. After sampling, such a neighbor passes through an aggregation function with him, and then aggregates to the K-th layer. The representation of all the neighbors on the K-th layer is that you first perform an aggregation on the side, right;
  • Then after the aggregation is over, you still need to follow the center node. We actually have a metaphorical assumption in all Graph Embeddings, that is, such a self-connected edge, you are with yourself, the node v and a neighbor of v, and then do a NN concat out, and then make a The activation is over, and then the last Norm will come out, so this algorithm is very simple, simple and effective.

Let's briefly talk about this sampling algorithm.


Sampling algorithm, generally we use random sampling, just sample a certain number of nodes of K order, and then we generally have several aggregation algorithms, one is MeanPooling, the other is MaxPooling, and LSTM (to know what LSTM is, please click this article ), which is mentioned in the paper.

For LSTM, here is a reminder that you need to shuffle every time, because it is LSTM, which is a sequence algorithm, but our sampling needs to follow the sequence, because it does not matter when Random sampling is performed, your first time It may be 1234, and you may get the same points for the second time, but you may get 4 2 3 1, right, so you need to shuffle, and you can solve it by shuffle.
GraphSAGE will introduce it later. One of its logics in recommendation is the PinSage algorithm. This algorithm is currently implemented in many recommendation scenarios.

1.4.7 GE's GraphWave

The last one, I will simply show it to you, and I will not tell you again. This is also an assumption of structural similarity. The problem is unsupervised and completely unsupervised.


Review Struc2Vec. Struc2Vec What is our assumption? A priori knowledge is that the structural similarity of a node can be expressed by the sequence of the degrees of its neighbor nodes, and can be expressed hierarchically. This is a priori you input, but GraphWave is There is no need for any a priori, as long as you give me a picture, I can get a function of your Embedding in some ways.
You can take a look at this effect. On the left is an algorithm that was not discussed today, in the middle is Struc2Vec, and on the far right is GraphWave. We can see that he divides this node very clearly. Each color actually represents a Classes can be distinguished very clearly, so this effect is still very good, and it does not require a very fast learning speed. If you need structural similarity in certain scenarios, you can try it.

1.4.8 GE Summary


We have passed this theoretical part for a while. In fact, the most important question of Graph Embedding is how you choose, because Graph Embedding has been shown to you since the beginning. There are many algorithms in the big family. The above must be related to your actual problem, so as to determine what model you choose.

In short, what kind of model you choose depends on your problem, and the most important thing is the composition. Everyone must remember this sentence. In the field of graph neural network, not all your graphs are natural, including social networks. Although it is a natural graph, you need to refine this graph. For example, your weights need to be processed a priori. This is also the case in the e-commerce field. You must construct this graph well. The most critical in your problem-solving process.

1.5 GCN of graph neural network


GCN, I won’t introduce too much here, because GCN involves Putu theory and Putu convolution, so there is more background knowledge in this aspect, I will only give you an analogy

  • The GE analogy is Word2Vec in NLP, and an algorithm for GCN analogy is CNN
  • Then GE generally corresponds to two-stage modeling, that is, I first use an algorithm, such as LINE to generate an embedding on a graph, and then empower the upstream to make a feature for the upstream recommendation, and then GCN generally They are all End2End. For example, if you do a classification directly, then GCN can generate Embedding;
  • For GCN, you can do supervised learning, but after you have supervised learning, for example, if you do Link prediction, you do node prediction and connection prediction. After you complete the connection prediction, the top-level Embedding generated by GCN , you can use it as a kind of GE, and then use it for upper-level tasks. In a broad sense, GCN actually belongs to GE. For example, the GraphSAGE we mentioned just now, many people will call it GCN, but I don’t like to call it. GCN, because I like to put GCN in another way, then GE is generally a single-layer network, such as Word2Vector, and then LINE network, LINE, in fact, it has fewer parameters;
  • Then Node2Vec uses the Wrod2Vector method, a single-layer network, and then GCN, it can also be very deep, and it can be done very Deep, just like CNN, layer by layer. You only need to remember GCN, which can be used Generating Embedding is not much different from GE.

Some friends said that the performance of GCN is too slow, because it needs to do a convolution in the whole image field, and GCN has no good application scenarios in the industry for the time being. If you are interested in the theory of this aspect, you can read it.


Then there is a PPT in this Github, as well as this document. If you want to understand the theory, you can take a look. Then I will tell you a little bit about this GCN. GCN is divided into two categories:

  1. One is ordinary convolution, which is to perform convolution transformation in the Fourier domain. This requires a deeper mathematical theory;
  2. The other is non-ordinary convolution. There is no need to introduce this kind of convolution. In fact, you can think of it as a kind of non-ordinary convolution. You can directly perform convolution and then ordinary convolution. There is a unified theory, but it will be affected by A limitation of the Laplacian operator, all of his theoretical cores are the L Laplacian operator, which is more flexible in spatial convolution, but the main difficulty lies in the selection of neighborhoods, such as the neighborhood of GraphSAGE Is there a better selection algorithm? You said Random, is the random selection particularly good, is there some weighted selection, these are some tricks of industrial practice, there are many models with different styles and there is no unified theory, it is more of a practice.

 

2 Application of GNN in Recommendation/Advertising

Then we will enter the practical part of today


A case of some applications of GNN in recommended advertisements, let me explain to you.

2.1 End2End modeling and two-stage modeling


Let me tell you about this paradigm first, that is, there are two major paradigms for the application of GNN in recommended advertisements.

  • one is two stage
  • One is End2End

For End to End

  1. First of all, you have a Graph, right? For example, if you take product recommendation as an example, then you can build an isomorphic graph of the product, which is the graph between the product and the product. You can also build a click, favorite, and add between the User and the Item. A heterogeneous graph of purchases, attention, purchases, and returns is similar to a knowledge graph. Then there are Other Features, which are normal responses. For example, when you are doing a recall, the normal User-side features and Item features, right? When you make twin towers, it may be the characteristics of both sides, and the characteristics you have to match are the characteristics of match;
  2. Then these two are used as input, and your middle model has two pieces, one is GNN for processing Graph, and the other is your normal model for processing other features;
  3. Then the upper layer can do classification, ranking or other, right?

Then there are still two ways in the middle part here:

  • The first one is that you use this Model as the main model, and GNN as an additional component. For example, you want to make a double tower for recall, as shown in the above picture, the left side is the vector of the User side, and the right side is the Item side, and then you are on the upper layer Made a logit. So now there is a graph, and when you are in the train, every time you come to a User, you can go to the graph to find the User, and then find the User and get its neighbors, and then the Item is also, you can find it , this is all in real time, you can find it in Train. Of course, the process of finding user/item neighbors in the picture is obtained through GNN, that is, GNN visits to get you the Graph Embedding of the User and the GraphEmbedding of the Item, and assigns you to the user and item. Then you can go up and do convolution/click to get logit. This is how to embed a twin tower, and how to embed GNN in the twin tower.
  • There is another way, GNN is used as the main model. Later we will introduce a case where GNN is used as the main model, and then your other features are placed in the picture. You directly generate a Graph Embedding, and then the Graph Embedding does not need to do any operations. Do the dot product directly and do the similarity.

Then the two-stage, the two-stage currently does not have a particularly mature application in the industry, and some applications on it are not particularly public, and it is not particularly easy to talk about.
For example, there is a Graph. I first use a pretrain and a GNN to get a Graph Embedding. This is stored. You can understand it as storing a KV or storing it in a database.

Suppose we save KV, for example, user has an Embedding ad, an Embedding is saved like this, or this Item is an Embedding, and then such an Other feature, and then your thing is entered as a feature of the main model, and then For example, if you are in fine-tuning, as a feature of the main model, then the main model can perform a fine-tune on it, and then the model is running, and it keeps updating the memory, and the GNN is also updating, but these two updates, How do you resolve a conflict in their update with this update? This is a key issue. Then you can do Ranking and other things on it.

This is a paradigm. At present, all industrial applications will not run out of the scope of these two paradigms. Let me introduce this to you as an article in 2018.

2.2 GCMC of End2End


The idea of ​​this article is very simple. When we make recommendations, the core theory of recommendation is this matrix filling.

All your recommendations can be understood as a problem of matrix filling. Some values ​​are your ground truth, and some 0s are what you want to fill.

I built this into a bipartite graph between the User side and the Item side, and there are some connections. After the connection, I will make a Graph Embedding or GCN for this graph. Anyway, you can finally get an Embedding UE of User, and then Item can also Get an Embedding IE, right? Then you can make another prediction, a similarity between User u_i and item_i.

For example, for this, mine is called 1, this is called 1, and this is called 2. u_{1}Then i_{2}their similarity edge weight is 1, right? Then you need to solve it. The similarity after their Embedding is also 1, of course I said this 1, it’s not really 1, yours is u_{1}only i_{2} connected to the root, it’s 1, if you u_{1} still i_{1}connect to it, it’s 3, then you should let it go to Embedding and the value you get should be 1/4 , is doing such a thing, I forgot to say that the GAE of this article uses a multi-level GCN, which is not practical.

Why is it not practical? Some students have just said that the GCN of general theory is based on the Laplacian operator, and the Laplacian operator is equal to the entire Graph. He has no way to do a sampling or something. , so every calculation is done on the whole graph, and the message is passed on the whole graph, so the time-consuming is too serious, the engineering complexity is too high, and there is no way to implement it. This is just a PR article, you can Simply understand how GCN is recalled and how it is used.

2.3 PinSage of End2End

2.3.1 Composition practice of PinSage

Next, let’s introduce this article mainly. It’s called PinSage, which is based on GraphSAGE. This is also a landing, and recommended advertisements in the industry are using their method, and they are landing in their respective business scenarios. This algorithm is Pinterest and Stanford. Joint research, a landing of a large-scale Graph GNN, and a long-term landing is recommended.

This paper is very recommended, you can read it, because there are a lot of engineering practices in this paper.

We are algorithm engineers, not to say that we study these mathematics, just study these algorithms. Our title is engineer, which means that in our normal work, you must have strong engineering capabilities. Now you are in a large factory. There is a first-authored paper or even a top-notch first-authored paper, which can only be used as a bonus item.

In the final interview, you are still required to have strong engineering ability and algorithm ability. Whether it is ByteDance, Alibaba or Tencent interviewed students, everyone knows that if you are a freshman, the requirements may be lower, but if social If you are recruited, your engineering ability must be very strong.

First of all, let's talk about this in detail. First of all, it is the composition. How did he compose it? The left side of the figure below represents pictures, and the right represents favorites.

This business scenario is a picture recommendation. You can understand it as using Douyin. What you slide over on Douyin is not a video, but pictures one by one. After you see many pictures, you can add them to your favorites Inside, the pins are pictures one by one, and the boards are favorites one by one. If a picture is in a favorite, it is connected with a line. If multiple pictures are in the same favorite, they may have similar Sex, right, it is such a bipartite graph.

In this way, there is a thinking, or the most basic question is how to compose the picture. This is a graph network that affects our final effect. For example, how to compose an e-commerce scene:

  1. There is a way that you can only build the graph of Item
  2. Another way is that you can just build the graph of User
  3. There is also a heterogeneous graph that you can build, a heterogeneous graph of clicks, additional purchases, likes, collections, purchases, and returns between users and items

How to compose a news recommendation? Click on the news recommendation, and then stay for a long time, repost, and comment, right? These compositions are more complicated than those of PinSage. Our traditional recommendation is still quite complicated.

2.3.2 GNN model of PinSage


What is its genetic model? Let's first look at the picture on the left. It is first based on GraphSAGE, and we already know that GraphSAGE is based on sampling and aggregation. Next, I will give you a visual explanation.

  1. First of all, suppose we sample the node A, we want to finally get the Embedding of the node A, how does he do it, first sample the adjacent points (neighbors) of A, for example, we sample three (the picture is full sampling, but we Generally, a certain amount is sampled), then have we sampled all the BCDs;
  2. Then for BCD we need to re-sample, then B uses AC, then C samples AEF, F samples CE, E samples CF, and D samples A, right, it is such a picture, right;
  3. After A has sampled BCD and finished sampling, assuming K=2, we only take two top (two hops) such a neighborhood, then I will first do an aggregation of all the sampled nodes, as shown on the far right of the above figure As shown in the side part, for example, this BCD, this step is done for both BCD and A : first, B is the B obtained through the aggregation of AC; second, C is aggregated through these four nodes; D is aggregated through the node A, and the node A It is aggregated through BCD, which can be calculated in parallel, and the rightmost part can be discarded after the calculation is completed;
  4. Finally, use the Embedding of BCD calculated just now, and then aggregate again to get A, which gives h_A(2).

Specifically, let's take a look at this algorithm and give you a detailed introduction.

First, you have all the nodes, this is V, then you have a batch_size, which is M nodes, your batch_size processes M nodes, and then you have a parameter K, this K can be the depth of sampling we just mentioned , We were 2 just now, weren't we? In the end, I want to get the Embedding of all the nodes in this batch_size, and then it samples.

Therefore, the whole process is equivalent to being divided into two parts, one is sampling and the other is aggregation, so first is the sampling algorithm:

  • Sampling this part, he looks a little awkward with GraphSAGE, he is reversed, for example, when I K is 2, assuming that M is equal to the two nodes A and B, then S2 is equal to {AB}, which is a neighborhood;
  • What about S1? He first assigns S2 to S1, and then fetches their neighbors for each node {AB} in S2. In other words, S1 is AB first, then add A's neighbor BCD, then add B's neighbor AC, and finally remove the duplicates, so that the final set of S1 is {ABCD} ;
  • As for what is S0? For example, your S1 is already equal to ABCD, and then you collect their respective neighbors for A, B, C, and D. In short, the first-order neighbors of S0 = S1∪S1 (ie ABCD + BCD AC AEF A, remove repeats as: ABCDEF );

Then there is its aggregation algorithm, from the outside to the inside, because your S0 is the largest, and your S0 contains the most boundary situation, that is, your S0 is equivalent to including ACAB. You have to aggregate these for it first, because When they converge later and then converge inward, they will not move. Let's take a look at it:

  1. First of all, for h0, every zeroth Embedding of h_u is equal to the original Embedding, this is S0, and S0 is the largest set, right? And k=1 to K, first when k=1, that is, for each node in S1 (that is, ABCD), such as A, first consider its neighbors, A's neighbors already exist in S0, for all of A Neighbors are aggregated once, and the CONVOLVE algorithm is executed after the aggregation. You can understand this algorithm as an aggregation. For example, if you want to aggregate A, it is the neighbor of A and an Embedding representation of the current A. After aggregation, you can get the next layer representation of A, and then execute S2;
  2. After this step is executed, is the Embedding of ABCD all calculated? A is obtained by A's neighbor, B is obtained by B's neighbor, C is obtained by C's neighbor, and D is obtained by D's neighbor. Yes, when you go to seek S0 after the calculation, is it AB left? The neighbor of A is BCD, right, and the neighbor of B is AC , right? On the second layer, you go to get BCD to aggregate to get A, and then go to get AC to get B, so the execution is over. After the execution is completed, he will make another transformation to all the batch_size, and then there is a CONVOLVE algorithm below, this CONVOLVE algorithm is also very simple, first aggregate all the neighbors;
  3. Then pass your generous mentality through a NN, pass an activation function, and then Normalize it. This is the final result.

The above is a description of his GNN model.

2.3.3 Training of PinSage

Then we talk about his training. He is a supervised training. In fact, PinSage and GraphSAGE can be both supervised and unsupervised. Then this article is supervised. Its positive example is What, that is, after the user clicks item_q, if he clicks immediately, he sets a time, for example, clicks item_i within 5 seconds, then I think i is a good candidate for q, that is, (q, i) is A positive example, then the others are negative examples, and it ultimately optimizes the max-margin ranking loss.
Let me say here that there are actually many such ranking losses. You can also be a loss of ranknet. This loss is to make the positive examples closer, and then make the negative examples farther away. The negative examples are through negative sampling.

2.3.4 Servering of PinSage

Just finished talking about train, let's talk about Servering next. Servering has an optimization. It generates the Embedding through a MapReduce, and then performs a retrieval of ANN. I won’t talk about ANN retrieval. It is a common way of recall, such as a retrieval of nearest neighbor similarity. Why is this thing used? What about generating Embedding with MapReduce?

Why is the servering time inconsistent with the training time? Because of the normal network, we know that whether it is a normal DNN or some messy model, the only difference between our training and servering is that we don’t do backwards, isn’t it? There is a little difference here, why? Different because it has a lot of double counting.

For example, if you now want to solve all the final Embeddings of this group according to the training method, then you will calculate the green point in the first picture on the far left of the above picture, and you have calculated it once, which is the first Layer, after the calculation, on the second layer you calculate another green dot in the second small picture on the left, you go to calculate it again, right? It has a lot of redundant calculations, and the green dot is calculated here again Once, when the graph is very large, you cannot effectively satisfy Servering, so he made a MapReduce task that is relatively simple, and I will explain it to you.


Make an Embedding of all the nodes on the first layer at the bottom of the figure above, and then go to the heavy, do such an aggregation, and then go up, which is equivalent to not counting for each individual, but for all These nodes are only calculated once and saved (because they are the same calculation), and then I will calculate upwards. If you have 10 layers, I will calculate each layer in this way, and the result of the next layer is used. For one layer, the input of the next layer is the output of the previous layer, just like that, because MapReduce is a K, our K can directly achieve the effect of deduplication.

And in this article, the reason why I recommend everyone to read it is because it has a lot of industrial optimization. It is a very sincere paper on the best industrial practice. Let me give you a few examples:

  • The first is an optimization of the neighbor sampling algorithm. It is not based on random walk sampling, but based on importance sampling, which means that whoever is important will have a larger sampling weight;
  • The second Warm up, this is a scene that we will encounter when we are training large batches in the industry. When your batch is very large, such as tens of thousands, you need to optimize tens of thousands of this in a batch Item-user-item Yes, then in fact, if you are too big, you need to have a warm up process, otherwise your training will easily go off track;
  • The third is distributed asynchronous training. In fact, many mainstream companies now have infrastructures such as Parameter server, haven, and Tensorflow. How does it train simultaneously on multiple cards through asynchronous and distributed, and then the gradient is regenerated? To calculate different gradients on each GPU, after the calculation, summarize them into one, and finally transmit them to the Parameter server, and then perform an asynchronous update, that is, the next time you re-optimize, you may use is not the latest, because when he updates the gradient, he is also asynchronous;
  • Another one is Hard negative. In fact, I also wrote a popular science article called Difficult Sample Mining. If you only use simple samples, you use positive examples, and they are real positive examples. Then you use random sampling for negative examples. In fact, you can’t learn Especially good, because your negative example is so easy to learn, you can learn it from any network;
  • Another is MapReduce.

How is the Hard negative mentioned above? For example, isn't the positive example q_i, what about the negative example? Assuming that your q is a query, when you query, you will have a rank for each item, and then you can choose a rank with i in 1000~2000 as its difficult sample, you select some of them, and then Random negative sampling, and then take a few more, so that you add difficult samples, so that q can be better distinguished from i.

Then here is a question for everyone, because of the difficulty of sampling, negative sampling is also a very good optimization trick in the industry. Think about it, why, at the beginning of his article, it was not Hard negativ but increased step by step. You used pure Random negative sampling at the beginning, and then you will slowly increase Hard negative a ratio of .

What problems will arise if only Hard negative is used, why not take the most difficult negative sample at once, and then let him learn better? Do you understand that there is actually a problem. If you practice it, you will find that if you use all Hard negatives from the beginning, then the network will not learn well. Because his Embedding is initialized at the beginning and has no ability to distinguish, it is not advisable to increase the difficulty when you do not have the ability to distinguish. Generally speaking, it is a common practice to add it slowly.

Some students may ask, how to consider the effect of Embedding?

  1. Generally, we don't have a particularly good method now. If you want to consider your Embedding, you must go online to do experiments, not online to do experiments. The first is to go offline to see your upstream model. Does the downstream model have better AUC, better NDCG, etc. Offline indicators are better. If the offline indicators are good or equal, you can go online to do a test. Put AB to the test.
  2. Of course, there are other methods that you can analyze offline, such as doing some clustering, and then take a look at the visualization, but generally it is not very useful. Do some psychological comfort, and you will still want to go after you are done. Do something online, so we usually look at the effect of this thing.

So I finally want to emphasize the PinSage algorithm. What I’m talking about here may not be particularly good or detailed. There are still a lot of content in it. I recommend everyone to read it after class.

2.4 End2End Taobao EGES


Then I will introduce the Graph Embedding of how Taobao does it. I just talked about how to compose a picture for e-commerce. Here I will introduce a solution. This article is about a session. The concept of a session is that you can understand it as a session for 30 seconds, or a series of operations, and then After stopping for a while, and then coming back to operate, then change the concept of a Session.

  1. If a user operates a sequence of commodities in a session, then they form a directed graph, such as DAB, then it is a directed graph from D to A to B;
  2. Then BE and DEF compose the picture like this, this is the composition plan;
  3. Then after composing the graph, you can do a RandomWalk;
  4. Sample some sequences, and then you can learn a Graph Embedding.

This is Taobao’s original Embedding, that is, before his article was published 18 years ago, he should have used this kind of Graph Embedding for recalls in 2016 and 17 years. Generally, we use multiple channels for recalls. Recall, this one is definitely one of those ways.

Then in 2018 or 2017, they upgraded Taobao’s Graph Embedding. They only did the front. This is very simple to consider. You only have entity information, and you don’t have any additional information. They considered a Side Info later. . Side Info refers to some other attributes of the product, such as price, brand and so on. I don’t know if I still remember here, this paradigm, he used the second paradigm, GNN as the main model, and other features as a content and some attributes in the Graph.

Here is to say, SI refers to an Embedding of the i-th item, which is the Embedding we just calculated using the most primitive method, and then after he has calculated this Embedding, it adds some of its attributes, such as yours If the product is of a certain brand, then it itself is a OneHot. Does it go up to make an Embedding, and then what keywords and prices do you have? Generally, the price is divided into buckets, OneHot, and Embedding, and then go up Concate up, of course you can do it directly.


Word2Vec is the last thing that softmax needs to solve. You can already do it. It also has an optimization here, which is a concept of Attention.

Because some of your attributes may not be so important, some attributes are very important, so he has an Attention operation. In short, this practice is still very good, and many companies have similar practices.

2.5 Two-stage case

Then two-stage modeling, but you can know the two-stage because many companies are using it, let me give you an example, for example, for Ali, Ali has a lot of data, your data in Taobao and Tmall, Although they are similar, if you are on the Taobao line, you go to train the model, you use the data of Taobao, when you are on Tmall, you use the data of Tmall, 1688, word of mouth, these data are not relevant Yes, but you can use the data of the whole site to build a Graph, and use GNN to train a GE, and then your GE can be used to recommend various products.

  1. To give the simplest example, for example, some users in your Taobao, you clicked, for example, you went to buy a razor, but you didn’t find this item on Tmall, or even the recommendation system. Using the data of the whole site, when you go to a GE like Train, he will let you know that I, a user, actually likes razors. Maybe he may push you on Tmall recently. This is the concept of Transfer;
  2. For Tencent, it is also full-site data, such as news, browsers, and QQ can be built;
  3. The same is true for ByteDance. Toutiao, Xigua Video, and Douyin generally use this method, and the two-stage modeling is generally for large-scale companies.

You want a small company, just an app, you don’t need it, right? Do you have any questions about this part? Let’s take a look at your questions.

Some students asked, does Side Concat get up and build samples?
It’s not about building samples, but when your samples have been determined, you will take the Embedding of Side Info into consideration as a point of your optimization. When you learn Graph Embedding, you also learn about its brand’s Embedding, After studying the Embeeding of its price, these are all taken into consideration. How to choose Margin is generally an adjustment parameter in the industry.

 

3 Introduction to GNN Architecture

In the last part, we introduce the necessary components of GNN when it lands in the industry.

Generally, large companies tend to self-develop. For some small and medium-sized companies, they will use open source software as a basis, and then do some secondary development on open source. For example, if you are in Byte or Ali, this is self-developed, and you may use some open source in other companies.
For GNN, besides the algorithm, there are still many problems to be solved

  1. The first one is how to save images that are too large. This is a routine operation in recommended advertisements, and this point at the level of 100 million is a routine operation;
  2. How to achieve high-performance graph operations, you sample on the 100 million nodes, and then you sample on the edge of tens of billions, you can’t bear it. When I introduced the paradigm to you just now, when you recalled , you can even directly access a graph service during the recall training phase, and then this graph service will help you sample in real time, and then calculate an Embedding for you, and attach it to the twin towers after the calculation, so that Think about it, how is such an operation realized;
  3. And how deep is the combination of GNN and DL

3.1 Ali High Performance Graph Service - Euler

Because Ali's high-performance service Euler is also doing well, and it is open source, so I will introduce it to you based on it. Many companies self-develop this set, and this place is biased towards some projects.


Let's take a look at this picture first

  1. The bottom layer is a distributed storage of Graph. You can’t store large graphs on a single machine. You must use distributed storage. Distributed storage generally has points and edges;
  2. Then there will be a GQL source at the upper level, similar to SQL, if you write such a sentence, it will convert it into a directed acyclic graph for you, and then do some operations in this graph;
  3. Further up is an MP: Message Passing, this thing is an abstraction layer, because all operations of GNN can be understood as Message Passing;
  4. Then there are some basic operators in it, such as convolution, such as aggregation, such as the Combine algorithm, all the upper layers are different, such convolution, different aggregations, and different Combines are combined;
  5. Then the upper layer will implement some commonly used algorithms, and then use them for your quick research.

3.1.1 Euler's distributed graph engine

Let me first talk about this graph distributed engine


Let me give you an example ABCD now has such a graph, when you are in distributed storage, there are two ways

  • One is point distribution
  • One is side-by-side distribution

The first one, the above is point-based distribution, we store it on each shard, first of all, if you are point-distributed, you will make a hash for each point, and then save it to a certain location after the hash is finished. On a shard, for example, A and B are stored on your shard1, and then for the convenience of sampling, I will also store his first-order neighbors

For example, if the node A is stored in A, I will also store its first-order neighbors B and C by the way, and then B will store B, and then two nodes C and D will be stored on shard2, and C has one The first-order neighbor named D is also saved together.

Then the second is partition by side. If you partition by side, this is very simple. You number the sides, and after numbering the sides, make a hash and save it.

For example, the first shard stores the edge from A to B and the edge from A to C, and the second shard stores the edge from C to D. These two partitions generally have their own advantages.

  • The first one is Node Partition. What are the benefits of Node Partition? It is convenient to sample in parallel. Because when you have a batch, for example, you have AC, then when you go to distributed, distributed sampling is very convenient. You only need to It is necessary to find node A in shard1, then do neighbor sampling, and then find node C in shard2, and do neighbor sampling, which is a basis for high-performance sampling;
  • In the case of Edge Partition, it is not easy for you to operate, because your node A may be on shard1 and shard2, because you may save the edge from A to C on shard2 , so just Not particularly good, doing this parallel sampling.

But there is a problem with Node Partition, which is limited by some hot spots. For example, you are very common in recommendation, and some nodes you have are very hot. Some users naturally like to click advertisements, and some users naturally like them. It is very active, and there are many cold users. If you store this user on a certain shard, your shard will have many first-order neighbor nodes attached to it, which will cause the shard to be unable to support it.

How does this kind of general industry do it? In fact, there is no good way. We just set a limit on his first-order neighbors, for example, no more than 10,000, and after more than 10,000, we will do some Random cropping is such a method, and Edge Partition does not have such a tilt problem, and will not be affected by some hot spots.

3.1.1 Euler's graph operation operator


How do I do high-performance neighbor sampling, how is it implemented in practice, such as how it is implemented in Euler, right? Generally, this graph is a heterogeneous graph. Now Euler doesn’t care about it. Other companies support this kind of heterogeneous graph. I drew this graph for you.

Let’s take a look at a picture of this user-item. The 0 of user represents one kind of side, 1 stands for another kind of side, 0 stands for one kind of side, 1 stands for another kind of side, how to understand 0? For example, in the advertising scene, 0 can be understood as a click code, 1 can be understood as a conversion side, and the side 0 has its own weight. For example, the number of times you click on a certain item can be used as a measure of the side weight, such as 0.5, 1, 0.8, and then this conversion can also have a side weight. How do you design the side weight of your conversion? You can use the final revenue brought by the conversion of this advertisement as the side weight, or it can be the number of conversions, or something else. This needs to be designed by you.
Again, the composition of the picture is very important. This picture must be constructed to meet your business scenario. Once you have completed the construction of the picture, 80% of your work is completed, or even more than 80%. Well, then directly decide whether you can make a profit in the end.

After having such a picture, how do we sample it? First, it stores some things


The second is Edge_type

  1. We now have two codes, 01, and then a Node_offset, each node has such a thing, each node has it, such as the user node, it means that Node_offset refers to the 3rd, and 3 refers to my 0. Side, the neighbor of this side of 0 ends at 3, and the other side 1 ends at 5;
  2. Then the prefix sum of this weight and the prefix weight is 0, which has 2.3, that is, 0.5+1+2.8 is 2.3, and then 1 is 5.6 when all are added together;
  3. When you sample, for example, you have to assume that you don’t divide the sides, you just pick, then first he makes a prefix sum, the prefix sum of item1 is 0.5 itself, item2 is item2 plus item1, and item3 is their 3 add up, such a prefix sum;
  4. Then you have this prefix sum, assuming that you do not divide the edge weight, if you directly sample the edge type, you only need to set a random, you first random a number, then multiply it by its maximum sum 5.6, after multiplying by 5.6, You get a number from 0 to 5.6;
  5. Then do a binary search, because it is ordered, and then this thing can be parallelized. If we just did it according to Node Partition, it can also be parallelized so it is very fast. If you want to combine some kind of edges, for example, you just want to The same goes for the zero side
  6. You first pass the previous data structure Node_offset to 3, then you only go to the place of 3 3 first, and then multiply by 2.3 instead of 5.6, and then if you only sample the side of 1, then you still multiply Take 5.6, but after multiplying, you have to subtract 2.3, right? Your final range is first from 3.6 to 5.6, and then you can do a binary search, right? This is a sampling design.

3.1.2 Euler's GQL query interface Euler 2.0

Then there is the GQL language. I won’t go into details about this GQL language. It probably means that it will design a language similar to SQL, because when you are in SQL, we all know that after you write your SQL language, its bottom layer can do various things. various optimizations.


The same is true for the GQL language. After you finish writing, it will give you a DAG, a directed acyclic graph, and then perform some operations on the graph, some sampling, and some attributes. Such an operation is completed
. After this graph, it will give you further optimization, such as some repetitive points you sampled, and then your attributes, and whether you can do some pruning, he will give you when designing in this GQL language optimization.

3.1.3 Euler's MP abstraction layer

Then the third module is the MP abstraction layer. We said that all GNNs can be understood as GCNs, let’s not consider them, because there is no real landing scene for GCNs.


If we look at Graph SAGE, it can be divided into three modules.
Generally, we don’t need the third module.
The first module is subgraph sampling. Subgraph sampling is sampling. First, you take a few nodes, and then take a few neighbors from the nodes. This is the direction to go, and then it is graph convolution. Layer and graph convolution technology is to do the real aggregation of Graph SAGE. How do you aggregate when you have neighbors?
Then the third is the optional pooling module. Generally, we don’t need it. At present, most of us don’t need it when we land in the industry. .

Here is a list of some for you, what kind of such a thing they are.

First of all, let’s talk about subgraph sampling. Subgraph sampling is what we just said. We combine it, because the lower layer of MP has several supports, one is GQL, the other is the distributed storage of this graph, and a module for high-performance sampling.

You sampled the yellow point. After sampling, this is first random sampling, and then after random sampling, he will sample the neighbors for you.

We just came back to look at this. This is actually like sampleN, which is to sample N nodes, randomly sample N nodes, and then sampleNB is based on these N nodes. B is the meaning of neighbor sampling. Sampling a NB count, a neighbor, that's what it means.

Corresponding to this side, it is to sample the yellow point first, and then sample some neighbors for each yellow point, and then finally get such a thing, after getting such a thing.


You then perform this convolution operation next, and this step is completed, and the sub-image sampling part is completed.

Then the graph convolution piece, you have something to sample, this piece has been sampled, we only have one node, we only use a yellow node x1, and then it samples three neighbor nodes, then it There are three methods that need to be practiced.

  1. It is generally implemented with various virtual classes. If you want it to help you achieve something, then if you want to add it in the module, just inherit it and rewrite the message passing function. The message passing function is just a PI, this PI refers to how you transmit messages from neighbors to this target node, such as how to transmit messages from x2 to x1, how to transmit messages from x3 to x1, and how to transmit them in GraphSAGE. Graph SAGE directly transmits All their things pass through a NN, and then aggregate with x1 after passing through;
  2. The message aggregation function, after you pass each node, how does it get together? Just now GraphSAGE is the first step to complete the message transmission. In fact, it is also possible to separate it. You multiply x2 with a matrix, and then put this. X3 is multiplied with a matrix and placed here, and x4 is multiplied with a matrix and placed here, and then do a Sum pooling and Average pooling. Such a thing is aggregation;
  3. After the aggregation is complete, how to update x1? How does Graph SAGE do it? GraphSAGE is to concatenate the original x1 with its neighbors, isn’t it, and then pass a W, and then pass a RELU, this is a method of GraphSAGE, right, of course you can achieve all kinds of these things .

The last algorithm layer implements many algorithms, including knowledge graphs, heterogeneous graphs, and some common representation learning. You can choose.

 

Q&A

Q In addition to GNN, there are other unconventional methods with high recall in the industry.

Generally we are normal now

Our current recall is generally done by Shuangta. At this stage, it is Shuangta
and some companies do not make it public, such as ByteDance and Ali. They have some self-developed and efficient recall algorithms, because the recall If we use the twin towers, it actually has some limitations. You do the calculations on the user side and the item side, but there is no match. match feature, this is a method;
the second is self-developed, Structural learning, this word has been found in recent academia and industry, that is, what you learn is not the final Embedding, and then do dot product, go Do recall, but you learn a path structure for each Embedding, Embedding he recalls, and then your user also learns a structure, and then the item learns a structure, and then the structure does something.

Then there are some engineering optimizations

  1. For example, in the case of recall, because our candidate set is very large, there are generally some optimization methods. For example, let me introduce ANN to you. Everyone knows about ANN. For nearest neighbor retrieval, HSW is generally used for everyone. tell me more
  2. Some practices in the industry, there is another one called quant. Generally, the Embedding we calculate in the end is float16. When you finally do the dot product calculation, but in fact, we can convert it into int8 and quant int8. Is there no need for ANN retrieval, we can use brute force to create a 10 million advertising library, or your hundreds of millions of recommendations, I can add a little more machines, and then do int8, after doing int8, then I can use full brute force Calculated, is it possible, and then do it with higher performance
  3. Then there is another kind, and the other one is PQ. PQ just corresponds to this kind of structure learning. If you are interested, you can (do your own research). Here I won’t make an in-depth introduction.

Let me tell you a little bit about the recall of advertisements. Let me look at it from the top down. The Euler stand-alone version has 100,000 samples. I don’t know this. We are all distributed. Generally, there are hundreds of thousands of nodes. Stand-alone machines, each pod is small.
GNN has been working in the recommendation scene for half a year. For example, GNN seems to be a relatively large department in Alibaba, because Euler and the others are doing very well, so I won’t say much about it.

Q Is there any iterative path for the above model at the beginning?

At the beginning of the above model, the simplest one, you can directly use LINE to train one, or GraphSAGE train one, depending on whether you want to do two-stage modeling or one-stage modeling, I have introduced the paradigm to you just now , everyone can follow that paradigm and just model.

Q What is the difference between an advertising recall and a recommended recall?

This student answered well

  • Commercial value should be considered in the recall of advertisements. The commercial value of the entire Rank link of advertisements must be considered. We call it ecpm or sorted ecpm, which considers commercial value and non-commercial value. Arrangement, or fine arrangement will consider the commercial value
  • If it is recommended, it is generally considered to be another goal, such as the length of stay, user retention, and these goals are different.

But the overall structure is the same. You are like Bytedance. They do this kind of middle stage. Advertisement, recommendation. They are all in the same set.

Q What kind of abilities do you need to do recommendation algorithm work?

First of all, I would like to take a few minutes to introduce to you the advanced course of the recommendation system online in July. Now I know that it is the seventh period (as of 21 years q2 is the ninth period), and the employment rate of each period In the end, it is 100%, and each teacher can help everyone to refer internally. Generally speaking, as far as I know, our teachers are still very good, and their grades are also good. Internal referrals will have a certain effect.
And the content of the 7th recommendation course, I think the content is quite rich, in addition to talking about some common methods of recommendation system recall and fine sorting, it will also combine some cutting-edge technologies in the industry, you can't just talk about some that are currently in use.
Let me tell you about my interview experience. That is to say, it depends on your experience. If you have little experience. For example, your school recruits. Or you have only been working for a year, so I don’t really care about your project. Your project is actually in our eyes, unless you are really doing well in a big factory, otherwise, in general, we will listen to your project, it is not particularly important, we will be very concerned about
your foundation

  • The first is coding, everyone must brush the questions
  • Then the basis of the algorithm, do you have some insight on the recall fine sorting, and I have a question for you, can you think of an optimization idea?
  • Secondly, in real work, we have a lot of components, and everyone has built them. You need to break through bit by bit, and you need to think about optimization points.
  • Then when I go to give you a practical problem, for example, I am at work, I encounter this problem, I think of a better solution, and put it into practice, then I will give you the same scenario, I ask you Can you figure it out? You don’t have to think like me, but if you have a better solution, I think your modeling ability is good.
  • To put it bluntly, the algorithm engineer is firstly an engineer, and secondly, an engineer who solves problems. In addition to classic and commonly used algorithms, there are also some cutting-edge algorithms, including feature engineering, and then some engineering, such as Flink, scala, spark, these are what you need in your work. If you are a school recruit, you have to say that you understand spark, then you understand scala, flink, and kafka. In fact, your algorithm ability, I can Not so demanding for you.
  • Of course, you must have one point that is particularly prominent, either you have a particularly good algorithm, or you have a good engineering, and the algorithm is also good, this kind of thing is fine.

Then there are recommended courses here, there are usually several projects, that is, the actual combat teacher takes everyone to practice, in the industry, such an approach, generally we are in the industry, and now we are polishing it, we understand the industry What to do in the world, so what we tell you is also the most practical, whether you are interviewing or working in the future, it will be very useful.

If you have been to graduate school, you know the graduate class

  1. first, long time
  2. Second, their courseware may be very old and often unchanged, so if you are online in July, first of all, the principal will let us update the courseware regularly
  3. Third, you can spend about 1/10 of the energy of graduate students to learn real practical skills, the skills you really use in your work, even more than graduate students. To be honest, what you learn in graduate students , In fact, very few are used in work, but what you let the teachers in the industry tell you, actually has the effect of getting twice the result with half the effort.

By the way, I will select a few students later and give each of them a video lesson worth 1,000 yuan.
Because now we are an open time, for example, you did not make recommendations before, what path do you want to make recommendations, or what you want to ask about GNN, or advertisements, you can ask about these, including online in July These, I think the principal is here, if you have any questions, you can ask them.
Some students asked, does Alishe recruit P7 year package of 100? This thing still depends on it. If you look at your interview results, the range is relatively large, and it may range from 600,000 to 1.5 million.

Q What are the skills of composition, let’s talk more about tricks

Because this is an open class, and the specific tricks and practice of these things are also in the formal class, it must be clearer, because it is necessary to get started.

If you have some skills in composition, you need to have a deep understanding of the business. Let me give you an example. If you are advertising and you have been doing it for a long time, you will know the ratio between clicks and conversions. Are you sure about 1:3 or 1:10? This thing is a business experience in the industry.

What dimensions should be considered in composition? In general composition practice, there are generally two dimensions.
The first dimension is that you only consider this kind of behavior, such as advertisements, which are clicks and conversions;
then you can also consider one related to business goals. Yes, such as how much money you have converted, and how much profit you have brought after conversion, you can add this.
The same is true for clicks. Generally, there are two dimensions, one is not considering the final goal, and the other is considering the final goal.

Q What is the difference between a bipartite graph and an isomorphic graph?

  1. For example, in the isomorphic graph of a product made by Ali, you can only get the embedding of the item in the end, but not the embedding of the user. This is the biggest difference. In the case of a bipartite graph, you can get both the embedding of the user and the Graph. The Embedding or the Embedding of this item, this is the biggest difference;
  2. Then the second difference is that you have no way to consider more interactions with isomorphic graphs. For example, you click the sequence of clicks in the Session modeled by him. Of course, if you have a bipartite graph, you can do very well. You can put all the interactions between rich users and items, and then you can design some

I also told you just now that when sampling, you can sample by side, you can press, for example, you can sample for each batch, assuming it is an e-commerce recommendation, let me tell you first, how this graph is constructed , this picture is, click, add purchase, purchase, and then return.
You can sample such a picture. What you can sample is, for example, for each batch, you sample three purchases, then five clicks, and then two Return the goods, so there are 10 in total. You can sample by different sides, or it can be like this.
Generally, we tend to make heterogeneous graphs if it is easy to do.

It’s almost two hours, and today is still Halloween. If there is no problem, we will stop here first, and then if you want to sign up for the advanced online class in July, there will be a lot of content in it.
Because this open class has a limited time first, and secondly, it is an open class after all, and I don’t know everyone’s basics, so a lot of the content is not targeted. You say whether you want to talk about the basic theory, and some students may know it. , some students he may not know, is not it.

Q In addition to courses, what channels are there to improve engineering capabilities?

If you are a student now, take part in Kaggle competitions and Tianchi competitions. These are opportunities for practice, but to improve your engineering capabilities, you still need to go to a large factory for internship. Because of those large factories, first of all, you have the amount of data , You can’t do anything about competitions or anything, so you still have to go to a big factory for an internship for your real engineering ability. I still suggest that you can go to a big factory for an internship to see how the real industry does it.
Because you are actually participating in a competition, its data has been prepared for you. For example, if you are in the real industry, such as how to do streaming training, there are many pitfalls in streaming training.

For streaming training, let me give you an example of a simple advertisement. It’s okay for you to click on an advertisement. After you click, it will be sent back immediately. Can I know that you clicked, but the conversion is like
downloading Yes, after downloading a certain game, after you download it, the return is determined by the manufacturer, which is determined by the game dealer. He will usually send you back in batches. Then at this time, you will encounter a problem. Your Positive samples cannot be returned. Every time you lose training, every time you see a positive sample, you have to wait for whether it is a positive sample or a negative sample. What do you do if you have to wait a day or two, or even a week.

Everyone will have some optimizations, such as emit, fast emit, as long as there is one sample, I don’t care if you are a positive sample or a negative sample, first treat it as a negative sample, and then when you actually send back a positive sample, I will treat you as a positive sample , process it again, and add some rules to deduce some algorithms to tell you that this is unbiased.
There are two major parts of the work of a real algorithm engineer

  • The first piece is idea. As an algorithm engineer, the biggest difference between you and engineering is that you must have an idea, and you must know which direction you are optimizing. This is also the value of algorithm engineering. After you have an idea, you can Improving the value of the business, such as recommendation, advertising is even more so, you have an idea, you improve the business, then you increase the company’s ability to make money, right, then you are not high performance, who is high performance, is Bar?
  • The second engineering ability is indispensable. After you have an idea, then you build the model, model, and after the modeling, you must understand engineering, because there are many aspects, such as fine layout, fine layout and mixed layout, fine layout The row-dependent model is a bit more important. You are on the recall side, which actually has a lot of engineering optimization. In engineering, such as the ANN I just mentioned, this is engineering, quant is engineering, and quant and PQ are both algorithms and engineering.

The optimization of your algorithm is one aspect, and if you project and propose some engineering content, that is also great, so everyone has to develop in a balanced way. It doesn’t mean that you can write papers. Everyone may know that some The development of the Lab is not particularly good. There are no human rights within the company, the status is relatively low, and the performance is not high. Those with high performance are these business departments because of the value they bring to the company. The Lab department does not just write papers. , You have to empower yourself, if you don't have enough ability, you still can't do it.

Q Do you want to consider the business sector of the company?

The last point is a suggestion for everyone to find a job. Everyone must abandon the idea that business is not good. It is very good to do business. If you are in the company, you have to do business. If you don’t do business, go to the research institute to do research.
If you want to do research in the company, to be honest, your development must not be very good, unless you are a bigwig in the industry, you either get a Ph. The director of the laboratory, right? That's okay, otherwise, I suggest you go to the business department.

When I was looking for a job before, I also wanted to do pure research, but I have done research, and the feeling after finishing it is not particularly good. The combination of business and algorithms is the most satisfying.

You can have an idea, and after the idea is launched, you can see the income. Your income may earn 1 million for the company every day. Although your annual salary may only be one or two million, or several million, you can give the company every day. Earn 1 million more, and you can make another improvement, and you can earn hundreds of thousands more for the company every day. This is your value.

I chatted with you a lot today, and now it will be two hours soon. We are all here today. Welcome everyone to participate in the online training camp/advanced class in July. My side is over, and I would like to thank the students today Participation.

 

postscript

After I sorted out this public class, I found that Mr. Wu has a really broad vision. He pays attention to the papers of top conferences every year and is proficient in the frontiers of the industry. In the end, he also gave a lot of suggestions for recommending students, such as algorithm engineering. For example, to go deep into the business and take root in the business, to make use of the value of technology by improving the business, etc., etc., it is eloquent, a long article.

In order to make this article more readable and popular, the following is the modification record:

  1. The first revision of 3.18 mainly straightens out the formula definition of first-order similarity and second-order similarity in LINE of 1.4.2 GE;
  2. The second revision of 3.20 mainly clarifies the hierarchical comparison logic in 1.4.5 Struc2Vec and clarifies the GraphSAGE algorithm;
  3. The third revision of 3.21 mainly clarifies the algorithm flow of the GNN model of 2.3.2 PinSage, such as clarifying the first-order neighbors of S0 = S1∪S1;
  4. The fourth revision of 3.23 mainly refines the derivation of the second-order similarity calculation formula in 1.4.2 GE's LINE algorithm;
  5. 3.27 The fifth modification is mainly to straighten out the relevant descriptions to make the sentences more fluent and understandable, mainly for the first part;
  6. 4.6 The sixth revision, discussing the first part of this article with Dr. Beiyou Group, the popular introduction to GNN, and improving the details of relevant descriptions and formulas.

Under the final notice, the next note on the recommended direction will recommend Mr. Wang's open class "Development of Recommendation System Algorithm Technology (Evolution of Each Model)" for another Daniel teacher. Welcome to look forward to it.

Guess you like

Origin blog.csdn.net/v_JULY_v/article/details/114791484