【Paper Notes】KDD2019 | KGAT: Knowledge Graph Attention Network for Recommendation

insert image description here

Abstract

For better recommendation, not only the user-item interaction should be modeled, but also the relationship information should be taken into account

The traditional method factorization machine treats each interaction as an independent instance, but ignores the relationship between items (eg: the director of one movie is also the actor of another movie)

Higher-order relationship: connect two items with one or more link attributes

KG+user-item graph+high order relations—>KGAT

Recursively propagate the embedding of neighboring nodes (maybe users, items, attributes) to update the embedding of its own node, and use the attention mechanism to distinguish the importance of neighboring nodes

Introduction

insert image description here

u 1 u_1 u1is the target user to whom the recommendation is to be provided. Yellow and gray circles denote important users and items discovered through higher-order relations but ignored by traditional methods.

For example, user u 1 u_1u1watched the movie i 1 i_1i1, the CF method focuses on also viewing the i 1 i_1i1The history of similar users of u 4 u_4u4u5 u_5u5, while supervised learning focuses on the relationship with i 1 i_1i1have the same attribute e 1 e_1e1movie i 2 i_2i2, obviously, these two kinds of information are complementary for recommendation, but the existing supervised learning fails to unify the two, for example, here i 1 i_1i1Sum i 2 i_2i2r 2 r_2r2The attributes are all e 1 e_1e1, but it fails to pass r 3 r_3r3reach i 3 i_3i3 i 4 i_4 i4, because it treats them as independent parts and cannot take into account the high-order relationship in the data, for example, the users in the yellow circle watched the same director e 1 e_1e1Other movies of i 2 i_2i2, or the movie in the gray circle is also related to e 1 e_1e1There are other relationships. This is also important information for making recommendations.
u 1 ⟶ r 1 i 1 ⟶ − r 2 e 1 ⟶ r 2 i 2 ⟶ − r 1 { u 2 , u 3 } , u 1 ⟶ r 1 i 1 ⟶ − r 2 e 1 ⟶ r 3 { i 3 , i 4 } , \begin{array}{l} u_{1} \stackrel{r_{1}}{\longrightarrow} i_{1} \stackrel{-r_{2}}{\longrightarrow} e_{1} \ stackrel{r_{2}}{\longrightarrow} i_{2} \stackrel{-r_{1}}{\longrightarrow}\left\{u_{2}, u_{3}\right\}, \\ u_{ 1} \stackrel{r_{1}}{\longrightarrow} i_{1} \stackrel{-r_{2}}{\longrightarrow} e_{1} \stackrel{r_{3}}{\longrightarrow}\left\ {i_{3}, i_{4}\right\}, \end{array}u1r1i1r2e1r2i2r1{ u2,u3},u1r1i1r2e1r3{ i3,i4},

There is a problem

Harnessing this high-level information is challenging:

1) The number of nodes with high-order relationships with target users increases sharply as the order increases, which brings computational pressure to the model

2) Higher-order relations contribute unevenly to predictions.

To this end, the paper proposes the Knowledge Graph Attention Network (KGAT) model, which updates a node's embedding based on the embedding of its neighbors, and recursively performs this embedding propagation to capture high-order connections in linear time complexity. An attention mechanism is additionally employed to learn the weight of each neighbor during propagation.

GNN->KGAT

1. Recursive embedding propagation, using domain node embedding to update the current node embedding

2. Use the attention mechanism to learn the weight of each neighbor during propagation

advantage:

1. Compared with the path-based method, manual calibration of the path is avoided

2. Incorporate higher-order relationships directly into predictive models compared to rule-based methods

3. Model framework

insert image description here

3.1 Problem Definition

Input: collaborative knowledge graph G \mathcal GG G \mathcal G G interacts with user-item dataG 1 \mathcal G_1G1And knowledge graph G 2 \mathcal G_2G2composition

Output:user u u u point 击 itemiii general ratey ^ ui \hat y_{ui}y^ui

Higher-order connections : Exploiting higher-order connections is crucial to perform high-quality recommendations. We will LLL- order connections (LLL - order connectivtiy) is defined as a multi-hop relation path:
e 0 ⟶ r 1 e 1 ⟶ r 2 . . . ⟶ r L e L e_0 \stackrel {r_1}{\longrightarrow} e_1 \stackrel {r_2}{\longrightarrow } \ ... \ \stackrel{r_L}{\longrightarrow}e_L\\e0r1e1r2 ... rLeL

3.2 Embedding Layer

The paper uses the TransR model in knowledge graph embedding. Its main idea is that different entities have different meanings under different relationships, so entities need to be projected into a specific relationship space. If hhh andttt hasrrr relation, then they are inrrThe expression of the r relational space should be close, otherwise it should be far away, and the expression of the formula is:
ehr + er ≈ etr \mathbf e_h^r + \mathbf e_r \approx \mathbf e_t^r \\ehr+eretr
Here , et ∈ R d \mathbf e_h, \mathbf e_t \in \mathbb R^deh,etRd, e r ∈ R k \mathbf e_r \in \mathbb R^k erRk ish, t, rh, t, rh , t , embedding of r .

它的得分为:
g ( h , r , t ) = ∣ ∣ W r e h + e r − W r e t ∣ ∣ 2 2 g(h,r,t)=||\mathbf W_r\mathbf e_h+\mathbf e_r-\mathbf W_r\mathbf e_t||_2^2\\ g(h,r,t)=∣∣Wreh+erWret22
其中 W r ∈ R k × d \mathbf W_r \in \mathbb R^{k\times d} WrRk × d is the relationrrTransformation matrix for r , transforming entities from ddProjection of d- dimensional solid space tokkin the k- dimensional relational space. g ( h , r , t ) g(h,r,t)g(h,r,The lower the value of t ) , the higher the probability that the triplet is true.

Finally, use pairwise ranking loss to measure the effect:
LKG = ∑ ( h , r , t , t ′ ) ∈ τ − ln σ ( g ( h , r , t ′ ) − g ( h , r , t ) ) \mathcal L_{KG} = \sum_{(h,r,t,t^{'})\in \tau} -ln \ \sigma(g(h,r,t^{'})-g(h,r ,t))\\LKG=(h,r,t,t)τln σ(g(h,r,t)g(h,r,t))
The meaning of this formula is to make the value of the negative sample minus the value of the positive sample as large as possible. The choice of negative samples is to use ttt is randomly replaced by one of the other.

3.3 Attentive Embedding Propagation Layers

Information dissemination

Consider entity hhh , we useN h = { ( h , r , t ) ∣ ( h , r , t ) ∈ G } \mathcal N_h = \{ (h,r,t)|(h,r,t) \in \ mathcal G\}Nh={(h,r,t)(h,r,t)G } for those ending inhhh is a triplet of head entities. Calculatehhego -network:
e N h = ∑ ( h , r , t ) ∈ N h π ( h , r , t ) et \mathbf e_{\mathcal N_h} = \sum _ {(h,r,t) \in \mathcal N_h} \pi(h,r,t) \mathbf e_t\\eNh=(h,r,t)Nhπ ( h ,r,t)et
π ( h , r , t ) \pi(h,r,t) π ( h ,r,t ) means that in relationrrr fromttt tohhThe amount of information for h .

knowledge perception attention

Weight π ( h , r , t ) \pi(h,r,t) in information disseminationπ ( h ,r,t ) is achieved through the attention mechanism
π ( h , r , t ) = ( W ret ) T tanh ( W reh + er ) \pi(h,r,t) = (\mathbf W_r \mathbf e_t)^Ttanh (\mathbf W_r \mathbf e_h+\mathbf e_r)\\π ( h ,r,t)=(Wret)Ttanh(Wreh+er)
Here use tanh tanht anh as an activation function can make the closereh \mathbf e_hehand \mathbf e_tethave higher attention scores. Using softmax softmaxsoftmax归一化:
π ( h , r , t ) = e x p ( π ( h , r , t ) ) ∑ ( h , r ′ , t ′ ) ∈ N h e x p ( π ( h , r ′ , t ′ ) ) \pi(h,r,t)=\frac{exp(\pi(h,r,t))}{\sum_{(h,r^{'},t^{'}) \in \mathcal N_h} exp(\pi(h,r^{'},t^{'}))}\\ π ( h ,r,t)=(h,r,t)Nhe x p ( π ( h ,r,t))e x p ( π ( h ,r,t))
Finally, by virtue of π ( h , r , t ) \pi(h,r,t)π ( h ,r,t ) We can know which neighbor nodes should be given more attention.

information aggregation

will eventually hhThe representation of h in the solid spaceeh \mathbf e_hehAnd its ego-network representation e N h \mathbf e_{\mathcal N_h}eNhaggregated as hhh新这些:
eh ( 1 ) = f ( eh , e N h ) \mathbf e_h^{(1)} = f(\mathbf e_h,\mathbf e_{\mathcal N_h})\\eh(1)=f(eh,eNh)
f ( ⋅ ) f(·) f() in the following ways:

  1. GCN Aggregator:
    f GCN = Leaky Re LU ( W ( eh + e N h ) ) f_{GCN}=LeakyReLU(\mathbf W(\mathbf e_h+\mathbf e_{\mathcal N_h}))fGCN=L e ak y R e LU ( W ( eh+eNh))
  2. GraphSage Aggregator:
    f Graph Sage = Leaky Re LU ( W ( eh ∣ ∣ e N h ) ) f_{GraphSage} = LeakyReLU( \mathbf W(\mathbf e_h || \mathbf e_{\mathcal N_h}); )fGraphSage=L e ak y R e LU ( W ( eh∣∣eNh))
  3. Bi-Interaction Aggregator:
    f B i − I nteraction = L eaky R e LU ( W 1 ( eh + e N h ) ) + L eaky R e LU ( W 2 ( eh ⊙ e N h ) ) f_{Bi-Interaction } = LeakyReLU(\mathbf W_1(\mathbf e_h+\mathbf e_{\mathcal N_h}))+LeakyReLU(\mathbf W_2(\mathbf e_h\odot\mathbf e_{\mathcal N_h}))fBiInteraction=L e ak y R e LU ( W1(eh+eNh))+L e ak y R e LU ( W2(eheNh))

Higher order propagation:

We can further stack more propagation layers to explore higher-order connectivity information and collect information propagated from higher-hop neighbors, so in lll步中:
eh ( l ) = f ( eh ( l − 1 ) , e N h ( l − 1 ) ) \mathbf e_h^{(l)} = f( \mathbf e_h^{(l-1)} ,\mathbf e_{\mathcal N_h}^{(l-1)})\\eh(l)=f(eh(l1),eNh(l1))
Let N h ( l − 1 ) = ∑ ( h , r , t ) ∈ N h π ( h , r , t ) et ( l − 1 ) \mathbf e_{\mathcal N_h}^{(l-1) } = \sum_{(h,r,t)\in\mathcalN_h}\pi(h,r,t)\mathbf e_t^{(l-1)};eNh(l1)=(h,r,t)Nhπ ( h ,r,t)et(l1), et ( l − 1 ) \mathbf e_t^{(l-1)}et(l1)Also through the above steps from et 0 \mathbf e_t^0et0owned.

3.4 Prediction layer

While executing LLAfter layer L , eventually we will get the user uuu的多山运动:{ eu ( 1 ) , . . . , eu ( L ) } \{\mathbf e_u^{(1)},...,\mathbf e_u^{(L)} \}{ eu(1),...,eu(L)} , and itemiii的多山可以:{ ei ( 1 ) , . . , ei ( L ) } \{\mathbf e_i^{(1)},..,\mathbf e_i^{(L)} \}{ ei(1),..,ei(L)}

电影电影电影,即:
eu ∗ = eu ( 0 ) ∣ ∣ . . . ∣ ∣ eu ( L ), ei ∗ = ei ( 0 ) ∣ ∣ . . . ∣ ∣ ei ( L ) \mathbf e_u^{*} = \mathbf e_u^{(0)} || ...||\mathbf e_u^{(L)} \ ,\ \mathbf e_i^{*} = \mathbf e_i^{(0)} || ...||\mathbf e_i^{(L)} \\eu=eu(0)∣∣...∣∣eu(L) , ei=ei(0)∣∣...∣∣ei(L)
Finally, the correlation score is calculated by inner product:
y ^ ( u , i ) = eu ∗ T ei ∗ \hat y(u,i) = {\mathbf e_u^*}^T \mathbf e_i^*\\y^(u,i)=euThat's iti

3.5 Loss function

BPR loss:
LCF = ∑ ( u , i , j ) ∈ O − ln σ ( y ^ ( u , i ) − y ^ ( u , j ) ) \mathcal L_{CF}=\sum_{( u,i,j) \in O} - ln \ \sigma(\hat y(u,i)-\hat y(u,j))\\LCF=(u,i,j)Oln σ(y^(u,i)y^(u,j))
其中 O = { ( u , i , j ) ∣ ( u , i ) ∈ R + , ( u , j ) ∈ R − } O = \{(u,i,j)|(u,i) \in \mathcal R^+, (u,j) \in \mathcal R^- \} O={(u,i,j)(u,i)R+,(u,j)R} R + \mathcal R^+ R+ Represents a positive sample,R − \mathcal R^-R represents a negative sample.

Definition:
LKGAT = LKG + LCF + λ ∣ ∣ Θ ∣ ∣ 2 2 \mathcal L_{KGAT} = \mathcal L_{KG} + \mathcal L_{CF} + \lambda||\Theta||_2^2\\LKGAT=LKG+LCF+λ∣∣Θ∣22

Guess you like

Origin blog.csdn.net/weixin_45884316/article/details/131815144