2021-IJCA-Masked Label Prediction Unified Message Passing Model for Semi-Supervised Classification

2021-IJCA-Masked Label Prediction Unified Message Passing Model for Semi-Supervised Classification

GNNs form feature propagation through neural networks to make predictions, while LPA uses label propagation across graph adjacency matrices to obtain results. However, there is still no efficient way to directly combine these two algorithms.

Model

The author uses Graph Transformer, combined with label embedding to combine the above features and label propagation, to build a UniMP model. Furthermore, a masked label prediction strategy is introduced to train our model to prevent the label leakage problem.

Method
Known fixed node features H ( l ) = { h 1 ( l ) , h 2 ( l ) , … , hn ( l ) } H^{(l)}=\left\{h_1^{(l)} , h_2^{(l)}, \ldots, h_n^{(l)}\right\}H(l)={ h1(l),h2(l),,hn(l)},计算从 j 到 i 的每条边的多头注意力:
q c , i ( l ) = W c , q ( l ) h i ( l ) + b c , q ( l ) k c , j ( l ) = W c , k ( l ) h j ( l ) + b c , k ( l ) e c , i j = W c , e e i j + b c , e α c , i j ( l ) = ⟨ q c , i ( l ) , k c , j ( l ) + e c , i j ⟩ ∑ u ∈ N ( i ) ⟨ q c , i ( l ) , k c , u ( l ) + e c , i u ⟩ \begin{gathered} q_{c, i}^{(l)}=W_{c, q}^{(l)} h_i^{(l)}+b_{c, q}^{(l)} \\ k_{c, j}^{(l)}=W_{c, k}^{(l)} h_j^{(l)}+b_{c, k}^{(l)} \\ e_{c, i j}=W_{c, e} e_{i j}+b_{c, e} \\ \alpha_{c, i j}^{(l)}=\frac{\left\langle q_{c, i}^{(l)}, k_{c, j}^{(l)}+e_{c, i j}\right\rangle}{\sum_{u \in N(i)}\left\langle q_{c, i}^{(l)}, k_{c, u}^{(l)}+e_{c, i u}\right\rangle} \end{gathered} qc,i(l)=Wc,q(l)hi(l)+bc,q(l)kc , j(l)=Wc,k(l)hj(l)+bc,k(l)ec,ij=Wc,eeij+bc,eac,ij(l)=uN(i)qc,i(l),kc,u(l)+ec , i uqc,i(l),kc , j(l)+ec,ij
其中, ⟨ q , k ⟩ = exp ⁡ ( q T k d ) \langle q, k\rangle=\exp \left(\frac{q^T k}{\sqrt{d}}\right) q,k=exp(d qTk) is a power level dot product function,ddd is the hidden size of each head. for theccc head attention, first source featurehi ( l ) h^{(l)}_ihi(l)and distant features hj ( l ) h^{(l)}_jhj(l)Convert to query vector qc , i ( l ) ∈ R d q_{c, i}^{(l)} \in \mathrm{R}^{\mathrm{d}}qc,i(l)Rd and key vectorkc , j ( l ) ∈ R d k_{c, j}^{(l)} \in \mathrm{R}^{\mathrm{d}}kc , j(l)Rd use different trainable parametersW c , q ( l ) , W c , k ( l ) , bc , q ( l ) , bc , k ( l ) W_{c, q}^{(l)}, W_{c, k}^{(l)}, b_{c, q}^{(l)}, b_{c, k}^{(l)}Wc,q(l),Wc,k(l),bc,q(l),bc,k(l)
Provided edge features ec , ij e_{c,ij}ec,ijwill be encoded and added to the key vector as additional information for each layer. jj from a distance after getting the attention of the figure's bullsj to sourceiii 进行消息聚合:
v c , j ( l ) = W c , v ( l ) h j ( l ) + b c , v ( l ) h ^ i ( l + 1 ) = ∥ c = 1 C [ ∑ j ∈ N ( i ) α c , i j ( l ) ( v c , j ( l ) + e c , i j ) ] \begin{gathered} v_{c, j}^{(l)}=W_{c, v}^{(l)} h_j^{(l)}+b_{c, v}^{(l)} \\ \hat{h}_i^{(l+1)}=\|_{c=1}^C\left[\sum_{j \in N(i)} \alpha_{c, i j}^{(l)}\left(v_{c, j}^{(l)}+e_{c, i j}\right)\right] \end{gathered} vc , j(l)=Wc,v(l)hj(l)+bc,v(l)h^i(l+1)=c=1C jN(i)ac,ij(l)(vc , j(l)+ec,ij)
where || is the concatenation operation of c head attention. The multi-head attention matrix replaces the original normalized adjacency matrix as the transition matrix for message passing.

然后是残差连接,防止模型过度平滑
r i ( l ) = W r ( l ) h i ( l ) + b r ( l ) β i ( l ) = sigmoid ⁡ ( W g ( l ) [ h ^ i ( l + 1 ) ; r i ( l ) ; h ^ i ( l + 1 ) − r i ( l ) ] ) h i ( l + 1 ) = RELU ⁡ ( LayerNorm ⁡ ( ( 1 − β i ( l ) ) h ^ i ( l + 1 ) + β i ( l ) r i ( l ) ) ) \begin{gathered} r_i^{(l)}=W_r^{(l)} h_i^{(l)}+b_r^{(l)} \\ \beta_i^{(l)}=\operatorname{sigmoid}\left(W_g^{(l)}\left[\hat{h}_i^{(l+1)} ; r_i^{(l)} ; \hat{h}_i^{(l+1)}-r_i^{(l)}\right]\right) \\ h_i^{(l+1)}=\operatorname{RELU}\left(\operatorname{LayerNorm}\left(\left(1-\beta_i^{(l)}\right) \hat{h}_i^{(l+1)}+\beta_i^{(l)} r_i^{(l)}\right)\right) \end{gathered} ri(l)=Wr(l)hi(l)+br(l)bi(l)=sigmoid(Wg(l)[h^i(l+1);ri(l);h^i(l+1)ri(l)])hi(l+1)=resume( LayerNorm((1bi(l))h^i(l+1)+bi(l)ri(l)))

多头平均,删除线性变换
h ^ i ( l + 1 ) = 1 C ∑ c = 1 c [ ∑ j ∈ N ( i ) α c , i j ( l ) ( v c , j ( l ) + e c , i j ( l ) ] h i ( l + 1 ) = ( 1 − β i ( l ) ) h ^ i ( l + 1 ) + β i ( l ) r i ( l ) \begin{gathered} \hat{h}_i^{(l+1)}=\frac{1}{C} \sum_{c=1}^c\left[\sum_{j \in N(i)} \alpha_{c, i j}^{(l)}\left(v_{c, j}^{(l)}+e_{c, i j}^{(l)}\right]\right. \\ h_i^{(l+1)}=\left(1-\beta_i^{(l)}\right) \hat{h}_i^{(l+1)}+\beta_i^{(l)} r_i^{(l)} \end{gathered} h^i(l+1)=C1c=1c jN(i)ac,ij(l)(vc , j(l)+ec,ij(l)]hi(l+1)=(1bi(l))h^i(l+1)+bi(l)ri(l)

Label Embedding and Propagation

H ( l ) = ( ( 1 − β ) A ∗ + β I ) l ( X + Y ^ W d ) W ( 1 ) W ( 2 ) … W ( l ) = ( ( 1 − β ) A ∗ + β I ) l X W + ( ( 1 − β ) A ∗ + β I ) l Y ^ W d W H^{(l)}=\left((1-\beta) A^*+\beta I\right)^l\left(X+\widehat{Y} W_d\right) W^{(1)} W^{(2) \ldots W^{(l)}}=\left((1-\beta) A^*+\beta I\right)^l X W+\left((1-\beta) A^*+\beta I\right)^l \widehat{Y} W_d W H(l)=((1b ) A+β I )l(X+Y Wd)W(1)W(2)W(l)=((1b ) A+β I )lXW+((1b ) A+β I )lY WdW
where the model is approximately decomposed into feature propagation( ( 1 − β ) A ∗ + β I ) l XW \left((1-\beta) A^*+\beta I\right)^l XW((1b ) A+β I )lX W和标签传发( ( 1 − β ) A ∗ + β I ) l Y ^ W d W \left((1-\beta) A^*+\right.\beta I)^l \widehat{Y} W_d W((1b ) A+β I )lY WdW

Masked label prediction Supervised training of its model parameters θ \theta
given X and Aθ :
arg ⁡ max ⁡ θ log ⁡ p θ ( Y ^ ∣ X , A ) = ∑ i = 1 V ^ log ⁡ p θ ( yi ^ ∣ X , A ) \underset{\theta}{\arg\max} \log p_\theta(\width{Y}\mid X, A)=\sum_{i=1}^{\width{V}} \log p_\theta\left(\width{y_i}\mid X, A\right)iargmaxlogpi(Y X,A)=i=1V logpi(yi X,A)

Graph Neural Code

import torch
import torch.nn.functional as F
from torch_geometric.nn import TransformerConv
class Transformer(torch.nn.Module):
    #初始化
    def __init__(self, hidden_channels):
        super(Transformer, self).__init__()
        torch.manual_seed(12345)
        self.conv1 = TransformerConv(10, hidden_channels,dropout=0.5) #GCNConv更换为TransformerConv
        self.conv2 = TransformerConv(hidden_channels, 2,dropout=0.5)#GCNConv更换为TransformerConv
	#前向传播
    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index)
        x = x.relu()
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv2(x, edge_index)
        #注意这里输出的是节点的特征,维度为[节点数,类别数]
        return x
    
x = torch.randn(4, 10)
edge_index = torch.tensor([[0,0,0,1,1,2],[1,2,3,2,3,3]])

model = Transformer(36)
res = model(x, edge_index)
print(res)

Guess you like

Origin blog.csdn.net/weixin_42486623/article/details/130002082