Graph convolution literature reading 1


Paper name: MGraphDTA: deep multiscale graph neural network for explainable drug–target binding affinity prediction
Paper download address: MGraphDTA
paper code download address: MGraphDTA and Grad-AAM
dataset download address: Davis and KIBA , filtered Davis , Metz , Tox-Cast , Human and C. elegans datasets

MGraphDTA: deep multiscale graph neural network for explainable drug–target binding affinity prediction

Graph neural networks are widely used to predict target drug affinity (Predicting drug–target affinity (DTA)), however, current shallow graph neural networks are insufficient to capture the global characteristics of compounds.

1. background

The models for predicting the affinity of targeted drugs can generally be divided into three categories:

  1. Structure-based models: Considering the 3D structure of small molecules and proteins.

  2. Feature-based model: The representative method is protein chemometrics (proteochemometrics), which relies on the clear description of proteins and ligands: biological feature vectors, which use a set of determined binary numbers to indicate whether the drug acts on the target. The obtained biometric vectors can be used to train deep learning models. Such as feedforward neural network (FNN), support vector machine (SVM), random forest (RF) and other kernel (kernel) models. Among them, FNNs work best.

  3. Model based on deep learning: The deep learning model is derived from biomedicine and compound activity data collected by emerging technologies such as high throughput screening and parallel synthesis. For example, DeepDTA uses a two-layer convolutional neural network (CNN) to learn the structural expression of drugs and proteins, and then uses a multi-layer perceptron to predict the drug matching relationship. WideDTA improves on this input (adding two additional text inputs), for a total of four inputs. In summary, the CNN-based model was proven effective, and CNN-based recognition of 2D data representing compound structures was proven feasible.

    Later, for the shortcomings of CNN's inability to express molecular structure information well, graph neural networks (GNNs) were introduced. GNNs use graphs to represent drugs, and use GNN (frequency domain method?) to realize DTA prediction. It has been verified that models based on GNNs outperform CNNs.
    For example, GraphDTA compares the convolutional network (GCN), graph attention network (GAT), GIN (?) and GAT-GCN for DTA prediction (regression problem). The introduction of attention mechanism increases the interpretability of the model.
    On the other hand, it has also been proven feasible to use structural-related features of protein as input to improve DTA prediction. (I personally feel that this method is similar to integrating prior knowledge into deep learning algorithms). For example, DGraphDTA uses protein sequences as input, but because protein structure information is not fully known, it uses the predicted contact map as input.
    In summary, current shallow GNNs have three problems:

    1. The shallow structure cannot capture the global characteristics well. Therefore, a multi-layer stacked graph convolution layer is required. If you want to capture the structure of K-hop neighbors, you should stack k graph convolution layers (I don’t understand it, here is embedding of nodes? deep encoder is somewhat similar, and I will check it later)
    DEEP encoder
    2 .GNN should preserve local information.

GNN should retain global and local information. The shallow GNN in the left picture does not capture the ring structure of the entire zearalenone. The picture shows that local information can distinguish substituents of different importance
3. Graph-based DTA interpretability relies on attention mechanism. (GAT network?) The attention mechanism has a better mathematical explanation, but only considers the vertex field.
GAT

2. Model overview

In order to solve the above problems, this article proposes a multiscale graph neural network multiscale graph neural network (MGNN) and a visualization model: gradient-weighted affinity activation mapping (Grad-AAM).
MGraphDTA
Prediction part: MGNN uses 27 layers of graph convolutional layers and multi-scale convolutional neural network (MCNN) to extract multi-scale features of drugs and targets. After fusion, the relationship is obtained and input into a multi-layer perceptron to predict affinity.
Visualization part: Grad-AAM generates a probabilistic graphical model based on the gradient of the last layer input of MGNN to mark the atoms that contribute most to DTA. Grad-AAM is derived from the gradient-weighted class activation mapping (Grad-CAM) in the image, and part of its CNN is replaced by GNN. (Recently, many models replace the CNN structure in the image field with GNN).

3. Model principle

3.1 Input

The model input is the simplified molecular linear input system SMILES (Simplified Molecular Input Line Entry System), which uses short ASCII codes to represent chemical structures. The input target is a set of protein sequences (strings), where each character represents an amino acid. (Amino acids are the basic structure of proteins? Not sure). Use RDKit to process SMILES into a graph structure with points or edges and an adjacency matrix.
For protein sequences, this article first established a vocabulary to map each character into an integer (int), as follows:

amino acid right align
Alanine 1
cysteine 2
glutamic acid 4

In order to save training costs, the length of the protein sequence is limited to 1200, basically covering at least 80% of the protein, and then converted into a 128-dimensional vector (embedding space) through the embedding layer. Although one-hot encoding can also encode proteins, it cannot Describe the semantic relatedness between two different amino acids, e.g. different one-hot vectors all have zero cosine similarity. (When I see one-hot, I always think of cross-entropy loss. There is zero cosine similarity between 1001 and 0110. I don’t know if there is a solution. Check back)

3.2 Graph neural network

A graph can be represented as:
G = ( ν , E ) \mathcal{G}=(\nu, \mathcal{E})G=( n ,E)
v i ∈ ν v_{i} \in \nu viν represents the i-th atom,eij ∈ E e_{ij} \in \mathcal{E}eijE represents the edge connecting the i-th and j-th atoms. GNN usually uses message pass and readout to mapG {G}G maps to the vector spacey G ∈ R d y_{\mathcal{G}} \in \mathbb{R}^{d}yGRd , where the message pass updates the vertex information according to the domain, and the readout calculates the feature vector for the entire graph.

3.2.1 Message passing phase

Use the following graph convolution layer to update the embedding vector of the i-th vertex with the change of time step t xi
( t + 1 ) = σ ( W 1 xi ( t ) + W 2 ∑ j ∈ N ( i ) xi ( t ) ) x_{i}^{(t+1)}=\sigma\left(W_{1} x_{i}^{(t)}+W_{2} \sum_{j \in \ mathcal{N}(i)} x_{i}^{(t)}\right)xi(t+1)=p(W1xi(t)+W2jN(i)xi(t))
whereW 1 W_{1}W1W 2 W_{2}W2is a learnable weight matrix, N ( i ) \mathcal{N}(i)N ( i ) is the neighbor of vertex i (first-order neighbor?),σ \sigmaσ is the batch normalization and activation function, and ReLU is chosen in this article. It can be seen from the above formula that the global information on the graph can be gradually captured through continuous iteration. (This method is similar to GCN, as shown below, I wonder if it is possible to introduce the graph attention network GAT, achieve better prediction results by adjusting parameters, or use a newer heterogeneous graph model)
GCN

3.2.2 Readout phase

It is to take the mean value of the vertex embedding vector output by the last layer:
y G = 1 ∣ V ∣ ∑ v ∈ V xv ( L ) y_{\mathcal{G}}=\frac{1}{|\mathcal{V}| } \sum_{v \in \mathcal{V}} x_{v}{ }^{(L)}yG=V1vVxv(L)
∣ V ∣ |{V}| V represents the number of vertices in the molecular graph, and L is the last time step. The readout layer aggregates node embeddings to graph embeddings. (It should be used for visualization later, this point may be one of the innovations of this article)

3.3 Multi-scale graph neural network for drug encoding (encoding) MGNN

Intuition: chemical
MGNN consists of three layers of multi-scale blocks connected through transition layers.

3.3.1 Multi-scale blocks

The multi-scale block is derived from DenseNet. This article migrates the dense connection to GNN.
MGNN
The dense connection connects each layer with other layers through the forward channel, so that all layers can directly accept the gradient of the loss function of each weight, so that the gradient It will not cause the gradient to disappear when passing between layers, so this method can expand the network to a deeper level (very good, a bit like an enhanced version of resnet). The multi-scale layer can be expressed as follows:
xi ( 1 ) = H ( xi ( 0 ) , Θ 1 ) x_{i}^{(1)}=\mathscr{H}\left(x_{i}^{(0 )}, \Theta_{1}\right)xi(1)=H(xi(0),Th1)
x i ( 2 ) = H ( x i ( 0 ) ∥ x i ( 1 ) , Θ 2 ) x_{i}^{(2)}=\mathscr{H}\left(x_{i}^{(0)} \| x_{i}^{(1)}, \Theta_{2}\right) xi(2)=H(xi(0)xi(1),Th2)
x i ( 3 ) = H ( x i ( 0 ) ∥ x i ( 1 ) ∥ x i ( 2 ) , Θ 3 ) x_{i}^{(3)}=\mathscr{H}\left(x_{i}^{(0)}\left\|x_{i}^{(1)}\right\| x_{i}^{(2)}, \Theta_{3}\right) xi(3)=H(xi(0) xi(1) xi(2),Th3)
x i ( N ) = H ( x i ( 0 ) ∥ x i ( 1 ) ∥ ⋯ ∥ x i N − 1 , Θ N ) x_{i}^{(N)}=\mathscr{H}\left(x_{i}^{(0)}\left\|x_{i}^{(1)}\right\| \cdots \| x_{i}^{N-1}, \Theta_{N}\right) xi(N)=H(xi(0) xi(1) xiN1,ThN)
H \mathscr{H} H is the convolutional layer mentioned above,Θ n \Theta_{n}ThnIt is the parameters of the n layer ( W 1 , W 2 W_1, W_2W1W2), ∥ \| is a series operation, so the multi-scale block can extract multi-scale information (the multi-scale here refers to the n-hop domain of the node), which describes the structural information of the molecule locally and globally

3.3.2 Transition layer

In order to increase the network depth of MGNN, two multi-scale blocks are connected with a transition layer. The role of the connection layer is to integrate the multi-scale features obtained by the multi-scale block of the previous layer, and the second is to reduce the number of feature map channels. For the multi-scale information xi ( 0 ) ∥ xi ( 1 ) ∥ ⋯ ∥ xi N ∈ R d + ( N − 1 ) h x_{i}^{(0)}\left\|x_ for the N+1th time step {i}^{(1)}\right\| \cdots \| x_{i}{ }^{N} \in \mathbb{R}^{d+(N-1) h}xi(0) xi(1) xiNRd+(N1)h多,transition层表示如下:
x i ( N + 1 ) = σ ( Φ 1 ( x i ( 0 ) ∥ x i ( 1 ) ∥ ⋯ ∥ x i N ) + Φ 2 ∑ j ∈ N ( i ) ( x j ( 0 ) ∥ x j ( 1 ) ∥ ⋯ ∥ x j N ) ) \begin{aligned} x_{i}^{(N+1)}=& \sigma\left(\Phi_{1}\left(x_{i}^{(0)}\left\|x_{i}^{(1)}\right\| \cdots \| x_{i}^{N}\right)\right.&\left.+\Phi_{2} \sum_{j \in \mathcal{N}(i)}\left(x_{j}^{(0)}\left\|x_{j}^{(1)}\right\| \cdots \| x_{j}{ }^{N}\right)\right) \end{aligned} xi(N+1)=p( F1(xi(0) xi(1) xiN)+ F2jN(i)(xj(0) xj(1) xjN)
Φ 1 , Φ 2 ∈ R ( M / 2 ) × M \Phi_{1}, \quad \Phi_{2} \in \mathbb{R}^{(M / 2) \times M}Phi1,Phi2R( M /2 ) × M is a learnable weight matrix,M = d + ( N − 1 ) h M=d+(N-1) hM=d+(N1 ) h (N-1 is the time step, d is the feature, h is the number of vertices) throughΦ 1 , Φ 2 \Phi_{1}, \quad \Phi_{2}Phi1,Phi2It can be seen from the dimension that after the transition layer, the channel is reversed to half of the original, thus saving calculation time. Finally, the readout layer can be used to convert the entire image into a feature vector.

3.4 Multiscale convolutional nerual network for target encoding

MCNN
The idea of ​​MCNN is similar to that of MGNN, which uses convolutional layers of different receptive fields to capture information of different scales. By stacking 3*3 convolutional layers to increase the receptive field, 3, 5, and 7 layers of convolution are used to capture protein structure information, and only part of the protein structure is useful for DTA prediction, so there is no need to consider capture like MGNN global features. For a protein sequence with an input length of 1200, use the embedding layer to convert it into a 128-dimensional vector, and then make the input matrix S ∈ R 1200 × 128 \mathcal{S} \in \mathrm{R}^{1200 \times 128}SR1200 × 128 (I don't quite understand here, the converted matrix means that the feature becomes 128?), and then use MCNN to convert the input matrix into a feature vectory S ∈ R d y_{\mathcal{S}} \in \mathbb {R}^{d}ySRd
y S = W ( m ( F 1 ( S ) ) ∥ m ( F 2 ( S ) ) ∥ m ( F 3 ( S ) ) ) y_{\mathcal{S}}=W\left(m\left(\mathcal{F}_{1}(\mathcal{S})\right)\left\|m\left(\mathcal{F}_{2}(\mathcal{S})\right)\right\| m\left(\mathcal{F}_{3}(\mathcal{S})\right)\right) yS=W(m(F1(S))m(F2(S))m(F3( S ) ) ) F i \mathcal{F}_{i
}FiIt is a 3 * 3 convolutional layer, and the activation function ReLU is added after each convolutional layer, m means that the maximum pooling operation converts C into a h-dimensional vector, and W ∈ R d × 3 h W \in \ mathbb{R}^{d \times 3 h}WRd × 3 h is a learnable matrix.

3.5 MGraphDTA network architecture

After the above operations, a vector that can represent the drug and the target is obtained, and the affinity can be predicted by concatenating the two vectors and inputting them into the multilayer perceptron. The MLP mainly consists of three linear transformation layers, each of which passes through an activation function and a dropout layer (dropout rate of 0.1). In general, MGraphDTA contains one MGNN for drug encoding and one MCNN for target encoding (protein). Use the mean square error as the loss function:
MSE = 1 n ∑ i = 1 n ( P i − Y i ) 2 \mathrm{MSE}=\frac{1}{n} \sum_{i=1}^{n }\left(P_{i}-Y_{i}\right)^{2}MSE=n1i=1n(PiYi)P i P_i
in 2Piis the predicted value corresponding to the drug target, Y i Y_iYiis the true value, and n is the number of samples. It is also possible to simply replace MSE with cross-entropy loss for binary classification problems (valid/invalid).

3.6 gradient-weighted affinity activation mapping

This part introduces how to improve the interpretability of the DTA model. The input gradient of the last layer of graph convolution is used to represent the affinity of each neuron. Since the graph convolution layer itself can retain the spatial information in the fully connected layer, so This paper argues that the last graph convolutional layer has the best compromise between high-level semantics and detailed spatial information. Denote the feature map of the last graph convolution layer as A, the chemical probability map PG rad − AAM ∈ RV P_{G radA AM} \in \mathbb{R}^{V}PGradAAMRV is represented by the number of vertices in a given molecule. First calculate the neuron A vk A_{v}^{k}at the kth channel and the vth vertexAvkThe gradient of the affinity score, and then calculate the channel importance weight α k \alpha_{k}ak
α k = 1 ∣ V ∣ ∑ v ∈ V ∂ P ∂ A vk \alpha_{k}=\frac{1}{|\mathcal{V}|} \sum_{v \in \mathcal{V}} \frac {\partial P}{\partial A_{v}{ }^{k}}ak=V1vVAvkP
After weighted aggregation and ReLU layer
P Grad-AAM = ∑ k α k A k P_{\text {Grad-AAM }}=\sum_{k} \alpha_{k} A^{k}PGrad-AAM =kakAk
is finally normalized to [0,1], and this figure can be considered as a weighted aggregation of important geometric structures captured by GNN
probabiliyt map

3.7 Dataset (to be supplemented)

insert image description here

4. Experimental results (to be supplemented)

4.1 Classification tasks

insert image description here

4.2 Regression tasks

insert image description here
insert image description here
insert image description here

4.3 Effect evaluation under more realistic conditions

insert image description here
insert image description here

4.4 Model Simplification Test

4.5 Visual interpretability

insert image description here
insert image description here
insert image description here

4.6 How MGNN overcomes the over-smoothing problem

A deep GNN will lead to over-smoothing of vertices (over-fitting?), and the model we pursue requires a deep network to increase the receptive field and be more expressive. The message pass mechanism is to combine the embedding vectors collected from neighbors with the vertex features to complete the update, so the over-smoothness is manifested as the similarity between vertex embeddings. As the number of model layers increases, more and more atoms are represented by vertices, so the embeddings of two different vertices will become more and more similar

4.7 Limitations

5 Conclusion

In this paper, a novel chemical intuition-based graphical model framework is proposed for DTA prediction: MGraphDTA. The model uses MGNN with 27 layers of graph convolution to capture molecular multi-scale structure, and uses GradAAM for visual interpretation.

Guess you like

Origin blog.csdn.net/qq_39917739/article/details/126200548