2022-ACS-Predicting Protein–Ligand Docking Structure with Graph Neural Network

2022-ACS-Predicting Protein–Ligand Docking Structure with Graph Neural Network


Paper: https://pubs.acs.org/doi/10.1021/acs.jcim.2c00127

Code:https://github.com/j9650/MedusaGraph

Predicting Protein-Ligand Docking Structures Using Graph Neural Networks

This article was published in ACS Articles in 2022 by Huaipan Jiang's research group at Pennsylvania State University.
Existing computational software packages for docking-based drug discovery suffer from low accuracy and high latency. Virtual screening is performed by improving the ability to assess protein-ligand binding affinities, but these methods rely heavily on traditional docking software to sample docking poses, which introduces excessive delays in execution. Therefore, the authors propose and evaluate a novel graph neural network (GNN) based framework, MedusaGraph, which includes pose prediction (sampling) and pose selection (scoring) models.

data

Improved dataset based on PDBbind 2017

  • Protein-ligand complexes with fewer than two rotatable bonds and proteins with multiple ligands have been removed
  • Except for proteins with missing or repeated residues.
  • Proteins were clustered using CD-HIT (sequence identity cutoff of 0.9)

The final dataset contains 3738 protein-ligand complexes

Model


The method consists of 2 GNN networks, the first network predicts the optimal docking pose of a protein-ligand pair from the initial docking pose, and the second network evaluates the output pose of the first network and predicts whether the pose is close to natural.

input representation

Convert the initial pose to a graph representation, where each vertex represents an atom in the complex and edges in the graph represent connections between nodes (e.g., covalent bonds or interactions between nearby atoms). The input features to the pose prediction model are N x 21 tensors, where N represents the number of atoms in the complex. Each vertex has a feature length of 21. The first 18 elements represent taxonomic features indicating the atom type. The last three elements include the atoms ( x , y , z ) in the initial pose (x, y, z)( x , y , z ) The 3D coordinates of (x, y, z). A threshold of 6 Å was chosen for atomic interactions. Edge features in the graph include the distance between vertices and the connection type (protein-ligand, protein-protein, or ligand-ligand).

model structure

Graph Neural Network Model .
The nodes in the graph are divided into rigid nodes and flexible nodes. The Pose Prediction Graph Neural Network is a vertex regression model that computes the motion of a flexible node and outputs a motion vector ( x , y , z ) (x, y, z)( x , y , z ) (x, y, z), the vector indicating the motion along each axis. This network is implemented using a TransformerConv layer. The transformer convolutional layer employs an attention mechanism to capture the importance between each pair of atoms. Additionally, it includes edge features (e.g., edge type, distance) as input features. The TransformerConv layer computes the output features of each node using the following equation

x i ′ = W 1 x i + ∑ j ∈ N ( i ) α i j ( W 2 x j + W 3 e i j ) x_i^{\prime}=W_1 x_i+\sum_{j \in N(i)} \alpha_{i j}\left(W_2 x_j+W_3 e_{i j}\right) xi=W1xi+jN(i)aij(W2xj+W3eij)

Among them xi x_ixiis node iiThe input feature vector of i ,xi ′ x_i^′xiis node iiThe output feature vector of i , N ( i ) N(i)N ( i ) is nodeiiThe set of neighbor nodes of i . Note the matrixα ij α_{ij}aijis to use

α i j = softmax ⁡ ( ( W 4 x i ) T ( W 5 x j + W 3 e i j ) d ) \alpha_{i j}=\operatorname{softmax}\left(\frac{\left(W_4 x_i\right)^{\mathrm{T}}\left(W_5 x_j+W_3 e_{i j}\right)}{\sqrt{d}}\right) aij=softmax(d (W4xi)T(W5xj+W3eij))

Among them xi x_ixiIndicates the input feature of node i, eij e_{ij}eijDenotes an edge feature ⟨ i , j ⟩ ⟨i, j ⟩ij d d d denotes the hidden size of node features; all W is a "learnable" weight matrix. Use L1-loss as the loss function as shown below

L = ∑ ∣ x c i − x 1 i − x i ∣ + ∣ y c i − y 1 i − y i ∣ + ∣ z c i − z 1 i − z i ∣ L=\sum\left|x_c^i-x_1^i-x^i\right|+\left|y_c^i-y_1^i-y^i\right|+\left|z_c^i-z_1^i-z^i\right| L= xcix1ixi + yciy1iyi + zciz1izi

Among them ( xi 1 , yi 1 , zi 1 ) (x_i^1, y_i^1, z_i^1)(xi1yi1zi1) isiiInitial coordinates of i atoms, ( xic , yic , zic ) (x_i^c, y_i^c, z_i^c)(xicyiczic) is the iiin the X-ray crystal structurei atomic coordinates,( xi , yi , zi ) (x_i, y_i, z_i)(xi,yi,zi) is the MedusaGraph for theiiThe predicted movement vectors for the i atoms. During training, only flexible nodes contribute to the loss function, since we only want to predict the motion of flexible nodes.

Multistep Pose Prediction
A multi-step pose prediction mechanism to gradually calculate the final position of each atom (as shown in Figure 1b). The idea is to divide the path from the initial position to the final position into steps and train multiple models to predict the motion of atoms in each step. Section IIThe output of the i model will be the( i + 1 )th (i + 1)(i+1 ) The input of the model. The output of the last model (for all atoms) will be considered as the final predicted pose.


After obtaining the initial docking pose (graph structure) of each complex, Pose Selection applies the pose prediction GNN to the initial docking pose and obtains the final docking pose of each complex.

The second GNN network as a pose selection GNN will predict whether such a pose is a good pose or not. This network is basically a graph binary classification model. As shown in Figure 1c, our model includes three TransformerConv layers to compute the features of each node based on its neighbors. Afterwards, the functionality of flexible nodes is added along with adding pooling layers.

result

Comparison of MedusaGraph with Existing Pose Prediction Schemes


5.9% of the raw poses generated by MedusaDock have an RMSD less than 2.5 Å. After applying the pose prediction model, 14.4% of the poses were smaller than 2.5 Å. Using the pose selection model, 37.6% of the poses are close to native.

The study of ligands with different properties

Some protein-ligand complexes find good poses more easily than others. This is mainly because the flexibility of each complex may be different from other complexes. In general, if the ligand has more atoms, or if the ligand has more rotatable bonds, the resulting complex makes it more difficult to find a good pose.

Evaluation of pose selection models

Pose selection GNN models can select good poses from all generated poses to potentially improve the final pose. The pose selection GNN model performs better than the pose generated by the pose prediction model than the pose generated by MedusaDock, and it is easier to select a good pose from the poses generated by the pose prediction GNN model than from the initial pose set.

External dataset evaluation: CASF

In Table 3, it can be observed that MeusaGraph predicts pose better than other methods, which indicates that MeusaGraph can work on different docking power benchmarks widely used in the drug discovery community.

[1] Jiang H, Wang J, Cong W, et al. Predicting protein–ligand docking structure with graph neural network[J]. Journal of Chemical Information and Modeling, 2022, 62(12): 2923-2932.

Guess you like

Origin blog.csdn.net/weixin_42486623/article/details/129889513