Graph Neural Network GNN GCN AlphaFold2 Virtual Drug Screening and New Drug Design

From Professor Manolis Kellis (Director of Computational Biology at MIT) the course "Artificial Intelligence and Machine Learning", combined with two intensive reading videos (GNN and AlphaFold2) by Mr. Li Mu in the middle as a supplement This lesson mainly introduces geometric deep learning, graph
neural Network
The main contents include graph neural network, GNN, GCN, symmetry, equivariance, information transfer, protein spatial structure prediction (AlphaFold2), drug design (virtual drug screening and new drug design)
. Just click on the outline to jump to the required module. If
you need a course video, you can private message the YouTube link
AI for Drug Design - Lecture 16 - Deep Learning in the Life Sciences (Spring 2021)
Deep Learning for Protein Folding - Lecture 17 - MIT Deep Learning in Life Sciences (Spring 2021)

Graph neural network

Quick preview three-piece suit (1h):

Teacher Li Mu's paper intensive reading video: Zero-based multi-graph detailed graph neural network (GNN/GCN) [Paper intensive reading]

Blog post link: A Gentle Introduction to Graph Neural Networks

Blog blog (convenient review): [Summary of Li Mu's intensive reading of GNN papers] A Gentle Introduction to Graph Neural Networks

Advanced and more comprehensive (3h):

Tang Yudi AI: [Dr. Tang takes you to learn AI] 2022 latest graph neural network course

Of course, the content of the graph neural network is far from something that can be learned in a few hours, and the above content is just an introduction. GNN has a certain threshold, and it must first have a certain deep learning foundation.

Just after reviewing some graph theory, I came to learn graph neural network. I felt the combination of graph theory and machine learning. It was fun.

This is the sixth week of the course. The main content of the lecture is as follows:

This talk covers geometric deep learning and graph neural networks. Geometric deep learning is a framework for learning from graphs. Graph neural networks are deep learning networks that can learn from graph representations, and can be used to process sequential data for large language models, protein folding, drug design, and computational drug development. The lecturer introduces the concepts of symmetry, change and message patterns.

Manolis is discussing the use of networks and graphs to represent complex 3D networks. He explained that networks can be represented as graphs, which can be used to represent relationships between users and objects. He gave an example of how a chemical network is represented by a diagram of chemical bonds and that the brain is represented by a network of neurons. Manolis was looking for an original way to represent biological networks with neighbors. Current methods use convolutional neural networks, where each node is connected to all neighbors, but this has too many parameters and cannot be adapted to graphs of different sizes. Manolis suggests using graph neural networks, where nodes are connected to their neighbors, and features are then learned for each node based on their neighbors.

Manolis discusses how graph neural networks can be applied to computational drug discovery, virtual drug screening, and new drug design. He explained that chemical structures can be represented as graphs, which are the same as chemical diagrams. He shows how to compare these maps to create a chemical fingerprint.

Manolis performed the Wisdom Discovery Lemon Test on the graph. This test asks what function each node can compute based on its neighbors. Manolis looks at the properties of each node and names each node according to its neighbors. This shows that there are two types of nodes in the graph: nodes with 3 neighbors and nodes with 2 neighbors.

Manolis wondered whether we could learn something about chemistry, chemistry, and biology by encoding proteins in graphs to predict their properties, such as their structure in three dimensions. A protein's structure is determined by its amino acid sequence, Manolis explained. There are many methods for protein structure prediction, including template-based modeling, which uses a template as a template to model any protein under prediction. The most successful approach is multiple sequence alignment, which aligns sequences to known pieces and models how they fit together.

Manolis gave a demonstration of computational drug discovery. There are three main methods: 1. Virtual screening: Screen a large number of compounds and test whether they meet the target. 2. Simulation: Simulate chemical space and predict how it folds in three dimensions. 3. New drug design: predicting bacterial resistance to antibiotics.

Manolis shows how to represent a molecular graph in a cross tree using hierarchical encoders and decoders. The goal is to generate meaningful molecules from simplified representations of molecular graphs. Manolis showed how to encode and decode molecules using a motivational network based on a graph neural network.

Manolis gave an overview of the research he and his team have done on deep learning and graph neural networks. They study how to model the relationships between nodes in a graph by learning properties of their neighbors and passing them on to the nodes themselves. This allows them to learn more attributes based on their neighbors, then propagate these attributes through different layers of the network, and average the embeddings of the neighbor layers. This research has led to two types of applications: screening molecules for their properties and generating new drugs based on those properties.

  • Cover (the bottom half of the cover is wrong hh)

  • Outline of this lesson

1. Geometric Deep Learning

Representation learning Representation learning

  • The core of traditional deep learning is a kind of representation learning, which generally consists of two parts
    • 'Modern' Deep learning: Modern deep learning models such as Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). It is a kind of hierarchical representation learning. In deep learning, the learning process from low-level features to high-level features is regarded as hierarchical representation learning.
    • 'Classical' Fully-connected Neural Networks: Traditional fully-connected neural networks for classification
    • In deep learning, feature extraction and classification tasks are interrelated (coupled): the classification task "drives" feature extraction
    • Deep learning is a very powerful and general paradigm. In the field of deep learning, one needs to be creative because the field is still in its infancy. New application domains (e.g., domains beyond images) may have structures that cannot be captured/utilized by current architectures. Genomics/biology/neuroscience can help drive the development of new architectures.

Data Types for Machine Learning: Sequences, Grids, Graphs

  • The CNNs and RNNs we have learned so far deal with regular, structured grid data.

  • But there are more, non-grid data, graph data in the world. Such as social network (commonly used in search, promotion and promotion), chemical molecules, human brain neurons, etc.
  • How to represent and learn these data, some people say to use adjacency matrix to convert to 2-dimensional grid data. But if the social network data has 6 billion nodes, then its complete adjacency matrix is ​​6 billion * 6 billion, which is not easy to express. And it is often a sparse matrix, most of which are 0.
  • So a graph is actually a type of object, and the various relationships are much more complicated. How to learn such objects is a complex task

More graph data

CNN vs GNN:

A core point is the image (two-dimensional grid data) processed by CNN, which has a sense of translational symmetry in it, and there are pixels in it. The graph (graph) processed by GNN has an arrangement symmetry, which means that no matter what kind of strange shape your graph is, as long as it has the same nodes and edges, the graph will be the same.

In a sense, graph can be regarded as a simple summary of grid (two-dimensional grid data), which represents related pixel points (grid points), which can be understood in this way.

  • Convolutional neural network:
    • Base domain: Grid. In a convolutional neural network, input data (such as an image) is usually represented as a grid, where each grid point (ie pixel) contains some specific information (such as a color).
    • Symmetry: Translation. Convolutional neural networks have translational symmetry, which means that no matter where an object in an image appears (for example, near the left edge, right edge, or center of the image), the network detects it in the same way.
    • Convolution: Equivariant to translation. This means that the convolution operation is invariant to translation, i.e. if the input data is shifted, the output data will be shifted in the same way.
  • Graph neural network:
    • Base Area: Figures. In a graph neural network, the input data is represented as a graph, where each node contains some specific information, and each edge represents the relationship between two nodes.
    • Symmetry: alignment. Graph neural networks have permutation symmetry, which means that no matter how the nodes in the graph are rearranged (i.e., no matter how the identity of the nodes changes), the network will produce the same output.
    • Messaging: permutation equivariance. This means that message passing operations are invariant under permutations, i.e. if the nodes of the input data are rearranged, the output data will be rearranged in the same way.

Lead GNN

For graph, each node also has attributes, just like in CNN (grid data), each pixel has a corresponding pixel value. Each node in the graph also has corresponding attributes. For example, if the graph is a molecular graph, then the atoms (nodes) in the graph may be hydrophilic or hydrophobic. If the graph is a social network, then people (nodes) have different hobbies/interests. These can be used as a vector to represent its node attributes. One of the tasks we will do is to study these node properties.

One of the tasks (challenges) that GNN has to do is, because it is a graph, it must have the same result output for different forms of input of the same graph.

For example, the following 24 kinds of matrices represent the same graph, but the input matrices are different (representation forms), and the results we get through GNN should be the same. (Because the essence is the same picture)

Some mainstream neural network structures, we mainly learn GNN today

2. Graph Neural Networks

Machine Learning Lifecycle

The life cycle of machine learning is shown in the figure. Traditional algorithms need to do feature engineering by themselves and preprocess data into structured data. Modern algorithms (deep learning) can learn features by themselves and feed unprocessed data directly. Is the so-called representation learning (Representation Learning)

In the encoder (encoder), the different elements are embed (mapped) into the embedding space. This is where Representation exists. For example, a kind of word embedding (word embedding) technology (word2vec) mentioned in nlp before, is this kind of thing.

learning graph is hard

Modern deep learning algorithms are designed to serve simple sequence/grid data, but data such as graphs cannot be simply represented by sequence/grid, so we need more widely used deep learning Neural Networks.

But this process is not easy, because the graph is too complex, complex and dynamic, why:

  1. Arbitrary size and complex topology : Networks are of arbitrary size and shape, and their topology is often more complex than meshes. For example, in an image (i.e., a grid), each pixel has fixed neighbors, but in a network, the number and identity of a node's neighbors may vary from node to node. Furthermore, there is no spatial locality in the network as in the grid, which makes it impossible to apply many grid-specific techniques (such as convolutions) directly.
  2. No fixed node order or reference point : The nodes in the network have no fixed order and no fixed reference point. This means that a node's properties or behavior cannot be inferred based solely on its identity or location.
  3. Often dynamic and multimodal in nature : Many networks are dynamic, that is, their structure and properties change over time. For example, in a social network, new connections may be formed, old connections may disappear, and attributes of nodes (such as users' interests) may also change. In addition, the nodes and edges of the network may have multiple types of attributes, which may exist in multiple modalities (such as values, texts, images, etc.).

Feature Learning in Graphs

Still need to rely on embed technology (embedding), just like the Attention mechanism we learned in the previous LLM, it can capture the contextual relationship of words, and the node and its neighbors also have such a relationship. Based on the attributes of the node itself , It is also necessary to obtain information about its relationship with surrounding nodes . This is graph feature learning. The input is a graph, and the output is a low-dimensional space, such as a two-dimensional plane. Each node represents a coordinate (for example, not necessarily accurate).

Feature learning in graphs mainly refers to how to extract useful information from graph structures and convert them into feature representations or embeddings that can be used for subsequent machine learning tasks. This usually involves learning a mapping function f that is able to map nodes (or edges) in the graph into a low-dimensional space (also called embedding space) that usually has properties that are more tractable than the original graph structure .

In this process, the input is a graph structure and the output is a low-dimensional node embedding. For example, in the medical domain, the input graph might represent a network of diseases, where nodes represent different diseases and edges represent relationships between diseases (such as co-occurrence). Through graph feature learning, we can obtain low-dimensional embeddings for each disease, which capture the relationship between diseases and can be used for subsequent machine learning tasks, such as disease prediction or drug recommendation.

How to learn the mapping function f is a key problem. Common methods include Graph Convolutional Networks and Graph Attention Networks. These methods all attempt to learn embeddings that capture the relationships between nodes by exploiting the graph topology and node features.

Convert a node into a corresponding vector, graph embedding technique.

The goal is to provide efficient, task-independent feature learning for machine learning tasks in networks. In other words, we hope that the learned features or embeddings are not only useful for the current task, but also easily transferable to other related tasks. For example, learned disease embeddings can be used not only to predict the incidence of a certain disease, but also to recommend treatments for that disease.

Here is a small knowledge point, which is about the word embed (embedded), which is always seen in deep learning:

In machine learning, "embed" or "embedding" generally has two main meanings:

  1. **Represents conversion or mapping:** This meaning comes from the concept of word embedding (word embedding), such as Word2Vec, GloVe, etc. Here "embed" means to map words or other types of objects from the original representation space to a new (usually low-dimensional) representation space. For example, we might map a word (usually represented as a one-hot encoding) to a real vector space (e.g., a 300-dimensional vector). This new representation space is designed to capture certain relationships between words (such as semantic or grammatical similarity).
  2. **Embedded representation of data:** In a broader sense, "embed" or "embedding" can also refer to any process of transforming data from one representation space to another. This can cover other types of embeddings besides word embeddings, such as entity embeddings, graph embeddings, etc.

Ways to Analyze Networks

Here are some ways and tasks to analyze the network:

  1. Node Classification : In this type of task, the goal is to predict the type of a given node. For example, in a social network, we might want to predict a user's interests or occupation.
  2. Link Prediction : The goal of this task is to predict whether there is a link between two nodes. For example, in a social network, we might want to predict whether two users will become friends.
  3. Community Detection : The goal of community detection is to identify clusters of close links between nodes, which can be viewed as communities or groups in the network. For example, in a social network, we might want to identify groups of users with similar interests or backgrounds.
  4. Network Similarity : In this type of analysis, the goal is to measure the similarity of two nodes or two networks. This might include comparing node properties, structural locations, or comparing the topology of the entire network.

A Naive Approach introduces a most basic way of learning graphs

Symbolic representation of graphs , suppose we have a graph G:

  1. V is the set of vertices
  2. A is an adjacency matrix
  3. X∈R^m*|V| is a matrix of node features, and m should be the number of features of each node. For example, a person in a social network has four hobbies: singing, dancing, rap, basketball, then m=4
  4. v: a node in V; N(v): the set of neighbors of node v

Node characteristics :

  • In biological networks, node features may include gene expression profiles and gene function information.
  • When there are no node features in the graph dataset, we can use indicator vectors (one-hot encoding of nodes) as node features.

The most basic method is to use traditional machine learning ideas to convert data into structured data and feed it into the neural network. Here, the adjacency matrix and the feature matrix are combined and then fed into a deep neural network for processing.

However, there are some problems with this idea:

  1. Too many parameters : Deep neural networks require a large number of parameters to learn, which can lead to very large computational requirements for large-scale graphs and can lead to overfitting.
  2. Not suitable for graphs of different sizes : If we have graphs of different sizes (i.e. different number of nodes), this approach may not work anymore, since the input and output sizes of deep neural networks are fixed.
  3. Sensitive to node order : Since we are taking the adjacency matrix and feature matrix directly as input, this means that our model will be sensitive to the order of nodes. However, in many graphs the order of the nodes does not matter, so we do not want our model to be sensitive to this order. As mentioned earlier, we need to input different forms of the same graph (different node order representations), and get the same output forever! !

Therefore, while this straightforward approach may work in some cases, in many practical applications we need more complex and flexible approaches to graph data, such as Graph Neural Networks.

A better way to deal with it: Deep Graph Encoders (Deep Graph Encoders)

I still recommend everyone to play at the beginning of this blog, which can be interactive: A Gentle Introduction to Graph Neural Networks

The better way here is a method similar to convolution, but it is used on the graph. For a node, the first layer of convolution learns the data of its neighbors, and the second layer is the neighbors of the neighbors. , the third layer is neighbors of neighbors of neighbors. In this way, the embedding of the graph is realized, and the information of the nodes is extracted. Instead of the previous simple and crude conversion into an adjacency matrix.

By gpt4:

Deep Graph Encoders (Deep Graph Encoders) use a special convolution operation - Graph Convolutions (Graph Convolutions) to extract the features of nodes or graphs and generate high-quality node or graph embeddings (embeddings).

The following is the architecture of a common depth map encoder:

  1. Graph Convolutions : This is an operation that aggregates information on each node in the graph. Each node's new feature is a function of its neighbors' features, which is usually a weighted sum or average.
  2. Activation Function : Increase the expressiveness of the model through a non-linear function, such as ReLU or tanh.
  3. Regularization : For example, dropout is a technique used to prevent overfitting by randomly discarding some nodes during training to increase the generalization ability of the model.
  4. Graph Convolutions : Perform another round of graph convolutions to further extract and integrate node features.

The output is node embeddings, which capture the characteristics of nodes and their positions in the graph. Furthermore, we can extend this method to generate embeddings of subgraphs or entire graphs.

This deep graph encoder is more efficient than traditional deep learning methods when dealing with graph data with complex structures and features.

  • Compare the convolution of CNN, from image to graph. Of course, this kind of thinking is also in nlp. Simply put, it is to integrate itself and surrounding attributes together.

Graph Convolutional Networks(GCN)

Graph Convolutional Networks (GCNs) are deep learning models designed specifically for processing graph data. The main idea of ​​GCNs is that the neighborhood of a node defines a computational graph through which we can propagate and convert information between nodes to compute node features.

The operation steps of GCNs can be briefly described as follows:

  1. Determine the node calculation graph : each node and its neighbors form a calculation graph. For example, for a target node, its computational graph will include the node and its neighbors.
  2. Propagate and transform information : In a computational graph, information will be propagated from one node to another. At the same time, the information of each node will be updated through some transformation (for example, through a neural network)

Idea: Aggregate Neighbors

A key idea of ​​GCNs is to aggregate neighbor information . Neighborhood Aggregation is a key concept in graph neural networks. Different approaches differ in how to aggregate neighbor information. The basic intuition is that a node can aggregate information from its neighbors through a neural network. For example, a node might compute a weighted sum of the features of all its neighbors, and then take this weighted sum as input and process it through a neural network to generate new node features .

In the figure, you can see that the 0th layer is A, the 1st layer is BCD, and the 2nd layer also has A. This is not a cycle, but it comes layer by layer. First, the information of A is updated for B, C, and D. , and then this information is passed to A on layer 0.

An advantage of this approach is that it can work directly with graph data without converting the graph to another format. This enables GCNs to preserve graph structure and feature information when processing complex graph data.

The working principle of GCNs described above is based on the method in the paper "Semi-Supervised Classification with Graph Convolutional Networks" published by Kipf and Welling at the ICLR conference in 2017

  • Approximate process:

Different methods have some differences in how to aggregate neighbor information, but the basic approach is to average information from neighbors and apply a neural network to process it. Specifically, this process can be divided into two steps:

  1. Averaging Information from Neighbors : First, the target node gathers information from its neighbors. This information can be node features, edge features, or information from more distant nodes. This information is collected and averages or other forms of aggregate statistics are calculated.
  2. Applying a neural network : Next, these summary statistics are passed as input to a neural network. This neural network can be any type of network, such as a fully connected network, a convolutional network, or a more complex network. The task of this network is to convert these summary statistics into new node features

It should not be difficult to understand the collection of information from neighbors. The application of neural networks is a bit abstract at first glance, but in fact it is not abstract. To a certain extent, it is like CNN's convolution + MLP. From the perspective of GCN, it is actually the process of learning new node features based on the original data. (Language expression ability is limited, anyway, refer to CNN)

The above process forms a general blueprint for learning graph data, which can be used for tasks such as node classification, graph classification, and link prediction:

  • Node classification : Apply a function f to each node, the input is the node feature learned through the graph neural network, and the output is the category of the node.
  • Graph classification : Apply a function f to the entire graph, the input is the features of all nodes in the graph (possibly summarized in some way), and the output is the category of the graph.
  • Link Prediction (Edge Prediction): Applies a function f to each pair of nodes, the input is the features of these two nodes (and possibly edge features), and the output is a prediction of whether there is a link between these two nodes.

The three “flavours" of GNN layers

The design of graph neural network (GNN) layers is currently a very active research area. The main goal of a GNN layer is to build permutation equivariant functions F(X, A) on graphs, realized by shared local permutation invariant functions φ(X#) Xw). Here, X represents the node features, and A represents the adjacency matrix.

Here are some commonly used terms:

  • F is called "GNN layer".
  • φ is called "diffusion", "propagation", or "message passing".

So, how to realize φ? This is a very active area of ​​research, and fortunately, almost all φ can be categorized in three "flavors" (namely, attention, message passing, and convolution).

Graph Neural Networks (GNNs) have three "flavors" of layers:

  1. Convolutional : This type of graph neural network layer, inspired by convolutional neural networks, applies a transformation (usually a linear transformation) to the features of each node and its neighbors, and then aggregates the results (usually by sum or average) to update the state of each node. This mechanism relies on the topology of the graph to extract local patterns in a way that does not depend on a specific ordering of nodes. Graph Convolutional Networks ( GCNs ) are an example of this type.
  2. Attentional : In this type of graph neural network layer, the update of each node depends on the information of its neighbors, but the contribution of each neighbor is weighted by attention mechanism. This means that some neighbors may have more influence on the update of the current node, while other neighbors may have less influence. ( Judging the different influences of different neighbors through the attention mechanism ). This mechanism allows the model to learn which neighbors are important in a given context. Graph Attention Networks ( GATs ) are an example of this type.
  3. Message-passing : In this type of graph neural network layer, each node sends "messages" to its neighbors, and then updates its own state based on all messages received. These "messages" are generated by the characteristics of the sending node, and all received messages can be aggregated in various ways (e.g., weighted average, max pooling, etc.) . This approach helps the model capture complex patterns and dependencies in the graph. Message passing neural network** (MPNN)** is an example of this type.

All three types of GNN layers try to obtain information from a node's neighbors, but they employ different strategies and thus may behave differently in different tasks and graph structures.

I was a little confused between GCN and MPNN, so I asked about gpt4, but I still don’t understand it, but I will have time to learn more in the future:

Graph Convolutional Networks (GCNs) and Message Passing Neural Networks (MPNNs) are somewhat similar in that they both attempt to pass information between a node and its neighbors. However, they differ in implementation details and concepts.

  1. Graph Convolutional Networks (GCNs) : Inspired by Convolutional Neural Networks (CNNs), GCNs try to capture local information from a node's neighbors. Unlike CNN, which performs local convolutions on regular grids, GCNs perform convolutions on irregular graph structures. In GCN, the new feature of each node is the weighted average of the features of its neighbors (the weighting term is usually the attribute related to the edge or the degree of the node), and then passed through a non-linear activation function. This operation can be viewed as a form of convolution, which defines a local neighborhood over the spatial structure of the graph.
  2. Message Passing Neural Network (MPNN) : The way MPNN works is that each node sends "messages" to its neighbors and then updates its own state based on all the messages received. These messages are generated from the features of the sending nodes through neural network functions. The node then aggregates all the messages received (say by summing or taking the maximum value) and updates its own state. This approach allows the model to capture more complex dependencies, as each node can decide how to update its state based on its own characteristics and the messages it receives.

In simple terms, convolution operations focus on capturing a fixed pattern (i.e., "average") between a node and its neighbors, while message passing allows for more complex interaction patterns because it allows each node to and receive messages to update its state autonomously.

3. Deep Learning on Graphs

Deep Encoder


This mathematical model describes how a deep encoder (Deep Encoder) works in a graph neural network. The basic approach of this deep encoder is to take the information (i.e. messages) of neighbor nodes, average it, and apply a neural network.

The following is a detailed explanation of this model:

  1. Initial embedding : At the initial moment, that is, layer 0 (0-th layer), the embedding (also called feature vector) of each node is its own feature ( X i X_iXi)。
  2. Calculate the average embedding of neighbors : for each node vvv , its neighbor node set is denoted asN ( v ) N(v)N ( v ) . For each node u in this node set, take out the embedding of the previous layer (that is,huk − 1 h^{k-1}_uhuk1), and compute the average of these embeddings.
  3. Apply the neural network : Then, multiply this average embedding by a weight matrix W k W^kWk , plus a bias termB k B^kBk , and through a nonlinear activation functionσ \sigmaσ (such as ReLU), get the embedding hukh^k_uof node u at the kth layerhuk

The above steps are carried out in k layers, and each layer adds more neighbor information to the original embedding, and finally generates a deep node representation.

Just like the previous process, there are a total of k layers:

  • Model parameters

The model parameters are weights and biases:

parameter settings:

  1. Trainable weight matrix : During the calculation of each layer, there is a weight matrix W_k used to update the embedding of the nodes. This weight matrix is ​​one of the parameters that the model needs to learn. The weight matrix of each layer can be different, that is to say, W_k can change as the level k changes.
  2. Calculate node embedding : the embedding h (k)_u of each node u is averaged by the embedding of its neighbor nodes in the previous layer, then multiplied by the weight matrix W_k, and a bias item b (k) is added, and finally passed A nonlinear activation function is computed.
  3. Loss functions and optimization : The resulting node embeddings can be used to compute any appropriate loss function, such as cross-entropy loss for node classification tasks, or negative log-likelihood loss for link prediction, etc. Then, the weight matrix and bias terms are adjusted by stochastic gradient descent (SGD) or other optimization algorithms to minimize the loss function to train the model.

GraphSAGE

GraphSAGE (Graph Sampling and Aggregation) is a framework for representation learning on large graphs. Its purpose is to generate node embeddings that can generalize on nodes and graphs not seen during training, which is very valuable for many practical applications, because real graphs are often dynamically generated and extended.

So far, we have aggregated by taking the (weighted) average of neighbor messages, GraphSAGE proposes an improved way.

The key idea of ​​the GraphSAGE method is that for each node, it will first sample a fixed number of neighbor nodes, and then perform some form of aggregation of the characteristics of these neighbor nodes (for example, average, pool, or use LSTM, etc.) , and then combine the aggregated neighbor features with the node's own features (eg, concatenation or summation), and generate new node features through a nonlinear transformation (eg, ReLU). This process can be repeated many times, similar to doing multi-layer convolutions on a graph. Finally, the obtained node features can be used for various downstream tasks, such as node classification, link prediction, etc.

Three aggregation variants:

  1. Average: AGG in the weighted average formula for neighbor nodes represents an aggregation function, where hk-1 is the embedding of the previous layer, and u belongs to N(y), where N(y) represents the neighbor set of node y.
  2. Pool (pooling): Transform the neighbor vectors and apply a symmetric vector function. For example, you can perform a linear transformation on each neighbor vector (multiply by the weight matrix Q), and then take the element-level average or maximum value on all neighbor vectors . One benefit of this approach is that because both the average and maximum operations are symmetric, the aggregated result will be the same regardless of the order of the neighbor nodes.
  3. LSTM: Applying LSTM to neighbor reordering For each node, the embeddings of its neighbor nodes are first reordered, and then these embeddings are fed into an LSTM network as a sequence. The resulting final output of the LSTM is used as the new embedding for this node. A disadvantage of this method is that LSTM is not symmetric, so the same neighbor nodes, if the order is different, may produce different aggregation results.

Application to Drug Design/Evaluation

The model is applied to the side effect prediction of polydrug therapy: given a set of prescribed drugs, the model can predict the possible side effects of these drugs. Each side effect is represented by a probability, the higher the probability, the more likely it is to occur.

Understanding and predicting drug side effects is crucial during drug research and development. If a drug's side effects are too severe, the drug may not be on the market. By using graph neural networks to predict drug side effects, researchers can better understand drug interactions and potentially help them design new drugs that are safer and more effective.

Heterogenous Networks

  • Heterogeneous networks, that is, nodes and edges of some networks have different attributes, so the problem is more complicated

By gpt4:

How are heterogeneous networks embedded?

A heterogeneous network means that the nodes and edges in the network can have multiple types. For example, in a social network, nodes might represent individuals or organizations, and edges might represent friendships or business relationships. In this case, we need a way to embed such diverse networks.

One approach to embedding heterogeneous networks is to use graph neural networks (GNNs). These models can handle multiple types of nodes and edges, and generate node embeddings by considering the type information of nodes and edges. ** For example, nodes with different types of edges may be encoded with different characteristics.

For each type of node or edge , we can have a different embedding model or set of parameters. Then, different types of neighbor information are combined by suitable aggregation strategies (e.g., weighted summation, averaging, max pooling, etc.). This is the so-called heterogeneous graph neural network.

The design and application of heterogeneous graph neural networks is a hotspot in the current graph deep learning research, with rich theoretical research and application practice.

  • setup

  • Aggregation of neighbors based on the type of sensory edge:
    • Key idea: Generating node embeddings based on network neighborhoods separating edge types
    • Each edge type models the neighborhood of nodes separately defining a computational graph

  • Mathematical formula

Training the Model

We need to define a loss function to train the content after embedding

4. GNNs for Protein folding

Chemical Structures as Graphs

  • Chemical Structure as a Graph: The structure of a chemical molecule can be viewed as a graph where atoms are considered as vertices and chemical bonds as edges. This perspective provides powerful tools for the computation and analysis of chemical molecules, especially in fields such as computational chemistry and drug design.
  • Structural Similarity of Molecules: Structural similarity of molecules is measured by comparing the chemical structures of two or more molecules. This similarity is usually determined by comparing properties of molecular graphs, such as atom types, chemical bond types, connection relationships between atoms, etc.
  • Graph theory and chemistry: Graph theory is a branch of mathematics that mainly studies the properties and structures of graphs. In chemistry, graph theory is widely used in the description and analysis of molecular structures. By treating molecules as graphs, the concepts and techniques of graph theory can be used to solve many chemical problems, such as judging whether two molecules are isomorphic, predicting the chemical reactivity of molecules, etc.
  • Weisfeiler-Lehman Test: The Weisfeiler-Lehman (WL) test is an algorithm for determining whether two graphs are isomorphic. Graphs that are isomorphic are structurally identical even if the vertices are labeled or ordered differently. The WL test relabels the vertices of the graph through an iterative process. If the labels obtained by the two graphs are the same after enough iterations, the two graphs are considered to be isomorphic . In chemistry, the WL test can be used to compare the structures of molecules, thereby helping chemists identify molecules with similar structures.

To explain briefly, let's look at the first row:

  1. The first picture is blue, all nodes are the same
  2. We mark the nodes with two neighbors in green and the nodes with three neighbors in yellow, and we get the second graph
  3. We then mark nodes with two yellow neighbors as purple, nodes with two green and one yellow neighbors as orange, and nodes with one green and one yellow neighbor as gray
  4. Finally, we count the nodes of this molecular graph as one purple, two gray and two oranges.
  5. We found that although the molecular diagram below is different from the one above, it is also one purple, two gray and two oranges through this simple calculation. We think it is isomorphic.

This section outlines two main technologies for drug development using graph neural networks: virtual screening/molecular property prediction and new drug design.

Virtual Screening/Molecular Property Prediction: Virtual screening is a computational technique used to identify likely drug candidates in large compound libraries. In this process, graph neural networks are used to learn low-dimensional representations of molecules, which can be used to predict molecular properties such as solubility, toxicity, etc. This technique has been used in several studies, including those by Duvenaud et al. (2015), Kearnes et al. (2016) and Jin et al. (2018).

New drug design: New drug design, also known as "de novo" drug design, is a process that uses computational tools to design new potential drugs. In this process, graph neural networks are used to generate new molecular structures. This technique has been used in several studies, including those by Olivecrona et al. (2018), Gomez-bombarelli et al. (2018), Jin et al. (2018) and Popova et al. (2018)

Protein Structure Prediction

The three-dimensional protein structure folded from the one-dimensional amino acid sequence has been studied for decades.

Protein structure prediction is a method used to determine the three-dimensional structure of a protein, usually from its amino acid sequence. Understanding the three-dimensional structure of proteins is very important because it can help us understand protein functions and find possible drug targets in drug design.

Protein folding usually involves the following main structural elements:

  • α-helices (Alpha-helices): α-helices are a common secondary structure of proteins. Due to the hydrogen bonding between amine groups and carboxyl groups, amino acid chains rotate in space to form a helix.
  • Beta -sheets: Beta-sheets are another common secondary structure of proteins, formed by two or more parallel or antiparallel beta strands joined together by hydrogen bonds.

Protein structure prediction is an important and challenging problem in computational biology. Many different techniques and approaches have been developed to solve this problem, including template matching, homology modeling, and more advanced methods such as deep learning. Deep learning methods, such as AlphaFold, have achieved remarkable results in the CASB (Critical Assessment of protein Structure Prediction) competition, showing their potential in predicting protein structures.

Methods for Protein Structure Prediction

Template-dependent modeling and template-free modeling are two commonly used methods for protein structure prediction.

  • Template-based modeling (TBM) : This approach relies on known protein structures from resolved protein structure databases (such as PDB) as templates. If a new protein sequence is highly similar (homologous) to a protein sequence of known structure, then this known structure can be used as a template to predict the structure of the new protein. However, this method may not work well for some proteins that do not have a similar template, such as certain membrane proteins.
  • Template-free modeling (Template-free modeling, also known as ab-initio or de novo modeling) : This method does not rely on known protein structure templates, but predicts protein structures through theoretical models and computational methods. For example, this might include simulations of the physical and chemical properties of proteins, such as mechanistic and electromagnetic models, or stochastic searches using Monte Carlo methods. However, the computational complexity of this approach is usually high, and the prediction accuracy may not be as good as template-dependent modeling.

Recently, deep learning methods (such as AlphaFold) have shown great potential in protein structure prediction, which can either use known protein structure information (if available) or predict protein structure from scratch, thus combining Advantages of template-dependent and template-free modeling

Old method: fragment assembly

In the past, supercomputers were used to simulate, which required a lot of computing power and the success rate was very low

Fragment assembly is a traditional protein structure prediction method, the specific process is as follows:

1. Target sequence : First, we have a protein target sequence whose structure needs to be predicted.

2. Fragment Library : Create a fragment library by extracting short sequence fragments from proteins of known structure. These fragments are typically 3 to 9 amino acids in length.

3. Profile-Profile Alignment : Then, compare the amino acid profile of the target sequence with the profile of each fragment in the library.

4. Monte Carlo Fragment Assembly (Monte Carlo Fragment Assembly) : Fragments are randomly selected from the fragment library by the Monte Carlo method, and they are spliced ​​together to construct a preliminary model of the protein. By repeating this process, a large number of protein models are generated.

5. Knowledge-based potentials : Use a statistically based scoring function (such as Rosetta's scoring function) to evaluate and rank these models, and select the model with the highest score as the best model.

6. Physics-based atomic refinement (Physics-based atomic refinement) : Finally, a physics-based energy minimization process refines the selected model, including optimizing amino acid side chain positions and small adjustments to the protein's main chain.

A major problem with this approach is that it is computationally inefficient and the accuracy is limited by the library of known structural fragments.

New Strategy

This new strategy includes the following steps:

  1. Multiple Sequence Alignment (Multiple Sequence Alignment) : Multiple Sequence Alignment is a method based on sequence alignment, which is used to determine the similarity between a set of protein sequences or nucleic acid sequences. By aligning multiple sequences, we can identify shared evolutionarily conserved regions, those that have remained unchanged during evolution. These conserved regions are often critical for protein function and structure.
  2. Co-evolution Analysis (Co-evolution Analysis) : This is a method to find the relationship between amino acids within a protein, that is, a change in one amino acid position may cause a change in another amino acid position. This co-evolution information can be used to predict the three-dimensional structure of proteins, because in proteins, co-evolving pairs of amino acids tend to be close in space.
  3. Deep neural network prediction interaction matrix : The input data includes the above-mentioned multiple sequence alignment and co-evolution analysis results, as well as other possible protein sequence information. The goal of this deep neural network is to predict an interaction matrix where each element represents the interaction strength of two amino acids in a protein sequence in space.
  4. Deep neural network prediction of local structure : Similarly, deep neural networks can also be used to predict local structural information of proteins, such as secondary structure (alpha helix, beta fold, etc.) and the position of each amino acid.
  5. Minimization Molecular Force field : After obtaining the predicted interaction matrix (two-dimensional) and local structure information (linear), molecular simulation methods such as force field minimization or molecular dynamics simulations can be used to Generate the 3D structure of the protein . The goal of this step is to find a protein structure that best satisfies the predicted interactions and local structural information.

This strategy provides an effective method to predict protein structure, which makes more use of deep learning and sequence evolution information than traditional methods, and thus usually leads to more accurate prediction results.

Co-evolution Analysis

The structure and function of a protein depend heavily on the sequence and chemical properties of its amino acids. If an amino acid is mutated so that its side chains grow larger, this can disturb the protein's structure or affect its function, as it could alter its interactions with neighboring amino acids .

But if another nearby amino acid is mutated at the same time, making it smaller, this might balance out the size change of the side chains and keep the protein stable. In this case, the two amino acids exhibit the property of coevolution, that is, their changes are coordinated.

This co-evolution phenomenon can help us understand the three-dimensional structure and function of proteins, because co-evolving amino acid pairs are usually in close contact in the three-dimensional structure, and jointly participate in the formation of the active site or domain of the protein. Therefore, by analyzing the co-evolution patterns of multiple protein sequences, we can predict protein structures or functional sites.

The concept of co-evolution may be rather abstract, we understand it through a simple example.

Suppose we have an extremely simple organism with three amino acids. These three amino acids are arranged in a chain to form a protein. We named these three amino acids as A, B, and C, respectively.

In an ideal environment, the ideal forms of these three amino acids are large, medium and small respectively. That said, the best form of protein is Big A, Medium B, and Small C. This combination allows the organism to achieve the best adaptation to the environment.

However, organisms mutate during evolution, and amino acids may change form. For example, A might mutate into medium, or B mutate into large.

Now, if A mutates to medium, then B, which is adjacent to A, will be stressed because it is now the same size as A. In order to maintain the optimal shape of the protein, B may also be mutated to a smaller size. In this way, A and B have co-evolved by co-adapting to environmental stress.

This phenomenon of co-evolution may be difficult to observe if we only look at one organism. But if we look at many organisms with similar proteins, we can see that mutations at certain amino acid positions are correlated. For example, we might observe that whenever A becomes medium, B also tends to become small.

This is the basic concept of co-evolution. In real organisms and proteins, the situation is much more complicated, since a protein may consist of hundreds or thousands of amino acids, each of which may appear in 20 different forms. But even in this complex environment, we can still use statistical analysis to discover patterns of co-evolution to predict protein structure or function.

Here is an introduction to co-evolution analysis of multiple sequence alignments (MSA, Multiple Sequence Alignment). Co-evolution analysis is widely used in bioinformatics, especially in the fields of protein structure prediction and functional site prediction.

Here graphical models (also known as graphical models or Markov random fields, MRFs) are used to model MSA. It regards each position in the amino acid sequence as a random variable and assumes that the relationship between two positions can be described by a weight function.

For a given amino acid sequence S, its generation probability P ( S ) P(S)P ( S ) can be computed by this graphical model, whereφ ( x ) φ(x)φ ( x ) represents the single-point probability of a certain position in the sequence, andw ( xi , xj ) w(x_i, x_j)w(xi,xj) indicates the correlation or covariation of two positions in the sequence.

The parameters of this model φ φφ sumwww can be estimated by maximum likelihood method or pseudo-likelihood method. A special case of this approach is the Gaussian graphical model, which assumes that the random variables follow a multivariate normal distribution.

In general, the goal of this method is to predict the structural or functional sites of proteins by analyzing the co-evolution patterns of protein sequences.

In protein structure prediction, it is usually necessary to collect and utilize some protein features and labels. These features and labels can help us understand the properties of proteins and predict their future behavior.

  1. Sequential Features: This refers to information in protein sequences, such as conservation profiles and predicted local structures. Conservation refers to the degree of variation of amino acids at a certain position in different species or different protein families. The predicted local structure refers to the type of secondary structure that each amino acid in the protein may form (eg, α-helix, β-sheet).
  2. Pairwise Features: This refers to features based on the relationship between two amino acids, including mutual information, direct co-evolution, contact potential, etc. Mutual information and coevolution are methods to quantify the dependence of amino acid changes at two sites, and contact potential is to predict the probability that physical contact between them may occur based on the amino acid type.
  3. Labels: This is the target we want to predict, such as the distance between amino acids. This distance is usually divided into several intervals, such as less than 8 angstroms, 8-15 angstroms, and greater than 15 angstroms. It is worth noting that this label distribution is often unbalanced, as amino acid pairs that are far apart (greater than 15 Å) and those that are closer together (less than 8 Å) are usually more numerous in proteins than intermediate distances of amino acid pairs.

After collecting these features and labels, we can use some machine learning or deep learning methods to train a model that can learn from these features and predict the structure or behavior of proteins.

Towards An End-to-End Workflow

The "Towards An End-to-End Workflow" workflow introduces an additional step, the "Templates" step.

The Templates step usually involves using known protein structures as templates that are somewhat similar to the sequence of the target protein. These templates can provide additional structural information that can help improve the accuracy of predictions. This approach is particularly useful for protein sequences for which good templates can be found.

In this workflow, templates are used not only to initialize predictions, but also to train deep neural networks. This means that the network not only learns information from multiple sequence alignments and co-evolution analyses, but also learns structural information from templates. This strategy of combining data-driven and knowledge-driven approaches has the potential to improve the accuracy and robustness of predictions.

It is important to note that template-guided predictions can be highly sensitive to template quality and similarity to the target sequence, and are not suitable for proteins for which no suitable template can be found. Furthermore, this approach may require more complex training steps and more computing resources.

AlphaFold2 architecture

AlphaFold2 is a revolutionary protein structure prediction method developed by DeepMind. It performed well in the 2020 Competition for Protein Structure Prediction (CASP14), marking a major breakthrough in the protein folding problem.

The method is end-to-end, meaning it takes as input a sequence of amino acids (i.e., a protein sequence) and outputs a predicted three-dimensional structure of the protein. This differs from many earlier methods, which often required a series of intermediate steps and manual adjustments.

In the framework of AlphaFold2, multiple sequence alignments are performed first, and templates are searched. This information is then fed into a deep network similar to Transformer. This network includes multiple modules, such as Evotransformer module and Structure module, which process sequence and template information through a large number of attention mechanisms and residual connections.

The output of the network is a 3D representation describing the expected position of each amino acid in space. This representation is then used to reconstruct the three-dimensional structure of the protein.

A major innovation of AlphaFold2 is the introduction of template and multiple sequence alignment information, which are processed through a powerful Transformer-like network. This allows the network to learn complex sequence-structure relationships and improve the accuracy of predictions.

However, while AlphaFold2 has achieved notable achievements, there are still some challenges. For example, the prediction of large proteins, as well as the prediction of protein dynamics and protein-protein interactions, remain difficult problems.

The key to the graph neural network is to learn the relationship between the residues, complete the embedding of each residue in some three-dimensional structures, and predict the final three-dimensional structure

In the case of protein structure prediction, GNN can be used to simulate the process of protein folding, where nodes represent amino acids and edges represent interactions between amino acids. GNNs can capture local and global structural features of proteins, as well as complex interactions between amino acids.

For example, GNNs can be used to predict the distance between amino acids, which is a key step in protein 3D structure prediction. GNNs can capture long-distance dependencies between amino acids by propagating information in protein graphs.

Supplement: Intensive reading of papers with Li Mu

Mr. Li Mu Station b : AlphaFold 2 Intensive Reading of Papers 【Intensive Reading of Papers】

Three parts, the first part is to extract features, the second is encode, and the third is decode

  • For the first part, there are a total of the following inputs
    • import the sequence directly
    • MSA: Search in the gene bank to find similar protein sequences, and then form an MSA (Multi sequence alignment), also called multiple sequence alignment. (The role of MSA is to extract the co-evolution information of a protein sequence in multiple species)
    • The relationship between amino acids, we know that proteins can be rolled up is the relationship between amino acids, so there is an input data that stores the relationship between amino acids in the protein sequence (here is not the spatial distance between amino acids, because We don't know yet, some other features)
    • Finally, there are some additional features: search in the structural database, because we already know the structure of some proteins, and then search in it to get information such as the spatial distance between amino acids, and get a lot of templates
    • Therefore, the extracted features mainly obtain two types of features: the first type is the feature between different sequences, and the second type is the feature between amino acids. These two types of features and some other things can be input into the encoder
  • For the second part, input two 3D tensors

    • MSA: Size is (s, r, c)
      • s means that there are s proteins, the first one is the human protein we want to predict, and the latter s-1 are the proteins matched from the database.
      • r indicates that there are r amino acids in the protein (the result of multiple sequence alignment will use _ to fill in the gaps, and the final result should be the same length)
      • c indicates that each amino acid represents a vector that grows into c (for an image, it is the number of channels per pixel, and for a sentence, it is the embedding length of each word)
    • Pair amino acid pair: the size is (r, r, c)
      • r, r is the number of amino acids, and c is a vector of length c to represent the characteristics of an amino acid
  • Then input the two into Evoformer (can be seen as a variant of transformer)

    • There are two differences from Transformer: 1. It is no longer a sequence relationship (such as a sentence), but now it is a two-dimensional relationship (different protein sequences, the same amino acid position) 2. The input is two Different tensors, we have to fuse them together
    • Most of the rest are the same. Guess that the 3D structure of the transformer protein may also be caused by the relationship between amino acids, and both the sequence position near and the sequence position far away may play an important role, which has the taste of attention
  • Then we got the output of the encoder, including all the feature representations of the human amino acids to be predicted, as well as the correlation information between amino acids. Predict the position of each amino acid based on relevant information, and finally get our output

There is also a recycling mechanism here, which takes the output of the encoder and decoder back and makes it the input of the encoder. A bit like RNN, it can be seen as a network that is four times deeper and achieves better progress.

Some differences: the weight of each copy is still based on the previous one, and the gradient is not reversed when recycling

Pause from the video at 23:05, the internal structure of the encoder and decoder will be added later

Excerpted from some small partners' comments on station b:

Question: First of all, an immature suggestion: I think it would be better if Mr. Li could give a brief introduction to this science task when interpreting papers such as AI for science. For example, what unknown results are to be obtained from known conditions for this task, and what methods are currently used in non-AI directions to complete this task. This should help us better understand the author's intentions for certain practices in this article. This is the information I have been looking for in the past few days. Friends of Shengxin, please check if there is any problem:

The tertiary structure of protein is determined by the primary structure, and each protein has its own specific sequence of amino acids, thus forming its inherent unique tertiary structure. Protein three-dimensional structure prediction refers to inputting a protein sequence (primary structure) and outputting the three-dimensional spatial coordinates of all atoms of the protein. The current method for predicting the three-dimensional structure of proteins, in addition to the Cryo-SEM mentioned in the article, also has homology modeling methods through homologous proteins. The specific steps are as follows: firstly select the best template 3D structure and perform sequence alignment. The first sequence alignment is usually performed using the BLOcks substitution matrix. A second sequence alignment (also known as alignment correction) is used to construct the backbone 3D structure. Then perform loop modeling on the template-free region or the region with relatively low similarity, and the highest accuracy can reach 12-13 residues. This is followed by side-chain reconstruction, followed by conformational searching through the backbone-dependent rotator library. The structure should then be refined and validated by various quality assessment tools. The MSA I want to input in this article is the protein sequence corresponding to the template, and the template is the structural information corresponding to these template protein sequences.

Answer 1: MSA and template are not the same. The sequence of MSA is far more than the sequence in template. The idea of ​​MSA modeling comes from co-evolutionary analysis, that is, the 3D structure can be theoretically obtained only through MSA without sequence theory; template modeling is the idea of ​​homology modeling you said. But the core of ensuring accuracy in AF is MSA

Answer 2: MSA is to search and query sequences similar to the target in the seq database to extract co-evolution information, that is, to obtain the contact relationship between some residues at the sequence level to guide the final structure prediction, and the template directly searches for the same sequence based on the sequence The source structure obtains the relative position information between residues in the template structure in three-dimensional space. The applicability of MSA is wider than that of template. The main reason is that the scale of the protein sequence database is much larger than the structural database of the protein structure from which the structure is solved.

5. Computational Drug Development

From millions of compounds, find the few we use for our experiments. The time and money cost of designing a drug is enormous

The drug discovery process consists of several stages:

  1. Target identification and validation : In this phase, scientists identify molecules or pathways that play a key role in the disease process and validate their potential as drug targets.
  2. Screening and optimizing primers : In this phase, a large library of chemicals is screened to identify "hit" compounds that interact with the target. These "hit" compounds are then optimized into "primer" compounds - potential drugs that have the desired effect on the target and are safe and can be produced efficiently.
  3. Preclinical testing : Primer compounds are tested in the laboratory and in animal models to assess their safety and efficacy prior to testing in humans.
  4. Clinical Trials : Clinical trials involve testing drugs in humans. They are usually divided into three stages. The Phase 1 trial involves a small group of healthy volunteers to assess safety and dosage. Phase 2 trials involve larger patient groups and are designed to assess efficacy and side effects. Phase III trials involve large groups of patients to confirm effects, monitor side effects, and compare the drug with commonly used treatments.
  5. Regulatory Approval : If the results of a clinical trial are positive, drug companies can apply for regulatory approval to market the drug.

Computational drug discovery: three schemes

There are three main strategies for computer-aided drug discovery:

  1. Functional Space Simulation : Traditionally, this approach utilizes computer simulations, such as solving the Schrödinger equation, to understand and predict the physical and chemical properties of molecules, such as redox potential, solubility, and toxicity. Such theoretical approaches usually require high computing power.
  2. Virtual screening : Virtual screening is a computational strategy that screens a large library of compounds to identify candidate compounds with the highest potential to interact with a target protein or other biomacromolecule. High-throughput virtual screening generally has multiple filtering stages, including molecular docking, ADMET (pharmacokinetic) property prediction, etc., to identify the most likely drug candidates.
  3. Novel drug design : This is a relatively new field that uses optimization algorithms, evolutionary strategies, and generative models (such as variational autoencoders (VAE), generative adversarial networks (GAN), and reinforcement learning (RL)) in chemical space Search and design new candidate drug molecules . The goal of novel drug design is to generate potentially unexplored novel compounds in chemical space that may possess superior pharmacological activity and appropriate pharmacokinetic properties.

6. Virtual drug screening

Virtual screening

virtual screening

Virtual screening is the use of computational models to assess whether a compound is a good drug (Walters et al., 1998; McGregor et al., 2007).

Virtual Screening Model --> Prediction: OK! --> Compound --> Experiment

Virtual screening is much faster than experimental screening in the laboratory. It can test 109 compounds in a day, whereas experimental screening would take years. Also, virtual screening is less expensive than experimental screening

Virtual screening is an inherent trade-off

  • Virtual screening is limited to commercially available compounds (eg, ZINC libraries).
  • The advantage is that no compounds need to be synthesized (tests are faster).
  • Limitation 1: It loses coverage, in the best case we can screen 109 compounds.
  • Limitation 2: Traditional technologies are based on artificially designed features

Part 1: Antibiotic discovery Antibiotics

The overuse of antibiotics has led to the growth of resistant bacteria. No new classes of antibiotics have been introduced since 1987. And because of drug resistance, we need to develop new antibiotics.

Virtual Screening for Antibiotic Discovery

  • In collaboration with the Broad Institute, we have assembled a collection of 2560 molecules that measure growth inhibition of Escherichia coli (BW25113).

  • Why Choose Graph Neural Networks?

  • We have some drug data, such as Nitrocefin, Reserpine, Penicillin, and IQ-1S, whether they have antibacterial properties, we train, and then predict their antibacterial properties.

Graph neural networks can automatically learn feature representations from data.

Traditional approach: hand-crafted features

  • The Traditional Method: Handcrafted Features

    • Traditional approaches are based on fixed, hand-engineered molecular properties. These properties include molecular weight, number of heavy atoms, etc. More complex features like Morgan fingerprints (Rogers & Hahn 2010), by enumerating all possible substructures up to a radius of 3. The result is a high-dimensional feature (2048), with different substructures merged by hashing.
  • Problems with traditional features

    • Traditional approaches are based on fixed, hand-engineered molecular properties. But the problem is that we don't know all the antimicrobial modes, so these hand-designed properties may miss some unknown modes.
    • Graph neural networks can automatically learn feature representations from data. Properties are learned automatically, for example, given a compound, the network can automatically predict its properties and thus whether it is good or not.

Graph neural network (GNN)

Use graphs to represent compound molecules, edge is a bond, node is an atom, and the attributes of a node are not only the atom itself, but also the embedding (from the surroundings)

Like convolution, start with an atom type, then learn from the properties of its neighbors, and then learn from the properties of its neighbors' neighbors. Through the propagation of the graph structure, the properties of the entire molecule are finally encoded

Combining manual features and deep learning features to judge whether it has antibacterial properties

Use GNN for virtual screening

We screened more than 10^6 compounds virtually at the Broad Drug Repurposing Center and experimentally tested the top 99 compounds at the Broad Institute. It was found that 51 of the compounds did have antibacterial activity, with a hit rate of 51.5%.

The SU3327 compound (renamed Halicin) is a novel and potent antibiotic. In in vitro experiments, Halicin showed a strong growth inhibitory effect on Escherichia coli, and its structure was also significantly different from known antibiotics, with low similarity to existing antibiotics.

Halicin's efficacy against drug-resistant bacteria was verified in mice. Both anti-multidrug-resistant Acinetobacter baumannii (A. baumannii) and drug-resistant Clostridium difficile (C. difficile) exhibited strong inhibitory effects in vivo.

We also performed a large-scale virtual screening of 10^9 compounds in the ZINC library, and successfully identified 8 compounds with inhibitory effects on Escherichia coli (EC) and drug-resistant Staphylococcus aureus (S).

7. New drug design

Going beyond virtual screening to create new drugs. We use generative models to try it out.

"De novo" drug design is the process of directly generating compounds with desired properties (Moon et al., 1991; Clark et al., 1995; Schneider & Fechner, 2005;). In this process, we need to solve the inverse problem: find a good drug in functional space and chemical space according to desired properties (e.g., redox potential, solubility, toxicity).

Inherent tradeoffs of "de novo" drug design: virtual screening is limited to commercial compounds (eg, ZINC libraries). The advantage is that the entire chemical space can be efficiently explored. But the limitation is that we need to synthesize new compounds, which can be difficult. "De novo" drug design is based on hand-crafted rules (e.g., genetic algorithms) to explore the space.

Motivation for designing new drugs : Deep learning can discover new antibiotics and COVID-19 drugs. A simple approach is to train a graph neural network to rank all compounds in the library. The reason is that this maximizes the speed of experimental validation. The problem is that the number of molecules like drugs reaches 10^60 and we cannot sort all of them.

Graph generation for "de novo" drug design : We want to learn a distribution whose mass is concentrated around "good" molecules. We can train a generative model to directly generate "good" molecules. This approach can efficiently explore the entire chemical space (10^60 molecules). The next question is how to generate molecular graph?

Previous solution 1: sequence-based methods

  • The SMILES string is used here to represent the compound
  • Previous work used recurrent neural networks to generate molecular diagrams. They convert a molecule into a SMILES string (a domain-specific language).
  • But this string-based representation is very brittle... two diagrams that are almost the same, their string representations are very different.
  • Previous solution: Nodes are generated one by one
  • A straightforward approach: generate graphs one node at a time (Liu et al., 2018). Molecules are usually sparse: N nodes, O(N) edges.
  • Tell us just don't think about a single node, just think about a subgraph.

Junction tree variational autoencoder

Use motifs to decompose alternative representations. 638 motifs covering 250k graphs

The design inspiration of "node tree variational autoencoder" comes from the node tree algorithm in the graph model. This approach takes into account structural patterns (motifs) commonly found in molecular graphs, breaking down complex molecular structures into smaller parts.

These small parts are called "motifs" and they are small due to the low tree width. We can extract these patterns from a large number of graphs (for example, 250K graphs), forming a "motif vocabulary".

We then recombine these motifs into new molecular structures. This process covered 99.9 percent of the new diagrams, showing that we can generate almost any type of molecular structure.

In this approach, molecular graphs are represented as their corresponding node trees, which allows us to efficiently operate on graphs and automatically learn feature representations from data

Molecular graph represented by node tree

Details: hierarchical encoder & decoder

  • Layered encoder:
    • First, a graph convolution is used to operate on the junction tree of the molecular graph, generating a vector representation of each motif.
    • These node vectors are then propagated to parts of the molecular graph to run graph convolutions at this level.
    • The result is that we obtain a multi-level molecular representation that contains information from local to global.
  • Layered decoder:
    • A layered decoder is responsible for generating new molecular structures. First, it predicts the next motif to be generated.
    • It then predicts how to attach this motif to the current graph. This process proceeds step by step in a motif-by-motif manner until a complete molecular structure is generated.

This layered encoding-decoding process allows the model to capture and generate complex molecular structures while maintaining computational efficiency.

Motif-by-motif versus node-by-node

Motif-by-Motif vs. Node-by-Node

Training Objective: Minimize Reconstruction Loss

Motif-by-Motif (generated by motif) method can better reconstruct macromolecules than node-by-node generation!

The reconstruction accuracy in the figure demonstrates this difference. The accuracy of motif-by-motif generation remains stable with increasing molecular size (number of atoms), while the accuracy of node-by-node generation decreases significantly as the molecular size increases.

Results: Molecular Optimization

Task: Learn to modify non-drug-like molecules into drug-like molecules. Drug-likeness was measured by the QED score (Bickerton et al., 2012).

As shown, the Motif-by-Motif generation method performed better in improving the success rate of drug-likeness than both the node-by-node and sequential methods. So, this demonstrates that using a motif-by-motif approach, it is possible to efficiently optimize molecules to achieve desired properties while maintaining the complexity of the molecular structure.

8. Research Frontiers

Guess you like

Origin blog.csdn.net/weixin_57345774/article/details/130837390