DTI Overview (Updating)

Deep Learning for drug repurposing:methods,datasets,and applications

I feel that the review is a bit old, so I wrote my own DTI review, which is suitable for students who are new to DTI.

Dataset (open source)

dataset contents (including but not limited to) source Fields (including but not limited to)
BindingDB Drug sequence, protein sequence, label (0/1) DrugBAN-github DTI
BioSNAP Drug sequence, protein sequence, label (0/1) DrugBAN-github DTI
HUMAN Drug sequence, protein sequence, label (0/1) DrugBAN-github DTI
C.elegans DTI
DUD-E DTI
--------------

Representation Learning

Sequence-based

Insert image description here

Drug representations (for molecular compounds).

(a) One-hot representation[67] of SMILES string. 1d representation is the text symbol of SMILES (simpified Molecular input Line Entry System) topological information based on chemical bond rules .

(b) Two-dimensional(2D) representation of molecular graph where each substructure was associated with a predefined bitvector. A chemical fingerprint, such as a circular fingerprint, is a 2D representation of a molecule that iteratively searches for the partial structure around each atom and then uses The hash function converts the numerator into a binary vector . However, since the generated vectors are not only high-dimensional and sparse, but also they may contain "bit collisions" due to the hash function.

© Graph Neural network (GNN) was adopted to transfer a molecular graph to a vector where the atoms and bonds were denoted by nodes and edges, respectively.

In addition, Mol2vec was proposed and considered the most representative method, treating molecular substructures as “words” and compounds as “sentences”, and using Word2Vec to generate embeddings of atom identifiers. Although these methods achieve good performance, a clear disadvantage of such one- or two-dimensional representations is the loss of information on bond lengths and three-dimensional conformations, which may be important for binding details of drug targets. Therefore, 3D representation will attract more attention in the future.

Insert image description here

Target representations.

(a) One-hot representation of amino acids sequences. Each amino acid can be encoded simply by one-hot encoding .

(b) Contactmap was a kind of two-dimensional (2D) representation of the protein. distance.

© Graph convolutionalnetwork was used to learnthe representation of the three-dimensional (3D) protein graph withnodesrepresenting the various constituent non-hydrogen atoms.

Likewise, protein sequences usually consist of 20 standard amino acids. Inspired by NLP embedding technology, ProtVec and doc2vec were further developed to generate non-overlapping 3-gram subsequences from protein sequences and use word2vec technology to pre-train their distributed representations based on skip-gram models. However, these models usually focus on learning context-independent representations. Unlike k-grams, UniRep aims to apply RNNs to learn statistical representations of proteins from unlabeled amino acid sequences that are semantically rich and have structural, evolutionary, and biophysical underpinnings.

Network/graph-based representation learning

RDKit can easily convert SMILES strings into molecular graphs. For molecules, we can represent atoms and bonds as vertices connected by 12 edges (drug graph c).

For proteins, a more natural way to represent protein molecules is to encode the protein graph with nodes representing the various constituent non-hydrogen atoms in the protein, is to construct a rotation-invariant representation. ProteinGCN effectively utilizes inter-atom direction and distance, and captures local structure information through graph convolution formula (target figure c). Compared with those GNNs that mainly preserve first-order or second-order proximity, another promising technique, called network embedding, is used to learn global features. Specifically, it usually maps nodes, edges and their features to a vector, which preserves global attributes (such as structural information) to the maximum extent. [84] Once node representations are obtained, deep learning models can be applied to network-based tasks, including node classification, [85] node clustering [86] and link prediction. [87] Another important graph-based deep learning method, called probabilistic graphs, combines various neural generative models, gradient-based optimization and neural inference techniques. Furthermore, variational autoencoders (VAEs) [88] trained on biological sequences have been shown to learn biologically meaningful representations that benefit a variety of downstream tasks. In short, VAE is a variant of autoencoder that provides a random mapping between input space and latent space. This map is regularized during training to ensure that its latent space is capable of generating some new data. An example of the application of VAE in the field of protein modeling is learning the representation of bacterial luciferase. [89] The resulting continuous real-valued representation can be used to generate new functional variants of luxA bacterial luciferase.

Drug

embedding:

Drug encoder

MODEL INPUT
GCN Molecular graph
Graph Transformer did not find
Transformer encoder sequence (one-hot vector, MolTrans)

Protein

embedding:
1.k-gram(k=3)+word2vec(transformerCPI)

Protein encoder

GNN is no longer written

MODEL INPUT
CNN Protein sequence
Protein Bert Protein sequence
ESM Protein sequence

Model

Drug repurposing tools often aim to predict unknown drug-target or drug-disease interactions and can be classified as "target-centric" or "disease-centric" approaches.

Model drug target(pr) architecture task year
Gao et al Molecular graph Amino acid sequence GCN,LSTM,two-way attention mechanism DTI 2018
My DeepAffinity SMILES Protein SPS(Structural property sequence) RNN,CNN,Attention Mechanism DTA 2019
GraphDTA Molecular graph Protein sequence GCN,DNN DTA 2019
DeepConv-DTI Fingerprint Protein sequence CNN,DNN DTI 2019
MCPINN ECFP&Mol2Vec&SMILES Amino acid sequence & ProtVec DNN CPI 2019
Tsubaki et al. Molecular graph Amino acid sequence GCN,CNN,attention mechanism CPI 2019
Trimodel Biomedical knoledge graphs about drug and target - Knowledge Graph Embedding DTI 2019
DrugVQA SMILES 2D distance map (pr:resnet+seq attention) ,(drug:Bi-LSTM+Multihead self attention(不是transformer)) ,MLP DTI 2019
Rifaioglu et al. SMILES Protein sequence structural,evolutionary and physicochemical properties CNN DTA 2020
MolTrans SMILES->Substructure sequence Protein sequence ->Substructure sequence encoder: transformer ,fusion: CNN DTI 2020
TransformerCPI Molecular graph Protein sequence (CONV1D+ GLU)Transformer encoder,transformer decoder CPI 2020
DeepDTI
ImageMol molecular images - resnet, five proxy tasks (may have a constraint effect, but a larger encoder should be better, TSNE scatter plot (features of the GAP layer of ImageMol.) Drug discovery 2022
MultiDTI (general) SMILES Protein sequence (also directly encoded between drug, target, disease, and side effect) CNN,MLP DTI 2021
MOVE SMILES Protein sequence (also drug, target, disease, side effect, the last two are directly encoded) CNN,GCN,MLP,图attention,contrastive learning DTI 2022
CLOOME SMILES->Morgan fingerprints Molecular image descriptor-based fully-connected networks,resnet,continuous modern Hopfield networks,contrastive learning Drug discovery 2022 ICLR workshop
BridgeDPI(war) SMILES Protein sequence (pr: k-mer +seq CNN) (drug: fp features+seq CNN), GNN (super node), MLP DPI 2022
AttentionSiteDTI(war) SMILES->bidirectional graph 3D PDB data->binding site->graph (each atom is a node) TAGCN,Bi-LSTM,self-attention,MLP DTI 2022
DrugBAN SMILES->Molecular graph Protein sequence GCN,CNN,biattention DTI 2023 Nature MI

The inputs to the algorithms are quite different.

Interpretability

Biological and medical algorithms basically require highly interpretable
1. attention map (AttentionSiteDTI)


Databases (including but not limited to those in the review)

They are all databases, and you have to create your own data sets. Check to see if the data sets of other papers are open source.

DATABASE DESCRIBE
BindingDB There are detailed drug information and corresponding targets. V5.1.7 includes 13791 drug entries (DTI)
KEGG(Kyoto Encylopedia of Genes and Genomes) Integrated database containing large-scale molecular data sets from genes, proteins, biological pathways and human diseases,
Pubchem The database of chemical molecules and their activities for biological assays, including 1.1 million compounds, 271 million substances, and 297 million bioactivities, provides a variety of molecular information, including chemical structure and physical properties, biological properties, biological activity, safety, and toxicity Information, patents, documents, citations, etc.
CCLE Useful for anti-cancer drugs
ChemDB The chemical structure and molecular properties are provided, and the 3D structure of the molecule is also predicted.
CTD(Comparative Toxicogenomics Database) CTD provides manually curated information on chemical-gene (chemical-gene?) or protein interactions, chemical diseases, and genetic disease relationships.
DGIdb DTI mined from 30 sources including DrugBank, PharmGKB, Chembl, DrugTarget Commons, Therapeutic Target Database
DrugBank Combine drug data information (chemical, pharmacological, pharmaceutical) and drug target information (sequence, structure, pathway)
DrugCentral Provides active chemical entities and dug mode of action
DTC(Drug Target Commons) DTC collates bioactivity data and protein classification into superfamilies, clinical stages and adverse reactions and disease indications
DTP(Drug Target Profiler) DTP contains drug target bioactivity data and realizes network visualization. It also contains the cell-based drug response map of the drug and its clinical phase information.
GCLIDA 包含DTI for G-protein-coupled receptors(GPCRs)
GtopDB Contains quantitative bioactivity data for approved drugs and compounds under investigation
PathwayCommon Including biochemical reactions, complex assembly, physical interactions, involving proteins, DNA, RNA, small molecules and complexes
PharmGKB Contains comprehensive data on genetic variants underlying clinical and researcher drug responses
STITCH Stores known and predicted interactions of chemicals and proteins, covering 9643763 proteins from 2031 and organisms
Supertarget For analysis of DTI and drug side effects
BioSNAP DTI
HUMAN DTI
TTD(Therapeutic Target Database) Provides information on known and under-explored therapeutic protein and nucleic acid targets, targeted diseases, pathways, and corresponding drug information for each target.
AOPEDF Collect physical DTI from DrugBank, TTD, PharmKGB and use bioactivity data to extract DTI from chembl, bindingdb and chemical structure of each drug in SMIES format from DrugBANK.

Guess you like

Origin blog.csdn.net/qq_52038588/article/details/133905289