Deep Learning for drug repurposing:methods,datasets,and applications
I feel that the review is a bit old, so I wrote my own DTI review, which is suitable for students who are new to DTI.
Dataset (open source)
dataset | contents (including but not limited to) | source | Fields (including but not limited to) |
---|---|---|---|
BindingDB | Drug sequence, protein sequence, label (0/1) | DrugBAN-github | DTI |
BioSNAP | Drug sequence, protein sequence, label (0/1) | DrugBAN-github | DTI |
HUMAN | Drug sequence, protein sequence, label (0/1) | DrugBAN-github | DTI |
C.elegans | DTI | ||
DUD-E | DTI | ||
-------------- |
Representation Learning
Sequence-based
Drug representations (for molecular compounds).
(a) One-hot representation[67] of SMILES string. 1d representation is the text symbol of SMILES (simpified Molecular input Line Entry System) topological information based on chemical bond rules .
(b) Two-dimensional(2D) representation of molecular graph where each substructure was associated with a predefined bitvector. A chemical fingerprint, such as a circular fingerprint, is a 2D representation of a molecule that iteratively searches for the partial structure around each atom and then uses The hash function converts the numerator into a binary vector . However, since the generated vectors are not only high-dimensional and sparse, but also they may contain "bit collisions" due to the hash function.
© Graph Neural network (GNN) was adopted to transfer a molecular graph to a vector where the atoms and bonds were denoted by nodes and edges, respectively.
In addition, Mol2vec was proposed and considered the most representative method, treating molecular substructures as “words” and compounds as “sentences”, and using Word2Vec to generate embeddings of atom identifiers. Although these methods achieve good performance, a clear disadvantage of such one- or two-dimensional representations is the loss of information on bond lengths and three-dimensional conformations, which may be important for binding details of drug targets. Therefore, 3D representation will attract more attention in the future.
Target representations.
(a) One-hot representation of amino acids sequences. Each amino acid can be encoded simply by one-hot encoding .
(b) Contactmap was a kind of two-dimensional (2D) representation of the protein. distance.
© Graph convolutionalnetwork was used to learnthe representation of the three-dimensional (3D) protein graph withnodesrepresenting the various constituent non-hydrogen atoms.
Likewise, protein sequences usually consist of 20 standard amino acids. Inspired by NLP embedding technology, ProtVec and doc2vec were further developed to generate non-overlapping 3-gram subsequences from protein sequences and use word2vec technology to pre-train their distributed representations based on skip-gram models. However, these models usually focus on learning context-independent representations. Unlike k-grams, UniRep aims to apply RNNs to learn statistical representations of proteins from unlabeled amino acid sequences that are semantically rich and have structural, evolutionary, and biophysical underpinnings.
Network/graph-based representation learning
RDKit can easily convert SMILES strings into molecular graphs. For molecules, we can represent atoms and bonds as vertices connected by 12 edges (drug graph c).
For proteins, a more natural way to represent protein molecules is to encode the protein graph with nodes representing the various constituent non-hydrogen atoms in the protein, is to construct a rotation-invariant representation. ProteinGCN effectively utilizes inter-atom direction and distance, and captures local structure information through graph convolution formula (target figure c). Compared with those GNNs that mainly preserve first-order or second-order proximity, another promising technique, called network embedding, is used to learn global features. Specifically, it usually maps nodes, edges and their features to a vector, which preserves global attributes (such as structural information) to the maximum extent. [84] Once node representations are obtained, deep learning models can be applied to network-based tasks, including node classification, [85] node clustering [86] and link prediction. [87] Another important graph-based deep learning method, called probabilistic graphs, combines various neural generative models, gradient-based optimization and neural inference techniques. Furthermore, variational autoencoders (VAEs) [88] trained on biological sequences have been shown to learn biologically meaningful representations that benefit a variety of downstream tasks. In short, VAE is a variant of autoencoder that provides a random mapping between input space and latent space. This map is regularized during training to ensure that its latent space is capable of generating some new data. An example of the application of VAE in the field of protein modeling is learning the representation of bacterial luciferase. [89] The resulting continuous real-valued representation can be used to generate new functional variants of luxA bacterial luciferase.
Drug
embedding:
Drug encoder
MODEL | INPUT |
---|---|
GCN | Molecular graph |
Graph Transformer | did not find |
Transformer encoder | sequence (one-hot vector, MolTrans) |
Protein
embedding:
1.k-gram(k=3)+word2vec(transformerCPI)
Protein encoder
GNN is no longer written
MODEL | INPUT |
---|---|
CNN | Protein sequence |
Protein Bert | Protein sequence |
ESM | Protein sequence |
Model
Drug repurposing tools often aim to predict unknown drug-target or drug-disease interactions and can be classified as "target-centric" or "disease-centric" approaches.
Model | drug | target(pr) | architecture | task | year |
---|---|---|---|---|---|
Gao et al | Molecular graph | Amino acid sequence | GCN,LSTM,two-way attention mechanism | DTI | 2018 |
My DeepAffinity | SMILES | Protein SPS(Structural property sequence) | RNN,CNN,Attention Mechanism | DTA | 2019 |
GraphDTA | Molecular graph | Protein sequence | GCN,DNN | DTA | 2019 |
DeepConv-DTI | Fingerprint | Protein sequence | CNN,DNN | DTI | 2019 |
MCPINN | ECFP&Mol2Vec&SMILES | Amino acid sequence & ProtVec | DNN | CPI | 2019 |
Tsubaki et al. | Molecular graph | Amino acid sequence | GCN,CNN,attention mechanism | CPI | 2019 |
Trimodel | Biomedical knoledge graphs about drug and target | - | Knowledge Graph Embedding | DTI | 2019 |
DrugVQA | SMILES | 2D distance map | (pr:resnet+seq attention) ,(drug:Bi-LSTM+Multihead self attention(不是transformer)) ,MLP | DTI | 2019 |
Rifaioglu et al. | SMILES | Protein sequence structural,evolutionary and physicochemical properties | CNN | DTA | 2020 |
MolTrans | SMILES->Substructure sequence | Protein sequence ->Substructure sequence | encoder: transformer ,fusion: CNN | DTI | 2020 |
TransformerCPI | Molecular graph | Protein sequence | (CONV1D+ GLU)Transformer encoder,transformer decoder | CPI | 2020 |
DeepDTI | |||||
ImageMol | molecular images | - | resnet, five proxy tasks (may have a constraint effect, but a larger encoder should be better, TSNE scatter plot (features of the GAP layer of ImageMol.) | Drug discovery | 2022 |
MultiDTI (general) | SMILES | Protein sequence (also directly encoded between drug, target, disease, and side effect) | CNN,MLP | DTI | 2021 |
MOVE | SMILES | Protein sequence (also drug, target, disease, side effect, the last two are directly encoded) | CNN,GCN,MLP,图attention,contrastive learning | DTI | 2022 |
CLOOME | SMILES->Morgan fingerprints | Molecular image | descriptor-based fully-connected networks,resnet,continuous modern Hopfield networks,contrastive learning | Drug discovery | 2022 ICLR workshop |
BridgeDPI(war) | SMILES | Protein sequence | (pr: k-mer +seq CNN) (drug: fp features+seq CNN), GNN (super node), MLP | DPI | 2022 |
AttentionSiteDTI(war) | SMILES->bidirectional graph | 3D PDB data->binding site->graph (each atom is a node) | TAGCN,Bi-LSTM,self-attention,MLP | DTI | 2022 |
DrugBAN | SMILES->Molecular graph | Protein sequence | GCN,CNN,biattention | DTI | 2023 Nature MI |
The inputs to the algorithms are quite different.
Interpretability
Biological and medical algorithms basically require highly interpretable
1. attention map (AttentionSiteDTI)
Databases (including but not limited to those in the review)
They are all databases, and you have to create your own data sets. Check to see if the data sets of other papers are open source.
DATABASE | DESCRIBE |
---|---|
BindingDB | There are detailed drug information and corresponding targets. V5.1.7 includes 13791 drug entries (DTI) |
KEGG(Kyoto Encylopedia of Genes and Genomes) | Integrated database containing large-scale molecular data sets from genes, proteins, biological pathways and human diseases, |
Pubchem | The database of chemical molecules and their activities for biological assays, including 1.1 million compounds, 271 million substances, and 297 million bioactivities, provides a variety of molecular information, including chemical structure and physical properties, biological properties, biological activity, safety, and toxicity Information, patents, documents, citations, etc. |
CCLE | Useful for anti-cancer drugs |
ChemDB | The chemical structure and molecular properties are provided, and the 3D structure of the molecule is also predicted. |
CTD(Comparative Toxicogenomics Database) | CTD provides manually curated information on chemical-gene (chemical-gene?) or protein interactions, chemical diseases, and genetic disease relationships. |
DGIdb | DTI mined from 30 sources including DrugBank, PharmGKB, Chembl, DrugTarget Commons, Therapeutic Target Database |
DrugBank | Combine drug data information (chemical, pharmacological, pharmaceutical) and drug target information (sequence, structure, pathway) |
DrugCentral | Provides active chemical entities and dug mode of action |
DTC(Drug Target Commons) | DTC collates bioactivity data and protein classification into superfamilies, clinical stages and adverse reactions and disease indications |
DTP(Drug Target Profiler) | DTP contains drug target bioactivity data and realizes network visualization. It also contains the cell-based drug response map of the drug and its clinical phase information. |
GCLIDA | 包含DTI for G-protein-coupled receptors(GPCRs) |
GtopDB | Contains quantitative bioactivity data for approved drugs and compounds under investigation |
PathwayCommon | Including biochemical reactions, complex assembly, physical interactions, involving proteins, DNA, RNA, small molecules and complexes |
PharmGKB | Contains comprehensive data on genetic variants underlying clinical and researcher drug responses |
STITCH | Stores known and predicted interactions of chemicals and proteins, covering 9643763 proteins from 2031 and organisms |
Supertarget | For analysis of DTI and drug side effects |
BioSNAP | DTI |
HUMAN | DTI |
TTD(Therapeutic Target Database) | Provides information on known and under-explored therapeutic protein and nucleic acid targets, targeted diseases, pathways, and corresponding drug information for each target. |
AOPEDF | Collect physical DTI from DrugBank, TTD, PharmKGB and use bioactivity data to extract DTI from chembl, bindingdb and chemical structure of each drug in SMIES format from DrugBANK. |