IJCAI2023 | A Systematic Survey of Chemical Pre-trained Models (Review of Pre-trained Models of Chemical Small Molecules)

IJCAI_A Systematic Survey of Chemical Pre-trained Models

Summary of review data (under update, original text provided):GitHub - junxia97/awesome-pretrain-on-molecules: [IJCAI 2023 survey track]A curated list of resources for chemical pre-trained models

Reference materials:IJCAI2023 | Review of chemical small molecule pre-training models

Articles describing molecular descriptors, encoder architectures, pre-training strategies and applications.

Problems with using DNNs directly: (1) Scarcity of labeled data: task-specific labeling of molecules can be extremely scarce, as molecular data labeling often requires expensive wet lab experiments; (2) Poor out-of-distribution generalization: in many In real-world situations, learning molecules with different sizes or functional groups requires out-of-distribution generalization.

This has led to increased focus on chemical pre-trained models (CMPs) [to learn universal molecular representations from large sets of unlabeled molecules and then fine-tune them on specific downstream tasks]

一、Molecular Descriptors and Encoders

Fingerprints (FP). Describes the presence or absence of a specific substructure of a molecule with a binary string.

Sequences. 最常用的SMILES,

2D graphs. Atoms as nodes and bonds as edges.

3D graphs. Represents the spatial arrangement of the atoms of a molecule in 3D space, where each atom is associated with its type and coordinates and some optional geometric properties (such as velocity) . The advantage of using 3D geometry is that conformational information is crucial for many molecular properties, especially quantum properties. Furthermore, stereochemical information, such as chirality, can also be directly exploited given the 3D geometry.

二、Pre-training Strategies

1,AutoEncoding (AE) 图3a

[SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery.]: Utilizes a transformer-based encoder-decoder network and reconstructs the SMILES string Represented molecules to learn representations.

[PanGu Drug Model: Learn a Molecule Like a Human.]: Pre-train graphs to rank asymmetric conditional variational autoencoders to learn molecular representations.

Although autoencoders can learn meaningful representations of molecules, theyonly focus on single molecules and cannot capture relationships between molecules, which Limiting their performance in some downstream tasks. (Not a mainstream method)

2、Autoregressive Modeling (AM) 图3b

Decompose the molecular input into subsequences (usually atomic level structures), and then predict the subsequent subsequences one by one based on the previous subsequences in the sequence. formula:

[MolGPT: Molecular Generation Using a Transformer-Decoder Model.2021]:Use transformer decoder to predict the next token in the SMILES string in an autoregressive way.

[GPT-GNN: Generative Pre-Training of Graph Neural Networks.KDD, 2020.]: Reconstruct the molecular graph (Figure 3b), given a node and edge are randomly The masked graph generates one masked node and edge at a time, and maximizes the likelihood of the nodes and edges generated in each iteration. Then, nodes and edges are generated iteratively until all masked nodes are generated.

[Motif-based Graph Self-Supervised Learning for Molecular Property Prediction. NeurIPS, 2021.]:Generate molecular graph fragments (subgraphs) instead of single atoms or autoregression key.

AM performs better at generating molecules. However, AM is computationally more expensive and requires the ordering of atoms or bonds to be set in advance, which may not be suitable for molecules because there is no inherent ordering of atoms or bonds.

3、Masked Component Modeling (MCM) 图3c

MCM masks some components of the molecule (such as atoms, bonds, and fragments) and then trains the model to predict them given the remaining components. formula:

Sequence-base model:

[ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction. 2020], [SMILES-BERT: Large Scale Unsupervised Pre-Training for Molecular Property Prediction. BCB, 2019.], [Molformer: Large Scale Chemical Language Representations Capture Molecular Structure and Properties. Nat. Mach. Intell., 2022]: Mask random characters in SMILES strings and then restore them based on the output of the un-masked SMILES string transformer.

Graph-based model:

[Strategies for Pre-training Graph Neural Networks. In ICLR, 2020.]: Randomly mask the input atom/chemical bond attributes and pre-train GNN to predict them

[Self-Supervised Graph Transformer on Large-Scale Molecular Data. NeurIPS, 2020.]:Trying to predict masked subgraphs to capture Contextual information in molecular diagrams

[Mole-BERT: Rethinking Pre-training Graph Neural Networks for Molecules. ICLR,
2023]: Since the set of atoms in nature is extremely small and unbalanced, masking atom types is possible There will be problems. To alleviate this problem, a context-aware tokenizer is designed to encode atoms into chemically meaningful discrete values ​​for mask

Masked components are predicted based on their surrounding environments, while AM ​​relies only on previous components in a predefined sequence. Therefore, MCM can capture more complete chemical semantics. However, since MCM often masks the fixed part of each molecule during the pre-training process after BERT, it cannot train on all components in each molecule, which results in low sample utilization efficiency.

4、Context Prediction (CP) 图3d

Capture the semantics of molecules/atoms in an unambiguous, context-aware manner.

[Strategies for Pre-training Graph Neural Networks. ICLR, 2020.]:Use binary classification to determine whether the molecule and the subgraph in the surrounding context structure belong to the same node.

Although CP is simple and effective, it requires an auxiliary neural model to encode the context into a fixed vector, thus adding additional computational overhead to large-scale pre-training.

5、Contrastive Learning (CL) 图3e

Pretrain the model by maximizing the consistency between a pair of similar inputs, such as two different augmentations or descriptors of the same molecule.

According to the contrasting particle size (e.g., molecule- or substructure-level), two types of CL: Cross-Scale Contrast (CSC) and Same-Scale Contrast (SSC)

1)Cross-Scale Contrast (CSC)

[InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization. ICLR, 2020.]:Unsupervised and semi-supervised graphs through mutual information maximization Level represents learning.

[Contrastive Multi-View Representation Learning on Graphs. ICML, 2020.]: performs node diffusion to generate an augmented molecular graph, then combines the atomic representation of one view with another Molecular representations of one view (and vice versa) are compared to maximize the similarity between the original and augmented views.

2)Same-Scale Contrast (SSC).

Contrastive learning is performed on a single molecule by pushing the enhancing molecule toward the anchor molecule (positive pair) and away from other molecules (negative pair). formula:

Graph-level molecular-level pre-training (various enhancement strategies) comparative learning:

【Graph Contrastive Learning with Augmentations. In NeurIPS, 2020】、

【Graph Contrastive Learning Automated. In ICML, 2021.】

【MoCL: Contrastive Learning on Molecular Graphs with Multi-level Domain Knowledge. In
KDD, 2021.】

【Adversarial Graph Augmentation to Improve Graph Contrastive Learning. In NeurIPS,
2021】

【InfoGCL: InformationAware Graph Contrastive Learning. In NeurIPS, 2021.】

【Molecular Graph Contrastive Learning with Parameterized Explainable Augmentations.
In BIBM, 2021.】

【Molecular Contrastive Learning with Chemical Element Knowledge Graph. In AAAI,2022.】

【Improving Molecular Contrastive Learning via Faulty Negative Mitigation and Decomposed Fragment Contrast. J. Chem. Inf. Model., 2022.】

【SimGRACE: A Simple Framework for Graph Contrastive Learning without Data Augmentation. In WWW, 2022.】

Maximizes agreement between various descriptors for the same molecule and excludes different descriptors:

【SMICLR: Contrastive Learning on Multiple Molecular Representations for Semisupervised and Unsupervised Representation Learning. J.Chem. Inf. Model., 2022.】:联合graph encoder 和SMILES string encoder

[Multilingual Molecular Representation Learning via Contrastive Pre-training. In ACL, 2022.]: MM Deacon utilizes two independent transformers to encode SMILES and IUPAC, and uses comparison targets to promote SMILES and IUPAC similarity of the same molecule

[3D Infomax Improves GNNs for Molecular Property Prediction. In ICML, 2022]: Maximize the consistency between the learned 3D geometry and 2D graph;

[GeomGCL: Geometric Graph Contrastive Learning for Molecular Property Prediction. In AAAI, 2022.]: Use dual-view geometric message passing neural network (GeomMPNN) to encode 2D and 3D graphs of molecules, and design geometric contrastive targets.

Some key issues hinder its wider application, and it is difficult to preserve semantics during molecular enhancement. Choose enhancements through manual trial and error [Graph Contrastive Learning with Augmentations. In NeurIPS, 2020.], tedious optimization [Graph Contrastive Learning Automated. In ICML, 2021.], or through expensive domain knowledge guidance [MoCL: Contrastive Learning on Molecular Graphs with Multi-level Domain Knowledge. In KDD, 2021.], butstill lacks an effective and principledmethod to design suitable Chemical augmentation of molecular pretraining.

The assumption behind CL of bringing similar representations closer may not always hold true for molecular representation learning (molecular activity cliff: similar molecules have completely different properties).

The CL objective randomly selects all other molecules in a batch as negative samples regardless of their true semantics, which will undesirably exclude molecules with similar properties and destroy performance due to false negatives [ProGCL: Rethinking Hard Negative Mining in Graph Contrastive Learning. In ICML, 2022.】

 6、Replaced Components Detection (RCD) 图3f

Identify randomly substituted components of the input molecule. For example, [MPG: An Effective Self-Supervised Framework for Learning Expressive Molecular Global Representations to Drug Discovery. Briefings Bioinform 2021.] divides each molecule into two parts, changes its structure by combining the parts of the two molecules, and trains the encoder Test whether the combined parts belong to the same molecule.

While RCD can reveal intrinsic patterns in molecular structure, the encoder is pretrained to always produce the same "non-replacement" label for all natural molecules and the same "replacement" label for random combinations of molecules. However, in downstream tasks, the input molecules are all natural molecules, resulting in molecular representations produced by RCD that are less distinguishable.

7. DeNoising (DN) Figure 3g

[Pre-training via Denoising for Molecular Property Prediction. ICLR, 2023.]: Add noise to the atomic coordinates of 3D molecular geometry and pre-train the encoder to predict the noise.

[Uni-Mol: A Universal 3D Molecular Representation Learning Framework. In ICLR, 2023.]: Adds noise to the atomic coordinates, the motivation is that the masked atom type can be easily inferred given the 3D atomic position .

[Molecular Geometry Pretraining with SE(3)-Invariant Denoising Distance Matching. In ICLR, 2023.]: Distance denoising pretraining to model the dynamic characteristics of 3D molecules.

3. Extensions

1、Knowledge-enriched pre-training.

CPM usually learns general molecular representations from large molecular databases. However, they often lack domain-specific knowledge.

[GraphCL: Graph Contrastive Learning with Augmentations. In NeurIPS, 2020.]: Point out that key perturbation (adding or removing keys as data augmentation) is conceptually incompatible with domain knowledge and is empirically unhelpful for contrastive pre-training of compounds. Therefore, they avoid employing bond perturbations to enhance molecular graphs.

[MoCL: Contrastive Learning on Molecular Graphs with Multi-level Domain Knowledge. In KDD, 2021.] A domain-based molecular enhancement operator called substructure replacement is proposed, in which the effective substructure of the molecule is replaced by the biological isomer. Substituted, biological isomers produce new molecules with similar physical or chemical properties to the original molecules.

[Molecular Contrastive Learning with Chemical Element Knowledge Graph. In AAAI, 2022.]: Constructed a chemical element knowledge graph (KG) to summarize the microscopic correlations between chemical elements, and proposed a new knowledge for molecular representation learning Enhanced contrastive learning (KCL) framework.

[Motif-based Graph Self-Supervised Learning for Molecular Property Prediction. In NeurIPS, 2021.]: First use existing algorithms (On the Art of Compiling and Using 'Drug-Like' Chemical Fragment Spaces. ChemMedChem, 2008.) to extract semantics on meaningful motifs, and then preprocess the neural encoder to predict the motifs in an autoregressive manner.

[Geometry-Enhanced Molecular Representation Learning for Property Prediction. Nat. Mach. Intell. 2022.]: It is proposed to use molecular geometry information to enhance molecular graph pre-training. A geometry-based GNN architecture and several geometry-level self-supervised learning strategies (bond length prediction, bond angle prediction, and atomic distance matrix prediction) are designed to obtain molecular geometry knowledge during the pre-training process.

Although knowledge-rich (knowledge-infused) pre-training helps CPM acquire chemical domain knowledge, it is expensive (computation cost, data cost) as a guide, whenprior knowledge is incomplete, incorrect, or costly to obtain, which limits wider application .

2、Multimodal pre-training.

Use includes images and biochemistry text, etc.

[KV-PLM: A Deep-Learning System Bridging Molecule Structure and Biomedical Text with Comprehension Comparable to Human Professionals. Nat. Commun., 2022]: Marking SMILES strings and biochemical text. Then, partial tokens are randomly masked, and the neural encoder is pretrained to recover the masked tokens.

[MolT5: Translation between Molecules and Natural Language. In EMNLP, 2022.]: Mask some spans of a large number of SMILES strings and biochemical textual descriptions of molecules, and then pre-train the transformer to predict the masked spans. In this way, these pre-trained models can generate SMILES strings and biochemical text, which is particularly effective for text-guided molecule generation and molecular captioning (generating descriptive text of molecules).

[Featurizations Matter: A Multiview Contrastive Learning Approach to Molecular Pretraining. In AI4Science@ICML, 2022.]: It is proposed to use contrast objectives to maximize the consistency between the embeddings of four molecular descriptors and their aggregated embeddings. In this way, these different descriptors can cooperate with each other for molecular property prediction tasks.

[MICER: A Pre-Trained Encoder–Decoder Architecture for Molecular Image Captioning. Bioinform., 2022.]: A pre-training framework based on autoencoders is used for molecular image captioning. Molecular images are fed to a pre-trained encoder and the corresponding SMILES strings are decoded.

The above multi-modal pre-training strategy can facilitate translation between various modalities. Additionally, these patterns can work together to create a more complete knowledge base for a variety of downstream tasks

4. Applications

1、Molecular Property Prediction (MPP)

Labels can be scarce because wet lab experiments are often laborious and expensive. CPM utilizes large amounts of unlabeled molecules and serves as a backbone for downstream molecular property prediction tasks.

Compared with models trained from scratch, CPM can better infer molecules outside the distribution [Strategies for Pre-training Graph Neural Networks. In ICLR, 2020.]

2、Molecular Generation (MG)

1) Similar to MolGPT autoregression generated on SMILES.

2) Further expand the possibilities for molecule generation by converting descriptive text into molecular structures.

3) Generation of three-dimensional molecular conformations, especially for predicting protein ligand binding poses. But subject to computational limitations. CPM based on 3D geometry shows significant advantages in conformation generation tasks because some inherent relationships between 2D molecules and 3D conformations can be captured during the pre-training process.

3、Drug-Target Interactions (DTI)

Drug Target Interaction Prediction (DTI)

4、Drug-Drug Interactions (DDI)

Drug-drug interaction prediction (DDI), interactions can lead to adverse reactions that can harm health or even lead to death.

五、Conclusions and Future Outlooks

1、Improving Encoder Architectures and Pre-training Objectives

The ideal functionality and architecture of CPM remains elusive, and there is still considerable room for improvement in pre-training objectives, with effective masking strategies for subcomponents in MCM being a prime example.

2、Building Reliable and Realistic Benchmarks

Although there have been numerous studies on CPM, their experimental results may sometimes be unreliable due to inconsistent evaluation settings used (e.g., random seeds and dataset splitting). For example, on MoleculeNet, which contains several expensive datasets for molecular property prediction, the performance of the same model can vary significantly with different random seeds, possibly due to the relatively small size of these molecular datasets. It is also crucial to establish more reliable and realistic benchmarks for CPM that take into account out-of-distribution generalization. One solution is to evaluate CPM via scafold split, which involves splitting molecules based on their substructure. In fact, researchers must often apply CPM trained from known molecules to newly synthesized unknown molecules, which may be very different in nature and belong to different structural domains.

3、Broadening the Impact of Chemical Pre-trained Models

The ultimate goal of CPMs research is to develop versatile molecular encoders that can be applied to a large number of molecular-related downstream tasks. Despite this, compared with the progress of PLM (pretraining language model) in the NLP community, there is still a large gap between the methodological progress and practical application of CPM. On the one hand, representations produced by CPM have not yet been widely used to replace traditional molecular descriptors in chemistry, and pre-trained models have not yet become a standard tool in the community. On the other hand, there is currently limited exploration of how these models can benefit a wider range of downstream tasks beyond individual molecules, such as chemical reaction prediction, molecular similarity search in virtual screening, reverse transcription synthesis, chemical space exploration, etc.

4、Establishing Theoretical Foundations

CPMs have shown impressive performance on a variety of downstream tasks, but rigorous theoretical understanding of these models remains limited. This lack of foundation hinders the scientific community and industry stakeholders from seeking to maximize the potential of these models. The theoretical foundation of CPMs must be established to fully understand their mechanisms and how they improve performance in various applications. Further studies are necessary to gain a deeper understanding of the effectiveness of different molecular pre-training targets to provide guidance for optimal method design.

Guess you like

Origin blog.csdn.net/justBeHerHero/article/details/130471851