Accurate prediction of molecular targets using a self-supervised image representation learning framework

（使用自监督图像表示学习框架精确预测分子目标）

https://assets.researchsquare.com/files/rs-1477870/v1_covered.pdf?c=1649357561

基础知识补充：

Drug design（药物设计）：
药物设计（英语：Drug design），根据对于靶点（Biological target）的现有知识，去寻找与发明出新型药物的过程。药物设计根据有机小分子物质（如蛋白质）的化学结构、电价与形状等，来设计出可能达到效果的新型化学药物。

computer-aided drug design（计算机辅助药物设计）：
使用电脑分子建构技术，来进行药物设计，称为电脑辅助药物设计（computer-aided drug design）。

根据对于生物目标的化学结构来进行设计，称为结构药物设计（structure-based drug design）。

Abstract

The clinical efficacy and safety of a drug is determined by its molecular targets in the human proteome. However, proteome-wide evaluation of all compounds in human, or even animal models, is challenging.
In this study, we present an unsupervised pre-training deep learning framework, termed ImageMol, from 8.5 million unlabeled drug-like molecules to predict molecular targets of candidate compounds.
The ImageMol framework is designed to pretrain chemical representations from unlabeled molecular images based on local- and global-structural characteristics of molecules from pixels. We demonstrate high performance of ImageMol in evaluation of molecular properties (i.e., drug’s metabolism, brain penetration and toxicity) and molecular target profiles (i.e., human immunodeficiency virus) across 10 benchmark datasets. ImageMol shows high accuracy in identifying antiSARS-CoV-2 molecules across 13 high-throughput experimental datasets from the National Center for Advancing Translational Sciences (NCATS) and we re-prioritized candidate clinical 3CL inhibitors for potential treatment of COVID-19. In summary, ImageMol is an active self-supervised image processing-based strategy that offers a powerful toolbox for computational drug discovery（计算机辅助药物研发） in a variety of human diseases, including COVID-19.

药物的临床疗效和安全性取决于其在人类蛋白质组中的分子靶点。然而，对人类甚至动物模型中的所有化合物进行蛋白质组范围的评估是具有挑战性的。在这项研究中，我们提出了一个无监督的预训练深度学习框架，称为ImageMol，来自850万个未标记的药物样分子，以预测候选化合物的分子靶标。

ImageMol框架旨在根据来自像素的分子的局部和全局结构特征，从未标记的分子图像中预先训练化学表示。

我们在10个基准数据集上展示了ImageMol在评估分子特性(即药物的代谢、脑渗透和毒性)和分子靶谱(即人类免疫缺陷病毒)方面的高性能。ImageMol在识别来自美国国家转化科学促进中心(NCATS)的13个高通量实验数据集中的antiSARS-CoV-2分子时表现出很高的准确性，我们重新确定了候选临床3CL抑制剂的优先级，以用于新冠肺炎的潜在治疗。总之，ImageMol是一种基于主动自我监督图像处理的策略，为包括新冠肺炎在内的各种人类疾病的计算药物发现提供了一个强大的工具箱。

Introduction

Despite recent advances of biomedical research and technologies, drug discovery and development remains a challenging multidimensional task requiring optimization of vital properties of candidate compounds, including pharmacokinetics, efficacy and safety [1, 2]. It was estimated that pharmaceutical companies spent $2.6 billion in 2015, up from $802 million in 2003, on drug approval by the U.S. Food and Drug Administration (FDA) [3]. The increasing cost of drug development resulted from lack of efficacy of the randomized controlled trials, and the unknown pharmacokinetics and safety profiles of candidate compounds [4-6]. Traditional experimental approaches are unfeasible on proteome-wide scale evaluation of molecular targets for all candidate compounds in human, or even animal models. Computational approaches and technologies have been considered a promising solution [7, 8], which can significantly reduce costs and time during the entire pipeline of the drug discovery and development.【但药物开发仍然是一项具有挑战性的任务，传统的实验方法在人类甚至动物模型中所有候选化合物的分子靶标的蛋白质组范围内评估是不可行的】

The rise of advanced Artificial Intelligence (AI) technologies [9, 10], motivated their application to drug design [11-13] and target identification [14- 16]. One of the fundamental challenges is how to learn molecular representation from chemical structures [17]. Previous molecular representations were based on hand-crafted features, such as fingerprint-based features [16, 18], physiochemical descriptors and pharmacophore-based features [19, 20]. However, these traditional molecular representation methods rely on a large amount of domain knowledge, such as sequence-based [21, 22] and graph-based [23, 24] approaches. Their accuracy in extracting informative vectors for description of molecular identities and biological characteristics of the molecules is limited. Recent advances of unsupervised learning in computer vision [25, 26] suggest that it is possible to apply unsupervised image-based pre-training models for computational drug discovery.【如何从化学结构中学习到分子表示？传统的分子表示方法依赖于大量的领域知识，可以将无监督的基于图像的预训练模型应用于计算药物发现。计算机视觉中无监督学习的最新进展表明，可以将无监督的基于图像的预训练模型应用于计算机辅助药物研发】

In this study, we presented an unsupervised molecular image pretraining framework (termed ImageMol) with chemical awareness for learning the molecular structures from large-scale molecular images. ImageMol combines an image processing framework with comprehensive molecular chemistry knowledge for extracting fine pixel-level molecular features in a visual computing way. Compared with state-of-the-art methods, ImageMol has two significant improvements:
(1) It utilizes molecular images as the feature representation of compounds with high accuracy and low computing cost;
(2) It exploits an unsupervised pre-trained learning framework to capture the structural information of molecular images from 8.5 million drug-like compounds with diverse biological activities at the human proteome (Figure 1). We demonstrated the high accuracy of ImageMol in a variety of drug discovery tasks. Via ImageMol, we identified anti-SARS-CoV-2 molecules across 13 high-throughput experimental datasets from the National Center for Advancing Translational Sciences (NCATS). In summary, ImageMol provides a powerful pre-training deep learning framework for computational drug discovery.【我们提出了一个具有化学意识的无监督的分子图像预训练框架（称为 ImageMol），用于从大规模分子图像中学习分子结构。（1）它利用分子图像作为化合物的特征表示，精度高，计算成本低； (2) 它利用无监督的预训练学习框架从 850 万种在人类蛋白质组中具有不同生物活性的类药物化合物中捕获分子图像的结构信息（图 1）。】

Accurate prediction of molecular targets using a self-supervised image representation learning ...

基础知识补充：

Abstract

Introduction

猜你喜欢