药物优化中的Oracle什么意思?评价指标(Metrics)是什么?

Oracle

Oracle是一个属性评估器,输入为分子结构,输出为分子的性能好坏

Oracle is a property evaluator and is a function whose input is molecular structure, a nd output is the property. We consider following oracles:

  • JNK3生物活性】: biological activity to JNK3, ranging from 0 to 1.
  • GSK3B生物活性 biological activity to GSK3B, ranging from 0 to 1.
  • QED药物相似性的定量估计】: Quantitative Estimate of Drug-likeness, ranging from 0 to 1.
  • SA合成可达性】: Synthetic Accessibility, we normalize SA to (0,1).
  • LogP化合物的溶解度和合成可达性】: solubility and synthetic accessibility of a compound. It ranges from negative infinity to positive infinity.

For all the property scores above, higher is more desirable.

Optimization Task

There are two kinds of optimization tasks: single-objective and multi-objective optimization.【优化任务分为单目标优化和多目标优化

(1) single-objective generation that optimizes JNK3, GSK3β and LogP separately

(2) Multi-objective optimization contains jnkgsk (JNK3 + GSK3B), qedsajnkgsk (QED + SA + JNK3 + GSK3B).

Generate Vocabulary(最常用的子结构)

In this project, the basic unit is substructure, which can be atoms or single rings. The vocabulary is the set of frequent substructures.

Labelling

We use oracle to evaluate molecule's properties to obtain the labels for training graph neural network.

具体的实现方式和细节看:

Unpaired Generative Molecule-to-Molecule Translation for Lead Optimization: GitHub - guy-ba/UGMMT: Code for the paper "Unpaired Generative Molecule-to-Molecule Translation" (KDD 2021)

https://arxiv.org/abs/2109.10469

GitHub - futianfan/DST: (differentiable) gradient-based optimization on a chemical graph for de novo molecule design/optimization (ICLR 2022)

他们之间的关系:

 de novo molecular generation 全新药物生成

 Single-objective de novo molecular generation.

Multi-objective de novo design


C.4 Evaluation metrics

We leverage the following evaluation metrics to measure the optimization performance:

Novelty【新颖性】 is the fraction of the generated molecules that do not appear in the training set.(新颖性是生成的分子中没有出现在训练集中的部分。)

Diversity【多样性】 of generated molecules is defined as the average pairwise Tanimoto distance between the Morgan fingerprints [49, 23, 47].(生成分子的多样性被定义为摩根指纹之间的平均成对Tanimoto距离)

 where Z is the set of generated molecules. sim(Z1, Z2) is the Tanimoto similarity between molecule Z1 and Z2.

(Tanimoto) Similarity【(Tanimoto) 相似性】 measures the similarity between the input molecule and generated molecules.(相似性代表输入分子和输出分子的相似性)

It is defined as

 bX is the binary Morgan fingerprint vector for the molecule X. In this paper, it is a 2048-bit binary vector.

SR (Success Rate)【成功率】 is the percentage of the generated molecules that satisfy the property constraint measured by objective f defined in Equation (1):

where X is a molecule, Q denotes the set of valid molecules; f is the composite objective combining all the oracle scores, e.g., the mean value of P oracle scores.

1、For single-objective de novo molecular generation, the objective f is the property score, the constraints for JNK3, GSK3β and LogP are JNK3≥ 0.5, GSK3β≥ 0.5 and LogP≥ 5.0 respectively.
2、For multi-objective de novo molecular generation, the objective f is the average of all the normalized target property scores. Concretely, when optimizing “JNK3+GSK3β”, both JNK3 and GSK3β ranges from 0 to 1, f is average of JNK3 and GSK3β scores; when optimizing “QED+SA+JNK3+GSK3β”, we first normalized SA to 0 to 1. f is average of QED, normalized SA, JNK3 and GSK3β scores. The constraint is the f score is greater than 0.4.

# of oracle calls during the generation process. DST needs to call oracle in labeling data for GNN and DST based de novo generation, thus we show the costs for both steps.

chemical validities【化学有效性】. As we only enumerate valid chemical structures during the recovery from scaffolding trees (Section C.5), the chemical validities of the molecules produced by DST are always 100%.

4.3 Oracle Efficiency

As mentioned above, oracle calls for realistic optimization tasks can be time-consuming and expensive. From Table 1 and 2, we can see that majority of de novo optimization methods require oracle calls online (instead of precomputation), including all of RL/evolutionary algorithm based baselines. DST takes fewer oracle calls compared with baselines. DST can leverage the precomputed oracle calls to label the molecules in an existing database (i.e., ZINC) for training the oracle GNN and dramatically saving the oracle calls during reference. In the three tasks in Table 2, two-thirds of the oracle calls (10K) can be precomputed or collected from other sources. To further verify the oracle efficiency, we explore a special setting of molecule optimization where the budget of oracle calls is limited to a fifixed number (2K, 5K, 10K, 20K, 50K) and compare the optimization performance. For GCPN, MolDQN, GA+D and MARS, the learning iteration number depends on the budget of oracle calls. RationaleRL [23] is not included because it requires intensive oracle calls to collect enough reference data, exceeding the oracle budget in this scenario. In DST, we use around 80% budget to label the dataset (i.e., training GNN) while the remaining budget to conduct de novo design. Specifically, for 2K, 5K, 10K, 20K, 50K, we use 1.5K, 4K, 8K, 16K and 40K oracle calls to label the data for learning GNN, respectively. We show the average objective values of top-100 molecules under different oracle budgets in Figure 3. Our method shows a significant advantage compared to all the baseline methods in all limited budget settings. We conclude the reason as supervised learning is a well-studied and much easier task than generative modeling.

猜你喜欢

转载自blog.csdn.net/weixin_43135178/article/details/126813466