分子优化数据集

成对的(paired)数据集什么是分子优化(Molecule Optimization)以及相关论文_马鹏森的博客-CSDN博客


不成对的数据集:

paper:Unpaired Generative Molecule-to-Molecule Translation for Lead Optimization()

code:GitHub - guy-ba/UGMMT: Code for the paper "Unpaired Generative Molecule-to-Molecule Translation" (KDD 2021)

Following [6, 11], we focus on two common properties which are vital for generated drug’s effectiveness evaluation:

(1) Dopamine Receptor D2 (DRD2): DRD2 score measures molecule’s biological activity against a biological target named the dopamine type 2 receptor.

(2) Drug likeness (QED): QED score [2] measures, intuitively, how “druglike” a molecule is. 

在[6,11]之后,我们将重点关注对生成药物的有效性评估至关重要的两个常见属性:

(1)多巴胺受体D2 (DRD2): DRD2评分测量分子对称为多巴胺2型受体的生物靶标的生物活性。

(2)药物相似性(QED): QED分数[2]直观地测量分子的“药物相似性”如何。我们使用RDkit [16]库来计算

FULL DATASETS DETAILS

We provide experiments demonstrating our model’s capability of molecule and drug optimization. Hence we use two different datasets.

(1) Molecule Dataset: The current SOTA method in molecule optimization is CORE [6]. Therefore we use their datasets, which were adapted from [11] and are publicly available on their GitHub2.

Train set: These datasets are paired, designed for supervised models, however UGMMT, CDN, JTVAE and Mol-CG are unsupervised. Hence we use the paired train set to construct unpaired train set. UGMMT method requires 2 input sets, set for domain which contains low property molecules and set for domain which contains high property molecules. We construct these by unpairing the pairs, removing duplicates and then for each molecule we calculate its property score {DRD2, QED} and add it to the relevant set if the score exceeds a certain threshold. i.e., the molecule is added to set if its property score is lower than domain’s threshold and it is added to set if its property score is higher than domain’s threshold.

In order to avoid unbalanced domains issues we randomly sample equal number of molecules from each set and thus obtain the final train set for and . Since JTVAE and CDN require only one training set, we merge ’s and ’s train sets to one set and use it to train them. We set empirically the following domain thresholds: DRD2 – A 0.02 (experimented with 0.01-0.2) ; B 0.85 (experimented with 0.75-0.9) / QED – A 0.78 (experimented with 0.75-0.8) ; B 0.91 (experimented with 0.88-0.93), other thresholds yield similar results. All dataset files are located inside dataset/DRD2 and dataset/QED folders on our GitHub. Details:

UGMMT and Mol-CG– The train sets contain 2,097 molecules for DRD2 and 8,968 molecules for QED (in each set among {}). Files names: A_train.txt and B_train.txt.
CDN and JTVAE– The train sets contain 4,194 molecules for DRD2 and 17,936 molecules for QED.
Files names: DRD2_mergedAB_specific_train.txt and QED_mergedAB_specific_train.txt.
G2G and CORE– Taken from their GitHub, contains 34,404 molecule pairs for DRD2 and 88,306 molecule pairs for QED. Files names: DRD2_DATASET.txt and QED_DATASET.txt.

Validation set: For UGMMT, after sampling the final train set from set , we randomly sample the validation set from set remaining molecules. For the other models we use the original validation sets from Fu et al. [6].
UGMMT– The validation sets contain 800 molecules for DRD2 and 800 molecules for QED. File name: A_validation.txt.
Other models– The validation sets contain 500 molecules for DRD2 and 360 molecules for QED.
File name: g2g_validation.txt.

Test set: The test set is taken as is from Fu et al. [6], thus all methods are evaluated on exactly the same data. The test sets contain 1,000 molecules for DRD2 and 800 molecules for QED. File name: A_test.txt

(2) Drug Dataset: We use DrugBank dataset [30], which contains a list of FDA approved drugs, to conduct retrospective experiments and confirm our model’s capability of potential drug optimization. We extract a set of 1,897 drugs and ensure none of them appeared during training or validation. The drug dataset is located in the FDA_approved_canon_ clean.csv file inside the dataset/FDA_approved_drugs_ drugbank folder on our GitHub.

猜你喜欢

转载自blog.csdn.net/weixin_43135178/article/details/126921869