J. Med. Chem 2022|TocoDecoy+: A new method for constructing data sets without hidden bias for training and testing of machine learning scoring functions

原文标题:TocoDecoy: A New Approach to Design Unbiased Datasets for Training and Benchmarking Machine-Learning Scoring Functions

Paper link:https://pubs.acs.org/doi/10.1021/acs.jmedchem.2c00460

Paper code:GitHub - 5AGE-zhang/TocoDecoy

Reference link:J. Med. Chem.|TocoDecoy: A new method for constructing data sets without hidden bias for training and testing of machine learning scoring functions - Zhihu

MLSFs: Deep learning-based scoring functions (structure-based virtual screening for a given target that can estimate ligand binding affinities and determine binding poses).

1. Problem

1, Traditionally designed score functions (SF) have implicit biases and insufficient data.

Molecular docking technology is used for: structure-based virtual screening (structure-based virtual screening, SB VS) and computational fishing targets ( target fishing, TF).  

The reliability of molecular docking depends on the accuracy of the scoring function.

 Note:

Molecular Docking Method (MDM) refers to placing small molecules (ligands) in the binding area of ​​macromolecule targets (receptors) through computer simulation, and then Predict the binding force (binding affinity) and binding mode (conformation) of the two by calculating physical and chemical parameters, and then find the lowest energy conformation when the ligand and the receptor are combined in their active region.

Computational target fishing (TF) is a computational method that can use target structure information and data from biological databases to identify biological targets of active compounds

Decoy molecules:Theoretically inactive compounds with similar chemical properties but dissimilar structures.

2. High-quality data sets are very important for SF and virtual screening methods.

PDBbinding Dataset (2004): A unified repository of cocrystal structures of protein-ligand complexes with corresponding biological activities and/or binding affinities

DUD Dataset (2006), DUD-E (2012): Collects active substances against various targets, and also provides artificially generated inactive substances (baits) ), these inactive substances have similar physical and chemical properties to active substances, but have different topological structures.

DEKOIS (2011): automated process to create tailor-made baits for any given active, DEKOIS 2.0 was developed in 2013 with 81 new Structurally diverse bait set

Maximum Unbiased Validation (MUV): Unbiased data set, used to validate the VS method by collecting experimentally validated active and inactive compounds in PubChem BioAssay

A large number of traditional data sets are based on traditional SFs and are not suitable for benchmarking

There are four main types of bias:Artificial enrichment, simulation bias, domain bias, and acausal bias

Artificial enrichment: It is caused by significant differences in the physical and chemical properties of active substances and baits (Decoy) Bias.

Similarity bias:The structures of the compounds in the data set are too similar, and the test performance of the model is too optimistic/the generalization ability is limited.

Likewise, if a data set has limited chemical diversity, then an MLSF/SF trained on that data set may suffer from domain bias (i.e., the MLSF/SF is only suitable for predicting ligands that share similar scaffolds with ligands in the data set). body).

Domain bias:The structural diversity in the data set is too low and the model is only suitable for predicting compounds with a specific skeleton present in the training set.

Non-causal bias:Due to ML non-linear fitting and uninterpretable black-box mechanism. ML algorithms easily distinguish actives from decoys by learning different topological features rather than physically meaningful protein-ligand interactions.

To account for non-causal bias, train the model on a training set where the decoys are constructed in one way, and test the model on a test set where the decoys are collected in another way (not topologically structured in the dataset, but the model can still learn from the training set to non-causal bias), or to train MLSF/SF using only interaction features rather than topological information (despite excellent performance in expected SBVS, lack of ligand structure information may prevent their further improvement). Therefore, there is still an urgent need to design unbiased and realistic datasets specifically for benchmarking and training.

The first relatively unbiased dataset, LIT-PCBA (2020). Several depolarization techniques are employed, such as selectively keeping active and inactive compounds within a similar range of molecular properties to remove artificial enrichment.

Another important contribution comes from the use of three-dimensional conformations of active molecules, by introducing DL simple data augmentation, to address non-causal biases caused by the large topological differences between actives and decoys (appropriately reducing the bias of the data set).

Decoys are generated by randomly rotating and moving the docked conformations of active molecules in DUD-E (three decoys per active molecule) or by re-docking each active molecule into the corresponding protein pocket (with at least 5 poses related to the active molecule). The three highest poses with Å root-mean-square distance are labeled as decoys), and the augmented data set may force the ML algorithm to learn the physical interactions between proteins and ligands rather than the properties of the ligands. Furthermore, the authors did not delve into whether mlsf is able to consistently learn protein-ligand interactions from randomly generated binding poses.

DeepCoy (Deep Generative Model): uses a supervised training process to learn from pairs of molecules. Taking an active molecule as input, a decoy with matching properties but a different structure is generated. However, the dataset generated by DeepCoy may still suffer from non-causal bias because it defines the decoys under the same assumptions as DUDE. However, only a subset of 250,000 molecules was used for training rather than a larger data set, leaving diversity problematic.

2. MATERIALS AND METHODS

For 4deviations: A conditional molecule generation model was trained on a data set of the order of 1 million to ensure that the model can generate structurally diverse Molecules (remove domain bias); use the conditional molecule generation model to control the physical and chemical properties of the generated molecules to be similar to the active molecules (remove artificial enrichment); use the T-SNE algorithm to map the compounds to the two-dimensional chemical space and perform grid filtering ( Remove similarity bias); introduce two bait construction strategies: assuming that molecules with low structural similarity to active molecules are negative samples, assuming that the wrong binding conformation of active molecules and targets is negative samples (removing non-causal bias)

active is the number of active ligands of the target, the number of conformational baits with poor docking scores generated when the CD active ligand is docked with the protein pocket; TD refers to the number of conformational baits that provide similar physical and chemical properties to the actives generated from cRNN but are different from the actives. Number of topological decoys with different topologies. Top50 means that the top 50 topological decoys are selected based on topological similarity to the corresponding active ligands rather than grid filtering. xW(9W/100W) means grid filtering the bait, the grid number is x × 104, where x is a number. LIT-PCBA represents the number of inactive ligands in the LIT-PCBA data set. The training set will be further split into a training set and a validation set at a ratio of 4:1. test test set: a data set used for model testing. Number of TocoDecoy = Active + CD + TD_Top50. TocoDecoy_9W = Active + CD + TD_9W. Number of LIT-PCBAs = Active + LIT-PCBAs. DUD-E = Active + TD_Top50.
 

1. Data set

DatasetA: CHEMBL v25. Rules: (1) Normalize fragments, charges, isotopes, stereochemistry and tautomeric states of ligands in the ChEMBL data set; (2) ) The molecule contains H, C, N, O, F, S, Cl and Br atoms, and the number of heavy atoms is less than 50; (3) The remaining ligands are further divided into a training set and a test set (1 347 173 respectively) and 149 679 ligands), the ratio of the cRNN model is 9:1. Dataset A is used for cRNN modeling.

DatasetB: LIT-PCBA, the ligands in this dataset have been experimentally verified and are relatively unbiased for the construction and benchmarking of MLSF. A subset of 10 targets (i.e., ALDH1, ESR1_ant, FEN1, GBA, KAT2A, MAPK1, MTORC1, PKM2, TP53, and VDR) was selected as dataset B. Generate TocoDecoy data set with active molecules and targets in LIT-PCBA.

DatasetC: is the TocoDecoy data set, which is generated based on the active molecules in data set B.

DatasetD: Since the target selected in data set B does not have a corresponding DUD-E data set, and the TD set (a subset of data set C) is not the same as that of DUD-E The decoy generation strategy is similar, so the TD set is chosen as an alternative to the DUD-E dataset.

2. Data set generation

(1) Combine the six physical and chemical properties of the “seed” ligand (active ligand) (MW, molecular weight; logP, oil-water partition coefficient; RB, number of rotatable bonds; HBA, number of hydrogen bond acceptors; HBD, hydrogen bond The number of donors; HAL, the number of halogen bonds) is input into a conditional recurrent neural network (cRNN) model to generate attribute-matching decoys. cRNN generated a total of 200 valid and non-duplicate decoys for each active ligand.

(2) Although the generated decoys may have similar physicochemical properties and their topological structures are likely to be different from their seed (active) ligands, there are always some exceptions. Therefore, decoys will be filtered out if their Dice similarity (DS) does not meet the following requirements compared to the physicochemical properties of the active ligand: (i) MW±40 Da, (ii) logP±1.5, (iii) RB±1 , (iv) HBA±1, (v) HBD±1, (vi) HAL±1, (vii) DS<0.4.

(3) Calculate ECFP and T-SNE vectors for each molecule in turn, and then perform grid filtering to eliminate similarity deviations caused by similar structures; the retained decoys form a topology decoy set (Topology Decoys, TD), and the docking of these decoys Conformations are obtained by molecular docking of structurally pre-processed proteins and ligands. First, RDKit is used to calculate the ECFP fingerprint of the ligand, and then the T-SNE algorithm in scikit-learn is used to nonlinearly map the 2048-bit ECFP to a two-dimensional vector to visualize the distribution of the ligand in chemical space. Calculate the minimum and maximum values ​​of each dimension and split the 2D vector into different intervals using a fixed step size. The two-dimensional chemical space is divided into grids formed by intervals in each dimension, leaving only one ligand in each grid by removing other topologically similar compounds. Simulation deviations before and after grid filtering were also examined and compared by visualizing TocoDecoy's chemical space. The sparser the distribution of points in chemical space, the smaller the simulation bias the data set contains.

(4) Filter the docking conformations of active ligands according to the corresponding docking score thresholds listed in Table S1. Conformations with docking scores lower than the threshold are retained as decoys conformations, thereby generating a conformation decoy set (Conformation Decoys, CD).

(5) Finally, the TD and CD sets are integrated into the final TocoDecoy data set.

3. Task

First, the TocoDecoy dataset is generated based on the active molecules in LIT-PCBA, and then the hidden biases in LIT-PCBA and TocoDecoy are systematically studied, including artificial enrichment, similarity bias, domain bias, and non-causal bias. Evaluation is performed via descriptor-based XGBoost and end-to-end graph-based IGN models. The XGBoost algorithm can directly output the predicted label of the sample, while the IGN model can only output the probability of the sample being active or inactive.

MLSFs evaluation indicators:

n is the number of active substances, n is the number of ligands, ri is the ranking of the i-th active ligand, Ra is the proportion of active substances in the data set, and α (80.5 in this study) is a parameter related to early identification.

BED_ROC (α = 80.5) was used to evaluate the screening ability of mlsf, since one of the main applications of SFs is that its screening ability refers to the ability of the scoring function to identify true binders of a given target protein in a pool of random molecules combined with binding affinity predictions Differently, the screening capability focuses more on enriching potentially active molecules among the highest scoring molecules.

Hidden Biases evaluation indicators:

The DOE score can be used to evaluate artificial enrichment caused by the imbalance in the distribution of physical and chemical properties between active substances and baits. The smaller the DOE score, the lower the degree of artificial enrichment.

Domain deviation evaluation, the larger the I(D) value, the greater the diversity of ligands in the data set:

In Formula 6, xi,j,norm are the normalized values ​​of the physical and chemical property j of compound i, xj,95% and xj,5% are the 95th and 5th percentiles of the physical and chemical property j, respectively. In Formula 9, a, m, i, and n represent active ligand a, number of active ligands, decoy i, and number of decoys respectively. In Equation 10, D is the size of the data set, x and y represent the ligands in the data set, |mx∩my| represents the number of common molecular fingerprints (ECFP) of the ligands x and y, and |mx∩my| represents the fingerprint. total.

3. Experiment

1, Artificial wealth collection

In TocoDecoy, the distribution of the number of rotatable bonds (RB), number of hydrogen bond acceptors (HBA), number of hydrogen bond donors (HBD) and halogen number (HAL) of the active substances is close to that of the bait.

TocoDecoy with lower DOE score outperforms LIT-PCBA

2. Simulation deviation and domain deviation

For each active ligand, result in an active to bait ratio of 1:100. In addition, the original TocoDecoy dataset is passed to grid filters with grid numbers of 90000 (i.e. 300 × 300) and 1000000 (i.e. 1000 × 1000) respectively, resulting in the TocoDecoy_9W and TocoDecoy_100W datasets. The size of the dataset decreases as the number of grids decreases. Since the deep learning algorithm requires a large amount of data, a grid number of 90,000 was used for strict filtering to eliminate simulation bias, while a grid number of 1,000,000 was used for loose filtering to ensure that the number of ligands was large enough to train a reliable IGN model.

To explore the impact of grid filters on the dataset, we visualized the chemical space of the dataset by generating a 2048-bit ECFP containing topological information for each molecule in the dataset, and then the T-SNE algorithm reduced the 2048-dimensional ECFP to a two-dimensional vector, for drawing. Taking ALDH1 and MAPK1 as examples, the biased chemical space and unbiased chemical space of TocoDecoy and LIT-PCBA are shown in the figure.

It can be observed that the distribution of ligands in chemical space is non-uniform, with a large number of structurally similar ligands stacked together before depolarization. Furthermore, TocoDecoy covers a larger chemical space than LIT-PCBA (a wider area in the plot), suggesting that TocoDecoy may contain fewer domain biases compared to LIT-PCBA.

In order to further explore the impact of grid filters on debiasing, we trained the IGN model on these data sets and tested it on the test set of LIT-PCBA. Its performance is shown in Figure 4.

Models trained on the debiased datasets (TocoDecoy_9W and TocoDecoy_100W) outperformed models trained on the biased TocoDecoy dataset on most targets. Therefore, we draw the following conclusions: (1) TocoDecoy contains less domain bias than LIT-PCBA, (2) Grid filter helps eliminate simulation bias; (3) Training with grid filtered debiased dataset The model generalizes relatively better than the model trained with the simulated partial data set.

3. Non-causal bias

Performance of XGBoost model trained on CD set and TocoDecoy's Glide SF energy term. TD, CD, TC and LI represent TD set, CD set, TocoDecoy set and LIT-PCBA respectively. The data set before @ is the training set, and the data set after @ is the test set. For example, the F1 score in the TC@LI column represents the performance of a model trained on TocoDecoy and tested on the LIT-PCBA test set. The CD set and the TD set are extracted from the TocoDecoy_9W set.

The XGBoost model trained on the CD training set shows good performance on the CD test set, but performs poorly on other test sets, which indicates that the model trained on the CD training set is biased in docking score-based classification, Limited generalization ability.

When trained with TocoDecoy, the model's performance on the CD test set becomes worse, which means that the addition of the TD set helps minimize the non-causal bias brought by the docking score. Similarly, the XGBoost model was trained using the ECFP of the ligands in TD and TocoDecoy, respectively, and tested on the test sets of LITPCBA, CD, TD and TocoDecoy.

Performance of XGBoost model trained on ECFP on TD set and TocoDecoy. TD, CD, TC, and LI represent TD set, CD set, TocoDecoy set and LIT-PCBA respectively. The data set before @ is the training set, and the data set after @ is the test set. For example, the F1 score in the TC@LI column represents the performance of a model trained on TocoDecoy and tested on the LIT-PCBA test set. The CD set and the TD set are extracted from the TocoDecoy_9W set.

The XGB model trained on the TD set based on ECFP performs much worse on the LITPCBA, CD set and TocoDecoy test set than on the TD test set.

4. Performance of models trained on different data sets in simulated virtual screening

IGN models were trained on various datasets and tested on the test set of LIT-PCBA. To better understand the generalization ability of models trained on TocoDecoy, in addition to TocoDecoy, the authors also trained IGN models on DUD-E and LIT-PCBA as controls. As shown in Figure A, the IGN model outperforms Glide SP in terms of F1 score and BED_ROC, which indicates that MLSF outperforms traditional SF in virtual screening. BED_ROC and Precision are better than the models trained by TocoDecoy and DUD-E because the data distribution of the LIT-PCBA training set is more similar to the LIT-PCBA test set than the test set of TocoDecoy and DUD-E. Obviously, the model trained on DUD-E is biased towards distinguishing active molecules from inactive molecules through differences in molecular topology and cannot generalize to the test set of LIT-PCBA. Similarly, models trained on TocoDecoy do not generalize well in virtual screening. However, the model trained by TocoDecoy outperforms the model trained by DUD-E in terms of F1 score, BED_ROC and Precision, which indicates that the model trained by TocoDecoy has relatively better generalization ability.

As shown in Figure 6B, on nine of the ten target (except ESR1_ant) target datasets, the model trained on TocoDecoy achieved higher F1 scores than the model trained on DUD-E. On the five target data sets in, the prediction performance of the model trained on LIT-PCBA is weaker than the model trained on the TocoDecoy data set.

4. Summary

TocoDecoy compared the traditional data set DUD-E with the hidden bias-free data set LIT-PCBA suitable for the evaluation of MLSFs. In the verification of four hidden biases, TocoDecoy performed equivalent/fewer hidden biases than the other two data sets. In simulated virtual screening experiments, the prediction accuracy ranking of models trained on different data sets is: LIT-PCBA≈TocoDecoy>DUD-E. Although TocoDecoy has comparable performance to models trained on LIT-PCBA, the TocoDecoy dataset is scalable and it is feasible to generate an unbiased TocoDecoy dataset large enough for modeling and benchmarking of MLSF.

Guess you like

Origin blog.csdn.net/justBeHerHero/article/details/132549335