DeepChem教程19: 用Zinc筛选HIV抑制剂

本教程我们学习如何用DeepChem (ZINC)有效的筛选大的化学合数据库。用机器学习筛选大型化合物库是直接受CPU约束的平行计算问题。我要使用的代码用例假定可用的资源是在个大的机器(像AWS c5.18xlarge)，但是其它是统也是可交换的（如超级计算群）。更高层次的，我们将要做的是：

用标签数据创建机器学习模型。
转换ZINC到“工作单元”
加载“工作单元”到“工作序列”。
“从工作序列”中使用“工作单元”。
获得结果。

这个教程与前面的教程的不同之处在于它设计来运行在AWS上而不是Google Colab上。那是因为我们要访问一个有许多核心的大型机器以有效的进行计算。本教程我们将详细地讲解如何做。

1.用标签数据训练模型

我们这里只是做一个简单的模型。实际问题中你可能需要尝试多个模型并摸索超参数。

In [1]:

from deepchem.molnet.load_function import hiv_datasets

/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/sklearn/externals/joblib/__init__.py:15: FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.

warnings.warn(msg, category=FutureWarning)

RDKit WARNING: [18:15:24] Enabling RDKit 2019.09.3 jupyter extensions

/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

_np_qint8 = np.dtype([("qint8", np.int8, 1)])

/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

_np_quint8 = np.dtype([("quint8", np.uint8, 1)])

/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

_np_qint16 = np.dtype([("qint16", np.int16, 1)])

/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

_np_quint16 = np.dtype([("quint16", np.uint16, 1)])

/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

_np_qint32 = np.dtype([("qint32", np.int32, 1)])

/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

np_resource = np.dtype([("resource", np.ubyte, 1)])

/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

_np_qint8 = np.dtype([("qint8", np.int8, 1)])

/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

_np_quint8 = np.dtype([("quint8", np.uint8, 1)])

/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

_np_qint16 = np.dtype([("qint16", np.int16, 1)])

/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

_np_quint16 = np.dtype([("quint16", np.uint16, 1)])

/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

_np_qint32 = np.dtype([("qint32", np.int32, 1)])

/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

np_resource = np.dtype([("resource", np.ubyte, 1)])

In [2]:

from deepchem.models import GraphConvModel

from deepchem.data import NumpyDataset

from sklearn.metrics import average_precision_score

import numpy as np

tasks, all_datasets, transformers = hiv_datasets.load_hiv(featurizer="GraphConv")

train, valid, test = [NumpyDataset.from_DiskDataset(x) for x in all_datasets]

model = GraphConvModel(1, mode="classification")

model.fit(train)

Loading raw samples now.

shard_size: 8192

About to start loading CSV from /var/folders/st/ds45jcqj2232lvhr0y9qt5sc0000gn/T/HIV.csv

Loading shard 1 of size 8192.

Featurizing sample 0

Featurizing sample 1000

Featurizing sample 2000

Featurizing sample 3000

Featurizing sample 4000

Featurizing sample 5000

Featurizing sample 6000

Featurizing sample 7000

Featurizing sample 8000

TIMING: featurizing shard 0 took 12.479 s

Loading shard 2 of size 8192.

Featurizing sample 0

Featurizing sample 1000

Featurizing sample 2000

Featurizing sample 3000

Featurizing sample 4000

Featurizing sample 5000

Featurizing sample 6000

Featurizing sample 7000

Featurizing sample 8000

TIMING: featurizing shard 1 took 13.668 s

Loading shard 3 of size 8192.

Featurizing sample 0

Featurizing sample 1000

Featurizing sample 2000

Featurizing sample 3000

Featurizing sample 4000

Featurizing sample 5000

Featurizing sample 6000

Featurizing sample 7000

Featurizing sample 8000

TIMING: featurizing shard 2 took 13.550 s

Loading shard 4 of size 8192.

Featurizing sample 0

Featurizing sample 1000

Featurizing sample 2000

Featurizing sample 3000

Featurizing sample 4000

Featurizing sample 5000

Featurizing sample 6000

Featurizing sample 7000

Featurizing sample 8000

TIMING: featurizing shard 3 took 13.173 s

Loading shard 5 of size 8192.

Featurizing sample 0

Featurizing sample 1000

Featurizing sample 2000

RDKit WARNING: [18:16:53] WARNING: not removing hydrogen atom without neighbors

Featurizing sample 3000

Featurizing sample 4000

Featurizing sample 5000

Featurizing sample 6000

Featurizing sample 7000

Featurizing sample 8000

TIMING: featurizing shard 4 took 13.362 s

Loading shard 6 of size 8192.

Featurizing sample 0

TIMING: featurizing shard 5 took 0.355 s

TIMING: dataset construction took 80.394 s

Loading dataset from disk.

TIMING: dataset construction took 16.676 s

Loading dataset from disk.

TIMING: dataset construction took 7.529 s

Loading dataset from disk.

TIMING: dataset construction took 7.796 s

Loading dataset from disk.

TIMING: dataset construction took 17.521 s

Loading dataset from disk.

TIMING: dataset construction took 7.770 s

Loading dataset from disk.

TIMING: dataset construction took 7.873 s

Loading dataset from disk.

TIMING: dataset construction took 15.495 s

Loading dataset from disk.

TIMING: dataset construction took 1.959 s

Loading dataset from disk.

TIMING: dataset construction took 1.949 s

Loading dataset from disk.

WARNING:tensorflow:From /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.

Instructions for updating:

Call initializer instance with the dtype argument instead of passing it to the constructor

WARNING:tensorflow:Entity <bound method GraphConv.call of <deepchem.models.layers.GraphConv object at 0x1a3e35c048>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method GraphConv.call of <deepchem.models.layers.GraphConv object at 0x1a3e35c048>>: AttributeError: module 'gast' has no attribute 'Num'

WARNING: Entity <bound method GraphConv.call of <deepchem.models.layers.GraphConv object at 0x1a3e35c048>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method GraphConv.call of <deepchem.models.layers.GraphConv object at 0x1a3e35c048>>: AttributeError: module 'gast' has no attribute 'Num'

WARNING:tensorflow:Entity <bound method GraphPool.call of <deepchem.models.layers.GraphPool object at 0x1a41856e80>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method GraphPool.call of <deepchem.models.layers.GraphPool object at 0x1a41856e80>>: AttributeError: module 'gast' has no attribute 'Num'

WARNING: Entity <bound method GraphPool.call of <deepchem.models.layers.GraphPool object at 0x1a41856e80>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method GraphPool.call of <deepchem.models.layers.GraphPool object at 0x1a41856e80>>: AttributeError: module 'gast' has no attribute 'Num'

WARNING:tensorflow:Entity <bound method GraphConv.call of <deepchem.models.layers.GraphConv object at 0x1a49f5aa90>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method GraphConv.call of <deepchem.models.layers.GraphConv object at 0x1a49f5aa90>>: AttributeError: module 'gast' has no attribute 'Num'

WARNING: Entity <bound method GraphConv.call of <deepchem.models.layers.GraphConv object at 0x1a49f5aa90>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method GraphConv.call of <deepchem.models.layers.GraphConv object at 0x1a49f5aa90>>: AttributeError: module 'gast' has no attribute 'Num'

WARNING:tensorflow:Entity <bound method GraphPool.call of <deepchem.models.layers.GraphPool object at 0x1a43f5d198>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method GraphPool.call of <deepchem.models.layers.GraphPool object at 0x1a43f5d198>>: AttributeError: module 'gast' has no attribute 'Num'

WARNING: Entity <bound method GraphPool.call of <deepchem.models.layers.GraphPool object at 0x1a43f5d198>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method GraphPool.call of <deepchem.models.layers.GraphPool object at 0x1a43f5d198>>: AttributeError: module 'gast' has no attribute 'Num'

WARNING:tensorflow:Entity <bound method GraphGather.call of <deepchem.models.layers.GraphGather object at 0x1a43f3a940>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method GraphGather.call of <deepchem.models.layers.GraphGather object at 0x1a43f3a940>>: AttributeError: module 'gast' has no attribute 'Num'

WARNING: Entity <bound method GraphGather.call of <deepchem.models.layers.GraphGather object at 0x1a43f3a940>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method GraphGather.call of <deepchem.models.layers.GraphGather object at 0x1a43f3a940>>: AttributeError: module 'gast' has no attribute 'Num'

WARNING:tensorflow:From /Users/bharath/Code/deepchem/deepchem/models/layers.py:222: The name tf.unsorted_segment_sum is deprecated. Please use tf.math.unsorted_segment_sum instead.

WARNING:tensorflow:From /Users/bharath/Code/deepchem/deepchem/models/layers.py:224: The name tf.unsorted_segment_max is deprecated. Please use tf.math.unsorted_segment_max instead.

WARNING:tensorflow:Entity <bound method TrimGraphOutput.call of <deepchem.models.graph_models.TrimGraphOutput object at 0x1a41a9ecf8>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method TrimGraphOutput.call of <deepchem.models.graph_models.TrimGraphOutput object at 0x1a41a9ecf8>>: AttributeError: module 'gast' has no attribute 'Num'

WARNING: Entity <bound method TrimGraphOutput.call of <deepchem.models.graph_models.TrimGraphOutput object at 0x1a41a9ecf8>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method TrimGraphOutput.call of <deepchem.models.graph_models.TrimGraphOutput object at 0x1a41a9ecf8>>: AttributeError: module 'gast' has no attribute 'Num'

WARNING:tensorflow:From /Users/bharath/Code/deepchem/deepchem/models/keras_model.py:169: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

WARNING:tensorflow:From /Users/bharath/Code/deepchem/deepchem/models/optimizers.py:76: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

WARNING:tensorflow:From /Users/bharath/Code/deepchem/deepchem/models/keras_model.py:258: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /Users/bharath/Code/deepchem/deepchem/models/keras_model.py:260: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

WARNING:tensorflow:From /Users/bharath/Code/deepchem/deepchem/models/losses.py:108: The name tf.losses.softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.softmax_cross_entropy instead.

WARNING:tensorflow:From /Users/bharath/Code/deepchem/deepchem/models/losses.py:109: The name tf.losses.Reduction is deprecated. Please use tf.compat.v1.losses.Reduction instead.

WARNING:tensorflow:From /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/ops/math_grad.py:318: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.

Instructions for updating:

Use tf.where in 2.0, which has the same broadcast rule as np.where

/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/ops/gradients_util.py:93: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.

"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

Out[2]:

0.0

In [3]:

y_true = np.squeeze(valid.y)

y_pred = model.predict(valid)[:,0,1]

print("Average Precision Score:%s" % average_precision_score(y_true, y_pred))

sorted_results = sorted(zip(y_pred, y_true), reverse=True)

hit_rate_100 = sum(x[1] for x in sorted_results[:100]) / 100

print("Hit Rate Top 100: %s" % hit_rate_100)

Average Precision Score:0.19783388433313015

Hit Rate Top 100: 0.37

为筛选用全数据集再次训练

In [29]:

tasks, all_datasets, transformers = hiv_datasets.load_hiv(featurizer="GraphConv", split=None)

model = GraphConvModel(1, mode="classification", model_dir="/tmp/zinc/screen_model")

model.fit(all_datasets[0])

Loading raw samples now.

shard_size: 8192

About to start loading CSV from /tmp/HIV.csv

Loading shard 1 of size 8192.

Featurizing sample 0

Featurizing sample 1000

Featurizing sample 2000

Featurizing sample 3000

Featurizing sample 4000

Featurizing sample 5000

Featurizing sample 6000

Featurizing sample 7000

Featurizing sample 8000

TIMING: featurizing shard 0 took 15.701 s

Loading shard 2 of size 8192.

Featurizing sample 0

Featurizing sample 1000

Featurizing sample 2000

Featurizing sample 3000

Featurizing sample 4000

Featurizing sample 5000

Featurizing sample 6000

Featurizing sample 7000

Featurizing sample 8000

TIMING: featurizing shard 1 took 15.869 s

Loading shard 3 of size 8192.

Featurizing sample 0

Featurizing sample 1000

Featurizing sample 2000

Featurizing sample 3000

Featurizing sample 4000

Featurizing sample 5000

Featurizing sample 6000

Featurizing sample 7000

Featurizing sample 8000

TIMING: featurizing shard 2 took 19.106 s

Loading shard 4 of size 8192.

Featurizing sample 0

Featurizing sample 1000

Featurizing sample 2000

Featurizing sample 3000

Featurizing sample 4000

Featurizing sample 5000

Featurizing sample 6000

Featurizing sample 7000

Featurizing sample 8000

TIMING: featurizing shard 3 took 16.267 s

Loading shard 5 of size 8192.

Featurizing sample 0

Featurizing sample 1000

Featurizing sample 2000

Featurizing sample 3000

Featurizing sample 4000

Featurizing sample 5000

Featurizing sample 6000

Featurizing sample 7000

Featurizing sample 8000

TIMING: featurizing shard 4 took 16.754 s

Loading shard 6 of size 8192.

Featurizing sample 0

TIMING: featurizing shard 5 took 0.446 s

TIMING: dataset construction took 98.214 s

Loading dataset from disk.

TIMING: dataset construction took 21.127 s

Loading dataset from disk.

/home/leswing/miniconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.

"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

2. 创建工作单元

下载所有的ZINC15数据集

到http://zinc15.docking.org/tranches/home下载所有的非空.smi格式分部数据。我发现很容易下载wget脚本并运行wget脚本。本教程的后面我们假定zinc下载到/tmp/zinc。

Zinc下载数据的方法不太适合推理。我们然望单一的CPU工作单元可以执行合理的时间(10分钟到1小时)。为完成这项任务，我们要分割zinc数据到文件中，每个文件有50万行。

mkdir /tmp/zinc/screen

find /tmp/zinc -name '*.smi' -exec cat {} \; | grep -iv "smiles" \

| split -l 500000 /tmp/zinc/screen/segment

这个bash命令

查找所有的smi文件
把文件内容输出到stdout
去除首行
分割数据到多个文件于/tmp/zinc/screen，每个文件50万个分子。

3. 创建推理脚本

现在我们有了工作单元，我们需要构建程序消化工作单元并记录结果。重要的是记录机制是线程安全的！这个例子，我们将通过文件路径取得工作单元，并记录结果到文件中。一个分配多台计算机的容易的扩展是通过url获得工作单元，记录结果到分配序列中。

看起来大概是这样子的。

inference.py

import sys

import deepchem as dc

import numpy as np

from rdkit import Chem

import pickle

import os

def create_dataset(fname, batch_size=50000):

featurizer = dc.feat.ConvMolFeaturizer()

fin = open(fname)

mols, orig_lines = [], []

for line in fin:

line = line.strip().split()

try:

mol = Chem.MolFromSmiles(line[0])

if mol is None:

continue

mols.append(mol)

orig_lines.append(line)

except:

pass

if len(mols) > 0 and len(mols) % batch_size == 0:

features = featurizer.featurize(mols)

y = np.ones(shape=(len(mols), 1))

ds = dc.data.NumpyDataset(features, y)

yield ds, orig_lines

mols, orig_lines = [], []

if len(mols) > 0:

features = featurizer.featurize(mols)

y = np.ones(shape=(len(mols), 1))

ds = dc.data.NumpyDataset(features, y)

yield ds, orig_lines

def evaluate(fname):

fout_name = "%s_out.smi" % fname

model = dc.models.TensorGraph.load_from_dir('screen_model')

for ds, lines in create_dataset(fname):

y_pred = np.squeeze(model.predict(ds), axis=1)

with open(fout_name, 'a') as fout:

for index, line in enumerate(lines):

line.append(y_pred[index][1])

line = [str(x) for x in line]

line = "\t".join(line)

fout.write("%s\n" % line)

if __name__ == "__main__":

evaluate(sys.argv[1])

4. 将工作单元加载到工作序列中

我们将要使作扁平文件作为我们的分配机制。它可能是为每一个工作单元调用我们的推理脚本的bash脚本。如果你在研究机构，这可能需要在pbs/qsub/slurm排序你的工作。云计算的一个备选是rabbitmq或kafka。

In [ ]:

import os

work_units = os.listdir('/tmp/zinc/screen')

with open('/tmp/zinc/work_queue.sh', 'w') as fout:

fout.write("#!/bin/bash\n")

for work_unit in work_units:

full_path = os.path.join('/tmp/zinc', work_unit)

fout.write("python inference.py %s" % full_path)

5. 从"distribution mechanism"使用工作单元

我们从工作序列中使用工作单元，使用非常简单的线程池。它从我们的“工作序列”中取多行并运行它们，平行运行我们的CUP支持的尽可能多的线程。如果你使用超级计算机集群系统如pbs/qsub/slurm，它会为你处理这个。关键是每个工作单元使用一个CPU来获得更高的产出。我们使用linux工具"taskset"来完成它。

使用AWS的c5.18xlarge，这将花一个晚上完成。

process_pool.py

import multiprocessing

import sys

from multiprocessing.pool import Pool

import delegator

def run_command(args):

q, command = args

cpu_id = q.get()

try:

command = "taskset -c %s %s" % (cpu_id, command)

print("running %s" % command)

c = delegator.run(command)

print(c.err)

print(c.out)

except Exception as e:

print(e)

q.put(cpu_id)

def main(n_processors, command_file):

commands = [x.strip() for x in open(command_file).readlines()]

commands = list(filter(lambda x: not x.startswith("#"), commands))

q = multiprocessing.Manager().Queue()

for i in range(n_processors):

q.put(i)

argslist = [(q, x) for x in commands]

pool = Pool(processes=n_processors)

pool.map(run_command, argslist)

if __name__ == "__main__":

processors = multiprocessing.cpu_count()

main(processors, sys.argv[1])

>> python process_pool.py /tmp/zinc/work_queue.sh

6. 收集结果

由于我们记录我们的结果于*_out.smi，我们要将所有的结果收集起来，并用我们的预测来排序。结果文件可能> 40GB。要进一步分析文件，你可以用dask，或将数据送到rdkit postgres盒子中。

我向你展示如何联合和排序数据以获得最好的结果。

find /tmp/zinc -name '*_out.smi' -exec cat {} \; > /tmp/zinc/screen/results.smi

sort -rg -k 3,3 /tmp/zinc/screen/results.smi > /tmp/zinc/screen/sorted_results.smi

# Put the top 100k scoring molecules in their own file

head -n 50000 /tmp/zinc/screen/sorted_results.smi > /tmp/zinc/screen/top_100k.smi

/tmp/zinc/screen/top_100k.smi现在足够小可以用标准的工具如pandas 来调查。

In [9]:

from rdkit import Chem

from rdkit.Chem.Draw import IPythonConsole

from IPython.display import SVG

from rdkit.Chem.Draw import rdMolDraw2D

best_mols = [Chem.MolFromSmiles(x.strip().split()[0]) for x in open('/tmp/zinc/screen/top_100k.smi').readlines()[:100]]

best_scores = [x.strip().split()[2] for x in open('/tmp/zinc/screen/top_100k.smi').readlines()[:100]]

In [10]:

print(best_scores[0])

best_mols[0]

0.98874843

Out[10]:

In [11]:

print(best_scores[0])

best_mols[1]

0.98874843

Out[11]:

In [12]:

print(best_scores[0])

best_mols[2]

0.98874843

Out[12]:

In [13]:

print(best_scores[0])

best_mols[3]

0.98874843

Out[13]:

筛选看来趋向于三氧化硫。最高分值的分子也有低的多样性。当创建一个菜单时我们想要优化更多的东西而不仅仅是活性，如多样性和类药MPO。

下载全文请到www.data-vision.net,技术联系电话13712566524

DeepChem教程19: 用Zinc筛选HIV抑制剂

猜你喜欢