ECFP的实现：deepchem

一.源码结构：

class CircularFingerprint(Featurizer):
  def __init__(self, radius=2, size=2048, chiral=False, bonds=True,
             features=False, sparse=False, smiles=False):

  def _featurize(self, mol):
    return fp

参数：

__init__
radius	int, optional (default 2);Fingerprint radius. 半径
size	int, optional (default 2048);Length of generated bit vector. 特征数量
chiral	bool, optional (default False);Whether to consider chirality in fingerprint generation. 手性
bonds	bool, optional (default True);Whether to consider bond order in fingerprint generation. 键序
features	bool, optional (default False);Whether to use feature information instead of atom information; see RDKit docs for more info.
sparse	bool, optional (default False);Whether to return a dict for each molecule containing the sparse fingerprint.
smiles	bool, optional (default False);Whether to calculate SMILES strings for fragment IDs (only applicable when calculating sparse fingerprints).
_featurize
mol	RDKit Mol；Molecule.

二.代码实现：

smiles_file(type:pandas.dataframe)

compound_ID	smiles
D06396	N(C1C2CCCN(CC1)C2)C(=O)c1cc(c(cc1OC)N)Cl
D06056	c1(c2c([nH]c1)ccc(c2)OC)/C=N/NC(=N)NCCCCC

import pandas as pd
import numpy as np
from rdkit import Chem
from deepchem.feat import fingerprints as fp

ID_list = []
error_ID_list = []
feature_list = []
for index,smiles in zip(smiles_file['compound_ID'],smiles_file['smiles']):
	mol = Chem.MolFromSmiles(smiles)
	mol = [mol]  #如果不加此行，则TypeError: 'Mol' object is not iterable
	engine = fp.CircularFingerprint(radius=2, size=2048, chiral=False,         
                 bonds=True,features=False, sparse=False, smiles=False)  #千万不要把mol加到这里，因为TypeError: __init__() got an unexpected keyword argument 'mol'
	feature = engine(mol) #结果形式为：[array([0, 0, 0, ..., 0, 0, 0])]
	ID_list.append(index)
	feature_list.extend(feature) #如果用append，则为[[array([0, 0, 0, ..., 0, 0, 0])],……]]型，生成dataframe时出现ValueError: Must pass 2-d input

ID_feature_df = pd.DataFrame(feature_list,ID_list)
vec_name = ['feature_{0}'.format(i) for i in range(0,size)]
ID_feature_df.columns = vec_name
ID_feature_df.index.name = 'compound_ID'


ID_feature_df.to_csv('***') #保存结果到文件
print('------------------程序结束--------------------------------------')

结果为：

compound_ID	feature_0	feature_1	feature_2	feature_3	feature_4	feature_5	feature_6	feature_7	feature_8	...	feature_2038	feature_2039	feature_2040	feature_2041	feature_2042	feature_2043	feature_2044	feature_2045	feature_2046	feature_2047
D06396	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
D06056	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

保存文件时可能会出现OSError: [Errno 28] No space left on device

solution：分批处理

参考网站：

MoleculeNet：Models and Featurizations

deepchem中ECFP的源码

ECFP的实现：deepchem

猜你喜欢