ECFP的实现:deepchem

一.源码结构:

class CircularFingerprint(Featurizer):
  def __init__(self, radius=2, size=2048, chiral=False, bonds=True,
             features=False, sparse=False, smiles=False):

  def _featurize(self, mol):
    return fp

参数:

__init__

radius

int, optional (default 2);Fingerprint radius.  半径
size int, optional (default 2048);Length of generated bit vector. 特征数量
chiral bool, optional (default False);Whether to consider chirality in fingerprint generation.  手性
bonds bool, optional (default True);Whether to consider bond order in fingerprint generation.  键序
features bool, optional (default False);Whether to use feature information instead of atom information; see RDKit docs for more info.
sparse bool, optional (default False);Whether to return a dict for each molecule containing the sparse fingerprint.
smiles

bool, optional (default False);Whether to calculate SMILES strings for fragment IDs (only applicable when calculating sparse fingerprints). 

_featurize
mol RDKit Mol;Molecule.

二.代码实现:

smiles_file(type:pandas.dataframe)

compound_ID smiles
D06396 N(C1C2CCCN(CC1)C2)C(=O)c1cc(c(cc1OC)N)Cl
D06056 c1(c2c([nH]c1)ccc(c2)OC)/C=N/NC(=N)NCCCCC
import pandas as pd
import numpy as np
from rdkit import Chem
from deepchem.feat import fingerprints as fp

ID_list = []
error_ID_list = []
feature_list = []
for index,smiles in zip(smiles_file['compound_ID'],smiles_file['smiles']):
	mol = Chem.MolFromSmiles(smiles)
	mol = [mol]  #如果不加此行,则TypeError: 'Mol' object is not iterable
	engine = fp.CircularFingerprint(radius=2, size=2048, chiral=False,         
                 bonds=True,features=False, sparse=False, smiles=False)  #千万不要把mol加到这里,因为TypeError: __init__() got an unexpected keyword argument 'mol'
	feature = engine(mol) #结果形式为:[array([0, 0, 0, ..., 0, 0, 0])]
	ID_list.append(index)
	feature_list.extend(feature) #如果用append,则为[[array([0, 0, 0, ..., 0, 0, 0])],……]]型,生成dataframe时出现ValueError: Must pass 2-d input

ID_feature_df = pd.DataFrame(feature_list,ID_list)
vec_name = ['feature_{0}'.format(i) for i in range(0,size)]
ID_feature_df.columns = vec_name
ID_feature_df.index.name = 'compound_ID'


ID_feature_df.to_csv('***') #保存结果到文件
print('------------------程序结束--------------------------------------')

结果为: 

compound_ID feature_0 feature_1 feature_2 feature_3 feature_4 feature_5 feature_6 feature_7 feature_8 ... feature_2038 feature_2039 feature_2040 feature_2041 feature_2042 feature_2043 feature_2044 feature_2045 feature_2046 feature_2047
D06396 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
D06056 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

保存文件时可能会出现OSError: [Errno 28] No space left on device

solution:分批处理

参考网站:

MoleculeNet:Models and Featurizations

deepchem中ECFP的源码

猜你喜欢

转载自blog.csdn.net/weixin_41171061/article/details/83746038
今日推荐