[bug record] file name problem in df.to_csv()

I did a split data set this afternoon, put the molecules with the same parent nucleus in a csv file, and then the new csv file is named the name of the parent nucleus, and then the split data set and the original unsplit The number of data sets is always not right, I am speechless, I have been troubleshooting for a long time, I am about to vomit, and finally found that it is in the following line of code:

df_match.to_csv('drawimages/data/diffscaffold/{}.csv'.format(csv_name), index=False)

At first glance, it may feel okay, and it is indeed possible to store data, but there is a pit in it, that is, if the two mother cores are the following two types (only part of the atomic capitalization of the two mother cores is inconsistent , but They are not the same), then it will only store one of them (big pit) ! [Reason: Windows will consider these two folders with inconsistent capitalization as one type, and only one type can exist! 】So it will always result in not being able to store all the data, and it will always fail to match the data. . . .

 O=C(NCc1ccccc1)c1c(Cc2ccccc2)sc2c1CCOC2

O=C(NCC1CCCCC1)c1c(Cc2ccccc2)sc2c1CCOC2

Just modify the code to:

for i, core in enumerate(core_list):
    matches = []
    for j, smiles in enumerate(smiles_list):
        mol = Chem.MolFromSmiles(smiles)
        core_mol = Chem.MolFromSmiles(core)
        if mol.HasSubstructMatch(core_mol):
            matches.append(j)

    # 选取包含母核的行,并保存到一个新的csv文件中
    df_match = df.iloc[matches]
    df_match.to_csv('drawimages/data/diffscaffold/{}.csv'.format(i+1), index=False)
    print("{}.csv is done! ".format(i+1))

Guess you like

Origin blog.csdn.net/weixin_43135178/article/details/130508700