rdkit&nlp | Use regular expressions to segment smiles

Segmentation of smiles chemical expressions

The Chinese corpus data is a batch of short texts or long texts, such as a collection of sentences, article abstracts, paragraphs, or entire articles. Generally, the words and words between sentences and paragraphs are continuous and have certain meanings. When performing text mining and analysis, we hope that the smallest unit granularity of text processing is words or words, so at this time, word segmentation is needed to segment all the text.

Similarly, when processing chemical text, the smallest unit particle size we want to process is element or bond
 

def smi_tokenizer(smi):
    """
    Tokenize a SMILES molecule or reaction
    """
    import re
    pattern =  "(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|#|-|\+|\\\\|\/|:|~|@|\?|>|\*|\$|\%[0-9]{2}|[0-9])"
    regex = re.compile(pattern)
    tokens = [token for token in regex.findall(smi)]
    assert smi == ''.join(tokens)
    return ' '.join(tokens)
    
smi_tokenizer("CC(=O)OCC1=C(C(=O)O)N2C(=O)[C@@H](NC(=O)C(OC(C)=O)c3ccccc3)[C@H]2SC1>O>CC1=C(C(=O)O)N2C(=O)[C@@H](N)[C@H]2SC1")

‘C C ( = O ) O C C 1 = C ( C ( = O ) O ) N 2 C ( = O ) [C@@H] ( N C ( = O ) C ( O C ( C ) = O ) c 3 c c c c c 3 ) [C@H] 2 S C 1 > O > C C 1 = C ( C ( = O ) O ) N 2 C ( = O ) [C@@H] ( N ) [C@H] 2 S C 1’

Guess you like

Origin blog.csdn.net/weixin_43236007/article/details/109671590