SELFIES是什么?

SELFIES is a new molecular representation with an exciting set of properties and applications.

For a technical discussion of what are SELIFES, see our paper in Machine Learning: Science & Applications and our extensive GitHub repo. 

Thanks to numerous excellent suggestions from the community, the SELFIES language has developed significantly since its first introduction in the first preprint in mid-2019. For example, it can now handle all features that exist in SMILES, most of them in a semantically valid way. 

Molecular representations in a computer

One central problem in this endeavour(努力、尝试) is the representation of molecules for computers.

There are two main approaches, adjacency-matrix-based(基于邻接矩阵) and string-based approaches to represent graphs.
1、(邻接矩阵、分子图)In the former, molecular graphs are represented by adjacency matrices where every element of the matrix indicates whether two vertices share a connection. This representation is sometimes loosely called "graph representation". The matrix needs to be augmented with a  vector specifying the atom species of the individual vertices to use this representation for molecules. The simplicity of adjacency matrix representations,
however, poses severe weaknesses. For instance(例如), there is no natural way to represent 3-dimensional structures of molecules and many other chemical properties -- which are, of course, essential for the molecule's functionality.(矩阵的表示方式无法表示3D

2、(SMILES)Another way to represent molecules in the computer is strings. Strings are a much more powerful representation than adjacency matrices, as -- in principle -- strings can encode arbitrary(任意的)computing programs. The standard method, SMILES, has been developed more than 30 years ago and has been a significant workhorse over the decades in chemoinformatics. SMILES can encode several molecules' spatial features and is sometimes called a 2.5-dimensional representation. However, now, in the era of machine learning in chemistry, SMILES face significant issues.

The problems stem from the fact that SMILES strings have a complex grammar(语法、基础知识). When used in machine learning models, most of the results are entirely invalid. Either they do not correspond to correct syntactical molecules or violate(违反)fundamental constraints from physics and chemistry. Special-case solutions for specific generative models exist, but a universal solution needs to concern the representation itself.(String的表示方式生成的结果很多时候都是无效的

3、(DeepSMILES)The first significant effort to solve the issue of string-based representation for machine learning in chemistry has been DeepSMILES by Noel O'Boyle and Andrew Dalke. They redefined SMILES to circumvent several common syntactical errors. While DeepSMILES showed significant improvements over SMILES, not all syntactic mistakes were solved, and semantic mistakes (physical and chemical laws) were not addressed.(虽然DeepSMILES极大的提升了SMILES的性能,但是语法错误无法完全避免、并且语义错误也没有解决

4、(SELFIES)This is what we present: SELFIES, a 100% robust, simple, string-based molecular representation (GitHub).  The features of SELFIES: It is as powerful as SMILES (for instance, can represent 3d features of molecules), is human-readable (we come back to this in the end), and is easier for computers to "understand" (we explain below on several examples what we mean). 

The basic idea of SMILES is to represent a chain of atoms. However, molecules are much richer than atom chains. Therefore, SMILES introduce two additional features to indicate branches and rings. Branches are represented as chains of atoms in brackets, that emerge from the main chain. Rings are represented by two numbers that indicate the atoms that share an additional edge.

SELFIES is a formal Chomsky type-2 grammar (or analogously, a finite state automata). This can be understood as a small computer program with minimal memory to achieve 100% robust derivation. It is designed with two ideas in mind: First, the non-local features in SMILES (rings and branches) are localized. Instead of indicating the beginning and end of a ring and branch in strings, SELFIES represents rings and branches by their length. After a ring and branch symbol, the subsequent symbol is interpreted as a number that stands for a length. This circumvents many syntactical issues with non-local features.

Second, physical constraints are encoded by different states of the deriving formal automaton/grammar. As an example, physically, a molecule of the form C=C=C is possible (three carbons connected via double bonds). However, F=O=F is not possible, because fluorine can only form one bond (not two) and oxygen can only form two bonds (not four as in this example). In SELFIES, after compiling a symbol into a part of the graph, the derivation state changes. This can be considered as a minimal memory that ensures the fulfilment of physical constraints.

The effects of these features can be clearly seen when we apply random mutations(突变) on the strings that describe MDMA (the example in the first image). We see that mutations are harmful to SMILES in most cases, while all mutated SELFIES correspond to valid molecular graphs. This indicates excellent advantages for generative models.

SELFIES and de-novo design of molecules

Next, we show one main motivation of SELFIES: The application in de novo computational design of molecules. Here we show three conceptually different approaches that exploit the advantages of SELFIES.

  1. How SELFIES enable advanced combinatorial approaches(SELFIES如何实现高级组合方法)

  2. How SELFIES can advance genetic algorithms(SELFIES如何推进遗传算法

  3. How SELFIES are used in three different deep generative models(如何在三种不同的深度生成模型中使用SELFIES

1. STONED: An efficient combinatorial approach

After the development of SELFIES, a question often had us wondering: How powerful is a purely unbiased combinatorial, generative model with SELFIES that exploits random and systematic modifications of molecular strings? Attempting to solve the problem, we recently introduced the STONED algorithm (Superfast Traversal, Optimization, Novelty, Exploration and Discovery) that efficiently explores and interpolates in the chemical space. Surprisingly, we show that STONED can perfectly solve many commonly used cheminformatics benchmarks that before were thought to be challenging problems. Thereby, we demonstrate that the generation of diverse structures, rediscovery of molecules and other tasks are simple problems that can be solved purely combinatorial with SELFIES. We believe that STONED can be used for practical tasks where there can often be a lack of data and high-efficiency requirements. Important examples include the design of functional materials in material science and catalysis.

......

继续看链接:

https://aspuru.substack.com/p/molecular-graph-representations-and

猜你喜欢

转载自blog.csdn.net/weixin_43135178/article/details/126977204