Recent Advances in Structured Organization—Quantitative Structural Biology

Author: Zen and the Art of Computer Programming

1 Introduction

Structured organization is an emerging research direction in the field of modern biology. Its core concepts and methodologies are widely used in different fields, such as microorganisms, cell differentiation, molecular biology, etc. This article will review the latest progress in structured organizations and explore this topic.

2. Overview of Quantitative Structural Biology

2.1 Concepts and definitions

Structured organization refers to a biological community composed of many different biological systems that interact according to specific rules and patterns to form complex organizational networks. Structured organizations generally have the following characteristics:

  1. Diversity: The differences between different biological systems in structured organizations are very rich and have natural advantages.

  2. Dynamic: Structured organizations have self-healing functions and can adapt to environmental changes, making them strong in ecological balance.

  3. High degree of parallelism: Each member of a structured organization evolves independently, and they compete, coexist, and co-evolve with each other.

  4. Special functions: In some special circumstances, structured organizations can perform unique functions, such as using food resources to establish communities or sharing resources with humans.

The key words of quantitative structural biology are "quantification" and "structure". For biological communities or organizational structures discovered through measurements or experiments, structured models can be used to predict some of its behavioral characteristics or lifespan.

2.2 Methodology

The study of structured organization has gradually become a hot topic in the biological community, and many theories, methodologies and tools have emerged. The methodology of quantitative structural biology mainly includes the following aspects:

  1. Model construction: Establish a quantitative description model of biological communities or organizational structures through statistical analysis, computer simulation, simulation experiments, etc., so as to analyze and predict complex systems.

  2. Data acquisition: Structured organizations, scientific research centers, and laboratories work together to collect enough data to accurately simulate and describe actual biological communities or organizational structures.

  3. Model verification: The established quantitative model needs to be rigorously verified before it can be used in actual research to avoid erroneous predictions caused by incorrect assumptions.

  4. Model-based prediction: Make fast and accurate predictions of unknown biological communities or organizational structures based on established models.

3. Explanation of basic concepts and terms

3.1 Quantitative structure

Quantitative structure refers to the position distribution and arrangement of various components in a complex system. It is a quantitative description of the organizational characteristics of biological systems. It is usually represented by symbols such as graphics, tables, images or text, which expresses a specific organizational structure and the relationship between all components in the organization. Therefore, quantitative structure is a description method in the form of symbolic language, which is different from the sequence characteristics of system core and cellular structure.

3.2 Network relation network

Networks A network of relationships refers to the complex network structure of a biological community or organization. It is a graph structure composed of interconnected nodes and edges, where each edge represents the relationship between two nodes, such as kinship relationship, reproductive relationship, signaling relationship, metabolic relationship, etc. A network of network relationships can reflect a complexity because it may have millions of nodes and hundreds of millions of edges. Structured organizations model and control complexity through networks of networked relationships.

3.3 Molecular dynamics

Molecular Dynamics (MD), also known as high-dimensional molecular dynamics, is a method of using computers to simulate the motion of molecular dynamics. Its basic principle is to exert traction on a group of atoms in an object, and combines gravity with hydration and space. The interaction between holes and fluids is used to calculate their movement patterns. Molecular dynamics can also be used to study complex three-dimensional or higher-dimensional molecular systems such as fluids, solids, liquids, and gases.

3.4 Clustering analysis

Clustering analysis refers to dividing the samples into different groups or clusters based on the samples in the data set, so that the distance between samples in the same group is small and the distance between different groups is large. Cluster analysis has important applications in biology, sociology, finance, health care, artificial intelligence and other fields.

3.5 Random walk

Random Walk (RW) is a simulation method based on probability statistics and one of the commonly used methods to study probability distribution. The basic idea of ​​a random walk is to randomly select a path from the starting state and then move along this path in sequence until the terminal state is reached. Through random walks, we can simulate probability distributions to obtain a series of results that reflect the evolution of the system.

3.6 Bayesian networks

Bayesian Network is a probability model that defines probability distribution as a network of dependencies between a series of variables. Bayesian networks can be used for reasoning, learning and identification. By constraining the network structure and the form of conditional probability distribution, it can describe the interaction between various variables and carry out effective probability inference.

3.7 Markov Chain

Markov chain is a stochastic process that predicts future state values ​​based only on the current state value. Markov chain, also called probability transition matrix, is a memoryless dynamic process determined by the state space S and the state transition probability matrix P. The Markov chain consists of an initial state s0 and a family of transition probability matrices P and state transition matrices p. In a Markov chain, only the current state is related to the next state, regardless of past or future states.

4. Explanation of core algorithm principles, specific operating steps and mathematical formulas

4.1 PLA algorithm

PLA algorithm (Partition-Label Algorithm) is one of the most commonly used algorithms in structured biology. The basic idea is to cluster a sample, that is, divide the sample into several groups, and require the data points within the group to be as similar as possible and the data points between groups to be as different as possible. Similarities and differences here can be measured statistically.

For a given sample set set. Next, the sample set X is clustered according to the following rules:

  1. Initialize K empty clusters, each cluster corresponds to a center vector;

  2. Assign sample xi to the blank cluster C closest to it. If there is no blank cluster, create a new cluster;

  3. Update the labels of all samples in the blank cluster, labeling sample xi as C.

  4. For each non-blank cluster C, calculate its center vector Ci, and recalculate the neighborhood Gi(xi) of sample xi to obtain a new center vector vedom(i). Then, update the center vector Ci and vedom(i);

  5. Determine whether to stop: If all samples are assigned to the corresponding blank clusters, stop, otherwise continue.

  6. Return the final classification result.

4.2 K-means algorithm

K-means algorithm (K-Means Clustering Algorithm) is the simplest and commonly used clustering algorithm in structured biology. The basic idea is to divide the sample set into K subsets of the same size, and the internal elements of each subset are similar to each other. When the number of iterations is enough, the elements in the subset will converge to the global centroid. The K-means algorithm has been proven to be a more efficient clustering algorithm than the PLA algorithm.

The algorithm steps are as follows:

  1. Specify K centers and randomly select K samples as initial clustering centers C(k);

  2. Repeat the following operations M times:

    a. For each sample x, calculate its distance dk from each center C(k);

    b. Attribute sample x to the center C(k) closest to it;

    c. Re-estimate each center C(k) based on all previous samples;

  3. Return the final classification result.

The running time complexity of the algorithm is O(MKN^2), where N is the number of samples and K is the number of cluster centers. Therefore, if K is too large or the number of samples is too small, this algorithm may have a local optimal solution.

4.3 Louvain algorithm

Louvain algorithm (Louvain Community Detection Algorithm) is a clustering algorithm that is more complex and robust than other clustering algorithms. The Louvain algorithm advocates taking community discovery as the goal and using community metrics to measure the closeness between nodes. Therefore, nodes within a community should be clustered together, and nodes between communities should be dispersed. The Louvain algorithm is based on dynamic network partitioning, so it can handle dynamic networks, social networks, and network structures that change over time.

The algorithm steps are as follows:

  1. Use the adjacency matrix Adjacency Matrix to construct the network G=(V,E);

  2. Find the shortest path length matrix L=dijkstra_shortest_path(G);

  3. Introduce the auxiliary variable z into the network G, let z[i]=j indicate that node i belongs to the jth community;

  4. Execute the following loop until stopped:

    a. For each community j, solve the binary community division that satisfies the following properties:

    z'[j] = argmin_{z' in Z} sum_(i!=j)[L[i][z'] + L[z'][j]];

    Here, Z represents the set of all non-j communities, [ ] represents net income, the second term represents minimizing the intra-community connection score, and the third term represents maximizing the inter-community connection score;

    b. If the community division changes, update z[i]=zj;

  5. Return the final classification result.

The time complexity of the Louvain algorithm is proportional to the network size, but due to traversing each round of community division, its computational load is relatively large, especially when the network size is large. Moreover, the Louvain algorithm can only handle static networks and cannot handle dynamic networks, social networks, and network structures that change over time.

4.4 GMM algorithm

The GMM algorithm (Gaussian Mixture Model Algorithm) is another commonly used clustering algorithm in structured biology. The GMM algorithm believes that the center of the cluster is composed of multiple Gaussian distributions, and each distribution has a mean vector and a covariance matrix. Therefore, the GMM algorithm can simulate a Gaussian mixture model, dividing the data set into several "mixed components" and their corresponding probabilities.

The algorithm steps are as follows:

  1. Set parameters K, α, and β, α represents the initial value of the mixing weight, and β represents the initial value of the mixing coefficient of the Gaussian distribution;

  2. Repeat the following:

    a. For each sample xi, calculate the probability pi of belonging to each Gaussian distribution, and calculate the weighted average mean vector mu, and the weighted average covariance matrix Sig;

    b. Update the mixing weight α, the mixing coefficient β of the Gaussian distribution, and the parameters μ and Σ of each Gaussian distribution.

  3. Return the final classification result.

The running time complexity of the algorithm is O(MND^2logD), where N is the number of samples, D is the feature dimension, and M is the number of iterations. Therefore, when the number of samples and feature dimensions grow rapidly, this algorithm may take too long.

Guess you like

Origin blog.csdn.net/universsky2015/article/details/133565471