Web Analytics and Machine Learning

network analysis

"Artificial Intelligence and Machine Learning" from Professor Manolis Kellis (Director of MIT Computational Biology)
This lesson mainly introduces the knowledge of network and graph.
The main content includes basic knowledge of network and graph (network type, related algorithms), network properties (motifs properties, centrality, etc.), eigenvectors, SVD singular value decomposition, PCA, Sparse PCA, t-SNE, etc.
I think it is very good! The lecture in just one and a half hours is much more interesting and inspiring than the linear algebra I learned in my freshman year. Although my linear algebra is bad, it does not prevent me from learning and checking, and I can probably understand it in the end.
If you need a video, you can send a YouTube link by private message


  • The main content of this lesson (taken from zoom)

This transcript is a summary of a conversation between Manolis and Zijian on deep learning and networks. They discussed the goals of the dialogue to discuss how networks work, how they are structured, and how to analyze them. Manolis describes how graphs and networks work, and how they can be used for inference. Manolis explains how to use social networking to connect people and understand their activities. He also shows how networks can be used to predict the function of genes or other regulatory mechanisms. Manolis is explaining how charts work in Zoom. He explained that a network is a set of interdependent variables, while a graph is a set of vertices that can be used to represent the flow of information between the two. Manolis is demonstrating how to do weight transfer in randomized networks. He shows how to multiply a matrix with a vector to get the weights of the nodes. Manolis demonstrates how to use Zoom's search feature to search within diagrams. He shows how to create a breadth-first search tree and how to traverse that tree. Manolis discussed the concept of small worlds and how people can find connections in the web, even if they don't know each other's locations. Manolis explains how networks work and how they can be used to understand the world around us. He talks about the concept of prioritized connectivity and how it affects people connecting to each other. He also explained how a power-law distribution affects the strength of one person's connection to another. Manolis is explaining the properties of the network in Zoom. He talks about network centrality, the number of neighbors, the flow of information, and how to cluster based on how connected they are. Manolis explains how matrix operations in networks work and how they can be used to understand the structure of networks. Manolis is explaining how eigenvectors are combined with matrix operations. He said that if a matrix is ​​a complete matrix and each vector is a multiple of the identity matrix, then the eigenvalues ​​are the combination of the two. Manolis explains how Principal Component Analysis (PCA) works and how it can be used to represent relationships between different websites. Manolis explains how to partition the network using Laplacian matrices.

1.Introduction to networks and graph foundations and algorithms

Network types (social/bio/comp), Euler/Hamilton, Graphs (matrix/adj)

some biological networks

  1. Regulatory Network : This is a network that expresses the relationship between genes and other molecules. These molecules can regulate gene expression. For example, transcription factors (Transcription Factors, TFs) can bind to specific gene promoter regions to promote gene expression or inhibit gene expression (network).
  2. Metabolic Network : This is a network that describes various metabolic reactions in organisms . In a metabolic network, nodes represent metabolites (such as sugars, fatty acids, etc.), edges represent metabolic reactions, and enzymes (Enzymes) that catalyze these reactions are usually marked on the edges. This network can help us understand the flow of energy and matter in living organisms.
  3. Signaling Network : The signaling network describes the signal transmission process inside and between cells. For example, the receptor protein on the cell membrane can receive external signal molecules, and then transmit the signal to the inside of the cell through a series of enzymatic reactions, thus triggering a series of cellular reactions.
  4. Protein Functional Interaction Network (Protein Functional Interaction Network) : This network depicts the functional relationship between proteins. Nodes represent proteins, and edges represent functional relationships between two proteins, for example, one protein can activate the activity of another protein.
  5. Co-expression Network (Co-expression Network) : In this network, nodes represent genes, and edges represent the correlation between the expression patterns of two genes. For example, two genes may function in the same biological process if they are expressed at similar levels under various conditions.

  1. Element Identification : In biological networks, the elements that need to be identified may include genes, transcription factors, metabolites, etc. For example, the "ATTAAT" and "CGCTT" you provided may mean looking for specific transcription factor binding sites (ie, regulatory motifs, Regulatory Motifs) in the DNA sequence.
    • Regulators / Regulatory Motifs / Target genes : In the regulatory network, the elements to be identified include transcription factors (Regulators), which regulate the expression of target genes (Target genes) by binding to regulatory motifs (Regulatory Motifs).
  2. Using networks to Predict expression levels / predict cellular activity / Predict gene ontology (GO) functional annotation terms : By analyzing biological networks, we can predict gene expression levels, cell activity status, and gene functional annotations (for example, Gene Ontology, GO, which is a standard terminology for describing gene function).
  3. Inferring networks from functional data : By analyzing functional data such as gene expression data and protein interaction data, we can infer biological networks. For example, the "X=f(A, B)" and "Y=g(B)" you provided may represent how to describe the interaction between genes or proteins through mathematical models.
    • Activity patterns / Structure / Function : These may be talking about different properties of the network, for example, activity patterns in the network (e.g., which genes are activated or repressed at the same time), the structure of the network (e.g., the distribution of nodes and edges in the network) , and the function of the network (for example, what biological process is the network able to accomplish).
  4. Network Structure Analysis : Analyzing the structure of a network can help us understand how biological processes are organized. For example, Hubs (node ​​degree distribution) in a network are those nodes that are connected to many other nodes, and these nodes may play a key role in the network. Network motifs are specific patterns that recur in the network, and they may represent a specific regulatory strategy. Functional modules are a group of nodes that work together in the network, and they may participate in a specific biological process.

Challenges in this field may include data complexity, incompleteness, and noise, computational and storage requirements, and how to derive biologically meaningful interpretations from complex network patterns, among others.

  • Compared with the reality, the network is a more abstract concept. For example, the edge does not actually exist, but only represents the relationship between things, and the node is just a probability variable in a sense.
  • Edges Represent Relationships Between Data Points : In more abstract network models, edges can represent relationships between data points. For example, in a social network, edges can represent friendships or interactions between people. In physics, edges can represent forces between particles. In machine learning, edges can represent the similarity between samples. Such networks are commonly used in fields such as social network analysis, systems biology, and recommender systems.
  • Nodes as Probabilistic Variables : In the most abstract case, the nodes themselves can be viewed as probabilistic variables. This representation is especially common in probabilistic graphical models, such as Bayesian networks or Markov random fields. In this model, each node represents a random variable, and edges represent the probabilistic relationship between these random variables. These networks have applications in a variety of complex prediction and inference tasks, including fields such as natural language processing, computer vision, and bioinformatics.

  1. Physical Networks : In this type of network, edges represent actual physical interactions between nodes . For example, in a physical regulatory network, an edge might represent a physical association between a transcription factor (a protein) and a segment of DNA that regulates the expression of one or more genes. Network models of this type are often based on experimental data such as protein-protein interaction data or transcription factor binding site data.
  2. Relevance Networks : In a Relevance Network, the weight of an edge represents the similarity between nodes. For example, in a functional regulatory network, if two genes have similar expression patterns under multiple environmental conditions, there may be an edge between them, and the weight of this edge indicates the similarity of their expression patterns. This type of network model is usually based on computational methods, such as correlation calculations or mutual information calculations, which can be applied to various types of data, such as gene expression data or phenotype data.

Each type of network has its advantages and disadvantages. Physical networks can provide clear information on biological mechanisms, but may be incomplete or biased due to experimental technical limitations. Correlation networks can provide large-scale, system-level insights, but can make it difficult to explain the specific mechanisms underlying observed patterns. Therefore, practical studies often combine these two networks to gain a more comprehensive and accurate understanding.

Probabilistic networks and graphical models are a way of using graphs to represent the probabilistic relationships between variables. This approach is useful in a variety of data analysis and machine learning tasks. Here is an introduction to the two types of probabilistic graphical models mentioned:

  1. Bayesian Networks : This is a directed graph model in which each node represents a random variable and each edge represents the probability dependence between variables. Often, edges point from dependent variables to outcome variables, so such networks can be represented as "cause-effect" relationships. For example, if a disease causes certain symptoms, then in a Bayesian network there would be an edge from a "disease" node to a "symptom" node. Bayesian networks are commonly used for causal inference, prediction, classification, and other tasks.
  2. Markov Random Fields (MRFs) : Also known as an undirected graph model, this is an undirected graph model in which each node represents a random variable and each edge represents the probability relationship between variables. Compared with Bayesian network, Markov random field does not emphasize the cause-effect relationship, but emphasizes the correlation between variables. For example, if a person's friends in a social network all like a certain style of music, then that person may also like that style of music. Markov random fields are often used for structural prediction, such as image segmentation and image restoration.

The concept of graph is more mathematical

The Edge of an undirected graph, which represents a set

The Edge of an undirected graph, represented as a tuple

Representing networks with graphs and graphs with adjacency matrices

The teacher here introduced the adjacency matrix * a weight vector, which is similar to the diffusion of weights in the graph. This is an idea that has never been heard before, so I asked gpt

In graph theory and network analysis, matrix multiplication and diffusion of node weights are closely related. To visualize this relationship, let us consider a simple example.

Suppose we have a network of three nodes, we can represent this network by an adjacency matrix, where the elements represent the connections between the nodes. For example, if there is a connection between node 1 and node 2, then there is a value at the first row and second column of the adjacency matrix. This value may be 1 (indicating a connection) or another value (indicating the strength or weight of the connection).

Assuming each node has a weight, we can represent these weights with a vector. For example, [w1, w2, w3].

When we multiply the adjacency matrix by the weight vector, we get a new vector that represents the diffusion or propagation of the weights in the network. Specifically, each element in the new vector is the dot product (i.e., the sum of each element's multiplication) of the original weights and the corresponding row in the adjacency matrix. This represents how each node's weight is diffused to other nodes connected to it.

For example, if the weight of node 1 is w1, then in the new weight vector, the weight of node 2 will be the original w2 plus the first column of the second row of the adjacency matrix (representing the connection between node 1 and node 2) and w1 product of . This represents how the weight of node 1 spreads to node 2.

In this way, matrix multiplication can be used to model the diffusion of weights in a network, which is very useful in many applications, such as simulating the propagation of information in social networks, or the propagation of signals in biological networks, etc.

Note that the explanation here assumes a simple linear diffusion model. In more complex cases, the diffusion model may need to consider other factors, such as time delays, capacity constraints of nodes, etc. In addition, if the connections of the network are directed, the direction of the connections also needs to be considered when calculating the new weight vector.

Adjacency lists can be used to represent sparse matrices, etc., which are more suitable

Breadth-first search (shortest paths), Depth-first search (conn. compnts)

2. Emergent global / local network properties in the real world

Global: Small-world, scale-free, clustering, motifs, similarity, dynamics

You mentioned several emergent properties of networks that are determined by the network's structure and how it is connected, rather than by individual nodes or edges of the network. Here's an explanation of each property you mentioned:

  1. Small-world property : This property describes that most nodes can be connected to any other node by a relatively short path. This attribute gets its name from the well-known "six degrees of separation theory," which states that any two people can find a connection through a chain of six steps or fewer. The importance of this property is that it accounts for the rapidity with which information can be propagated between nodes, even in large networks.
  2. Scale-free or power-law degree distribution ('Scale-free' or power-law degree distribution): This property describes that most nodes in the network have few connections, while a small number of nodes (called centers or hubs) There are many connections. It's like the 20/80 rule where 80% of the connections are in 20% of the nodes. The importance of this property is that it can explain the robustness and vulnerable points of the network, since hub nodes play an important role in the network.
  3. Network growth and preferential attachment : This is a model of dynamic network growth, in which new nodes tend to link to nodes that already have more connections, which is the phenomenon of "the rich get richer". This model can explain why the network exhibits the property of scale-free or power-law degree distribution.
  4. Clustering coefficient : This indicator measures whether the neighbors of nodes in the network are also connected to each other. In other words, it measures the density of closed "triangles" (three nodes connected to each other) relative to open "triangles" (three nodes with only two sides) in the network. This indicator can reflect the aggregation degree and community structure of the network.
  5. Frequency of network motifs : This indicator measures the frequency of a certain pattern (such as a specific subgraph) in the network, such as the frequency of subgraphs with sizes 3, 4, and 5. This indicator can reflect the complexity and diversity of the network.

This is a job of the teacher team: analyze the basic components in the network, such as feed-forward loops , feedback loops , cross-regulation , **amplification* * Four types. In genes, it refers to mutual regulation.

Networks are often not random

A biological network is a representation of various elements and their interactions in a biological system. Understanding the properties of biological networks can help us understand the structure and function of biological systems. Here's an explanation of some of the biological network properties you mentioned:

  1. Hierarchical organization : This means that the nodes and links in the network are organized according to different levels. In biological networks, such as gene regulatory networks, "Master regulators" are usually at the top of the hierarchy, controlling the activities of a series of "Local regulators".
  2. Degree distribution : This represents the distribution of the number of links to nodes in the network. In many biological networks, there are some nodes (called "hubs" or "Hubs") that have a large number of connections, while most nodes have only a small number of connections. "In-hubs" are nodes that are connected to most in-degrees, and "out-hubs" are nodes that are connected to most out-degrees.
  3. Diameter : This is the length of the longest shortest path between any two nodes in the network. In biological networks, diameter can help us understand the speed and efficiency with which information or signals travel through the network.
  4. Info transfer : This refers to the way information or signals flow in a network. For example, a signal might pass from one node to another, or cycle through a subset of nodes in the network.
  5. Modularity : This means that the network can be divided into multiple modules or communities, where the connection density between nodes inside each module is higher than the connection density between modules. In biological networks, modularity can reflect functional differentiation and specificity.
  6. Locality : This indicates that the connections between nodes are mainly concentrated in spatially similar locations. In biological networks, spatially close nodes usually have similar functions or properties.
  7. Clustering : This indicates the degree to which a node's neighbors in the network are also connected to each other. In biological networks, a high degree of clustering usually reflects the stability and robustness of the network.
  8. Subnetworks (Subnetworks) : This represents a part of the nodes and connections in the network, which form a relatively independent system. In biological networks, subnetworks usually correspond to specific biological processes or functions.
  9. Flow direction : This indicates the direction in which information or signals flow in the network. In biological networks, the flow direction is usually from upstream nodes to downstream nodes, for example, from genes to proteins.
  • Node properties: centrality

  • between nodes: similarity

Some words explained:

  1. Assortative mixing : This means that nodes with similar properties are more likely to be connected to each other. For example, in social networks, people often connect with people who have similar interests or backgrounds.
  2. Node similarity or node equivalence : There are two basic types of node similarity. Structurally similar nodes share many of the same neighbors, while regularally similar nodes share connections with neighbors with similar properties.
  3. Property clustering : This means that a group of nodes can form a specific structure, such as clique, k-plex or k-core, depending on how they are connected. A clique is a structure in which all nodes are connected to each other, a k-plex is a structure in which almost all nodes are connected to each other, and each node has fewer than k neighbors unconnected, and a k-core is a structure in which each node is connected to at least k A structure connected to other nodes.
  4. Defining graph neighborhoods with components : In networks, a common way is to define neighborhoods or communities of nodes through components. A component is a collection of nodes in which any two nodes are connected by at least one path, while a k-component is a collection of nodes in which there are at least k independent paths between any two nodes
  • The graph is divided into components

Mentioned are some concepts about connected components and community detection in networks. I will explain one by one:

  1. Giant connected component : In many networks, most of the nodes are contained in a large connected component, i.e. there exists at least one path between any two nodes. However, there are still substructures within this large component, such as friend clusters in a social network.
  2. Graph partitioning algorithms : These algorithms try to partition the nodes of the graph into k clusters so that there are as many connections within the clusters as possible and as few connections between clusters as possible. The simplest form of the graph partition problem is the graph bipartition problem, which divides the nodes of the graph into two equal-sized clusters. This is an NP-complete problem that can only be solved approximately by heuristic algorithms.
  3. Kernighan-Lin algorithm : This is a classic graph partitioning algorithm, which first randomly divides the nodes into two clusters, and then repeatedly adjusts the members of the clusters to reduce the connection between the clusters.
  4. Spectral partitioning (Spectral partitioning) : This is a graph partitioning method based on the Laplacian matrix of the graph, which tries to divide the nodes of the graph into several clusters, so that the "diffusion" between the clusters (that is, the information or signal in the cluster spread) as small as possible.
  5. Community detection algorithms : These algorithms try to find some small closely connected clusters in the graph, that is, communities. Commonly used community detection methods include modularity maximization, spectral methods, betweenness-based methods, etc.

  1. Percolation and network resilience : Percolation theory is used to describe stochastic processes in networks, such as the effect of random removal of nodes or edges on network connectivity. This theory is very useful in understanding the resilience of networks, that is, how much a network can maintain functionality in the face of the loss of nodes or edges (such as router failures, network attacks, or vaccinations).
  2. Epidemics on networks : Modeling the spread of diseases on networks can help us understand how diseases spread among populations. In this model, it is usually assumed that each node (such as a person) can be in several different states (such as susceptible state, infected state and recovered state), and over time, the state of a node can be changed according to some Rules, such as disease transmission rates, change.
  3. Dynamical systems on networks : This involves modeling some time-dependent processes on networks, such as metabolic modeling or information flow. This kind of model can help us understand the dynamic behavior in the network, such as stability, synchronicity, etc.
  4. Network search : Network search involves finding specific information or nodes in a network. In web search, an important question is how to design efficient search strategies, such as in web search or distributed databases, or in message passing systems.

Local: centrality metrics (degree, betweenness, closeness, etc)

We talked about some global properties of the network before, and now we will talk about some local properties of the network

The green nodes in the middle have high centrality but low degree. If two networks are to be separated, these nodes will be selected for attack

Used to measure the importance of a node in the network. Let's explain these concepts in detail:

  1. Degree Centrality : For an undirected graph, the degree centrality of a node refers to its number of connections (that is, how many neighbors it has). In directed graphs, we usually calculate in-degree centrality (how many nodes point to this node) and out-degree centrality (how many other nodes this node points to) separately. For weighted graphs, we can consider weight centrality, which is the sum of the weights of node connections.
  2. Betweenness Centrality : The betweenness centrality of a node refers to the proportion of all shortest paths passing through the node. If a node has a high betweenness centrality, then it means that this node has an important controlling role in the flow of information or any other kind of flow in the network. The betweenness centrality of an edge is similarly defined, that is, the proportion of all shortest paths that pass through this edge.
  3. Closeness Centrality : The closeness centrality of a node is the reciprocal of the sum of the shortest path lengths to all other nodes in the network. If a node has a high proximity centrality, it means that it is closer on average to other nodes in the network, and thus may be more effective at disseminating information or influencing other nodes in the network.
  4. Eigenvector Centrality : It somewhat extends the concept of degree centrality,
  5. : The eigenvector centrality of a node is equal to the sum of the eigenvector centralities of all its neighbors. In other words, the importance of a node depends not only on how many neighbors it has, but also on the importance of those neighbors. Mathematically, this definition leads to an equation involving the eigenvectors of the network adjacency matrix. The adjacency matrix of a network is a square matrix in which each element represents a connection between one node and another. The solution to this equation is eigenvector centrality.

3. Linear Algebra Review

Review some basics of linear algebra

Eigenvector and singular vector decomposition

Here we start with the operation of matrix multiplication, transition from the algebraic level to the geometric level, and lead to the SVD Singular Value Decomposition (Singular Value Decomposition), and then discuss the eigenvector (Eigenvector)

In many computational science, data science, and machine learning problems, we often need to calculate the multiplication of matrices and vectors. For example, given a matrix A and a vector x, we need to compute y = Ax. This is mathematically very intuitive, but if understood graphically, the process may yield some new discoveries.

First, we can think of matrix A as a graph where each node i corresponds to a row of the matrix and each edge represents an element in that row. The vector x can be regarded as the initial label of each node in the graph, and the vector y is the calculated new label.

In the graph interpretation, the new label y(i) of node i is obtained by multiplying the x value of its neighbor node (ie, the original label) by the corresponding edge label (ie, the matrix element corresponding to the neighbor node) . This is the process of multiplying a matrix by a vector, and from a graph perspective, it is a weight diffusion or information propagation process.

This observation reveals an interesting fact: multiplication of dense matrices and vectors is just a special case of multiplication of sparse matrices and vectors. In a sparse matrix, most elements are zero, so in the graph, most nodes have only a few neighbors. Whereas in a dense matrix, every node is connected to every other node, that is, every node receives information from all other nodes. So, from a graph perspective, the multiplication of sparse matrices and vectors is a more general model of computation, while the multiplication of dense matrices and vectors is just a special case of this model.

The matrix is ​​transforming this vector

When a matrix is ​​multiplied by a vector, we can think of this operation as a linear transformation. This linear transformation can be seen as mapping the input vector into a new space, changing its position or orientation. In this process, each component of the vector is multiplied by each element of the matrix to obtain a new vector. This is called weight diffusion.

For example, suppose we have a 2D vector v = (v1, v2), and a 2x2 matrix A = [[a11, a12], [a21, a22]]. The result of multiplying matrix A by vector v is a new 2-dimensional vector w = (w1, w2), where w1 = a11*v1 + a12*v2, w2 = a21*v1 + a22*v2. It can be seen that each component of the new vector w is obtained by weighting each component of the original vector v through the elements of the matrix A.

Therefore, when we say that matrix multiplication by a vector is similar to the diffusion of weights, we actually mean that each component of the input vector is weighted by the elements in the matrix during the transformation process to form a new vector.

In addition, from the perspective of point transformation in high-dimensional space, this linear transformation is like rotation, scaling, tilting or other forms of geometric transformation in space . This transformation will change the position and direction of the vector in space, but will maintain some properties, such as the linear relationship of the vectors, that is, if a vector is a linear combination of the other two vectors, after the transformation, this relationship is still maintained. So the multiplication of matrices and vectors can be seen as a geometric transformation in a high-dimensional space.

  • SVD singular value decomposition

Matrix operations in linear algebra can be explained geometrically. The following is a geometric interpretation of the three matrix operations of Shearing, Rotation and Scaling:

  1. Shearing : In two-dimensional space, a shearing transformation changes the shape of an object but does not change its area. For a given vector, its coordinates in one direction (such as the x direction) will be increased by a multiple of its coordinates in another direction (such as the y direction). The shear matrix has an off-diagonal element, e.g.

    [1, a]
    [0, 1]
    

    This matrix converts movement in the y direction to movement in the x direction.

  2. Rotation : The rotation transformation will rotate the object around a certain point on the plane by a certain angle, but will not change its shape and size. In two-dimensional space, the rotation matrix can be represented by the angle θ:

    [cosθ, -sinθ]
    [sinθ, cosθ]
    

    This matrix rotates the vector by the angle θ.

  3. Scaling : Scaling transforms change the size of an object, but not its shape. In two dimensions, the scaling matrix is ​​a diagonal matrix that can be represented by scaling factors a and b:

    [a, 0]
    [0, b]
    

    This matrix multiplies the length in the x-direction by a factor and the length in the y-direction by a factor b.

Here is very powerful, why, because ∑ ∑ is a diagonal matrix, and each value on its diagonal represents the zoom size of each dimension. Then U and V are simply rotations (without changing size)

Similarly, we can regard M*x as a function. The function is to multiply x by V* (rotation), then S (scale), and finally U (rotation). It is equivalent to the effect of multiplying the M matrix

Because my linear algebra foundation is not very good, and the SVD decomposition cannot be explained clearly in a few words, so here I need to learn by myself ( I will understand it further in the future, and I will come back to add it )


With the above operations (shearing, rotation, scaling), the teacher began to talk about what is a feature vector.

Here is an example, M is the matrix 6 − 2 4 0 \begin{matrix} 6&-2\\ 4&0\\ \end{matrix}6420, x is 1 2 \begin{matrix} 1\\ 2\\ \end{matrix}12, the multiplication of the two is equivalent to performing a series of operations on x (rotation, scaling, and rotation), but the result is equivalent to a simple linear transformation (2*x) on x.

This understanding is amazing, that is to say, when M reshape the space, this x is unchanged

by gpt4 :

In a geometric sense, the eigenvectors of the matrix A are the vectors whose direction remains unchanged (may be flipped) under the linear transformation of A, but whose length is scaled. This scaling factor is the corresponding eigenvalue. That is to say, if v is an eigenvector of matrix A, and λ is the corresponding eigenvalue, then Av=λv, that is, the linear transformation A acts on the eigenvector v, which is equivalent to pulling v along its own direction Stretch or compress by λ times.

As a simple example, consider a linear transformation in two-dimensional space, represented by a 2x2 matrix. If the matrix has two linearly independent eigenvectors, then they are along the x-axis and y-axis respectively. For eigenvectors along the x-axis, a linear transformation may stretch or compress them, but the direction remains the same; for eigenvectors along the y-axis, the same thing happens. The effect of this transformation is to stretch or compress the entire space without changing the direction of the space.

For an m*m matrix, how many corresponding eigenvectors are discussed:

Think of the matrix as a column vector. If these vectors are not parallel in space, then there will be m eigenvectors. If there are parallel (linear correlation), then <m

by gpt4 :

The number of eigenvectors of an m*m matrix depends on the properties of the matrix. In theory, a full-rank m*m matrix can have at most m linearly independent eigenvectors. These eigenvectors can form a basis for the space in which the matrix resides.

However, the specific number may be less than m. This is mainly because one eigenvalue may correspond to multiple linearly related eigenvectors, or feature spaces. For example, if a matrix has two eigenvalues ​​that are the same, it may have only one linearly independent eigenvector, because the eigenvectors corresponding to the two eigenvalues ​​may be linearly dependent.

Also, if a matrix is ​​not full rank, then it may not have m linearly independent eigenvectors. For example, if a matrix is ​​singular (ie its rows or columns are linearly dependent), then it may have fewer than m eigenvectors.

In general, the number of eigenvectors of an m*m matrix depends on the number of eigenvalues ​​(including duplicates) and the linear independence of the eigenvectors corresponding to each eigenvalue.

Use an example to show the relationship between matrix-vector multiplication and eigenvector/eigenvalue, here is an explanation:

First, the given matrix S has three eigenvalues: 3, 2, 0, and the corresponding eigenvectors are v1, v2, v3 respectively. The role of matrix S on each eigenvector can be regarded as an identity transformation, but the magnification (ie, eigenvalue) on each eigenvector is different.

Then, given a vector x, it can be represented as a linear combination of these three eigenvectors: x = 2v1 + 4v2 + 6v3. When we perform a linear transformation of the matrix S on x, because v1, v2, and v3 are the eigenvectors of S, each v will only be amplified by its corresponding eigenvalue and will not change its direction .

This means **, if we first perform a linear transformation of the matrix S on x, and then express it as a linear combination of v1, v2, v3, the result will be equivalent to first expressing x as a linear combination of v1, v2, v3 Combination, and then do a linear transformation of S for each v. **This is what matrices and eigenvectors/eigenvalues ​​mean geometrically.

For complex matrix transformation problems, we usually hope to find such a set of bases (i.e., eigenvectors), so that the transformation on this set of bases can be simplified into a diagonal matrix, that is, each base vector will only be transformed by its own The corresponding eigenvalues ​​are enlarged and will not affect each other. This is the value of eigenvectors and eigenvalues ​​in solving practical problems.

The bold words above are the key points. In simple terms, S*x is a series of complex transformations, but x can be expressed as a linear combination of v1, v2, v3 (the eigenvector of S) (such as x = 2v1 + 4v2 + 6v3), so S*x =2S*v1+4S*v2+6S*v3.

Because of the eigenvector, matrix * eigenvector = eigenvalue * eigenvector, this is very simple, just multiply by a scalar, so S*x =2*3*v1+4*2*v2+6*0*v3 =6v1+8v2

This example also reveals that the influence of "small" eigenvalues ​​is limited. In the above example, because the third eigenvalue is 0, the contribution of S to the eigenvector v3 corresponding to that eigenvalue is effectively ineffective: Sv3 = 0v3 = 0. This means that if an eigenvalue of a matrix is ​​very small, the influence of the eigenvector corresponding to this eigenvalue under matrix transformation will be very limited. Conversely, if the eigenvalue is large, the corresponding eigenvector will have a relatively greater influence under matrix transformation. This is why in some problems, such as principal component analysis (PCA), we tend to only focus on those eigenvectors with larger eigenvalues , because they have a greater influence under matrix transformation and contain the main information.

  • The singular value decomposition (more general case) was mentioned earlier. When all the eigenvalues ​​of S are real numbers and S has m linearly independent eigenvectors, we have eigenvalue decomposition (special case)

For a matrix S, if it has m linearly independent eigenvectors (that is, the matrix is ​​non-degenerate), then there is an eigenvalue decomposition (or diagonalization): S = UAU^-1.

In this breakdown:

  • U is an mxm matrix whose columns are the eigenvectors v1, v2, ..., vm of S. These eigenvectors are usually normalized so that their length (or norm) is 1.
  • A is an mxm diagonal matrix, and the elements on its diagonal are the eigenvalues ​​λ1, λ2, ..., λm of S. These eigenvalues ​​correspond to the eigenvectors in U.
  • U^-1 is the inverse matrix of U.

This decomposition tells us that any non-degenerate matrix S can be expressed as the product of a matrix U of its eigenvectors, a diagonal matrix A, and the inverse of U. This decomposition is very useful in many applications, such as when solving linear differential equations, performing simplified calculations on matrices, data dimensionality reduction (such as PCA), etc.

About the difference and connection between Eigen/diagonal Decomposition and Singular Value Decomposition (by gpt4):

Both eigenvalue decomposition and singular value decomposition (SVD) are methods of decomposing a matrix into a product of simpler matrices, but they apply to different situations and have different properties.

Eigenvalue Decomposition

Eigenvalue decomposition works for square matrices, that is, matrices with the same number of rows and columns. When a matrix has n linearly independent eigenvectors, it can be decomposed into:

A = QΛQ^−1

where Q is a matrix of eigenvectors and Λ is a diagonal matrix with eigenvalues ​​on the diagonal. It should be noted that not all square matrices can be eigendecomposed, such as those that do not have n linearly independent eigenvectors.

singular value decomposition

Singular value decomposition is more general and applicable to any m×n matrix, not limited to square matrices. Any matrix A can be decomposed as:

A = UΣV^T

Where U and V are orthogonal matrices, Σ is a diagonal matrix with non-negative elements on the diagonal, and the elements on the diagonal are called singular values.

difference and connection

  1. Eigenvalue decomposition works only for square matrices, while singular value decomposition works for any matrix.
  2. The eigenvalue decomposition requires the matrix to have n linearly independent eigenvectors, while the singular value decomposition does not have this requirement for the matrix.
  3. The eigenvectors in eigenvalue decomposition are not always orthogonal, while the column vectors of U and V matrices in singular value decomposition are always orthogonal.
  4. The singular values ​​of a singular value decomposition are always real and non-negative, while the eigenvalues ​​of an eigenvalue decomposition can be any complex number.
  5. Both decompositions provide important information about the matrix. For example, the eigenvalues ​​and eigenvectors tell us the "stretch" and "rotation" directions of the linear map, while the singular values ​​of the singular value decomposition give the main "energy" directions of the matrix.
  6. If the matrix A is a real symmetric matrix, then the eigenvalue decomposition and singular value decomposition are equivalent. In this case the eigenvalues ​​and singular values ​​are real numbers and the eigenvectors and left/right singular vectors are the same

4. Sparse Principal Component Analysis

PCA

The singular values ​​of the matrix are actually principal components of variation

Singular value decomposition (SVD) and principal component analysis (PCA) are both linear dimensionality reduction methods, which can reduce the dimensionality of data while retaining the main information. The two are closely related. In fact, PCA can be calculated by SVD.

The goal of principal component analysis is to find a set of orthogonal bases, which maximizes the projection variance of the data on this set of bases. This set of bases is called the principal components. PCA is usually applied to covariance matrices, which are symmetric positive definite and can undergo eigenvalue decomposition. After eigenvalue decomposition, the corresponding eigenvectors with large eigenvalues ​​are the principal components.

However, SVD is a decomposition of any matrix, which can decompose any matrix into the form of UΣV^T. In SVD, the left singular vector U corresponds to the basis of the original data, the right singular vector V corresponds to the basis of the projected space, and the singular value Σ gives the coordinates on these basis. This is very similar to PCA: the principal components of PCA correspond to the right singular vectors of SVD, and the scores of PCA (that is, the coordinates of the data on the principal components) correspond to the singular values ​​of SVD.

This is why the singular values ​​are actually principal components of variation: the singular values ​​give coordinates on the principal components, and these coordinates represent the amount of variation in the data in those directions. A large singular value corresponds to a major direction of change, while a small singular value corresponds to a minor direction of change. Therefore, by preserving large singular values ​​and corresponding singular vectors, we can reduce the dimensionality of the data while preserving the main direction of variation of the data.

To give an example, the principal components of the same number are close to each other in space:

The MNIST (Modified National Institute of Standards and Technology) handwritten digit dataset is a very popular image recognition dataset, which contains grayscale handwritten digit pictures of 10 numbers from 0 to 9. Each sample in this dataset is a 28x28 pixel image, which is 784-dimensional data.

When we analyze or model such high-dimensional data, we will face problems such as large amount of calculation and sparse data. Therefore, it is necessary to reduce the dimensionality of the data, and PCA (Principal Component Analysis) is a commonly used dimensionality reduction method.

When performing PCA processing on the MNIST dataset, the covariance matrix of all samples in the 784-dimensional space is first calculated, and then the eigenvalue decomposition of the covariance matrix is ​​performed. Since the covariance matrix is ​​symmetric, a set of orthogonal eigenvectors can be obtained, which are the principal components.

The eigenvalue represents the importance of the corresponding principal component in the original data, and the larger the eigenvalue, the more important the corresponding principal component is. By selecting the eigenvectors corresponding to the first k largest eigenvalues, we can get a k-dimensional new space, and then project the original 784-dimensional data into this k-dimensional space to complete the dimensionality reduction.

In the PCA-processed MNIST data, each image of a handwritten digit is represented as a k-dimensional vector. This k-dimensional vector is the coordinates of the original 784-dimensional image on the k principal components. In this way, the dimensionality of the data is greatly reduced, and at the same time, important information of the original data is preserved as much as possible.

Sparse PCA

PCA (Principal Component Analysis) and SPCA (Sparse Principal Component Analysis) are techniques used for dimensionality reduction in machine learning and statistical analysis. Their goal is to transform high-dimensional data into a low-dimensional space while retaining the data. Major change patterns or structures.

PCA works by finding the principal components in the data, which are the directions in which the variance in the data is greatest. Specifically, the first principal component is the direction in which the variance of the data is greatest, the second principal component is the direction that is orthogonal to the first principal component and has the greatest variance, and so on. PCA reduces the dimensionality of data by projecting it into the space defined by these principal components.

PCA can reduce redundancy or dependence in the data because it transforms the data into a new coordinate system in which the principal components are linearly independent, that is, they are orthogonal . In this way, PCA can remove redundant information in the original data, so as to achieve the purpose of dimensionality reduction.

However, one disadvantage of PCA is that all principal components are linear combinations of the original features, which makes the results produced by PCA often difficult to interpret. Sparse PCA was proposed to solve this problem. Unlike PCA, Sparse PCA tries to find a principal component vector that is zero in most dimensions and has nonzero values ​​in only a few dimensions. This way, the principal components produced by Sparse PCA are easier to understand since they only involve a small subset of the original features.

In general, both PCA and Sparse PCA are techniques used to reduce dimensionality and reduce dependencies in data. They achieve this goal by finding the principal components of the data and projecting the data into the space defined by the principal components. PCA is suitable for those situations where we want to maximize variance and remove redundancy, while Sparse PCA is more suitable for those situations where we want easy-to-interpret principal components

L1/L2 regularization

The implementation of Sparse PCA usually requires the use of a technique called lasso (Least Absolute Shrinkage and Selection Operator). Lasso is a regularization method for linear regression, which can compress some elements of the regression coefficient vector to zero, so as to achieve the purpose of feature selection. By introducing lasso regularization in the optimization problem of PCA, we can get Sparse PCA.

L1 and L2 regularization are techniques used to prevent machine learning models from overfitting. They do this by adding a regularization term (i.e., the L1 or L2 norm of the model parameters) to the model's objective function. L1 regularization can make some parameters go to zero, enabling feature selection. L2 regularization will make the model parameters tend to be small and prevent the model from being too complex.

PCA and regularization can be used together in some cases. For example, you can first reduce the dimensionality of the data using PCA, and then train a model using L1 or L2 regularization on the reduced dimensionality data. This approach can help you deal with high-dimensional data and prevent the model from overfitting.

Non-linear embeddings

I am too lazy to excerpt the ppt here. The teacher turned it too fast. The PCA mentioned above is a linear dimensionality reduction technology, which changes the space. There is also a non-linear dimensionality reduction technique, which is done through the neighbors of points (probability similarity between data points), that is t-SNE. (The principle is not too much to introduce)

B station learning link: python and artificial intelligence - data dimensionality reduction - t-SNE

t-SNE, full name t-Distributed Stochastic Neighbor Embedding, is a machine learning algorithm, mainly used for visualization of high-dimensional data. In particular, it is widely used to reduce cubes to two or three dimensions, allowing visualization in planar or three-dimensional spaces.

The working principle of t-SNE is summarized as follows:

  1. Computes the probabilistic similarity between data points in a high-dimensional space. In this step, t-SNE calculates the similarity of two points in a high-dimensional space using a Gaussian distribution.
  2. Computes the similarity between data points in a low-dimensional space. t-SNE uses a so-called t-distribution (actually a form of Student's t-distribution) to calculate the similarity in low-dimensional space.
  3. optimization process. t-SNE uses a metric called KL divergence to minimize the difference between similarities in high-dimensional space and low-dimensional space. This is a nonlinear optimization process.

An important advantage of t-SNE is that it is able to preserve the local structure in the data and, to some extent, the global structure as well. This means that similar data points will be closely clustered together in a low-dimensional space.

However, t-SNE also has its limitations. For example, its computational complexity is relatively high and may require long computation time for large-scale datasets. In addition, the results of t-SNE are also affected by some hyperparameters (such as perplexity and learning rate), and it may take multiple trials to find the appropriate parameter settings.

Both PCA and t-SNE are techniques for data dimensionality reduction, but their methods and advantages are different.

PCA is a linear dimensionality reduction technique whose goal is to find the main directions of variation in the data, which are called principal components. PCA tries to preserve the maximum variance in the data and project the data into a new low-dimensional space defined by the principal components. Therefore, PCA generally better preserves the global structure in the data, but may not be as effective at preserving the local structure in the data.

In contrast, t-SNE is a non-linear dimensionality reduction technique that mainly focuses on preserving local structure in the data. t-SNE tries to preserve the similarity between points that are closer together and uses stochastic gradient descent to find the best low-dimensional representation. Therefore, t-SNE is usually good at preserving the local structure of the data, but may not be as good at preserving the global structure.

  • For the same handwritten digit recognition data set of t-SNE, it can be seen that the effect difference from the previous PCA is still obvious. t-SNE pays more attention to local

  • Applications in Bioinformatics
    • Visualization of single-cell RNA-sequencing data: t-SNE is widely used for dimensionality reduction and visualization of single-cell RNA-sequencing data. By using t-SNE, researchers can map the gene expression patterns of each cell into two-dimensional or three-dimensional space, and further identify distinct cell populations.
    • Protein expression data analysis: t-SNE can also be used to analyze protein expression data to help researchers identify protein expression patterns and functional groups.
    • Disease classification and drug response prediction: t-SNE can help researchers understand the biological basis of diseases. By mapping patients' gene expression patterns to low-dimensional space, researchers can better distinguish different types of diseases, or predict patients' response to Response to certain medications.

There is no time to talk about the following sections, I have time to make up (mainly involved: Laplacian matrix, spectral clustering, Guilt by association, Network Diffusion Kernels, random walk, etc.)

  • Network Communities and Modules
    • Guilt by association
    • Maximum cliques, density-based modules and spectral clustering
  • Network Diffusion Kernels and Deconvolution
    • Network diffusion kernels
    • Network decorvorution

Online Communities and Modules

Network communities and modules refer to groups of nodes in a network that are more closely connected to other nodes in the group than to the rest of the network. Identifying these structures is an important task of network analysis, as this can help to understand the underlying mechanisms of the systems represented by the network.

For example, in protein-protein interaction networks, densely connected communities may represent protein complexes or functional units. Likewise, in a social network, a community might represent a group of friends or people with similar interests.

Guilt by association

"Guilt association" is a commonly used concept in biology, which means that genes (or proteins) that are related or interacting with each other may participate in the same biological process. Therefore, if a gene of unknown function is frequently associated with a gene of known function, it can be inferred that the unknown gene may also be involved in that function.

Maximal cliques, density-based modules and spectral clustering

These are all methods used to identify communities or modules in a network:

  • A clique in a network is a subset of a set of nodes, any two of which are connected to each other. The largest clique is the largest clique in the network.
  • Density-based modules refer to subnetworks in which the internal connection density (ratio of actual connections to potential connections) is high.
  • Spectral clustering is a technique that uses the eigenvectors of a network's Laplacian matrix to cluster nodes.

Network Diffusion Kernel and Deconvolution

The network diffusion kernel represents the probability of a random walk between two nodes in the network. They can capture the connectivity structure of a network and can be used for tasks such as node classification or link prediction.

Network deconvolution is a method for removing indirect associations (i.e., associations between two nodes due to the fact that they are both connected to a third node) in a network in order to highlight direct connections.

Guess you like

Origin blog.csdn.net/weixin_57345774/article/details/130792722