VGAMF: miRNA-disease association prediction based on multi-view variational graph autoencoders and matrix factorization (IEEE Journal of Biomedical and Health Informatics)

Predicting miRNA-Disease Associations Based On Multi-View Variational Graph Auto-Encoder With Matrix Factorization

Predicting miRNA-Disease Associations Based On Multi-View Variational Graph Auto-Encoder With Matrix Factorization | IEEE Journals & Magazine | IEEE Xplorehttps://ieeexplore.ieee.org/document/9451570

Availability and implementation: The code and datasets of VGAMF are available at: https://github.com/XYDBCS/VGAMF.

Abstract

        ​ ​Abstract: Small RNAs (miRNAs) have been proven to play an important role in a variety of biological processes, including the development of human diseases. Exploring potential links between miRNAs and disease can help us better understand complex disease mechanisms. Considering that traditional biological experiments are expensive and time-consuming, computational modeling can serve as an effective means to discover potential miRNA-disease associations. This study proposes a new computational model based on variational graph autoencoding with matrix decomposition (VGAMF) for miRNA-disease association prediction. More specifically, VGAMF first integrates four different types of information about miRNAs into a comprehensive similarity network for miRNAs and two types of information about diseases into a comprehensive similarity network for diseases, respectively. VGAMF then obtains nonlinear representations of miRNAs and diseases, respectively, from these two comprehensive similarity networks with variational graph autoencoders. Simultaneously, a nonnegative matrix factorization is performed on the miRNA-disease association matrix to obtain linear representations of miRNAs and diseases. Finally, a fully connected neural network combines linear and nonlinear representations of miRNAs and diseases to obtain a final predicted association score for all miRNA-disease pairs. In a 10-fold cross-validation experiment, VGAMF achieved an average AUC of 0.9280 on HMDD v2.0 and 0.9470 on HMDD v3.2, which is better than other competing methods. Furthermore, case studies of colon and esophageal cancer further demonstrate the effectiveness of VGAMF in predicting novel miRNA-disease associations.

Table of contents

1 Introduction

2. Materials and methods

2.A.Benchmark Dataset

2.B. Similarity Network

2.C.VAGE obtains nonlinear representation

2.D.NMF obtains linear representation 

2.E.VGAMF predicts miRNA-disease association

3. Results and discussion

4 Conclusion


1 Introduction

        Although existing methods have made great progress, they all have some limitations: Firstly, multi-view similarity is reasonably integrated into a comprehensive similarity network is a challenge. Most models combine different levels of similarity by using one similarity to fill in the missing parts of another similarity or the average of different types of similarities. However, different rules derive different similarities based on different evidence. When inappropriate fusion methods are used, scale differences, collection bias, and noise in multiple similarities may lead to poor quality comprehensive similarity networks. In this case, although multi-view evidence can provide more information for the prediction model, noise inevitably affects the accuracy of prediction. Secondly, some similarity network-based methods and matrix completion-based methods heavily rely on existing MDA information. If there is no known miRNA, It cannot be applied to new diseases. Thirdly, for some supervised learning methods, the final model is seriously affected by the quality of training samples, and there are no verified negative samples in the current database. Randomly selecting unknown samples as negative samples may reduce the quality of model training. Finally, integrating all the information in similarity networks and correlation matrices makes it difficult to obtain feature representations suitable for miRNAs and diseases. Some methods predict MDA by extracting linear features of miRNA and disease, while others extract deep nonlinear associations. Linear features and deep nonlinear features each have their advantages, but few methods have both features at the same time.

        In this study, in order to solve some of the limitations mentioned above, We proposed a new automatic encoder based on matrix decomposition ( VGAMF) computational model for MDA prediction. VGAMF first calculates two comprehensive similarity networks for miRNAs and diseases respectively by integrating different databases. Specifically, VGAMF constructed four miRNA similarity networks including miRNA sequence similarity, functional similarity, semantic similarity and Gaussian interaction spectrum kernel similarity, as well as disease semantic similarity and Gaussian interaction spectrum kernel. Two disease similarity networks including similarity. These different similarity networks were then integrated into comprehensive similarity networks for miRNAs and diseases, respectively, via a nonlinear fusion method. Then, two variational graph autoencoders were trained on two synthetic similarity networks, respectively, to obtain nonlinear representations of miRNAs and diseases. Meanwhile, a nonnegative matrix factorization (NMF) model is performed on the MDA matrix to extract linear representations. Finally, a fully connected neural network combines linear and nonlinear representations of the miRNA and disease to obtain a final association prediction score for each miRNA-disease pair. Compared with previous MDA prediction methods, VGAMF uses a nonlinear similarity network fusion method to obtain shared and complementary information from different data sources [42]. The VGAE model can well extract deep, complex features of miRNAs and diseases because GCN can naturally combine node features in the graph structure, while variational autoencoders (VAE) can capture features from the perspective of data distribution. In addition, the combination of linear representation and nonlinear representation provides more information from different levels for the final prediction.


2. Materials and methods

        In Section II-A, the data sources involved in the model are described. In Section II-B, different methods for calculating miRNA and disease similarity respectively based on various data sources are presented. The fusion method is a method of fusing different similarities into a comprehensive similarity network. In Section II-C, VGAE is proposed to obtain nonlinear representations of miRNAs and diseases. Additionally, this article summarizes the process of using NMF to obtain linear characterization of miRNAs and diseases. Finally, the whole process of VGAMF is introduced in Chapter 2, Section E.

2.A.Benchmark Dataset

        In this study, we demonstrate the performance of our model on two miRNA disease association datasets, commonly used versionHMDD v2.0[16] and the latest versionHMDD v3.2[43]. HMDD v2.0 and related similarity datasets were obtained from previous research [44], while HMDD v3.2 and related similarity datasets were preprocessed according to the same process as in study [44]. Specifically, the original HMDD v2.0 database includes 6441 associations between 577 miRNAs and 366 diseases manually collected from all miRNA-related publications in PubMed. In order to integrate more information in VGAMF, some databases are used for similarity calculation. First, the miRNA sequence information comes from miRBase [45], which contains 4796 human miRNA annotation information. MiRNA-related gene information comes from mirTarBase [46], which includes 380693 interactions between 2599miRNAs and 15064 genes. Disease semantic information was obtained from the latest MeSH descriptors from the National Library of Medicine ( http://www.nlm.nih.gov/ ). It includes 11,572 unique entries from category C diseases. In order to maintain the consistency of data from different sources, after mapping the miRNA names and disease names in HMDD v2.0 with miRBase records, mirTatBase records and MeSH descriptors, 6088 between 550 miRNAs and 328 diseases were finally retained associations, an association matrix expressing all MDAs was constructed. If miRNA i is associated with disease j, then = 1, otherwise = 0.

        In addition, the original HMDD v3.2[43] contains 18,733 associations between 1,208 miRNAs and 894 diseases after removing duplicate relationships. We removed miRNAs in HMDD v3.2 but not present in the databases miRBase or mirTarBase, and removed diseases in HMDD 3.2 but not present in MeSH. Then, 8968 associations between 788 miRNAs and 374 diseases were left to construct an association matrix . In both types of databases, similarity networks are calculated according to the following procedure.

2.B. Similarity Network

        Disease similarity measure. To obtain a comprehensive disease similarity matrix, we employ two different criteria to evaluate disease-disease similarity.

        1) Disease semantic similarity. MeSH descriptors are used to implement directed acyclic graphs (DAGs) of diseases. Based on the assumption that most of the DAGs of two diseases are common and the two diseases are more similar semantically, we use the method in the previous study [47] to calculate the semantic similarity between the two diseases. Therefore, we obtain a disease semantic similarity matrix, expressed as   for HMDD v2.0, or for HMDD v3.2.

        2) Disease Gaussian Interaction Spectrum (GIP) kernel similarity. Disease GIP kernel similarities were calculated based on the MDA matrix A obtained from HMDD, assuming that diseases associated with the same miRNA are more likely to be similar and vice versa. Disease GIP kernel similarity was calculated by the previous method [48], and the disease GIP kernel similarity matrix was calculated by HMDD v2.0's or HMDD v3.2's express. .

        MiRNA similarity measurement. To comprehensively characterize the similarities between miRNAs, we employed four different criteria to evaluate miRNA-miRNA similarity.

        1) miRNA sequence similarity. The sequences of MiRNAs are from the database miRBase, and for each miRNA, the entire mature sequence is approximately 22 nucleotides "AUCG". Based on the sequence information, we used the pairwise sequence alignment function “pairwiseAlignment” in the R package Biostring to calculate the similarity score. In this function, the gap opening penalty is set to 5, the gap expansion penalty is set to 2, the match score is set to 1, and the mismatch score is set to -1. After obtaining the sequence similarity score, we normalize it to the range [0,1] using the max-min normalization method. Finally, we obtained the miRNA sequence similarity matrix of HMDD v2.0 and the miRNA sequence similarity matrix of HMDD v3.2.

        2) miRNA functional similarity. Based on the assumption that functionally similar miRNAs are more likely to be associated with the same disease, the functional similarity of miRNAs was measured by the similarity of their associated disease DAGs according to previous methods [47]. MDA is derived from HMDD, and disease DAGs are constructed based on MeSH descriptors. Finally, we obtained the miRNA functional similarity matrix represented as for HMDD v2.0 or for HMDD v3.2.

        3) miRNA semantic similarity. The semantic similarity of miRNAs is described through miRNA target genes and gene-related ontology (GO) annotations. MiRNA target gene information was obtained from mirTarBase. For each pair of miRNAs, we obtain their target gene lists and then calculate the semantic similarity between the two corresponding genomes by the method in previous studies [49]. Similarly, we obtained a miRNA semantic similarity matrix, represented as (HMDD v2.0) or (HMDD v3.2). 

        4) miRNA GIP core similarity. Based on the same method as the disease GIP kernel similarity calculation, such as the previous method [48] to calculate the miRNA GIP kernel similarity matrix, the miRNA GIP kernel similarity matrix of HMDD v2.0 is SM4 something R550 * 550, and that of HMDD v3.2 The miRNA GIP core similarity matrix is ​​R788 * 788. Note that in cross-validation, positive correlations in the test sample should be set to unknown in the miRNA-disease correlation matrix before calculating the GIP kernel similarity for each fold.

        Merge different similarities into a comprehensive similarity networkInspired by previous research [42], we use a nonlinear fusion method to integrate various similarity measures into a single similarity network for miRNA and disease, respectively. Compared with most simple linear similarity combination methods, this method can well capture shared and complementary information from different data sources and is resistant to noise and data heterogeneity. This article is constructed using comprehensive miRNA similarity as an example.

        First of all, we have better normalization of each type of similar networks. Taking the miRNA sequence similarity matrix SM1 as an example, the calculation renormalization process is as follows:

        This normalization is not affected by the self-similarity of the diagonal entries, and the sum of each row is still 1.

        Then, for a certain similarity network GM, such as sequence similarity network SM1, we use K nearest neighbors (KNN) to measure its local affinity S_kn as follows:

        ​​​​where Ni is a set of K closest to xi, including xi in G. This algorithm is based on the assumption that local similarity (high similarity value) is more reliable than remote similarity, and sets the remote similarity to 0.

        After obtaining the local affinity kernel corresponding to each type of data, we iteratively update the similarity matrix of each type of data as follows:

where m is the total number of data types, v is the number of current data types, ranging from 1 to m, represents the state matrix of the vth data type after t iterations, Represents the local affinity kernel of the vth data type.

        After each iteration, the state matrix  is normalized as shown in equation (1). When the iteration reaches the convergence criterion, that is, when the relative change is less than 10-6, the iteration stops. When the iteration of each type of data stops (we assume it consists of t iterations), the overall comprehensive similarity matrix is ​​calculated as follows:

        According to the above rules, the similarity matrix in Eq. (4) is not a symmetric matrix, so we calculate  as the comprehensive similarity matrix of miRNA. For diseases, we follow the same rules as for miRNAs and obtain the disease comprehensive similarity matrix SD.

2.C.VAGE obtains nonlinear representation

        VGAE is an unsupervised learning model that combines GCN and variational autoencoders (VAE). This model is typically applied to graph-structured data. Leverage the latent variables of VAE and the neighborhood information fusion capabilities of GCN to learn interpretable latent representations of undirected graphs. The nonlinear representation obtained by VGAE can integrate graph structure and data distribution. Next, we will introduce GCN and VGAE in detail.

        GCN [50] was proposed for convolution operations on underlying graph structures with non-Euclidean data. In recent years, GCN has brought significant performance improvements to many network-based prediction tasks, such as lncRNA disease association prediction [51] and miRNA drug resistance prediction [52]. GCN can effectively learn the feature vector of each node in the graph by integrating adjacent node information and graph structure information. Currently, GCN methods based on different definitions of local convolution filters are divided into two categories: one is spatial-based methods, and the other is spectral-based methods. As mentioned in Bruna's study [53], spectrum-based methods are designed based on the spectrum of the graph Laplacian [50], which generally have better performance than space-based methods, which have many limitations. Therefore, this study adopted a spectrum-based approach to extract miRNA and disease feature vectors from the miRNA similarity network and disease similarity network, respectively.

        Let the similarity matrix SM be the adjacency matrix of miRNAs. We set the initial scalar feature of miRNA After obtaining the input data SM and X, GCN converts the graph signal X (t-1) at step t-1 into a new signal X (t) by the following formula:

where is a matrix that sets all diagonal elements of SM to 1, representing the self-cyclic adjacency matrix of GM, is A degree diagonal matrix, where , are parameters in the GCN model (t-1) layer, and Relu () is a nonlinear activation function , it can also be replaced by sigmoid () or some other activation function.

        ​​​​Our model combines GCN and variational autoencoders (VAE) as VGAE to extract non-linear representations of miRNAs and diseases. In VGAE, GCN can add node features in the network, while VAE uses latent variables to learn interpretable latent representations from the perspective of data distribution.

        VGAE includes encoder and decoder. In the encoding part, it takes an adjacency matrix SM and a feature matrix X as input, and obtains a latent variable z as output through GCN, while in the decoding part, VGCN reconstructs the adjacency matrix SM based on the latent variable z. It also includes a loss function to obtain optimal parameters.

        Encoder: The encoder consists of two layers of GCN. The first GCN layer generates a low-dimensional feature matrix. It is defined as follows:

The second GCN network layer produces the following data distribution:

Then, the latent variable z is calculated as follows:

where ε follows the standard normal distribution N (0,1). Encoders can also be represented as follows:

        Decoder: The decoder is defined by the inner product between latent variables z, and the output is a reconstructed adjacency matrix as follows:

where S is the sigmoid function.

        The decoder can also be represented as follows:

        Loss function: The loss function includes two parts. The first part is the binary cross-entropy between the target SM and the output SM, and the second part is the KL-divergence between and p (Z). The loss function is defined as follows:

        The entire process of VGAE extracting non-linear representation of miRNA on the miRNA-miRNA similarity network is shown in Figure 1. The latent variable matrix Z is considered to be a non-linear representation of the miRNA. Similarly, we input the disease-disease similarity matrix SD as the adjacency matrix, each column of the MDA matrix A as the initial disease feature, and obtain the nonlinear disease representation as .

2.D.NMF obtains linear representation 

        Non-linear representation mainly comes from comprehensive similarity networks. Although they integrate multi-view datasets to receive more information, they also contain some noise since all these similarities are calculated based on different measures. In this part, linear representations of miRNAs and diseases are calculated by NMF based on MDA matrices.

        NMF projects the miRNA-disease relationship into the miRNA subspace and disease subspace, helping to reveal the underlying features by decomposing the original MDA matrix into two low-rank matrices, and multiplying them to approximate the original matrix as much as possible [54]. Assuming that the MDA matrix is very close to the inner product of the low-rank miRNA signature matrix and the disease signature matrix, [54] m is the number of miRNAs, n is the number of diseases, and k is the feature space dimension. In order to make full use of the verified correlations and reduce the adverse effects of unknown correlations, an indicator weighting matrix is proposed. The value of W is the same as A. In addition, we use regularization [55] to ensure the smoothness of U and V, and then define the objective function as follows:

where λ1 and λ2 are regularization coefficients, and is the Hadamard product. U ≥ 0 (V ≥ 0) means that all entries of U (V) are non-negative.

        Suppose is the Lagrange multiplier, that is, the Lagrangian function of the optimization problem equation. (15) can be constructed in the following way:

The partial derivative of J is calculated as follows:

According to the Karush-Kuhn-Tucker (KKT) condition [56], , we can get the multiplicative update rule of as follows:

According to the update rules in equation (19) and equation (20), when the relative difference convergence criterion reaches 10-4, we can obtain U and V.

2.E.VGAMF predicts miRNA-disease association

        In this section, we introduce the entire process of VGAMF, which includes five steps, as shown in Figure 2.

        Step 1: From the multi-view database, VGAMF calculates four different types of similarity networks for miRNAs (including miRNA sequence similarity, miRNA functional similarity, miRNA semantic similarity and miRNA GIP core similarity) and two different types of diseases. Types of similarity networks (including disease semantic similarity and disease GIP kernel similarity).

        Step 2: VGAMF fuses these different similarities into a comprehensive miRNA similarity network SM and a comprehensive disease similarity network SD.

        Step 3: VGAMF takes the node feature matrix of the similarity network and the miRNA-disease adjacency matrix as input, and extracts the non-linear representation of the miRNA and disease from the comprehensive similarity network SM and SD respectively.

        Step 4: Based on the miRNA-disease adjacency matrix, NMF extracts linear representations of miRNA and disease. The linear representation is based only on the MDA adjacency matrix.

        Step 5: VGAMF combines nonlinear representation and linear representation to perform MDA prediction using a fully connected neural network.


3. Results and discussion


4 Conclusion

        The discovery of potential MDA can help us better understand the pathogenesis of diseases at the molecular level and improve the diagnosis, prognosis and treatment of diseases. However, it is inefficient to reveal the association between miRNA and disease through biological experiments. In recent years, with the establishment of many databases related to miRNAs and diseases, various prediction methods for calculating MDA have been proposed. This study integrates multi-view databases and proposes an MDA prediction method VGAMF based on variational graph autoencoders and matrix decomposition. A nonlinear similarity fusion method is used to fuse different types of miRNA and disease-related information into a comprehensive miRNA similarity network and a comprehensive disease similarity network, respectively. VGAE is then used to extract deep nonlinear representations from the comprehensive similarity network of miRNAs and diseases, while NMF is used to extract linear representations of miRNAs and diseases from their adjacency matrices. Combining linear and nonlinear representations yields the final predicted correlation score. Experimental results for 5-fold CV and 10-fold CV show that the VGAMF method has better prediction performance than competing methods. Furthermore, case studies also show the effectiveness of VGAMF for predicting potential MDA.​ 

        The reliable prediction performance of VGAMF mainly depends on the following factors. First, nonlinear fusion methods are used to effectively integrate different types of databases into similarity networks. This fusion method can obtain shared information and complementary information from various data sources at the same time, and has a better fusion effect than the linear similarity integration method. Secondly, VGAMF can effectively extract information from the network by naturally combining node features in the graph structure and using latent variables to analyze the data from the perspective of data distribution. Third, linear representation and nonlinear representation complement each other. Nonlinear representations are mainly extracted based on similarity networks, which contain noisy multi-view information, while linear representations are simply based on reliably verified MDA matrices. Furthermore, one representation is extracted through a deep nonlinear process, while the other represents shallow linear relationships. Finally, VGAMF can mitigate the impact of randomly selected negative sample noise.

        ​ ​ ​ However, VGAMF also has some limitations and requires further research. For example, disease similarity only includes two types of information. In the future, we will integrate more disease-related evidence in the disease similarity network. Furthermore, nonlinear representation in VGAMF heavily relies on the quality of the noisy similarity network. In the future, we will further investigate a method that can effectively utilize multi-view information of miRNAs and diseases while minimizing noisy information. In addition, we will consider more biologically relevant analyzes to illustrate the effectiveness of our prediction models, such as survival analysis or drug sensitivity analysis.

Guess you like

Origin blog.csdn.net/adsdasdasdahj/article/details/130403675