DAEMKL: Multi-kernel learning-based deep autoencoder predicts miRNA-disease association (IEEE Transactions on Neural Networks and Learning Systems)

DAEMKL:Predicting miRNA–Disease Associations Through Deep Autoencoder With Multiple Kernel Learning(发表于IEEE Transactions on Neural Networks and Learning Systems)

Predicting miRNA-Disease Associations Through Deep Autoencoder With Multiple Kernel Learning | IEEE Journals & Magazine | IEEE Xplorehttps://ieeexplore.ieee.org/document/9635589


Table of contents

Abstract

1. Introduction

2. MATERIALS AND METHODS

A. Confirmed Human MDAs (Confirmed Human MDAs)

B. MiRNA Space

C. Disease Space

D.Multiple Kernels Learning

E. Regression Model for Feature Representation 

F. Deep autoencoder 

G.DAEMKL

3. RESULTS AND DISCUSSION

A. Implementation Details

B.Performance Evaluation

C. Effect of Multiple Kernels Learning

D. Case Studies

4. Conclusion  


Abstract

        Identifying microRNA (miRNA)-disease associations is an important component in preventing, diagnosing, and treating complex diseases. However, wet experiments to identify MDAs are inefficient and expensive. Therefore, it is of great significance to establish reliable and effective data integration models to predict MDAs. This paper proposes a prediction method for MDAs based on deep autoencoders with multi-kernel learning (DAEMKL). First, DAEMKL applies multi-kernel learning (MKL) in the miRNA space and disease space respectively to construct the miRNA similarity network and disease Similarity network. Then, throughregression modelfrommiRNA similarity network and disease similarity networkMediumLearnthe characteristic representation of eachdisease or miRNA. Then the integrated miRNA feature representation and disease feature representation are input to the deep autoencoder (DAE) a>reconstruction errors. Finally, the AUC results show that DAEMKL achieves excellent performance. In addition, case studies of three complex diseases further demonstrate that DAEMKL has excellent predictive performance and is able to discover a large number of potential MDAs. Overall, our method DAEMKL is an effective method for MDA identification. are predicted by. Furthermore, new MDAs

Index Terms - Deep Autoencoders (DAE), feature representation, microRNA (miRNA)-disease associations (MDAs), multi-kernel learning (MKL).​ 


1. Introduction

        mIcroRNAs (miRNAs) are a group of non-protein-coding ribonucleic acids (RNAs) about 22 nucleotides in length and are key regulators of a variety of biological processes [1]-[3]. The biosynthesis and dysfunction of miRNAs and their target mRNAs may lead to various diseases [4]. Research has shown that miRNAs are closely related to the prevention, diagnosis, and treatment of various malignant diseases such as breast cancer, pancreatic cancer, and lymphoma [5], [6]. Therefore, studying the association between miRNA and disease is of great significance in biomedicine. However, traditional biological experimental methods to identify MDAs are time-consuming, laborious and require equipment. Therefore, more and more researchers are turning to bioinformatics methods to analyze and predict new MDAs.​ 

        Previous calculation methods for predicting MDAs are mainly summarized into three categories [7]. The first category is methods based on similarity measures, based on the assumption that similar diseases are more likely to be associated with similar miRNAs and vice versa. Xuan et al. [8] proposed weighted k-most similar neighbors to predict MDAs. This method combines disease terms and disease phenotype similarities to effectively predict MDAs. Chen et al. [9] proposed a global network similarity measurement method based on random walk restart for predicting new MDAs (RWRMDA). Soon after, Shi et al. [10] proposed another new random walk method to predict MDAs. This method maps disease genes and miRNA target genes into protein interaction networks and uses the random walk method to predict MDAs. You et al. [11] proposed a method for path-based prediction MDAs (PBMDAs). PBMDA constructs a heterogeneous network and then uses a depth-first search method to predict new MDAs. Due to considering the network topology, PBMDA achieves good prediction results. Xiao et al. [12] proposed a new method to identify new MDAs (M2LFL) using adaptive multi-source multi-view latent features. M2LFL employs an adaptive joint graph regularization term to combine MiRNAs with disease manifold structures. Finally, this method achieved better prediction results. 

        The second category is based on machine learning methods. In recent years, machine learning has become popular in various fields. Various machine learning models for predicting MDAs have also been recognized by many scholars. In 2013, Jiang et al. [13] proposed a method based on support vector machine (SVM). They extracted features from each MDA data and used them to train SVM classifiers to mine potential information of unknown MDAs. Chen et al. [14] implemented ensemble learning and link prediction to discover MDAs (ELLPMDAs). ELLPMDA combines the prediction results obtained by three algorithms through ensemble learning. In addition, Chen et al. [15] proposed another method to predict potential MDAs (IMCMDAs) using inductive matrix completion. The motivation of IMCMDA is to use known MDAs information, miRNA similarity and disease similarity information to supplement missing associations. Some matrix factorization methods are also employed to predict MDAs. Xiao et al. proposed a prediction framework for MDAs based on graph regularized nonnegative matrix factorization (GRNMF). Recently, Gao et al. [16] developed another matrix factorization method for discovering new MDAs, called nearest spectrum based collaborative matrix factorization (NPCMF). NPCMF considers the neighbor information of the similarity network, applies the nearest spectrum to the miRNA similarity matrix and disease similarity matrix, and achieves good prediction results. 

        The third category is methods based on deep learning. With the successful application of deep learning methods in fields such as autonomous driving and face recognition [17]-[19], researchers have begun to develop some deep learning methods for MDA prediction. For example, Chen et al. [20] proposed a prediction method for MDAs based on deep belief networks (DBNMDAs). DBNMDA first extracts features from all miRNA-disease pairs to train a restricted Boltzmann machine, and then selects the same number of negative samples as positive samples to tune the deep belief network. In addition, Ji et al. [21] proposed a calculation method for predicting new MDAs (AEMDAs) based on autoencoders. AEMDA has trained three models through deep learning algorithms, including a miRNA model, a disease model, and a deep autoencoder (DAE) model. AEMDA first trains the miRNA model and disease model to obtain feature representation, and then inputs it into AutoEncoder to complete the prediction of MDAS. Finally, AEMDA achieved better prediction performance.

        Although the above methods achieve excellent performance, they still face some limitations. On the one hand, there are only a small number of known associations as positive samples in the existing database, and the rest of the samples are unknown associations, so there are no verified negative samples. Therefore, it is difficult for some supervised machine learning algorithms to train reliable prediction models. On the other hand, in the process of constructing the miRNA similarity kernel and the disease similarity kernel, a large number of methods simply combine the Gaussian kernel with the miRNA functional similarity kernel and the disease semantic similarity kernel. [16].  

        In order to overcome the above shortcomings, this study developed a more complete deep learning framework for predicting MDAs through the deep autoencoder for multi-kernel learning (DAEMKL). In this study, DAE was used to learn the characteristics of potential disease-related miRNAs without the need for negative samples. Most importantly, DAEMKL integrates multiple information from the miRNA feature space and disease feature space. Several cores are constructed from miRNA and disease spaces respectively. Then multi-kernel learning (MKL) is applied to the miRNA space and disease space to construct a miRNA similarity network and a disease similarity network. MKL based on multiple different information sources can contain more prior information, which is beneficial to capturing deep interactions. For each disease or miRNA, its feature representation is learned from the miRNA similarity network and disease similarity network through a regression model. MIRNA and disease signature representations are learned in a similar manner to ensure data compatibility during subsequent learning. Afterwards, the integrated miRNA feature representation and disease feature representation are input into DAE. Finally, DAEMKL reconstructs the input data and predicts new MDA based on the reconstruction error. Therefore, five-fold cross-validation (5-CV) and global leave-one-out cross-validation (LOOCV) were used to verify the prediction performance. In addition, case studies of three diseases were conducted to verify whether the predicted new MDAS had been documented. DAEMKL makes full use of the compatibility of prior knowledge and feature information, achieves good prediction performance, and mines more unknown MDA. The main contributions of this paper are summarized below. 

        ​​​​1) Developed a complete end-to-end prediction framework based on deep learning, which can effectively identify new MDAs.

        ​​​​2) As a predictor, DAE can effectively learn feature information from known MDAs without the need for negative samples.

        ​ ​ 3) Introduce MKL into the prediction framework to learn multi-source information, making the information about miRNAs and diseases richer.​ 

        This paper is organized as follows: Section 2 presents the dataset and the DAEMKL model used to predict MDAs. Section 3 introduces the experimental results and evaluation methods. Furthermore, our model is also compared with other state-of-the-art methods. Section 4 summarizes the full text.​ 


2. MATERIALS AND METHODS

A. Confirmed Human MDAs (Confirmed Human MDAs)

        Known MDAs were identified through the Human Small RNA Disease Database(HMDD) V2.0[22] andHMDD V3.2[23] Obtained. HMDD is an MDA database manually collected by researchers. Each recorded MDA contains detailed information, including not only the miRNA and disease name, but also the PubMed ID of the relevant description and reference. The first dataset (D1) is considered the gold standard dataset for predicting MDAs, including 495 miRNAs, 383 diseases and 5430 confirmed MDA. The second dataset (D2) is also widely used in MDA prediction [24], [25], containing 550 miRNAs, 328 diseases and 6088 an MDA. Use HMDD V3.2 to verify whether the new MDA predicted by the model is recorded. Considering the applicability of the model, the DAEMKL model and other advanced methods are evaluated using D1 and D2 to verify the performance of the DAEMKL model. 

B. MiRNA Space

        In order to characterize the characteristics of miRNAs from different information sources, three similarity kernels of miRNAs were constructed, including functional similarity of miRNAs, Gaussian interaction spectrum (GIP) kernel similarity and Jaccard similarity.​ 

        miRNA functional similarity was obtained from the MISIM database [26]. The MISIM database (http://www.cuilab.cn/files/images/cuilab/misim.zip) was developed by Wang et al. Based on functional similarity, miRNAs are more likely to be associated with phenotypically similar diseases. Thus, a miRNA functional similarity matrix was constructed, where represents the functional similarity between miRNA i and miRNA j. 

        In order to obtain the network similarity of the miRNA space, the GIP kernel similarity of the miRNA was calculated according to previous research [27] . The GIP kernel similarity between miRNA i and miRNA j is defined as 

where is the correlation spectrum of miRNA i and miRNA j, that is, the i-th column and j-th column of the correlation matrix respectively. In addition, θm is an adjustable parameter of the kernel bandwidth, defined as follows: 

where nm is the number of miRNAs, and θm is usually taken as 1 [27].​ 

        ​ ​ Jaccard similarity has been widely used in tasks such as text mining and data clustering. Here, we adopt it to measure the similarity between miRNAs. The formula is as follows: 

C. Disease Space

        In the disease space, similarity kernels also include three categories: disease semantic similarity, GIP kernel similarity and Jaccard similarity.

        Semantic similarity is considered to be a reliable indicator of the relationship between diseases [28]. According to previous research [26], a directed acyclic graph (DAG) was developed based on the medical subject headings (Mesh) database to calculate the semantic similarity of diseases. There are a large number of points and directed edges in DAG, each point is a disease, and each directed edge represents the relationship between diseases. 

        Based on the same calculation rules, GIP kernel similarity and Jaccard similarity are as follows: 

        Where is the correlation spectrum of disease I and disease J, which represent the i-th row and j-th row of the correlation matrix respectively. represents the set of miRNAs related to disease j in the association map.

D.Multiple Kernels Learning

         Considering that a single similarity kernel or a simple combination of the miRNA functional similarity matrix and the GIP kernel similarity matrix cannot represent the similar information between miRNAs, multi-kernel learning (MKL) is used to obtain the optimal kernel [29]. The figure below illustrates the process of MKL. 1. MKL calculates the weight of each core through an optimization algorithm to obtain an integrated core of miRNA. 

        The optimal kernel of miRNAs is defined as 

Where is the optimal weight of the miRNA core, and N is the number of miRNA cores. The learning process of MKL is to use the ideal kernel to optimize the integrated kernel. The formula is as follows: 

Among them, represents the Frobenius norm, , and β represents the regularization parameter. The above optimization problem can be calculated as follows: 

        Similarly, the weight of the disease kernel can also be calculated through the above MKL algorithm. The final optimal kernel of the disease can be obtained as follows: 

E. Regression Model for Feature Representation 

        Inspired by the work of Ji et al. In [21], miRNA feature representation and disease feature representation are learned from the final optimal similarity kernel through a regression model. In the miRNA space, for each miRNA, we construct a high-dimensional vector to represent the features in the high-dimensional space. The set of all high-dimensional vectors constitutes the characteristic matrix of miRNA. The characteristic matrix can be expressed as follows: 

where represents the i th miRNA vector, kM represents the size of , and nm represents the number of miRNA. During the training process of the regression model, the feature matrix is randomly initialized. 

        Furthermore, cosine similarity is used as a reliable and effective method to measure the distance between . The regression model uses the optimal similarity kernel as the ground-truth label. However, the range of the cosine similarity calculation value is [-1,1], and the similarity value in is non-negative. Therefore, the range of the cosine similarity score is adjusted between 0 and 1 as follows: 

where are the feature representation vectors of miRNA i and miRNA j respectively, and represents the distance metric of these two miRNAs. Then, a miRNA regression model is constructed to learn the feature representation of miRNAs, which minimizes and the optimal similarity kernel . The formula is as follows: 

where is the number of training samples. During the training process, the squared loss is used as the criterion for backpropagation, and the stochastic gradient descent (SGD) method is used to update the feature matrix. After multiple trainings, a reliable miRNA feature matrix was obtained. 

        similarly, reliable feature representations for diseases can be trained through regression. The formula is defined as follows: 

where nd is the number of diseases and KD represents the vector size. During training,we set kD=kM to 0.5. CD is the cosine similarity measure of the disease, and is also the number of training samples.

F. Deep autoencoder 

        A typical autoencoder is defined as an unsupervised learning model whose purpose is to reconstruct input features [30], [31]. In recent years, DAE is a commonly used deep learning method and has been widely used in fields such as feature extraction and pattern recognition [32]-[34]. Because only a small proportion of associations in the dataset are confirmed, we developed a semi-supervised learning method based on DAE.  Our model is divided into two parts: encoding and decoding. In the encoding part, the high-dimensional feature vector of miRNA and the high-dimensional feature vector of disease are concatenated as the input of the autoencoder. Then the latent variables are obtained after coding. Afterwards, DAE outputs reconstruction based on latent variables  in the decoding part.

        During the training process, known MDAs are considered observable samples. The training samples of DAE are connected through the feature vector of miRNA and the feature vector of disease. Training samples are defined as follows: 

        Where is the i-th training sample. For sample, the role of the encoding process is to convert the high-dimensional characteristics of sample into latent variables. The encoding process is defined as follows: 

where is the number of hidden layers in the encoder. This article sets to 2. represents , and is the th hidden layer feature representation. represents the weight matrix, represents the bias. In addition, nonlinear activation function is a reliable method to obtain representation. 

        The output of the decoder is the feature representation reconstructed based on the latent variables. The decoder is defined as follows: 

        Where is the latent variable output by the encoder, represents the weight matrix, < a i=4> represents the bias in layer . is the reconstruction of the input output by the decoder, and tanh(·) is the nonlinear activation function. 

        Finally, the loss function is calculated in two parts. The first part is the sum of the squared losses of all reconstruction errors, and the second part is a regularization term. Therefore, the loss function of DAEMKL can be expressed as 

        Where is the number of training samples, λ is the hyperparameter, and is Jacobian’s regularization. At each stage of training, DAEMKL updates all parameters in the model by minimizing the loss of

G.DAEMKL

        The entire flow chart of DAEMKL is shown in Figure 2. Our model framework can be described in four steps. First, MKL was used to obtain the optimal kernels of the miRNA space and disease space respectively. Secondly, the feature representation of each miRNA and each disease is learned from the optimal kernel of the miRNA space and disease space through a regression model. Third, the integrated miRNA feature representation and disease feature representation are input into DAE. Finally, the new MDA is predicted by the reconstruction error.​ 


3. RESULTS AND DISCUSSION

A. Implementation Details

        DAEMKL is trained through the following scheme. First, the mean square loss is used as the error back propagation criterion, and the SGD method is used to train the regression model end-to-end. In addition, the Adam algorithm is used to optimize the regression model. After 60 epochs, two feature representations of M and D are obtained. Furthermore, DAE is trained with M, D and known MDAs, and DAE usually reaches convergence after 50-100 epochs. We take the concatenated vector of M and D as input, so the input layer hasneurons. The hidden layer nodes of the encoder are set to , and the hidden layer nodes of the decoder are set to and respectively. 

        In addition, the initial learning rate of DAEMKL, the training epoch of DAEMKL, the vector size and for DAEMKL in the following experiments. ∈[1,1e-1,1e-2,1e-3,1e-4]. By adjusting the parameters empirically, we set ∈[50,100,150,200] and∈[512,1024,2048,4096], also affect the performance of DAEMKL. We train DAEMKL on database D1 using different combinations of these parameters, which range from

B.Performance Evaluation

        ​ ​ ​ In the experiment, the prediction performance of DAEMKL was evaluated using 5-CV and global LOOCV. For 5-CV, the known MDAs are randomly divided into five subparts. Each subpart is used as a test sample in turn, and the remaining four subsets are used as training sets. The value of AUC is the area under the receiver operating characteristic (ROC) curve. The value of AUC is between 0 and 1, where 0.5 represents random performance. AUC values ​​less than 0.5 indicate that the prediction results are meaningless. In order to prevent the randomness of the results, in each round, DAEMDA is trained using training samples. We then calculate the reconstruction error for the remaining samples and unconfirmed samples. The average of 50 runs is the final result.​ 

        In applying 5-CV and Global LOOCV to evaluate our method, we also compare DAEMKL with five advanced MDA prediction methods: IMCMDA [15], PBMDA [11], EGBMMDA [35], NPCMF [16] and AEMDA [21] made a comparison. To compare DAEMKL with state-of-the-art methods, we implement 5-CV on two datasets D1 and D2, and global LOOCV on dataset D1. Table 1 and Figure 3 give the detailed results and comparison method of DAEMKL. The results show that DAEMKL achieved a 5-CV AUC of 0.9583 and a global LOOCV AUC of 0.9671 on D1, both ranking first compared to other models. Therefore, DAEMKL significantly outperforms the other five state-of-the-art methods.

C. Effect of Multiple Kernels Learning

        In this study, multi-kernel learning is introduced into our prediction framework, which helps to extract high-dimensional features in the miRNA space and disease space respectively. The performance of MKL was studied by comparing the performance of DAEMKL and models without MKL (using single kernel or average kernel). The average kernel combines the GIP kernel with a miRNA functional similarity kernel and a disease semantic similarity kernel. Figures 4 and 5 show the performance comparison of different cores in 5-CV. The experimental mark K1 is , or K2 is , or K3 is

D. Case Studies

        To further demonstrate the predictive performance of our method and confirm the reliability and reasonableness of the prediction results obtained by DAEMKL in real-life situations, we conducted case studies on lymphoma, colonic neoplasms, and pancreatic tumors. In experiments, the D1 dataset was used to train our model to consider priority candidate miRNAs for diseases. Then, the top 30 potential miRNAs related to the disease were selected, and the predicted miRNAs were verified using the latest versions of the MDA database, DBDEMC V2.0 [36] and HMDD V3.2 [37].​ 

        Colon tumors are one of the most common malignant tumors of the gastrointestinal tract. China's tumor statistics in 2018 show that the incidence and mortality of colon cancer in my country rank third and fifth respectively among all malignant tumors. To evaluate the predictive effect of DAEMKL, we selected colon tumors as our first case study. As listed in Table II, 96.67% of the top 30 miRNAs associated with colon tumors have been verified by these two databases. Among these 30 new associations, 60% were confirmed by the DBDEMC V2.0 and HMDD V3.2 databases. Many studies have proven that the occurrence and development of colon tumors are closely related to many miRNAs. For example, zHANG et al. [38] found that miR-21, miR-17 and miR-19a promoted the proliferation and invasion of colon cancer cells.​ 

        ​ ​ ​ Pancreatic cancer is a highly lethal malignant tumor with a very low survival rate. Most patients are diagnosed with advanced disease and therefore available treatments are less effective [40]. Therefore, identifying potential miRNAs in pancreatic tumors could improve patient prognosis. This study selected the top 30 predicted pancreatic cancer-related miRNAs for analysis. Among the 30 screened miRNAs associated with pancreatic cancer, 96.67% of the associations can be found in the DBDEMC V2.0 or HMDD V3.2 database. In addition, 53.3% of pancreatic cancers are related to miRNA, and only miR-219 has not been confirmed. However, Lahdaoui et al. [39] found that overexpression of miR-219 in pancreatic cancer cell lines resulted in reduced cell proliferation and migration. Specific details are provided in Table 3 .​ 

        Lymphoma is a malignant tumor originating from the lymphatic hematopoietic system and can involve any organ in the body [40]. In 2012, there were approximately 386,000 patients with non-Hodgkin lymphoma and 66,000 patients with Hodgkin lymphoma worldwide, accounting for 3.2% of the total number of cancers [41]. Research shows that the presence of miRNA biomarkers plays an important role in the early detection of lymphoma. For example, Zhang et al. [42] Downregulation of miR-15a and miR-16 was detected in mantle cell lymphoma. Furthermore, Li et al. [43] found that the expression of miR-21 levels can be used as a new biomarker for the prognosis of diffuse large B-cell lymphoma. In this article, we will verify that both databases record predicted MDA. After the implementation of DAEMKL, 28 of the top 30 new miRNAs found to be associated with lymphoma were verified by the DBDEMC V2.0 and HMDD V3.2 databases. Although only two novel miRNAs were confirmed in both databases, we found relatively few lymphoma records in HMDD v3.2. Therefore, uncovering more potential miRNAs related to lymphoma is of great significance for the prevention and treatment of lymphoma. The specific results are listed in Table 4.

4. Conclusion  

        ​ ​Research shows that miRNAs are related to many complex diseases. Identifying potential disease-related miRNAs is of great significance for understanding disease pathogenesis and disease treatment. This paper proposes a DAE prediction framework based on multi-core learning for predicting new MDA. The first step is to apply MKL to the miRNA space and disease space to construct a miRNA similarity network and a disease similarity network. Then, for each disease or miRNA, its feature representation is learned from the miRNA similarity network and the disease similarity network through a regression model. Most notably, the integrated miRNA feature representation and disease feature representation are input into the DAE. Furthermore, predictions of the novel MDA were performed via reconstruction errors. Finally, the DAEMKL algorithm is compared with five state-of-the-art methods in 5-CV, and the results show that the DAEMKL algorithm has good performance in predicting MDAS. In addition, simulation experiments were conducted on three complex diseases to further verify the performance of DAEMKL. Therefore, the results of multiple performance evaluation metrics show that DAEMKL exhibits better prediction performance than other existing methods. It is a reliable and effective deep model for identifying potential disease-related miRNAs.​ 

        In general, there are limitations in DAEMKL. First, a regression model is used to learn the feature representations of miRNA and disease respectively. Therefore, when the database becomes very large, the model complexity is not very friendly. In addition, the shortcomings of currently validated MDAs may also affect the predictive ability of our model. In future work, we will work on developing more efficient feature representation methods. In addition, we will try to integrate more diverse databases to further improve the performance of our prediction model.​ 

Guess you like

Origin blog.csdn.net/adsdasdasdahj/article/details/129786725