Alibaba Cloud launched an efficient virus gene sequence search function, its underlying logic turned out to be like this

 

1. Background Introduction

At the end of 2019, a new type of coronavirus broke out in Wuhan, China's new commercial center. In the more than two months of the epidemic, more than 3,300 people died and more than 82,000 people were infected in China. As the epidemic further spreads, it has now spread across 109 countries, causing more than 800,000 infections and more than 40,000 people lost their lives. So far, the epidemic has shut down more than 50 countries and caused hundreds of billions of dollars in economic losses worldwide. Alibaba Cloud provides efficient gene sequence search to help coronavirus sequence analysis for epidemic prevention.

For the current epidemic, gene sequence analysis technology is mainly used in the following aspects.

First, the traceability and analysis of the new coronavirus can help people find the virus host and do effective prevention. Through gene matching technology, we can find that the RNA sequence matching of coronaviruses in bats and pangolins has reached 96% and 99.7%, so pangolins and bats are likely to be the hosts of new coronaviruses.

Second, through gene sequence analysis, the gene sequence is divided into functional regions to understand the function of each module, so as to better analyze the process of virus replication and spread. Find key nodes and design related drugs and vaccines.

Third, at the same time, it is possible to retrieve viral gene sequences similar to coronaviruses, such as SARS, MERS and other viruses. Therefore, it is possible to draw on the design mechanism of related drug targets, and design related test kits, vaccines, and related therapeutic drugs more quickly and efficiently.

However, the current gene matching algorithm is too slow, so an efficient matching algorithm is urgently needed for gene sequence analysis. The Alibaba Cloud AnalyticDB team converts gene sequence fragments into corresponding 1024-dimensional vector features. The matching problem of two gene fragments is converted into the distance calculation problem of two vectors, which greatly reduces the calculation overhead. The system can return the related gene fragments at the millisecond level to complete the initial screening of the gene fragments.

Then, the BLAST algorithm [6] for gene similarity calculation is used to complete the fine arrangement of gene similarity, so as to efficiently complete the gene sequence matching calculation. The matching algorithm is reduced from the original O (M + N) algorithm complexity to O (1). At the same time, Alibaba Cloud AnalyticDB provides a powerful machine learning analysis tool. Through gene transfer vector technology, local and disease-related key target gene fragments are converted into feature vectors for the design of genetic drugs, which greatly accelerates the gene Analysis process.

 

2. Application of gene search

 

2.1 Gene search function

The RNA sequence of the new coronavirus can express a series of nucleic acid sequences (also called base sequences). There are a total of four nucleotides in the RNA sequence, denoted by A, C, G, and T, respectively representing adenine, cytosine, guanine, and thymine. Each letter represents a kind of base, they are arranged together without spaces. The RNA sequence of each species is different and regular. The gene search system can search for similar genes by inputting a string of viral gene fragments, which can be used for viral RNA.

In order to demonstrate our gene fragment retrieval method, we downloaded a large number of viral RNA fragments from genbank, and imported the papers about viruses in genbank and papers about viruses in google scholar into the AnalyticDB gene retrieval database.

The demo interface of gene search is shown in Figure 1. The user uploads the sequence of the coronavirus (COVID-19) to the AnalyticDB gene search tool. The system can retrieve similar gene fragments within a few milliseconds (the current system only returns gene fragments with a matching degree exceeding 0.8). We can see that the pangolin-carried coronavirus (GD / P1L), bat-carried coronavirus (RaTG13), and SARS and MARS viruses were returned. Among them, GD / P1L has the highest sequence matching, with 0.974. Coronavirus is likely to be transmitted to humans through pangolins.

 

 

 

Figure 1. Gene search interface

As we all know, RNA fragments are very similar, indicating that these two RNAs may have similar protein expression and structure. Through the gene search tool, we can see that the matching degree of SARS and MARS with coronavirus is more than 0.8. It shows that some SARS or MARS research results can be applied to the new coronavirus. The system crawled the papers of each virus, and divided these papers into detection classes, vaccines and drugs through the text classification algorithm.

When we click on SARS (see Figure 2), we can see that there are seven methods for SARS detection, four methods for vaccines, and ten methods for drugs. It can be seen that the fluorescent quantitative PCR detection which is effective for SARS is now being applied to the detection of coronavirus. For vaccines, methods of genetic vaccines and methods of inducing in vivo immune vaccines are also in full swing. Regarding drugs, ridxivir and related interferons are also used in the treatment of new coronaviruses.

 

 

 

Figure 2. Classification of related papers

Figure 3 shows the related interferon link, you can see the relevant papers. The current system calls the automatic translation software and extracts the keywords of the Chinese version of the file name as the file name, which is convenient for users to read.

 

 

 

Figure 3. Click on the interferon link

 

2.2 Overall design of application architecture

The overall architecture of the Alibaba Cloud gene retrieval system is shown in Figure 4. AnalyticDB is responsible for all the structured data of the entire application (for example, the length of the gene sequence, the name of the paper containing this gene, and the type of gene, DNA or RNA, etc., See Figure 4 query return result part) and the storage and query of the feature vector generated by the gene sequence. When querying, we use the gene vector extraction model to convert genes into vectors, and perform a coarse search in the AnalyticDB library. In the result set of vector matching, we use the classic BLAST [7] algorithm for fine sorting and return the most similar gene sequence.

The core of this is that the gene vector extraction module contains the conversion of nucleotide sequences into vectors. We currently take all the sequence samples of various viral RNAs for training, so we can easily calculate the similarity of viral RNAs. Of course, the current vector extraction model can be easily extended to the genes of other species. The gene vector extraction model will be introduced in detail in Chapter 3.

 

 

Figure 4. Gene search framework

 

3. Introduction to key algorithms

 

3.1 Gene vector extraction algorithm

First introduce the most relevant word vector algorithm for gene extraction vectors.

Word vector 1 is a very mature technology, widely used in machine translation, reading comprehension, semantic analysis and other related fields, and has achieved great success. Word vectorization uses a distributed semantic method to express the meaning of a word. The meaning of a word is the context in which the word is located.

For example, in the high school English cloze test, there are 10 vacancies in a short essay. Choose the appropriate word according to the context of the missing word. In other words, the context has been able to accurately express the word. Give the correct word choice, indicating that you understand the meaning of the vacant word. Therefore, through the relationship of context words, using the word vector algorithm, each word can generate a vector. By calculating the similarity of the vector between the two words, the similarity of the two words is obtained. For example, "spoon" and "bowl" are very similar because they always appear in the eating scene.

The same is because the arrangement of gene sequences has certain rules, and the functions and meanings expressed by each part of the gene sequence are different. Therefore, we can divide a very long gene sequence into small unit fragments (that is, "words") for research. And these words also have context, because these words are connected and interacted to complete the corresponding function, forming a reasonable expression. Therefore, bioscientists 8 [10] use the word vector algorithm to vectorize gene sequence units. The similarity of the two gene units is very high, indicating that the two gene units are always together and express together to complete the corresponding function.

In summary, the specific method of vector extraction is mainly divided into three steps:

First, we must first solve how to define the words one by one in the amino acid sequence. K-mers [3] is used to analyze the amino acid sequence in bioinformatics. k-mer refers to dividing the nucleic acid sequence into a string containing k bases, that is, iteratively selecting a sequence of length K bases from a continuous nucleic acid sequence. If the length of the nucleic acid sequence is L, the length of k-mer is K, then you can get L-K + 1 k-mers. As shown in Fig. 5, suppose there is a sequence length of 12, and the selected k-mer length is 8, then (12-8 + 1 = 5) 5-mers are obtained. These k-mers are just "words" one by one in the amino acid sequence.

 

 

 

Figure 5. 8-mer nucleic acid sequence diagram

Secondly, for the word vector algorithm, another important issue is the context of the context. We will choose a window of length L among the amino acid fragments. The amino acid fragments in this window are considered to be in the same context. For example, we selected a window with a length of 10 (a nucleic acid sequence of CTGGATGA), and we converted it into 5 5-mers: {AACTG, ACTGG, CTGGA, GGATG, GATGA}. For one of the 5-mer {CTGGA}, then the 5-mers associated with it are {AACTG, ACTGG, GGATG, GATGA}, and these four 5-mers are the current context of the context of 5-mer {CTGGA} . We apply the training model of the word vector space to train the k-mers of the genes of existing organisms, and we can convert a k-mer (a "word" in the gene sequence) into a 1024-dimensional vector.

Again, similar to the word vector model, the k-mer vector model also has the same mathematical calculation properties as the word vector model.

Equation 1 shows that the distance between the vector of the ACGAT nucleotide sequence minus the vector of the GAT sequence and the vector of the AC sequence is very close. Formula 2 shows that the distance between the vector of the nucleotide sequence AC plus the vector of the ATC sequence and the vector of the ACATC sequence is also very close. Therefore, according to these mathematical characteristics, when we want to calculate a vector of a long amino acid sequence, we accumulate each k-mer sequence in this sequence, and finally normalize to get the vector of the entire amino acid sequence . Of course, to further improve accuracy, we can treat the gene fragment as a text, and then use doc2vec4 to convert the entire sequence into a vector for calculation.

In order to further verify the performance of the algorithm, we calculated the similarity between the sequence of BLAST [6] algorithm commonly used in gene search library and the sequence of gene transfer vector l2 distance. The Spearman rank correlation coefficient of the two sequences is [7] 0.839. Therefore, it is effective and feasible to convert DNA sequences into vectors for the initial screening of similar gene fragments.

 

3.2 Features of AnalyticDB Vector Edition

Analytical database (AnalyticDB) is a high-concurrency, low-latency PB-level real-time data warehouse on Alibaba Cloud. It can perform real-time multi-dimensional analysis and business exploration for trillion-level data in milliseconds.

AnalyticDB for MySQL is fully compatible with MySQL protocol and SQL: 2003 grammar standard. AnalyticDB forPostgreSQL supports standard SQL: 2003 and is highly compatible with Oracle grammar ecology. Currently, both products include vector retrieval function, which can support image, recommendation, voiceprint, nucleotide Similarity queries such as sequence analysis. At present, AnalyticDB can support 1 billion-level vector data query and 100-ms response time in real application scenarios. AnalyticDB has been deployed in large-scale security projects in many cities.

In a general application system that includes vector retrieval, developers usually use a vector retrieval engine (such as Faiss) to store vector data, and then use a relational database to store structured data. When querying, you also need to query the two systems alternately. This solution will have additional development work and the performance is not optimal.

AnalyticDB supports the retrieval of structured data and unstructured data (vectors). Using only the SQL interface, you can quickly build functions such as gene search or gene + structured data hybrid search. AnalyticDB's optimizer will select the optimal execution plan according to the data distribution and query conditions in the mixed retrieval scenario, and ensure the best performance while ensuring the recall.

RNA nucleic acid sequence search can be achieved through a SQL:

 
 

-Find gene sequences with similar RNA and submitted sequence vectors. select title, # article name length, # gene length type, # mRNA or DNA etc. l2_distance (feature, array [-0.017, -0.032, ...] :: real []) as distance # vector distance from demo.paper a , demo.dna_feature b where a.id = b.id order by distance; # sort by vector similarity

The table demo.paper stores the basic information of the uploaded articles, and demo.dna_feature stores the vector corresponding to the gene sequence of each species. Through the gene transfer vector model, the gene to be retrieved is converted into a vector [-0.017, -0.032, ...], and searched in the Alibaba Cloud AnalyticDB database.

Of course, the current system also supports mixed retrieval of structured information + unstructured information (nucleotide sequence). For example, we want to find similar gene fragments related to coronavirus. In this case, using AnalyticDB, we only need to add where title like '% COVID-19%' in SQL to easily achieve.

For cloud, see Yunqi: more cloud information, cloud cases, best practices, product introduction, visit: https://yqh.aliyun.com/

This article is original content of Alibaba Cloud and may not be reproduced without permission.

Published 1217 original articles · 90 praised · 230,000 views +

Guess you like

Origin blog.csdn.net/weixin_43970890/article/details/105490426