Design and practice of large-scale short text clustering

picture

Introduction: The large-scale short text clustering system aims to accurately and efficiently summarize and summarize massive search queries, and condense them into "needs" with cohesive and clear meanings. Long and short version. How to ensure the high accuracy of the clustering system and how to improve the operating efficiency of the clustering system is the focus of our team. We have gradually improved various effects and performance indicators of the system by means of multi-level splitting, precise matching of semantic similarity, and error correction . This paper shares the design and practice of large-scale short text clustering based on our practical work experience.

1. Background and problem introduction

Search is a scenario where users clearly express their needs. Baidu Search undertakes a large number of requests every day, and the expression of query is varied. The core purpose of the large-scale short text clustering system is to summarize and summarize short texts represented by query, accurately and efficiently condense short texts with the same meaning and different expressions, and condense them into "requirements" with cohesive and clear meanings. **This method can not only compress short texts, but also better meet user needs, help content providers, and provide better content. At present, it has assisted the content production of Baidu UGC products.

In this section, we first introduce the short text clustering problem, and introduce common clustering algorithms, and finally analyze the difficulties and challenges of clustering algorithms in search scenarios.

1.1 Short text clustering problem

Clustering is a common unsupervised algorithm that divides a data set into different classes or clusters according to a certain distance measure, so that the similarity of data objects in the same cluster is as large as possible, and at the same time they are not in the same cluster. The difference of the data objects is also as large as possible.

In general, search queries are dominated by short text. From a large number of short texts, all the texts with the same meaning as possible are aggregated together to form a "demand cluster", which is the short text clustering problem.

for example:

Among them, query="What should I do if the screen of my mobile phone is broken?" and query="Help, the screen of my mobile phone is broken" are the same requirements and can be aggregated into a cluster; query="What does February mean", query "What does ⻰ mean" and query="What is the allusion to ⻰" are the same requirements and can be aggregated into a cluster.

It can be seen that there is an obvious difference between the short text clustering problem in the search query scenario and the conventional clustering problem: ** The connotation of "cluster" is relatively concentrated, ** that is, after the aggregation is completed, the cluster The magnitude of is relatively large, and the data in the same cluster are closer.

1.2 Common Algorithms and Frameworks

simhash

In the field of text clustering, simhash is a commonly used locality-sensitive hashing algorithm, which is often used as a web page deduplication algorithm in search engines. It tries to map similar documents to the same hash value with a high probability, so as to achieve the purpose of clustering (de-duplication).

The simhash algorithm generally divides the document into words, and gives the weight of each word, then calculates a hash value for each word, and sums the hash values ​​of all words by weight, and then takes Dimensionality reduction with parametric binarization is used to produce the hash value of the final document. The importance weighting of words is used in the process, so important words will have a greater impact on the hash value of the final document, while unimportant words will have a smaller impact on the hash value of the final document, so that Similar documents are more likely to generate the same hash value.

In the field of long document clustering (de-duplication), simhash is an efficient algorithm. However, for short text clustering, the effectiveness of simhash is greatly reduced due to the substantial reduction of the text length. The accuracy of the output results cannot be guaranteed.

vectorized clustering

Another common text clustering method is to first vectorize the text, and then use the conventional clustering method for clustering. Among them, there are many ways of text vectorization, from the early tf-idf, to the convenient and fast word2vec, and even the text vectors produced by pre-training models such as bert and ernie , all of which can be used as text vectors. Vectorized representation. In general, vectorization methods imply distributional assumptions, and the generated text vectors can solve the problem of synonyms to a certain extent.

There are also various ways of clustering, such as the commonly used kmeans, hierarchical clustering, single-pass, etc. It should be noted that when designing the distance metric of the clustering algorithm, the vectorization method needs to be specially considered, and different distance metrics are designed for different vectorization methods.

There are three problems with this kind of algorithm:

1. The clustering algorithm represented by kmeans has a hyperparameter "number of clusters", which can only be determined by experience and auxiliary indicators; 2. For short text clustering, the number of clusters is often very large, which will lead to clustering. 3. The accuracy of clustering results is affected by both vectorization and clustering algorithms, and it is difficult to express fine-grained differences in data.

Other Algorithms and Comparisons

Although the short text clustering algorithm has many application scenarios in the industry, it is still a relatively small branch in the research field, and it mainly adopts the scheme of improving the vectorized representation of text, such as sentences The vector is designed as the first principal component of the word vector weighting matrix, rather than directly weighted; for example, a cluster-guided vector iteration method is used to improve the vectorized representation, such as SIF+Aut. But no matter what kind of improvement, it will increase the large computing overhead, which is not realistic in the industry.

1.3 Difficulties in large-scale short text clustering

The problem we face is large-scale short text clustering, which also contains 3 implicit constraints:

1. The accuracy requirements of clustering results are very strict; 2. The number of short texts is very large; 3. The timeliness requirements for data output are high;

It is not difficult to see that it is difficult for conventional algorithms to meet the above three requirements at the same time in terms of computational efficiency and accuracy; however, there is no mature framework in the industry that can solve this problem; the algorithm in the academic world is still a certain distance from the application in industrial scenarios; we need to A new way of thinking to solve problems.

When designing an algorithm, the following issues need to be considered:

1. The amount of data is large and the system throughput is high: the magnitude of the search query is self-evident, and efficient computing for billion-level data is a test of algorithm design capabilities and engineering architecture capabilities;

2. The clustering algorithm requires high accuracy: First of all, the clustering algorithm is an unsupervised algorithm. It uses the distance measure in the vector space to measure the degree of aggregation of the clustering results. There is essentially no concept of accuracy; however, when searching for a query In the scenario, the clustering algorithm has a clear measurement index: through the unified text similarity evaluation standard, it can be evaluated a posteriori whether the results are similar in the same cluster, and whether the data in different clusters are not similar, so that The clustering accuracy can be measured. This puts a "tightening spell" on the short text clustering algorithm: not any clustering algorithm is applicable, and a clustering algorithm that is more closely integrated with text similarity needs to be considered.

"Tighter binding" is considered from the definition of distance in vector space, for example, the similarity measure given by similarity model is not necessarily "distance" in vector space. Specifically, a well-defined "distance" on a vector space needs to satisfy the triangle inequality: distance(A,B)+distance(B,C)>=distance(A,C), however, for similarity, similarity (A,B)+similarity(B,C) and similarity(A,C) do not necessarily have a stable quantitative relationship. Therefore, the clustering model cannot be used directly, and the similarity can only be used as an "off-site guidance" to realize the clustering algorithm under the guidance of the similarity. This results in that general clustering algorithms cannot be directly used for short text clustering.

3. Text similarity requires high precision and short time consumption: Text similarity is a scene-dependent problem. For the same query pair, in different scenarios, there may be completely different judgment results. In the search scenario, the accuracy requirements for similarity are very high, and the difference between a single word is often a completely different requirement. Therefore, the model needs to be able to accurately capture the fine-grained differences in the text; The query is aggregated into a cluster to reduce the case of missing recall; that is to say, for the short text clustering problem, the accuracy and recall rate of text similarity have high requirements; ** In addition, in order to adapt to For large-scale calls, the text similarity service needs to have the characteristics of short response time and easy expansion.

4. Text representation is complex: Text representation refers to converting text into semantic vectors in some way for subsequent text clustering. There are various ways to generate text vectors. For example, in simhash, weighted hash functions are used to express text information; word2vec word vectors can also be used to generate text vector information. In addition, short text categories, keywords and other information are also important components of text representation; when text vectorization, it is necessary to focus on how to reflect similarity in vector space. Text representation is an important and basic algorithm. Choosing different text representations determines different similarity measurement methods, which directly affects the selection of clustering models and indirectly affects the final results.

5. Error discovery and repair: From text representation to text similarity to clustering algorithm, errors will accumulate at each step, which will affect the final result. For large-scale text clustering, this problem is particularly obvious, and it is also an easily overlooked link.

1.4 The general idea of ​​large-scale short text clustering

Considering the above difficulties, we adopted the general idea of ​​multi-level splitting and breaking them one by one.

picture

Figure 1: General idea of ​​large-scale short text clustering

1. We first divide the large-scale short text into multi-level splits to basically ensure that queries with the same meaning will enter the same bucket with a high probability:

1.1 Level 1 split: It is necessary to ensure that the split items are at the semantic level and have mutually exclusive meanings, that is, queries with the same meaning must enter the same bucket;

1.2 Second-level split: After the first-level split, the query magnitude in each bucket is still very large, and a second-level split needs to be performed to ensure that the subsequent calculation level is controllable;

2. Perform refined semantic aggregation of queries in the same bucket, and combine queries with the same meaning into a cluster;

3. For the semantic clusters in the same first-level bucket, perform error checking, and merge the clusters that need to be merged;

illustrate:

1. Why split? Assuming that our query is of the order of magnitude picture, then in the worst case, we need to perform picturesub-similarity calculations to complete text clustering; however, if we split the data, and guarantee it with a high probability If the split is mutually exclusive, picture, then only picturethe sub-similarity calculation needs to be performed, and the magnitude is much smaller than that picture;

2. Why do you need multi-level splitting? If the original data is split into a finer level, the number of similarity calls will be reduced; however, the finer the level of splitting, the less guaranteed semantic mutual exclusion, which will lead to a surge in the number of error checks that need to be performed; if We use multi-level split to ensure the accuracy of the top split, which will reduce the amount of data for error checking;

3. How to perform semantic clustering? After the data is split, text clustering needs to be performed in each bucket, which is also the core of the entire solution. Currently, there are three solutions:

3.1 Retrieval clustering based on text literals: Although the amount of data allocated to the same bucket has been greatly reduced, if the similarity calculation is carried out in pairs, the amount of data is still very large; we found that the conventional Keyword search can ensure that most similar queries are recalled, and it is only necessary to calculate the correlation for the recalled queries.

3.2 Clustering based on text representation: Although most similar queries can be covered by keyword search, the recall is not complete. In order to solve the problem of incomplete keyword recall caused by synonymy problems and sentence transformation, we use Recall method based on text representation: perform fine-grained clustering of vectorized queries, and only need to calculate the similarity of queries in the same class;

3.3 Retrieval clustering based on text representation: Considering that the fine-grained clustering algorithm is weak in control, vector retrieval can also be introduced to solve the problem of incomplete recall by means of text vector recall.

Effectiveness Analysis

1. Solve the problem of large amount of data and short time-consuming: the current process can split the large-scale data into smaller data blocks by splitting the data in two stages and assign them to different computing units for processing. We have estimated that on 100 million data, the traditional pairwise comparison method will take 58 years to calculate, while the layered method will only take 4 days; although the hierarchical vectorized aggregation method is used The clustering method (without similarity) can reduce the time consumption to 2 days, but the accuracy and recall rate are low;

2. Optimize the similarity and improve the computing performance of the similarity service: We have customized the short text similarity, multi-instance deployment, and the throughput of the similarity service has been improved by 100 times;

3. The clustering algorithm has a high accuracy rate: the error correction mechanism is introduced, the overall short text clustering algorithm accuracy rate has increased from 93% to 95%, and the recall rate has increased from 71% to 80%;

Based on the above analysis, we made a compromise between computing performance and computing effect, and we adopted hierarchical vectorized clustering + optimized similarity as the final solution;

2. Evolution and thinking of large-scale short text clustering algorithms

The above is the solution to the large-scale short text clustering problem that we have concluded after a period of exploration. In the process, we carried out two major version iterations.

2.1 v1.0: A clear idea of ​​splitting

In the early days of solving this problem, we were faced with two technical solutions: one was to explore representations suitable for short texts and transform the problem into a special vector clustering problem, such as simhash; the other was to split the data set , reduce the data scale, speed up the similarity calculation, and solve the clustering problem; after analysis, we believe that the first solution, in selecting the appropriate text representation, there is no intermediate index that can guide the optimization direction, and it is difficult to iteratively optimize. Therefore, it is clear that The idea of ​​splitting a data set.

First, we split the query at one level according to the classification of the query and other business requirements to ensure that the split results are semantically mutually exclusive;

The second step is to perform two-level splitting: In order to ensure that the amount of data in the same bucket after two-level splitting is moderate, which is convenient for fine-grained semantic aggregation, we adopt the method of hierarchical bucketing:

1. Calculate the high-level semantic features of the query, and perform a binarization operation to produce a 256-dimensional vector composed of 0 and 1

2. First, take the first 50-dimensional vectors and perform hash rough aggregation; if the number of clusters exceeds a certain number, expand the dimension of the query to 100 dimensions, and perform hash rough aggregation again until the number of dimensions reaches 256 or the number in the cluster. less than a certain amount

The third step is to perform refined semantic aggregation on the queries in each bucket, and combine queries with similar meanings into a cluster.

Analysis of the advantages and disadvantages of v1.0

We can see that, due to the idea of ​​data splitting, when the data is divided into buckets and refined semantic clustering, the data semantics between each bucket are mutually exclusive, and only semantic aggregation needs to be performed in each bucket. , the data can be generated. This method is convenient for parallelization of computation, which greatly contributes to shortening the computation time.

During the process, we also found some areas for improvement:

1. The accuracy of clustering results strongly depends on the similarity model, and it is a fine-grained similarity model. If a coarse-grained similarity model is used, it will lead to false recall, so a set of models that can distinguish the fine-grained similarity is required.

2. The use of hierarchical bucketing for secondary splitting does not necessarily guarantee that data with similar semantics will enter a bucket. The query vector representation used in hierarchical splitting implies that the semantic expression of the vector has a gradual increase with the increase of dimensions. The characteristics of refinement, but in the process of producing the query vector representation, this guidance is not applied, and this assumption cannot be guaranteed; in v2.0, we changed the method of secondary split to overcome this defect;

3. Lack of error detection and correction mechanisms: no matter how high the similarity accuracy is, no matter how accurate the classification of short texts is, there will be errors; we need a mechanism to detect and correct errors.

2.2 v2.0: Introducing fine-grained similarity

In response to the problems found in v1.0, we have made three changes:

Introduce fine-grained similarity

A typical usage scenario of short text text clustering is to merge search queries at the semantic level, and merge queries with different expressions of the same requirement into one "requirement". Therefore, the criteria for identifying similar queries are relatively strict. The existing similarity model is no longer applicable. After a detailed analysis of the query, we found that features such as syntactic analysis, phrase collocation, etc. can greatly help improve the accuracy and recall rate of the similarity model. Therefore, we developed a set of syntactic analysis, phrase collocation and The similarity model of other features has achieved better returns.

Introducing a clustering model to the secondary split

In the v1.0 version, the accuracy of hierarchical bucketing cannot be guaranteed, and a more accurate secondary splitting method is required to achieve the purpose of data bucketing. The second-level split only needs to ensure that similar short texts are divided into the same bucket with a high probability, and any two short texts in the bucket are not required to be similar. This kind of setting is very suitable for the vectorized clustering method. question. On the one hand, considering the performance of the pre-training model, we adopted the traditional word vector weighting method to generate the word vector of the short text; through the kmeans clustering method, the short text in the first-level bucket was clustered to ensure that The accuracy of the secondary split.

There may be doubts here, why not use vector recall to solve the problem? In fact, the essence of vector recall is vector clustering, supplemented by efficient vector search to achieve the purpose of vector recall. In the second-level split, there is no need for vector lookup, and if it is introduced, it will add additional time overhead, so we do not use vector recall.

Error correction

In the v1.0 version, the errors are accumulated layer by layer and are not corrected. In order to overcome this, we introduced error correction operations in the last step of output.

There are mainly two forms of errors: one is that the clustering is incomplete, and the data that should be aggregated together are not aggregated together. This type of error is mainly caused by multi-level splitting. Through cross-level verification, this can be solved. Problem; another error is inaccurate clustering, mainly due to similarity calculation. We focus on solving the problem of incomplete clustering.

In order to reduce the amount of data that needs to be checked, we limit the scope of error checking to within the secondary bucket. After secondary bucketing, we first use refined semantic aggregation to obtain clustering results. For the cluster centers in the same secondary bucket, verify their correlation. If the correlation is high, it is merged after further validation.

Effects of v2.0

After v2.0 is launched, it can process and complete the refined clustering of large-scale short texts in a very short time, with a clustering accuracy rate of 95% and a recall rate of 80%. It has already served the content production of Baidu's UGC business line.

Continuous optimization

The v2.0 version basically realizes the function of large-scale short text refined clustering, but there is still a lot of room for changes. For example, the persistence of clustering results, repeated calculation problems, more efficient aggregation algorithms, etc., we will continue to optimize in the future to provide more efficient and stable algorithm support.

3. Summary

Search is where users clearly express their needs. In-depth understanding and generalization of search queries can not only provide users with a better search experience, but also find shortcomings in content satisfaction. Through multi-level splitting and fine aggregation, we have realized the clustering function of large-scale short texts, assisting Baidu UGC business line in content production, and improving production efficiency.

Read the original article https://mp.weixin.qq.com/s/-QCY1v6oCj3ibjTDHHpiOQ

Recommended reading

How does the Baidu search center with tens of billions of traffic build its observability?

How did the search front-end with billion-level traffic upgrade its architecture?

|Exploration and application of elastic near-line computing in Baidu information flow and search business

----------  END  ----------

"Baidu Geek Talk" is newly launched

Interested students can also join the "Baidu Geek Said WeChat Communication WeChat Group" and continue to communicate within the group.

If you are interested in other positions in Baidu, please join the "Baidu Geek Official Recruitment Group" to learn about the latest job trends. There is always a JD for you.

Technical dry goods · Industry information · Online salon · Industry conference

Recruitment information · Internal push information · Technical books · Baidu peripherals

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324312619&siteId=291194637