Single exam to achieve one hundred million scale weight

Digression: Welcome the public number is set to star, technical articles first time to see. We will, as always, a good text selection techniques, provide valuable reading. If readers want to contribute, you can leave a message at any number of public articles, technical bloggers rich rewards.

Author: haolujun

cnblogs.com/haolujun/p/8399275.html

background

Recent work encountered a problem: how to re-exam on a large scale? After years of accumulation, we have a hundred-million-question exam, but because the topic of varying origin leads exam has a lot of repeat title, repeat these topics in the search, in addition to increasing computational search engine, but does not improve Accuracy.

In addition, because too many questions, search engines tend to adopt a cut-off strategy, subject only to a part of the calculation, which leads to some but not to calculate the correct title, shooting accuracy is not found even falling instead of rising. So for a search engine, although the initial increase in the number of items can often take substantial increase search accuracy, but when the subject of a large amount to a certain extent, but will not keep up because the amount of calculation results in an accurate rate. How to remove duplicate topics as possible is very important.

Some programs attempt

Compare MD5 value

MD5 value calculated for each question as a signature, so that when the new subject, as long as there is determined whether the same MD5 value to the exam.

This scheme applies only to two topics exactly the same, but in reality is often not just this topic.

"A greater than B 10" and "B smaller than A 10"
"Little Red buy 10 books" and "Xiao Ming to buy 10 books."
"Today, the air temperature is 10 degrees" and "Today's air temperature of 10 degrees."

These questions should be repeated, but with different MD5 value, can not go heavy.

And using the longest common subsequence minimum edit distance algorithm

Using the longest common subsequence algorithm and a minimum edit distance algorithm similarity two subjects, if the similarity is greater than a certain percentage, such as greater than 90%, the subject is considered to be duplicated.

This method is theoretically possible, but too computationally intensive. If the document number is N, the average document length is M, then the calculated amount of approximately: O (N2 * M2).

Million assuming N = 1000, M = 200, then the calculated amount of about 4 * 1018, limited availability of the machine at the line I, not so much computing power. But if we can put together a similar topic transports and collects, and then to compare this handful of topics which two at a similar degree, this is still feasible.

Jaccard similarity

To this end, I specifically looked at two books: "Introduction to Information Retrieval" and section 19.6 - 3.2 and Section 3.3 "big data Internet data mining and large-scale distributed processing". There is about how to calculate two sets Jaccard similarity: | A∩B || A∪B |. This formula is nothing to go heavy with eggs, because the calculation of the amount is so large.

Therefore, these two books are also specifically describes its equivalent algorithm: random converted into full array, to calculate an approximation based on probabilistic algorithms Jaccard. This article does not go into the conversion of proof, a small partner directly interested to see these two books. But there also have an interesting problem to calculate the most crucial step Jaccard similarity: how to generate a random 0 ~ N-1 for a super full array of N? I am here to give an approximation algorithm, studied elementary number theory junior partner should be no stranger to the following theorem.

Theorem: Y = (a * x + B) modN, if a prime with n (i.e., the greatest common divisor of a and n is 1), when x takes over when 0 ~ n-1, y runs over 0 ~ n- 1.

Proof: If there are two numbers x1 and x2, such that y1 = (a * x1 + b ) modn = y2 = (a * x2 + b) modn, then (a * x1 + b)% n = (a * x2 + b)% n, obtained (a * x1 + b-a * x2-b)% n = 0, then give a * (x1-x2)% n = 0. Since a prime with n, the greatest common divisor of 1, the obtained x1-x2 = k * n, i.e., x1 = x2 + k * n. When x1 and x2 are smaller than n, k can equal 0, i.e., x1 = x2. This shows that when x runs over 0 ~ n-1 when the remainder is certainly not be repeated, since the remainder is in the range 0 ~ n-1, so that the conclusion is proved.

Thus, if we know n, just find the 100 or 200 number with prime to n of the line, and even be found less than n of 100 or 200 prime numbers (prime sieve you own Baidu), then randomly generates a 100 to 200 b, will be able to construct a number of such functions.

For example, a = 3, b = 4, n = 8

x = 0 y = 4
x = 1 y = 7
x = 2 y = 2
x = 3 y = 5
x = 4 y = 0
x = 5 y = 3
x = 6 y = 6
x = 7 y = 1

Although this algorithm can reduce the probability of some amount of calculation, but I still can not accept. Because we are now the key issue is to find a handful of similar and more refined in determining policy in this handful, how to find the handful of blanket?

Sign use the online search logs Mining

The so-called specific conditions, can not pursue high-tech ignores real-world conditions. For example, Baidu also have to re-strategy, but not its final application to Jaccard similarity line, but to find several documents longest sentences, depending on whether the two documents whether to repeat these sentences as judge, and accuracy surprisingly good. So, we have a concrete analysis of specific issues.

Take a look at the search process, retrieve the log will record every search results in a few match the highest degree of document id, then I can believe these documents is a small cluster, there is no need to re-clustered. In addition, because very many search optimization strategies shot of extremely high accuracy, which is a re-invention clustering algorithm to be easy and effective than my own. There are such a good journal in hand, we should make full use of them. Then I will explain in detail how I implement de-duplication strategy.

Log format is as follows:

[[1380777178,0.306],[1589879284,0.303],
[1590076048,0.303],[1590131395,0.303],
[1333790406,0.303],[1421645703,0.303],
[1677567837,0.303],[1323001959,0.303],
[1440815753,0.303],[1446379010,0.303]]

This is a json array, each array have a problem ID and their scores.

Select Log

Selecting a relatively high score title ID as the candidate log log. Select the image recognition is so because the line does not guarantee hundred percent accurate, especially if the picture quality is poor, then the greater the difference between the retrieved according to identify the content title, probably not a class.

Clustered

The initial set of established

For each log, a first ranked ID as a cell ID, other elements as an element cluster.

And a set of requirements

Look at the following examples:

A -> B,C,D
E -> C,D,F

Since the two sets have the same ID, we speculate that these two collections actually belong to a cluster, and how the two sets? Use disjoint-set algorithm (self-Baidu, the program participated in the competition of junior partner should not unfamiliar.)

I wrote a sample code below and check set can complete the outstanding collection of merge operations. For example, using disjoint-set of join operations complete the merger of two logs.

https://github.com/haolujun/Algorithm/tree/master/union_find_set

union_find_set.join(A,B)
union_find_set.join(A,C)
union_find_set.join(A,D)

union_find_set.join(E,C)
union_find_set.join(E,D)
union_find_set.join(E,F)

After calling the operation, we will find A, B, C, D, E, F belong to the same collection.

A collection of elements limit

In the actual test found that the number of certain set of topics could reach millions this happens is because the calculation of the deviation caused by the clustering process. For example: A is similar to B, B to C is similar, we will A, B, C into one cluster, but in fact may not be similar to the A and C, which is the clustering process is very prone to problems.

Cluster computing over the General Assembly to increase the amount of the fine calculation of the back, which is a slightly simpler problem than go heavy in the big exam, but it is also very difficult. Considering the subject is not too much repeated exam, can limit the number of elements in each set size is set, if the total number of two elements to be merged set than the upper limit, not merge these two sets, the use of disjoint-set also easy to accomplish.

Fine calculation

How to determine whether the two title repeat

Now get is a cluster shot through search results polymerization, but the shot found a problem with the text that is used to retrieve is generated by the OCR recognition, which will inevitably be recognition errors, the search engine to be able to tolerate such errors, joined some fuzzy strategy, which leads to the result of the cluster is not entirely similar, so it is necessary to calculate the fine.

So how do you compare two questions is a duplicate of it? Especially for math problems such numbers and operators, mixed title characters, how do? After a long analysis, the same can not compare numbers, letters and characters. Numbers, letters, if not equal, then most probably these two questions are different; if numbers, letters, then the same characters Description section may allow some differences, but the differences are not too big.

This last I got heavy to fine strategy: extracting title characters and are numbers, letters, operators, numbers, letters, and a similarity operator characters exactly equal portions (minimum edit distance may be used, or the longest common sequence) greater than 80%, it can be considered the same two questions.

“A比B大10"与"B比A小10”  -- 数字与字母组成的字符串不相等，不认为重复
“小红买10本书”与“小明买10本书”   -- 数字字母相同，汉字相似度大于80%，认为重复
“今天空气温度为10度”与“今天的空气温度为10度”  -- 数字字母相同，汉字相似度大于80%，认为重复

While this strategy can not be a hundred percent go heavy all duplicate questions, but to ensure that it can go significant part repeat the question.

Which subjects to keep and which to remove the title?

Taking into account the search engine inverted in storage is in accordance with the size of the sort of topic ID (the difference between the stored ID and ID), thus leaving a small ID ID is necessary to remove large, this is not difficult to achieve.

Cyclical iteration

Our algorithm is based deduplication to log heavy, you can go to every part of the weight, and then gain on-line logs over a period of time to go heavy, so the constant iteration.

Computational also great?

According to the calculation of the amount of stand-alone, once a certain number of fish to heavy log, you can complete stand-alone, no clusters, no distributed.

Epilogue

Smart small partners might find me a opportunistic. I did not go straight to brute force to re-exam, but from start to shoot search logs, incremental step by step to achieve exam weight, as long as the number of iterations enough to finally go all the heavy topic, and each can go real heavy in fact we see the effect it easier to adjust the policy details. So, in the face of a problem from a different angle there may be a simpler approach.