Common ideas and methods of massive data processing (transfer)

Recently, I have been researching personalized recommendation systems, the foundation is very weak, and there is no experience in massive data processing. This article is to share some experience. If you want to apply personalized recommendation technology to the Internet, you must face the problems of incremental computing and scalability, that is, it can be deployed in server clusters in a distributed manner, so as to meet the needs of real-time recommendation.
1. Scope of application of Bloom filter

: It can be used to implement data dictionary, carry out data judgment, or set intersection.

Basic principle and main points:
It is very simple for the principle, bit array + k independent hash functions. Set the bit array of the value corresponding to the hash function to 1. If all the corresponding bits of the hash function are found to be 1, it means that it exists. Obviously, this process does not guarantee that the result of the search is 100% correct. At the same time, deleting an inserted keyword is also not supported, because the bit corresponding to the keyword will affect other keywords. So a simple improvement is the counting Bloom filter, which replaces the bit array with a counter array to support deletion.

There is also a more important question, how to determine the size of the bit array m and the number of hash functions according to the number of input elements n. When the number of hash functions k=(ln2)*(m/n), the error rate is the smallest. In the case where the error rate is not greater than E, m must be at least equal to n*lg(1/E) to represent any set of n elements. But m should be larger, because it is necessary to ensure that at least half of the bit array is 0, then m should be >=nlg(1/E)*lge is about 1.44 times nlg(1/E) (lg means base 2 logarithm).

For example, if we assume the error rate is 0.01, then m should be about 13 times larger than n. So k is about 8.

Note that the units of m and n are different here, m is the unit of bit, and n is the unit of the number of elements (to be precise, the number of different elements). Usually the length of a single element is many bits. So using the bloom filter is usually a saving in memory.

Extension:
The Bloom filter maps the elements in the set to the bit array, and uses k (k is the number of hash functions) mapping bits to indicate whether the element is in the set or not. Counting bloom filter (CBF) expands each bit in the bit array into a counter, thus supporting element deletion. Spectral Bloom Filter (SBF) associates it with the number of occurrences of a collection element. SBF takes the minimum value in counter to approximate the frequency of occurrence of elements.

Example of the problem: Give you two files A and B, each of which stores 5 billion URLs, each URL occupies 64 bytes, and the memory limit is 4G, so that you can find the common URL of A and B files. What if there are three or even n files?

According to this problem, let's calculate the memory usage. 4G=2^32 is about 4 billion*8, which is about 34 billion, and n=5 billion. If the error rate is 0.01, it needs about 65 billion bits. There are 34 billion available now, which is not much difference, which may increase the error rate. In addition, if these urlips are in one-to-one correspondence, they can be converted into ip, which is much simpler.

2. Scope of application of Hashing

: The basic data structure for quick search and deletion, usually requires the total amount of data to be put into memory.

Basic principles and points:
Hash function selection, for strings, integers, arrays, and specific corresponding hash methods.
Collision processing, one is open hashing, also known as zipper method; the other is closed hashing, also known as open addressing method, open addressing.

Extension:
The d in d-left hashing means more than one. Let's simplify this problem first and take a look at 2-left hashing. 2-left hashing refers to dividing a hash table into two halves of equal length, called T1 and T2 respectively, and equipping T1 and T2 with a hash function, h1 and h2, respectively. When storing a new key, two hash functions are used for calculation at the same time, and two addresses h1[key] and h2[key] are obtained. At this time, it is necessary to check the position of h1[key] in T1 and the position of h2[key] in T2, which position has stored more (collisioned) keys, and then store the new key in a position with less load. If there are the same number of both sides, for example, both positions are empty or a key is stored, the new key is stored in the T1 subtable on the left, and 2-left also comes from this. When looking up a key, you must do two hashes and look up two locations at the same time.

Example of the problem:
1). From the massive log data, extract the IP with the most visits to Baidu on a certain day.

The number of IPs is still limited, at most 2^32, so you can consider using hash to directly store IPs in memory, and then perform statistics.

3. Scope of application of bit-map

: It can quickly search, judge and delete data. Generally speaking, the data range is less than 10 times that of int.

Basic principles and main points: use bit array to indicate whether certain elements exist, such as 8 bits Phone number

extension : bloom filter can be seen as an extension to bit-map

Problem example:

1) It is known that a certain file contains some phone numbers, each number is 8 digits, and the number of different numbers is counted.

8 bits can be up to 99 999 999, which requires about 99m bits, about 10 m bytes of memory.

2) Find the number of non-repeating integers among the 250 million integers, and the memory space is not enough to accommodate these 250 million integers.

Expand the bit-map and use 2bit to represent a number, 0 means not appearing, 1 means appearing once, and 2 means appearing twice or more. Or we don't use 2bit to represent, we can simulate this 2bit-map with two bit-maps.

4.

Heap Scope of application: The first n of massive data is large, and n is relatively small, and the heap can be placed in memory.

Basic principles and points: the largest heap is the first n small, and the smallest heap is the largest n. Methods, such as finding the first n is smaller, we compare the current element with the largest element in the maximum heap, if it is less than the largest element, then the largest element should be replaced. In this way, the final n elements obtained are the smallest n. It is suitable for a large amount of data, the first n is small, and the size of n is relatively small, so that all the first n elements can be obtained by scanning once, and the efficiency is very high.

Extension: Double heap, a max heap combined with a min heap, can be used to maintain the median.

Example of the problem:
1) Find the largest top 100 numbers among 100w numbers.

Use a min-heap of 100 elements.

5. Double-layer bucket division - in fact, it is essentially the idea of ​​"divide and conquer", focusing on the skill of "dividing"!

Scope of application: the kth largest, median, non-repeating or repeating numbers

Basic principles and key points: Because the range of elements is very large, the direct addressing table cannot be used, so through multiple divisions, the range is gradually determined, and then finally within the acceptable range. Can be scaled down multiple times, double layer is just an example.

Extension:

Problem example:
1). Find the number of non-repeating integers among 250 million integers, and the memory space is not enough to accommodate these 250 million integers.

A bit like the pigeonhole principle, the number of integers is 2^32, that is, we can divide the 2^32 numbers into 2^8 areas (for example, a single file represents an area), and then separate the data into different area, and then different areas can be solved directly by using bitmap. That is to say, as long as there is enough disk space, it can be easily solved.

2). 500 million ints to find their median.

This example is more obvious than the one above. First, we divide the int into 2^16 areas, and then read the data and count the number of numbers that fall into each area. Then we can judge which area the median falls into based on the statistical results, and at the same time know the number in this area. Several large numbers are just the median. Then in the second scan we count only those numbers that fall in this area.

In fact, if it's not an int but an int64, we can reduce it to an acceptable level after 3 such divisions. That is, you can first divide the int64 into 2^24 areas, then determine the largest number of the area, divide the area into 2^20 sub-areas, and then determine the largest number of the sub-area, and then determine the number in the sub-area. The number is only 2^20, so you can directly use the direct addr table for statistics.

6. Database

index Scope of application: Addition, deletion, modification, and checking of large amounts of data

. Basic principles and key points: Use the design and implementation method of data to process additions, deletions, modification, and checking of massive data.
Expansion:
Question example:


7. Inverted index (Inverted index)

scope of application: search engine, keyword query

Basic principles and points: Why is it called an inverted index? An indexing method used to store a map of where a word is stored in a document or set of documents under full-text search.

Taking English as an example, here is the text to be indexed:
T0 = "it is what it is"
T1 = "what is it"
T2 = "it is a banana"
We get the following reverse file index:
"a": {2}
"banana": {2}
"is": {0, 1, 2 }
"it": {0, 1, 2}
"what": {0, 1}
The retrieved conditions "what", "is" and "it" will correspond to the intersection of sets.

Forward indexes were developed to store lists of words for each document. The query of forward index often satisfies the query of order and frequent full-text query of each document and the verification of each word in the verification document. In forward indexing, documents occupy a central position, and each document points to a sequence of index entries that it contains. That is to say, the document points to the words it contains, and the reverse index is that the word points to the document containing it. It is easy to see this reverse relationship.

Extension:

Problem example: document retrieval system, query those documents that contain a certain word, such as a common keyword search of academic papers.

8. External

sorting Scope of application: sorting of big data, deduplication

Basic principles and key points: merging method of external sorting, principle of permutation selection loser tree, optimal merge tree

Expansion :

problem example:
1). There is a file with a size of 1G, each line in it is a word, the size of the word does not exceed 16 bytes, and the memory limit size is 1M. Returns the 100 most frequent words.

This data has obvious characteristics. The size of the word is 16 bytes, but the memory is only 1m, which is not enough for hashing, so it can be used for sorting. Memory can be used as an input buffer.

9. Trie

tree Scope of application: large amount of data, many repetitions, but small data types can be put into memory

Basic principles and key points: implementation, representation of node children

Extension : compression implementation.

Example of the problem:
1). There are 10 files, each file is 1G, each line of each file stores the user's query, and the query of each file may be repeated. Ask you to sort by the frequency of the query.

2). 10 million strings, some of which are the same (duplicate), need to remove all the duplicates and keep the strings without duplicates. How to design and implement?

3). Look for popular queries: The repetition of query strings is relatively high. Although the total number is 10 million, if the repetitions are removed, there will be no more than 3 million, and each of them will not exceed 255 bytes. 10. The scope of application of

distributed processing mapreduce : the amount of data is large, but the data type is small and can be put into memory. Basic principle and main points: hand over the data to different machines for processing, data division, and result reduction.



Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326220640&siteId=291194637