Space limitations

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/hahaEverybody/article/details/93317392

Topic one: only 2G memory to find the most number of occurrences in 2 billion integers

There is a whole contains 2 billion is a 32-bit integer large files, where you can find the largest number of occurrences.
Memory is limited to 2G

Analysis of a:

  • To find the most number of occurrences of this problem in a lot of numbers, it is common practice to use a hash table for each number appears to do word frequency statistics, key hash table to the corresponding integer, the number of occurrences of value for the integer .
  • For this question, assume that a number appears 20 million times, with a 32-bit integer can be expressed as the number does not overflow, so the key for the hash table 4B, value is 4B, then a record hash table occupied . 8B . When the number of records in the hash table of 200 million, at least 1.6GB memory
  • If the number of two billion different number over 200 million, the most extreme case is the number of two billion is not the same, then the share of memory to 2 billion * 8B, it is clear that memory is not enough.
  • Solution is to contain a small number 2000000000 file into smaller files using a hash function 16, depending on the nature of the hash function, the same number is unlikely to be hashed to different small files. Assuming that hash functions are good enough, then each small file in a different number must not be greater than 200 million kinds.
  • Then each small hash table file to count the number of occurrences of each frequency, thereby obtaining small files 16 respective largest number of occurrences and the number of times the corresponding statistics.
  • That can last as long as the highest number in 16 the number selected appear.
  • The reason why the document is divided into 16 small memory limit corresponding to 2G.

Assigned to a large collection of multiple machines by the hash function, or assigned to a plurality of files, a common problem of large data processing method, the specific allocation of many machines, how many files is determined according to memory limitations.
such as:

  • There is a url contains 10 billion of large files, assuming that each url occupied 64B, please find out all duplicate url.
  • Solution:
    • By large hash function to hash file url split into several small files on a single or several machines (the nature of the hash function determines the same url is not assigned to a different machine or different files ), the specific number is selected according to resource constraints.
    • Then for each small file and then use a hash table walk, find duplicate url.

Topic two: four billion non-negative integers find the number does not appear

32-bit unsigned integer range is 0 to 2 32 = 4,294,967,295, now contains just a 4 billion documents unsigned integers, it is bound to the entire range of the number of non-appeared. You can use up to 1G of memory, how to find all numbers have not appeared?
Advanced: memory is limited to 10MB, but only a few can ever find a no-show.

Analysis II:

If not the same number 4000000000, a hash table is stored is required 4000000000 16000000000 * 4B = B = 16GB space and does not meet the requirements.
ways to improve:

  • Applying a length of 2 32 array size, bit type, 2 32 bit 2 = 29 B = 512MB
  • 4000000000 traversal number, the corresponding position is set to 1, for example When the number of 7000, put bitArr [7000] is set to 1.
  • BitArr then traverse the array, if the position is 0, then prove that the number does not appear.

Advanced Problem: 0-2 32 This range is equally divided into 64 intervals, each interval 2 26 is the number, if the number of the one section of less than 2 26 is that there must not occur a number.

  • 4000000000 first pass number, integer array to apply length count0 ~ 63 64 a], COUNT [i] represents the number of interval i. When the i-th traversed num, num / 2 26 is the interval that is where num is the corresponding interval count by one, i.e., COUNT [num / 2 26 is ] ++
  • The second pass count array, if the value of the i-th position is less than 2 26 is , then the interval must exist a number does not appear, the memory required for the 64 * 4B
  • Suppose the number of the i-th interval is less than 2 26 is then apply a length of 2 26 is ' bit array, the number of traverse 4000000000 again, the number of fall on only concerned i interval, i.e. NUM / 2 26 is equal to the number i of the 'bit [2-NUM 26 is * I] is set to 1.
  • Traversing bit array, not the position index 1 is not present is the number 2 26 is * index + I

TOPK question Search Words: Title three

A search company day user search terms are (ten billion the amount of data) massive, please design a feasible method of obtaining hot day TOP100 vocabulary.

Analysis of Three:

  • The vocabulary document contains ten billion the amount of data to hash on a different machine.
  • For each machine, the amount of data is still great if assigned, you can then hash function to split large files on the machine into smaller file processing.
  • Dealing with every little file through a hash table to store each word and the corresponding number of occurrences;
  • Traversing the hash table, through the establishment of a small root heap identify each small file top100;
  • After sorting top100 of each small file on each machine top100 selected by the process of external sorting;
  • top100 between different machines then built or external sorting stack rootlets embodiment obtains the amount of data top100 ten billion.
  • See https://blog.csdn.net/hahaEverybody/article/details/91125723 external sorting

Topic Four: the number appears twice in the four billion non-negative integer found

32-bit unsigned integer range is 0 to 2 32 = 4,294,967,295, 4,000,000,000 now unsigned integers, can use up to 1GB of memory to find all the numbers appear twice.

Analysis of four:

  • Applying a length of 2 32 'bit array type * 2, 1GB of accounting, word frequency indicates a number of occurrences of the two positions.
  • Traversing 4000000000 unsigned number, if the initial encounter num, put the bit [num * 2 + 1] and the 'bit [num] is set to 01, the second encounter num is set to 10, the third face set 11 If then met and found that 11 had been set, do not do any settings.
  • After the traversal is complete, the bit array traversal again, if found to bit [i * 2 + 1] and bit [i] is set to 10, then i is the number appears twice.

Topic Five: Found 4 billion integers median

Range 32-bit unsigned integer from 0 to 2 32 = 4,294,967,295 now have 4 billion unsigned integers, you can use up to 10MB of memory, how to find the median of these 4 billion integers.

Analysis of five:

  • Length 2M i.e., 2 21 is unsigned integer space is 8MB, the integer defines the number of each interval it is 2 21 is , then the number of intervals is 2 32 /2 21 is rounded up to 2148 range. 0 Interval 0 to 2 21 is -1, 2 the first section 21 is to 2 22 is -1, the i-th interval of 2M i ~ 2M (i + 1) -1
  • Applying a length of integer array 2148, recording section on the digital number
  • By adding up the number of digits on each section, you can find a median number of four billion, ie 2 billion. If the number of digits appears on the front section of the k-1 1999800000, 2000000000 than it was found after adding the digital number when the k-th interval, instructions appear on the median of the interval k, and is the first 000 200 000 number on the k range.
  • Next, apply a length of the 2MB unsigned integer array, then traverse the number 4000000000, in this case only care about the number of k on the interval num, and the count [num-k * 2M] ++.
  • After completion of traversing four billion number, you get a word frequency statistics on the k interval, only the last number can be found on 000.2 million in the first k interval.

Guess you like

Origin blog.csdn.net/hahaEverybody/article/details/93317392