Application heap: how to quickly get to the TOP10 of the most popular search keywords?

Application heap: how to quickly get to the TOP10 of the most popular search keywords?

Is the search engine top search rankings, search engine receives a large number of user search requests each day, he would search keywords entered by the user to record, and then off-line statistical analysis, keyword search TOP10

Suppose we have a one billion search log file contains keywords, how to quickly get to popular search keywords TOP10 list?

A heap of application: Priority Queue

Priority queue, the first is a queue, the queue is the biggest characteristic FIFO, however, the priority queue, the team is not the order of the data FIFO, according to priority, the highest priority, first out team

Used to implement a heap priority queue, because the heap priority queue and is very similar to a stack may be considered as a priority queue, the priority queue to insert an element corresponding to an element inserted into the pile; priority from remove the highest priority queue elements, elements corresponding to remove the top of the stack

This is a lot of priority scenarios:

A: Merge ordered small files

There are 100 small file size of each file is 100MB, each file is stored in an orderly string, hope these consolidated 100 small files into one large file orderly, use the priority queue

From this 100 files, depicting the first string into an array than the size, the smallest of the string into a large file after the merger, and removed from the array, from the smallest string 13.txt this small file, and then removed from this little file a string into an array, the re-compare the size and choose the smallest combined into a large file, delete it from the array, and so on, data until all the documents are put into a large file so far

A string from a document taken out of the storage array, each of the minimum string taken from the array, the array need to traverse the entire loop, is not very efficient, priority queue may be used, a string from the file taken out was placed small pile top, top of the pile element is the head of the queue priority queue elements, namely a minimum string, the string will be put into large file, delete it from the heap, and then takes the next small file string, into the stack, loop 100 may be small data files into a large file sequentially

Two: High Performance Timer

Timer in a lot of regular maintenance tasks, each task to set up a time to trigger execution. Each timer had a very small unit of time, the task will scan it again to see if there is time to perform the task reaches the set, if reached, out execution

The best method is to use a priority queue, in accordance with the execution time of the task set, the task is stored in the priority queue, the queue head bank is the first task execution, the execution time of the timer simply point to bring the team's first task , and subtracted from the current time point, to obtain a time interval T, the timer T seconds after is set, the task again, do not need to do the task from the point between the current time to (T-1) s

When T seconds have elapsed, the timer takes priority task execution queue squadron first, and then calculate the execution time point of the first tasks of the new team and the difference between the current point of time

Application of two stacks: stack seeking using TOP k

Seeking TOP K problem can be abstracted into two categories, one is for a set of static data, i.e. data set is determined in advance and will not change, the other is for collection of dynamic data, the data set is not determined in advance, to the collection of dynamic data in

For static data, how arrays containing n data, look for the big data K before it? Maintain a small size of the top of the heap K sequentially through the array, and remove the top of the stack of data elements from an array, and if the element is larger than the top of the stack, remove the top of the stack element, this element is inserted into the stack, than the top of the stack if the element is small, does not deal with, after all been traversed, the heap data is pre-K big data

A stack operation requires O (logK), the worst case of n elements into a stack, the time complexity is O (nlogK)

TOP K for the dynamic data is obtained in real time TOP K, one is to add data, the current one is asking TOP K

Always ask before big data K, have been maintaining the top of a small heap size of a K, when there is data to be added to the top of the heap contrasting elements

Heap of application of three: the use of heap seeking median

Median A dynamic data set, i.e. the number in the intermediate position

If the number of data is odd, the data in ascending order, of n / 2 + 1 is the number of bits of data (data numbered from 0) if the number of data is an even number, there are two intermediate position, the first n / 2 and n / 2 + 1 data can be selected at random as a median

A set of static data, the median is fixed, to be sorted, the n / 2 data is the median, the median time to each query, the fixed value is directly returned to the marginal cost will be very small However, if we are faced with a dynamic data set, the median constantly changing, if the first sort and then ask if efficiency is not high

Maintains two stacks, the top of a large stack, a small pile top, the top of the stack a large part of the data stored in the first half, the second half of the data stored in the small pile top, and the top of the stack data is small data size is larger than the top of the heap, if data n, n is an even number, from small to large, before the n / 2 data is stored in a large pile top, after the n / 2 data is stored in a small pile top, so that the large top element is the median top piles, If n is odd, a large storage pile top n / 2 + 1 data, a small memory heap top n / 2 data

Data is dynamic, when a newly added data of when and how to adjust the two stacks, so that the top of the heap element continues to be the big top of the heap median it?

If the newly added data element is less than the top of the heap heap equal to the big top, the top of the heap is inserted into the large, small or inserted into the top of the heap, this time the number of possible cases of data does not meet the two top stacks in front of the agreement will appear, so we can keep the top of the stack element is moved from one stack to another stack, so adjustment

Using two stacks implement the dynamic data set seeking operation of the median of the data inserted the time complexity is O (logN), find the median top of the stack only needs to return a large top stack element, the time complexity is O ( 1)

Use two stacks can not only quickly find the median, the data can also order other percentile, how to use the interface of two heap seek quick response time 99%?

Percentile 99 is a set of data in ascending order, the 99th percentile is greater than 99% in front of that data, and that 99% of the response time, if there are 100 access request interfaces, each interface request response times are different, then the response time in ascending order, the data in row 99 is a 99% response time, also called 99 percentile response time (ps: there are n data, is the first 99 percentile n * 99% data)

99% of demand response time?

It maintains two stacks, the top of a large stack, a small pile top, this is the total number of data n, a large storage pile top n 99 n*99% data, save a small pile top n 1 n*1% data, large roof top of piles of data that we're looking for 99% of the response time. Each time you insert a data when the data to determine the size relationship with large and small heap top top top of piles of data, and then decided to insert into the heap, if the data is small top big piles of data elements than the top, insert the big top stack, the top element is smaller than the top of the piles of large, small insertion pile top, in order to maintain the large top data stack 99%, after each data is inserted, the large top stack recalculated and the minor vertex number data stack, meets 99 :1

Join a log file that contains one billion search keywords, how to quickly get to the TOP10 of the most popular search keywords?

You can use MapReduce, scene if we will deal with is limited to stand-alone, the memory is 1GB, because many of them are repeat keyword, the first statistical frequency of each keyword that appears to record key by hash tables, balanced binary search tree, etc. and the number of occurrences of the word

Choice of hash tables, sequential scan one billion search keywords, when scanning to a keyword, take the hash table query, there is, the number of +1, does not exist, is inserted into the hash table, the table is stored in the non-repetition and the number of search keywords, and then stack the tOP K method seeking to establish a small size of the top of the stack 10, traversing the hash table, and the number of times each keyword successively extracted, compared with the top of the heap keyword, if the number of times than the females top keywords more often, remove the top of the heap keyword, there will be more number of keywords added to the heap

If one billion search keywords are not repeated 100 million, the average length of each keyword is 50bit, storing 100 million keywords need to 5GB of memory space, and we have only 1GB, not a one-time All words added memory, how to do?

The same data through a hash value hash algorithm is the same, to one billion keywords to slice through a hashing algorithm to 10 file, create an empty file 00, 01, 10, 09 iterate ...... 1 billion key words, evaluate them through a hash algorithm hash value, then the same hash value modulo 10, and the result is this search for keywords should be assigned to the file number, each file is only 100 million keywords , remove duplicates, may only 1000W, the total size is 500MB, for 100 million keyword file, using the hash table and heap were obtained TOP10, the 10 TOP10 together, take the 100 keywords that appear the highest number of 10 keywords

How to achieve access to a very large news sites, hoping to traffic rankings TOP10 news summary, a scrolling display on the site home page banner, updated every hour?

For each news summaries are a calculated hashcode, establish the link between abstract and hashcode, memory use map to hashcode as a key, press digest is a value, a way to document the record hourly summaries hashcode is clicked, when a after the hour, the hour is calculated on a click TOPlO, the hashcode slices to a plurality of files, but also by modulo count hashcode of the same hashcode into the same file, taking TOP10d hashcode for each file, using the map <hashcode, int> way, the statistical summary of all clicks, using small heap top computing TOP10, TOP10 calculate a total for all fragments, if only to show an hour, calculate the end of the show throughout the day, and on subprime hashcode merger, take TOP10

Published 76 original articles · won praise 9 · views 9191

Guess you like

Origin blog.csdn.net/ywangjiyl/article/details/104432130