Big Data analysis interview questions

   Recently I learned a little knowledge hashtable found can be used to solve some of the problems of big data. Big Data analysis we talking about here is not the fact of these distributed data mining and advanced concepts, but with a certain number of characteristics for the find from a large file or from a bunch of data (memory does not fit), which is in recent years, major companies often question exam.

Face questions 1: 100G size to more than a log file, log in deposit with the IP address, the design algorithm to find the largest number of occurrences of IP addresses?

Parsing: 100G file gives us the feeling is too big, our computer memory is generally about 4G it is not possible to so many one-time the information is loaded into memory, it will be cut into 100 parts. IP address is the string is too long, we can put it into 100% integer, this value modulo fall in the range of 0-99, the same IP address as the modulo value is assigned to the same file, then we can use the statistics for each file in most of the IP address of the hash table, and finally comparing get 100 IP in the largest IP on it.

 

Big Data analysis interview questions

 

 

Face questions 2: The same conditions on the question, how to find the IP top K's?

Analysis: see seek TOP K of IP will immediately react to the use of heap sort, heap sort should be noted here is to build a small pile, what we want to build piles can only be guaranteed if the top of the heap is the biggest element, so You can only get the maximum that IP.

3 face questions: Given 10 billion integers, find the integer arithmetic design appears only once

Analysis: integer divided into two kinds of signed and unsigned, as signed numbers ranging from -21 -2147483648 2147483648 ~ +21 million and billion and unsigned numbers ranging from 0 to 4294967296 from 0 to 4.2 billion but given us 10 billion integer, integer appears only to find out once, so we thought to use a hash table, but we'd better not define an array of integers, because 4.2 billion * 4B is about 16G, such a large array of our further segmentation would be too much trouble, here we can use BitMap, with a bits to represent a number exists or not, there is no expressed as 0, appears once represented as 1, appear more than once with another bits. This size of the array can be reduced by a factor of 16 points. Also encountered a problem, that is, in the end how the definition of the array, a good definition of a positive number, a negative number, then we can use the full 32 1 (-1) and the divergence or take it to a positive number and the same position, this time we define a two dimensional array, half represents half of the positive number indicates negative, in the same row. At this point we use 1G of space can solve this problem.

Development: If the interviewer asked me there is only 500M or less space, then how do?

Resolution: Segmentation uses the same idea, but I think here we can press the direct range of the number of direct segmentation. If there is 500M of memory, we would cut it once, at this time if we have a 50% chance to find a number that occurs only once, may be more efficient.

Interview questions 4: Give two files, there are 10 billion integers, we only have 1G of memory, how to find the intersection of the two documents?

Analysis: This question is the same idea and above.

5 face questions: 1 file there are 10 billion int, 1G memory, designed algorithm to find all occurrences of the integer no more than 2 times

Analysis: The only difference between this and over this question is to find an integer of not more than twice, the same method.

6 interview questions: Give two files, there are 10 billion query, we only have 1G of memory, how to find the intersection of the two files are given precise algorithms and approximation algorithms!

Analysis: the intersection of two documents, we certainly use this algorithm to compare, if we put both files are divided into 100 parts, to get a copy of each file in the file were compared with another 100 parts time efficiency is too low, then we can borrow a first surface mind questions modulo of them, so we only compare two files modulo the same value can be a comparison, if the same numerals.

面试题7:如何扩展BloomFilter使得它支持删除元素的操作?

解析:BloomFilter并不支持删除元素的操作,因为很可能产生哈希冲突(就是由不同的哈希函数算出的位置指向同一个位),这样改变一个位很可能会影响到其他元素的判断。这里我们可以按照和智能指针sharedptr的思想即“引用计数”来解决,我们添加一个count计数器,每当我们在这个位上表示一个元素时就让它count++,每删除一个涉及到这个位表示的元素时就让它count--,这样只当count为0时我们再对这一位置0,这样就完成了删除的操作。

 

Big Data analysis interview questions

 

 

面试题8:如何扩展BloomFilter使得它支持计数操作?

解析:这道题思想和上一道题一样。

面试题9:给上千个文件,每个文件大小为1K—100M。给n个词,设计算法对每个词找到所有包含它的文件,你只有100K内存

解析:我们可以使用布隆过滤器来判断一个文件是否包含这n个单词生成n个布隆过滤器放到外存,我们事先定义好一个包含这n个单词信息的文件info,每当我们在一个文件找到一个对应的单词就将这个文件的信息写入info对应单词的位置。我们只有100K内存,这100K内存我们一部分用来存放布隆过滤器一部分可以存放文件,因为文件最小都为100K,所以我们可以尝试把它切分为50K的小文件,每个文件标志好所属的大文件,这样我们每次读入一个布隆过滤器和一个小文件,如果这个文件有对应的单词则在info中标记所属大文件的信息,如果没有则读入下一个布隆过滤器,把所有布隆过滤器都使用后,再读下一个文件重复上述步骤直至把所有文件都遍历完。

面试题10:有一个词典,包含N个英文单词,现在任意给一个字符串,设计算法找出包含这个字符串的所有英文单词

解析:首先判断一个单词是否包含一个字符串我们可以用strstr这个函数,对于这个问题,我觉得如果该字符串的前缀和要找的单词一样的话可以采用字典树来查找,但是N个英文单词我们可以假设它很大,我们把它放到一个文件里,每次只读出固定个数个单词进行判断。


Summary: For such a large issue of data points we usually use Ha Xiqie i.e. die length will be assigned a data array to a reasonable position, while cutting into a large file, so it is particularly convenient with several other for example, comparing the IP address for the rounding Ha Xiqie points, or operation of the internal elements.

                                                                                                       Micro-channel public attention to obtain a large number of data tutorials
use BloomFilter can determine the presence or absence of elements in the collection.

Guess you like

Origin www.cnblogs.com/dashujunihaoa/p/10954699.html