百度面试题——海量日志数据，提取出某日访问百度次数最多的那个IP

欢迎关注，敬请点赞！

抓黑客——提取某日访问百度次数最多的那个IP

解决思路：
python代码
总结

解决思路：

因为是海量数据，所以我们想把所有的日志数据读入内存，再去排序，找到出现次数最多的，显然行不通了。

(1) 我们先假设内存足够，可以只用几行代码，求出最终的结果。

from collections import Counter


if __name__ == '__main__':
    ip_list = ["192.168.1.2","192.168.1.3","192.168.1.3","192.168.1.4","192.168.1.3","192.168.1.2"]  # 为了简化，用一个列表代替。
    ip_counter = Counter(ip_list)  # 使用python内置的计数函数，进行统计
    print(ip_counter.most_common()[0][0])

192.168.1.3

(2) 假如内存有限，不足以装下所有的日志数据，应该怎么办？
既然内存不能装下所有数据，那么我们将无法使用排序算法，这里我们采取“化整为零”的做法：
假设海量数据的大小是100G，而我们的可用内存是1G。
我们可以把数据分成1000份(只要大于100都是可以的)，每次内存读入100M，再去处理。

但是问题的关键是：怎么将这100G数据分成1000份呢？
我们以前学过的hash函数就派上用场了。
Hash函数的定义：对于输入的字符串，返回一个固定长度的整数。
hash函数的巧妙之处在于：
对于相同的字符串，经过hash计算，得出来的结果肯定是相同的；
不同的字符串，经过hash，结果可能相同（可能性一般都很小）或者不同。

解题思路如下：

对于海量数据中的每一个ip，使用hash函数计算hash(ip)%1000，输出到1000个文件中；
对于这1000个文件，分别找出出现最多的ip；
使用外部排序，对找出来的1000个ip再进行排序。

python代码

返回顶部

import os
from collections import Counter


source_file = r'C:\Users\13721\Documents\most_ip_temp\bigdata.txt'
temp_files = r'C:\Users\13721\Documents\most_ip_temp\temp\\'  # 最后双斜杠是为了转义
top_1000_ip = []

# 创建文件夹及文件
def hash_file():
    temp_path_list = []
    if not os.path.exists(temp_files):
        os.makedirs(temp_files)
        
    for i in range(1000):
        temp_path_list.append(open(temp_files + str(i) + '.txt', mode='w'))
        
    with open(source_file) as f:
        # 关键，使用hash函数计算hash(ip)%1000，输出到1000个文件中
        for line in f:
            temp_path_list[hash(str(line))%1000].write(line)
            # print(hash(str(line))%1000, line)
            
    for i in range(1000):
        temp_path_list[i].close()  # 文件关闭，不影响运行
        
def cal_query_frequency():
    for root, dirs, files in os.walk(temp_files):
        for file in files:
            real_path = os.path.join(root, file)
            ip_list = []
            
            with open(real_path) as f:
                for line in f:
                    ip_list.append(line.replace('\n', ''))
                    
            try:
                top_1000_ip.append(Counter(ip_list).most_common()[0])  # top_1000_ip结果为一列表，列表元素为元组(IP, 出现次数)
            except:
                pass
    # print(top_1000_ip)
            
def get_ip():
    return (sorted(top_1000_ip, key=lambda a:a[1], reverse=True)[0])[0]


if __name__ == '__main__':
    hash_file()
    cal_query_frequency()
    print(get_ip())

"192.168.1.3"

总结

该题思路采用了“分而治之”，“化整为零”的思想。
关键代码是： 使用hash函数计算hash(ip)%1000，输出到1000个文件中。 需要注意“%”，而不是“/”。

temp_path_list[hash(str(line))%1000].write(line)

容易掉的坑是： 文件夹temp后面的双斜杠。

temp_files = r'C:\Users\13721\Documents\most_ip_temp\temp\\'  # 最后双斜杠是为了转义

欢迎关注，敬请点赞！
返回顶部

文龙问路

原创文章 43 获赞 14 访问量 2852

关注私信