一、问题描述

现在有一个大的包含50M个的URL记录，一个小的含有200个URL的记录，如何找出两个记录里相同的URL

二、解决方案

一种不建议的方案是：遍历200个URL，同时再遍历50M个的URL，然后边遍历边判断，这种做法不太可取，因为URL可能会包含多个重复的，不好处理
一种建议做法是：
- 将500个URL放置到一个unordered_set中
- 然后遍历50M个的URL，判断每个URL是否在unordered_set中，如果在的话，就打印这个URL并将这个URL从unordered_set中删除
- 之后，所有打印的URL就是两个记录里相同的URL
为什么上面用unordered_set而不用set：
- unordered_set本质上使用哈希表实现的，其设计用来替换在C++中已经过时的hash_set
- 当然，如果你使用的是Java等编程语言，可以采用对应的哈希set
- set的查找复杂度是O(logn)，而unordered_set的查找复杂度是O(1)；因此用unordered_set的查找比set的查找速度快
为什么上面是把含有500个URL的记录放置到unordered_set中：因为500个URL记录少，哈希冲突小。50M个的URL记录太多，建立的哈希表太大，并且哈希冲突也明显

三、编码案例

此处我们就不设置有50M个和500个的URL记录了，简单的设置一些含有少量URL的记录来判断

#include <iostream>
#include <set>
#include <unordered_set>

using namespace std;

int main()
{
    // 存放这两个URL的集合(允许重复的)
    std::multiset<std::string> url1;
    url1.insert("www.baidu.com");
    url1.insert("www.1688.com");
    url1.insert("www.tencent.com");
    url1.insert("www.baidu.com");
    url1.insert("www.jd.com");
    url1.insert("www.bytedance.com");

    std::multiset<std::string> url2;
    url2.insert("www.jd.com");
    url2.insert("www.baidu.com");
    url2.insert("www.sohu.com");
    url2.insert("www.jd.com");

    // 打印两个url集合
    std::cout << "URL1: " << std::endl;
    for(const auto & elem : url1)
    {
        std::cout << "\t" << elem << std::endl;
    }
    std::cout << "URL2: " << std::endl;
    for(const auto & elem : url2)
    {
       std::cout << "\t" << elem << std::endl;
    }

    // 将url2添加到unordered_set中
    std::unordered_set<std::string> hs;
    for(const auto & elem : url2)
    {
        hs.insert(elem);
    }


    // 然后遍历url1, 如果这个元素在unordered_set中就打印并且删除
    std::cout << "common URL: " << std::endl;
    for(const auto & elem : url1)
    {
        if(hs.find(elem) != hs.end())
        {
            std::cout << "\t" << elem << std::endl;
            hs.erase(elem);
        }
    }

    return 0;
}

运行效果如下：

面试冲刺:24---一个含有50M个URL的记录，另一个含有500个URL的记录，如何找出两个记录中相同的URL？

一、问题描述

二、解决方案

三、编码案例

猜你喜欢