Associative Container (Chapter 3) (Article 23)

Item 23: Consider replacing associative containers with sorted vectors

The standard associative container is usually implemented as a balanced binary search tree, which is suitable for the following application scenarios: frequent insertion, deletion, search and other operations, which are basically synchronized and cross-processed (hereinafter referred to as scenario 1 ).

However, the way in which many applications use data results is not so confusing. The process of their use is clearly divided into 3 stages (hereinafter referred to as scenario two ):

1) Setup stage: Create a new data structure and insert a large amount of data. Almost all operations at this stage are insert and delete operations , almost no search operations , and there are few;

2) Search phase: query the specific information of the data structure, almost all operations in this phase are query operations , almost no insert and delete operations , and there are few;

3) Reorganization stage: change the content of the data structure, perhaps delete the current data, insert new data , similar in behavior to the first stage , when this stage is over, re-enter the second stage;

For the two scenarios described above, the conclusions are as follows:

Scenario one is suitable for associative containers

Scenario 2 is suitable for vector storage. After sorting, binary_search is used to find: because it saves more memory (no additional pointers to tree nodes), and the query may be faster (less cross pages)

2 million int data, respectively, using set / vector / unordered_set to test, the results are as follows, from the data, we can see that scene two uses vector to sort first, and then binary search is indeed faster than associative container set, even faster than unordered_set hash container Fast, but the difference between the use of set and unordered_set itself in the two scenarios is not big.... In addition, pay special attention to the fact that scenario 1 is not suitable for the vector solution. See, the program has not been completed when I wrote the blog

container	Scenes	Time-consuming (milliseconds)
set	scene one	6675
set	Scene two	6845
unordered_set	scene one	7655
unordered_set	Scene two	7613
vector	scene one	As of the time I wrote the blog, it was not finished...
vector	Scene two	2867

My test code

    void test_23() {
        const int N = 2000000;
        // 一次性插入数据，然后查找的场景
        std::default_random_engine generator;
        std::uniform_int_distribution<int> distribution(0, N);
        Common::TimeClock tc;  // 时间打点
        set<int> iset1;
        for (int i = 0; i < N; ++i) {
            iset1.insert(distribution(generator));
        }
        int findCount = 0;
        for (int i = 0; i < N; ++i) {
            findCount += iset1.count(i);
        }
        std::cout << "findCount: " << findCount << std::endl;
        tc.end();

        // 频繁插入、查找数据场景
        tc.start();
        set<int> iset2;
        findCount = 0;
        for (int i = 0; i < N; ++i) {
            // 不断插入和查找数据
            iset2.insert(distribution(generator));
            findCount += iset2.count(i);
        }
        std::cout << "findCount: " << findCount << std::endl;
        tc.end();
    }

   void test_23() {
        const int N = 2000000;

        // 一次性插入数据，然后查找的场景
        std::default_random_engine generator;
        std::uniform_int_distribution<int> distribution(0, N);
        Common::TimeClock tc;
        vector<int> ivec1;
        ivec1.reserve(N);
        for (int i = 0; i < N; ++i) {
            ivec1.push_back(distribution(generator));  // 插入数据
        }
        std::sort(ivec1.begin(), ivec1.end()); // 排序
        int findCount = 0;
        for (int i = 0; i < N; ++i) {
            findCount += std::binary_search(ivec1.cbegin(), ivec1.cend(), i); // 二分查找
        }
        std::cout << "findCount: " << findCount << std::endl;
        tc.end();

        // 频繁插入、查找、删除数据场景
        tc.start();
        vector<int> ivec2;
        ivec2.reserve(N);
        findCount = 0;
        for (int i = 0; i < N; ++i) {
            ivec2.push_back(distribution(generator));
            findCount += std::find(ivec2.cbegin(), ivec2.cend(), i) != ivec2.end();
        }
        std::cout << "findCount: " << findCount << std::endl;
        tc.end();
    }

In addition, I tested the second scenario of map / vector / unordered_map. The vector is slightly higher than the map, which may be the reason for the pair. The effect is not as obvious as the set side. Unordered_map is significantly higher than the other two.

container	Scenes	Time-consuming (milliseconds)
map	Scene two	18459
unordered_map	Scene two	11874
vector	Scene two	15776

My test code

    using data = std::pair<string, int>;

    struct DataCompare {
    public:
        // 两个pair的比较
        bool operator()(const data &lhs, const data &rhs) const {
            return lhs.first < rhs.first;
        }

        // operator<(pair, string)
        bool operator()(const data &lhs, const data::first_type &rhs) const {
            return keyLess(lhs.first, rhs);
        }

        // operator(string, pair)
        bool operator()(const data::first_type &lhs, const data &rhs) const {
            return keyLess(lhs, rhs.first);
        }

    private:
        // operator<(string, string)
        bool keyLess(const data::first_type &k1, const data::first_type &k2) const {
            return k1 < k2;
        }
    };

    void test_23_1() {
        const int N = 2000000;
        // map测试一次性插入数据，然后查找的场景
        std::default_random_engine generator;
        std::uniform_int_distribution<int> distribution(0, N);
        Common::TimeClock tc;  // 时间打点
        map<string, int> imap1;
        for (int i = 0; i < N; ++i) {
            imap1.emplace(std::to_string(distribution(generator)), i);
        }
        int findCount = 0;
        for (int i = 0; i < N; ++i) {
            findCount += imap1.count(std::to_string(i));
        }
        std::cout << "findCount: " << findCount << std::endl;
        tc.end();

        // vector测试一次性插入数据，然后查找的场景
        tc.start();  // 时间打点
        vector<std::pair<string, int>> vec;
        vec.reserve(N);
        for (int i = 0; i < N; ++i) {
            vec.emplace_back(std::to_string(distribution(generator)), i);
        }
        findCount = 0;
        std::sort(vec.begin(), vec.end(), DataCompare());
        for (int i = 0; i < N; ++i) {
            // binary_search
            if (std::binary_search(vec.cbegin(), vec.cend(), std::to_string(i), DataCompare())) {
                ++findCount;
            }

            // lower_bound
//            auto lit = std::lower_bound(vec.begin(), vec.end(), std::to_string(i), DataCompare());
//            if (lit != vec.end() && !DataCompare()(std::to_string(i), *lit)) {
//                ++findCount;
//            }

            // upper_bound
//            auto uit = std::upper_bound(vec.begin(), vec.end(), std::to_string(i), DataCompare());
//            if (uit != vec.end() && !DataCompare()(*uit, std::to_string(i))) {
//                ++findCount;
//            }

            // equal_range
//            auto p = std::equal_range(vec.begin(), vec.end(), std::to_string(i), DataCompare());
//            if (p.first != p.second) {
//                ++findCount;
//            }
        }
        std::cout << "findCount: " << findCount << std::endl;
        tc.end();

        // unordered_map一次性插入数据，然后查找的场景
        tc.start();
        std::unordered_map<string, int> uimap1;
        for (int i = 0; i < N; ++i) {
            uimap1.emplace(std::to_string(distribution(generator)), i);
        }
        findCount = 0;
        for (int i = 0; i < N; ++i) {
            findCount += uimap1.count(std::to_string(i));
        }
        std::cout << "findCount: " << findCount << std::endl;
        tc.end();
    }

Reference: "Effective STL Chinese Version"

Associative Container (Chapter 3) (Article 23)

Item 23: Consider replacing associative containers with sorted vectors

Guess you like