Performance Test B + Tree (C ++ implementation)

The main article is a test of the indicators I realized B + tree results show. B + CRUD specific text tree algorithm not covered, it may be added subsequent.

project address

github.com/SirLYC/BPTr…

B + tree Profile

Quoted from Wikipedia

B + tree is a tree data structure, typically the database and operating system's file system. Characteristics of the B + tree is able to maintain stable and orderly data, which has a more stable insertion and modification time complexity of the number of pairs. B + tree element is inserted from the bottom up, which is just the opposite binary tree.

If you have wanted to learn c ++ programmers, you can come to our C / C ++ learning buckle qun: 589348389,
free delivery C ++ Video Tutorial Oh!
Each 20:00 I will live in the group to explain the C / C ++ knowledge, welcome everyone to learn Oh.

B + tree structure

B + tree has an important parameter called 阶 (m), determines the number of a B + tree each node stores a key promoter.

Each node are sequentially stored in a set of keywords, for the non-root nodes, which key in the tree s> = (m + 1) / 2. For the leaf nodes, a pointer to the value stored in the structure, corresponding to the keyword, as well as a next pointer to the next sibling leaf node, so finding the leftmost leaf node can traverse the list in order of a keyword; for non-leaf nodes , there s a pointer pointing to child nodes.

B + tree split by the insertion, by keyword or merge sibling to sibling nodes when deleting a balance, all the leaf nodes are at the same level. Query, insert, delete all efficiency Log(N) .

Realization of public API

template<typename K, typename V>
class BPTree {
private:
    ...

public:
    // constructor and destructors
    ...

    /**
     * deserialize from a file
     */
    static BPTree<K, V> deserialize(const std::string &path);

    static BPTree<K, V> deserialize(const std::string &path, comparator<K> comp);

    void put(const K &key, const V &value);

    void remove(K &key);

    /**
     * @return NULL if not exists else a pointer to the value
     */
    V *get(const K &key);

    bool containsKey(const K &key);

    int getOrder();

    int getSize();

    /**
     * iterate order by key
     * @param func call func(key, value) for each. func returns true means iteration ends
     */
    void foreach(biApply<K, V> func);

    void foreachReverse(biApply<K, V> func);

    void foreachIndex(biApplyIndex<K, V> func);

    void foreachIndexReverse(biApplyIndex<K, V> func);

    void serialize(std::string &path);

    /**
     * clear the tree
     * note that all values allocated will be freed
     */
    void clear();
};
Copy the code

Tips: For compatibility custom categories, need to pass comparison of the function pointers, or implement respective>, =, <, etc. operator;

Important data structure

Node: the B + tree index node

Main data structure is as follows:

struct Node {
    // parent
    // if root, parentPtr == NULL
    Node *parentPtr = NULL;
    // flag
    bool leaf;
    List<K> keys;
    /*-------leaf--------*/
    Node *previous = NULL;
    Node *next = NULL;
    List<V> values;
    /*-------index-------*/
    List<Node *> childNodePtrs;
    // for init
    int initCap;
    // constructor
    ...
};
Copy the code

List <T>: List using fixed-length array implementation than std :: vector <T> function more simple and efficient; memory decreases after removal of a certain number of elements.

Serialization

File suffix bpt
Header format:

Offset (byte)	Size (byte)	content
0	4	LYC \ 0 head logo
4	4	order, int type, stage B + Tree
8	4	initCap, int type, the size of the pre-assigned to each node
12	4	size, int type, the number of elements

If the size is not 0, after the head is the root node, the node has the same format, in front of the node common format:

Offset (relative to the starting node, byte)	Size (byte)	content
0	4	leaf, int type, identifies whether the node is a leaf node
4	4	sizeofK, int type, key form represents the number of bytes
8	4	The number of keywords kSize, int type, the node has
12	kSize * sizeofK	In order to store keywords

For leaf nodes

Offset (relative to the starting node, byte)	Size (byte)	content
12 + kSize * sizeofK	4	sizeofV, int type, value type representing the number of bytes
16 + kSize * sizeofK	kSize*sizeofK	Sequentially storing values

For non-leaf node

Offset (relative to the starting node, byte)	Size (byte)	content
12 + kSize * sizeofK	ksiz A * 8	long type, the byte order is stored in the file offset point

Points achieved

The initial implementation is to use a vector, measuring down performance is not particularly good;
Based on ordered features within a node key, use binary search when looking for;
Memory should store a pointer to a child node of the node. Because it involves split, merge, you need to copy the list, if the structure is stored, copied cause recursive copy, inefficient, and difficult to control memory;
Because the root node is no minimum keyword limits, after you delete node operation, you need to check the root number of child nodes, if 1, directly to the root set point for the byte root , or after the child is removed may result in roota keyword, the child node is lost .
Each time you insert, you need to update parent up after the last key delete nodes.
Split, for the next leaf node and pointers to update the previous merge operation.

test

test environment:

File main.cpp has the following macro, 1 open test:

List // test performance (and vector comparison)
#define TEST_LIST 0
// function to test the correctness of B + tree
#define TEST_FUNC 0
// Test B + tree speed (deletions change check)
#define TEST_SPEED 0
// Test B + tree and the heap memory leak (using test tool after build)
#define TEST_MEM 0
// Test B + tree serialization and deserialization
#define TEST_SERIAL 0
Copy the code

List Test

Volume: 10 ^ 5
Add, delete data, corresponding to the assertion position is as expected (functional testing)
Test tail insert (not pre-assigned and pre-allocated)
Head insertion test
Head remove test
Last delete the test
rangeRemove test

Test run results:

form:

	List(ms)	vector(ms)
Insert the tail	1.506	4.724
Inserting the tail (pre-allocated space)	1.201	2.765
Head insert	834.804	906.981
Remove half of the elements (of the head)	619.493	805.379
Remove half of the element (the tail)	1.444	7.523
rangeRemove (half)	0.065	0.558

Histogram:

B + tree function test

Volume: 10 ^ 5
After inserting data into data exists asserts
Remove half of the data, the assertion remove data does not exist, does not remove the data exists
Re-insert all the data, all the data are present assertions (delete test whether structural damage)
Data do not exist prior to the assertion (after) clear
After inserting all the data, the test traversal method (key test sequence)

Speed Test

Volume: 10 8,10 ^ ^ ^ 6 7,10
B + tree of order log (TEST_SIZE) ^ 2
Data insertion cycle
Cycle access to all data
Cycle to remove data

Test run results:

form:

	bp tree(ms)	stl map (ms)
Insert (10 ^ 8)	192808.064	325621.333
Access (10 ^ 8)	163102.022	280150.403
Removing (10 ^ 8)	213982.406	366576.836
Insert (10 ^ 7)	11825.821	22213.139
Access (10 ^ 7)	10190.870	18137.073
Removing (10 ^ 7)	15130.015	22133.154
Insert (10 ^ 6)	1057.291	1624.615
Access (10 ^ 6)	888.186	1155.504
Removing (10 ^ 6)	1099.584	1495.433

Histogram: