Data structure - tree understanding

Table of contents

One: The problem to be solved, the starting point

1. Evolution

Tree definition:

 the depth (height) of the tree

Balanced Binary Tree (AVL Tree)

Red-black tree:

B-tree:


In the middle of the night, when I was inspired, I suddenly felt that I finally had my own understanding of this data structure, and then I lamented the wisdom of these computer pioneers. Here, I record my understanding of the data structure-tree. According to the ins and outs of the development of things, start from the needs to evolve understanding step by step, instead of rigidly giving definitions and rules like textbooks for people to memorize by rote, in order to better understand.

This is the technology of computer storage. To be precise, it should be the technology of information storage. Generally speaking, the purpose is: how to use the least amount of data and store the most information. With a large amount of stored information, how to find the desired data as quickly as possible with the least number of visits .   The previous problems, such as coding, already have Huffman coding. The basic idea is to use the shortest coded information to represent the same element that appears most frequently. This way the overall encoding uses the least amount of information.

One: The problem to be solved, the starting point

For the latter question, how to store a bunch of data (here refers to the data stored in memory and dynamically used by the program at runtime, rather than saving to disk, which is a database), so that the fastest when searching and accessing Woolen cloth?

The fastest access, from the data point of view, is the least amount of data traversed. Let's focus on this question and analyze one by one how things like red-black trees were developed.

1. Evolution

First of all, a bunch of data must have an access entry. From the perspective of implementation, it is best to have a simple design (we will also consider the idea of ​​simplification in the future). The best way is to use the only one entry. Consider the most The simple design is to connect these data in series with a line, which is the simplest chain structure, linked list.

But this kind of efficiency is not high, he only has one path, and to find the last data, it is necessary to traverse and access all the previous data. Such time complexity is O(n).
How about designing multiple paths starting from the entrance:

A bunch of scattered data:

 If starting from a point in the middle, connect all the data:

 
But here we found a problem. If a data node has two incoming lines (pointed by the arrow), like the 1, 2, and 3 numbered nodes in the above figure, if there are two incoming lines, it will inevitably return to node 3. In this way, the orange incoming line is actually going in the reverse direction. The path from node A to node 3 is equivalent to going back in the direction close to the root node. We hope that the path from the middle to the outside is as short as possible, and of course we don’t want to go back. , that must have covered more distance. 

Tree definition:

To avoid such problems, nodes can be restricted to only one incoming line. According to the definition of the book, there is only one predecessor, but not the only successor . (This is actually the difference between a graph and a tree. A tree is one-to-many (one input and multiple outputs), and a graph is many-to-many (unlimited). The tree concerns the length of the path from the root node to a certain node. , the graph concerns the connectivity of any two nodes in the solution)

 Remove all the orange lines similar to the above, so that all nodes have only the only incoming line (the only precursor) and the
only precursor. If you go back in this way, each node has only one roadbed accessed from the root node, and the problem becomes It is more simplified, we don't need to consider how many access paths a node has, there is only one. So as long as you know the depth (level) of this node, it is the path length when this node is accessed.

 In order to look good, the individual branches are turned to the bottom to form a hierarchical structure, which becomes a tree when opened:

 a tree:

 It is not enough, further simplification, there are multiple branches from the entry node, in fact, only two branches are the simplest, just like the computer is binary, only two states can be combined to represent more information , which is also the design of the simplified base.

So simplify it again, because it is basically a level from top to bottom, simply omit the arrow as a simple straight line, pointing from top to bottom, which is our binary tree.

 the depth (height) of the tree

For a binary tree, we start from the entrance (root node) and visit the farthest node (leaf node), the number of nodes we have traveled-the path length can be called this time complexity, and we require it to be as small as possible , this data is more important, we use one word to express, depth, the depth (or height) of the tree , is the path length of the node farthest from the root node, (describes performance, so record the boundary value that leads to the worst performance , using the longest one as the depth of the tree )

Going back to the previous question, how to organize and store these data so that the path to find the data is as small as possible.

You can think about it from another angle, how can a path with a fixed length (fixed depth) maximize the amount of data stored.

Now it is stored in the form of a binary tree. How to store the most data with a fixed path length, of course, fill it up as much as possible. This is a full binary tree. A binary tree with a depth of k, the most (that is, when it is filled), can store data: the sum of geometric sequences, high school mathematics, a1(1-q^n)/(1-q) => that is 2^k-1 data.

Conversely, to find data in a full binary tree composed of n numbers, the time complexity (that is, the length of the access path, the shortest access length 0, the longest access length and the depth of the tree) log is the logarithm of n with base 2 , generally directly write that the time complexity is log(n) without specifying the base, in fact, the base of the binary tree is 2, and the base of the ternary tree is 3. Usually used a lot, basically a binary tree. This is the origin of the time complexity of binary tree search access.

Balanced Binary Tree (AVL Tree)

The AVL tree gets its name from its inventors GM Adelson-Velsky and EM Landis

Of course, the number of data cannot be 2^n-1, which cannot be filled, which is called "complete binary tree". But the actual problem is that the data is often not guaranteed to be filled down in order layer by layer, as shown in the figure below: In this way, the

length of the path to access node 2 is 6, and the depth of the tree becomes 6. From another perspective, it is A tree with a depth of 6 can actually store a maximum of 2^6-1 nodes, but now, only 15 data are stored, which wastes a lot of space. To avoid this storage situation, you should try to make the binary tree fill up the upper layer and then fill in the next layer from top to bottom, instead of filling in the next layer if one layer is not filled. In that case, the whole The tree looks like it has more data on one side and less data on the other side, "unbalanced" . According to this logic, we introduce a balanced binary tree: it is an empty tree or the absolute value of the height difference between its left and right subtrees does not exceed 1, and both left and right subtrees are a balanced binary tree. Conversely, if the height difference of the subtree is set to 1, it means that the last layer node of the subtree with the highest height should be placed on the upper layer for filling, and the left and right heights are unbalanced .

Optimal binary search tree (also known as Huffman tree). . . . .

Red-black tree:

Therefore, the data should be stored as a balanced binary tree as much as possible. However, a balanced binary tree is actually an ideal value, which belongs to theory. In practical application, other factors need to be considered, (like Wang Miao's applied physics in Three Body and Ding Yi's theoretical physics). In actual use, for a balanced binary tree, inserting and deleting operations will inevitably destroy the balanced nature of the tree. To maintain the balance, additional modifications need to be made. This is left-handed right-handed and other operations. By modifying some nodes In and out of the line path to achieve balance. And these operations bring a lot of consumption. The result is a strictly balanced binary tree, and the search performance is optimal. However, maintaining the balance property after inserting and deleting brings additional operational overhead, which leads to an increase in the total operational overhead. So is there any way to balance the total operation overhead of search and insertion and deletion?

The red-black tree is for this purpose. It is actually a semi-balanced feature, not completely balanced, but it reduces the additional consumption caused by insertion and deletion operations, and achieves a compromise in the overall performance. Although its The search operation is not the ideal balance of binary tree performance, but the consumption of insertion and deletion operations is reduced a lot, and the overall performance is more "balanced".

How much extra overhead does maintaining balance after insertion and deletion of a balanced binary tree bring? And how does the red-black tree reduce this overhead? To be explored further.......? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
Red-black tree properties:

Property 1: Each node is either black or red.

Property 2: The root node is black.

Property 3: Each leaf node (NIL) is black.

Property 4: The two child nodes of every red node must be black. There cannot be two red nodes connected.

Property 5: The path from any node to each leaf node contains the same number of black nodes.

B-tree:

Similarly, considering that in actual use, the amount of data is too large to be read into the memory at one time, it is possible that the program needs to deal with the disk when accessing the data, and the read and write operations of the disk are slow, so the operations should be minimized. B-tree is a special balanced tree specially designed to deal with this situation. Generally used for databases . It is to store the key information in the node, and this node is put in the memory. It can have multiple child nodes, and all these child nodes are in the disk. Usually, a whole page of data is used as all the byte points of this node, which is convenient for one-time reading and writing in units of pages. 
Introduction to Algorithms Chapter 18 B-tree photoshoot: 

This is the binary tree, balanced binary tree (AVL tree), red-black tree, the process of development and evolution, personal understanding.

Guess you like

Origin blog.csdn.net/u012459903/article/details/129000718