C++ study notes (three)-tree

1. Tree

For a large amount of input data, the linear access time of the linked list is too long to be used. This section introduces a simple data structure whose running time for most operations is O(logN) on average.

The data structure we are involved in is called a binary search tree. Binary search tree is the implementation basis of two library collection classes set and map that are used in many applications. Trees are a very useful abstract concept in computer science.

1.1 Preliminary knowledge

Trees can be defined in several ways. A natural way to define trees is the recursive method. A tree is a collection of nodes. This set can be an empty set; if it is not an empty set, the tree consists of a node r called the root and zero or more non-empty (sub) trees T_{1},T_{2},\cdots ,T_{k}, each of which has a root Connected by a directed edge from root r.

The root of each subtree is called the child of root r, and r is the parent of the root of each subtree. The figure below shows a typical tree defined by recursion.

It can be found from the recursive definition that a tree is a collection of N nodes and N-1 edges, and one of the nodes is called the root. The conclusion that there are N-1 edges is drawn from the fact that each edge connects a node to its father, and every node except the root node has a father (see the figure below) ).

In the tree in the figure above, node A is the root. Node F has a father A and sons K, L, and M. Each node can have any number of sons or no sons. A node without a son is called a leaf node. The leaf nodes (leaves) in the above figure are B, C, H, I, P, Q, K, L, M, and N. Nodes with the same father are siblings nodes; therefore, K, L, and M are all siblings. A similar method can be used to define the relationship between grandparent and grandchild.

The path from a node n_{1}to n_{k}a node is defined as n_{1},n_{2},\cdots ,n_{k}a sequence of nodes so that, for 1\leq i< k, the node n_{i}is n_{i+1}the parent. The length of the path is the number of edges on the path, namely k-1. There is a path of length 0 from each node to itself. Note that there is exactly one path from the root to each node in a tree.

For any node n_{i}, n_{i}the depth is n_{i}the length of the unique path from the root . Therefore, the depth of the root is zero. n_{i}The height is n_{i}the length of the longest path from a leaf. Therefore, the height of all leaves is 0. The height of a tree is equal to the height of its root. For the tree in the figure above, the depth of E is 1 and the height is 2; the depth of F is 1 and the height is also 1; the height of the tree is 3. The depth of a tree is equal to the depth of its deepest leaves; the depth is always equal to the height of the tree.

If there from n_{1}to n_{2}a path, it n_{1}is n_{2}of an ancestor (ancestor) and n_{2}is n_{1}a descendant (descendant). If n_ {1} \ neq n_ {2}, then  n_{1}it is n_{2}the one true ancestor (proper ancestor) and n_{2}is n_{1}a true descendant (proper descendant).

1.1.1 Implementation of the tree

One way to implement a tree is to have some chains in addition to the data at each node to point to each child of the node. However, since the number of sons of each node may vary greatly and is not known in advance, it is not feasible to establish a direct link to each son node in the data structure, because it will generate too much wasted space. In fact, the solution is very simple, put all the sons of each node in the linked list of the tree node. The following code is a very typical statement.

struct TreeNode
{
    Object  element;
    TreeNode  *firstChild;
    TreeNode  *nextSibling;
}

The above code shows how a tree is represented by this implementation method. The downward arrow in the figure is the chain pointing to firstChild. The arrow from left to right is the chain pointing to aingnextSibling. Because there are too many empty chains, they are not drawn.

In the tree shown in the figure below, node E has one chain pointing to the brother (F) and the other chain pointing to the son (I), and some nodes do not have both chains.

1.1.2 Tree traversal and application

There are many applications for trees. One of the popular usage is for the directory structure in many common operating systems including UNIX and DOS. The figure below is a typical directory in the UNIX file system.

The root of this directory is /usr (the asterisk after the name indicates that /usr is itself a directory). /usr has three sons: mark, alex, and bill, which are all directories themselves. Therefore, /usr contains three directories and no regular files. The file name /usr/mark/book/ch1.r is obtained through the leftmost son node three times in succession. Each "/" after the first "/" represents an edge; the result is a full path (pathname). This hierarchical file system is very popular because it enables users to logically organize data. Not only that, two files in different directories can also have the same name, because they must have different paths from the root and thus have different path names. A directory in the UNIX file system is a file containing all its sons. Therefore, these directories are constructed almost exactly according to the above type declaration. In fact, according to some versions of UNIX, if the standard command to print a file is applied to a directory, the file name in the directory can be seen in the output (along with other non-ASCII information).

Suppose we want to list the names of all files in the directory. The output format is: d_{i}a file with a depth of 1 will be d_{i}indented by a tab and its name will be printed. The algorithm is given in the following pseudo code:

void FileSystem::listAll(int depth = 0) const
{
printName( depth );  //Print the name of the object
if( isDirectory() )
     for each file c in this directory (for each child)
        c.listAll( depth + 1 );

}

In order to display the root without indentation, the recursive function listAll needs to start at depth 0. The depth here is an internal bookkeeping variable, not the kind of parameters that the calling routine can expect to know. Therefore, you need to provide a default value of 0 for depth.

The logic of the algorithm is simple and easy to understand. The name of the file object is printed out with an appropriate number of tabs. If it is a directory, then we recursively process all its sons one by one. These sons are at the same depth, so they need to be indented an additional space. The entire output is as follows:

/usr
   mark
      book
         ch1.r
         ch2.r
         ch3.r
      course 
         cop3530
             fall05
                 sy1.r
             spr06
                 sy1.r
             sum06
                 sy1.r
      junk
   alex
      junk
   bill 
      work
      course
          cop3212
              fall05
                  grades
                  prog1.r
                  prog2.r
              fall06
                  prog2.r
                  prog1.r
                  grades

This traversal strategy is called a preorder traversal (preorder traversal). In the pre-order traversal, the processing of the node is performed before its son nodes are processed. When the program is running, obviously the first line is executed exactly once for each node, because each name is output only once. Since the first line is executed for each node at most once, the second line must also be executed once for each node. Not only that, the fourth row of each child node of each node can only be executed once at most. However, the number of sons is exactly one less than the number of nodes. After that, every time the fourth line is executed, the for loop will iterate once, and it will be added every time the loop ends. Therefore, the total workload of each node is constant. If there are N file names to be output, the running time is O(N).

Another common method is to traverse the tree postorder (postorder traversal). In post-order traversal, the work on a node is performed after its son nodes are calculated. For example, the following figure shows the same directory structure as before, where the number in parentheses represents the number of disk blocks occupied by each file.

Since directories are files themselves, they also have sizes. Suppose we want to calculate the total number of disk blocks occupied by all files in the tree. The most common method is to find the number of blocks contained in the subdirectories /usr/mark(30), /usr/alex(9) and /usr/bill(32). Therefore, the total number of disk blocks is the total number of blocks in the subdirectory (71) plus one block used by /usr, for a total of 72 blocks. The following pseudo-code method size implements this traversal strategy.

int FileSystem::size ( ) const
{
    int totalSize = sizeOfThisFile( );
    
    if( isDirectory( ) )
       for each file c in this directory (for each child)
           totalSize += c.size( )

    return totalSize;
}

If the current object is not a directory, then size only returns the number of blocks it occupies. Otherwise, the number of blocks occupied by the directory will be added to the number of blocks found by all its child nodes (recursively). In order to distinguish between the post-order traversal strategy and the pre-order traversal strategy, the following code shows how the size of each directory or file is generated by this algorithm.

             ch1.r
             ch2.r
             ch3.r
          book
                     sy1.r
                  fall05
                     sy1.r
                  spr06
                     sy1.r
                  sum06
             cop3530
          course
          junk
       mark
          junk
       alex
          work
                     grades
                     prog1.r
                     prog2.r
                  fall05
                     prog2.r
                     prog1.r
                     grades
                  fall06
              cop3212
         course
       bill
/usr

1.2 set and map in the standard library

In Chapter 3, the container vector and list in STL are discussed, both of which are not enough for searching. Correspondingly, STL provides two additional containers, set and map, which guarantee the logarithmic time overhead of basic operations (such as insertion, deletion, and search).

1.2.1  set

A set is a sorted container, which does not allow duplication. Many routines for accessing items in vectors and lists also apply to sets. In particular, the iterator and const_iterator types are nested in the set, which allows traversal of the set. Several methods of vector and list have exactly the same name in set, including begin, end, size and empty.

Set-specific operations are efficient insertion, deletion, and basic search.

The insert routine is appropriately named insert. However, because set does not allow duplication, for insert, there may be insertion failure. Therefore, we want the return type to be a boolean variable that can indicate this situation. However, insert returns a much more complicated type than the bool type. This is because insert also returns an iterator to give the position of x when insert returns. This iterator either points to the newly inserted item or points to the existing item that caused the insert to fail. This iterator is very useful, because if you know the location of the item, you can quickly delete the item. The node containing the item can be obtained directly, thus avoiding the search operation.

STL defines a class template named pair, which has two more members first and second than struct to access the two items of pair. Here are two different insert routines:

pair<iterator,bool>insert( const Object & x);
pair<iterator,bool>insert( iterator hint, const Object & x);

The execution of single parameter insert is shown above. The two-parameter insert allows a clue description of where x will be inserted. If the clue is accurate, then the insertion is fast, usually O(1). If it is not accurate, you need to use the conventional insert algorithm to complete, the execution at this time is the same as the single parameter insert. For example, using two-parameter insert in the following code is much faster than using single-parameter insert:

set<int>s;
for ( int i=0; i<1000000; i++)
    s.insert(s.end(),i);

There are several versions of erase:

int erase( const Object & x);
iterator erase( iterator itr);
iterator erase( iteratorstart, iteartor end);

The first single parameter erase deletes x (if found), and then returns the number of deleted elements. Obviously, the return value is either 0 or 1. The execution of the second single-parameter erase is exactly the same as in vector and list. Delete the object at the position specified by the iterator, the returned iterator points to the element at the next position of the itr immediately before calling erase, and then invalidate the itr, because the itr at this time is no longer useful. The execution of two-parameter erase is the same as in vector or list. Delete all items from start to end (not including end).

For search, set provides a find routine that is superior to the contains routines that return variables. This routine returns an iterator to point to the position of the item (point to the end identifier if the search fails). This provides a considerable amount of more information without taking up running time. The form of find is as follows:

iterator find( const Object & x ) const;

By default, the sort operation is implemented using the less<Object> function object, and the function object is implemented by calling the operator on the Object. Another alternative sorting scheme can be exemplified by a set template with a function object type. For example, you can generate a set that stores string objects, and ignore the case of characters by using the CaseInsensitiveCompare function object. In the code below, the size of set s is 1.

set<string,CaseInsensitiveCompare> s;
s.insert( "hello" );s.insert("HeLLo");
cout<< "The size is: " << s.size() <<endl;

1.2.2  map

The map is used to store a sorted collection of items consisting of keys and values. The key must be unique, but multiple keys can correspond to the same value. Therefore, the value does not need to be unique. The keys in the map maintain the logical sorted order.

The execution of map is similar to the set exemplified by pair. The comparison function only involves keys. Therefore, map supports begin, end, size, and enmty, but the basic iterator is a key-value pair. In other words, for iterator itr, *itr is of type pair<KeyType,ValueType>. map also supports insert, find and erase. For insert, a pair<KeyType,ValueType> object must be provided. Although find only requires one key, the returned iterator still points to a pair. It is usually not worth using these operations, because it will lead to expensive syntax burden.

Fortunately, map has an important additional operation to obtain a simple syntax. The following is an overload of the array index operator of map:

ValueType & operator[] ( const KeyType & key );

The syntax of operator[] is as follows. If there is a key in the map, a reference to the corresponding value is returned. If there is no key in the map, insert a default value in the map, and then return a reference to the inserted default value. This default value is obtained by applying a zero-argument constructor, or 0 if it is a basic type. These syntaxes do not allow to modify the operator[] of the function version, so operator[] cannot be used for constant maps. For example, if map is passed by constant reference in the routine, then operator[] is not available.

The code snippet in the figure below illustrates two techniques for accessing map items. First observe the third line, operator[] is called on the left, so insert "Pat" and a double with a value of 0 into the map. Also returns a reference to this double. Then assign the double in the map to 75000. Line 4 outputs 75000. Unfortunately, the 5th line inserts "Jan" and salary "0.0" into the map and prints it out. This may or may not get the correct result, depending on the application. If it is important to distinguish between the items in the map and those not in the map, or if they are not inserted into the map (because they cannot be modified), then an alternative method shown in lines 7-12 can be used. There is a call to find. If the key is not found, iterator is the end marker and can be tested. If the key is not found, we can access the second item referenced by the iterator in the pair, which is the value corresponding to the key. If itr is iterator instead of const_iterator, you can assign itr->second.

map<string,double>salaries;

salaries[ "Pat" ] = 75000.00;
cout << salsries[ "Pat" ] << endl;
cout << salsries[ "Jan" ] << endl;

map<string,double>::const_iterator itr;
itr = salaries.find( "Chris" );
if( itr == salaries.end( ) )
   cout << "Not an employee of this company!" << endl;
else
   cout << itr->second << endl;

 

Guess you like

Origin blog.csdn.net/weixin_38452841/article/details/109093176