Analysis of sort() in STL

1. Introduction
2. Preparations
3. Explore sort() in STL
4. Show Me the Code
5. What I Learned

1. Introduction

I read the 6.7.9 section of the book STL源码解析 about the implementation of the sort() function in c++, and also read through this article, which helps me a lot in understanding the section.

I conclude that the implementation of the sort() function in STL tells us that

Algorithms that can finish a task independently can be MIXED TOGETHER to improve performance of the implementation.
Many classic algorithms we learned from the textbook can be improved to be faster and more efficient.

references:
(1)《STL源码解析》
(2) 知无涯之std::sort源码剖析

2. Preparations

Before focusing on the implementation, make sure you understand the following contents. You can skip this section if you understand them.

2.1. Inline functions

An inline function is a function that when it is invoked, the statement calling the function will be implicitly replaced with the implementation (the code) of the function. Concretely,

#include <iostream>
using namespace std;

inline int call()
{
  static int ctr = 1;
  return ctr++;
}

int main()
{
  for (int i = 0; i < 500; i++)
  {
    cout << call() << " " << endl;
  }
}

In the code above, the call() in

cout << call() << " " << endl;

will be implicitly replaced by the code in the call() function.

reference: C++ inline关键字

2.2. Templates

The technique of templates enables functions or other things to receive more than a single type of values for each parameter. Consider the following code that calculates the greater one of two numerical values:

def max(a, b):
  if a > b:
    return a
  return b

The task can be done easily in python. But for c++, that's not the case. You should define many overloaded functions to receive different types of values. Consider another situation: when you want to customize a stack, you should define many classes to be stacks for different data types, which is complex and inconvenient. But with templates, we can easily finish the tasks above without the need to repeat writing many similar code. Let's look at an example

#include <iostream>
#include <string>
using namespace std;

template <typename T>
inline T const& Max (T const& a, T const& b)
{
    return a < b ? b:a;
}
int main ()
{

    int i = 39;
    int j = 20;
    cout << "Max(i, j): " << Max(i, j) << endl;

    double f1 = 13.5;
    double f2 = 20.7;
    cout << "Max(f1, f2): " << Max(f1, f2) << endl;

    string s1 = "Hello";
    string s2 = "World";
    cout << "Max(s1, s2): " << Max(s1, s2) << endl;

   return 0;
}

Notice that the single function Max() receives values of more than one data type. When the code above is executed, the results will be

Max(i, j): 39
Max(f1, f2): 20.7
Max(s1, s2): World

Note that the statement

template <typename T>

means that the T can be any data type. If the type of the variable is int, T will be int, and if the type of the variable is string, T will be string.

template <class T>

works in a similar way.

references:
(1) C++ 模板
(2) <转载> 模板声明中template 和template

2.3. RandomAccessIterator

Iterators in STL are objects that work like pointers. Iterators overload the * and -> operation. And an onject of RandomAccessIterator is like a pointer pointing at an array. If p, p1, p2 are objects of RandomAccessIterator and n is an int value, operations like p[n], p + n, p += n, p1 - p2, p1 <= p2 and so on are valid.

reference: STL源码学习系列四：迭代器(Iterator)

3. Explore sort() in STL

The sort() function in STL uses

Quick sort
Insertion sort
Heap sort

together to improve performance of sorting. Click here to see the entire code.

3.1. Introduction to sort()

The quick sort is the fastest sorting algorithm, with an average time complexity of \(O(NlogN)\). But when elements in the list to be sorted are almost in order, the time complexity of the quick sort algorithm will turn to \(O(N^2)\). And when times of recursions increase, the quick sort algorithm will be less efficient due to over use of function calls. Fortunately, the insertion sort has a time complexity close to \(O(N)\) when the elements being sorted are almost in order, and the heap sort, which does not need recursions, has a worst time complex of \(O(NlogN)\).

Here's the implementation of the sort() function.

template <class RandomAccessIterator>
inline void sort(RandomAccessIterator first, RandomAccessIterator last) {
    // when the length of the list to be sorted is greater than 0
    if (first != last) {  
        // combination of the quick sort and the heap sort
        __introsort_loop(first, last, value_type(first), __lg(last - first) * 2);
        // the insertion sort
        __final_insertion_sort(first, last);
    }
}

If the length of the list to be sorted is greater than 0, the __introsort_loop() function will be implemented, which is a combination of the quick sort and the heap sort. The quick sort and the heap sort are fast, so when the amount of data is large, the time for sorting can be shorten. When the list being sorted is almost in order, the sorting time that the quick sort algorithm uses will increase, thus the __final_insertion_sort() function will be implemented, which is the insertion sort, an algorithm performs better when the list is almost in order.

Note that the __lg() function is used to calculate the max recursion depth. When the recursion depth is deeper than the outcome of the __lg() function, the insertion sort __final_insertion_sort() is used instead to avoid the potential risk that the quick sort algorithm get slower due to the extra time and space cost brought by deep recursions, see below. The implementation of the __lg() function is

template <class Size>
inline Size __lg(Size n) {
  Size k;
  for (k = 0; n > 1; n >>= 1) ++k;
  return k
}

It simply finds the maximum k in \(2^k \leq n\). Therefore \(k = log_2n\). In the sort() function \(k = log_2 (lengthOfTheListToBeSorted)\)

3.2. Quick Sort

Here's the main body of the quick sort

template <class RandomAccessIterator, class T, class Size>
void __introsort_loop(RandomAccessIterator first,
                      RandomAccessIterator last, T*,
                      Size depth_limit) {
    while (last - first > __stl_threshold) {  // __stl_threshold: const int 16
        // when the condition is true,
        // the recursion depth is deep
        if (depth_limit == 0) {
            // use the heap sort
            partial_sort(first, last, last);
            return;
        }
        --depth_limit;

        // the quick sort
        // find the pivot
        RandomAccessIterator cut = __unguarded_partition
          (first, last, T(__median(*first, *(first + (last - first)/2),
                                   *(last - 1))));
        // recursively sort the part on the right of the pivot
        __introsort_loop(cut, last, value_type(first), depth_limit);

        // in the next loop the left part will be sorted
        last = cut;
    }
}

When the recursion depth is not deep, the quick sort is used so that the sorting is fast. The algorithm of the quick sort algorithm is quite similar to that in our textbook:

find a pivot
recursively sort the list on the right of the pivot
sort the list on the left of the pivot

3.2.1. One Recursion Instead of Two

Notice the third step, which is different from the normal solution. The algorithm here does not use recursions to sort the list on the left of the pivot, but finishes the sorting in the next loop, which is an improvement of the original quick sort algorithm. This method makes the sorting faster. Notice the code last = cut on the last line of the loop body. After this statement, the last argument of the __unguarded_partition() function in the next loop will be cut, which is the pivot.

3.2.2. Median-of-Three

The function __median() selects the median in the first, the last and the middle element to be the pivot. It can then get rid of the risk that the list being sorted is pretty in order, which will make the time complexity of the quick sort turn to about \(O(N^2)\). It's another improvement of the algorithm. And here is its implementation

template <class T>
inline const T& __median(const T& a, const &T b, const T& c) {
  if (a < b)
    if (b < c)  // a < b < c
      return b;
    else if (a < c)  // a < b, b >= c, a < c
      return c;
    else
      return a;
  else if (a < c)  // c > a >= b
    return a;
  else if (b < c)  // a >= b, a >= c, b < c
    return c;
  else
    return b;
}

3.2.3. A Faster Partition

And the __unguarded_partition finds the pivot. It is similar to our familiar one but a little different. We will see one more improvement here. Let's see the code

template <class RandomAccessIterator, class T>
RandomAccessIterator __unguarded_partition(RandomAccessIterator first,
                                           RandomAccessIterator last,
                                           T pivot) {
    while (true) {
        // the first pointer iterates over the list until the current element
        // it points at is smaller than the pivot
        while (*first < pivot) ++first;

        // shift left by one
        --last;

        // the last pointer iterates over the list until the current element
        // it points at is greater than the pivot
        while (pivot < *last) --last;

        // if the first pointer is after the last pointer
        // return the first pointer
        if (!(first < last)) return first;

        // else swap the two elements the two pointer are pointing at
        iter_swap(first, last);

        // shift right by one
        ++first;
    }
}

Note that the boundary checking first < last is not necessary here, which is a benifit brought from the median-of-three strategy (the __median() introduced above). Thus, the time for boundary checking can be saved, which makes the sorting much faster, especially when the amount of data is huge.

Let's find out the reason why the boundary checking is not needed here. First implement the sorting without the median-of-three strategy. Consider a simple condition. When the list to be sorted is in a reversed order

3 2 1

Let's make the pivot to be the last element, 1 here, and implement the __unguarded_partition() function.

// F: first
// L: last
// P: the pivot

// original state
3 2 1
F   L
    P

/* the code below is not implemented in the current loop

while (*first < pivot) ++first;

*/

// --last
3 2 1
F L
    P

// while (pivot < *last) --last;
? 3 2 1
  F
L       // notice that the last pointer here is beyond the boundary!
      P

We can see that after while (pivot < *last) --last; is executed, the last is beyond the boundary. Thus boundary checking is needed without the median-of-three strategy.

Let's see what happens when we use the median-of-three strategy. In the sequence of 1 2 3, the median 2 is selected to be the pivot.

// the original state
3 2 1
F P L

/* the code below is not implemented in the current loop

while (*first < pivot) ++first;

*/

// --last;
3 2 1
F
  L
  P

/* the code below is not implemented in the current loop

while (pivot < *last) --last;
if (!(first < last)) return first;

*/

// iter_swap(first, last);
1 2 3
F
  L
  P

// ++first
1 2 3
  F
  L
  P
// notice that the sequence is in order now!

// the second loop

/* the code below is not implemented in the current loop

while (*first < pivot) ++first;

*/

// --last;
1 2 3
  F
L     // notice thst L is before F now!
  P

/* the code below is not implemented in the current loop

while (pivot < *last) --last;

*/

// if (!(first < last)) return first;

// then first (points at 2 now) is returned

From the code above, we can see that the movement of the two pointers will stops before they are beyond the boundary.

3.3. Heap Sort

When the condition if (depth_limit == 0) is true, the heap sort will be implemented instead.

template <class RandomAccessIterator, class T, class Compare>
void __partial_sort(RandomAccessIterator first, RandomAccessIterator middle,
                    RandomAccessIterator last, T*, Compare comp) {
    make_heap(first, middle, comp);
    for (RandomAccessIterator i = middle; i < last; ++i)
        if (comp(*i, *first))
            __pop_heap(first, middle, i, T(*i), comp, distance_type(first));
    sort_heap(first, middle, comp);
}

template <class RandomAccessIterator, class Compare>
inline void partial_sort(RandomAccessIterator first,
                         RandomAccessIterator middle,
                         RandomAccessIterator last, Compare comp) {
    __partial_sort(first, middle, last, value_type(first), comp);
}

The reason why the sort() function in STL does not use the heap sort algorithm directly is that although the time complexity of the heap sort is also \(O(NlogN)\), the sorting time it actually costs is 2~5 times longer than that of the quick sort algorithm.

3.4. Insertion Sort

When the conditon while (last - first > __stl_threshold) is false, where __stl_threshold is const int 16 here, the list to be sorted is almost in order. Now, continue using the quick sort algorithm will get less efficient due to its bad time complexity for lists that are (almost) in order. Thus the __introsort_loop() function will be exited and the __final_insertion_sort() function, which is actually the insertion algorithm with some improvements, will be implemented to sort the list. Here is the implementation of the insertion algorithm.

template <class RandomAccessIterator>
void __final_insertion_sort(RandomAccessIterator first,
                            RandomAccessIterator last) {
    // when there are more than 16 elements in the list
    if (last - first > __stl_threshold) {
        // take the first 16 elements to be sorted using the
        // __insertion_sort() function
        __insertion_sort(first, first + __stl_threshold);

       // use __unguarded_insertion_sort() to sort the remaining elements
        __unguarded_insertion_sort(first + __stl_threshold, last);
    }
    else
        // sort the elements with the __insertion_sort function
        __insertion_sort(first, last);
}

The __final_insertion_sort() function contains an if-else statement. Notice that there are two similar functions: __insertion_sort() and __unguarded_insertion_sort. Why is the code written like that will be explained later. Let's first look at the implementation of __insertion_sort().

template <class RandomAccessIterator, class T>
void __unguarded_linear_insert(RandomAccessIterator last, T value) {
    RandomAccessIterator next = last;
    --next;

    // find a position for the value to insert to
    // and insert the value into that position
    while (value < *next) {
        *last = *next;
        last = next;
        --next;
    }
    *last = value;
}

template <class RandomAccessIterator, class T>
inline void __linear_insert(RandomAccessIterator first,
                            RandomAccessIterator last, T*) {
    T value = *last;

    // if the value is smaller than the first element in the sorted sublist
    if (value < *first) {
        // all elements in the sorted sublist are shifted right by one
        copy_backward(first, last, last + 1);

        // the (smallest) value is then in the first position
        *first = value;
    }
    else
        // implement the insertion sort without the need to check boundary
        __unguarded_linear_insert(last, value);
}

template <class RandomAccessIterator>
void __insertion_sort(RandomAccessIterator first, RandomAccessIterator last) {
    // if the list is empty, do nothing
    if (first == last) return;

    // implement the insertion sort
    for (RandomAccessIterator i = first + 1; i != last; ++i)
        __linear_insert(first, i, value_type(first));
}

If the list to be sorted is not empty, the list is sorted using the insertion sort algorithm. But there is an improvement here.

3.4.1. Is the Value the Smallest?

Notice the __linear_insert() function. The value to be inserted is firstly compared to the smallest element of the sorted sublist, that is, the first element. If the value is smaller then the first element in the sorted sublist, the value will be inserted into the first position, with all elements in the original sorted sublist shifted right by one. Concretely, let's assume we have a sorted sublist

1 6 8

And the value to be inserted is 0. 0 is smaller than the first (and the smallest) element in the sorted sublist. Thus all the elements of the sorted sublist are shifted right by one, and the value 0 will be inserted into the first position.

`cpp 1 6 8 ↓ 1 6 8 ↓ 0 1 6 8

And if the value is not smaller than the first element (the smallest one) in the sorted sublist, the __unguarded_linear_insert() function will be implemented, similar to the insertion process of our familiar insertion algorithm, but without the need to check boundary. In other words, it does not need to check if the value will be inserted before the first valid position of the sorted sublist. The reason why the checking is not needed is as follows. Since the conditon that value < *first is false (and hence the code in the else statement __unguarded_linear_insert(last, value);, is executed.), the value to be inserted now will not be the smallest in the sorted sublist. In other words, the value will not be smaller than the first element (which is also the smallest one) in the sorted sublist. Thus the position where the value is to be inserted to must be after the first (and the smallest) element. Concretely, let's assume we have a sorted sublist

1 6 8

And the value we are to insert now is 2. Since \(1 < 2\), the first condition in __linear_insert() will not be implemented. Instead the code in the else statement, __unguarded_linear_insert(last, value); will be executed. A position in the sorted list will be found to insert the value 2.

1 6 8 (insert 2 here? no)

1 6 (insert 2 here? no) 8

1 (insert 2 here? yes) 6 8
↓
1 2 6 8

Since the value 2 will never be smaller than 1, it must be inserted after 1.

The advantage of the improvement is that since the boundary checking is not necessary, the time spent in the checking will be saved. And when the amount of data is huge, the saved time will be surprisingly increases.

Then let's focus on the __unguarded_insertion_sort() function. Here's its implementation.

template <class RandomAccessIterator, class T>
void __unguarded_insertion_sort_aux(RandomAccessIterator first,
                                    RandomAccessIterator last, T*) {
    for (RandomAccessIterator i = first; i != last; ++i)
        __unguarded_linear_insert(i, T(*i));
}

template <class RandomAccessIterator>
inline void __unguarded_insertion_sort(RandomAccessIterator first,
                                RandomAccessIterator last) {
    __unguarded_insertion_sort_aux(first, last, value_type(first));
}

What it actually does is calling the __unguarded_linear_insert() function, which has been discussed above.

3.4.2. Why an If-Else Statement?

Now let's see why the __final_insertion_sort() function is written that way. We know that the __unguarded_linear_insert() function can be called when the smallet element is in the sorted sublist.

If the amount of data is small, the normal insertion sort algorithm can be used to sort the list directly, since when the amount is small, time spent on boundary checking is very few so that it can be omitted. Thus the insertion sort __insertion_sort() can be used directly.

But when the amount of data is huge, time spent on boundary checking cannot be omitted, thus the __unguarded_insertion_sort() function is needed to save the time for boundary checking.

But how can we guarantee that the smallest element is in the sorted sublist? Since the smallest one of the quick sort is always in an area of the left part in the whole list, we can sort that area using __insertion_sort(), and the remaining part which does not contain the smellest element using the faster __unguarded_insertion_sort().

4. Show Me the Code

/* the heap sort */
template <class RandomAccessIterator, class T, class Compare>
void __partial_sort(RandomAccessIterator first, RandomAccessIterator middle,
                    RandomAccessIterator last, T*, Compare comp) {
    make_heap(first, middle, comp);
    for (RandomAccessIterator i = middle; i < last; ++i)
        if (comp(*i, *first))
            __pop_heap(first, middle, i, T(*i), comp, distance_type(first));
    sort_heap(first, middle, comp);
}

template <class RandomAccessIterator, class Compare>
inline void partial_sort(RandomAccessIterator first,
                         RandomAccessIterator middle,
                         RandomAccessIterator last, Compare comp) {
    __partial_sort(first, middle, last, value_type(first), comp);
}

/* the quick sort */
template <class RandomAccessIterator, class T>
RandomAccessIterator __unguarded_partition(RandomAccessIterator first,
                                           RandomAccessIterator last,
                                           T pivot) {
    while (true) {
        // the first pointer iterates over the list until the current element
        // it points at is smaller than the pivot
        while (*first < pivot) ++first;

        // shift left by one
        --last;

        // the last pointer iterates over the list until the current element
        // it points at is greater than the pivot
        while (pivot < *last) --last;

        // if the first pointer is after the last pointer
        // return the first pointer
        if (!(first < last)) return first;

        // else swap the two elements the two pointer are pointing at
        iter_swap(first, last);

        // shift right by one
        ++first;
    }
}

/* the quick sort + the insertion sort in the sort() function */
// find the pivot
inline const T& __median(const T& a, const &T b, const T& c) {
  if (a < b)
    if (b < c)  // a < b < c
      return b;
    else if (a < c)  // a < b, b >= c, a < c
      return c;
    else
      return a;
  else if (a < c)  // c > a >= b
    return a;
  else if (b < c)  // a >= b, a >= c, b < c
    return c;
  else
    return b;
}

// the combination of the quick sort and the heap sort
template <class RandomAccessIterator, class T, class Size>
void __introsort_loop(RandomAccessIterator first,
                      RandomAccessIterator last, T*,
                      Size depth_limit) {
    while (last - first > __stl_threshold) {  // __stl_threshold: const int 16
        // when the condition is true,
        // the recursion depth is deep
        if (depth_limit == 0) {
            // use the heap sort
            partial_sort(first, last, last);
            return;
        }
        --depth_limit;

        // the quick sort
        // find the pivot
        RandomAccessIterator cut = __unguarded_partition
          (first, last, T(__median(*first, *(first + (last - first)/2),
                                   *(last - 1))));
        // recursively sort the part on the right of the pivot
        __introsort_loop(cut, last, value_type(first), depth_limit);

        // in the next loop the left part will be sorted
        last = cut;
    }
}

/* the insertion sort */
/* __insertion_sort */
// insert elements without the need to check the boundary
template <class RandomAccessIterator, class T>
void __unguarded_linear_insert(RandomAccessIterator last, T value) {
    RandomAccessIterator next = last;
    --next;

    // find a position for the value to insert to
    // and insert the value into that position
    while (value < *next) {
        *last = *next;
        last = next;
        --next;
    }
    *last = value;
}

// implement different code depending on
// whether the value is smaller than the minimum
// in the sorted sublist
template <class RandomAccessIterator, class T>
inline void __linear_insert(RandomAccessIterator first,
                            RandomAccessIterator last, T*) {
    T value = *last;

    // if the value is smaller than the first element in the sorted sublist
    if (value < *first) {
        // all elements in the sorted sublist are shifted right by one
        copy_backward(first, last, last + 1);

        // the (smallest) value is then in the first position
        *first = value;
    }
    else
        // implement the insertion sort without the need to check boundary
        __unguarded_linear_insert(last, value);
}

// the main body of the __insertion_sort
template <class RandomAccessIterator>
void __insertion_sort(RandomAccessIterator first, RandomAccessIterator last) {
    // if the list is empty, do nothing
    if (first == last) return;

    // implement the insertion sort
    for (RandomAccessIterator i = first + 1; i != last; ++i)
        __linear_insert(first, i, value_type(first));
}

/* __unguarded_insertion_sort */
// sort the whole list
template <class RandomAccessIterator, class T>
void __unguarded_insertion_sort_aux(RandomAccessIterator first,
                                    RandomAccessIterator last, T*) {
    for (RandomAccessIterator i = first; i != last; ++i)
        __unguarded_linear_insert(i, T(*i));
}

// the main body of the __unguarded_insertion_sort
template <class RandomAccessIterator>
inline void __unguarded_insertion_sort(RandomAccessIterator first,
                                RandomAccessIterator last) {
    __unguarded_insertion_sort_aux(first, last, value_type(first));
}

/* the insertion sort step in the sort() function */
template <class RandomAccessIterator>
void __final_insertion_sort(RandomAccessIterator first,
                            RandomAccessIterator last) {
    // when there are more than 16 elements in the list
    if (last - first > __stl_threshold) {
        // take the first 16 elements to be sorted using the
        // __insertion_sort() function
        __insertion_sort(first, first + __stl_threshold);

       // use __unguarded_insertion_sort() to sort the remaining elements
        __unguarded_insertion_sort(first + __stl_threshold, last);
    }
    else
        // sort the elements with the __insertion_sort function
        __insertion_sort(first, last);
}

/* the main body of the sort() function */
// calculate the maximum recursion depth
template <class Size>
inline Size __lg(Size n) {
  Size k;
  for (k = 0; n > 1; n >>= 1) ++k;
  return k
}

// the sort() function
template <class RandomAccessIterator>
inline void sort(RandomAccessIterator first, RandomAccessIterator last) {
    // when the length of the list to be sorted is greater than 0
    if (first != last) {  
        // combination of the quick sort and the heap sort
        __introsort_loop(first, last, value_type(first), __lg(last - first) * 2);
        // the insertion sort
        __final_insertion_sort(first, last);
    }
}

5. What I Learned

In conclusion, we can see STL uses many methods to improve the performance of the sorting.

It uses the quick sort, the heap sort and the insertion sort together to make the sorting more efficient. It's pretty like the ensemble learning in machine learning, which mixes many different models to get better performance. When the list is far from being in order, the quick sort algorithm is used to sort the list quickly. When the recursion depth is deep, the heap sort is used instead, which has a similar time complexity with the quick sort (though is actually slower on the same data), but does not need recursions. When the list is almost in order, the insertion sort is used.

It also uses many amazing techniques to improve the algorithms. The three algorithms used in the sort() function are quite similar to those familiar ones we learned from the textbook. But STL makes many small changes on the algorithms to make them much faster, especially for the case of a large amount of data.