C# key technology (1) - List underlying source code analysis

1. ## Common component underlying code analysis

List underlying code analysis

List is the most common scalable array component in C#. We often use it to replace arrays, because it is scalable, so we don't need to manually allocate the size of the array when writing. Even sometimes we use it as a linked list. So how is its bottom layer written, how does it execute and operate internally every time it increases and decreases and assigns a value? We will explain in detail next.

Let's first take a look at the construction part of List. The source code is as follows:


public class List<T> : IList<T>, System.Collections.IList, IReadOnlyList<T>
{
    private const int _defaultCapacity = 4;

    private T[] _items;
    private int _size;
    private int _version;
    private Object _syncRoot;
    
    static readonly T[]  _emptyArray = new T[0];        
        
    // Constructs a List. The list is initially empty and has a capacity
    // of zero. Upon adding the first element to the list the capacity is
    // increased to 16, and then increased in multiples of two as required.
    public List() {
        _items = _emptyArray;
    }

    // Constructs a List with a given initial capacity. The list is
    // initially empty, but will have room for the given number of elements
    // before any reallocations are required.
    // 
    public List(int capacity) {
        if (capacity < 0) ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument.capacity, ExceptionResource.ArgumentOutOfRange_NeedNonNegNum);
        Contract.EndContractBlock();

        if (capacity == 0)
            _items = _emptyArray;
        else
            _items = new T[capacity];
    }

    //...
    //其他内容
}

It can be known from the source code that List inherits from IList, IReadOnlyList, IList provides the main interface, and IReadOnlyList provides the iteration interface.

IList source code

IReadOnlyList source code

Looking at the construction part, we have made it clear that the List is internally implemented with an array, not a linked list, and when no specified capacity is given, the initial capacity is 0.

In other words, we can speculate with a high probability that the List component works in the way of "copying and generating a new array from the original array" when the Add and Remove functions are called.

Let's see if our guess is correct.

Add interface source code:


// Adds the given object to the end of this list. The size of the list is
// increased by one. If required, the capacity of the list is doubled
// before adding the new element.
//
public void Add(T item) {
    if (_size == _items.Length) EnsureCapacity(_size + 1);
    _items[_size++] = item;
    _version++;
}

// Ensures that the capacity of this list is at least the given minimum
// value. If the currect capacity of the list is less than min, the
// capacity is increased to twice the current capacity or to min,
// whichever is larger.
private void EnsureCapacity(int min) {
    if (_items.Length < min) {
        int newCapacity = _items.Length == 0? _defaultCapacity : _items.Length * 2;
        // Allow the list to grow to maximum possible capacity (~2G elements) before encountering overflow.
        // Note that this check works even when _items.Length overflowed thanks to the (uint) cast
        if ((uint)newCapacity > Array.MaxArrayLength) newCapacity = Array.MaxArrayLength;
        if (newCapacity < min) newCapacity = min;
        Capacity = newCapacity;
    }
}

For the Add function in the above List source code, every time an element is added, the Add interface will first check whether the capacity is enough, and if not, use EnsureCapacity to increase the capacity.

In EnsureCapacity, there is such a line of code:

    int newCapacity = _items.Length == 0? _defaultCapacity : _items.Length * 2;

Every time the capacity is not enough, the capacity of the entire array will be doubled, and _defaultCapacity is the default value of 4. Therefore, the entire extended route is 4, 8, 16, 32, 64, 128, 256, 512, 1024... and so on.

List uses the array form as the underlying data structure. The advantage is that using the index method to extract elements is very fast, but it will be very bad when expanding the capacity. Every new array will cause memory garbage, which brings a lot of burden to the garbage collection GC.

Here, the 2-exponential expansion method can reduce the burden on GC, but if the array is replaced continuously, it will still cause a lot of burden on GC, especially when Add is frequently used by List in the code. In addition, if the number is not appropriate, a lot of memory space will be wasted. For example, when the number of elements is 520, the List will be expanded to 1024 elements. If the remaining 504 space units are not used, most of the memory space will be wasted. . What exactly to do is the best strategy, which we will discuss in a later article.

Let's take a look at the source code of the Remove interface part:


// Removes the element at the given index. The size of the list is
// decreased by one.
// 
public bool Remove(T item) {
    int index = IndexOf(item);
    if (index >= 0) {
        RemoveAt(index);
        return true;
    }

    return false;
}

// Returns the index of the first occurrence of a given value in a range of
// this list. The list is searched forwards from beginning to end.
// The elements of the list are compared to the given value using the
// Object.Equals method.
// 
// This method uses the Array.IndexOf method to perform the
// search.
// 
public int IndexOf(T item) {
    Contract.Ensures(Contract.Result<int>() >= -1);
    Contract.Ensures(Contract.Result<int>() < Count);
    return Array.IndexOf(_items, item, 0, _size);
}

// Removes the element at the given index. The size of the list is
// decreased by one.
// 
public void RemoveAt(int index) {
    if ((uint)index >= (uint)_size) {
        ThrowHelper.ThrowArgumentOutOfRangeException();
    }
    Contract.EndContractBlock();
    _size--;
    if (index < _size) {
        Array.Copy(_items, index + 1, _items, index, _size - index);
    }
    _items[_size] = default(T);
    _version++;
}

The Remove interface includes IndexOf and RemoveAt, where the IndexOf function is used to set the index position of the found element, and RemoveAt can be used to delete the element at the specified position.

From the source code, we can see that the principle of element deletion is actually to use Array.Copy to overwrite the array. IndexOf uses the Array.IndexOf interface to find the index position of the element. The internal implementation of this interface is to compare each position from 0 to n in index order, and the complexity is O(n).

Let’s make a quick summary first, and then let’s look at the source code of the Insert interface.


// Inserts an element into this list at a given index. The size of the list
// is increased by one. If required, the capacity of the list is doubled
// before inserting the new element.
// 
public void Insert(int index, T item) {
    // Note that insertions at the end are legal.
    if ((uint) index > (uint)_size) {
        ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument.index, ExceptionResource.ArgumentOutOfRange_ListInsert);
    }
    Contract.EndContractBlock();
    if (_size == _items.Length) EnsureCapacity(_size + 1);
    if (index < _size) {
        Array.Copy(_items, index, _items, index + 1, _size - index);
    }
    _items[index] = item;
    _size++;            
    _version++;
}

Like the Add interface, first check whether the capacity is sufficient, if not, expand the capacity. It is learned from the source code that when Insert inserts an element, it uses the form of copying the array to move the element behind the specified element in the array backward by one position.

Seeing this, we can understand that the Add, Insert, IndexOf, and Remove interfaces of List have not been optimized in any form, and they all use sequential iteration. If they are used too frequently, the efficiency will decrease, and the A lot of memory redundancy is caused, which puts more pressure on garbage collection (GC).

The principles of other related interfaces such as AddRange and RemoveRange are the same as Add and Remove, the only difference is that there are a few more elements, and a single element is operated in the form of a container. First check whether the capacity is appropriate, if not, expand the capacity, or get the index position first when removing, and then overwrite the following elements as a whole. The size of the container itself will not change, but the operation of overwriting is repeated.

Other interfaces are also based on arrays and use similar methods to operate on data. We can quickly see how the source code of other commonly used interfaces is implemented.

For example, the implementation of [],


// Sets or Gets the element at the given index.
// 
public T this[int index] {
    get {
        // Following trick can reduce the range check by one
        if ((uint) index >= (uint)_size) {
            ThrowHelper.ThrowArgumentOutOfRangeException();
        }
        Contract.EndContractBlock();
        return _items[index]; 
    }

    set {
        if ((uint) index >= (uint)_size) {
            ThrowHelper.ThrowArgumentOutOfRangeException();
        }
        Contract.EndContractBlock();
        _items[index] = value;
        _version++;
    }
}

[]的实现，直接使用了数组的索引方式获取元素。

再比如 Clear 清除接口


// Clears the contents of List.
public void Clear() {
    if (_size > 0)
    {
        Array.Clear(_items, 0, _size); // Don't need to doc this but we clear the elements so that the gc can reclaim the references.
        _size = 0;
    }
    _version++;
}

Clear接口在调用时并不会删除数组，而只是将数组中的元素清零，并设置 _size 为 0 而已，用于虚拟地表明当前容量为0。

再比如 Contains 接口，用于确实某元素是否存在于List中


// Contains returns true if the specified element is in the List.
// It does a linear, O(n) search.  Equality is determined by calling
// item.Equals().
//
public bool Contains(T item) {
    if ((Object) item == null) {
        for(int i=0; i<_size; i++)
            if ((Object) _items[i] == null)
                return true;
        return false;
    }
    else {
        EqualityComparer<T> c = EqualityComparer<T>.Default;
        for(int i=0; i<_size; i++) {
            if (c.Equals(_items[i], item)) return true;
        }
        return false;
    }
}

从源代码中我们可以看到，Contains 接口使用的是线性查找方式比较元素，对数组进行迭代，比较每个元素与参数的实例是否一致，如果一致则返回true，全部比较结束还没有找到，则认为查找失败。

再比如 ToArray 转化数组接口


// ToArray returns a new Object array containing the contents of the List.
// This requires copying the List, which is an O(n) operation.
public T[] ToArray() {
    Contract.Ensures(Contract.Result<T[]>() != null);
    Contract.Ensures(Contract.Result<T[]>().Length == Count);

    T[] array = new T[_size];
    Array.Copy(_items, 0, array, 0, _size);
    return array;
}

ToArray接口中，重新new了一个指定大小的数组，再将本身数组上的内容考别到新数组上，再返回出来。

再比如 Find 查找接口


public T Find(Predicate<T> match) {
    if( match == null) {
        ThrowHelper.ThrowArgumentNullException(ExceptionArgument.match);
    }
    Contract.EndContractBlock();

    for(int i = 0 ; i < _size; i++) {
        if(match(_items[i])) {
            return _items[i];
        }
    }
    return default(T);
}

Find接口使用的同样是线性查找，对每个元素都进行了比较，复杂度为O(n)。

再比如 Enumerator 枚举迭代部分的细节


// Returns an enumerator for this list with the given
// permission for removal of elements. If modifications made to the list 
// while an enumeration is in progress, the MoveNext and 
// GetObject methods of the enumerator will throw an exception.
//
public Enumerator GetEnumerator() {
    return new Enumerator(this);
}

/// <internalonly/>
IEnumerator<T> IEnumerable<T>.GetEnumerator() {
    return new Enumerator(this);
}

System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator() {
    return new Enumerator(this);
}

[Serializable]
public struct Enumerator : IEnumerator<T>, System.Collections.IEnumerator
{
    private List<T> list;
    private int index;
    private int version;
    private T current;

    internal Enumerator(List<T> list) {
        this.list = list;
        index = 0;
        version = list._version;
        current = default(T);
    }

    public void Dispose() {
    }

    public bool MoveNext() {

        List<T> localList = list;

        if (version == localList._version && ((uint)index < (uint)localList._size)) 
        {                                                     
            current = localList._items[index];                    
            index++;
            return true;
        }
        return MoveNextRare();
    }

    private bool MoveNextRare()
    {                
        if (version != list._version) {
            ThrowHelper.ThrowInvalidOperationException(ExceptionResource.InvalidOperation_EnumFailedVersion);
        }

        index = list._size + 1;
        current = default(T);
        return false;                
    }

    public T Current {
        get {
            return current;
        }
    }

    Object System.Collections.IEnumerator.Current {
        get {
            if( index == 0 || index == list._size + 1) {
                 ThrowHelper.ThrowInvalidOperationException(ExceptionResource.InvalidOperation_EnumOpCantHappen);
            }
            return Current;
        }
    }

    void System.Collections.IEnumerator.Reset() {
        if (version != list._version) {
            ThrowHelper.ThrowInvalidOperationException(ExceptionResource.InvalidOperation_EnumFailedVersion);
        }
        
        index = 0;
        current = default(T);
    }

}

其中我们需要注意 Enumerator 这个结构，每次获取迭代器时，Enumerator 每次都是被new出来，如果大量使用迭代器的话，比如foreach就会造成大量的垃圾对象，这也是为什么我们常常告诫程序员们，尽量不要用foreach，因为 List 的 foreach 会增加有新的 Enumerator 实例，最后由GC垃圾回收掉。

最后我们来看看 Sort 排序接口


// Sorts the elements in a section of this list. The sort compares the
// elements to each other using the given IComparer interface. If
// comparer is null, the elements are compared to each other using
// the IComparable interface, which in that case must be implemented by all
// elements of the list.
// 
// This method uses the Array.Sort method to sort the elements.
// 
public void Sort(int index, int count, IComparer<T> comparer) {
    if (index < 0) {
        ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument.index, ExceptionResource.ArgumentOutOfRange_NeedNonNegNum);
    }
    
    if (count < 0) {
        ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument.count, ExceptionResource.ArgumentOutOfRange_NeedNonNegNum);
    }
        
    if (_size - index < count)
        ThrowHelper.ThrowArgumentException(ExceptionResource.Argument_InvalidOffLen);
    Contract.EndContractBlock();

    Array.Sort<T>(_items, index, count, comparer);
    _version++;
}

它使用了 Array.Sort接口进行排序，其中Array.Sort的源码我们也把它找出来。以下为 Array.Sort 的使用的算法源码：


internal static void DepthLimitedQuickSort(T[] keys, int left, int right, IComparer<T> comparer, int depthLimit)
{
    do
    {
        if (depthLimit == 0)
        {
            Heapsort(keys, left, right, comparer);
            return;
        }

        int i = left;
        int j = right;

        // pre-sort the low, middle (pivot), and high values in place.
        // this improves performance in the face of already sorted data, or 
        // data that is made up of multiple sorted runs appended together.
        int middle = i + ((j - i) >> 1);
        SwapIfGreater(keys, comparer, i, middle);  // swap the low with the mid point
        SwapIfGreater(keys, comparer, i, j);   // swap the low with the high
        SwapIfGreater(keys, comparer, middle, j); // swap the middle with the high

        T x = keys[middle];
        do
        {
            while (comparer.Compare(keys[i], x) < 0) i++;
            while (comparer.Compare(x, keys[j]) < 0) j--;
            Contract.Assert(i >= left && j <= right, "(i>=left && j<=right)  Sort failed - Is your IComparer bogus?");
            if (i > j) break;
            if (i < j)
            {
                T key = keys[i];
                keys[i] = keys[j];
                keys[j] = key;
            }
            i++;
            j--;
        } while (i <= j);

        // The next iteration of the while loop is to "recursively" sort the larger half of the array and the
        // following calls recrusively sort the smaller half.  So we subtrack one from depthLimit here so
        // both sorts see the new value.
        depthLimit--;

        if (j - left <= right - i)
        {
            if (left < j) DepthLimitedQuickSort(keys, left, j, comparer, depthLimit);
            left = i;
        }
        else
        {
            if (i < right) DepthLimitedQuickSort(keys, i, right, comparer, depthLimit);
            right = j;
        }
    } while (left < right);
}

Array.Sort 使用的是快速排序方式进行排序，从而我们明白了 List 的 Sort 排序的效率为O(nlogn)。

我们把大部分的接口都列了出来，差不多把所有的源码都分析了一遍，我们可以看到 List 的效率并不高，只是通用性强而已，大部分的算法都使用的是线性复杂度的算法，这种线性算法当遇到规模比较大的计算量级时就会导致CPU的大量损耗。

我们可以自己改进它，比如不再使用有线性算法的接口，自己重写一套，但凡要优化List 中的线性算法的地方都使用，我们自己制作的工具类。

List的内存分配方式也极为不合理，当List里的元素不断增加时，会多次重新new数组，导致原来的数组被抛弃，最后当GC被调用时造成回收的压力。

我们可以提前告知 List 对象最多会有多少元素在里面，这样的话 List 就不会因为空间不够而抛弃原有的数组，去重新申请数组了。

List源码