The 4D long article takes you to roam the world of data structures

What is the data structure?

Program = Data Structure + Algorithm

Yes, the above sentence is very classic. Programs are composed of data structures and algorithms. Of course, data structures and algorithms complement each other and cannot be viewed completely independently. However, this article will focus on those commonly used data structures.

What is the data structure?

First of all, what is data? Data is a symbolic representation of objective affairs . In computer science, it refers to all symbols that can be input into a computer and processed by a computer program. Then why add the word "structure"?

Data elements are the basic units of data , and in any problem, data elements do not exist independently. There is always a relationship between them. This relationship between data elements is called structure .

Therefore, we have the following definitions:

A data structure is the way a computer stores and organizes data . A data structure is a collection of data elements that have one or more specific relationships to each other . Often, a well-chosen data structure can lead to higher operational or storage efficiency . Data structures are often associated with efficient retrieval algorithms and indexing techniques.

Simply put, a data structure is a way of organizing, managing, and storing data. Although in theory all data can be mixed, or mixed, or stored without food, the computer is the pursuit of high efficiency. If we can understand the data structure, find a data structure that is more suitable for the current problem scenario, and express the relationship between the data. In terms of storage, the adaptive algorithm can be used more efficiently when calculating, so the running efficiency of the program will definitely be improved.

The four commonly used data structures are:

  • Sets: only relationships that belong to the same set, no other relationships
  • Linear structure: There is a one-to-one relationship between the data elements in the structure
  • Tree structure: There is a one-to-many relationship between the data elements in the structure
  • Graph-like structure or net-like structure: graph-like structure or net-like structure

What is the logical structure and storage structure?

The logical relationship between data elements is called logical structure , that is, we define a mathematical description of the operation object. But we also have to know how to represent it in the computer. The representation of the data structure in the computer (also known as the image) is called the physical structure of the data, also known as the storage structure .

The relationship before the data elements has two different representation methods in the computer: sequential image and non-sequential image , and two different storage structures are obtained from this: sequential storage structure and chain storage structure , such as sequential storage structure, we want to Representing complex numbers z1 =3.0 - 2.3i , the logical relationship between data elements can be represented directly by the relative position of the elements in the memory:

The chain structure uses pointers to represent the logical relationship between data elements. Similarly z1 =3.0 - 2.3i , find the next one first 100, which is an address, and find the real data according to the address -2.3i:

bit

The smallest unit that represents information in a computer is a bit in a binary number, called a bit . That is, we commonly see 01010101010data like this. The bottom layer of the computer is all kinds of transistors and circuit boards, so no matter what data it is, even pictures, sounds, and sums at the bottom 0. 1If there are eight circuits, then each circuit has its own The closed state of , there is 8a 2multiplication, 2^8^, which is 256a different signal.

But generally we need to represent negative numbers, that is, the highest bit represents the sign bit, 0represents positive numbers, 1represents negative numbers, that is, the maximum value of 8 bits is 01111111, that is 127.

It is worth noting that in the computer world, there are more concepts of original code, inverse code, and complement code:

  • Original code: use the first bit to represent the symbol, and the remaining bits to represent the value
  • Complement code: the complement of positive number is itself, the complement of negative number is that the sign bit remains unchanged, and the remaining bits are inverted.
  • Two's complement: the complement of a positive number is itself, the complement of a negative number is based on its complement + 1

Why do you need the inverse and complement of the original code?

We know that addition and subtraction are high-frequency operations. People can intuitively see the plus and minus signs, and they can calculate it immediately. However, if the computer distinguishes different symbols, then the addition and subtraction will be more complicated, such as positive + positive Numbers, positive numbers - positive numbers, positive numbers - negative numbers, negative numbers + negative numbers...etc. Therefore, some people want to use the same operator (plus operator) to solve all addition and subtraction calculations, which can reduce many complex circuits and the overhead of various symbol conversions, and the calculation is more efficient.

We can see that the result of the following negative numbers participating in the operation also conforms to the rules of complement:

        00100011		35
 +      11011101	   -35
-------------------------
        00000000       0
        00100011		35
 + 	    11011011	   -37
-------------------------
        11111110       -2

Of course, if the calculation result exceeds the range that the number of digits can represent, it is overflow, which means that more digits are needed to be correctly represented.

Generally, those who can use bit operations should try to use bit operations because it is more efficient. Common bit operations:

  • ~: bitwise negation
  • &: as AND operation
  • |: bitwise OR operation
  • ^: bitwise exclusive OR
  • <<: Left shift with sign, for example 35(00100011), a left shift is a bit 70(01000110), -35(11011101)a left shift is a bit-70(10111010)
  • >>: Signed right shift, for example 35(00100011), a right shift is a bit 17(00010001), -35(11011101)a left shift is a bit-18(11101110)
  • <<<: Unsigned left shift, for example 35(00100011), a left shift is70(01000110)
  • >>>: unsigned right shift, for example -35(11011101), a right shift is110(01101110)
  • x ^= y; y ^= x; x ^= y;:exchange
  • s &= ~(1 << k):th kposition 0

To talk about where to use bit operations is more classic, then to count Bloom filters , you can refer to for details: http://aphysia.cn/archives/cachebloomfilter

What is a Bloom filter?

Bloom filter ( Bloom Filter) was proposed by Bloom ( Burton Howard Bloom) in 1970. It is actually composed of a long binary vector and a series of random hash mapping functions (to put it bluntly, it is a feature of storing data in a binary array). It can be used to determine whether an element exists in the collection. Its advantages are high query efficiency and small space. The disadvantage is that there are certain errors, and when we want to remove elements, they may affect each other.

That is, when an element is added to the set, through multiple hashfunctions, the element is mapped to ka point in the bit array, set to 1.

The point is that there are multiple hash functions that can hash data to different bits, and only when these bits are all 1 can we judge that the data already exists

Suppose there are three hashfunctions, then different elements will use three hashfunctions hashto three positions.

Suppose there is another Zhang San, then hashwhen he is there, he will also go hashto the following positions, all positions are 1, we can say that Zhang San already exists in it.

So is there a possibility of misjudgment? This is possible. For example, now there are only Zhang San, Li Si, Wang Wu, and Cai Ba. The hashmapping values ​​are as follows:

Chen Liu came later, but unfortunately, hashthe bits from the hash of its three functions were just hashchanged by other elements, and 1it was judged that it already existed, but in fact, Chen Liu did not existing.

The above situation is a misjudgment, and the Bloom filter will inevitably cause misjudgment. But it has an advantage that, the Bloom filter, the elements that exist may not exist, but the elements that do not exist must not exist. , because the absence of judgment means that at least one of hashthem is incorrect.

It is also because multiple elements may hashcome together, but one data is kicked out of the collection, and we want to set its mapped bit to , 0which is equivalent to deleting the data. At this time, other elements will be affected, and the positions mapped by other elements may be set to 0. That's why bloom filters can't be removed.

array

Linear representation is the most commonly used and simplest data structure, a linear representation of a finite sequence of n data elements, with the following characteristics:

  • There is a unique first data element
  • There is a unique data element called last
  • Except for the first one, each element in the set has a predecessor
  • Except for the last element, each data element in the collection has a successor element

Linear tables include the following:

  • Arrays: fast query/update, slow lookup/delete
  • linked list
  • queue
  • stack

An array is a type of linear table, and the order of the linear table means that the data elements of the linear table are sequentially stored in a group of storage units with consecutive addresses :

JavaIndicated as :

int[] nums = new int[100];
int[] nums = {1,2,3,4,5};

Object[] Objects = new Object[100];

is C++represented in :

int nums[100];

Array is a linear structure, generally a continuous space at the bottom layer, storing the same type of data. Due to the continuous compact structure and natural index support, the query data efficiency is high:

aSuppose we know that the first value of the array is the address 296, and the data type in it occupies 2a unit, then if we expect to get the fifth value: 296+(5-1)*2 = 304, O(1)the time complexity of , can be obtained.

The essence of the update is also to find, first find the element, you can start to update:

But if you want to insert data, you need to move the following data, such as the following array, insert elements 6, the worst is to move all elements, the time complexity isO(n)

image-20220104225524289

Deleting an element requires moving the following data to the front, and the worst time complexity is also O(n):

Java code implements addition, deletion, modification and inspection of arrays:

package datastruction;

import java.util.Arrays;

public class MyArray {
    private int[] data;

    private int elementCount;

    private int length;

    public MyArray(int max) {
        length = max;
        data = new int[max];
        elementCount = 0;
    }

    public void add(int value) {
        if (elementCount == length) {
            length = 2 * length;
            data = Arrays.copyOf(data, length);
        }
        data[elementCount] = value;
        elementCount++;
    }

    public int find(int searchKey) {
        int i;
        for (i = 0; i < elementCount; i++) {
            if (data[i] == searchKey)
                break;
        }
        if (i == elementCount) {
            return -1;
        }
        return i;
    }

    public boolean delete(int value) {
        int i = find(value);
        if (i == -1) {
            return false;
        }
        for (int j = i; j < elementCount - 1; j++) {
            data[j] = data[j + 1];
        }
        elementCount--;
        return true;
    }

    public boolean update(int oldValue, int newValue) {
        int i = find(oldValue);
        if (i == -1) {
            return false;
        }
        data[i] = newValue;
        return true;
    }
}

// 测试类
public class Test {
    public static void main(String[] args) {
        MyArray myArray = new MyArray(2);
        myArray.add(1);
        myArray.add(2);
        myArray.add(3);
        myArray.delete(2);
        System.out.println(myArray);
    }
}

linked list

In the above example, we can see that the array needs continuous space. If the space is only large 2, when it is placed in the 3th element, it has to be expanded. Not only that, but also the elements must be copied. Some delete and insert operations will cause more data movement operations.

Linked list, that is, a chained data structure, because it does not require logically adjacent data elements to be adjacent in physical position, it does not have the shortcomings of sequential storage structures, but at the same time, it also loses the direct search through index subscripts. Advantages of elements.

Important: The linked list is not continuous in the computer's storage, but the previous node stores the pointer (address) of the next node, and the latter node is found through the address.

The following is the structure of a singly linked list:

Generally, we will manually set a front node in front of the singly linked list, which can also be called the head node, but this is not absolute:

The general linked list structure is divided into the following types:

  • Singly linked list : Each node in the linked list has one and only one pointer to the next node, and the last node points to null.
  • Doubly linked list : each node has two pointers (for convenience, we call it a front pointer and a back pointer ), which point to the previous node and the next node respectively, the front pointer of the first node points to , and the back pointer NULLof the last node points to pointer toNULL
  • Circular linked list : The pointer of each node points to the next node, and the pointer of the last node points to the first node (although it is a circular linked list, it is necessary to identify the head node or tail node when necessary to avoid an infinite loop)
  • Complex linked list : Each linked list has a back pointer, pointing to the next node, and a random pointer, pointing to any node.

Time complexity of linked list operation:

  • Query: O(n), need to traverse the linked list
  • Insert: O(1), modify the pointer before and after
  • Delete: O(1), the same is the pointer before and after modification
  • Modification: if no query is O(1)needed, if it needs to be queried, it isO(n)

How to represent the structure code of linked list?

The following only represents the singly linked list structure, which C++means:

// 结点
typedef struct LNode{
  // 数据
  ElemType data;
  // 下一个节点的指针
  struct LNode *next;
}*Link,*Position;

// 链表
typedef struct{
  // 头结点,尾节点
  Link head,tail;
  // 长度
  int len;
}LinkList;

JavaThe code says:

    public class ListNode {
        int val;
        ListNode next = null;

        ListNode(int val) {
            this.val = val;
        }
    }

Implement a simple linked list by yourself, and implement the function of adding, deleting, modifying and checking:

class ListNode<T> {
    T val;
    ListNode next = null;

    ListNode(T val) {
        this.val = val;
    }
}

public class MyList<T> {
    private ListNode<T> head;
    private ListNode<T> tail;
    private int size;

    public MyList() {
        this.head = null;
        this.tail = null;
        this.size = 0;
    }

    public void add(T element) {
        add(size, element);
    }

    public void add(int index, T element) {
        if (index < 0 || index > size) {
            throw new IndexOutOfBoundsException("超出链表长度范围");
        }
        ListNode current = new ListNode(element);
        if (index == 0) {
            if (head == null) {
                head = current;
                tail = current;
            } else {
                current.next = head;
                head = current;
            }
        } else if (index == size) {
            tail.next = current;
            tail = current;
        } else {
            ListNode preNode = get(index - 1);
            current.next = preNode.next;
            preNode.next = current;
        }
        size++;
    }

    public ListNode get(int index) {
        if (index < 0 || index >= size) {
            throw new IndexOutOfBoundsException("超出链表长度");
        }
        ListNode temp = head;
        for (int i = 0; i < index; i++) {
            temp = temp.next;
        }
        return temp;
    }

    public ListNode delete(int index) {
        if (index < 0 || index >= size) {
            throw new IndexOutOfBoundsException("超出链表节点范围");
        }
        ListNode node = null;
        if (index == 0) {
            node = head;
            head = head.next;
        } else if (index == size - 1) {
            ListNode preNode = get(index - 1);
            node = tail;
            preNode.next = null;
            tail = preNode;
        } else {
            ListNode pre = get(index - 1);
            pre.next = pre.next.next;
            node = pre.next;
        }
        size--;
        return node;
    }

    public void update(int index, T element) {
        if (index < 0 || index >= size) {
            throw new IndexOutOfBoundsException("超出链表节点范围");
        }
        ListNode node = get(index);
        node.val = element;
    }

    public void display() {
        ListNode temp = head;
        while (temp != null) {
            System.out.print(temp.val + " -> ");
            temp = temp.next;
        }
        System.out.println("");
    }
}

The test code is as follows:

public class Test {
    public static void main(String[] args) {
        MyList myList = new MyList();
        myList.add(1);
        myList.add(2);
        // 1->2
        myList.display();

        // 1
        System.out.println(myList.get(0).val);

        myList.update(1,3);
        // 1->3
        myList.display();

        myList.add(4);
        // 1->3->4
        myList.display();

        myList.delete(1);
        // 1->4
        myList.display();
    }
}

Output result:

1 -> 2 -> 
1
1 -> 3 -> 
1 -> 3 -> 4 -> 
1 -> 4 ->

The search and update of the singly linked list is relatively simple. Let's take a look at the specific process of inserting a new node (only the insertion at the middle position is shown here, and the head and tail insertion is relatively simple):

How to delete an intermediate node? The following is the specific process:

image-20220108114627633

Maybe you will be curious, the a5node just has no pointer, so where does it go?

If it is a Javaprogram, the garbage collector will collect such unreferenced nodes and help us recycle this part of the memory, but in order to speed up the garbage collection, we generally need to empty the nodes that are not needed. For example node = null, if in the C++program , then you need to manually recycle, otherwise it is easy to cause problems such as memory leaks.

The operation of the complex linked list is briefly mentioned here. Later, I will share the data structure and common algorithms of the linked list separately. This article mainly talks about the whole picture of the data structure.

skip table

We can observe above that if the linked list is searched, it is very troublesome. If this node is at the end, it needs to traverse all the nodes to find it. The search efficiency is too low. Is there any good way?

There are always more solutions than problems, but 多快好省there is no such thing as an absolute "". There is something to give. The computer world is full of philosophical flavor. Since there is a problem with search efficiency, we might as well sort the linked list. The sorted linked list still only knows the head and tail nodes and the range in the middle, but to find the middle node, you still have to go the old way of traversal. What if we store intermediate nodes? Save it, we really know that the data is in the first half or the second half. For example, to find 7, it must start from the middle node. If you search 4, you have to start from the beginning, and if you reach the intermediate node at worst, stop the search.

However, the problem is still not completely solved, because the linked list is very long, and it can only be searched through the two parts before and after. It's better to go back to the principle: 空间和时间,我们选择时间,那就要舍弃一部分空间, we add a pointer to each node, and now there are 2 layers of pointers (note: there is only one copy of the node, all of which are the same node, just for the sake of appearance, I made two copies, which are actually the same node, There are two pointers, say 1, that point to both 2 and 5 ):

Two layers of pointers, the problem still exists, then keep adding layers, for example, add one layer for every two nodes:

This is the skip table. The definition of the skip table is as follows:

The skip list (SkipList, the full name of the skip list) is a data structure used for fast search and search of ordered element sequences. The skip list is a randomized data structure, which is essentially an ordered linked list that can perform binary search. The skip list adds a multi-level index to the original ordered linked list, and uses the index to achieve fast search. Skipping tables can improve not only search performance, but also the performance of insert and delete operations. Its performance is comparable to the red-black tree and the AVL tree, but the principle of the skip table is very simple, and the implementation is much simpler than the red-black tree.

The main principle is to exchange space for time, which can achieve almost the efficiency of binary search. In fact, the space consumed, assuming that every two layers is added, 1 + 2 + 4 + ... + n = 2n-1almost doubles the space. Do you think it looks like a book's table of contents, a first-level directory, a second-level, a third-level...

If we continue to insert data into the jump table, there may be a situation where there are too many nodes in a certain segment. At this time, we need to dynamically update the index. In addition to inserting data, we must also insert it into the linked list of the previous layer to ensure query efficiency.

redisThe skip table is used to achieve this zset, redisand a random algorithm is used to calculate the level and how many levels of indexes are calculated for each node. Although the comparison cannot be absolutely guaranteed, the efficiency is basically guaranteed. It is more efficient than those balanced trees and red-black trees. The algorithm is simpler.

stack

A stack is a data structure in Javawhich Stackclasses are embodied. Its essence is first-in, last-out , like a bucket, which can only be placed on it continuously, and when it is taken out, only the top data can be continuously taken out. If you want to take out the data at the bottom, you can only do it when the data above is taken out. Of course, if there is such a need, we generally use a two-way queue.

The following is a demonstration of the characteristics of the stack:

What is the bottom layer of the stack used for? In fact, you can use a linked list or an array, but JDKthe underlying stack is implemented with an array. After encapsulation, APIonly the last element can be manipulated. The stack is often used to implement recursive functions. If you want to understand Javathe stack or other collection implementation analysis inside, you can take a look at this series of articles: http://aphysia.cn/categories/collection

Elements are added to the stack (pushed), and elements are taken out of the stack, and the top element of the stack is the last element that was put on the stack.

Use an array to implement a simple stack (note that it is only for reference testing, there will actually be thread safety and other issues):

import java.util.Arrays;

public class MyStack<T> {
    private T[] data;
    private int length = 2;
    private int maxIndex;

    public MyStack() {
        data = (T[]) new Object[length];
        maxIndex = -1;
    }

    public void push(T element) {
        if (isFull()) {
            length = 2 * length;
            data = Arrays.copyOf(data, length);
        }
        data[maxIndex + 1] = element;
        maxIndex++;
    }

    public T pop() {
        if (isEmpty()) {
            throw new IndexOutOfBoundsException("栈内没有数据");
        } else {
            T[] newdata = (T[]) new Object[data.length - 1];
            for (int i = 0; i < data.length - 1; i++) {
                newdata[i] = data[i];
            }
            T element = data[maxIndex];
            maxIndex--;
            data = newdata;
            return element;
        }
    }

    private boolean isFull() {
        return data.length - 1 == maxIndex;
    }

    public boolean isEmpty() {
        return maxIndex == -1;
    }

    public void display() {
        for (int i = 0; i < data.length; i++) {
            System.out.print(data[i]+" ");
        }
        System.out.println("");
    }
}

Test code:

public class MyStackTest {
    public static void main(String[] args) {
        MyStack<Integer> myStack = new MyStack<>();
        myStack.push(1);
        myStack.push(2);
        myStack.push(3);
        myStack.push(4);
        myStack.display();

        System.out.println(myStack.pop());

        myStack.display();

    }
}

The output is as follows, as expected:

1 2 3 4 
4
1 2 3 

The characteristic of the stack is first-in, last-out, but if the previous data needs to be taken out randomly, the efficiency will be relatively low, and it needs to be vacated Java.

queue

Since there is a first-in, last-out data structure in front of us, then we must also have a first-in, first-out data structure. During the epidemic, it is estimated that everyone in the queue has been tested for nucleic acids. The queue is long, and the first in line is tested first, and the last in line. Test, everyone knows this.

A queue is a special kind of linear table . The special feature is that it only allows deletion operations at the front of the table and insertion operations at the rear of the table. Like stacks, queues are an operation subject to Linear table of constraints. The end that performs the insert operation is called the tail of the queue, and the end that performs the delete operation is called the head of the queue.

Queues are characterized by first-in, first-out, the following are examples:

Generally speaking, as long as you talk about first-in, first-out ( FIFO), the full name First In First Out, you will think of a queue, but if you want to have a queue that can take elements from the head of the queue and take elements from the tail of the queue, you need to use a special queue (two-way queue). ), the two-way queue is generally simpler to implement using a doubly linked list.

Below we Javaimplement a simple one-way queue:

class Node<T> {
    public T data;
    public Node next;

    public Node(T data) {
        this.data = data;
    }
}

public class MyQueue<T> {
    private Node<T>  head;
    private Node<T>  rear;
    private int size;

    public MyQueue() {
        size = 0;
    }

    public void pushBack(T element) {
        Node newNode = new Node(element);
        if (isEmpty()) {
            head = newNode;
        } else {
            rear.next = newNode;
        }
        rear = newNode;
        size++;
    }

    public boolean isEmpty() {
        return head == null;
    }

    public T popFront() {
        if (isEmpty()) {
            throw new NullPointerException("队列没有数据");
        } else {
            Node<T> node = head;
            head = head.next;
            size--;
            return node.data;
        }
    }

    public void dispaly() {
        Node temp = head;
        while (temp != null) {
            System.out.print(temp.data +" -> ");
            temp = temp.next;
        }
        System.out.println("");
    }
}

The test code is as follows:

public class MyStackTest {
    public static void main(String[] args) {
        MyStack<Integer> myStack = new MyStack<>();
        myStack.push(1);
        myStack.push(2);
        myStack.push(3);
        myStack.push(4);
        myStack.display();

        System.out.println(myStack.pop());

        myStack.display();

    }
}

operation result:

1 -> 2 -> 3 -> 
1
2 -> 3 -> 
2
3 -> 

Commonly used queue types are as follows:

  • One-way queue: that is, what we call a normal queue, first in, first out.

  • Two-way queue: can enter and exit the queue from different directions

  • Priority queue: the interior is automatically sorted, and the queue is queued in a certain order

  • Blocking queue: When an element is taken from the queue, the queue will block if there is no element. Similarly, if the queue is full, putting elements into the queue will also be blocked.

  • Circular queue: It can be understood as a circular linked list, but it is generally necessary to identify the head and tail nodes to prevent infinite loops, and the tail node nextpoints to the head node.

Queues can generally be used to save data that needs to be ordered, or to save tasks. In tree level traversal, queues can be used to solve them. Generally, breadth-first searches can be solved using queues.

hash table

The previous data structure, when searching, is generally used =or , and !=may be used when searching by half or other range queries . Ideally, we definitely hope to directly locate a certain position without any comparison. (storage location), in the array, elements can be obtained by index. So, if we match the data to be stored with the index of the array, and it is a one-to-one relationship, can't we quickly locate the element's position?<>

As long as f(k)you can find kthe corresponding position through a function, this function f(k)is a hashfunction. It represents a mapping relationship, but for different values, it may be mapped to the same value (same hashaddress), that is f(k1) = f(k2), this phenomenon we call 冲突or 碰撞.

hashThe table is defined as follows:

Hash table (also called hash table) is a data structure that directly accesses the memory storage location according to the key. That is, it accesses records by computing a function on the key value that maps the data to be queried to a location in the table, which speeds up lookups. This mapping function is called a hash function, and the array of records is called a hash table.

Commonly used hashfunctions are:

  • Direct addressing method: take out the keyword or the value of a linear function of the keyword as a hash function, such as H(key) = keyorH(key) = a * key + b
  • Numerical analysis method: For all possible values, take several digits of the keyword to form a hash address
  • The square method: take the middle digits after the square of the keyword as the hash address
  • Folding method: Divide the keyword into several parts with the same number of digits (the number of digits in the last part can be different), and take the superposition sum of these parts (round down) as the hash address.
  • Remainder of division method: Take the remainder obtained after the keyword is divided by a mnumber not greater than the length of the hash table table pas the hash address. i.e. h ash(k)=k mod p, p< =m. Not only can the keyword be modulo directly, but also modulo can be taken after operations such as the folding method and the square method. The right pchoice is very important. Generally, a prime number or a prime number is used m. If the pchoice is not good, it is easy to cause conflicts.
  • Random number method: Take the random function value of the keyword as its hash address.

However, none of these methods can avoid hash collisions, and can only be reduced consciously. So what are the ways to deal with hashconflicts?

  • Open address method: hashAfter the calculation, if there is already data in the location, then the address +1, that is, looking back, knows to find an empty location.
  • Re- hashmethod: After a hash conflict occurs, you can use another hashfunction to re-calculate the pole to find an empty hashaddress, and if there is one, you can also superimpose the hashfunction.
  • Chain address method: All hashvalues ​​are the same, and the link becomes a linked list, which hangs behind the array.
  • Establish a common overflow area: not common, which means that if all elements hashconflict with elements in the table, they will get another table, also called overflow table.

JavaInside, the chain address method is used:

However, if hashthe conflict is serious, the linked list will be relatively long. When querying, you need to traverse the following linked list. Therefore, JDKa version is optimized. When the length of the linked list exceeds the threshold, it will become a red- black tree . The red-black tree has certain rules to Balance the subtree to avoid degenerating into a linked list, which affects the query efficiency.

But you will definitely think, what if the array is too small and more data is placed? The probability of replaying the conflict will become higher and higher. In fact, at this time, an expansion mechanism will be triggered, the array will be expanded to 2double size, and the hashprevious data will be hashed into a different array.

hashThe advantage of the table is that the lookup speed is fast, but if the re-trigger is constantly triggered hash, the response speed will also be slow. Also, if you want range queries, hashtables are not a good choice.

Tree

Arrays and linked lists are both linear structures, while the tree to be introduced here is a non-linear structure. In reality, the tree is a pyramid structure, and the tree in the data structure is called the root node at the top.

How do we define the tree structure?

A tree is a data structure , which consists of n (n≥1 ) finite nodes to form a set with hierarchical relationship . It's called a "tree" because it looks like an upside-down tree, which means it has the roots up and the leaves down. It has the following characteristics:

Each node has zero or more child nodes; a node without a parent node is called a root node; every non-root node has one and only one parent node; except the root node, each child node can be divided into multiple disjoint children Tree. (Baidu Encyclopedia)

The following are the basic terms for trees (from the Tsinghua University Data Structure CLanguage Edition):

  • Degree of a node: The number of subtrees contained in a node is called the degree of the node
  • The degree of the tree: In a tree, the largest node degree is called the degree of the tree;
  • Leaf nodes or terminal nodes: nodes with degree zero;
  • Non-terminal nodes or branch nodes: nodes whose degree is not zero;
  • Parent node or parent node: If a node contains child nodes, the node is called the parent node of its child nodes;
  • Child node or child node: The root node of the subtree contained in a node is called the child node of the node;
  • Sibling nodes: Nodes with the same parent node are called sibling nodes;
  • The level of the node: starting from the definition of the root, the root is the first 1layer, the child nodes of the root are the first 2layer, and so on;
  • Depth: For any node n, nthe depth is the length of the unique path from the root to n, and the depth of the root is 0;
  • Height: For any node n, nthe height is nthe length of the longest path from to a leaf, and the height of all leaves is 0;
  • Cousin node: Nodes whose parent nodes are in the same layer are cousins ​​of each other;
  • Ancestors of a node: all nodes on the branch from the root to the node;
  • Descendants: Any node in the subtree rooted at a node is called the descendant of the node.
  • Ordered tree: The subtrees of the nodes of the tree species are regarded as ordered from left to right (cannot be interchanged), then the tree should be called an ordered tree, otherwise it is an unordered tree
  • First Child: The root of the leftmost subtree in an ordered tree is called the first child
  • Last Child: The root of the rightmost subtree in an ordered tree is called the last child
  • Forest: A collection of m( m>=0) disjoint trees is called a forest;

Trees, in fact, we most commonly use binary trees:

The characteristic of a binary tree is that each node has at most two subtrees, and the subtrees are divided into left and right, and the order of the left and right child nodes cannot be arbitrarily reversed.

A binary tree Javais represented in:

public class TreeLinkNode {
    int val;
    TreeLinkNode left = null;
    TreeLinkNode right = null;
    TreeLinkNode next = null;

    TreeLinkNode(int val) {
        this.val = val;
    }
}

Full binary tree: A binary tree with depth k and 2<sup>k</sup>-1 nodes is called a full binary tree

Complete binary tree: A binary tree of depth k with n nodes, if and only if each node corresponds to a node numbered from 1 to n in a full binary tree of depth k, it is called a complete binary tree.

There are several types of traversal of a general binary tree:

  • Preorder traversal: traverse the order root node --> left child node --> right child node
  • In-order traversal: traversal order left child node --> root node --> right child node
  • Post-order traversal: traverse order left child node --> right child node --> root node
  • Breadth/level traversal: traversal from top to bottom, layer by layer

If it is a chaotic binary tree, the efficiency of searching or searching will be relatively low, and it is no different from a chaotic linked list, so why bother with a more complicated structure?

In fact, the binary tree can be used in sorting or searching, because the binary tree has strict left and right subtrees, we can define the size of the root node, the left child node, and the right child node. So there is a binary search tree:

Binary Search Tree, (also: Binary Search Tree, Binary Sort Tree) It is either an empty tree, or a binary tree with the following properties : If its left subtree is not empty, then the left The value of all nodes on the subtree is less than the value of its root node ; if its right subtree is not empty, the value of all nodes on the right subtree is greater than the value of its root node; its left , the right subtree is also a binary sorted tree , respectively . As a classic data structure, binary search tree not only has the characteristics of fast insertion and deletion of linked lists, but also has the advantages of fast array search; so it is widely used. For example, file systems and database systems generally use this kind of tree. Data structures for efficient sorting and retrieval operations.

A sample binary search tree is as follows:

For example, if we need to find the above tree, start4 from , go to the left subtree, find it , go to the right subtree, find it , that is, a tree of nodes, we only search times, that is, the number of layers, assuming a node, that is .545343473nlog(n+1)

If the tree is well maintained, the query efficiency is high, but if the tree is not well maintained, it will easily degenerate into a linked list, and the query efficiency will also decrease, for example:

A query-friendly binary tree should be a balanced or nearly balanced binary tree. What is a balanced binary tree:

The heights of the left and right subtrees of any node in a balanced binary search tree differ by at most 1. A balanced binary tree is also known as an AVL tree.

In order to ensure that the binary tree is still a balanced binary tree after inserting or deleting data, etc., it is necessary to adjust the nodes. This is also called the balancing process, which involves various rotation adjustments, which will not be expanded here for the time being.

However, if a large number of updates, deletions are involved, and various adjustments to balance tree species need to sacrifice a lot of performance, in order to solve this problem, some bosses proposed red-black trees.

Red Black Tree (Red Black Tree) is a self-balancing binary search tree, a data structure used in computer science , typically used to implement associative arrays . [1]

Red-black trees were invented by [Rudolf Bayer]( https://baike.baidu.com/item/Rudolf Bayer/3014716) in 1972 and were then called symmetric binary B-trees. Later, it was modified in 1978 by Leo J. Guibas and Robert Sedgewick to the present "red-black tree". [2]

A red-black tree is a specialized AVL tree ( balanced binary tree ), which maintains the balance of the binary search tree through specific operations during insertion and deletion operations, so as to obtain high search performance.

A red-black tree has the following characteristics:

  • Properties 1. Nodes are red or black.

  • Property 2. The root node is black.

  • Property 3. All leaves are black. (Leaves are NIL nodes)

  • Property 4. Both children of each red node are black. (There cannot be two consecutive red nodes on all paths from each leaf to the root)

  • Property 5. All paths from any node to each of its leaves contain the same number of black nodes.

It is these characteristics that make the adjustment of the red-black tree not as difficult and frequent as the adjustment of the ordinary balanced binary tree. That is to say, rules are added to make it meet certain standards and reduce the confusion and frequency of the balancing process.

The implementation of the hash table Javamentioned above is exactly the application of the red-black tree. When hashthere are many conflicts, the linked list will be converted into a red-black tree.

All of the above are binary trees, but we have to rip multi-fork trees, why? Although the various search trees in the binary tree, the red-black tree is already very good, but when interacting with the disk, most of them are in data storage, we have to consider the IO factor, because the disk IO is much slower than the memory. If the level of the index tree is tens of thousands, then the number of disk reads is too many. B-trees are more suitable for disk storage.

In 970, R.Bayer and E.mccreight proposed a tree suitable for outer search , which is a balanced multi-fork tree called B-tree (or B-tree, B_tree).

A balanced tree of order m is a balanced m-way search tree. It is either an empty tree, or a tree that satisfies the following properties:

1. The root node has at least two children;

2. The number j of keywords contained in each non-root node satisfies: m/2 - 1 <= j <= m - 1;

3. The degree of all nodes except the root node (excluding leaf nodes) is exactly the total number of keywords plus 1, so the number of internal subtrees k satisfies: m/2 <= k <= m ;

4. All leaf nodes are located in the same layer.

Each node puts a little more data. When searching, the operation in memory is much faster than that on disk, and the btree can reduce the number of disk IO. B-tree:

And each node datamay be very large, which will result in very little data found on each page, and the number of IO queries will naturally increase. Then we might as well only store data in leaf nodes:

The B+ tree is a variant of the B tree. The leaf nodes on the B+ tree store keywords and addresses of corresponding records, and the layers above the leaf nodes are used as indexes. A B+ tree of order m is defined as follows:

(1) Each node has at most m children;

(2) Except for the root node, each node has at least [m/2] children, and the root node has at least two children;

(3) A node with k children must have k keywords.

Generally, the leaf nodes of the b+ tree are connected by a linked list, which is convenient for traversal and range traversal.

This is the tree. Compared with the b+tree, the tree has the following advantages:b+B树

  1. b+The intermediate nodes of the tree do not save data, and each IO query can find more indexes, which is a squat tree.
  2. For range search, the b+tree only needs to traverse the linked list of leaf nodes, bbut the tree needs to start from the root node to the leaf nodes.

In addition to the above tree, there is actually a kind of Huffmantree: given N weights as N leaf nodes , construct a binary tree, if the weighted path length of the tree reaches the minimum, such a binary tree is called the optimal binary tree, Also known as Huffman Tree. The Huffman tree is the tree with the shortest weighted path length, and the node with the larger weight is closer to the root.

It is generally used for compression, because the frequency of each character in the data is different. The higher the frequency of the character, the shorter the code we use to save, the purpose of compression can be achieved. Where did this code come from?

Assuming that the character is hello, then the encoding may be (just a rough prototype of the encoding, high-frequency characters, the encoding is shorter), the encoding is a 01string of paths from the root node to the current character:

By encoding different weights, the Huffman tree is effectively compressed.

heap

The heap is actually a type of binary tree. The heap must be a complete binary tree. A complete binary tree is: Except for the last layer, the number of nodes in other layers is full, and the nodes in the last layer are concentrated in the left continuous position.

The heap has another requirement: the value of each node in the heap must be greater than or equal to (or less than or equal to) the value of its left and right child nodes.

There are two main types of heaps:

  • Big top heap: each node is greater than or equal to its subtree nodes (heap top is the maximum value)
  • Small top heap: each node is less than or equal to its subtree nodes (heap top is the minimum value)

In general, we use arrays to represent heaps, such as the following small top heap:

image-20220109000632499

The relationship between parent-child nodes and left and right nodes in the array is as follows:

  • i the parent of the node parent = floor((i-1)/2) (rounded down)
  • i node's left child2 * i +1
  • i node's right child2 * i + 2

Since data is stored, operations such as insertion and deletion must be involved. Insertion and deletion in the heap will involve adjustment of the heap. After adjustment, its definition can be re-satisfied. This adjustment process is called heapization .

Taking the small top heap as an example, the adjustment is mainly to ensure:

  • or complete binary tree
  • Each node in the heap is less than or equal to its left and right child nodes

For the small top stack, the adjustment is: small elements float up and large elements sink, which is a process of continuous exchange.

The heap can generally be used to solve TOP Kproblems, or the priority queue we mentioned earlier.

picture

I finally came to the explanation of the map. The map is actually a two-dimensional plane. I wrote about mine sweeping before. The entire block area of ​​mine sweeping can actually be said to be related to the map. A graph is a non-linear data structure, mainly composed of edges and vertices.

image-20220109002114134

At the same time, the graph is divided into a directed graph and an undirected graph. The above is an undirected graph, because the edge does not indicate the direction, but only indicates the relationship between the two, while the directed graph is like this:

If each vertex is a place and each edge is a path, then this is a map network, so graphs are often used to solve shortest distances. Let's take a look at the concepts related to the graph:

  • Vertex: The most basic unit of the graph, those nodes
  • Edge: the relationship between vertices
  • Adjacent Vertices: Vertices directly related by edges
  • Degree: the number of adjacent vertices that a vertex is directly connected to
  • weight: the weight of the edge

Generally, there are several ways to represent graphs:

  1. The adjacency matrix, represented by a two-dimensional array, is 1 for connectedness, and 0 for disconnected. Of course, if the path length is represented, a larger 0number can be used to represent the path length, and it is used to -1indicate disconnection.

In the picture below, 0 and 1, 2 are connected, we can see that the 1st and 2nd columns of the 0th row are 1, indicating that they are connected. Another point: the vertex itself is marked with 0, indicating that it is not connected, but in some cases it can be regarded as a connected state.

  1. adjacency list

The adjacency list, the storage method is similar to the child chain representation of the tree, is a storage structure that combines sequential allocation and chain allocation . If the vertex corresponding to the header node has adjacent vertices, the adjacent vertices are stored in sequence in the singly linked list pointed to by the header node.

For undirected graphs, the use of adjacency list for storage will also cause data redundancy. When there is a table node pointing to C in the linked list pointed to by the header node A, the linked list pointed to by the header node C will also exist. A table node pointing to A.

The traversal in the graph is generally divided into breadth-first traversal and depth-first traversal. Breadth-first traversal refers to the priority traversal of vertices directly related to the current vertex, which is generally implemented by means of queues. The depth-first traversal is to go all the way in one direction and can't go any further. It means not hitting the south wall and not looking back. Generally, it is implemented recursively.

In addition to calculating the minimum path, there is another concept: the minimum spanning tree.

A spanning tree of a connected graph with n nodes is a minimal connected subgraph of the original graph, and contains all n nodes in the original graph, and has the fewest edges that keep the graph connected. The minimum spanning tree can be calculated by kruskal algorithm or prim algorithm.

There is a saying that a graph is a point on a plane. We pick up one of the points, and the edge that can bring other vertices together takes the minimum weight, and removes the redundant edges, which is the minimum spanning tree.

Of course, the minimum spanning tree is not necessarily unique, and there may be multiple outcomes.

Qin Huai@Viewpoint

Knowing these basic data structures is the most useful when writing code or data modeling, and being able to choose a more appropriate one. Computers serve people, and so do codes. We cannot master all types of data structures all at once, but the basic things will not change much unless a new generation of revolutionary changes.

Programs are composed of data structures and algorithms. Data structures are like the cornerstone, ending with a sentence in the "Data Structure C Language" version:

In order to write a "good" program, it is necessary to analyze the characteristics of the objects to be processed and the relationships between the objects to be processed, which is the background of the discipline and development of "data structure".

【Author's brief introduction】 :
Qin Huai, author of the public account [ Qin Huai Grocery Store ], personal website: http://aphysia.cn, the road of technology is not at one time, the mountains are high and the rivers are long, even if it is slow, it is endless.

Sword Point Offer All Problem Solutions PDF

Open Source Programming Notes

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5077784/blog/5396070