[Machine Learning] [Operating System] [Network] [Algorithm and Data Structure] Knowledge Summary

Machine learning

What to do when L1 is not conductive

operating system

Stack difference

The difference between heap and stack:

1. The difference in stack space allocation:

1) Stack (operating system): automatically allocated and released by the operating system, storing function parameter values, local variable values, etc. Its operation mode is similar to the stack in the data structure;

2) Heap (operating system): Generally, it is allocated and released by the programmer. If the programmer does not release it, the OS may reclaim it when the program ends. The allocation method is similar to a linked list.

Two, the difference between stack caching methods:

1) The stack uses the first level cache, they are usually in the storage space when they are called, and they are released immediately after the call;

2) The heap is stored in the second-level cache, and the life cycle is determined by the garbage collection algorithm of the virtual machine (not that it can be recycled once it becomes an orphan object). Therefore, the speed of calling these objects is relatively low.

Heap: In the memory, the reference data type is stored, and the size of the reference data type cannot be determined. The heap is actually the storage space of a linked list structure that uses the scattered space in the memory. The size of the heap is directly determined by the size of the reference type , The change of the size of the reference type directly affects the change of the heap

Stack: It is the type of value stored in the memory, the size is 2M, if it exceeds it, an error will be reported, and the memory will overflow

Three, the stack data structure difference:

Heap (data structure): Heap can be regarded as a tree, such as: heap sort;

Stack (data structure): a first-in-last-out data structure.

Features: first in, last out

What are the situations of stack overflow

1) The local array is too large. When the array inside the function is too large, it may cause stack overflow.

2) There are too many levels of recursive calls. The recursive function will perform a stack push operation when it is running. When the stack is pushed too many times, it will also cause a stack overflow. 3) The pointer or array is out of bounds. This situation is most common, such as copying a string, or processing user input, and so on.

Algorithm and data structure

Sort

Count sorting/bucket sorting

Time complexity O (N + K) O(N+K)O ( N+K ) , use time for space, whenO (k)> O (n log ⁡ (n)) O(k)>O(n\log(n))O(k)>O ( nlo g ( n ) ) is not as efficient as comparison-based sorting

topK gives 3 solutions

1) Partial elimination method-Get TopK with the help of "bubble sort"

Ideas: (1) You can avoid sorting all data, only sorting part; (2) Bubble sorting is that each round of sorting will get a maximum value, then K rounds of sorting can get TopK.

Time complexity and space complexity: (1) Time complexity: one round of sorting is O(N), then the total time complexity of sorting K times is: O(KN). (2) Space complexity: O(K), used to store the obtained topK, or O(1) to traverse the last K elements of the original array.

2) Partial elimination method-Get TopK with the help of the data structure "heap"

Ideas: (1) Heap: It is divided into a large top pile (the top element is larger than all other elements) and a small top pile (the other elements on the top are smaller than all other elements). (2) We use a small top heap to achieve this. (3) Take out K elements and place them in another array, and build a pile of these K elements. (4) Then loop through the data from the K subscript position, as long as the element is greater than the top of the heap, we assign the top of the heap to the element, and then readjust to the small top heap. (5) After the loop, the heap array of K elements is the TopK we need.

Time complexity and space complexity: (1) Time complexity: Every time K elements are piled up, the time complexity is: O(KlogK), plus NK cycles, the total time complexity is O( (K+(NK))logK), that is, O(NlogK), where K is the number of TopKs to be acquired and N is the total data volume. (2) Space complexity: O(K), only need to create a K-sized array to store topK

3) Divide and conquer-get TopK with the help of "quick sort" method

Ideas: (1) For example, if there are 1 billion data, to find a Top1000, we first divide the 1 billion data into 1,000 pieces, each with 1 million pieces of data. (2) Find the corresponding Top 1000 in each copy and integrate it into an array to get 1 million pieces of data, thus filtering out 999%% of the data. (3) Use quick sort to perform a "round" sorting of these 1 million pieces of data. After a round of sorting, the number pointed to by the pointer is assumed to be S, and the array will be divided into two parts. One part is greater than S and recorded as Si, and the other is less than S is denoted as Sj. (4) If the number of Si elements is greater than 1000, we perform another round of sorting on the Si array, and again divide Si into Si and Sj. If the element of Si is less than 1000, then we need to get 1000-count(Si) elements in Sj, that is, sort Sj (5) so recursively to get TopK.

Time complexity and space complexity: (1) Time complexity: A copy of the time complexity of obtaining the TopK: O((N/n)logK). Then all the shares are: O(NlogK), but in the divide-and-conquer method, we will use multi-core and multi-machine resources. For example, we have S threads to process at the same time. Then the time complexity is: O((N/S)logK). After that, fast sorting is performed, and the time complexity of one time is: O(N). Assuming that the result is obtained after sorting M times, the time complexity is: O(MN). Therefore, the total time complexity is approximately O(MN+(N/S)logK). (2) Space complexity: If an array is required for each copy, the space complexity is O(N).

Hash table method of handling conflicts

apriori

Cattleya Number

[Algorithm] Shocked! ! ! The most detailed Cattleya number in history! ! !

Bipartite graph matching

Bloom filter

Bitmap

Red-black tree/balanced tree

Very large amount of data, realizing search and sorting

1), bitmap method

The bitmap method is a relatively novel method that I have seen on programming pearls. The idea is more ingenious and efficient.
Example of usage scenario: Sorting the amount of 2G data is a basic requirement.

Data: 1. Each data is not more than 800 million; 2. The data type is int; 3. Each data is repeated at most once.

Memory: Up to 200M of memory can be used for operation.

First of all, judge the occupied memory. Each data is not more than 800 million. So what is the concept of 800 million.

**

1 byte = 8 bit(位)

1024 byte = 8*1024 bit = 1k

1024 k = 810241024 bit = 1M = 8388608 bit

**

That is, 1M=8388608 bits

The basic idea of ​​the bitmap method is to use one bit to represent a number. For example, if the 3 digit is 1, it means that 3 has appeared in the data. If it is 0, it means that 3 has not appeared in the data. So when the condition appears in the question that each data is repeated at most once, we can consider using the bitmap method to sort big data.

So if you use the bitmap method to sort this question, how much memory does it take up? Knowing from the question that each data is not greater than 800 million, then we need 800 million bits, occupying 800000000/8388608=95M space, and meeting the conditions of using up to 200M memory for operation, this is also the problem that can be solved by bitmap method A foundation.

2), heap sort method

Heap sorting is one of the four sorting methods with an average time complexity of nlogn. Its advantage lies in its excellent performance when finding the first n maximum and minimum of M numbers. Therefore, when the first m maximum or minimum values ​​are to be found from the massive data, and other values ​​are not required, the heap sorting method works well.

Usage scenario: Find the 100 largest numbers from 100 million integers

step:

(1) Read the first 100 digits and build the maximum heap. (Heap sorting is used here to make the space complexity very low. To sort 100 million numbers, but only need to read 100 numbers at a time, or set other bases, there is no need to read all the data in one time, reducing the memory Claim)

(2) Read the remaining numbers in sequence, compare with the maximum heap, and maintain the maximum heap. The number of reads per time is one disk page, and the data of each page is sequentially compared to the heap, which saves IO time.

(3) Sort the heap to get 100 ordered maxima.

Heap sort is a common algorithm, but understanding its usage scenarios can help us understand it better.

3), a more general divide and conquer strategy

Divide and conquer strategists have a universal solution to common complex problems. Although in many cases, the solution of divide and conquer strategy is not the optimal solution, it is very versatile. The core of the divide-and-conquer method is to abstract a complex problem into several simple problems through decomposition.

Application scenario: 10G data, sorting algorithm on a single machine with 2G memory

In my opinion, this scenario neither introduces whether the data is duplicated, nor does it give the scope of the data, nor does it seek the maximum number. Although divide and conquer may require a lot of IO times, it is still feasible to solve this problem.

step:

(1) Extract samples from big data, and divide the data to be sorted into multiple intervals with roughly the same number of samples, for example: 1-100, 101-300...

(2) Divide the big data file into multiple small data files. Here, the number of IOs and hardware resources should be considered. For example, the number of small data files can be set to 1G (memory should be reserved for the use of the program during execution)

(3) Use the optimal algorithm to sort the data of the small data file, and store the sorting results according to the interval divided by step 1.

(4) Process the sorting result files in each data interval, and finally get a sorting result file for each interval

(5) Combine the sorting results of each interval. Divide and conquer the big data into small data for processing, and then merge.

Guess you like

Origin blog.csdn.net/TQCAI666/article/details/114083464