Find the IP address of top 1 in the Linux kernel receiving path

In actual work, I finally encountered some real interview questions:

  • Algorithm question: an unordered linked list containing a large number of nodes, knowing that there are multiple repeated elements in it, find the one with the most repeated times, and give the time complexity. For example, 20-1-2-3-5-7-3-20-12-3, the repeating element has 3 3s, and the answer to 2 20s is obviously 3.

When performing traffic analysis, DDoS detection and protection, traffic cleaning, etc., a very common requirement is "seeking top N", and the related algorithms can be described as a lot of sweat:

  • Sort
  • Maximum heap
  • LRU
  • Bitmap counter

There are many theoretical algorithms, but various engineering problems will always be encountered in actual combat. For example, the lock overhead of concurrent operations is inevitable. So designing a device instead of designing an algorithm is a natural idea.

It is also worth noting that the detection of abnormal traffic is a fuzzy operation, which does not necessarily require precise calculations and precise matching. It only needs to locate the abnormality. Therefore, there is another way of thinking to use the percpu data structure to tolerate when reading the data. Loss of accuracy, exchange accuracy for efficiency.

The following are two methods I thought of on the way off work. Finding top 1 is very suitable, and top N can be found slightly. These two methods are very simple and transparent to implement.

Method 1: Multi-hash linked list (the idea of ​​bloom filter-like)

Two (or more) hash tables with different hash algorithms are linked to the IP address structure and count per bucket:
Insert picture description here

The data structure is as follows:

struct bucket {
    
    
    int hash;
    atomic_t count;
    spinlock_t lock;
    struct list_head hlist;
};

struct IP_item {
    
    
    struct list_head list;
    u32 ipaddr;
};

#define HSIZE    8192

struct bucket hlist[2][HSIZE];

operating:

  • IP entry: Calculate hash1 and hash2 by IP address (source or target, depending on configuration), link to the corresponding list respectively, and increment the bucket count.

  • Timer expires or conntrack is destroyed: Remove the corresponding IP structure and decrement the bucket count.

abnormal detection:

  • There are two hash tables at the same time that a bucket count exceeds the average counter β \betaβ times, and the difference between the two largest bucket counters does not exceedα \alphaα , regarded as abnormal:

    L m a x > β L m e a n L_{max}>\beta L_{mean} Lmax>βLmean

    The average length is calculated as follows:

    L m e a n = Σ n = 0 N b u c k e t s − 1 L n − L m a x N b u c k e t s − 1 L_{mean}=\dfrac{\Sigma_{n=0}^{N_{buckets}-1}{Ln}-L_{max}}{N_{buckets}-1} Lmean=Nbuckets1Σn=0Nbuckets1LnLmax

  • Traverse the beginning of the bucket list of the two largest counters γ \gammaγ elements (γ \gammaγ is twice the average length), and the largest repeated intersection of the two linked lists is taken. It is the abnormal IP address. Leetcode 349:https://leetcode-cn.com/problems/intersection-of-two-arrays/

Evaluation:

  • IP entry requires hash calculation operation, O(1) time complexity.

  • The IP structure is inserted into the hash bucket linked list, O(1) time complexity. (Maybe spinlock is expensive and can be optimized by percpu)

  • Anomaly detection bubbling the maximum bucket counter of two hash tables (bucket is fixed), O(1) time complexity.

  • Traverse two linked lists KaTeX parse error: Undefined control sequence: \gama at position 1: \̲g̲a̲m̲a̲ elements, O(n) time complexity, assuming that the hash algorithm is uniform, L mean L_{mean}LmeanIs very small.

Method 2: The hash counter (also the idea of ​​bloom filter)
splits the IP address into every 8 bits, each 8 bits is 256 counters, and two hash counters are set at the same time:
Insert picture description here

The data structure is as follows:

struct bucket {
    
    
    int hash;
    atomic_t count;
};

struct bcounter {
    
    
    atomic_t counter;
};

#define HSIZE    8192

struct bucket hlist[2][HSIZE];
struct bcounter counter[4][256]

operating:

  • IP entry: Calculate hash1 and hash2 by IP address (source or destination, depending on the configuration), respectively increment the counters corresponding to two hash counters, and at the same time increment the counters corresponding to the bits of the per octet counter group.

  • Timer expires or conntrack is destroyed: decrement the corresponding counter.

abnormal detection:

  • There are two hash tables at the same time that a bucket count exceeds the average counter β \betaβ times, and the difference between the two largest bucket counters does not exceedα \alphaα , regarded as abnormal:

    L m a x > β L m e a n L_{max}>\beta L_{mean} Lmax>βLmean

    The average length is calculated as follows:

    L m e a n = Σ n = 0 N b u c k e t s − 1 L n − L m a x N b u c k e t s − 1 L_{mean}=\dfrac{\Sigma_{n=0}^{N_{buckets}-1}{Ln}-L_{max}}{N_{buckets}-1} Lmean=Nbuckets1Σn=0Nbuckets1LnLmax

  • Take the largest counter of the per octet counter group, index it into an IP address based on its position, and use two hash tables to check the calculation. If it falls into the bucket with the largest counter at the same time (that is, the two buckets found in the first step), the spliced ​​address is the abnormal address, otherwise the next smallest is selected to continue.

It is not that the probability of splicing the maximum value of the per octet counter group is very low. We assume that the IP address is a uniform long-tail distribution.

Evaluation:

  • It saves the maintenance of the hash linked list, especially the use of spinlock.

Optimized version

You can add two more overlapping per octet counter groups to improve accuracy. In this way, from low to high, for each octet, as long as the high 4bit and the low 4bit of the next octet can be joined together, the super probability is top 1 Address, in addition to the top N address:
Insert picture description here

The idea behind it is that under normal circumstances without spurs, the IP address is sufficiently hashed. Under abnormal conditions, any sub-range of the 32-bit IP address will have spurs. If all the spurs are spliced ​​together, it is a complete IP address. .

However, if there are two or more IP addresses with the same flood level, it will be difficult to detect, and permutations and combinations are needed to check. One way to deal with this situation is to sort each sub-interval into top N, and then check the calculation.

to sum up

Everyone suggests that I use bitmap to do it directly. For IPv4 addresses, it consumes 4G of address space. However, I do this function in the kernel, 4G is really not playable (actually nothing, but still feels inelegant). Bitmap is a good method, but how can it take up less memory?

So I thought of the fuzzy solution method, that is, method 1. This method may fail in extreme cases, such as the hash algorithm is compromised, and the hash has serious distortion, but it is easy to use in most cases. During the review, the spinlock of per bucket was still a pain in my heart.

The second method is directly based on the idea of ​​bitmap, so it is easy to use and has no lock. In fact, if you carefully observe and analyze these split bits sub-intervals, you can still dig out more things. It is not a problem to ask for top N. It just says that the goal is achieved, and then it will not continue.

BTW, in the future interviews, don't let people write a search algorithm anymore. You should examine the project implementation issues, such as how to optimize the overhead of locks for concurrent operations, etc. Writing a spinlock that is actually usable is much more useful than writing an algorithm! Most programmers don’t write quick sort at all in their careers except for interviews, but as an adult who doesn’t write a few lines of code and doesn’t know how to program, I don’t have to write a top 1 IP address in the Linux kernel. Is it realized?


The leather shoes in Wenzhou, Zhejiang are wet, so they won’t get fat in rain.

Guess you like

Origin blog.csdn.net/dog250/article/details/114433315