With God left practice with algorithms (Bloom filter and consistent hashing)

Understanding Bloom filter

//布隆过滤器和一致性哈希
//布隆过滤器就是爬虫项目,和黑名单项目的常见结构
//查询某一个结构是否存在于某一个集合之中

Here Insert Picture Description
Floor structure:
a large array of structures, each location is a bit bite
Here Insert Picture Description
reasons Bloom filter produces errors:
If there are 10 billion URL now you only gave 10-bit, four hash functions, each blackened four bits, then this 10-bit destined to be the view of the black, until the time looking for, you will find that the investigation and consequently is wrong.
If there are 10 URL, you are given 10 billion bits, this time certainly is right.

The big question is how much of the array to the problem
happens Bloom filter is to solve the problem of large data lookup, in less use of space, the small probability of failure

Find conditions Bloom filter
Bloom filter and size does not matter what you are looking for, and the number of related things to find
how to determine the conditions according to Bloom filter
for Bloom filter, you only need to tell me , the number you want to store the elements, and the error rate can be tolerated you, you to calculate this time, the number of required calculations and the K bits of the hash function requires a large number of bits array, a single large array as well as a hash function to determine the number, all things are determined, that is to say, given the conditions you only need to calculate the number of elements you need, and you can tolerate the fault tolerance

Mathematical formula:
Here Insert Picture Description

m: how much of a bite bit array
K: the number of hash function requires
the first two equations are once decimal rounding up occurs
, however, but a rounded up, the error must be changed
Here Insert Picture Description
so that with the above formula to find the real fault tolerance, fault-tolerant real need is less than the true fault tolerance on the right.
Here Insert Picture Description

So if someone asks you a question and the related blacklist, and a lot of room for difference of your space and the actual needs, you ask one more question we allow a mistake rate. Listening to you know that you are knowledgeable on the road.

Consistent hashing

Here Insert Picture Description
Here Insert Picture Description
还有一点忘了说就是,在前端服务器上有一个数组,这个数组里面存的是后面存储服务器的哈希值的值,并且是按顺序来存储的,这样的在接收到一个请求的话,就很方便使用二分法查找到这个请求对应所应该对应的服务器。

上面的机器成环存在两个问题:

这两个问题都是有哈希函数的性质所引起的,哈希函数是离散的函数,是均匀分布的,但是前提是量,数量要多,对于上面的问题来说也就是服务器的数量要多,量起来之后才能体现哈希的特点。

1.在服务器数量较少的时候,哈希域 (也就是上图中的那个环),不一定会被服务器均分
2.就算你解决了第一个问题,服务器均分了哈希域,当你再增加服务器的时候,哈希域又注定不会被均分。

上面两个问题是由哈希函数的性质引起的

那么又该如何解决上面的问题呢?
一个方法,虚拟结点技术

假设环还是那个环,服务器还是那三台服务器
M1,M2, M3
给M1在哈希域环上虚拟出10000个结点,M2也是,M3也是,此时将有30000个结点遍布在环上,归属于M1的10000个结点中应该存放的数据,都存放在M1上。此时哈希域上的结点数量起来了,哈希域也就自然被均分了。
第一个问题解决,当增加服务器数量的时候也是类似的方法,将服务器虚拟出大量的结点,这样就达到了均分的目的。第二个问题也被解决了

关于服务器虚拟出来的节点类似于路由器中的路由表,可以根据路由表找到结点,也可以根据节结点知道,这个结点属于哪一个路由器。
Here Insert Picture Description
一个技术解决两个问题(虚拟结点技术)
那如何让一个服务器产生10000个节点呢?
方法很多可以使用的其中一个是服务器IP+节点编号

Published 230 original articles · won praise 28 · views 9333

Guess you like

Origin blog.csdn.net/weixin_43767691/article/details/103363572