Hbase extension

1 HBase ability in commercial projects

every day:

1) the amount of message: the number of messages sent and received over 6,000,000,000

2 write) data of nearly 100 billion

3) the peak of about 1.5 million operations per second

4) entire read data occupies about 55%, 45% write possession

5) exceeds 2PB data relates to data redundancy total 6PB

6) data about monthly increase of 300 gigabytes.

Bloom filter 2

In Richangshenghuo, including in the design of computer software, we often have to judge whether an element in a collection. For example, in a word processor, you need to check whether an English word is spelled correctly (that is, to determine whether it is a known dictionary); the FBI, a suspect's name is already on the list of suspects; the web crawler years, whether a website is visited, and so on. The most direct way is to exist computer, you experience a new element, and it will directly compare the elements of the collection to the collection of all elements. In general, the computer is set hash table (hash table) to store. Its advantage is fast and accurate, the disadvantage is the cost of storage space. When the collection is small, this problem is not significant, but when the collection of huge, low hash table storage efficiency problem began to unravel. For example, one like Yahoo, Hotmail and Gmai as public mail (email) providers, people always need to be filtered (spamer) from spam spam. One way is to record those email addresses of spam. Since those who keep sending in the registration of new address, the world's less there are billions of spam addresses, they will need to save up a lot of the web server. If the hash table, each storing one hundred million email addresses, you need to 1.6GB of memory (with specific measures to achieve the hash table is to correspond to each email address into an eight-byte fingerprint information googlechinablog.com/2006/ 08 / blog-post.html, fingerprint information is then stored hash table, because the storage efficiency of the hash table is generally only 50%, so an email address needs to occupy sixteen bytes a to about 1.6 billion addresses GB, that is 1.6 billion bytes of memory). Therefore store billions of e-mail addresses may require hundreds of GB of memory. Unless it is a super computer, a server is generally not stored.

       Bloom filter need only 1/8 to 1/4 the size of the hash table will be able to solve the same problem.

Bloom Filter is a highly efficient random spatial data structure, which uses very simple set of bits represents a collection, and can determine whether an element belonging to this set. This efficient Bloom Filter is a certain cost: In determining whether an element belongs to a collection, it is possible to put this element does not belong to the set of mistaken belong to this set (false positive). Therefore, Bloom Filter is not suitable for those "zero error" applications. In applications can tolerate low error rate lower, Bloom Filter by very few mistakes in exchange for significant savings in storage space.

Here we look at specific Bloom Filter is how to represent the set with a bit array. The initial state, Bloom Filter is a set of bits comprising m bits, each bit set to 0, as shown in

 

 

For expression S = {x1, x2, ..., xn} Such a set of n elements, Bloom Filter independent of k hash functions (Hash Function), which are mapped to each element of the set {1 , ..., m} range. For any element x, the i-th position of a hash function mapping hi (x) will be set to 1 (1≤i≤k). Note that if a position many times is set to 1, only the first one will work behind the times will have no effect. As shown in FIG 9-6, k = 3, and has the same two positions selected hash function (the fifth bit from the left).

 

 

In determining whether y belongs to this collection, we use the k hash functions of y, if all H i (y) position is 1 (1≤i≤k), then we think y are elements in the collection, otherwise, y is not considered elements in the collection. Figure 9-7 shows the Y . 1 is not an element in the collection. Y 2 or belonging to this set, or is just a false positive.

 

 

 

To add an element, with the k hash function hash it bloom filter to obtain a bit of k bits, these k-1 bit positions.

· To a query element, i.e., whether it is determined in the collection, with the k hash function hash give it a bit k bits. If these k bits are all 1, then this element in the set; if either one is not 1, then the ratio of this element is not set (because if, at the time of the already add the k bits corresponding to position 1 ).

· Remove elements is not allowed, because then would the corresponding position of k bits to 0, and which is likely to be other elements of the corresponding bit. So remove introduce false negative, it is absolutely not allowed.

Bloom filter'll never miss a suspicious address in the blacklist. However, it has shortcomings a place, that is, it does not have a minimum of a black list of e-mail address may be determined to be in the blacklist, because there may be a good e-mail address happens to correspond to one of the eight are bit set to one. Fortunately, this possibility is very small, we call it the probability of false consciousness.

Benefits Bloom filter that fast, space-saving, but there is a certain error recognition rate, a common remedy is the establishment of a small white list, store those e-mail addresses may be individual misjudgment.

Bloom filter algorithm specific premium content, such as error rate estimate, the optimal number of hash function calculation, the median group size calculation, see http://blog.csdn.net/jiaomeng/article/details/1495500 .

2 HBase2.0 new features

Around at 2:00 on August 22, 2017, HBase released 2.0.0 alpha-2, compared to the previous version, the 500 patch repair, HBase we take a look at the new features of version 2.0.

Latest documents:

http://hbase.apache.org/book.html#ttl

The official release page:

http://mail-archives.apache.org/mod_mbox/www-announce/201708.mbox/<CADcMMgFzmX0xYYso-UAYbU7V8z-Obk1J4pxzbGkRzbP5Hps+iA@mail.gmail.com

For example:

1) region were more than redundancy

The main region responsible for reading and writing, from the region maintained in other HregionServer in charge of reading and information synchronization in the main region, if not in time synchronization, is likely to appear in the client read dirty data from the region in the (primary region have not had time to memstore the change in content flush).

2) More Changes

https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12340859&styleName=&projectId=12310753&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED%7Ce6f233490acdf4785b697d4b457f7adb0a72b69f%7Clout

 

Guess you like

Origin www.cnblogs.com/tesla-turing/p/11668558.html