Elegant and fast statistics of tens of millions of UVs

definition

PV is the abbreviation of page view, that is, the number of page views, which is usually the main indicator to measure an online news channel or website or even a piece of online news. The number of page views is one of the most commonly used indicators to evaluate website traffic, abbreviated as PV

UV is an abbreviation for unique visitor, which refers to a natural person who accesses and browses this web page through the Internet.

Through the above concepts, it can be clearly seen that pv is better designed. Every time a website is accessed, pv will increase, but uv will not necessarily increase. uv essentially records natural persons divided according to a certain standard. , this standard can actually be defined by ourselves. For example, visitors of the same IP can be defined as the same UV, which is also one of the most common UV definitions, and there are also definitions based on cookies and so on. Whether it is pv or uv, a time period is needed to describe it. Usually, the number of pv and uv we refer to refers to the data within 24 hours (one natural day).

Compared with uv, pv is technically easier. Today we will talk about uv statistics. Why is uv statistics relatively difficult, because uv involves the deduplication of natural persons under the same standard. Especially for a website with tens of millions of uvs, designing a good uv statistics system may not be as easy as imagined.

Then we will design a uv statistical system with a natural day as the time period. A natural person (uv) is defined as the same source IP (of course, you can also customize other standards), and the data volume level is assumed to be tens of millions per day magnitude of uv.

Note: The focus of our discussion today is how to design the uv statistical system after obtaining the information defined by the natural person, not how to obtain the definition of the natural person. The design of the uv system is not as simple as imagined, because uv may have instant large traffic with the website's marketing strategy, such as the website holding a seckill event.

DB-based solution

There is a famous saying in server-side programming: there is no function that cannot be solved by one table, if there are two tables and three tables. A uv statistics system can indeed be implemented based on a database, and it is not complicated. The record table of uv statistics can be similar to the following (don't worry too much about whether the following table design is reasonable):

field Types of describe
IP varchar(30) client source ip
DayID int shorthand for time, e.g. 20190629
other fields int Other field descriptions

When a request arrives at the server, the server needs to query the database every time whether there is an access record of the current IP and current time. If there is, it means the same uv; Of course, the above two steps can also be written in a SQL statement:

if exists( select 1 from table where ip='ip' and dayid=dayid )
  Begin
    return 0
  End
else
  Begin
     insert into table .......
  End

Almost all database-based solutions are more prone to bottlenecks when the amount of data is large. In the face of tens of millions of UV statistics per day, this database-based solution may not be optimal.

Optimization

Faced with the design of each system, we should sink our hearts and think about the specific business. As for UV statistics, this business has several characteristics:

1. Each request needs to determine whether the same uv record already exists

2. Persistent uv data cannot affect normal business

3. The accuracy of uv data can tolerate a certain degree of error

hash table

In the database-based solution, one of the reasons for the performance bottleneck in the case of a large amount of data is to determine whether the same record already exists. Therefore, to optimize this system, we must first optimize this step. According to Cai Cai's previous articles, can you think of a data structure to solve this problem? Yes, it is a hash table. The time complexity of the hash table to find the value according to the key is O(1) constant level, which can perfectly solve the operation bottleneck of finding the same record.

Maybe when the amount of UV data is relatively small, the hash table may be a good choice, but in the face of tens of millions of UV data, the hash collision and expansion of the hash table, and the memory occupied by the hash table may not be Good choice. Assuming that each key and value of the hash table occupies 10 bytes, 10 million uv data occupies about 100M. For modern computers, 100M is not too big, but is there a better solution?

Optimized hash table

The solution based on the hash table can only barely cope with the data volume of tens of millions. What if it is a data volume of 1 billion? Is there a better way to get uv statistics of 1 billion level data? Persistent data is put aside here, because persistence is designed to optimize strategies such as database sub-table and sub-database, and we will talk about it later. Is there a better way to quickly determine whether a record exists in a billion-level uv?

In order to minimize the memory used, we can design in such a way that an array of bit type can be pre-allocated. The size of the array is a multiple of the maximum amount of statistical data, and this multiple can be adjusted by custom. Now suppose that the maximum amount of uv data of the system is 10 million, the system can pre-allocate a bit array with a length of 50 million, the memory occupied by the bit is the smallest, and only one bit is occupied. Calculate the hash value of each data according to a hash function with a smaller hash collision, and set the value of the corresponding hash value position of the bit array to 1. Since hash functions have conflicts, it is possible that different data will have the same hash value and misjudgment, but we can use multiple different hash functions to calculate the same data to generate different hash values. At the same time, the array positions of these multiple hash values ​​are set to 1, which greatly reduces the false positive rate. The newly created array is a multiple of the maximum data amount is also a way to reduce conflicts (the larger the capacity, the more conflict smaller). When a 10 million uv data level, a 50 million bit array occupies only a few tens of megabytes of memory, which is much smaller than a hash table, and the gap in memory usage will be even greater at the 1 billion level.

The following is a code example:

class BloomFilter
    {
        BitArray container = null;
      public BloomFilter(int length)
        {
            container = new BitArray(length);
        }

        public void Set(string key)
        {
            var h1 = Hash1(key);
            var h2 = Hash2(key);
            var h3 = Hash3(key);
            var h4 = Hash4(key);
            container[h1] = true;
            container[h2] = true;
            container[h3] = true;
            container[h4] = true;

        }
        public bool Get(string key)
        {
            var h1 = Hash1(key);
            var h2 = Hash2(key);
            var h3 = Hash3(key);
            var h4 = Hash4(key);

            return container[h1] && container[h2] && container[h3] && container[h4];
        }

        //模拟哈希函数1
         int Hash1(string key)
        {
            int hash = 5381;
            int i;
            int count;
            char[] bitarray = key.ToCharArray();
            count = bitarray.Length;
            while (count > 0)
            {
                hash += (hash << 5) + (bitarray[bitarray.Length - count]);
                count--;
            }
            return (hash & 0x7FFFFFFF) % container.Length;

        }
         int Hash2(string key)
        {
            int seed = 131; // 31 131 1313 13131 131313 etc..
            int hash = 0;
            int count;
            char[] bitarray = (key+"key2").ToCharArray();
            count = bitarray.Length;
            while (count > 0)
            {
                hash = hash * seed + (bitarray[bitarray.Length - count]);
                count--;
            }

            return (hash & 0x7FFFFFFF)% container.Length;
        }
         int Hash3(string key)
        {
            int hash = 0;
            int i;
            int count;
            char[] bitarray = (key + "keykey3").ToCharArray();
            count = bitarray.Length;
            for (i = 0; i < count; i++)
            {
                if ((i & 1) == 0)
                {
                    hash ^= ((hash << 7) ^ (bitarray[i]) ^ (hash >> 3));
                }
                else
                {
                    hash ^= (~((hash << 11) ^ (bitarray[i]) ^ (hash >> 5)));
                }
                count--;
            }

            return (hash & 0x7FFFFFFF) % container.Length;

        }
        int Hash4(string key)
        {
            int hash = 5381;
            int i;
            int count;
            char[] bitarray = (key + "keykeyke4").ToCharArray();
            count = bitarray.Length;
            while (count > 0)
            {
                hash += (hash << 5) + (bitarray[bitarray.Length - count]);
                count--;
            }
            return (hash & 0x7FFFFFFF) % container.Length;
        }
    }

The test procedure is:

BloomFilter bf = new BloomFilter(200000000);
            int exsitNumber = 0;
            int noExsitNumber = 0;

            for (int i=0;i < 10000000; i++)
            {
                string key = $"ip_{i}";
                var isExsit= bf.Get(key);
                if (isExsit)
                {
                    exsitNumber += 1;
                }
                else
                {
                    bf.Set(key);
                    noExsitNumber += 1;
                }
            }
            Console.WriteLine($"判断存在的数据量:{exsitNumber}");
            Console.WriteLine($"判断不存在的数据量:{noExsitNumber}");

 Test Results:

判断存在的数据量:7017
判断不存在的数据量:9992983

It occupies 40M of memory, and the false positive rate is less than 1/1,000, which is within the acceptable range in this business scenario. In real business, the system will not allocate such a large bit array at the beginning of startup, but will slowly expand to a certain capacity as conflicts increase.

Asynchronous optimization

When the process of judging whether a data already exists is solved, the next step is to persist the data to DB. If the amount of data is large or the amount of instantaneous data is large, you can consider using mq or NOSql with large read and write IO instead of direct Insert a relational database.

As soon as the idea is changed, the entire UV process can actually be asynchronous, and it is also recommended to do so.

Pay attention, don't get lost, this is a public account that programmers want to pay attention to

{{o.name}}
{{m.name}}

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=324087402&siteId=291194637