What are the benefits of short chains, isn't it good to use long chains?

Original link

Foreword

Today, let ’s talk about how to design a high-performance short-chain system. The design of a short-chain system seems very simple, but each point can develop a lot of knowledge points. It is also a design problem that is very suitable for examining candidates in interviews. I will give you a brief introduction to some of the ideas involved in designing this system in conjunction with our production of a high-performance short-chain system that has been stable for two years. I hope it can help you.

This article will explain from the following aspects, each point contains a lot of information, I believe everyone will definitely gain after reading

  • What are the benefits of short chains, isn't it good to use long chains?

  • The basic principle of short chain jump

  • Several methods of short chain generation

  • High-performance short-chain architecture design

Note: It involves a lot of Bloom filters, snowflake and other technologies. Since it is not the focus of this article, it is recommended that you read it in depth after reading it, otherwise it will be very long.

What are the benefits of short chains, isn't it good to use long chains?

Take a look at the following geek time to send my marketing text message, click on the blue link below (short chain)

The address bar of the browser will eventually display a long chain as follows.

So why use short chain to express, can't you just use long chain, if you use short chain, it is as follows

1. The link becomes shorter, and when you post on a platform that has a limit on the length of the content, more editable text becomes more

The most typical one is Weibo, which is limited to only 140 words. If a long chain is directly uploaded, there is little other editable content left. With short chains, the link length is greatly reduced, and naturally editable There is a lot more text.

Another example is that there is a limit to the length of general text messages. If you use a long chain, a text message may be split into two or three texts. The original one-to-one text message fee has become two or three. In addition, the use of short chains is more beautiful in content layout.

2. We often need to convert the link into a QR code to share with others. If it is a long chain, the QR code is dense and difficult to identify, and the short chain does not have this problem, as shown in the figure

3. The link is too long and cannot be automatically recognized as a hyperlink on some platforms

As shown in the figure, on the nail, the following long link cannot be recognized, only the part can be recognized, and the short address does not have this problem

The basic principle of short chain jump

As can be seen from the above, the short chain has many benefits, so how does it work? Let's grab the package in the browser and see

After you can see the request, the response with status code 302 (redirection) and location value as long chain is returned, and then the browser will request this long chain again to get the final response. The entire interaction flowchart is as follows

The main step is to redirect to visit B after accessing the short URL, then the problem is coming, 301 and 302 are redirects, which one should be used, here need to pay attention to the difference between 301 and 302

  • 301, stands for  permanent redirect , which means that after the first request gets the long link, the next time the browser requests the short link, it will not request from the short URL server, but directly from the browser's cache In this way, the number of clicks on the short URL cannot be obtained at the server level. If this link happens to be a link to an activity, the effect of this activity cannot be analyzed. So we generally don't use 301.

  • 302 , stands for  temporary redirection , which means that every time you request a short link, you will request the short URL server (unless Cache-Control or Expired is used in the response to imply browser caching), which makes it easier for the server to count clicks, so although 302 It will add a little pressure to the server, but when the data is extremely important, this code is worth it, so 302 is recommended!

Several methods of short chain generation

1. Hash algorithm

How to generate a short chain, carefully observe the short chain in the above example, obviously it is composed of a string of letters mapped from a fixed short chain domain name + a long chain, then how can a long chain be mapped into a string of letters, the hash function is not Used to do this, so we have the following design ideas

So how do I get this hash function? I believe there must be many people who say that they use MD5, SHA and other algorithms. In fact, this is a bit of a killer, and since it means encryption, it means that there will be a loss in performance. We do n’t actually care. The difficulty of reverse decryption is more concerned with the speed of hash calculation and collision probability.

There are many hash algorithms that can be satisfied. The MurmurHash algorithm produced by Google is recommended here. MurmurHash is a non-encrypted hash function suitable for general hash retrieval operations. Compared with other popular hash functions, MurmurHash's random distribution feature performs better for keys with strong regularity. Non-encrypted means that the performance of these functions of SHA is definitely higher than MD5 (actually, the performance is more than ten times that of encryption algorithms such as MD5). It is precisely because of these advantages that it appeared in 2008, but It has been widely used in many famous software such as Redis, MemCache, Cassandra, HBase, Lucene and so on.

Voiceover: There is a small episode here. After MurmurHash became famous, the author got a Google offer, so do more open source projects, and maybe you can also receive Google ’s offer ^ _ ^ inadvertently after becoming famous.

MurmurHash provides two lengths of hash value, 32 bit and 128 bit. In order to make the URL as short as possible, we chose a 32 bit hash value. The maximum value that 32 bit can represent is nearly 4.3 billion. For small and medium-sized companies More than enough for your business. The MurmurHash calculation of the geek long chain mentioned above results in a hash value of 3002604296, so the short chain we now get is a fixed short chain domain name + hash value = http://gk.link/a/3002604296

How to shorten the domain name?

Some people say that the domain name is still a bit long, and there is a trick. The hash value obtained by 3002604296 is decimal, then we can shorten its length by converting it to 62. The conversion from decimal to 62 is as follows:

So we have (3002604296) 10 = (3hcCxy) 10, shortened from 10 bits to 6 bits at once! So now we get our short chain as http://gk.link/a/3hcCxy

Voice-over: 6-digit 62 hexadecimal number can represent 56.8 billion, more than enough to deal with long-chain conversion

How to solve the problem of hash conflict?

Since it is a hash function, it is inevitable that there will be a hash conflict (although the probability is very low), how to solve it.

We know that since the access to the short chain can jump to the long chain, the mapping relationship between the two must be saved before, and Redis or Mysql can be used. Here we choose to use Mysql to store. The table structure should look like this

CREATE TABLE `short_url_map` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `lurl` varchar(160) DEFAULT NULL COMMENT '长地址',
  `surl` varchar(10) DEFAULT NULL COMMENT '短地址',
  `gmt_create` int(11) DEFAULT NULL COMMENT '创建时间',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

So we have the following design ideas.

  1. After passing the long chain (lurl) through MurmurHash, the short chain is obtained.

  2. Then look up the short_url_map table according to the short chain to see if there are related records. If it does not exist, insert the correspondence between the long chain and the short chain into the database and store it.

  3. If it exists, it indicates that there are related records. At this time, a custom field is spliced ​​on the long string, such as "DUPLICATE", and then the docking field string "lurl + DUPLICATE" is performed as the first step. Repeat, spell another field string, as long as the long chain is taken out according to the short chain, then remove these customized strings to be the original long chain.

The above steps are obviously to be optimized. Inserting a record actually requires two SQL queries (according to the short-chain search records, and the long-short chain correspondence is inserted into the database). If it is under high concurrency, it will obviously become a bottleneck.

Voice-over: General databases and application services (only doing calculations and not storing) will be deployed on two different servers. Executing two SQLs requires two network communications. These two network communications and two SQL executions are the entire short chain The performance bottleneck of the system!

So how to optimize

  1. First, we need to add a unique index to the short chain field surl

  2. After the long chain gets the short chain through MurmurHash, directly insert the corresponding relationship of the long and short chain into db. If there is no record of this short chain in db, insert it. If it contains, it means that the unique index is violated. The chain plus the custom field "DUPLICATE" that we mentioned above can be re-hashed and inserted again. It seems that in the case of violation of the unique index, more steps are performed, but we must know that the probability of MurmurHash conflict is very high Low, basically unlikely, so this solution is acceptable.

Of course, if the amount of data is large, the probability of collision will increase, at this time we can add Bloom filter to optimize.

Build a Bloom filter with all generated short URLs. When a new long chain generates a short chain, first search the short chain in the Bloom filter. If it does not exist, it means that the short URL does not exist in db. Can be inserted!

Voice-over: The Bloom filter is a very memory-efficient data structure. The Bloom filter with a length of 1 billion requires only 125 M of memory space.

In summary, if you use a hash function to design, the overall design idea is as follows

The short chain generated by the hash algorithm can already meet our business needs. Next, let's take a look at how to use the self-increasing sequence to generate short chain

2. Self-increasing sequence algorithm

We can maintain an ID auto-increment generator, such as integer increment ID such as 1, 2, 3, when receiving a long chain to short chain request, the ID generator assigns an ID to it, and then converts it to 62 Hexadecimal, after stitching to the short chain domain name to get the final short URL, then how to design such an ID auto-increasing generator? If the number is issued in the low peak period and the high concurrency, the ID generation of the ID auto-increment generator may cause a system bottleneck, so its design is particularly important.

There are four main ways to get id

1. Class uuid

Simply put,  UUID uuid = UUID.randomUUID ();  UUID generated in this way, UUID (Universally Unique Identifier) ​​is a number generated on a machine, which guarantees that it is in the same time and space. All of the machines are unique, but the id generated in this way is relatively long and out of order. When inserting db, it may cause frequent page splits and affect insert performance.

2 、 Redis

Redis is a good choice, with good performance. A single machine can support 10 w + requests, which is enough for most business scenarios, but some people say that if one machine ca n’t carry it, you can set up multiple machines. For example, I set up 10 machines. Each machine can only generate IDs with trailing numbers 0, 1, 2, ... 9. Add 10 each time, as long as you set up an ID generator agent to randomly assign IDs to the number generators.

However, with the Redis solution, you need to consider persistence (short chain IDs cannot always be the same), disaster recovery, and the cost is a bit high.

3、Snowflake

Using Snowflake is also a good choice, but Snowflake depends on the consistency of the system clock. If the system clock of a certain machine is dialed back, it may cause ID conflicts, or the IDs are out of order.

4. Mysql increment primary key

This method is simple to use and easy to expand, so we use Mysql's auto-incrementing primary key as the short chain id. A brief summary is as follows:

Then the question is coming, if Mysql self-incrementing id is used as the short chain ID, under high concurrency, the write pressure of db will be great, what should we do in this case.

Think about it, do you have to generate the id when it is used? Is it possible to generate these self-increasing ids in advance?

The plan is as follows:

Design a special numbering table, each inserting a record, reserved for the short chain id (primary key id * 1000-999) to (primary key id * 1000) number segment, as follows

Issue number table: url_sender_num

As shown in the figure: tmp_start_num represents the start id of the short chain, and tmp_end_num represents the end id of the short chain.

When the request of long chain to short chain hits a certain machine, first see if this machine is assigned a short chain number segment, and if it is not assigned, insert a record into the numbering table, then this machine will allocate the short chain to The id from tmp_start_num to tmp_end_num. The allocation starts from tmp_start_num and continues until tmp_end_num. If the sender ID reaches tmp_end_num, it means that the ID of this interval segment has been allocated. Then, insert a record into the sender number table to obtain a sender ID interval.

Voiceover: Think about how to implement this self-increasing short chain id on the machine. You can use redis, but the simpler solution is to use AtomicLong, which has good performance on a single machine and also guarantees the safety of concurrency. Of course, if the amount of concurrency is large , AtomicLong's performance is not good enough, you can consider using LongAdder, which performs better under high concurrency.

The overall design is as follows

Solve the issuer issuer, the next step is simple, the id from the issuer is the short chain id, then we can create a mapping table of the long and short chain, the short chain id is the primary key, but There is a need to pay attention to here, we may need to prevent the same long chain from generating multiple short chain id multiple times, which requires each time to find the db according to the long chain to see whether there are related records, the general practice is Add an index to the long chain, but in this case, the index space will be very large, so we can compress the long chain appropriately, such as md5, and then index the md5 field of the long chain, the index will be much smaller. In this way, only need to check whether there is the same record in the table according to the md5 of the long chain. So the table we designed is as follows

CREATE  TABLE  `short_url_map` (
  `id` int(11) unsigned  NOT  NULL AUTO_INCREMENT COMMENT '短链 id',
  `lurl` varchar(10) DEFAULT  NULL  COMMENT '长链',
  `md5` char(32) DEFAULT  NULL  COMMENT '长链md5',
  `gmt_create` int(11) DEFAULT  NULL  COMMENT '创建时间',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB  DEFAULT  CHARSET=utf8;

Of course, if the amount of data is large, you will need to partition or divide the database into tables later.

High-performance short-chain architecture design

In e-commerce companies, there are often many activities, spikes, red packets, etc., at a certain point in time, the QPS will be very high. Considering this situation, we introduced openResty, which is a high-performance Web based on Nginx and Lua Platform, due to Nginx's non-blocking IO model, using openResty can easily support 100 w + concurrent number, in general, you only need to deploy one, but in order to avoid a single point of failure, two are suitable, and openResty also comes with In addition to the caching mechanism, the cache modules of redis are integrated, and you can also directly connect to mysql. No need to connect these middlewares through the business layer, the performance will be much higher

As shown in the figure, the use of openResty eliminates the step of the business layer, and directly reaches the cache layer and the database layer, which also improves a lot of performance.

to sum up

This article makes a detailed analysis of the short-chain design scheme, and aims to provide you with several different short-chain design ideas. This article involves a lot of technologies such as Bloom filters, openResty, etc. There is no discussion in the article. Learn more in detail. As another example, the Mysql page split mentioned in the article also requires a detailed understanding of the underlying B + tree data structure, operating system page acquisition and other knowledge. I believe that everyone will gain a lot after all the knowledge points are thoroughly understood.

Giant shoulder

https://www.cnblogs.com/rjzheng/p/11827426.html

https://time.geekbang.org/column/article/80850

Welcome everyone to scan the code and pay attention to the public number to discuss together

Published 225 original articles · Like 140 · Visit 250,000+

Guess you like

Origin blog.csdn.net/jankin6/article/details/105402131