High-performance short-chain design

Foreword

Today, we talk about how to design a high-performance short-chain systems, short-chain design system looks very simple, but each point can be expanded a lot of knowledge, but also very suitable for study in the interview candidates of a design problem, this article will we will combine the stable operation for two years on the production of high-performance short-chain system designed to give you some ideas simply do under the system involved, and I hope to have some help.

This article will explain the following aspects to the amount of information contained in each point are many, I believe we must have read harvest

  • Short chain so what benefits, why should design it, it is not fragrant with long-chain
  • The basic principle of short-chain jump
  • Several methods of generating short chain
  • Architecture support short-chain

Note: which involves a lot of Bloom filter, snowflake and other technology, as it is not the focus of this paper, it is recommended that you read to learn more about themselves and then, or else start speaking space will be very long

So what's the benefits of short-chain, it is not fragrant with long-chain

Look at the following geeks time I send marketing messages and click the blue link below (short-chain)

The final will be displayed as a long chain on the address bar of your browser.

So why use a short chain, said the direct use of long-chain will not do it, following a good outside with a short chain of words

1, link shortening, dispatch platform with limits on the content length, editable text becomes more

The most typical is Twitter, defines 140 characters can only send, if a string of long chain hate up directly, other editorial content is running out, short-chain, then the length of the link is greatly reduced, Nature editable writing a lot more.

Another example is the general message length limit has issued a document, if the long chain, a text message is likely to be split into twenty-three hair, had a hair of a SMS charges into a two San Mao, Why bother. Also in short chain content layout is also more aesthetically pleasing.

2, we often need to link turn into a two-dimensional code share with others, if it is a long chain of two-dimensional code words difficult to identify intensive, short-chain problem does not exist, and as shown

3, the link is too long not automatically recognized as a hyperlink on some platforms

As illustrated in the nail, as long link can not be identified, only identification section, no such problem with short address

The basic principle of short-chain jump

From the above it can be seen, short-chain benefits, then it is how to work it. We take a look at the browser grab bag

It can be seen that after the request, returns 302 (redirect) location and the response status code is a long chain, then the browser will request another long chain to give a final response, the entire flow diagram of the interaction

The main step is to visit after a short URL redirect access B, then the question is, are 301 and 302 redirect, the use of which in the end, there is need to look at the difference between 301 and 302

  • 301, representing a permanent redirect , that is to say after the first request to get the long link, the next time the browser requests go short chain, it will not request a short URL to the server, but take directly from the browser's cache so the server level, you can not get to a few clicks of a short URL, if this link is a link is just an activity, it can not analyze the effect of this activity. Therefore, we generally do not use 301.
  • 302 , on behalf of temporary redirect , which means that each time to request short-chain will request a short URL to the server (unless respond with Cache-Control or implied Expired browser cache), so easy to count the number of clicks server, so although with 302 server will add a bit of pressure, but in exceptionally important data today, this is the code is worth it, it is recommended to use 302!

Several methods of generating short chain

1 hash algorithm

How can we generate short chain, careful observation of short-chain in the example above, it is clear that it is fixed by the short-chain domain name + long chain mapped to a string of letters, so long chains how to map it into a string of letters, not to hash function to do the job for you, so we have the following design ideas

Then the hash function how to take it, I believe there must be a lot of people say that using MD5, SHA algorithms, in fact, doing a little overkill, but since it is encrypted means that there will be a loss of performance, we really do not care the difficulty of reverse decryption, but more concerned about the speed of operation and hash collision probability.

Able to meet such a hash algorithm There are many here recommend MurmurHash Google algorithm produced, MurmurHash is a non-encrypted hash function for general hash retrieval operation. Compared with other popular hash function for regularity strong key, random distribution MurmurHash better performance. Encryption means that compared with non-MD5, SHA performance of these functions it is certainly higher (in fact, the performance is ten times more than MD5 encryption algorithm, etc.), it is precisely because it is these advantages, so although it appears in 2008, but It has been widely applied to Redis, MemCache, Cassandra, HBase, Lucene , and many other well-known software.

Narrator: There is a small episode, after MurmurHash fame, the author got Google's offer, so do more open source projects, maybe the fame you can inadvertently receive Google's offer ^ _ ^.

MurmurHash offers two hash value length, 32 bit, 128 bit, to let the URL pass short as possible, we choose the hash value of 32 bit, 32 bit can represent a maximum of nearly 4.3 billion, for small and medium companies more than enough in terms of business. Geeks long chain referred to above do MurmurHash calculated, the hash value obtained 3002604296, then we now have a fixed short chain short chain domain + Hash value = gk.link/a/300260429...

How to shorten the domain name?

Some people say that the domain name is a bit long, there is a trick, 3002604296 get this hash value is decimal, then we put it into hex 62 to shorten its length, decimal turn 62 hex as follows:

So we have (3002604296) 10 = (3hcCxy) 10 , it was reduced from 10 to 6! So now we get a short-chain gk.link/a/3hcCxy

Voice-over: 6 62 decimal number can be represented 56.8 billion, more than enough to cope with long-chain conversion

How to solve the problem of hash conflict?

Since it is a hash function, the hash will inevitably produce conflict (despite the low probability), how to solve it.

Since we know that access to access short-chain can jump to a long chain, then this mapping relationship before the two must be saved, you can use such as Redis or Mysql, here we choose to use Mysql to store. Should table structure is shown below

CREATE TABLE `short_url_map` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `lurl` varchar(160) DEFAULT NULL COMMENT '长地址',
  `surl` varchar(10) DEFAULT NULL COMMENT '短地址',
  `gmt_create` int(11) DEFAULT NULL COMMENT '创建时间',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
复制代码

So we have the following design ideas.

  1. Long chain (lurl) obtained after MurmurHash short chain.
  2. The short chain then find the table to see if there short_url_map relevant records, if not, the long chain and short chain corresponding relationship into the database stored.
  3. If present, the relevant instructions have been recorded, then stitching a custom field in a long string of good on, such as "DUPLICATE", and then take the field string butt "lurl + DUPLICATE" to do the first operation, or if the last repeating it, and then fight a string field ah, long chain as long as the time taken when these custom removing the short-chain i.e. the original string good long chain.

Obviously the above steps are to be optimized, but still want to insert a record after two sql queries (based on short chain check records, correspondence between the length of the chain into the database), if at high concurrency will obviously become a bottleneck.

Voice-over: general database and application services (computing not only storage) will be deployed on two different server, you need to execute two sql two network communication, these two networks to communicate with the two sql execute the entire short chain the system performance bottleneck lies!

So how to optimize it

  1. First, we need to give short chain field surl plus a unique index
  2. When the long-chain MurmurHash obtained after a short chain, the chain length of the correspondence relationship directly inserted into the db, if the recording in this short chain containing no db, insert, if included, has been violated unique index, a long time as long as coupled with our above said upper chain custom field "DUPLICATE" re-hash and then plug it seems in the case of violation of a unique index is a multi-performed the steps, but we need to know the probability of conflict is very MurmurHash low, substantially less likely to occur, so this scheme is acceptable.

Of course, if in a large amount of data, the probability of collision increases, then we can be optimized Jiabu Long filter.

Construction of all short URLs generated by Bloom filter, when a new generation of long-chain short chain, the first chain to find this short in the Bloom filter, and if not, in this short description db URL does not exist, you can be inserted!

VO: Bloom filter is a very save memory data structure, a length of 10 million Bloom filter, only 125 M memory space.

In summary, if the hash function design, the overall design ideas are as follows

The hashing algorithm used to generate short-chain actually been able to meet our business needs, then we look at how to produce short-chain manner increment sequence

2, increment sequence algorithm

We can maintain a self-ID generator proliferation, such as 1,2,3 ID incrementing integers such, when a long chain transfer request is received short chain, the generator ID assigned an ID, which is then converted to 62 hex, mosaic behind the short-chain domain name to get the final short URL, this ID from the proliferation how to generator design it. If Fa Fortunately, high concurrency, ID from the proliferation of synthesizers in low peak ID generation may be a system bottleneck, so its design is particularly important.

Mainly in the following four methods to obtain id

1, class uuid

It simply is a UUID uuid = UUID.randomUUID (); in this way the generated UUID, UUID (Universally Unique Identifier) globally unique identifier, the number refers to a machine-generated, which ensure in the same space and time All machines are unique, but id generated in this way is relatively long and disorderly, when inserted db may frequently lead to page splitting , affect insert performance.

2 Redis

With Redis is a good choice, good performance, stand-alone can support 10 w + request to meet most business scenarios, but some people say that if a machine could not carry it, you can set more than one thing, for example, I arranged 10 machines, each machine are generated only tail number 0,1,2, ... ID 9, and 10 can each addition, as long as the agent is provided to generate a random ID assigned to the sent ID number generated on the line.

But with Redis such programs need to be considered persistent (short-chain ID can not, like it), disaster recovery, cost a little high.

3、Snowflake

Snowflake is a good choice, but Snowflake depend on the consistency of the system clock. If a machine clock callback system, may cause a conflict ID, ID or disorder.

4, Mysql increment primary key

In this way the use of simple, easy to expand, so we use Mysql auto-incrementing primary key as id short chain. Simply summarized as follows:

So the question is, if Mysql increment id as short-chain ID, under high concurrency, db writing will be great pressure, this situation how to do it.

Consider, must be used to generate when id do, whether these can be generated in advance increment id?

Program are as follows:

Made to design a dedicated number table, each record is inserted, a short chain reservation id (primary key id * 1000 - 999) to (primary key id * 1000) the number of sections, as follows

Fa Table: url_sender_num

As illustrated: tmp_start_num representative of short-chain starting id, id tmp_end_num terminated short-chain representatives.

When a request is a long-chain short chain transfer hit a machine, this machine look at whether the assigned number of short-chain segments, not go out into the assigned insert a record number in the table made, the machine will be in the range of short chain distribution id between tmp_end_num tmp_start_num to. Allocated from tmp_start_num, it has been assigned to tmp_end_num, if issued id number reached tmp_end_num, explained the id range segment has been allocated over, then go down to the table to insert a record number of hair on the hair has acquired a number id range.

Voice-over: Think about this from increased short chain id how to implement it on the machine, you can use redis, but simpler solution is to use AtomicLong, good performance on a single machine, but also to ensure the safety of concurrent, of course, if a large amount of concurrency , AtomicLong's performance is not OK, you can consider using LongAdder, more outstanding performance under high concurrency.

The overall design is shown below

Problems solved the Fa, then on the simple, take over control from the Fa id, id is the short chain, then we'll create a mapping table to a length of chain, short chain id that is the primary key, but there is a need to pay attention, we may need to prevent multiple identical long chain generates different short chain id this case, which requires a time to look first to see if there is a correlation db records based on a long chain, the general practice is do long chain index, but then the index space will be great, so we can appropriate compression of long-chain, such as MD5, and then the long chain of MD5 field to do an index, so the index will be much smaller. So long as the table to check whether the presence of the same according to a recording md5 long chain. So we designed the table below

CREATE TABLE `short_url_map` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT COMMENT '短链 id',
  `lurl` varchar(10) DEFAULT NULL COMMENT '长链',
  `md5` char(32) DEFAULT NULL COMMENT '长链md5',
  `gmt_create` int(11) DEFAULT NULL COMMENT '创建时间',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
复制代码

Of course, if a large amount of data, then the latter will need to partition or sub-library of the points table.

Request a short chain of high concurrency architecture design

In the electricity supplier companies often have a lot of activity, spike, grab a red envelope, etc., in a high QPS a point in time will, in view of this situation, we have introduced openResty, it is based on a high-performance Nginx and Lua Web platform, due to the non-blocking IO model Nginx using openResty can easily support 100 w + number of concurrent, under normal circumstances you can just deploy a while openResty also comes with a caching mechanism that integrates these redis cache module, It can be directly connected to mysql. By business layer do not need to connect these middleware, performance will naturally be a lot higher

As illustrated, the use of the service layer openResty this step is omitted, directly to the buffer layer and the database layer, but also enhance the performance a lot.

to sum up

This paper analyzes made short chain design detail designed to provide you several different short chain design ideas, the text involves a lot like the Bloom filter, openRestry technology, the paper did not begin speaking, I suggest that you can look back to re-take a closer look. Another example Mysql page split mentioned in the text also requires B + tree data structures used by the underlying operating system and other knowledge acquired by the page has a more detailed understanding, I believe we will find out exactly what each knowledge point learned a lot.

The shoulders of giants

www.cnblogs.com/rjzheng/p/1…

time.geekbang.org/column/arti…

Welcome to sweep the code number of public concern, to explore

Guess you like

Origin juejin.im/post/5e6ddef66fb9a07cb427ee13