The evolution of large-scale website architecture system

In the grassroots period, the website was quickly developed and launched. Of course, it is usually just to test the water first, the user scale has not been formed, and the economic capacity and investment are also very limited.

There is a certain amount of business and user scale, and I want to improve the speed of the website, so cache comes out.

The market response is not bad, the number of users is increasing every day, the database is reading and writing frantically, and it is gradually found that a server can no longer support it. Therefore, it was decided to separate DB and APP.

A single database also feels like it can't hold it anymore, and generally try to do "read-write separation ". Due to the " read more and write less " nature of most of the Internet . The number of Salve units depends on the read-write ratio assessed by business .

 

The database level is relieved, but there is also a bottleneck at the application level. Due to the increase in access volume, coupled with the limited code written by early programmers, the code is also poor, and the staff turnover is also large, making it difficult to maintain and optimize. Therefore, the most commonly used method is "heap machines".

Anyone who adds a machine will add it. The key is to have an effect after adding it. After adding it, it may cause some problems. For example, very common: problems with page output cache and local cache, problems with session saving...

At this point, the horizontal expansion of the DB level and the application level has basically been achieved, and you can start to focus on some other aspects, such as: the accuracy of on-site search, the dependence on DB, and the introduction of full-text indexing .

In the Java field, Lucene, Solr, etc. are used more (SOLR is a full-text indexing engine) , while in the PHP field, sphinx/coreseek is used more.

Here we need to roughly understand the principle of full-text indexing, which is to create an inverted table

So far, a medium-sized website architecture that can carry an average of millions of daily visits has basically been introduced. Of course, there will be a lot of technical implementation details in each step of the expansion, and there will be time to write articles to analyze those details separately.

After scaling to meet the basic performance needs, we will gradually focus on "availability" (that is, SLA, a few 9s we usually hear when others brag). How to ensure true "high availability" is also a difficult problem.

 

Almost mainstream large and medium-sized Internet companies use similar architectures, but the number of nodes is different.

 

There is also a trick that is used more, that is, the separation of movement and static . It may require the cooperation of developers (put the static resources under an independent site), or it may not need the cooperation of developers (use the 7-layer reverse proxy to process, and judge the resource type according to the suffix name and other information). With a separate static file server , storage is also a problem and needs to be scaled . How can the files of multiple servers be kept consistent, and what should I do if I can't afford shared storage? Distributed file systems also come in handy .

 

 

There is also a very common technology CDN acceleration currently used at home and abroad. The competition in this field is fierce and it is already relatively cheap. The domestic north-south Internet problem is relatively serious, and the use of CDN can effectively solve this problem.

 

The basic principle of CDN is not complicated. It can be understood as intelligent DNS + Squid reverse proxy cache, and then many computer room nodes need to provide access.

 

So far, the application architecture has not been changed very much, or to put it more popularly, there is no need to modify the code in a large area.

What if all the above methods are used up, and it still can’t hold up? Is it not a way to keep adding machines?

As the business becomes more and more complex, and the functions of the website become more and more, although the deployment layer adopts the cluster, the application architecture layer is still "centralized", which will lead to a lot of coupling, which is inconvenient for development and maintenance, and is easy to "One glory and one loss". Therefore, the website is usually split into different sub-sites to be hosted separately . ( Microservices! )

The applications are all dismantled. Due to the connection of a single database, the QPS, TPS, and I/O processing capabilities are very limited. The DB level can also perform vertical database operations.

 

After splitting the application and the DB, there are still many problems. Different sites may have codes with the same logic and functions. Of course, for some basic functions, we can encapsulate DLL or Jar package to provide references everywhere, but such strong dependencies can easily cause some problems (version problems, dependencies, etc. are very troublesome to deal with). In this way, the value of the legendary SOA is reflected.

 

There will still be some dependency problems between applications and services. At this time , a high-throughput decoupling tool appears (decoupling tool - message queue)

Finally, it also introduces a stunt used by large Internet companies - sub-library and sub-table. Personal experience, not business development and all aspects are very urgent, don't take this step lightly .

Because anyone can do the sub-database and sub-table, the key is what to do after the demolition. At present, there is no completely open source free solution on the market that allows you to solve the problem of database splitting once and for all.

 


 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325107970&siteId=291194637