Summary of massive data and high concurrency solutions for large website applications

The main solution for massive data and high concurrency

Solutions for massive data:

  1. use cache;
  2. Page static technology;
  3. database optimization;
  4. Separating active data in the database;
  5. Bulk reads and delayed modifications;
  6. read-write separation;
  7. Use technologies such as NoSQL and Hadoop;
  8. Distributed deployment database;
  9. Separation of application services and data services;
  10. Use search engines to search data in the database;
  11. business split;

Solutions for high concurrency situations:

  1. Application and static resource files are separated;
  2. page cache;
  3. Clustering and Distributing;
  4. reverse proxy;
  5. CDN;

3. Solutions for massive data

(1) Use cache

Most of the characteristics of website visit data are presented as the "28 law": 80% of business visits are concentrated on 20% of data.

For example, in a certain period of time, Baidu's search hot words may be concentrated on a small number of hot words; in a certain period of Sina Weibo, there may also be a small number of topics that people pay attention to.

In general, users only use a small part of the total data items. When the website develops to a certain scale and the database IO operation becomes a performance bottleneck, it is very difficult to use cache to cache this small part of popular data in memory. A good choice, it can not only reduce the pressure on the database, but also improve the data access speed of the overall website.

The way of using the cache can directly save the data into the memory through the program code, for example, by using Map or ConcurrentHashMap; the other is to use the cache framework: Redis, Ehcache, Memcache, etc. 
write picture description here 
When using a caching framework, all we need to care about is when to create a cache and a cache invalidation strategy .

The creation of the cache can be created in many ways, and you need to choose according to your own business. For example, the news on the news homepage should be cached when the data is read for the first time; for articles with a relatively high click-through rate, the content of the articles can be cached.

With limited memory resources, choosing how to create a cache is a question worth thinking about. In addition, the cache invalidation mechanism also needs to be well studied, which can be set by setting the invalidation time; it is also possible to set priorities for popular data, and set different invalidation times according to different priorities;

It should be noted that when we delete a piece of data, we have to consider deleting the cache, and also consider whether the data has reached the cache expiration time before deleting the cache.

When using the cache, it is also necessary to consider how to perform fault-tolerant processing when the cache server fails. It uses N multiple servers to cache the same data, controls the cached data through distributed deployment, and automatically switches when one fails. Go to other machines; or through Hash consistency, wait for the cache server to be re-assigned to the cache server when it returns to normal use. Another function of Hash consistency is to locate the data under the distributed cache server and distribute the data on the non-cache server. Hash consistency of data cache is also a relatively difficult problem. It can only be briefly described here. For understanding of Hash consistency, an article is recommended: http://blog.csdn.net/liu765023051/article/details/49408099

 

(2) page static technology

Using the traditional JSP interface, the display of the front-end interface is rendered by the back-end server and then returned to the front-end browser for parsing and execution, as shown below: write picture description here

Of course, the separation of front and back ends is now advocated. The front-end interface is basically HTML web page code. The route provided by Angular JS or NodeJS sends a request to the back-end server to obtain data, and then renders the data in the browser, which greatly reduces the pressure on the backend server.

These static HTML, CSS, JS, image resources, etc. can also be placed on the cache server or on the CDN server. Generally, the static resource function provided by the CDN server or the Nginx server should be used the most.

In addition, in the book " Advanced Guide to High-Performance Website Construction - Best Practices for Web Developer Performance Optimization (Translation by Koubei Front-end Team)" , some valuable experiences are provided on the front-end interface of website performance, as follows:

write picture description here

Therefore, in the processing of these static resources, choosing the correct processing method is still very helpful to the overall website performance!

 

(3) Database optimization

Database optimization is the most basic part of the performance optimization of the entire website, because the bottleneck of most website performance is the database IO operation. Although caching technology is provided, the optimization of the database still needs to be taken seriously. Generally, companies have their own DBA teams, which are responsible for the creation of databases and the establishment of data models. Unlike us now who do not understand database optimization, they can only find articles on database optimization on the Internet and explore by themselves. There is no systematic database optimization idea.

For database optimization, it is a way to trade technology for money. There are many ways of database optimization, and the common ones can be divided into: database table structure optimization, SQL statement optimization, partitioning, sub-table, index optimization , using stored procedures instead of direct operations, etc.

 

1. Table structure optimization

Regarding the development specifications and usage skills of the database, as well as the design and optimization, I summarized some articles in the front. Here I am lazy and put the address directly. If you need it, you can take a look: 
a)  Summary of MySQL development specifications and usage skills: http: //blog.csdn.net/xlgen157387/article/details/48086607 
b)  How to improve query efficiency in a database search of tens of millions? : http://blog.csdn.net/xlgen157387/article/details/44156679

In addition, it is not necessary to create foreign keys when redesigning database tables. One of the advantages of using foreign keys is that cascading delete operations can be easily performed, but now when conducting data business operations, we all use the way of things to ensure data. For the consistency of read operations, I feel that compared with using foreign keys to associate MySQL to automatically help us complete cascading delete operations, it is more reassuring to use things to delete operations. Of course, there may also be applicable scenarios. If you have any good suggestions, please leave a message!

2. SQL optimization

For the optimization of SQL, it is mainly for the optimization of the processing logic of the SQL statement, and it is also used in conjunction with the index. In addition, for the optimization of SQL statements, we can optimize for specific business methods. We can record the execution time of the database that performs business logic operations for targeted optimization. In this case, the effect is still very good! For example, the following figure shows the execution time of a database operation call:

write picture description here

Some suggestions on SQL optimization have been sorted out before, please check them out:

a)  Analysis of 19 MySQL performance optimization points: http://blog.csdn.net/xlgen157387/article/details/50735269

b)  Various performance optimizations for MySQL batch SQL insertion: http://blog.csdn.net/xlgen157387/article/details/50949930

sub-table

Sub-table is to decompose a large table into multiple entity tables with independent storage space according to certain rules, we can call it sub-table, each table corresponds to three files, MYD data file, .MYI index file, .frm table structure file. These subtables can be distributed on the same disk or on different machines. When the database is read and written, the corresponding sub-table name is obtained according to the pre-defined rules, and then it is operated.

For example, there are many roles for users in the user table 
, and users can be divided into different categories by enumerating types: students, teachers, enterprises, etc. In this case, we can divide the database according to the categories, so that If so, each time you query, a smaller range is now locked according to the type of user.

After the table is divided, if you need to query the complete order, you need to use multi-table operations.

partition

Database partitioning is a physical database design technique familiar to DBAs and database modelers. Although partitioning technology can achieve many effects, its main purpose is to reduce the total amount of data read and write in a specific SQL operation to reduce response time.

Partitions are similar to sub-tables in that they decompose tables according to rules. The difference is that the partition table decomposes the large table into several independent entity tables, while the partition divides the data into multiple locations for storage, which can be the same disk or different machines. After partitioning, it still appears to be a table, but the data is hashed to multiple locations. When the database is read and written, the name of the large table is still operated, and the DMS automatically organizes the partitioned data.

When the data in a table becomes very large, the efficiency of reading the data and querying the data is very low. It is easy to divide the data into different data tables for storage, but this will make the operation more troublesome after the table is divided. , because, if you put similar data in different tables, you need to easily query the data in these tables when searching for data. If you want to perform CRUD operations, you need to find all the corresponding tables first. If different tables are involved, you need to perform cross-table operations, which is still very troublesome to operate.

This problem can be solved by using the partition method. Partitioning is to divide the data in a table into different areas for storage according to certain rules, so that if the data range is in the same area when querying the data, you can detach one area. In this way, the amount of data is less, the operation speed is faster, and the method is transparent to the program, and the program does not need to be modified.

Index optimization

The general principle of the index is to pre-arrange the data in the order of the specified fields and save it in a table-like structure when the data changes, so that when the index field is a conditional record, the corresponding record can be quickly found from the index. Pointer and get the corresponding data from the table, so the speed is very fast.

However, although the efficiency of the query is greatly improved, when adding, deleting, and modifying, the corresponding index needs to be updated because of the data change, which is also a waste of resources.

Regarding the use of indexes, different discussions are required for different problems. Selecting the appropriate index according to specific business needs is an obvious measure to improve performance!

Recommended articles to read:

a)  The role, advantages and disadvantages of database indexes and 11 usages of indexes: http://blog.csdn.net/xlgen157387/article/details/45030829

b)  Principle of database indexing: http://blog.csdn.net/kennyrose/article/details/7532032

Use stored procedures instead of direct manipulation

Stored Procedure is a set of SQL statements in a large database system, which are stored in the database in order to complete a specific function. After the first compilation, it is not required to be recompiled again. The user specifies the name of the stored procedure and Give the parameters (if the stored procedure has parameters) to execute it. Stored procedures are an important object in the database, and any well-designed database application should use stored procedures.

In the business where the operation process is more complex and the calling frequency is relatively high, the written SQL statement can be replaced by a stored procedure. The stored procedure only needs to be mutated once, and some complex operations can be performed in a stored procedure. .

 

(4) Separate active data in the database

Just like the "28 Law" mentioned above, although there is a lot of data on the website, the data that is frequently accessed is still limited. Therefore, it can be said that these relatively active data can be separated and stored separately to improve processing efficiency.

In fact, the previous idea of ​​using cache is an obvious use case of separating active data in the database, and caching popular data in memory.

Another scenario is that, for example, a website has a large number of registered users in the tens of millions, but the users who log in frequently are only in millions, and the rest are basically no login operations for a long time. "Zombie users" are separated out separately, so every time we query other logged-in users, the query operations of these zombie users are wasted.

 

(5) Batch read and delayed modification

The principle of bulk reads and delayed modifications is to increase efficiency by reducing operations to manipulate the database.

Batch reading is to combine multiple queries into one reading, because each database request operation requires the establishment and release of links, or occupies a part of resources, batch reading can be read in an asynchronous manner .

Delayed modification is for some highly concurrent and frequently modified data, first save the data to the cache at each modification, and then save the data in the cache to the database regularly, the program can read the data at the same time. Read data from database and cache .

For example: I am writing this blog now. I wrote a content at the beginning and then clicked to publish it. Then I went back to Markdown to make changes. I have a habit of always clicking the "Save" button on the top when I haven't written something. , and then I open this blog on another page to view, my changes have been updated, but I'm still editing!

write picture description here

I don't know if CSDN's technology puts these data in the cache before I click to publish.

 

(6) Read-write separation

The essence of read-write separation is to distribute the read and write operations of the application to the database to multiple database servers, thereby reducing the access pressure of a single database.

The read-write separation is generally done by configuring the master-slave database, the data is read from the slave database, and the database is added, modified and deleted to the master database.

write picture description here

For related articles, please move to watch: 
a)  MySQL5.6 database master-slave (Master/Slave) synchronization installation and configuration details: http://blog.csdn.net/xlgen157387/article/details/51331244 
b)  MySQL master-slave replication Common topology, principle analysis and summary of how to improve the efficiency of master-slave replication: http://blog.csdn.net/xlgen157387/article/details/52451613

 

(7) Using technologies such as NoSQL and Hadoop

NoSQL is an unstructured non-relational database. Because of its flexibility, it breaks through the rules and regulations of relational databases and can be operated flexibly. In addition, because NoSQL stores data through multiple blocks, its operation on big data is difficult. The speed is also quite fast.

 

(8) Distributed deployment database

No single powerful server can meet the growing business demands of a large website. After the database is separated from reading and writing, a database server is split into two or more database servers, but it still cannot meet the continuously growing business needs. The distributed deployment database is the last resort to split the website database, and it is only used when the scale of single table data is very large.

A distributed deployment database is an ideal situation. A distributed database stores tables in different databases and then puts them in different databases, so that when processing requests, if you need to call multiple tables, you can make Multiple servers process at the same time, thereby improving processing efficiency.

A simple architecture diagram of a distributed database is as follows:

write picture description here

(9) Separation of application services and data services

The purpose of separating the application server and the database server is to optimize the bottom layer according to the characteristics of the application server and the database server. In this way, the characteristics of each server can be better utilized. Of course, the database server has a certain amount of disk space. The application server does not need too much disk space, so it is beneficial to separate it, and it can also prevent a server from having a problem and other services that cannot be used.

write picture description here

(10) Use a search engine to search the data in the database

The use of search engines, a non-database query technology, has better support for the scalable and distributed nature of website applications.

Common search engines such as Solr maintain the mapping relationship between keywords and documents through a reverse index method, similar to how we use the "Xinhua Dictionary" to search for a keyword. First, we should look at the directory of the dictionary to find it and then locate it. specific location.

By maintaining the mapping relationship between certain keywords and documents, the search engine can quickly locate the data to be searched. Compared with the traditional database search method, the efficiency is still very high.

At present, a relatively popular ELK stack technology is still worth learning.

An article about the comparison of Solr and MySQL query performance: 
Solr and MySQL query performance comparison: http://www.cnblogs.com/luxiaoxun/p/4696477.html?utm_source=tuicool&utm_medium=referral

(11) Splitting business

Why split the business? In the final analysis, the business data table that is still unreasonable is deployed to a different server, and the corresponding data is found separately to meet the needs of the website. Each application uses the specified URL to connect to obtain different services.

For example, a large shopping website will split the homepage, shops, orders, buyers, sellers, etc. into different sub-businesses. On the one hand, the business modules are divided into different teams for development, and on the other hand, different businesses use The database tables are deployed on unconnected servers, reflecting the idea of ​​splitting. When the database server used by one business module fails, it will not affect the normal use of the databases of other business modules. In addition, when the access volume of one of the modules surges, the number of databases used by this module can be dynamically expanded to meet business needs.

 

Solutions for high concurrency situations

(1) Separation of application and static resource files

The so-called static resources are static resources such as Html, Css, Js, Image, Video, Gif, etc. used in our website. Separation of applications and static resource files is also a common solution for front-end and back-end separation. Application services only provide corresponding data services, static resources are deployed on a specified server (Nginx server or CDN server), and the front-end interface is implemented through Angular JS. Or the routing technology provided by Node JS accesses the specific service of the application server to obtain the corresponding data and renders it on the front-end browser. This can greatly reduce the pressure on the backend server.

For example, the pictures used on the Baidu homepage are deployed on a separate domain name server.

write picture description here

(2) page cache

The page cache is to cache pages generated by the application that rarely change data, so that the page does not need to be regenerated every time, thereby saving a lot of CPU resources, and it is faster if the cached pages are placed in memory.

You can use the caching function provided by Nginx, or you can use Squid, a specialized page caching server.

(3) Cluster and distributed

(4) Reverse proxy

For reference articles, please move to: 
http://blog.csdn.net/xlgen157387/article/details/49781487

(5)CDN

The CDN server is actually a cluster page cache server. Its purpose is to return the data needed by the user as soon as possible. On the one hand, the user access speed is accelerated, and on the other hand, the load pressure of the back-end server is reduced.

The full name of CDN is Content Delivery Network, that is, Content Delivery Network. The basic idea is to avoid bottlenecks and links that may affect the speed and stability of data transmission on the Internet as much as possible, so that content transmission is faster and more stable.

CDN is a layer of intelligent virtual network on the basis of the existing Internet, which is formed by placing node servers everywhere in the network. Comprehensive information such as time redirects the user's request to the service node closest to the user. Its purpose is to enable users to obtain the desired content nearby, solve the situation of Internet network congestion, and improve the response speed of users visiting websites.

That is to say, the CDN server is deployed in the computer room of the network operator, and provides the data access service on the layer closest to the user. When the user requests website services, the data can be obtained from the computer room of the network provider closest to the user. Telecom users will allocate telecom nodes, and China Unicom will allocate China Unicom nodes.

The way CDN allocates requests is special. It is not the kind that is allocated by ordinary load balancing servers, but is allocated by a special CDN domain name resolution server during resolution and name resolution.

The CDN structure diagram is as follows:

write picture description here

Summarize

The points mentioned in the article are not introduced in detail. If you need to study one of the methods, you can find resources to study and study by yourself. Of course, the massive data and high concurrency solutions for large-scale website applications are not limited to these skills or technologies, there are many mature solutions, and you can learn about them if you need them.

It is hereby stated that the pictures in this article are from "Website Architecture and Its Evolution Process - Han Lubiao". I would like to thank the original author for the exquisite pictures provided, and the structure of the article generally refers to the author's ideas, but the content is some of my own experience and learning. Arranged what arrives.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326068037&siteId=291194637