Summary of massive data storage and high concurrency solutions in the era of big data

1. Storage of structured data

        With the widespread popularity of Internet applications, the storage and access of massive data has become a bottleneck in system design. For a large-scale Internet application, billions of PVs every day undoubtedly put a very high load on the database. It has caused great problems to the stability and scalability of the system.

  • Horizontally sharding the database can reduce the load on a single machine, and at the same time minimize the loss caused by downtime.
  • Through the load balancing strategy , the access load of a single machine is effectively reduced, reducing the possibility of downtime;
  • Through the cluster solution , the problem of single-point database inaccessibility caused by database downtime is solved;
  • The reading and writing separation strategy maximizes the speed and concurrency of reading (Read) data in the application.

1. What is data segmentation

        The data is horizontally distributed to different DBs or tables through a series of segmentation rules, and the specific DB or table to be queried is found through the corresponding DB routing or table routing rules to perform Query operations. The "sharding" mentioned here usually refers to "horizontal sharding" . What kind of segmentation method and routing method will there be? Next, let's take a simple example: Let's illustrate the log in a blog application. For example, the log article (article) table has the following fields:

article_id(int),  title(varchar(128)),  content(varchar(1024)),  user_id(int)

We can do this by putting all the article information with user_id from 1 to 10000 into the article table in DB1, and putting all the article information with user_id from 10001 to 20000 into the article table in DB2, and so on until DBn. By analogy, using the rules of sub-databases, reverse routing to specific DBs, this process we call "DB routing".

Considering the DB design of data segmentation, it will violate the usual rules and constraints. In order to segment, we have to have redundant fields in the database tables, which are used as distinguishing fields or tag fields called sub-databases, such as the above article The field user_id in the example (of course, the example just now does not reflect the redundancy of user_id very well, because the user_id field will appear even if it is not divided into databases, it is considered that we have taken advantage of it). Of course, the emergence of redundant fields does not only appear in the scene of sub-databases. In many large-scale applications, redundancy is also necessary, which involves the design of efficient DB.

2. Why data segmentation is needed

        For example, for example, there are 5000w rows of data in the article table. At this time, we need to add (insert) a new piece of data to this table. After the insert is completed, the database will re-create the index for this table, and the 5000w rows of data will be created. The overhead of indexing cannot be ignored. But conversely, if we divide this table into 100 tables, from article_001 to article_100, 5000w rows of data are averaged out, and each sub-table has only 500,000 rows of data. At this time, we send a table with only 50w rows of data The time to build an index after inserting data in the middle will decrease by an order of magnitude, which greatly improves the runtime efficiency of the DB and increases the concurrency of the DB. Of course, the benefits of sub-tables are not yet known. There are also lock operations such as write operations, which will bring many obvious benefits.

Oracle's DB is indeed very mature and stable, but not every company can afford the high usage costs and high-end hardware support. We use free MySQL and cheap Server or even PC as a cluster to achieve the effect of minicomputer + large commercial DB, reduce a lot of capital investment, and reduce operating costs. Why not do it? Therefore, we choose Sharding 

3. How to split data

  1. Data segmentation can be physical . Data is distributed to different DB servers through a series of segmentation rules , and specific databases are accessed through routing rules. In this way, each access is not faced with a single server. Server, but N servers, so that the load pressure on a single machine can be reduced.
  2. Data segmentation can also be in the database . Through a series of segmentation rules, the data is distributed to different tables in a database . For example, article is divided into sub-tables such as article_001 and article_002. Several sub-tables are combined horizontally to form a A logically complete article table is created, and the purpose of doing so is actually very simple. 

        To sum up, sub-libraries reduce the load on single-point machines ; sub-tables improve the efficiency of data operations, especially the efficiency of Write operations.  So far we have not touched on the issue of how to divide the text. Next, we will elaborate and explain the segmentation rules in detail.

        In order to achieve horizontal segmentation of data, there must be redundant characters in each table as the basis for segmentation and marking fields. In general applications, we use user_id as the distinguishing field. Based on this, there are three sub-databases as follows Methods and rules: 

Divided by number segment:

(1) user_id is the distinction, 1~1000 corresponds to DB1, 1001~2000 corresponds to DB2, and so on;

Advantages: can be partially migrated

Disadvantage: uneven distribution of data

(2) Hash modulus:

Hash user_id (or directly use the value of user_id if user_id is numeric), and then use a specific number. For example, if the application needs to divide a database into 4 databases, we use the number 4 to user_id The hash value is modulo operation, which is user_id%4. In this way, there are four possibilities for each operation: when the result is 1, it corresponds to DB1; when the result is 2, it corresponds to DB2; when the result is 3, it corresponds to DB3; When it is 0, it corresponds to DB4, so that the data is evenly distributed to 4 DBs.

Pros: Evenly distributed data

Disadvantages: Data migration is troublesome, and data cannot be allocated according to machine performance

(3) Save the database configuration in the authentication library

It is to create a DB, which separately stores the mapping relationship between user_id and DB. Every time you access the database, you must first query the database to get specific DB information, and then you can perform the query operations we need.

Advantages: strong flexibility, one-to-one relationship

Disadvantages: one more query is required before each query, and the performance is greatly reduced

        The above are the three methods we choose in the usual development, and some complex projects may use these three methods in combination.  Through the above description, we also have a simple understanding and understanding of the rules of the sub-library. Of course, there will be better and more complete sub-library methods, and we need to continue to explore and discover.

The distributed data solution provides the following functions:

(1) Provide sub-database rules and routing rules (RouteRule referred to as RR), and directly embed the three segmentation rules mentioned in the above description into this system. The specific embedding method will be described in detail in the following content and Discussion;

(2) Introduce the concept of cluster (Group) to ensure high availability of data;

(3) Introduce load balancing policy (LoadBalancePolicy referred to as LB);

(4) Introduce the cluster node availability detection mechanism to regularly detect the availability of single-point machines to ensure the correct implementation of the LB strategy and ensure the high stability of the system;

(5) Introduce read/write separation to improve data query speed;

Storage and Access Solutions for Massive Data_Thinking, Summary, Focus-CSDN Blog_Massive Data Storage

2. Main solutions for massive data and high concurrency

2.1. Solutions for massive data:

  1. use cache;
  2. Page static technology;
  3. database optimization;
  4. separate data active in the database;
  5. Bulk reads and delayed modifications;
  6. Read and write separation;
  7. Use technologies such as NoSQL and Hadoop;
  8. Distributed deployment database;
  9. Separation of application services and data services;
  10. use search engines to search data in databases;
  11. Separation of business;

2.2. Solutions for high concurrency:

  1. Separation of application and static resource files;
  2. page caching;
  3. cluster and distributed;
  4. reverse proxy;
  5. CDN;

3. Solutions for Massive Data

3.1. Using cache

        Most of the characteristics of website access data are presented as the "28th rule": 80% of business visits are concentrated on 20% of the data.

For example: in a certain period of time, Baidu's search hot words may focus on a small number of popular words; in a certain period of time, Sina Weibo may also focus on a small number of topics that everyone pays attention to.

        Generally speaking, users only use a small part of the total data entries. When the website develops to a certain scale and database IO operations become a performance bottleneck, it is very important to use cache to cache this small part of popular data in memory. A good choice, not only can reduce the pressure on the database, but also improve the data access speed of the overall website.

        The way to use the cache can save the data directly into the memory through the program code, for example, by using Map or ConcurrentHashMap; the other is to use the cache framework: Redis, Ehcache, Memcache, etc.

è¿éåå¾çæè¿°

        When using the cache framework, what we need to care about is when to create the cache and the cache invalidation strategy.

There are many ways to create a cache, and you need to choose according to your own business. For example, the news on the news homepage should be cached when the data is read for the first time; for articles with a relatively high click-through rate, the content of the articles can be cached.

With limited memory resources, choosing how to create a cache is a question worth thinking about. In addition, the invalidation mechanism of the cache also needs to be studied carefully, which can be set by setting the invalidation time; or by setting the priority of popular data, setting different invalidation times according to different priorities, etc.;

It should be noted that when we delete a piece of data, we have to consider deleting the cache, and also consider whether the piece of data has reached the cache expiration time before deleting the cache, and other situations!

When using the cache, it is also necessary to consider how to perform fault-tolerant processing when the cache server fails. It uses more than N servers to cache the same data, and controls the cache data through distributed deployment. When one fails, it will automatically switch Go to other machines; or through Hash consistency, wait for the cache server to resume normal use and re-designate to the cache server. Another function of Hash consistency is to locate data under the distributed cache server and distribute the data on unused cache servers.

3.2. Page static technology

Using the traditional JSP interface, the display of the front-end interface is rendered by the background server and then returned to the front-end browser for parsing and execution, as shown in the figure below:

è¿éåå¾çæè¿°

Of course, the front-end and back-end separation is now advocated. The front-end interface is basically HTML web page code, and the route provided by Angular JS or NodeJS sends a request to the back-end server to obtain data, and then renders the data in the browser, which greatly reduces the pressure on the backend server.

You can also place these static HTML, CSS, JS, image resources, etc. on the cache server or CDN server. Generally, the static resource function provided by the CDN server or Nginx server should be the most used.

  • Rule 1: Minimize HTTP requests as much as possible.
  • Rule 2: Use a CDN.
  • Rule 3: Add an Expires header.
  • Rule 4: Use Gzip to compress components.
  • Rule 5: Put style sheets at the top.
  • Rule 6: Put scripts at the bottom.
  • Rule 7: Avoid CSS expressions.
  • Rule 8: Use external JavaScript and CSS.
  • Rule 9: Reduce DNS Lookups
  • Rule 10: Minimal JavaScript.
  • Rule 11: Avoid redirects.
  • Rule 12: Remove duplicate scripts.
  • Rule 13: Configure ETag
  • Rule 14: Make Ajax cacheable.

3.3. Database optimization

        Database optimization is the most basic part of the performance optimization of the entire website, because the bottleneck of most website performance is in the database IO operation. Although caching technology is provided, the optimization of the database still needs to be taken seriously. Generally, companies have their own DBA team, which is responsible for database creation, data model establishment and other issues. Unlike our few people who don’t understand database optimization, they can only find articles on database optimization on the Internet and explore by themselves. Did not form a system of database optimization ideas.

For database optimization, it is a way to exchange technology for money. There are many ways to optimize the database, and the common ones can be divided into: database table structure optimization, SQL statement optimization, partitioning, sub-table, index optimization, using stored procedures instead of direct operations, etc.

3.3.1, table structure optimization

        For database development specifications and usage skills, as well as design and optimization, I summarized some articles earlier. Here I am lazy and put the address directly. If you need it, you can take a look at it: a) Summary of MySQL development specifications and usage skills: http
: //blog.csdn.net/xlgen157387/article/details/48086607
b) In a tens of millions of database queries, how to improve query efficiency? : http://blog.csdn.net/xlgen157387/article/details/44156679

        In addition, when designing database tables, do you need to create foreign keys? One of the benefits of using foreign keys is that you can easily perform cascading delete operations. However, when performing data business operations, we all use things to ensure data For the consistency of read operations, I feel that compared to using foreign keys to associate MySQL to automatically complete the cascade delete operation for us, it is more at ease to use things to delete operations by ourselves.

3.3.2, SQL optimization

        The optimization of SQL is mainly for the optimization of SQL statement processing logic, and it also needs to be used in conjunction with indexes. In addition, for the optimization of SQL statements, we can optimize specific business methods. We can record the execution time of the database that executes business logic operations for targeted optimization. In this case, the effect is still very good! For example, the following figure shows the execution time of a database operation call:

Some suggestions on SQL optimization have been sorted out before, please move to see:

a) Analysis of 19 MySQL performance optimization points: http://blog.csdn.net/xlgen157387/article/details/50735269

b) Various performance optimizations for MySQL batch SQL insertion: http://blog.csdn.net/xlgen157387/article/details/50949930

3.3.3 Sub-table

        Sub-table is to decompose a large table into multiple entity tables with independent storage space according to certain rules. We can call them sub-tables. Each table corresponds to three files, MYD data file, .MYI index file, .frm table structure file. These subtables can be distributed on the same disk or on different machines. During database read and write operations, get the corresponding subtable name according to the pre-defined rules, and then operate it.

For example:
there are many roles of users in the user table, and users can be divided into different categories by enumerating types: students, teachers, enterprises, etc. In this way, we can divide the database according to the categories, so that If so, a smaller range is now locked according to the type of user each time you query.

However, after splitting the tables, if you need to query the complete order, you need to use multi-table operations.

3.3.4, partition

        Database partitioning is a physical database design technique familiar to DBAs and database modelers. Although partitioning technology can achieve many effects, its main purpose is to reduce the total amount of data read and write in specific SQL operations to shorten the response time.

        Partitions are similar to sub-tables, which decompose tables according to rules. The difference is that sub-table decomposes a large table into several independent entity tables, while partitioning divides data segments into multiple locations, which can be on the same disk or on different machines. After partitioning, it is still a table on the surface, but the data has been hashed to multiple locations. When the database reads and writes, the name of the large table is still operated, and the DMS automatically organizes the partitioned data.

        When the data in a table becomes very large, the efficiency of reading data and querying data is very low. It is easy to divide the data into different data tables for storage, but this will make the operation more troublesome after dividing the tables. , because, if you put the same kind of data in different tables, you need to conveniently query the data in these tables when searching for data. If you want to perform CRUD operations, you need to find all the corresponding tables first. If different tables are involved, you need to perform cross-table operations, which is still very troublesome.

        This problem can be solved by partitioning. Partitioning is to divide the data in a table into different areas for storage according to certain rules. In this way, when performing data query, if the data range is in the same area, then you can detach one area. In this way, the amount of data is less and the operation speed is faster, and this method is transparent to the program, and the program does not need any modification.

3.3.5. Index optimization

        The general principle of the index is that when the data changes, it is arranged in the order of the specified fields in advance and stored in a table-like structure, so that when the search index field is a conditional record, the corresponding record can be quickly found from the index Pointer and get the corresponding data from the table, so the speed is very fast.

However, although the query efficiency has been greatly improved, it is also a waste of resources to update the corresponding indexes because of data changes when adding, deleting, and modifying.

        Regarding the issue of using indexes, we still need to conduct different discussions when dealing with different problems. Choosing the appropriate index according to specific business needs is an obvious measure to improve performance!

Recommended article reading:

a) The role, advantages and disadvantages of database indexes and 11 uses of indexes: http://blog.csdn.net/xlgen157387/article/details/45030829

b) The principle of database indexing: http://blog.csdn.net/kennyrose/article/details/7532032

3.3.6. Use stored procedures instead of direct operations

        Stored Procedure (Stored Procedure) is in a large database system, a set of SQL statement set in order to complete a specific function, stored in the database, after the first compilation, calling again does not need to be compiled again, the user specifies the name of the stored procedure and Gives parameters (if the stored procedure has parameters) to execute it. Stored procedures are an important object in the database, and any well-designed database application should use stored procedures.

In the business where the operation process is relatively complicated and the call frequency is relatively high, the written SQL statement can be replaced by a stored procedure. Using a stored procedure only needs to be mutated once, and some complex operations can be performed in a stored procedure

3.4. Separate the active data in the database

        Just like the "28th law" mentioned above, although there are a lot of data on the website, the data that is frequently accessed is still limited. Therefore, these relatively active data can be separated and stored separately to improve processing efficiency.

In fact, the previous idea of ​​using cache is an obvious use case of separating active data in the database, and caching popular data in memory.

Another scenario is that, for example, a website has tens of millions of registered users, but the number of users who log in frequently is only one million, and the rest basically have not logged in for a long time. If these " "Zombie users" are separated separately, so every time we query other logged-in users, the query operations of these zombie users are wasted.

3.5. Batch read and delayed modification

        The principle of batch reading and delayed modification is to improve efficiency by reducing the operation of operating the database

Batch reading is to combine multiple queries into one reading, because each database request operation requires the establishment and release of links, and still occupies a part of resources. Batch reading can be read in an asynchronous manner .

Delayed modification is for some highly concurrent and frequently modified data. At each modification, the data is first saved in the cache, and then the data in the cache is saved to the database at regular intervals. The program can read the data at the same time. Read data from the database and from the cache.

3.6, read and write separation

        The essence of read-write separation is to distribute the read and write operations of the application program to multiple database servers, thereby reducing the access pressure of a single database.

Read-write separation is generally done by configuring the master-slave database. The data is read from the slave database, and the database is added, modified, and deleted to operate the master database.

è¿éåå¾çæè¿°

Please watch related articles:
a) MySQL5.6 database master-slave (Master/Slave) synchronization installation and configuration details: http://blog.csdn.net/xlgen157387/article/details/51331244
b) MySQL master-slave replication Summary of common topologies, principle analysis and how to improve the efficiency of master-slave replication: http://blog.csdn.net/xlgen157387/article/details/52451613

3.7. Using technologies such as NoSQL and Hadoop/HBase

        NoSQL is an unstructured, non-relational database. Due to its flexibility, it breaks through the rules and regulations of relational databases and can be operated flexibly. In addition, because NoSQL stores data in multiple blocks, its ability to operate big data The speed is also quite fast

3.8. Distributed deployment database

        Any powerful single server cannot meet the continuously growing business needs of large websites. After the database is separated by reading and writing, one database server is split into two or more database servers, but it still cannot meet the continuously growing business needs. The distributed deployment database is the last resort to split the website database, and it is only used when the data in a single table is very large.

Distributed database deployment is an ideal situation. Distributed databases store tables in different databases and then put them in different databases. In this way, when processing requests, if you need to call multiple tables, you can let Multiple servers process at the same time, thereby improving processing efficiency.

A simple architecture diagram of a distributed database is as follows: 

è¿éåå¾çæè¿°

3.9. Separation of application services and data services

        The purpose of separating the application server and the database server is to optimize the bottom layer according to the characteristics of the application server and the database server. In this way, the characteristics of each server can be better utilized. Of course, the database server has a certain amount of disk space. The application server does not require much disk space, so separation is beneficial, and it can also prevent a server from failing and other services from being unusable.
 è¿éåå¾çæè¿°

3.10. Use search engines to search data in the database

        The use of search engines, a non-database query technology, has better support for the scalable and distributed nature of website applications.

Common search engines such as Solr maintain the mapping relationship between keywords and documents through a reverse index method, similar to how we use "Xinhua Dictionary" to search for a keyword. First, we should look at the directory of the dictionary and then locate it. specific location.

Search engines can quickly locate the data to be searched by maintaining a certain keyword-to-document mapping relationship. Compared with traditional database search methods, the efficiency is still very high.

An article about the comparison of Solr and MySQL query performance:
Solr and MySQL query performance comparison: http://www.cnblogs.com/luxiaoxun/p/4696477.html?utm_source=tuicool&utm_medium=referral

3.11. Splitting up business

        Why split the business, in the final analysis, still use unreasonable business data tables to deploy to different servers, and search for corresponding data to meet the needs of the website. Each application has used the specified URL to connect to obtain different services.

For example, a large shopping website will split the homepage, shops, orders, buyers, sellers, etc. into different sub-businesses. On the one hand, the business modules are divided into different teams for development. The database tables are deployed on unconnected servers, reflecting the idea of ​​splitting. When the database server used by one business module fails, it will not affect the normal use of the databases of other business modules. In addition, when the number of visits to one of the modules surges, the number of databases used by this module can be dynamically expanded to meet business needs.
 

4. Solutions for high concurrency

4.1, application and static resource files are separated

        The so-called static resources are the static resources such as Html, Css, Js, Image, Video, Gif used in our website. Separation of applications and static resource files is also a common solution for front-end and back-end separation. Application services only provide corresponding data services, static resources are deployed on designated servers (Nginx servers or CDN servers), and front-end interfaces are created through Angular JS Or the routing technology provided by Node JS accesses the specific services of the application server to obtain the corresponding data and render it on the front-end browser. This can greatly reduce the pressure on the backend server.

For example, the pictures used on the Baidu homepage are deployed on a separate domain name server
è¿éåå¾çæè¿°

4.2. Page cache

        The page cache is to cache the pages generated by the application that rarely change data, so that there is no need to regenerate the page every time, thus saving a lot of CPU resources. It will be faster if the cached pages are placed in the memory.

You can use the caching function provided by Nginx, or you can use the dedicated page caching server Squid.

4.3. Cluster and Distributed


4.4. Reverse proxy

http://blog.csdn.net/xlgen157387/article/details/49781487

4.5、CDN

        The CDN server is actually a cluster page caching server. Its purpose is to return the data required by the user as soon as possible. On the one hand, it speeds up the user's access speed, and on the other hand, it also reduces the load pressure on the back-end server.

The full name of CDN is Content Delivery Network, that is, content distribution network. The basic idea is to avoid bottlenecks and links on the Internet that may affect the speed and stability of data transmission as much as possible, so as to make content transmission faster and more stable.

CDN constitutes a layer of intelligent virtual network based on the existing Internet by placing node servers all over the network. The CDN system can respond in real time according to the network traffic, the connection of each node, the load status, and the distance to the user. Comprehensive information such as time redirects the user's request to the service node closest to the user. Its purpose is to enable users to obtain the required content nearby, solve the congestion situation of the Internet network, and improve the response speed of users' access to websites.

That is to say, the CDN server is deployed in the computer room of the network operator and provides the data access service closest to the user. When the user requests the website service, the data can be obtained from the computer room of the network provider closest to the user. Telecom users will allocate Telecom nodes, and China Unicom users will allocate China Unicom nodes.

The way CDN allocates requests is special. It is not the one allocated by ordinary load balancing servers, but is allocated by special CDN domain name resolution servers when resolving domain names.

The CDN structure diagram is as follows:
 è¿éåå¾çæè¿°

V. Summary

Of course, the massive data and high concurrency solutions for large-scale website applications are not limited to these techniques or technologies, and there are many mature solutions, you can learn by yourself if you need them.

Guess you like

Origin blog.csdn.net/qq_22473611/article/details/91039843