Talk about hot and cold data

http://jishu.zol.com.cn/11379.html

 

The most important core unit of web products is undoubtedly data, and the mainstream storage container is Mysql. For rapidly growing data, its performance may decrease exponentially. To solve this problem, the mainstream approach is basically to split horizontally and vertically. According to the characteristics of the data, the data is divided at the database and table levels. In fact, the theory is still data division, but one day you will find that the data of a single table is still getting bigger and bigger, maybe you can say that I will split it again. , the cost of splitting may be to deploy multiple auxiliary libraries. The storage capacity may surprise you, and does anyone really think that this approach is useful? Many people say that we use cache to solve it, but cache It is to add a layer invisibly. The design of the cache is very important. The update and granularity are very important to the stability of a system. In addition, have you considered the size of the cache? So, is there any way we can do some analysis from the DB layer? , Mysql, as an excellent software, should not question it without reasonable analysis, nor should you think about how to replace it.

   The web is based on data, so the processing method should also be based on data. Have you ever analyzed what your product form is and what is the nature of the data? The million users are less than a few (relative to the total registered users). But the total pv accounts for 64%. The article is about 11G storage space. Explain: half of our webservers are serving a small number of people, but they The traffic it accounts for can scare people to death, but their articles are only 11G. I believe most people understand that if I strip the data of these users and focus on serving them, then I believe that the performance and capacity costs will be further reduced.

   This is the so-called separation of hot and cold data. It is obvious that the hot data is stripped out, and the core guarantees its performance. Relatively speaking, the cold data has less access and the service level is reduced, and you can imagine how much capacity can be saved. If I put these How much will the performance improve if the data is put into memory?

   There are not many difficulties and skills in the splitting of hot and cold data, but there are still many things to do after careful analysis. After all, the purpose of doing things is to ensure that the results are reasonable and effective. Performance and maintainability must be balanced.

  (1) Have you ever thought about what kind of people these people with high pv are, and what is the update frequency of his articles?

  (2) Does hot data need to be divided into databases and tables? How should it be divided? Will it affect the query cache? For active data, how many auxiliary databases are needed to support tens of millions of accesses? Say, where is the bottleneck to support large concurrency (disk I/O, CPU?)

  (3) Hot and cold data migration strategy and how to distinguish between hot and cold data, such splitting may require manual, and whether automatic splitting is required in the future

  (4) When it comes to the separation of hot and cold data, it is possible to distinguish users, so how to design the serverid of the data, whether it will become a bottleneck

  (5) Is it clear that DB is the core problem restricting performance? Is there any improvement after implementation? How much?

  (6) We distinguish the hot and cold data according to pv, and then based on the first-level page cache, whether the analysis method to distinguish the hot and cold data is reasonable.

  How often are these users' visits to the actual backend?

  (7) For active data, do we need to establish a second-level cache that is separated from the DB, or whether to upgrade the hardware through the design of SSD for active data?

  (8) A very practical question, the amount of cold data is still very large, but their access volume is a little less, so how to optimize the access of these users, the second level cache? Does it involve the design of archived data? What is the problem solved.

  (9) How to effectively improve the query cache of DB.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326994429&siteId=291194637