Race against time: The Micro One event data why it takes so long?

Author | RU Bing Sheng

Zebian | Carol

Publisher | CSDN cloud (ID: CSDNcloud)

Micro One "delete library on foot" event has in the past several days, it is learned, Micro One service has been fully restored, for new users, have been able to start all normal business activities, but for the old user, data is still not able to full recovery, according to information on its official website, now restored merchant account and equity data, as of February 28 evening, there will be about seventy percent complete data recovery.

As the B-side user and the general masses eat melon, there will be such a curiosity, and now cloud computing, container deployment, elastic scaling capacity, data backup technology, technology is very advanced, why the whole recovery period will take so long ? So today I'll talk to the technical dimension to my understanding.

When you think one thing is very simple

It is probably because you do not understand

Before the formal chat technology, I would like to talk about this year's New Year's Eve speech Luo fat "friends of time", Luo fat comes to "bend the incoming" Let me deal with this perennial and IT technologies "My generation human" emotionally , often when we stand on the sidelines, it will feel a lot of things are not complicated, but when you into it, you will find that we were just seeing the tip of the iceberg, a lot of things to be complicated and difficult to be far more than you think.

For a very vivid example, people usually like picking low-hanging fruit, because it is the brain's feedback in terms of low-hanging fruit is very easy to pick, but a fruit looks low, it may not be really low, most likely you are too far away from some of it when you walk into, you'll find it at first appeared to be higher than you, and then when you walk into some of you will find simply prohibitive.

It's like a mountain when you're far away from it, you will feel the mountain is not high, only if you personally come to the foot of the mountain, will be sensitive to their present impossible to climb. I have a mountain stronghold in northern slope of Mount Everest photos, then an altitude of about 5300 m, behind me is the legendary summit of Mount Everest above sea level in the world of 8848, it may seem that does not seem high ah, it is because I am from It was far from enough. In other words, when you think one thing is very simple, it is often not really simple, but it is probably because you do not understand.

Back to the micro League event, it is the same reason, modern large Internet products, whether or toB of toC, from the user's point of view, the use of very simple, but the complexity of the architecture behind it is part of the iceberg the following section, its complexity will be far more than you can imagine, I would often say the word "cognitive limit your imagination." So, I believe, at this moment, slightly below the tip of the Union must make their own best efforts to promote an early resumption of the data.

Full cloud, the cloud and the cloud is not fake

Well, then talk about the topic of partial technology. Obviously, the main problem is that the micro-union on the recovery of the database, because the official did not disclose the specific technical details, I only found a very top-level diagram of the architecture of the Internet, and was not able to get the system infrastructure, especially For more information on the database architecture, we can only do some might guess from the perspective of personal experience, the purpose is to make you able to understand the extent in which the technical complexity.

First, let us look at the operating environment, in terms of simplifying the database There are the following three:

"Not on the cloud": build in their own data center, complete management of hardware, software and data themselves, which is a cloud platform before the popularity of mainstream practice. In this mode, all of the relevant database high availability, capacity expansion, data backup should have their own very professional team (DBA team and the operation and maintenance team) to manage and maintain, on the technical requirements of the enterprise is relatively high.

"All the cloud": built entirely on the cloud environment. Note that this cloud can be public cloud, private cloud can be. Cloud vendors provide a full range of solutions to support features for high availability, capacity expansion and data backup. It can be said, and with the popularity of the Pan-class database service (DBaaS) rapid development of cloud computing, more and more emerging companies will choose this option.

"False cloud": This program is the most wonderful, a bit like a Louis Vuitton bag to food, but also many in the industry, it should be said that this is the product of a transitional phase. In this way the program is to cloud as a virtual machine to use. In this way and above the "no cloud" is very similar, there is no cloud with a good advantage, but the data center of the machine moved to the cloud only. Cloud solutions can provide disaster recovery, expansion and other functions have been castrated.

For the above three ways, "not on the cloud" and "false cloud" of data compared to the risk of "all the cloud" will be greater, the lower operation and maintenance personnel in the case of "no cloud" and "the false cloud" more likely to have the opportunity to do something like "rm -rf / *" and extreme operating "fdisk" type, and "all the cloud", it is more difficult to have the opportunity to execute such commands from the operating system level, database data will not is rm -rf / to delete.

If the delete operation does not occur in the data file-level operating system (usually in the presence of a backup file format), then use our own identity database to recover accidentally deleted data efficiency will be greatly improved.

Similarly, faced with the problem of data misuse (for example, a field incorrectly batch update data in the table), "all the cloud" than "no cloud" and "cloud on the fake" have a distinct advantage. This is my personal experience, used to have a project using a self-built database, due to the misuse of a DBA, do not add a condition where the update statement on a database production environment, a direct result of the bid auction of goods record field all is lost, then the full amount is difficult to roll back and replay binlog, eventually took over four hours before resuming. Later the same mistake occurs in a cloud database, rollback recovery time took only a few minutes.

Foreign Tencent cloud from the previous response, we can probably see Micro One of deleted data is not Tencent cloud, combined with the current speed of data recovery, we almost can determine the probability of a large micro League does not use "all on the cloud "architecture, or only part of the data in the cloud, and probably occurred more extreme" rm -rf / * "and" fdisk "situation.

So in this case, all of the master file from the library, full backup files and incremental backup files are lost binlog together. The main technical challenge here is how traditional IT vendors disk recovery, is not any cloud vendor's where the skill points.

To recover all data in this case, you can imagine the technical difficulty is great. According to my rough understanding, at least to cross the threshold below these technologies.

Get full backup if the situation cold standby or off-site disaster recovery, it is an ideal existence, but due to the full backup is usually very large, it takes a long time to complete the transfer and check documents. If the full amount is no off-site backup is available, then we must take a more time-consuming, and can not guarantee 100% success certain amount of disk recovery tools. Why disk recovery will be more time-consuming, I'll be explained. There is also a full backup problem is probably too "old", and this also brings more time to the back of cost recovery.

Get incremental backup, incremental backup often did not have time to do remote backup, so the probability of a large recovery from disk, which in turn is consuming a lot of time, but the same can not guarantee 100% fully restored.

Get binlog, binlog records all database table structure is changed (e.g., CREATE, ALTER TABLE, etc.) and table data modifications (INSERT, UPDATE, DELETT etc.) binary log file, usually the index file (suffix .index) and log files ( suffix in the form of .00000 *) exists on the disk, usually in order to ensure the accuracy of the data recorded binlog changes, are generally used binlog row format, so the file size is not small, but the number of files a lot.

With the above data import and recover as the basic input to start the database-level work, this process takes a lot of time, but this is based on the above-mentioned documents can be obtained 100% of the premise, if the above-mentioned backup file appear data problem, and that the extra costs resulting time will become larger.

Recovery Disk files

Finally, he said the recovery disk files. When we delete files on the disk and other storage media, and even the format operation (except for low-level format), data on the disk does not really disappear from the disk, but only marked a bit of it in the file allocation table , located in the data area of ​​the data itself is not erased immediately. As long as the data coverage zone file information is not written back, then these deleted files that can be recovered, which is the theoretical basis of the disk after deleting files can be restored.

However, the database data files and backup files tend to be large, so long as the individual data area appeared rewrite, then recovered file is incomplete, this time on the need for human intervention to correct this technical difficulty and workload will be great, and sometimes we need the help of special equipment. In more complex cases, but also the use of data carving techniques (File Carving), engraving technology is a data file digital forensics research frequently used in recovery technology, it's no difference from the surface of the binary data set that is a raw disk image extract the files without using the file system type disk.

In addition, like the Micro One such a large system, each vertical division may have their own business databases, which may even use different schemes, heterogeneity on this architecture will bring to the process of recovery a great challenge. In addition, even after part of the data recovery is complete, we can not immediately on the line, and to other relevant data recovery, and do a good job of cross-check data to ensure foolproof data, which requires a lot of time.

These are just some of the cases I can think of, I stood too far away, but also from the dimension of the spectator look at the issue, so I believe the situation will be described in more complex than I do. We did not make inferences about the ultimate method of recovery results, can do is wait.

About the Author:

Author Ru Ping Sheng, Shi Zhanpai the industry's leading software quality engineering and R & D performance experts, China Chamber of Commerce Internet Applied Technology Council think-tank experts, best-selling book "full-stack test engineer advanced technology and practice".

The current position Dell EMC China R & D Group senior architect, served as eBay China Development Center test infrastructure technical director, HP China R & D center, senior software architect, performance testing expert, Alcatel-Lucent senior technical director, Cisco China R & D Center senior engineers with more than 16 years of experience in software development and technology management experience.

Published 251 original articles · won praise 761 · views 260 000 +

Guess you like

Origin blog.csdn.net/sch881226/article/details/104611441