Programmer's perspective: How did Lu Han's announcement of his relationship blow up Weibo?

Weibo exploded!

On the last day of the long vacation, I was lying on the bed and swiping Weibo, trying to find funny jokes and laughing. I got stuck while playing Weibo. At first I thought it was a phone problem. , I can only go out to eat. I shouldn't be alone in this situation that day. I don't want to have a Weibo card. Of course, everyone knows what happened next. The most dangerous hacker in China is actually in the entertainment industry. The deer hacker hacked Weibo with a single sentence (this sentence is inaccurate). This incident should be known to everyone in the IT circle. Hacker it.

Weibo exploded.

Then the microblog service issued an announcement:
Customer service reply

Don't digress!

After writing a few paragraphs, I found that it was a bit off topic. The technical article should not be so gossip, so I deleted it, hahahaha.

Deer fans were crying and screaming, some people jumped off the building and some committed suicide (not sure if it's true or not), "This is my girlfriend" became popular in the circle of friends, all kinds of jokes emerged one after another, and the melon-eating crowd was full of excitement , The attention and influence of the event have not been dissipated yet.... We will not repeat them in the article. After all, we can't stray from the topic, and we can't talk about this matter with the mentality of watching the excitement. This article is mainly about Weibo as a platform. How was this incident affected.

In my impression, Weibo has actually been hung up many times. Every time a celebrity announces some exciting news, it is a test for the Weibo server. Going to the company to work overtime... I feel distressed again.

Why did Weibo explode?

The following are some personal thoughts. Due to the lack of knowledge, I can't talk about it. I can only write what I can think of, and there is not much technical content.

The read operation should not drag down Weibo

The first traffic problem that comes to mind is that the server load is too high due to the impact of a large number of requests, and one by one hangs up, but before I see the data published by Weibo, I think the technical architecture of Weibo should not be too It may be easily overwhelmed by large traffic. I have seen some technical sharing from the Weibo technical team before, promoting their service degradation, automatic expansion, operation and maintenance automation and other articles. With the size of Sina Weibo, it should not be It will collapse so easily, and even if the server load reaches the upper limit, there is an automatic expansion mechanism, and it should be able to hold for a period of time, and it will not collapse for so long.

These are the initial ideas. After comparing the data released by Weibo, I will sort it out. I still feel a little optimistic about Weibo's technical team, a little too trusting, thinking that they are omnipotent, but no matter how good the team is, they will Experience some surprises.

Problems with the database?

After rejecting the first idea, I thought of the data level again. According to the number of Weibo reposts, the number of comments, the number of replies to comments, the number of likes of comments, and the data of other Weibo related to this Weibo, it was extremely The big possibility is that there were too many requests to write to the database at the time of the incident, the write operations reached a peak, and most of the writes would fall on the same Weibo (that is, Luhan's Weibo). The comment setting level is high, and some write operations may trigger other write operations. The pressure on the database is too high, causing the database to hang for a while. I sorted out the sketches according to the ideas at the time. It is expected that the n in the picture is hundreds of millions!)

database

However, Weibo has been hanging for too long. If it is a database problem, it is unlikely that the service will be unavailable for such a long time, and after the database is segmented and other architecture optimizations are done, even if there is a problem, it will not spread too widely, so this guess Shouldn't be very established.

Is there something wrong with the cache or other middleware?

After thinking about it again, is the problem in the design of the cache? The cache design of Weibo seems to have some small problems. It may be that some functions are deliberately abandoned for the high availability of services and cost considerations. This is not discussed.

There must be something wrong with the database, but the main problem should not be the pressure of the database. After all, there is a cache layer, and comments or likes do not necessarily guarantee real-time and accuracy. Certain data loss and errors are also acceptable. Therefore, it is possible that the database hangs for a period of time. Secondly, the comments and likes could not be posted normally at that time, and the comment content of the Weibo could not be displayed normally. For example, the paging data could not be obtained normally. This should be the cache layer. There is a problem, maybe the cache is also broken down....

Traffic shock is the culprit

After reading the data released on Weibo, the result is roughly there. The incident happened during the holiday, and everyone was still immersed in the cheerful atmosphere of the festival. The server was obviously not well-prepared, and the traffic was indeed too large. The incident involved 8 More than 100 million users, under the impact of this magnitude of network traffic, it is difficult to survive.

My initial thinking direction was also a bit wrong, although I also considered that the server was down, but I think it should be quickly dealt with in terms of the technical level and architectural design of Weibo, and I also think that the read operation will not be delayed. When the server cluster collapses, it should be more that write operations have caused some crashes, so more consideration is given to the data layer, and I think it may be a problem with the database or other middleware (there must be some problems, but the main reason for the crash is Not these, but caused by the lack of server resources). From the processing plan of Weibo staff, it can be concluded that the main problem is still in server resources.
add server
Before seeing the data in the picture below, no one would have imagined that there would be such a high degree of attention and such a huge impact on traffic. In the face of the absolute magnitude gap, even if we do our best, we can't fully display it, we can only try to make up for it.

I remember a sentence that was said by South Korea's deployment of THAAD some time ago, that is, under the absolute saturation blow, THAAD equipment can't play any role. Of course, we don't discuss THAAD or the military, just through such a sentence Comment to summarize this Weibo downtime incident. Although Weibo’s architecture and technical team are very strong, under the impact of large enough instantaneous traffic, disaster recovery, automatic expansion, caching, load balancing, throughput, and architecture scalability A series of plans or countermeasures seem to have been diluted by the result of the downtime, and they have disappeared without a trace. The traffic is too large, and the concept of more than 800 million users should be clear to everyone.
data assistant

Summarize

First published on my personal blog .

Although the accident was finally solved through some follow-up measures, some problems were still exposed, but there was no particularly good way to deal with it. Who knows when they suddenly announced their romance or cheated on a whim.

To sum up, the main reason for this time is the huge amount of traffic brought by the traffic niche + no warning + insufficient resources + high data density and the huge impact caused by it, exhausting the resources of one server after another, destroying one after another. There must be other faults in the service, but the specifics are unknown. The above is my humble opinion. If it is inappropriate, please forgive me.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325774866&siteId=291194637