Remember a production accident: 300,000 orders are gone!

Pay attention to the official account "Tong Ge Read Source Code" to unlock more source code, basic and architectural knowledge!

background

Hello, my name is Tong.

I went home from get off work last night. On the subway, the boss suddenly called. The production environment of system B responded slowly, which affected the use of system A. Tens of thousands of young brothers could not receive orders, and about 300,000 orders were stuck. You go and help Position it.

I got home around 8:30 and immediately joined the membership online.

reboot

When I joined the club, there were already colleagues helping to locate. As the saying goes, restarting can solve 80% of the problems. If restarting can’t solve it, it must be because the number of restarts is not enough. Bah, no, restarting can’t solve it. to be positioned.

Facts have proved that it is still useless to go through a wave of stress tests after restarting. With 1000 concurrent tests, the average response time is 3 to 4 seconds. This is the result of several consecutive pressure tests.

Upgrade configuration

The restart seems to be invalid. Enter the second stage - upgrade the configuration. Two 4-core 8G instances are upgraded to six 8-core 16G instances, and the database configuration has also doubled. The problems that can be solved with money, we generally Will not invest too much manpower ^^

Facts have proved that adding configuration is useless, 1000 concurrent, the average response time of the stress test is still 3~4 seconds.

Interesting.

At this point, Brother Tong and I intervened.

View monitoring

After I went online, I checked the monitoring, and the CPU, memory, disk, network IO, and JVM heap memory usage of the instance seemed to be all right. This is really a headache.

Local stress test

We divided into two waves of students, one to prepare for the local stress test, and the other to continue the analysis. After the local stress test, we found that the local environment, single machine, 1000 concurrent, all right, there were no gross problems, and the average response was basically maintained at hundreds of milliseconds.

It seems that there is indeed no problem with the service itself.

code walkthrough

There is really no other way, take out the code, a group of big men look at the code together, and the R&D classmates explain the business logic to us. Of course, he has been scolded to death by the big bosses, what kind of broken code he wrote, in fact, Brother Tong intervened Before, they have changed a wave of code, and there is a place where the redis command has been scanchanged keys *. There is a hole here, but it is not the main problem now, we will talk about it later.

代码一路走读下来,发现有很多的redis操作,还有个for循环里面在调redis的get命令,其它的都是常规的数据库操作,而且都加了索引的,所以,初步排查,数据库这里应该是没有什么问题,主要问题可能还是集中在redis这块,调用太频繁了。

加日志

代码走查下来,除了那个scan改成了keys *(这个我还不知道),基本上没有什么问题,加日志吧, 一小段一小段的加上日志,OK,重启服务,压测来一波。

当然了,结果没有什么变化,分析日志。

通过日志,我们发现,调用redis的时候时而很快,时而很慢,看起来像是连接池不够的样子,也就是一批请求先行,一批请求在等待空闲的redis连接。

修改redis连接数

查看redis配置,用的是单机模式,1G内存, 连接数默认的8,客户端还是比较老的jedis,果断改成springboot默认的lettuce,连接数先调整为50,重启服务,压一波。

平均响应时间从3~4秒降到了2~3秒,并不明显,继续加大连接数,因为我们是1000个并发,每个请求都有很多次redis操作,所以,肯定会有等待,这次我们把连接数直接干到了1000,重启服务,压一波。

事实证明,并没有明显地提升。

再次查看日志

此时,已经没有什么好的解决办法了,我们再次回到日志中,查看redis相关操作的时间,发现99%的get操作都是很快返回的,基本上是在0~5毫秒之间,但是,总有那么几个达到了800~900毫秒才返回。

我们以为redis这块没什么问题了。

但是,压测了好几次,时间一直提不上去。

很无奈了,此时,已经半夜3点多了,领导发话,把华为云的人喊起来。

华为云排查

最后,我们把华为云相关的人员喊起来一起排查问题,当然,他们是不情愿的,但是,谁让我们给钱了呢^^

华为云的负责人,把redis的专家搞起来,帮我们看了下redis的指标,最后,发现是redis的带宽满了,然后触发了限流机制。

他们临时把redis的带宽增大三倍,让我们再压测一波。

握了颗草,平均响应时间一下子降到了200~300毫秒!!!!

真的是握了颗草了,这就有点坑了,你限流就算了,带宽满了也不报警一下的么。。

这真是个蛋疼的问题。

到这里,我们以为问题就这样解决了,领导们也去睡觉了~~

上生产

既然问题原因找到了,那就上生产压一波吧~

我们让华为云的专家把生产的带宽也增大了三倍大小。

从生产提交拉一个hotfix分支,关闭签名,重启服务,压测走一波。

完蛋,生产环境更差,平均响应时间在5~6秒。

测试环境我们是改了连接池配置的,生产环境还是jedis,改之,走一波。

并没有什么实际作用,还是5~6秒。

真是个蛋疼的问题。

查看监控

查看华为云中redis的监控,这次带宽、流控都是正常的。

这次不正常的变成了CPU,redis的CPU压测的时候直接飙到了100%,导到应用响应缓慢。

再次唤醒华为云redis专家

已经凌晨四点多了,大家已经没什么思路了,华为云的redis专家,你给我再起来!

再次唤醒华为云的redis专家,帮我们分析了下后台,发现10分钟内进行了14万次scan~~

万恶的scan

询问研发人员哪里用到了scan(前面他们改的,我不知道),发现,每次请求都会调用scan去拿某个前缀开头的key,每次扫描1000条数据,查看redis键总数,大概有11万条,也就是说,一个请求就要scan100次,1000并发,大概就是10几万次scan,我们知道,redis中scankeys *是要进行全表扫描的,非常消耗CPU,14万次scan操作,直接让CPU上天了。

为什么测试环境CPU没有上天呢?

对比了下,测试环境和生产环境redis的键总数,测试环境只有900个key,每次请求也就scan一次或者keys *一次,毛线问题都没有。

为什么生产环境有这么多key?

询问研发人员,为什么生产环境有这么多key,没有设置过期时间吗?

研发人员说设置了的,是另一个同事写的代码,打开代码,真是一段魔性的代码,具体代码我就不方便贴出来了,里面有根据条件判断要不要设置过期时间,经过分析,大部分情况下,都没有设置过期时间成功。

当前解决办法

此时,已经凌晨4点半了,虽然大家还很兴奋,但是,经过领导决策,暂时先不动了,因为,目前A系统已经暂停调用B系统了,所以,此时B系统可以说流量几乎为0了,我们白天再分两个阶段来修复这个问题。

第一步,先清理掉生产环境redis的数据,只保留一小部分必要的数据。

第二步,修改scan某前缀开头的数据,改成hash存储,这样可以减少扫描的范围。

好了,本次生产事故排查就到这里了,后续,彤哥,也会继续跟进的。

总结

本次生产事件跟以往遇到的事件都略有不同,大概总结一下:

  1. 以往都是应用服务本身的CPU、内存、磁盘、JVM这些问题,redis的带宽和限流还是第一次遇见;

  2. 上了华为云以后,很多东西还没有弄得熟练,包括监控指标这些,还需要慢慢摸索;

  3. redis一定要禁用掉keys和scan命令,且大部分key应该设置过期时间!

好了,本次事件大概就写这么多,后续有新的情况彤哥也会继续跟进的,当然,最好不要有新的情况^^



本文分享自微信公众号 - 彤哥读源码(gh_63d1b83b9e01)。
如有侵权,请联系 [email protected] 删除。
本文参与“OSC源创计划”,欢迎正在阅读的你也加入,一起分享。

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324036926&siteId=291194637