Soaring request volume, system optimization, service optimization

The company's user volume increased to 160W last month, daily activity was 50W, daily requests were 300 million, and concurrent transactions were about 40,000.

The spring cloud used by our company's framework integrates nacos, zuul, xxljob, sharding sub-database sub-table, redis, rocketmq.

Encountered many problems:

1. When I bought the server, the disk was purchased as small as 100G, and logging was not enough. Foreseeable students can calculate the disk usage before buying .

a, nginx uses log cutting sh, cuts regularly every day, and deletes the log

b. The nacos log is super large, there are dozens of Gs every day, because it is already online, and I don’t want to restart it, just use the sh command to delete it.

c. The business log is also relatively large. You have to print it when entering and exiting. And everyone's development habits are different. In order to locate the problem, it is still necessary. Reduce the retention time in the early stage. Later, the service restarted and expanded the machine's disk to 500G.

2. In the self-built redis cluster, a service process was killed by the machine, and I didn't understand how to recover it. Downtime for half a day.

The https://www.cnblogs.com/ywrj/p/9531800.html  article used at the time to guide the construction. The main and standby are connected by commands. Restarted one of the main services. I don't know if it is a command problem or a configuration problem, and it did not reconnect. It has become 2 redis.

Later, I simply purchased cloud redis. Friends who are not very familiar with middleware should buy it directly, so worry-free .

I need to modify the configuration later. Redis deletes the key mode adjustment. I use allkeys-lru. Anyway, the data in redis can be deleted, so I use the rule of deleting all keys with the least hits to delete.

3. I get stuck when I run into the service. cpu up to 100%

Locate gc to find out whether youngGC and fullGC are too frequent or take too long, so as to prescribe the right medicine. We will analyze the G1 garbage collector below, and I also recommend that you use G1-XX:+UseG1GC

Modify the fgc method as:

-XX:+UseG1GC -XX:G1HeapRegionSize=16m -XX:MaxGCPauseMillis=200

4. Why does a problem with a certain service cause all traffic to drop?

Shet, there is no fuse in zuul. Du Niang looked for an article and added the circuit breaker . I always see learning articles that I always miss a lot of components when I build a project. Solve the problem of a single point of service failure causing the full service to be unavailable .

5. A new problem is coming. The database always shows high concurrency pressure and the transaction is not committed. The database cpu is 100%.

Two problems were found later

a: The database CPU is 100%, and a large number of rows are scanned . It is several orders of magnitude more than that on a certain time. The original 10W/s-100W/s line scanning is now 1000W/s-10000w/s in minutes. This is cheating. It's not messy to use index after meal. Do redis optimization for the interface behind, so it's better not to check the database.

In the end, a certain interface is located, and the index granularity is not enough. Each query line scans tens of millions of lines, and the business logic is modified + redis cache is solved.

b: The server container we use is Undertow. It has 2 kinds of threads, 1 is the Io thread is generally the number of cores, the minimum is 2, and the other is the worker thread which is generally 8 times the io thread. Expand the number of working threads to solve the problem of service execution task blocking. Solve the problem that the database always shows high concurrency pressure and the transaction is not committed.

https://blog.csdn.net/zhangjunli/article/details/89207038

Continuous iteration

 

 

 

Guess you like

Origin blog.csdn.net/lin351550660/article/details/112509624