Early online accidents of my IoT project

One MQTT connection number alarm

The project has been online for about a month, and the number of cradles put out is about 200, and the average number of online per day (I heard that some businesses are careful, and there are children who need to ride in the car before plugging in. Usually they are not plugged in, and some of them are still still The lazy tube in the corner) is maintained at about 100. At that time, the MQTT configuration purchased on Alibaba Cloud was the maximum number of connections of 2000 (MQTT was purchased according to the number of connections). Like the current number of rockers, the configuration at the time was more than enough. , For a month, it has been normalized (now think about it, the original promotion strategy was immature, the number of cradles placed every day was either 3 or 4 units a day, or 3 or 4 units were promoted for several days), so the problem is not It hasn't been exposed, but it will be paid back sooner or later.

One afternoon when I was about to get off work, suddenly MQTT kept alarming, and I received an alarm message every 5 seconds on my phone, indicating that the number of MQTT connections has exceeded the limit (using Alibaba Cloud products, I feel that this warning function is quite timely), because at the beginning There are also some rockers for testing and frequently use MQTT, so I didn’t care too much at the time and asked the testers to stop and do the test (this is very embarrassing, the test environment and lines used by the rocker scan code to start The above environment is the same MQTT, the reason will be described in detail later) I thought that the number of connections would be released after a while, but the mobile phone received alarm messages more and more fierce, the first time I thought of the overall accident point should be in the MQTT business application layer At this end (the mobile phone scan code is through http request to the MQTT application layer, and the MQTT application layer throws messages to the Alibaba Cloud MQTT server). At the beginning, two MQTT application layers were doing load balancing clusters. I logged into the two servers and used top respectively. Command to view the two servers, one CPU is about 80%, and the other CPU is about 60%. I suddenly felt very surprised. Such a bit of rocking data request will not cause the application server to be unable to bear it. The first thing I thought of was watching Looking at the current number of TCP connections, the results were shocked (using the command: netstat -natp|awk'{print $7}' |sort|uniq -c|sort -rn), the current number of connections on the two servers are close Faster than 1W, and it is still rising (because the cradles released at the time are still being used and consumed), the basic positioning problem at this time: the mobile phone scans the code and sends the http request to the MQTT application layer, and the MQTT application layer still message every time To Alibaba Cloud MQTT server, a connection needs to be established. So the problem is likely to be that the connection is not released. Since the code logic was relatively simple at the beginning, I directly found the developer who wrote this code and looked at the code together. Sure enough, after modifying the code, re-sending the package, everything is normal, and the mobile phone alarms the SMS immediately. Stopped.

	public void sendMsgMqtt(String productId, String deviceId, String scontent, String topic){
		String subTopic = getSubTopic(productId, deviceId, topic);
		String clientId = getClientId();
		MemoryPersistence persistence = new MemoryPersistence();
		try {
			 final MqttClient sampleClient = new MqttClient(GlobalConstant.BROKER, clientId, persistence);
             final MqttConnectOptions connOpts = getConnOpts(clientId);
             log.info("Coin Connecting to broker: " + GlobalConstant.BROKER);
             sampleClient.setCallback(new MqttCallback() {
             public void connectionLost(Throwable throwable) {
                 log.info("mqtt connection lost");
                 throwable.printStackTrace();
                 while(!sampleClient.isConnected()){
                     try {
                         sampleClient.connect(connOpts);
                     } catch (MqttException e) {
                         e.printStackTrace();
                     }
                     try {
                         Thread.sleep(1000);
                     } catch (InterruptedException e) {
                         e.printStackTrace();
                     }
                 }
             }
             public void messageArrived(String topic, MqttMessage mqttMessage) throws Exception {
                 log.info("coin messageArrived:" + topic + "------" + new String(mqttMessage.getPayload()));
             }
             public void deliveryComplete(IMqttDeliveryToken iMqttDeliveryToken) {
                 log.info("coin deliveryComplete:" + iMqttDeliveryToken.getMessageId());
             }
          });
             sampleClient.connect(connOpts);
             try{
            	 scontent = scontent.replace("[", "").replace("]", "");
            	 final MqttMessage message = new MqttMessage(scontent.getBytes());
                 message.setQos(1);
                 log.info("pushed at "+new Date()+" "+ scontent);
                 sampleClient.publish(subTopic, message);
                 log.info("-------send end---------");
             }catch(Exception ex){
            	 ex.printStackTrace();
             }finally{
               	 sampleClient.disconnect();
                 log.info("-------client disConnect()---------"); 
             }
		}catch(Exception ex){
			ex.printStackTrace();
		}
	}

Two database CPU100%

We must first say that at the beginning of a startup project, the business at the beginning was very unclear. It basically belongs to the kind of crossing the river by feeling the stones, taking a step by step, and then thinking about it or even looking at the market’s reaction before making changes. Therefore, this stage is more about testing the model and optimizing the business. Of course, funds are also limited, and rapid development and iteration requirements are required. Therefore, the database was not clustered at the beginning, and it was just one, playing a single machine. It is precisely because of playing a single machine, so some are in the cluster Problems that do not break out so quickly in the environment can easily break out in a single machine.

Around 7 o'clock in the evening (7 to 9 o'clock in the evening is the peak of the use of cradles, generally there is less order data during the day, and the data doubles in the evening), I received feedback that the platform system is very slow and slow to open. In addition, the mobile phone continued to receive MQTT SMS warnings, saying that there were a large number of unconsumed orders. The first response was quick to log in to the cloud database management platform to see that the database CPU indicator was seriously red, and the utilization rate reached 100%. At the beginning, the database used a general-purpose 4-core 8G, and the order table data was nearly 40W. Open the database performance management interface to view ( if there is no such management interface, you can also view the executing SQL and explain through the command show processlist to analyze the execution plan Check Slow SQL ), and found that a large amount of slow SQL has been accumulated. The average execution time of some SQL exceeds nearly 1 minute. Obviously, a global scan of the database has been done. In addition, during the peak period of this time, slow SQL accumulates, resulting in CPU resources The order is consumed.

It was also the peak period of business usage at that time. In addition to customer complaints and many people above, I wanted to restore the database to normal in a short period of time. At the beginning, there were several options to restore quickly. Switch the database to the standby database (the database is highly available) or kill some slow SQL (check the column whose state is Sending data through show processlist, and then kill id), these may affect the business currently in use, as a last resort The situation will not be done immediately. I looked at the first few slow SQL, some of which can be processed quickly, such as optimizing the index. I tried to re-optimize some of the indexes (described in detail later) After a few minutes, the CPU usage rate slowly dropped (one record was executed in about 3 seconds before the index was optimized, and thousands of records could be executed in 1 second after the index was optimized), the business was normal, and the database for me later I have a lot of time to think about the optimization, so the biggest feeling from this incident is that I don’t have to be hotheaded, and I need to calm down and minimize problems.

This SQL optimization also summarized some experience and made relevant optimizations as follows:

1. Add indexes and optimize indexes, be especially careful about implicit conversion of indexes

If only the primary key is set in a table, then other indexes will not be built. Simple business such as the business that only involves querying by the primary key is okay, but the design of the query to other fields will increase the amount of data and the peak of the business. In the period, this kind of SQL that leads to global query of the table will definitely accumulate a lot of slow SQL, and eventually cause the CPU to continue to rise. If there are conditions, it is better to do some stress testing with large data volume to test it out, in addition, establish an index , Also note that the index is invalid. For example: select *from order where phone=13772556391;  usually write code carelessly, do not check carefully, and the stress test is not tested in place, it may cause problems in business scenarios with high concurrent data volume slightly larger. The string type used in the phone field of the database table, but there are no quotes in this SQL, so in this case, the index is invalid.

2. Paging query optimization

select * from orderwhere oid=100 limit 100000, 5000, this common limit M, N page turning writing method, the slower the page turning process, the reason MySQL will read the first M+N entries in the table Data, the larger the M, the worse the performance. If this kind of SQL is just a normal query to look at the records, if you don't feel abnormal, even if it is a little slower, you can bear it, but in the business scenario I just mentioned, it will also lead to the accumulation of slow SQL queries. Optimized writing: select t1.* from order t1,(select id from order oid=100 limit 100000, 5000) t2 where t1.id=t2.id, this kind of efficiency will be high and very good .

3. Sub-table

Although the above optimization has been done, the execution efficiency is much higher than before, but the pressure of single expression still exists. The order table was not divided into tables at that time. Although the current conditions did not use sharding clusters, physical distribution tables were also needed to solve the burning needs (as more and more cradles were put out, the order was about 1 10,000 to 20,000).

 

SQL optimization is a long-term process. It is better to optimize efficiency in combination with specific business scenarios. Later, I will continue to list some SQL pits and related optimizations that appeared in the actual combat of this project.

Guess you like

Origin blog.csdn.net/u010460625/article/details/108848862