Technical dry goods | mPaaS client troubleshooting: the mystery of the long 3s waiting

Cover image 0113.png

In the face of an increasingly complex technological world, apps are encountering more and more problems during the development, launch, and operation and maintenance stages. These various problems may come from any part of the entire link, not just at the code level.

For developers, troubleshooting methods are no longer limited to debugging in the process of building code. They often need to expand the troubleshooting methods to analyze and locate problems from multiple channels. This article will share with you an example of a mini program network performance troubleshooting journey of mPaaS developers.

 

 

Problem background

"Xiaolian Technology" reported that in the app developed based on mPaaS, its integrated applet accessing the customer's self-built Web API has a performance problem of slow connection. The problem recurring video is as follows:

Play the problem reproduction video

Judging from the recurrence of the problem, after opening the applet, there is a "long" waiting process for the loading of page data.

After communicating with the developer, I learned that part of the data necessary for page initialization is obtained through its own Web API, and slow data return will cause the page to load. In addition, the developer also mentioned that this problem is regional and sporadic, and some users in some areas will be seriously troubled by this problem for a period of time.

 

 

Problem analysis and troubleshooting

As mentioned above, the data is obtained through Web API. Naturally, we hope to use external means to confirm whether the Web API itself has performance problems.

However, access to the Web API through tools such as a browser or Postman cannot reproduce the problem, and the back-end response is all milliseconds. However, because the developer mentioned that the problem is regional and sporadic, some of the reasons cannot be directly ruled out.

Since we are not the direct developers of the App, a conventional method for this type of problem is to grab HTTP messages to observe and understand the behavioral characteristics behind the App. Fortunately, our test iOS mobile phone can reproduce the problem. Through Charles grabbing App messages, we have the following findings:

The address of the Web API is: https://api.xiaolianhb.com/ ;

When Charles turns on SSL Decryption (man-in-the-middle decryption HTTPS Body mode), the problem cannot be reproduced.

When Charles turns off SSL Decryption, the problem can be reproduced, and there is obviously a 3s wait for data loading.

The above phenomena 2 and 3 strongly imply that the problem may be related to the HTTPS/SSL protocol level (when SSL Decryption is turned on, the HTTPS connection is made by the Mac notebook and the server; when SSL Decryption is turned off, the HTTPS connection is made by the iPhone and the server).

For SSL issues, you need to capture TCP packets for further confirmation and analysis. Use Wireshark to capture packets at the network layer (basic packet capture steps: iPhone connects to the network normally; iPhone connects to the Mac through a data cable, and builds a virtual map for the phone network card; Wireshark captures packets on the map; refer to here for detailed steps ).

After the problem reappears and the relevant message is captured, first confirm the problem, as shown in the following figure:

1.png

 

As you can see from the log above, in the TLS handshake phase, the client delayed 3s in the process before sending the Client Key Exchange message to the server. Normally, there should not be such a long wait. At the same time, the developer also confirmed the relevant situation in the front-end embedded debugging on the Debug package, as shown in the following figure:

2.png

 

Next, we need to figure out why the client waited so long in the handshake phase and what was the client doing during these 3s? After letting go of the network packet filtering conditions, by reading the context of the network packet, we have further findings, as shown in the following figure:

3.png

 

In the above figure, you can see that the client has been trying to establish a TCP connection with a site with an IP of 243.185.187.39, but failed. There are two questions here: 1. What does this site do? 2. Why does the client connect to the site first during the TLS handshake process?

Through the DNS query records in the same network packet, try to check the domain name address corresponding to the IP, and find that the domain name of the site is: a771.dscq.akamai.net:

4.png

 

Searching the domain name through the public network, I learned that this domain name is the OCSP (Online Certificate Status Protocol, used to verify the status of the certificate) address of Let's Encrypt (the world's largest free certificate authority) certificate, but the evidence needs to be further confirmed. In the network package, check the detailed information of the certificate frame returned by the server and confirm that the OCSP address of the certificate is  http://ocsp.int-x3.letsencrypt.org :

5.png

 

Because the OCSP address is inconsistent with the one seen in the network packet, the local confirms further through nslookup: a771.dscq.akamai.net is a CNAME address of ocsp.int-x3.letsencrypt.org (this configuration is generally for site acceleration) :

6.png

 

Combining the above situation and public information, it can be confirmed that within the waiting period of 3s, the client tries to connect to the OCSP site address provided by the certificate in order to confirm the revocation status of the certificate. Careful observation will reveal that the locally resolved IP address is not the same as the IP address seen in the packet capture on the mobile phone. There is a high probability that the OCSP address of the Let's Encryped certificate is "contaminated".

7.png

 

Here, we can see the summary of the problem: the client needs to  establish an HTTPS connection with  https://api.xiaolianhb.com/ . During the TLS handshake phase, the client cannot connect to the OCSP address provided by the site certificate, so it cannot After confirming the revocation status of the certificate, the timeout release mechanism is triggered after 3s, the client and the site establish an HTTPS connection normally, and the request sending and data return process can be carried out.

Since it has become a fact that the OCSP domain name of the Let's Encrypt certificate is "contaminated", the fastest solution to this problem is to replace the site certificate to ensure the smooth TLS handshake process.

 

 

summary

In this case, we can see that sometimes the problem is not directly related to the code, SDK, or system bug. The abnormal situation may come from an unexpected place.

Returning to the symptoms of the problem, a further question is: Why are desktop browsers or network tools less affected? Why are Android phones not affected? The differences in the details of these symptoms are all related to the implementation of the protocol by different systems or tools.

It is difficult for developers to predict all these subtle issues at the beginning of the planning stage. Therefore, after a problem occurs, in-depth problem analysis and log interpretation are often important means to understand the logic behind the program's behavior.

 

 

CodeDay#5: In-depth exploration of Alipay terminal

In the past year, we have discovered through capability docking and demand communication with many terminal developers that it is difficult for more and more R&D teams to find effective ways to provide high-concurrency support when facing the outbreak of business needs.

Everyone’s question shows a common feature: how to achieve dynamic release? How to further improve R&D efficiency? Does Alipay have best practices?

Therefore, this CodeDay we focused on the "Alipay terminal" and tried to share 4 topics to lead everyone to understand how Alipay, as a super app, uses containerization technology to achieve dynamic release and update capabilities, and precipitate a set of features. Reusable technology system.

Click me to sign up now

See you in Guangzhou on January 23.Long image.png

Long image.png

Guess you like

Origin blog.51cto.com/15052833/2590826