Summary of TCP programming issues

Let’s review the TCP/IP five-layer model: from top to bottom, the application layer, the transport layer, the network layer, the data link layer, and the physical layer; we will come into contact with the application layer, transport layer, and network layer.

What are these three layers? The following are notes from the book "Computer Networks-Top-Down Method" (it's a good book, easy to understand and explain complex concepts, different from the textbooks that tasted the same in college).

Network layer

The network layer provides host-to-host communication services , that is, data transmission from one IP address to another. The network layer is divided into a data plane and a control plane.

The main function of the data plane is to forward data from its input link to its output link;

The role of the control plane is to coordinate the forwarding actions of each of these local routers, so that datagrams are finally transmitted end-to-end along the router path between the source and destination hosts. Simply put, the role of the control plane is routing. The algorithm for calculating the forwarding path is called the routing algorithm.

The author gave an example to illustrate the data plane and the control plane. A person driving from Pennsylvania to Florida, forwarding is the process of passing through an overpass, entering the overpass from one entrance of the overpass, and then leaving the overpass from one exit to another road; The control plane routing is the process of planning a route before starting a journey, consulting the map, and choosing one from many possible routes.

Transport layer (also called transport layer)

Provides logical communication for application processes running on different hosts. The network layer provides logical communication between hosts. After the message arrives at the host, the transport layer protocol directs the message to different processes .

The author gave a very interesting example: There are two families A and B, each with 12 children, each of them writes letters to 12 children in each other’s family every week, so there are 144 letters exchanged each week. In the week, Ann will collect the letters for everyone and deliver them to the mail truck. When the letters arrive, Ann will distribute the letters to brothers and sisters one by one, and Bill will do the work for B family. What Ann and Bill do is what the transport layer does. In this example:

Home = host

Siblings = process

Transport layer protocol = Ann or Bill

Network layer protocol = postal service (including mail truck)

Characters on the envelope = application layer message

For example, there are several network applications running on the computer, such as browsers, NetEase Cloud Music, Thunder Download, etc. How does the data arriving on the computer know which application process it is for? This is what the transport layer does.

What does the transport layer use to distinguish different processes? Socket port number.

The two examples above are great, right?

The protocol of the transport layer is divided into TCP and UDP. TCP is a connection-oriented communication protocol that guarantees the reliable arrival of data (a responsible letter distributor, besides distributing the data arriving at the host to different processes, there are many additional services), UDP It is a connectionless and unreliable data message protocol (only the most basic distribution service is provided, and the data arriving at the host is distributed, and it is confiscated and received).

How does TCP ensure that data arrives reliably? Simply put, he divides the data to be transmitted into packages, and each package is numbered and sent in order. The continuously transmitted data packets form a data stream. After each numbered data packet arrives, the other party sends an ACK, and the sender receives the other party’s The ACK thinks that this packet of data has been successfully sent, and the data packet sequence number window of the data transmission slips backward. Actually it is not that simple and there are many contents.

Not to mention the application layer.

The following summarizes several problems encountered in TCP communication:

1. The order of data arrival is not expected

I have encountered such a problem. There is a washing machine that must be turned on before setting the washing mode. During debugging, it is found that the APP clearly sends the power-on command first, and then sends the setting mode, but the wifi board on the home appliance always receives it first Set the mode, and then receive the boot command.

The analysis found that the APP sends the data to the server in order, but will the server send the data to the wifi board in order one by one? No, the server is one-to-many to communicate with thousands of devices at the same time. There are mobile APP users and home appliances wifi, so it is usually concurrent, that is, it will divide the message into many threads for processing at the same time, such as here The control message sent by the APP should be distributed to two threads for processing at the same time. The two threads of the application layer lose data to the transmission layer at the same time. The two packets of data point to the same IP and the same port (ie, the home appliance wifi board). The transmission layer has only One, the data of the different threads will be ranked first, whichever is passed down first, which causes the phenomenon that the first one does not necessarily arrive first.

How was this problem solved later? The APP sends the power on and off and the setting mode together to be processed by the Wi-Fi board. First, it retrieves whether there is a power on or off message, if there is, execute the power on and off, and then retrieve whether there is a setting mode.

2. The data arrives in multiple consecutive times

This is the basic problem to be considered in TCP programming. The ideal situation is that the APP sends a complete message and the wifi board receives a complete message. However, during the test, I found that I clicked the air conditioner switch button on the APP, kept on and off, and clicked for about 20 times. The last display on the APP was to turn on the air conditioner and finally to turn off, or the APP to turn off the air conditioner and finally to turn on (this test Is it abnormal?)

Analyzing the log, it is found that the last control message arrived in two times. The tcp transmission layer is data stream transmission. The transmission layer of the sender does not care what data is passed down from the application layer. It only divides the data into small packages. When sending out, the transmission layer of the receiving party does not care what is in the data sent by the other party, and transmits as much as it receives to the application layer.

Our TCP data processing program at that time was not perfect and could not handle the situation where a message arrived in two times. If the message arrived in two times, the fragment received for the first time did not conform to the application layer protocol format and was discarded. The second time Those that arrived were also discarded.

A complete tcp data processing program should:

(1) Basically, it can process a complete message that arrives;

(2) Several messages are received together and can be processed one by one;

(3) Several complete messages + a fragment of incomplete message, can process the previous complete one, and then store the incomplete message fragment in the cache, and wait for the next or next few times to receive the remaining message fragments. Process a complete message;

(4) There is only a fragment of incomplete message, stored in the cache, and the next or next few received messages form a complete message before processing.

Later, the program was improved, the message was stored in a circular queue for processing, and the bug was resolved.

3. When I quickly click the APP, I find that the execution becomes slower.

It’s the abnormal test above. I kept clicking the on/off button on the APP. The person who tested said that the air conditioner responded quickly at first. Why did the reaction be so slow after ordering more than 10 times, that is, the order on the APP was finished and the air conditioner was slow. Half a shot was still there for a long time, as if it was switching automatically.

Check the TCP thread, select is used in the while main loop, and when select detects that there is data arriving at this socket, it will collect data. The timeout time set by select is 1s, that is, when there is no data, it will wait at most 1s before going down. When there is data, it will be collected immediately. Later, 1s is changed to 500ms, which is obviously much faster!

But I still have doubts. It stands to reason that the select function will not delay the processing and sending and receiving of data, because if it has data, it will return immediately and tell you that there is data. At this time, go to collect it immediately. If there is no data, wait for a timeout to return, then It doesn't matter how long the timeout is set, right?

4. The data arrives after a period of time in two splits

This is a bug that recently appeared on the wifi+zigbee gateway. One end of the gateway is connected to the server by wifi, and the other is zigbee followed by many sub-devices, such as switches, water immersion, air sense, etc. The phenomenon of the bug is that there is a probability that the scene command cannot be To execute scene commands such as turning on all lights or turning off all lights, the user clicks a scene button on the APP, and the message is sent to the gateway.

Analyzing the log found that the message arrived in two times, and there was still 2 seconds between the two arrivals. The strange thing is that the previous piece of data was cleared when the second piece of data arrived and was not saved, but most of the time Messages that arrive in two can be processed correctly. Why is it not processed this time? What is the difference between this time and other times?

The difference this time is that after seeing that the first message is received, the time to send the ping message is up, and the ping message is sent to the server (this message is a heartbeat message at the application layer, sent once every 30s, in order to keep the heartbeat and detect When the ping message sent to the server is not received for 5 seconds, the server thinks it is offline) and then it took 2 seconds to receive the second message.

Searched all the places where the cache was cleared, and found that the cache was cleared at the place where the ping message was sent! As a result, the section from the message was lost and was not processed correctly. After removing this clear action for multiple tests, this situation did not recur.

Another problem found here is: this time the ping message did not receive the server's reply, so the gateway judged that the connection was dropped, and the received control message was not processed anymore. How should the logic of the dropped connection be designed?

Is it reasonable to think that the connection is dropped just without receiving the heartbeat message back? At this time, the useful control messages we are concerned about can be received normally!

Therefore, some optimizations should be made to the logic of judging the disconnection:

(1) When the ping message is not received but the control message can still be received, it should not be judged as dropped, as long as the data can be received, it should not be regarded as dropped;

(2) Debounce processing, when the ping message is not received several times in a row (the ping message is sent once every 30s), it is considered as dropped.

5. Data is accidentally erased

After question 4 was changed, the result of the test showed that the scene message was not executed again, and it appeared once after two hundred tests, and it crashed! Intuitively I think this is a new bug!

Analyzing the log, it is found that this is also a message that arrived in two parts and was not processed correctly. The data that arrived for the first time has a total of n complete messages + the first half of the control message. I saw that the fragment of the incomplete message was copied to the cache at the end. Operation, but when the second half of the message is received, the first half of the message is not printed in the cache!

The library function memcpy is used to copy the string, and the function strlen is used to copy the length of the string.

Run the problematic message again with the test code, increase the log, and see how the code runs, and see that the previous complete messages are indeed processed. The problem is that the main loop is directly used for receiving when processing. The buffer pointer of the socket data is passed in. There is a place to calculate the MD5 digest value of the message and compare it with the MD5 digest value brought down in the message, and assign a character at a certain position in the string to 0.

This operation is very dangerous! This directly causes the length calculated by strlen to be 0 when copying incomplete message fragments later, and string processing functions such as strlen, strstr, and strcpy all stop when they encounter 0. 

Pass the receive buffer pointer directly in, this kind of operation is not standardized. 

The modification method is that the pointer passed in to the function that assigns 0 is no longer the buffer pointer used to receive socket data, but another buffer is opened, the data to be processed is copied over, and the newly opened buffer pointer is passed in. .

6. Socket port number problem

I'm tired of writing this. To make a long story short, the process of connecting to the server is as follows:

(1) Call the socket function to create a tcp type socket;

(2) Initialize your own address my_addr, the type is sockaddr_in, and the content includes port number, type, and IP address (as shown below);

(3) Call bind to bind socket and my_addr;

(4) Initialize the server address svr_addr to be connected;

(5) Call connect to connect to the server

In the second step, there is a special note that the port number of your address must be a non-multiple number, that is to say, 2000 is used this time, then the next time the wifi board is connected again (such as power off and power on again) Then use 2000, which can be 2001 in increments or others.

So why is this? Because the server detects that the Wi-Fi board is offline, it is generally not as fast as the Wi-Fi board itself. When the Wi-Fi board reconnects to the server, if the original port number is used, and the server has not detected the Wi-Fi board to be offline, the original one The tcp link of the port number is still there, and the resource is not released. Using the original port number to establish a new link will definitely not succeed; in another case, the wifi board is powered off and reconnected to the server, the server must not know the wifi board After restarting, I still can't connect with the original port number, unless the power is cut for a long time and the server detects that the wifi is offline and then powers on.

So the correct way is to store the port number in the Flash, fetch it from the flash every time it is used, and update this value when it is used up.

The single-chip microcomputer solution will not generate unreplicated numbers by itself, so you need to worry about saving it yourself. Some linux system solutions are that the bottom layer will generate unreplicated numbers by yourself without worrying about it yourself.

7. Problems caused by no keep alive mechanism

I'm so tired, I don't remember this a bit, so I will remember to add it another day.

Guess you like

Origin blog.csdn.net/weixin_38293850/article/details/105632705