TCP retransmission troubleshooting ideas and practical problems, a bit dry!

Disclaimer: This article is a blogger original article, shall not be reproduced without the bloggers allowed. https://blog.csdn.net/wufaliang003/article/details/90664256

1, on the TCP retransmission

There is a normal TCP retransmission mechanism, in order to protect data transmission reliability. Just a LAN environment, network quality and security, because the network problems retransmission should be very low; the Internet or metropolitan area network environment, complex lines (imagine the urban underground pipe network, utility poles and other complex), poor network quality assurance, retransmission appear higher probability.

TCP has a retransmission, the network level is not necessarily a problem. It may be the receiver does not exist, the receiving side receive buffer is full, the application abnormal link is not properly closed and so on and so on.

2, TCP / IP-related

Troubleshoot network problems, to master the TCP / IP principle, a truth in a data package. The following are and TCP retransmission relatively few key parameters.

2.1 Establish parameters of the TCP connection

  1. #syn包重传多少次后放弃,重传间隔是2的n次方(1s,2s,4s..)

  2. net.ipv4.tcp_syn_retries

  3.  

  4. #syn ack包重传多少次后放弃

  5. net.ipv4.tcp_synack_retries

  6.  

  7. #syn包队列

  8. net.ipv4.tcp_max_syn_backlog

2.2 TCP retransmission type

Retransmission timeout

When the request packet is sent to open a timer, when the timer reaches the time, does not receive an ACK, the operation proceeds retransmission request has been retransmitted until the maximum number of retransmissions or the received ACK.

Fast retransmit

When the receiver receives the packet sequence number is not normal, then the receiver should receive is repeated to repeatedly transmit ACK that one, at this time, if the sender receives a same sequence number of the ACK continuous strip 3, They will start the fast retransmission mechanism, corresponding to the ACK packet sent to resend again. Specific reference: 

3, frequently asked questions and Measures

3.1 a single machine or a single application machine retransmission tcp

May be linked server or port unreachable

Troubleshooting ideas

 
  1. 1、抓1000或者更多个tcp包

  2. # 出现2次以上seq一样的包就是发生了重传

  3. # syn包重传间隔是指数增加

  4. # 已经建立了链接的tcp重传间隔,参考RTO

  5. # 收到比较多ack重传,一般说明数据包出现乱序,seq较大的先到达了目的端,发送端收到3次sack会触发立即快速重传缺失的tcp分片。快速重传不太影响rt,但是发送窗口立即减半,会对吞吐带宽有一定影响

  6. # 云环境虚拟机,还要考虑分析宿主机的问题

  7.  

  8. sudo ss -anti |grep -B 1 retrans #重传统计

  9.  

  10. if=bond0

  11. sudo tcpdump -w /tmp/tcp.pcap -i $if -c 1000 -nn tcp 2>/dev/null

  12. sudo tcpdump -nn -r /tmp/tcp.pcap | awk '{print $3,$5,$8,$9}' | sort | uniq -c | sort -rn |sed 's/^ \{1,\}//g'|egrep -v "^1 |Request"

  13.  

  14. 2、联通性检查

  15. ping $ip

  16. nc -nvz $ip $port

  17.  

  18. 3、接收端应用程序问题排查;来源和目的抓包,wireshark分析具体是什么包丢失导致了重传

More than 3.2 or more applications simultaneously machines retransmission tcp

It may be the network jitter

Troubleshooting ideas

  1. 1、查看网络区域埋点,查看网络设备报警,看是否有区域网络抖动

  2. 2、区域网络没问题的话。可以用常见问题:1 的方法缩小排查范围

3.3 Bandwidth run over

Troubleshooting ideas

  1. 1、查看主机监控,检查是否带宽跑满

  2. 2、检查重传联路上相关的网络设备是否有带宽跑满

3.4 uncommon problem

1 or an optical network device port module as a result of abnormal network packet checksum failure 2 3 routing convergence jitter host network driver has a bug, and other network devices bug

4, how to monitor

Use tsar -tcp -C retran can monitor the property that is, the tcp number of retransmissions.

tsar --tcp -C | sed 's/:/_/g;s/=/ /g' | xargs -n 2

Interested friends can directly execute the following script to monitor tcp acquisition-related condition monitoring data for open-falcon.

  1. #!/usr/bin/env bash

  2. HOSTNAME=`hostname`

  3. timestamp=`date +%s`

  4. tagapp="app=tsar.collect"

  5. data_item=""

  6. tsarcollectstring=`/opt/tsar/bin/tsar --tcp -C | sed 's/:/_/g;s/=/ /g' | xargs  -n 2 | tail -n +2|sed 's/ /|/'`

  7. for i in $tsarcollectstring

  8. do

  9. getkey=`echo $i|awk -F "|" '{print $1}'`

  10. getvalue=`echo $i|awk -F "|" '{print $2}'`

  11. tags="$tagapp"

  12. metric="tsar.collect.$getkey"

  13. metric_item="{\"endpoint\":\"${HOSTNAME}\",\"tags\":\"${tags}\",

  14.  

  15.                  \"timestamp\":${timestamp},\"metric\":\"$metric\",

  16.  

  17.                  \"value\":${getvalue},\"counterType\":\"GAUGE\",

  18.  

  19.                  \"step\":60}"

  20.  

  21. if [ "${data_item}x" = "x" ];then

  22. data_item="$metric_item"

  23. else

  24. data_item="${data_item},${metric_item}"

  25. fi

  26.  

  27. done

  28. echo "[$data_item]"

5, case study.

1 packet capture and analysis using wireshark package on the experience packet loss retransmission machine, be careful because the retransmission not have time, so the capture command is to continue to execute in order to capture the packet retransmission. Use wireshark open the tcpdump in the search box to start tcp.analysis.retransmission  results are as follows:

Figure 1 shows that the server has occurred three times retransmission action.

2 As more packages, we can use the wireshark trace stream function to obtain retransmission related tcp stream 

FIG two Flow Tracking -> TCP flow can be retransmitted packets related

As can be seen in FIG three to request and acknowledge the client and server.

3 parsing retransmission

In particular it should be noted:

NO ends 67,68 client for some reason did not receive the correct data packet, transmits to the server side dup ack, mentioned with reference to the basics of the fast retransmit

NO.69 and the time difference between the NO.68 200ms ( Follow the row time , the other is by less than 1ms), server wait timeout, then retransmission.

NO 73-74 are client sends a fin package and close the connection active.

This case occurs only once, not reproduce, by parsing out the packet capture analysis has not been clear conclusion.

6. Summary

This paper summarizes the process of resolving the problem of TCP retransmission of their work encountered in the process, focus on the general idea of ​​problem-solving and concrete practice, too few theoretical knowledge, we are interested can be more access to relevant articles in order to understand the work of tcp mechanism.

No micro-channel public attention and headlines today, excellent article continued update. . . . .

 

Guess you like

Origin blog.csdn.net/wufaliang003/article/details/90664256