Tcpdump solves problems encountered in operation and maintenance production

Source: DevOpSec Official Account
Author: DevOpSec

As a technician, tcpdumpthis tool is still necessary to understand

When you encounter network protocol problems and are at a loss, you can often look at tcpdumpwhat happened in the network communication process to help quickly locate the problem.

This article only introduces the problems encountered in work for your reference. It aims to provide inspiration for solving similar problems in your work. How to use tcpdump specifically google.

The following three cases are introduced:

Case 1: flumeWrite kafkaa log and report an error

Case 2: LB(load balancing) headerAfter increasing the request, nginxthe log cannot be obtainedheader key client_ip

Case 3: mysql QPSVery high, but mysqlnot slow query, want to know topK mysqlthe statement

Finally: the scene of the http protocol capture scene

Case 1: flumeWrite kafkaa log and report an error

flumeThe log is written kafkaas follows, and there are no other errors. Look at the error below, to kafka pushthe dataTimeoutException

But the 9092 port flumeon the machine telnet kafkais connected

What is the reason for this?

I don’t have any ideas when looking at the logs, tcpdumpso take a look at the packets

13 May 2023 16:01:28,367 ERROR [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.kafka.KafkaSink.process:240)  - Failed to publish events
java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Batch Expired
        at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.valueOrError(FutureRecordMetadata.java:56)
        at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:43)
        at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:25)
        at org.apache.flume.sink.kafka.KafkaSink.process(KafkaSink.java:229)
        at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:67)
        at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:145)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.common.errors.TimeoutException: Batch Expired
13 May 2023 16:01:28,367 ERROR [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.SinkRunner$PollingRunner.run:158)  - Unable to deliver event. Exception follows.
org.apache.flume.EventDeliveryException: Failed to publish events
        at org.apache.flume.sink.kafka.KafkaSink.process(KafkaSink.java:252)
        at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:67)
        at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:145)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Batch Expired
        at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.valueOrError(FutureRecordMetadata.java:56)
        at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:43)
        at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:25)
        at org.apache.flume.sink.kafka.KafkaSink.process(KafkaSink.java:229)
        ... 3 more

Capture packets on flumethe machine

tcpdump port 9092 -s 0 -A -e -vvv
16:46:31.324786 52:54:00:6f:bf:d2 (oui Unknown) > 98:f2:b3:2b:74:f0 (oui Unknown), ethertype IPv4 (0x0800), length 97: (tos 0x0, ttl 63, id 3722, offset 0,flags [DF], proto TCP (6), length 83)
    flume-001.28230 > 192-168-160-10.kafka.release.svc.cluster.local.XmlIpcRegSvc: Flags [P.], cksum 0xc1b4 (incorrect -> 0x3b21), seq 14182:14225, ack 6634, win 31200, length 43
E..S..@.?.k........
nF#...d.q#..P.y........'.........
producer-1......log_flume_topic

16:46:31.325436 98:f2:b3:2b:74:f0 (oui Unknown) > 52:54:00:6f:bf:d2 (oui Unknown), ethertype IPv4 (0x0800), length 704: (tos 0x0, ttl 64, id 39463, offset 0, flags [DF], proto TCP (6), length 690)
    192-168-160-10.kafka.release.svc.cluster.local.XmlIpcRegSvc > flume-001.28230: Flags [P.], cksum 0x4359 (correct), seq 6634:7284, ack 14225, win 50470, length 650
E....'@.@......
....#.nFq#....e.P..&CY....................kafka-002..#.......kafka-003..#.......kafka-001..#.........log_flume_topic.............................................................
...................................................     ..........................................................................................................................................................................................................................................................................................................................................................................................................................

From the packet capture information above, 192-168-160-10.kafka.release.svc.cluster.local.XmlIpcRegSvcthe content of the kafka node’s return packet iskafka-002..#.......kafka-003..#.......kafka-001..#.........log_flume_topic

kafka-002、kafka-003、kafka-001It is kafkathe host name. It is not very intuitive to see here. Save the data packet to a file and whiresharkanalyze it

Execute , and then open tcpdump port 9092 -s 0 -w kafka_traffic.pcapthe file with [External link picture transfer failed, the source site may have an anti-leech mechanism, it is recommended to save the picture and upload it directly. You can see that the agreement has andwhireshark

insert image description here
kafkaKafka Metadata v0 requestKafka Metadata v0 Response

Click on Kafka Metadata v0 requestthe agreement to see the detailed information
[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly
insert image description here

Click on Kafka Metadata v0 Responsethe agreement to see the detailed information
[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly

insert image description here

on flumethe machineping kafka-002

ping kafka-002                                           
ping: cannot resolve kafka-002: Unknown host

TimeoutExceptionIt is clear that there is a problem here . flume clientBefore writing , the returned address is the host name plus port kafkafrom the information kafkaobtained kafka.brokerkafkabroker

flumekafka-002After getting , dnsthe parsing fails, resulting in push eveta failure

Solution:

After flumeconfiguring on the machine , the error disappears and the problem is solvedkafka-002hosts

flumeThere are some pitfalls in the log here TimeoutException, rather than kafka-002 name reslove failedmaking the positioning problem more difficult

Another workaround:

Why does kakfa return kafka-002hostname instead of ip?

Let's take a look at the configuration file of kafka and find that advertised.listeners=PLAINTEXT://kafka-002:9092
advertised.listenersthe function of the parameters is to Brokerpublish Listenerthe information to Zookeeperthe

So instead of getting the host name flumefrom there , you can change the configuration to restart and solve the problemkafkaipkafkaadvertised.listenersipkafka

Case 2: LB(load balancing) headerAfter increasing the request, nginxthe log cannot be obtainedheader key client_ip

Let me talk about the scene first

After LBdoing seven-layer load balancing, remote_addrwhat you see is LB ip, so add the client's ipto the request in charge of balancing header client_ip.

nginxAdded log printing in , $http_client_ipand did not get headerthe value of this

What is the reason?

Is it because the number of partners responsible for LB operation and maintenance has not increased client_ip header?

Or is headerit added but the value is empty?

This requires tcpdump to capture packets to verify our guess.

Execute the following command on nginx:

tcpdump -s 0 -A 'tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x47455420'|grep client_ip

结果发现:

client_ip: 1.1.1.1

From the packet capture information, it can be seen that the client ipis set to headerthe inside, and LBthe configuration of the description is no problem, so the problem comes to nginxthis side.

Why pass the value $http_client_ipnot obtained header?

Find the relevant configuration through nginxthe official website headerhttp://nginx.org/en/docs/http/ngx_http_core_module.html found

Syntax:	underscores_in_headers on | off;
Default:	
underscores_in_headers off;
Context:	http, server

At this point, the truth is revealed. nginxBy default, user-defined underlined headeravoidance and nginxbuilt-in header keyconflicts will be ignored.

Problem solved after modifying nginxconfiguration settingsunderscores_in_headers no;

Case 3: mysql QPSVery high, but mysqlnot slow query, want to know topK mysqlthe statement

mysqlThe load becomes high and qpsextremely high, but there is no slow query, or it may be that sql query timethe slow query is not exposed due to unreasonable settings. I am worried that this will affect the performance of the database for a long time. I would like to know what statement caused it?

There is no audit enabled here mysql, and the caller does not record logs, so it is not easy to troubleshoot

How to deal with it? Is there a way to get it non-intrusively and without R&D intervention topK sql?

That's when ours tcpdumpshines

The general script to grab mysql is as follows:

cat /tmp/mdump.sh
tcpdump -i eth0 -s 0 -l -w - port 3306 | strings | perl -e '
while(<>) { chomp; next if /^[^ ]+[ ]*$/;
    if(/^(SELECT|UPDATE|DELETE|INSERT|SET|COMMIT|ROLLBACK|CREATE|DROP|ALTER|CALL)/i)
    {
        if (defined $q) { print "$q\n"; }
        $q=$_;
    } else {
        $_ =~ s/^[ \t]+//; $q.=" $_";
    }
}'

Execute on the mysql machine with high qps

sh /tmp/mdump.sh > /tmp/m.sql
30s后ctrl + c
然后执行如下命令获取top 10 SQL

grep -i ' from ' /tmp/m.sql |grep -i ' where ' |awk -F'where|WHERE' '{print $1}'|sort|uniq -c |sort -rnk1|head -n 10

If you find high-frequency SQL, you can communicate with developers, whether there are new business functions online, and optimize solutions.

From this point of view, similar components can also capture and locate problems in this form

Here is another recommended MySQL, Redis, MongoDB, httpnetwork packet capture tool
https://github.com/40t/go-sniffer

Finally: the scene of the http protocol capture scene

Grab HTTP GET requests

tcpdump -i enp0s8 -s 0 -A 'tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x47455420'

explain:

tcp\[((tcp\[12:1\] & 0xf0) >> 2):4\]The 4 bytes that define the location of the string we want to intercept (behind the http header).

0x47455420is G E Tthe ASCII code for .

Character ASCII Value
G 47
E 45
T 54
Space 20

Grab HTTP POST requests

tcpdump -i enp0s8 -s 0 -A 'tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x504F5354

0x504F5354Represents the ASCII code P O S Tof .

HTTP GET request with destination port 80

tcpdump -i enp0s8 -s 0 -A 'tcp dst port 80 and tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x47455420'

HTTP GET and POST requests with destination port 80 or 443 (from 10.10.10.10)

tcpdump -i enp0s8 -s 0 -A 'tcp dst port 80 or tcp dst port 443 and tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x47455420 or tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x504F5354' and host 10.10.10.10

Grab HTTP GET and POST request and response

tcpdump -i enp0s8 -s 0 -A 'tcp dst port 80 and tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x47455420 or tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x504F5354 or tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x48545450 or tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x3C21444F and host 10.10.10.10'

Filter destination port is 80, host is 10.10.10.10, request and response of http get/post

0x3C21444FYes '<' 'D' 'O' 'C'ASCII code, as an identifier for html files

0x48545450Yes 'H' 'T' 'T' 'P'ASCII code, used to grab HTTP response

Monitor all HTTP request URLs (GET/POST)

tcpdump -i enp0s8 -s 0 -v -n -l | egrep -i "POST /|GET /|Host:"

Grab the password in the POST request

tcpdump -i enp0s8 -s 0 -A -n -l | egrep -i "POST /|pwd=|passwd=|password=|Host:"

Grab the cookies in Request and response

tcpdump -i enp0s8 -nn -A -s0 -l | egrep -i 'Set-Cookie|Host:|Cookie:'

Filter HTTP headers

#从header里过滤出user-agent
tcpdump -vvAls0 | grep 'User-Agent:'

Guess you like

Origin blog.csdn.net/linuxxin/article/details/130662469
Recommended