Source: DevOpSec Official Account
Author: DevOpSec
As a technician, tcpdump
this tool is still necessary to understand
When you encounter network protocol problems and are at a loss, you can often look at tcpdump
what happened in the network communication process to help quickly locate the problem.
This article only introduces the problems encountered in work for your reference. It aims to provide inspiration for solving similar problems in your work. How to use tcpdump specifically google
.
The following three cases are introduced:
Case 1: flume
Write kafka
a log and report an error
Case 2: LB
(load balancing) header
After increasing the request, nginx
the log cannot be obtainedheader
key
client_ip
Case 3: mysql
QPS
Very high, but mysql
not slow query, want to know topK
mysql
the statement
Finally: the scene of the http protocol capture scene
Case 1: flume
Write kafka
a log and report an error
flume
The log is written kafka
as follows, and there are no other errors. Look at the error below, to kafka
push
the dataTimeoutException
But the 9092 port flume
on the machine telnet
kafka
is connected
What is the reason for this?
I don’t have any ideas when looking at the logs, tcpdump
so take a look at the packets
13 May 2023 16:01:28,367 ERROR [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.kafka.KafkaSink.process:240) - Failed to publish events
java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Batch Expired
at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.valueOrError(FutureRecordMetadata.java:56)
at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:43)
at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:25)
at org.apache.flume.sink.kafka.KafkaSink.process(KafkaSink.java:229)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:67)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:145)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.common.errors.TimeoutException: Batch Expired
13 May 2023 16:01:28,367 ERROR [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.SinkRunner$PollingRunner.run:158) - Unable to deliver event. Exception follows.
org.apache.flume.EventDeliveryException: Failed to publish events
at org.apache.flume.sink.kafka.KafkaSink.process(KafkaSink.java:252)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:67)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:145)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Batch Expired
at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.valueOrError(FutureRecordMetadata.java:56)
at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:43)
at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:25)
at org.apache.flume.sink.kafka.KafkaSink.process(KafkaSink.java:229)
... 3 more
Capture packets on flume
the machine
tcpdump port 9092 -s 0 -A -e -vvv
16:46:31.324786 52:54:00:6f:bf:d2 (oui Unknown) > 98:f2:b3:2b:74:f0 (oui Unknown), ethertype IPv4 (0x0800), length 97: (tos 0x0, ttl 63, id 3722, offset 0,flags [DF], proto TCP (6), length 83)
flume-001.28230 > 192-168-160-10.kafka.release.svc.cluster.local.XmlIpcRegSvc: Flags [P.], cksum 0xc1b4 (incorrect -> 0x3b21), seq 14182:14225, ack 6634, win 31200, length 43
E..S..@.?.k........
nF#...d.q#..P.y........'.........
producer-1......log_flume_topic
16:46:31.325436 98:f2:b3:2b:74:f0 (oui Unknown) > 52:54:00:6f:bf:d2 (oui Unknown), ethertype IPv4 (0x0800), length 704: (tos 0x0, ttl 64, id 39463, offset 0, flags [DF], proto TCP (6), length 690)
192-168-160-10.kafka.release.svc.cluster.local.XmlIpcRegSvc > flume-001.28230: Flags [P.], cksum 0x4359 (correct), seq 6634:7284, ack 14225, win 50470, length 650
E....'@.@......
....#.nFq#....e.P..&CY....................kafka-002..#.......kafka-003..#.......kafka-001..#.........log_flume_topic.............................................................
................................................... ..........................................................................................................................................................................................................................................................................................................................................................................................................................
From the packet capture information above, 192-168-160-10.kafka.release.svc.cluster.local.XmlIpcRegSvc
the content of the kafka node’s return packet iskafka-002..#.......kafka-003..#.......kafka-001..#.........log_flume_topic
kafka-002、kafka-003、kafka-001
It is kafka
the host name. It is not very intuitive to see here. Save the data packet to a file and whireshark
analyze it
Execute , and then open tcpdump port 9092 -s 0 -w kafka_traffic.pcap
the file with [External link picture transfer failed, the source site may have an anti-leech mechanism, it is recommended to save the picture and upload it directly. You can see that the agreement has andwhireshark
kafka
Kafka Metadata v0 request
Kafka Metadata v0 Response
Click on Kafka Metadata v0 request
the agreement to see the detailed information
[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly
Click on Kafka Metadata v0 Response
the agreement to see the detailed information
[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly
on flume
the machineping kafka-002
ping kafka-002
ping: cannot resolve kafka-002: Unknown host
TimeoutException
It is clear that there is a problem here . flume client
Before writing , the returned address is the host name plus port kafka
from the information kafka
obtained kafka
.broker
kafka
broker
flume
kafka-002
After getting , dns
the parsing fails, resulting in push evet
a failure
Solution:
After flume
configuring on the machine , the error disappears and the problem is solvedkafka-002
hosts
flume
There are some pitfalls in the log here TimeoutException
, rather than kafka-002 name reslove failed
making the positioning problem more difficult
Another workaround:
Why does kakfa return kafka-002
hostname instead of ip?
Let's take a look at the configuration file of kafka and find that advertised.listeners=PLAINTEXT://kafka-002:9092
advertised.listeners
the function of the parameters is to Broker
publish Listener
the information to Zookeeper
the
So instead of getting the host name flume
from there , you can change the configuration to restart and solve the problemkafka
ip
kafka
advertised.listeners
ip
kafka
Case 2: LB
(load balancing) header
After increasing the request, nginx
the log cannot be obtainedheader
key
client_ip
Let me talk about the scene first
After LB
doing seven-layer load balancing, remote_addr
what you see is LB ip
, so add the client's ip
to the request in charge of balancing header
client_ip
.
nginx
Added log printing in , $http_client_ip
and did not get header
the value of this
What is the reason?
Is it because the number of partners responsible for LB operation and maintenance has not increased client_ip
header
?
Or is header
it added but the value is empty?
This requires tcpdump to capture packets to verify our guess.
Execute the following command on nginx:
tcpdump -s 0 -A 'tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x47455420'|grep client_ip
结果发现:
client_ip: 1.1.1.1
From the packet capture information, it can be seen that the client ip
is set to header
the inside, and LB
the configuration of the description is no problem, so the problem comes to nginx
this side.
Why pass the value $http_client_ip
not obtained header
?
Find the relevant configuration through nginx
the official website header
http://nginx.org/en/docs/http/ngx_http_core_module.html found
Syntax: underscores_in_headers on | off;
Default:
underscores_in_headers off;
Context: http, server
At this point, the truth is revealed. nginx
By default, user-defined underlined header
avoidance and nginx
built-in header
key
conflicts will be ignored.
Problem solved after modifying nginx
configuration settingsunderscores_in_headers no;
Case 3: mysql
QPS
Very high, but mysql
not slow query, want to know topK
mysql
the statement
mysql
The load becomes high and qps
extremely high, but there is no slow query, or it may be that sql
query time
the slow query is not exposed due to unreasonable settings. I am worried that this will affect the performance of the database for a long time. I would like to know what statement caused it?
There is no audit enabled here mysql
, and the caller does not record logs, so it is not easy to troubleshoot
How to deal with it? Is there a way to get it non-intrusively and without R&D intervention topK
sql
?
That's when ours tcpdump
shines
The general script to grab mysql is as follows:
cat /tmp/mdump.sh
tcpdump -i eth0 -s 0 -l -w - port 3306 | strings | perl -e '
while(<>) { chomp; next if /^[^ ]+[ ]*$/;
if(/^(SELECT|UPDATE|DELETE|INSERT|SET|COMMIT|ROLLBACK|CREATE|DROP|ALTER|CALL)/i)
{
if (defined $q) { print "$q\n"; }
$q=$_;
} else {
$_ =~ s/^[ \t]+//; $q.=" $_";
}
}'
Execute on the mysql machine with high qps
sh /tmp/mdump.sh > /tmp/m.sql
30s后ctrl + c
然后执行如下命令获取top 10 SQL
grep -i ' from ' /tmp/m.sql |grep -i ' where ' |awk -F'where|WHERE' '{print $1}'|sort|uniq -c |sort -rnk1|head -n 10
If you find high-frequency SQL, you can communicate with developers, whether there are new business functions online, and optimize solutions.
From this point of view, similar components can also capture and locate problems in this form
Here is another recommended MySQL
, Redis
, MongoDB
, http
network packet capture tool
https://github.com/40t/go-sniffer
Finally: the scene of the http protocol capture scene
Grab HTTP GET requests
tcpdump -i enp0s8 -s 0 -A 'tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x47455420'
explain:
tcp\[((tcp\[12:1\] & 0xf0) >> 2):4\]
The 4 bytes that define the location of the string we want to intercept (behind the http header).
0x47455420
is G E T
the ASCII code for .
Character | ASCII Value |
---|---|
G | 47 |
E | 45 |
T | 54 |
Space | 20 |
Grab HTTP POST requests
tcpdump -i enp0s8 -s 0 -A 'tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x504F5354
0x504F5354
Represents the ASCII code P O S T
of .
HTTP GET request with destination port 80
tcpdump -i enp0s8 -s 0 -A 'tcp dst port 80 and tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x47455420'
HTTP GET and POST requests with destination port 80 or 443 (from 10.10.10.10)
tcpdump -i enp0s8 -s 0 -A 'tcp dst port 80 or tcp dst port 443 and tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x47455420 or tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x504F5354' and host 10.10.10.10
Grab HTTP GET and POST request and response
tcpdump -i enp0s8 -s 0 -A 'tcp dst port 80 and tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x47455420 or tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x504F5354 or tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x48545450 or tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x3C21444F and host 10.10.10.10'
Filter destination port is 80, host is 10.10.10.10, request and response of http get/post
0x3C21444F
Yes '<' 'D' 'O' 'C'
ASCII code, as an identifier for html files
0x48545450
Yes 'H' 'T' 'T' 'P'
ASCII code, used to grab HTTP response
Monitor all HTTP request URLs (GET/POST)
tcpdump -i enp0s8 -s 0 -v -n -l | egrep -i "POST /|GET /|Host:"
Grab the password in the POST request
tcpdump -i enp0s8 -s 0 -A -n -l | egrep -i "POST /|pwd=|passwd=|password=|Host:"
Grab the cookies in Request and response
tcpdump -i enp0s8 -nn -A -s0 -l | egrep -i 'Set-Cookie|Host:|Cookie:'
Filter HTTP headers
#从header里过滤出user-agent
tcpdump -vvAls0 | grep 'User-Agent:'