[Reprint] powerful strace command usage Detailed

Powerful strace command usage Detailed

HTTPS: // www.linuxidc.com/Linux/2018-01/150654.htm 

find time to learn about strace usage.

 

What strace that?

As described strace official website, strace is a useful in diagnosing, debugging and teaching Linux user-space tracker. We use it to monitor user-space processes and kernel interaction, such as system calls, signal transmission, process changes and other states.

Use strace underlying kernel ptrace characteristics to achieve its function.

In the daily operation and maintenance work, the troubleshooting and problem diagnosis is a major content, but also the necessary skills. strace as a dynamic tracking tools that help efficiently locate operation and maintenance processes and service failures. It's like a detective, clues through system calls, to tell you the truth abnormal.

What strace do?

Operation and maintenance engineers are practice-camp, we first you an example.

We copy the packages is called some_server come from other machines, said development started directly on the line, and consequently do not change. But try to start Shique an error, simply get up!

Start command:

./some_server ../conf/some_server.conf

Output:

FATAL: InitLogFile failed iRet: -1!
Init error: -1655

Why get up it? From the log, it seems to initialize the log file failed, in the end how the truth? We use strace to look at.

strace -tt -f  ./some_server ../conf/some_server.conf

Output:

 

We note that the output of the previous line InitLogFile failed error, there is a open system call:

23:14:24.448034 open("/usr/local/apps/some_server/log//server_agent.log", O_RDWR|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = -1 ENOENT (No such file or directory)

It tries to open a file /usr/local/apps/some_server/log//server_agent.log to write (it does not exist is created), but they were wrong, return code is -1, the system error number errorno to ENOENT. Check under the open system call manual page:

man 2 open

Searching for this error number errno ENOENT explanation

ENOENT O_CREAT  is not set and the named file does not exist.  Or, a directory component in pathname does not exist or is a dangling symbolic link.

Here a little more clear, because in our case open option specifies the O_CREAT option here errno to ENOENT The reason is that a portion of the log path does not exist or is a symbolic link failure. Let's look at a portion of a path which does not exist:

ls -l /usr/local/apps/some_server/log
ls: cannot access /usr/local/apps/some_server/log: No such file or directory
ls -l /usr/local/apps/some_server
total 8
drwxr-xr-x 2 root users 4096 May 14 23:13 bin
drwxr-xr-x 2 root users 4096 May 14 22:48 conf

The original log subdirectory does not exist! Parent directory are there. After you create a log subdirectory manually, the service will be able to properly started.

Look back, strace what can you do?

It can open the black box of the application process, through the trail system call and tell you about the process of doing.

strace how to use?

Since strace is used to track the system calls and signals a user-space processes, the use strace before entering the theme of our first understand what the system calls.

About system call:

According to Wikipedia to explain, in the computer, the system calls (in English: system call), also known as system calls, which you run in user space program to the operating system kernel service requests require higher privileges to run.

System call provides an interface between the user program and the operating system. Operating system process space into user space and kernel space:

  • The operating system kernel to run directly on the hardware, provide device management, memory management, task scheduling and other functions.

  • The kernel provides the API user space, system call - user space to perform its function via the API request kernel space services.

On Linux system, application code by the function glibc library package, system call indirectly.

There are currently more than 300 Linux kernel system calls, a detailed list can be viewed through syscalls man page. These calls are divided into several categories:

File and device access classes such as open / close / read / write / chmod such as 
process management class fork / clone / execve / exit / getpid and other 
signals like signal / sigaction / kill and other 
memory management brk / mmap / mlock and other 
inter-process communication IPC shmget / semget * semaphores, shared memory, message queues and other 
network communication socket / connect / sendto / sendmsg like 
other

Familiar call / Linux system programming system that allows us handy when using strace. However, for the operation and maintenance of the fault location, the strace know this tool will check the manual system calls, almost enough.

Want to understand the students, I recommend reading "Linux System Programming", "Unix-level programming environment" and other books.

We use strace back up. strace There are two modes of operation.

One is that it started to be tracked through the process. Usage is very simple, before the original command with strace can be. For example, we want to track the implementation of "ls -lh / var / log / messages" of this command, it can be:

strace ls -lh /var/log/messages

Another mode of operation, the tracking process is already running, without interrupting the process of execution, it is understood why. This case, to transfer a strace -p pid option.

For example, there is a run of some_server service, the first step in view pid:

pidof some_server                      
17553

Get its pid 17553 then you can use strace to track their implementation:

strace -p 17553

Upon completion of the track, press ctrl + C to end strace.

strace There are some options you can adjust its behavior, we introduce here the next few of the more commonly used, and then explain its practical application through examples.

strace common options:

Sample command from a point of view:

strace -tt -T -v -f -e trace=file -o /data/log/strace.log -s 1024 -p 23489

-tt in front of each line of output, display milliseconds of time
-T display time for each system call takes
-v for certain related calls, to play a complete environment variables, file stat structure out.
-f target tracking process, the target process and all child processes created
-e control and tracking events you want to track behavior, such as specifying the name of the system call to be traced
-o strace output is written to a separate file specified by
-s when the system call the argument is a string of up to output the contents of a specified length, the default is 32 bytes
process pid -p designated to track, to track multiple pid, repeated several times -p option.

Example: When tracking nginx, look at its start which files are accessed

strace -tt -T -f -e trace=file -o /data/log/strace.log -s 1024 ./nginx

Part of the output:

 

Output in the first column shows the pid process, followed by milliseconds of time, this is the effect -tt options.

Finally, one each line, the call shows the time spent, it is the result of the -T option.

Here only the output display and file access related content, because we passed -e trace = file option specifies.

strace locate the problem cases

1, the process of locating abnormal exit

Question: do a man named run.sh resident script on the machine, run one minute will die. We need to find out the cause of death.

Positioning: The process is still running, it gets its pid ps command, assuming we get pid is 24298

strace -o strace.log -tt -p 24298

View strace.log, we see the following in the last two lines:

22:47:42.803937 wait4(-1,  <unfinished ...>
22:47:43.228422 +++ killed by SIGKILL +++

Here you can see, the process is being used by another process KILL signal kill.

In fact, through the analysis, we found that there is a level of service on the machine monitor script, which is also called a monitoring process run.sh when found run.sh process is greater than 2, it will kill the restart. The result of our run.sh script is manslaughter.

He was killed when a process exits, strace output will be killed by SIGX (on behalf of SIGX sent to the signal process) and so on, then they will process what output quit?

Here a man named test_exit program codes are as follows:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv) {
       exit(1);
}

We look to see what strace strace traces when it exits.

strace -tt -e trace=process -f ./test_exit

Description: -e trace = process represents the only track and process management related system calls.

Output:

23:07:24.672849 execve("./test_exit", ["./test_exit"], [/* 35 vars */]) = 0
23:07:24.674665 arch_prctl(ARCH_SET_FS, 0x7f1c0eca7740) = 0
23:07:24.675108 exit_group(1)           = ?
23:07:24.675259 +++ exited with 1 +++

As can be seen, when the process is its own exit (exit function call, or return from the main function), the final call is exit_group system calls and strace output will be exited with X (X is the exit code).

Some may wonder, is the code which obviously calls exit, how to display exit_group?

This is because there is not a function of the exit system call, but rather a function of the glibc library, call the exit function will eventually be transformed into exit_group system call, it will exit all threads in the current process. In fact, there is one called _exit () system call (note the underscore in front of the exit), will eventually be called when the thread exits.

2, locate the shared memory exception

There is an error when you start a service:

shmget 267264 30097568: Invalid argument
Can not get shm...exit!

Tell us about the error log to obtain a shared memory error, look through strace:

strace -tt -f -e trace=ipc ./a_mon_svr     ../conf/a_mon_svr.conf

Output:

22:46:36.351798 shmget(0x5feb, 12000, 0666) = 0
22:46:36.351939 shmat(0, 0, 0)          = ?
Process 21406 attached
22:46:36.355439 shmget(0x41400, 30097568, 0666) = -1 EINVAL (Invalid argument)
shmget 267264 30097568: Invalid argument
Can not get shm...exit!

Here, we pass -e trace = ipc option to strace only track and process communication related system calls.

From strace output, we know shmget system call wrong, errno is EINVAL. Also, inquire at shmget manual pages, indicating that the search EINVAL error code:

EINVAL A new segment was to be created and size < SHMMIN or size > SHMMAX, or no new segment was to be created, a segment with given key existed, but size is greater than the size of that segment

Under translation, shmget set EINVAL error code of one of the following reasons:

  • To create a shared memory segment is smaller than SHMMIN (typically a byte)

  • To create a shared memory segment of large (kernel configuration parameters kernel.shmmax) than SHMMAX

  • Shared memory segment specified key already exists, the value of different size and passed when calling shmget.

From strace output, we want to connect shared memory key 0x41400, the specified size is 30097568 bytes, clearly does not match the first and second case. Then remains the third case. Use ipcs look whether it is the size mismatch:

ipcs  -m | grep 41400
key        shmid      owner      perms      bytes      nattch     status    
0x00041400 1015822    root       666        30095516   1

It can be seen already 0x41400 this key already exists, and its size is 30,095,516 bytes, and we call parameters of 30,097,568 does not match, so have this error.

In our case which reason, the size of the shared memory result in inconsistent, is a set of programs, wherein a compiler is a 32-bit, 64-bit addition of a compiler, this code is used inside a variable-length long int data type.

The two programs are compiled for 64 solves this problem.

Here in particular to say under strace -e trace of options.

To track a specific system calls, -e trace = xxx can be. But sometimes we want to track a class of system calls, such as file names and all related calls, and all calls related to memory allocation.

If you manually enter each specific system call names, it may be easy to miss. So strace provides several types of commonly used combinations of system call names.

-e trace = file tracking and file access-related calls (parameter in the file name)
-e the trace = process-related calls and process management, such as fork / Exec / exit_group
the trace = Network and network communications-related calls -e, such as Socket / the sendto / Connect
-e = the trace signal transmission and processing associated signal, such as the kill / the sigaction
-e desc = and the trace file descriptor associated, such as write / read / select / epoll like
-e trace = ipc students see the correlation process, For example shmget etc.

The vast majority of cases, we use a combination of the above name is enough. When really need to track specific system calls, you may need to pay attention to differences in the C library implementation.

For example, we know that the process of creating using a fork system call, but in glibc inside, calling fork actually mapped to lower-level clone system call. When using strace, have to specify -e trace = clone, designated -e trace = fork does not match anything on.

3, performance analysis

If there is a demand, counts the number of lines of code in Linux 4.5.4 kernel versions (including assembly and C code). Here are two Shell scripts to achieve:

poor_script.sh:

!/bin/bash

total_line=0
while read filename; do
   line=$(wc -l $filename | awk ‘{print $1}’)
   (( total_line += line ))
done < <( find linux-4.5.4 -type f  ( -iname ‘.c’ -o -iname ‘.h’ -o -iname ‘*.S’ ) )
echo “total line: $total_line”

good_script.sh:

!/bin/bash

find linux-4.5.4 -type f  ( -iname ‘.c’ -o -iname ‘.h’ -o -iname ‘*.S’ ) -print0 \
| wc -l —files0-from - | tail -n 1

Two pieces of code to achieve the purpose is the same. We were statistical system calls the situation and the time they spend in the two versions by strace -c option (using the -f statistics while the child process)

As can be seen from the two outputs, good_script.sh only 2 seconds to get the result: 19613114 rows. Most of its calls (calls) overhead file operations (read / open / write / close), etc., the number of lines of code statistics has always been to do these things.

The poor_script.sh accomplish the same task then took 539 seconds. Most of which are in the process call overhead and memory management (wait4 / mmap / getpid ...).

In fact, the number of clone system calls both figures, we can see good_script.sh just need to start three processes, while poor_script.sh complete the task actually started the process 126,335!

The cost of process creation and destruction is very high, the performance is not bad strange.

to sum up

When the discovery process or service exceptions, we can track it by strace system call, "Look at doing it", and then find the cause of the exception. Familiar with common system calls, to better understand and use strace.

Of course, everything strace not really universal. When the target process stuck in user mode, strace will be no output.

This time we need to track other means, such as gdb / perf / SystemTap and so on.

Remarks:

1, perf reasons kernel support

2, ftrace kernel supports programmable

3, systemtap powerful, RedHat system support for user mode, kernel mode logic probe can use a wider range

This article permanently updated link address: http://www.linuxidc.com/Linux/2018-01/150654.htm

Guess you like

Origin www.cnblogs.com/jinanxiaolaohu/p/11418124.html