1.4. Technical Investigation

The first section of this chapter introduced some good investigation practices. Good investigation practices lay the groundwork for efficient problem investigation and resolution, but there is still obviously a technical aspect to diagnosing problems. The second part of this chapter covers a technical overview for how to investigate common types of problems. It highlights the various types of problems and points to the more in-depth documentation that makes up the remainder of this book.

本章第一部分介绍了一些良好的调查实践。良好的调查实践为有效的问题调查和解决打下了基础, 但在诊断问题上仍然需要技术知识。本章的第二部分介绍了如何调查常见类型问题的技术概述。本书的其余部分介绍了各种类型的问题, 并指出了进行深入学习的文档。

1.4.1. Symptom Versus Cause

Symptoms are the external indications that a problem occurred. The symptoms can be a hint to the underlying cause, but they can also be misleading. For example, a memory leak can manifest itself in many ways. If a process fails to allocate memory, the symptom could be an error message. If the program does not check for out of memory errors, the lack of memory could cause a trap (SIGSEGV). If there is not enough memory to log an error message, it could result in a trap because the kernel may be unable to grow the stack (that is, to call the error logging function). A memory leak could also be noticed as a growing memory footprint. A memory leak can have many symptoms, although regardless of the symptom, the cause is still the same.

症状是出现问题的外在征兆。这些症状可能是潜在原因的暗示, 但也可能是误导性的。例如, 内存泄漏可以在许多方面表现出来。如果进程无法分配内存, 则症状可能是错误消息。如果程序未检查内存错误, 则内存不足可能导致 SIGSEGV错误。如果内存不足无法记录错误消息, 则可能导致trap, 因为内核可能无法扩展堆栈 (即调用错误日志功能)。内存泄漏也会被注意到, 因为内存占用不断增加。内存泄漏可能有许多症状, 尽管无论症状如何, 根本原因是一样的。

Problem investigations always start with a symptom. There are five categories of symptoms listed below, each of which has its own methods of investigation.

问题调查总是从症状开始。下面列出了五类症状, 每个都有自己的调查方法。

Error错误
Crash崩溃
Hang (or very slow performance) 挂起 (或性能非常差)
Performance性能
Unexpected behavior/output意外的行为/输出

1.4.1.1. Error

Errors (and/or warnings) are the most frequent symptoms. They come in many forms and occur for many reasons including configuration issues, operating system resource limitations, hardware, and unexpected situations. Software produces an error message when it can’t run as expected. Your job as a problem investigator is to find out why it can’t run as expected and solve the underlying problem.

错误 (和/或告警) 是最常见的症状。它们以多种形式出现, 原因包括配置问题、操作系统资源限制、硬件和意外情况。软件无法预期运行时产生错误消息。作为一个问题调查员的工作是找出为什么它不能像预期那样运行并解决潜在的问题。

Error messages can be printed to the terminal, returned to a Web browser or logged to an error log file. A program usually uses what is most convenient and useful to the end user. A command line program will print error messages to the terminal, and a background process (one that runs without a command line) usually uses a log file. Regardless of how and where an error is produced, Figure 1.3 shows some of the initial and most useful paths of investigation for errors.

错误消息可以打印到终端, 返回到 Web 浏览器或记录到错误日志文件。程序通常使用最方便和最有用的方式反馈给最终用户。命令行程序将向终端打印错误消息, 而后台进程 (在没有命令行的情况下运行) 通常使用日志文件。不管错误产生的方式和地点如何, 图1.3 显示了调查错误的一些初始和最有用的方法。

Figure 1.3. Basic investigation for error symptoms.

Unfortunately, errors are often accompanied by error messages that are not clear and do not include associated actions. Application errors can occur in obscure code paths that are not exercised frequently and in code paths where the full impact (and reason) for the error condition is not known. For example, an error message may come from the failure to open a file, but the purpose of opening a file might have been to read the configuration for an application. An error message of “could not open file” may be reported at the point where the error occurred and may not include any context for the severity, purpose, or potential action to solve the problem. This is where the strace and ltrace tools can help out.

不幸的是, 错误通常伴随着含糊的错误消息, 没有关联的操作。应用程序错误可能发生在不常见的代码路径中。例如, 错误消息可能来自打开文件的失败, 但打开文件的目的可能是读取应用程序的配置。在发生错误时, 可能会报告 "无法打开文件" 的错误消息,而没有其他的信息。对这种问题，strace 和 ltrace 工具可以大显身手。

Many types of errors are related to the operating system, and there is no better tool than strace to diagnose these types of errors. Look for system calls (in the strace output) that have failed right before the error message is printed to the terminal or logged to a file. You might see the error message printed via the write() system call. This is the system call that printf, perror, and other print-like functions use to print to the terminal. Usually the failing system call is very close to where the error message is printed out. If you need more information than what strace provides, it might be worthwhile to use the ltrace tool (it is similar to the strace tool but includes function calls). For more information on strace, refer to Chapter 2.

许多类型的错误与操作系统有关, 并且没有比 strace 更好的工具来诊断这些类型的错误。在将错误消息打印到终端或记录到文件之前, 查找在 strace 输出中失败的系统调用。您可能会看到通过write() 系统调用打印的错误消息。这是 perror 和其他类似功能的系统调用, 用write()打印到终端。通常, 失败的系统调用与打印错误消息的位置非常接近。如果您需要的信息比 strace 提供的要多, 那么使用 ltrace 工具是一个更好的选择 (它类似于 strace 工具, 但包含函数调用)。有关 strace 的详细信息, 请参阅第2章。

If strace and ltrace utilities do not help identify the problem, try searching the Internet using the error message and possibly some key words. With so many Linux users on the Internet, there is a chance that someone has faced the problem before. If they have, they may have posted the error message and a solution. If you run into an error message that takes a considerable amount of time to resolve, it might be worthwhile (and polite) to post a note on USENET with the original error message, any other relevant information, and the resulting solution. That way, if someone hits the same problem in the future, they won’t have to spend as much time diagnosing the same problem as you did.

如果 strace 和 ltrace 工具不能帮助诊断问题, 请尝试在互联网上使用错误信息进行搜索, 可能还有一些关键字。有这么多的 Linux 用户在互联网上, 可能有人已经遇到过这个问题。如果有, 他们可能已经发布了错误信息和解决方案。如果您遇到了一个需要相当长时间才能解决的错误消息, 最好在USENET中张贴带有原始错误消息、任何其他相关信息以及结果解决方案的帖子。这样, 如果将来有人碰到同样的问题, 他们就不必花那么多时间去诊断同一个问题。

If you need to dig deeper (strace, ltrace, and the Internet can’t help), the investigation will become very specific to the application. If you have source code, you can pinpoint where the problem occurred by searching for the error message directly in the source code. Some applications use error codes and not raw error messages. In this case, simply look for the error message, identify the associated error code, and search for it in source code. If the same error code/message is used in multiple places, it may be worthwhile to add a printf() call to differentiate between them.

如果你需要深入挖掘 (strace, ltrace, 和互联网无法提供帮助), 调查将具体到应用程序。如果您有源代码, 则可以通过直接在源代码中搜索错误信息来确定问题发生的位置。某些应用程序使用错误代码, 而不是原始错误消息。在这种情况下, 只需查找错误信息, 识别相关的错误代码, 然后在源代码中搜索它。如果在多个位置使用相同的错误代码/消息, 则添加一个 printf () 调用以区分它们，这是值得的。

If the error message is unclear, strace and ltrace couldn’t help, the Internet didn’t have any useful information, and you don’t have the source code, you still might be able to make further progress with GDB. If you can capture the point in time in GDB when the application produces the error message, the functions on the stack may give you a hint about the cause of the problem. This won’t be easy to do. You might have to use break points on the write() system call and check whether the error message is being written out. For more information on how to use GDB, refer to Chapter 6, “The GNU Debugger (GDB).”

如果错误信息不清楚, strace 和 ltrace 无法帮助, 互联网没有提供任何有用的信息, 并且您没有源代码, 您仍然可以使用 GDB 做进一步调查。如果您可以在应用程序生成错误消息时捕获 GDB 中的时间点, 则堆栈上的函数可能会提示您问题的原因。这不容易做。您可能必须在写write() 系统调用中使用断点, 并检查是否正在写入错误消息。有关如何使用 GDB 的详细信息, 请参阅6章 "GNU 调试器 (GDB)"。

If all else fails, you’ll need to contact the support organization for the application and ask them to help with the investigation.

如果所有其他操作都失败, 您需要与技术支持部门联系, 并要求他们帮助对该应用程序进行调查。

1.4.1.2. Crashes

Crashes occur because of severe conditions and fit into two main categories: traps and panics. A trap usually occurs when an application references memory incorrectly, when a bad instruction is executed, or when there is a bad “page-in” (the process of bringing a page from the swap area into memory). A panic in an application is due to the application itself abruptly shutting down due to a severe error condition. The main difference is that a trap is a crash that the hardware and OS initiate, and a panic is a crash that the application initiates. Panics are usually associated with an error message that is produced prior to the panic. Applications on Unix and Linux often panic by calling the abort() function (after the error message is logged or printed to the terminal).

Like errors, crashes (traps and panics) can occur for many reasons. Some of the more popular are included in Figure 1.4.

crash发生的原因主要分为两个类别: trap和panic。当应用程序访问内存错误、执行错误指令或出现错误的 "page-in" (将页从交换区域换入内存的过程) 时, 通常会发生trap。应用程序中的panic是由于应用程序本身严重的错误情况而突然关闭的。主要区别在于, trap是硬件和 OS 启动的crash, 而panic是应用程序启动的崩溃。panic通常与在死机之前产生的错误消息相关。Unix 和 Linux 上的应用程序通常会通过调用abort () 函数 (在错误消息被记录或打印到终端之后) 而死机。与错误一样, crash (trap和panic) 可能发生的原因很多。一些更受欢迎的内容包括在图1.4 中。

Figure 1.4. Common causes of crashes.

1.4.1.2.1. Traps

When the kernel experiences a major problem while running a process, it may send a signal (a Unix and Linux convention) to the process such as SIGSEGV, SIGBUS or SIGILL. Some of these signals are due to a hardware condition such as an attempt to write to a write-protected region of memory (the kernel gets the actual trap in this case). Other signals may be sent by the kernel because of non-hardware related issues. For example, a bad page-in can be caused by a failure to read from the file system.

当内核在运行进程过程中遇到问题时, 它可能会向进程(Unix 和 Linux 约定)发送信号(SIGSEGV, SIGBUS或SIGILL)。其中一些信号是由于硬件原因, 例如试图向写保护区域写入数据 (在这种情况下内核得到实际的trap)。由于非硬件相关问题, 内核可能会发送其他信号。例如, 错误的页面可能是由于无法从文件系统读取而导致的。

The most important information to gather for a trap is:

收集trap最重要的信息是:

The instruction that trapped. The instruction can tell you a lot about the type of trap. If the instruction is invalid, it will generate a SIGILL. If the instruction references memory and the trap is a SIGSEGV, the trap is likely due to referencing memory that is outside of a memory region (see Chapter 3 on the /proc file system for information on process memory maps).
trap指令。指令可以告诉你很多关于trap的类型。如果指令无效, 它将生成一个 SIGILL。如果指令访问内存，trap是 SIGSEGV, 则trap可能是由于引用内存区域以外的内存 (有关进程内存映射的信息, 请参见第3章/proc文件系统上)。
The function name and offset of the instruction that trapped. This can be obtained through GDB or using the load address of the shared library and the instruction address itself. More information on this can be found in Chapter 9, “ELF: Executable Linking Format.”
捕获的函数名和指令的偏移量。可以通过 GDB获得或使用共享库和指令地址本身的载入地址获得。有关这方面的更多信息, 请参加第9章 "ELF: 可执行链接格式" 。
The stack trace. The stack trace can help you understand why the trap occurred. The functions that are higher on the stack may have passed a bad pointer to the lower functions causing a trap. A stack trace can also be used to recognize known types of traps. For more information on stack trace backs refer to Chapter 5, “The Stack.”
堆栈跟踪。堆栈跟踪可以帮助您了解陷阱发生的原因。堆栈上较高的函数可能已将错误指针传递给导致陷阱的较低函数。堆栈跟踪还可用于识别已知类型的陷阱。有关堆栈跟踪的详细信息, 请参阅第5章 "堆栈"。
The register dump. The register dump can help you understand the “context” under which the trap occurred. The values of the registers may be required to understand what led up to the trap.
寄存器转储。寄存器转储可以帮助您了解陷阱发生的 "上下文"。可能需要寄存器的值来理解什么导致的trap。
A core file or memory dump. This can fill in the gaps for complex trap investigations. If some memory was corrupted, you might want to see how it was corrupted or look for pointers into that area of corruption. A core file or memory dump can be very useful, but it can also be very large. For example, a 64-bit application can easily use 20GB of memory or more. A full core file from such an application would be 20GB in size. That requires a lot of disk storage and may need to be transferred to you if the problem occurred on a remote and inaccessible system (for example, a customer system).
核心文件或内存转储。这可以填补复杂trap调查的空白。如果某些内存已损坏, 您可能希望查看它是如何损坏的, 或者查找指向该损坏区域的指针。核心文件或内存转储可能非常有用, 但它也可能非常大。例如, 64 位应用程序可以轻松地使用20GB 内存。此类应用程序的完整核心文件大小为20GB。这需要大量的磁盘存储, 如果问题发生在远程和无法访问的系统 (例如, 客户系统) 上, 可能需要传送给您。

Some applications use a special function called a “signal handler” to generate information about a trap that occurred. Other applications simply trap and die immediately, in which case the best way to diagnose the problem is through a debugger such as GDB. Either way, the same information should be collected (in the latter case, you need to use GDB).

某些应用程序使用称为 "信号处理程序" 的特殊函数来生成有关发生的陷阱的信息。其他应用程序只是捕获和立即退出, 在这种情况下, 诊断问题的最好方法是通过一个调试器, 如 GDB。无论哪种方式, 都应该收集相同的信息 (在后一种情况下, 您需要使用 GDB)。

A SIGSEGV is the most common of the three bad programming signals: SIGSEGV, SIGBUS and SIGILL. A bad programming signal is sent by the kernel and is usually caused by memory corruption (for example, an overrun), bad memory management (that is, a duplicate free), a bad pointer, or an uninitialized value. If you have the source code for the tool or application and some knowledge of C/C++, you can diagnose the problem on your own (with some work). If you don’t have the source code, you need to know assembly language to properly diagnose the problem. Without source code, it will be a real challenge to fix the problem once you’ve diagnosed it.

SIGSEGV 是三个错误编程信号(SIGSEGV、SIGBUS 和 SIGILL)中最常见的一个。错误的编程信号由内核发送, 通常是由于内存损坏 (例如溢出)、错误的内存管理 (即重复的释放内存)、错误的指针或变量未初始化引起的。如果您有源代码以及一些 c/c++ 知识, 则可以自行诊断问题 (需要花费时间和精力)。如果没有源代码, 则需要知道汇编语言才能正确诊断问题。如果没有源代码, 一旦诊断出来, 解决这个问题将是一个真正的挑战。

For memory corruption, you might be able to pinpoint the stack trace that is causing the corruption by using watch points through GDB. A watch point is a special feature in GDB that is supported by the underlying hardware. It allows you to stop the process any time a range of memory is changed. Once you know the address of the corruption, all you have to do is recreate the problem under the same conditions with a watch point on the address that gets corrupted. More on watch points in the GDB chapter.

对于内存损坏, 您可以通过 GDB 使用监视点来精确定位导致损坏的堆栈。监视点是 GDB 中的一种特殊功能, 它由底层硬件支持。它允许您在任何时侯，内存地址范围变化时停止该进程。一旦知道了损坏的地址, 您所要做的就是在相同的条件下复现问题, 并在地址被损坏时，添加监视点。更多关于GDB观察点的资料，参看GDB章节。

There are some things to check for that can help diagnose operating system or hardware related problems. If the memory corruption starts and/or ends on a page sized boundary (4KB on IA-32), it could be the underlying physical memory or the memory management layer in the kernel itself. Hardware-based corruption (quite rare) often occurs at cache line boundaries. Keep both of them in mind when you look at the type of corruption that is causing the trap.

检查一些信息, 可以帮助诊断操作系统或硬件相关问题。如果内存在页的边界上损坏 (IA-32 上为 4KB), 则问题可能发生在内核本身的底层物理内存或内存管理层。由于硬件引起的内存损坏 (相当少见) 通常发生在缓存线边界上。当您查看导致trap的损坏类型时, 请记住这两种情况。

The most frequent cause of a SIGBUS is misaligned data. This does not occur on IA-32 platforms because the underlying hardware silently handles the misaligned memory accesses. However on IA-32, a SIGBUS can still occur for a bad page fault (such as a bad page-in).

SIGBUS 的最常见原因是未校准的数据。这不会发生在 IA-32 平台上, 因为底层硬件会默默地处理好未对齐的内存访问。但是在 IA-32 上, 页错误 (如页中的错误), 仍可能导致 SIGBUS出现。

Another type of hardware problem is when the instructions just don’t make sense. You’ve looked at the memory values, and you’ve looked at the registers, but there is no way that the instructions could have caused the values. For example, it may look like an increment instruction failed to execute or that a subtract instruction did not take place. These types of hardware problems are very rare but are also very difficult to diagnose from scratch. As a rule of thumb, if something looks impossible (according to the memory values, registers, or instructions), it might just be hardware related. For a more thorough diagnosis of a SIGSEGV or other traps, refer to Chapter 6.

另一种类型的硬件问题是, 当指令不起作用。您已经看过内存值, 并且您已经看过寄存器, 但是没有任何思路指令为什么不起作用。例如, 它可能看起来像是无法执行的增量指令, 或者没有进行减法指令。这些类型的硬件问题是非常罕见的, 但也很难从头诊断。作为一个经验法则, 如果某样东西看起来不可能 (根据内存值、寄存器或指令), 它可能只是与硬件相关。要更彻底地诊断 SIGSEGV 或其他trap, 请参阅第6章。

1.4.1.2.2. Panics

A panic in an application is due to the application itself abruptly shutting down. Linux even has a system call specially designed for this sort of thing: abort (although there are many other ways for an application to “panic”). A panic is a similar symptom to a trap but is much more purposeful. Some products might panic to prevent further risk to the users’ data or simply because there is no way it can continue. Depending on the application, protecting the users’ data may be more important than trying to continue running. If an application’s main control block is corrupt, it might mean that the application has no choice but to panic and abruptly shutdown. Panics are very product-specific and often require knowledge of the product (and source code) to understand. The line number of the source code is sometimes included with a panic. If you have the source code, you might be able to use the line of code to figure out what happened.

应用程序中的panic是由于应用程序本身突然退出。Linux 甚至有一个专门为这类事情设计的系统调用: abort (尽管有许多其他的方法可以让应用程序 "panic")。panic是一个与trap类似的现象, 但更有目的。有些产品可能会panic, 以防止用户数据面临进一步的风险, 或者仅仅因为无法继续。根据应用程序的不同, 保护用户的数据可能比尝试继续运行更重要。如果应用程序的主控模块已损坏, 则可能意味着应用程序除了死机和突然退出之外别无选择。panic是与产品相关的, 通常需要了解产品 (和源代码) 才能理解。源代码的行号有时会被包含在panic中。如果您有源代码, 您可能可以使用代码行来找出发生了什么事。

Some panics include detailed messages for what happened and how to recover. This is similar to an error message except that the product (tool or application) aborted and shut down abruptly. The error message and other evidence of the panic usually have some good key words or sentences that can be searched for using the Internet. The panic message may even explain how to recover from the problem.

有些panic包括有关发生事件和如何恢复的详细信息。这与错误消息类似, 只是产品 (工具或应用程序) 已中止并突然关闭。这种panic的错误信息和其他证据通常有一些好的关键词或句子, 可以通过互联网进行搜索。panic消息甚至可以解释如何从问题中恢复。

If the panic doesn’t have a clear error message and you don’t have the source code, you might have to ask the product vendor what happened and provide information as needed. Panics are somewhat rare, so hopefully you won’t encounter them often.

如果panic没有明确的错误信息, 并且没有源代码, 您可能需要询问产品供应商发生了什么, 并根据需要提供信息。panic是有点罕见, 所以希望你不会经常遇到他们。

1.4.1.2.3. Kernel Crashes

A panic or trap in the kernel is similar to those in an application but obviously much more serious in that they often affect the entire system. Information for how to investigate system crashes and hangs is fairly complex and not covered here but is covered in detail in Chapter 7, “Linux System Crashes and Hangs.”

内核panic或trap与应用程序类似, 但明显地严重得多, 因为它们经常会影响整个系统。有关如何调查系统崩溃和挂起的信息是相当复杂的, 这里不再详述, 参见第7章, "Linux 系统崩溃和挂起”。

1.4.1.3. Hangs (or Very Slow Performance)

It is difficult to tell the difference between a hang and very slow performance. The symptoms are pretty much identical as are the initial methods to investigate them. When investigating a perceived hang, you need to find out whether the process is hung, looping, or performing very slowly. A true hang is when the process is not consuming any CPU and is stuck waiting on a system call. A process that is looping is consuming CPU and is usually, but not always, stuck in a tight code loop (that is, doing the same thing over and over). The quickest way to determine what type of hang you have is to collect a set of stack traces over a period of time and/or to use GDB and strace to see whether the process is making any progress at all. The basic investigation steps are included in Figure 1.5.

很难区分挂起和非常差的性能之间的区别，这些症状几乎完全相同。在调查挂起时, 您需要了解进程是挂起、死循环还是执行非常慢。真正的挂起是当进程不消耗任何 CPU, 并且被卡住在等待系统调用时。死循环的进程正在消耗 CPU (即, 一次又一次地做相同的事情)。确定您所拥有的挂起类型的最快方法是尽快收集一组堆栈跟踪并/或使用 GDB 和 strace 查看进程是否正在进行任何进展。基本调查步骤包括在图1.5 中。

Figure 1.5. Basic investigation steps for a hang.

If the application seems to be hanging, use GDB to get a stack trace (use the bt command). The stack trace will tell you where in the application the hang may be occurring. You still won’t know whether the application is actually hung or whether it is looping. Use the cont command to let the process continue normally for a while and then stop it again with Control-C in GDB. Gather another stack trace. Do this a few times to ensure that you have a few stack traces over a period of time. If the stack traces are changing in any way, the process may be looping. However, there is still a chance that the process is making progress, albeit slowly. If the stack traces are identical, the process may still be looping, although it would have to be spending the majority of its time in a single state.

如果应用程序似乎挂起, 使用 GDB 跟踪堆栈(使用 bt 命令)。跟踪堆栈将告诉您应用程序中可能发生挂起的位置。如果您仍然不知道应用程序是否实际挂起, 或者是否正在循环。使用 "cont" 命令让进程正常运行一段时间, 然后用 GDB 中的control + C 停止它。收集另一个堆栈跟踪。这样反复做几次, 以确保您有几个堆栈跟踪。如果堆栈发生变化, 则进程可能正在循环。然而, 进程仍有可能取得进展, 尽管是缓慢的。如果堆栈跟踪相同, 则进程可能仍处于死循环状态。

With the stack trace and the source code, you can get the line of code. From the line of code, you’ll know what the process is waiting on but maybe not why. If the process is stuck in a semop (a system call that deals with semaphores), it is probably waiting for another process to notify it. The source code should explain what the process is waiting for and potentially what would wake it up. See Chapter 4, “Compiling,” for information about turning a function name and function offset into a line of code.

使用堆栈跟踪和源代码, 可以获取代码行。从代码行中, 您将知道进程正在等待什么, 但可能不是原因。如果该进程卡在 semop (处理信号量的系统调用) 中, 它可能正在等待另一个进程通知它。源代码应该解释进程等待的内容以及可能会唤醒它的原因。有关将函数名和函数偏移转变为代码行的信息, 请参见第4章 "编译"。

If the process is stuck in a read call, it may be waiting for NFS. Check for NFS errors in the system log and use the mount command to help check whether any mount points are having problems. NFS problems are usually not due to a bug on the local system but rather a network problem or a problem with the NFS server.

如果进程卡在read调用中, 它可能正在等待 NFS。检查系统日志中的 NFS 错误, 并使用 "mount" 命令来检查载入点是否出现问题。nfs 问题通常不是由于本地系统上的 bug, 而是网络问题或 nfs 服务器的问题。

If you can’t attach a debugger to the hung process, the debugger hangs when you try, or you can’t kill the process, the process is probably in some strange state in the kernel. In this rare case, you’ll probably want to get a kernel stack for this process. A kernel stack is stack trace for a task (for example, a process) in the kernel. Every time a system call is invoked, the process or thread will run some code in the kernel, and this code creates a stack trace much like code run outside the kernel. A process that is stuck in a system call will have a stack trace in the kernel that may help to explain the problem in more detail. Refer to Chapter 8, “Kernel Debugging with KDB,” for more information on how to get and interpret kernel stacks.

如果无法将调试器attach到挂起的进程, 则调试器在尝试时挂起, 或者无法杀死进程, 该进程可能处于内核中某种奇怪的状态。在这种罕见情况下, 您可能希望获得此过程的内核堆栈。内核堆栈是内核中任务 (例如进程) 的堆栈跟踪。每次调用系统调用时, 进程或线程都会在内核中运行一些代码, 而此代码会创建与内核外运行的代码非常类似的堆栈跟踪。卡在系统调用中的进程将在内核中生成堆栈跟踪, 这可能有助于更详细地解释问题。有关如何获取和解释内核堆栈的更多信息, 请参阅第8章 "内核调试 KDB"。

The strace tool can also help you understand the cause of a hang. In particular, it will show you any interaction with the operating system. However, strace will not help if the process is spinning in user code and never calls a system call. For signal handling loops, the strace tool will show very obvious symptoms of a repeated signal being generated and caught. Refer to the hang investigation in the strace chapter for more information on how to use strace to diagnose a hang with strace.

strace 工具还可以帮助您了解挂起的原因。特别是, 它将显示您与操作系统的任何交互。但是, 如果进程在用户代码中循环, 并且从不调用系统调用, strace 将不会有帮助。对于信号处理死循环, strace 工具将显示一个重复信号正在产生和捕获的症状。有关如何使用 strace 诊断挂起的详细信息, 请参阅 strace 章节中的 "hang调查"。

1.4.1.3.1. Multi-Process Applications

For multi-process applications, a hang can be very complex. One of the processes of the application could be causing the hang, and the rest might be hanging waiting for the hung process to finish. You’ll need to get a stack trace for all of the processes of the application to understand which are hung and which are causing the hang.

对于多进程应用程序, 挂起可能非常复杂。应用程序的一个进程可能导致挂起, 其余的可能由于等待挂起的进程完成而挂起。您需要获得应用程序的所有进程的堆栈跟踪, 以了解哪些是挂起的, 哪些是被动挂起的。

If one of the processes is hanging, there may be quite a few other processes that have the same (or similar) stack trace, all waiting for a resource or lock held by the original hung process. Look for a process that is stuck on something unique, one that has a unique stack trace. A unique stack trace will be different than all the rest. It will likely show that the process is stuck waiting for a reason of its own (such as waiting for information from over the network).

如果其中一个进程挂起, 则可能有相当多的其他进程具有相同 (或类似的) 堆栈跟踪, 它们都在等待最初挂起的进程持有的资源或锁。查找一个在与众不同的堆栈跟踪上停留的进程。它可能会显示该进程由于自身的原因而挂起 (如等待来自网络的信息)。

Another cause of an application hang is a dead lock/latch. In this case, the stack traces can help to figure out which locks/latches are being held by finding the source code and understanding what the source code is waiting for. Once you know which locks or latches the processes are waiting for, you can use the source code and the rest of the stack traces to understand where and how these locks or latches are acquired.

应用程序挂起的另一个原因是死锁。在这种情况下, 堆栈跟踪可以通过查找源代码并了解源代码所等待的内容来帮助确定正在持有的锁。一旦知道进程等待的锁, 就可以使用源代码和堆栈跟踪的其余部分来了解如何获取这些锁。

Note: A latch usually refers to a very light weight locking mechanism. A lock is a more general term used to describe a method to ensure mutual exclusion over the access of a resource.

注:锁是一个更通用的术语, 用于描述一种方法以确保对资源的访问进行互斥。

1.4.1.3.2. Very Busy Systems

Have you ever encountered a system that seems completely hung at first, but after a few seconds or minutes you get a bit of response from the command line? This usually occurs in a terminal window or on the console where your key strokes only take effect every few seconds or longer. This is the sign of a very busy system. It could be due to an overloaded CPU or in some cases a very busy disk drive. For busy disks, the prompt may be responsive until you type a command (which in turn uses the file system and the underlying busy disk).

您是否曾经遇到过一个看起来完全挂起的系统, 但在几秒钟或几分钟后, 您会从命令行得到一些响应？这通常发生在终端窗口或控制台上, 您的键盘输入只在每几秒钟或更长时间内生效。这是一个非常繁忙的系统的标志。这可能是由于 CPU 超载, 或者在某些情况下是非常繁忙的磁盘驱动器。对于繁忙的磁盘, 在键入命令 (访问文件系统和底下繁忙的磁盘) 之前, 提示可能会响应。

The biggest challenge with a problem like this is that once it occurs, it can take minutes or longer to run any command and see the results. This makes it very difficult to diagnose the problem quickly. If you are managing a small number of systems, you might be able to leave a special telnet connection to the system for when the problem occurs again.

面临这样一个问题的最大挑战是, 一旦发生这种情况, 运行任何命令并查看结果会花费几分钟或更长时间。这使得很难快速诊断问题。如果您正在管理少量的系统, 则在再次出现问题时, 您可以将特殊的 telnet 连接留给这个系统。

The first step is to log on to the system before the problem occurs. You’ll need a root account to renice (reprioritize) the shell to the highest priority, and you should change your current directory to a file system such as /proc that does not use any physical disks. Next, be sure to unset the LD_LIBRARY_PATH and PATH environment variables so that the shell does not search for libraries or executables. Also when the problem occurs, it may help to type your commands into a separate text editor (on another system) and paste the entire line into the remote (for example, telnet) session of the problematic system.

第一步是在出现问题之前登录系统。您需要一个root帐户将 shell renice (重新确定) 到最高优先级, 并且您应该将当前目录更改为不使用任何物理磁盘的文件系统 (例如/proc)。接下来, 一定要取消设置 LD_LIBRARY_PATH 和 PATH 环境变量, 以便 shell 不搜索库或可执行文件。此外, 当出现问题时, 将命令键入单独的文本编辑器 (在另一个系统上), 并将整个行粘贴到问题系统的远程 (例如 telnet) 会话中，这是有帮助的。

When you have a more responsive shell prompt, the normal set of commands (starting with top) will help you to diagnose the problem much faster than before.

当您有一个响应更快的 shell 提示时, 正常的命令集 (从顶部开始) 将帮助您更快地诊断问题。

1.4.1.4. Performance

Ah, performance ... one could write an entire book on performance investigations. The quest to improve performance comes with good reason. Businesses and individuals pay good money for their hardware and are always trying to make the most of it. A 15% improvement in performance can be worth 15% of your hardware investment.

啊, 性能..关于性能的调查可以写一本厚厚的书。企业和个人为他们的硬件支付了大笔钱, 于是试图充分利用它。15% 的性能提高可以值15% 的硬件投资。

Whatever the reason, the quest for better performance will continue to be important. Keep in mind, however, that it may be more cost effective to buy a new system than to get that last 10-20%. When you’re trying to get that last 10-20%, the human cost of improving performance can outweigh the cost of purchasing new hardware in a business environment.

不管是什么原因, 追求更好的性能是重要的。不过, 请记住, 购买新系统可能比获得最后10-20% 性能的成本更高。当你试图得到最后的 10-20%性能, 提高性能的人力成本会超过在商业环境中购买新硬件的成本。

1.4.1.5. Unexpected Behavior/Output

This is a special type of problem where the application is not aware of a problem (that is, the error code paths have not been triggered), and yet it is returning incorrect information or behaving incorrectly. A good example of unexpected output is if an application returned “!$#%#@” for the current balance of a bank account without producing any error messages. The application may not execute any error paths at all, and yet the resulting output is complete nonsense. This type of problem can be difficult to diagnose given that the application will probably not log any diagnostic information (because it is not aware there is a problem!).

这是一种特殊类型的问题, 应用程序不知道问题 (即错误代码路径尚未触发), 但它返回错误信息或不正确行为。意外输出的一个很好的例子是, 如果应用程序返回了一个银行帐户当前余额是 "!$#%#@", 而不产生任何错误消息。应用程序可能没有出错, 但结果输出完全是无稽之谈。由于应用程序可能不会记录任何诊断信息 (因为它不知道存在问题), 因此此类问题可能很难诊断。

Note: An error path is a special piece of code that is specifically designed to react and handle an error.

注意: 错误路径是专门设计用于响应和处理错误的特殊代码段。

The root cause for this type of problem can include hardware issues, memory corruptions, uninitialized memory, or a software bug causing a variable overflow. If you have the output from the unexpected behavior, try searching the Internet for some clues. Failing that, you’re probably in for a complex problem investigation.

此类问题的根本原因可能是硬件问题、内存损坏、未初始化的内存或导致变量溢出的软件 bug。如果您有意外行为的输出, 请尝试在 Internet 上搜索一些线索。如果不这样做, 你可能会面临复杂的问题调查。

Diagnosing this type of problem manually is a lot easier with source code and an understanding of how the code is supposed to work. If the problem is easily reproducible, you can use GDB to find out where the unexpected behavior occurs (by using break points, for example) and then backtracking through many iterations until you’ve found where the erroneous behavior starts. Another option if you have the source code is to use printf statements (or something similar) to backtrack through the run of the application in the hopes of finding out where the incorrect behavior started.

使用源代码和了解代码应该如何工作, 诊断这种类型的问题会容易得多。如果问题很容易重现, 您可以使用 GDB 来找出意外行为发生的位置 (例如, 使用断点), 然后回溯, 直到找到错误行为开始的位置。如果您有源代码, 另一个选项是使用 printf 语句 (或类似的内容) 在应用程序运行过程中回溯, 以找出错误行为的开始位置。

You can try your luck with strace or ltrace in the hopes that the application is misbehaving due to an error path (for example, a file not found). In that particular case, you might be able to address the reason for the error (that is, fix the permissions on a file) and avoid the error path altogether.

If all else fails, try to get a subject matter expert involved, someone who knows the application well and has access to source code. They will have a better understanding of how the application works internally and will have better luck understanding what is going wrong. For commercial software products, this usually means contacting the software vendor for support.

您可以用 strace 或 ltrace 来碰运气, 或许应用程序由于错误路径 (例如找不到文件) 而运行异常。在这种情况下, 您可能解决错误(即, 修改文件的权限), 来完全避免错误路径。如果各种尝试都失败, 邀请专家参与, 专家了解应用程序, 并可以访问源代码。他们将更好地了解应用程序内部的工作原理, 并能更好地了解问题的发生。对于商用软件产品, 这通常意味着与软件供应商联系以得到支持。

mounter625

发布了234 篇原创文章 · 获赞 12 · 访问量 24万+

他的留言板关注

1.4. Technical Investigation

猜你喜欢