60 seconds to locate the problem, ten times the programmer's Debug routine

Author: Tao Jianhui

This is an internal blog I wrote in May 2020. At that time, I hoped that R&D and technical support students could help users quickly locate bugs and solve problems. In December 2020, I iterated another version, and also conducted internal training on this. During this period of time, I have also been paying attention to the questions in the WeChat group, and found that many users feel that it is difficult to analyze the logs of TDengine or any other distributed system. Now I will make this blog public. Taking the analysis of TDengine logs as an example, I will introduce a set of methods. If you can master it, then analyzing the logs of distributed systems will become extremely simple.

TDengine is a cluster system, any operation will involve logical nodes such as APP, taosc, mnode and vnode. Communication between these nodes is through Socket. Also, in a test, there may be multiple instances of TDengine, which complicates the analysis. For an operation, how to collude the log matching of different logical nodes and efficiently filter and process it becomes the key to analyzing the problem.

This article summarizes a set of methods that allow you to quickly find where the bug is.

Turn on the relevant log switch

Each independent module of TDengine has its own debugFlag, including taosc, dnode, vnode, mnode, tsdb, wal, sync, query, rpc, timer, etc. Currently the log output of each module can be controlled to:

  • Fatal/Error, error, ERROR will be displayed on the log
  • Warning, warning, WARN will be displayed on the log
  • Info, important information
  • Debug, general information
  • Trace, very detailed and recurring debug information
  • Dump, raw data

The output log can control the output to:

  • document
  • Screen

The above control is all controlled by one byte of debugFlag. The bit map in this byte is as follows:

Therefore, if you want to output error, warning, info, and debug to the log file, then debug should be set to: 135; if you also want to output trace-level logs, then debug should be set to: 143; if only error and warning, debug are output Set to 131. Under normal circumstances, it is recommended to set debug to 135 .

The settings of the debug flag of each module are all controlled by the configuration file taos.cfg. The specific parameters of each module and the module name displayed in the log are as follows:

If you think there are too many configuration parameters, the easiest way is to set the general parameter debugFlag of debug. After this parameter is set, except for the tmr log, the debugs of all modules are uniformly set to the same parameter debugFlag. The default value of debugFlag is 0. When debugFlag is a non-zero value, it will override all log configuration parameters.
Unless there are special cases, it is not recommended to set util, the debugFlag of timer is 135, and 131 is appropriate.

log file

TDengine will generate client and server logs and store them in the log directory. The default log directory is /var/log/taos, but it can be specified by modifying the configuration parameter logDir in taos.cfg

  • The client log file is named taoslogY.X (because multiple clients can run, so multiple log files can be generated on one machine)
  • The server-side log file is taosdlog.X

The size of the log file is controlled. After reaching a certain number of lines (controlled by the configuration parameter numOfLogLines in taos.cfg), a new log file will be generated. But TDengine only keeps two log files with file names ending with 0 or 1, alternately.

log format:

The log file, from left to right, is divided into four blocks

  1. Timestamp, accurate to the microsecond
  2. Thread ID, because it is multi-threaded, this parameter is very important, because only the logs output by the same thread are guaranteed timing, and are output according to the designed flow
  3. Module name, three letters
  4. log output by each module

Several steps to analyze logs

When a test or customer reports a bug, whether manually or automatically, it happens by performing a specific action. This specific operation generally executes a SQL statement. This problem can be caused by the client, or it can be caused by the server code. The following uses create table as an example to explain the analysis of the log. In order to facilitate the focused explanation, the timestamp is removed from the figure.

Look at the client first

The first thing to look at is the client log. The sample screenshot is as follows:

  1. First find the SQL statement in question, search for "SQL: " in the client log, and you can see it (the second line of the screenshot). Search the log for "SQL result:" (line 11 of the screenshot), if successful, it will display "SQL result:success", otherwise it will display "SQL result: xxxx", where xxxx is the error message. How to quickly find the failed SQL, you need to know the approximate time range and what the SQL statement is, so the search will be very fast.
  2. In the log data of taoc, a very important parameter is pSql, which is an address assigned to the internal SQL Obj. taosc puts this address information at the top of all taosc logs, immediately after TSC. This value is critical and is the key to traditional client and server logs. In the screenshot above, it is marked with a green background.
  3. The pSql parameter will be passed to the RPC module as ahandle, and RPC will print it (green background) in all messages. Because many modules will call the RPC module, RPC will also print out who called it. For example, in the screenshot, it is called by the TSC, and the RPC TSC will be printed.
  4. RPC will send the message create-table to the server, and the RPC log will be printed out (line 8 of the screenshot), telling which dnode's End Point it is sent to. The screenshot shows that it is sent to hostname: 9be7010a917e, and the port is 6030. If there is a problem, then we need to check the server log where this End Point is located.
  5. It can be seen that the RPC module received the response from the server, but to avoid resource consumption by the conversion, the log only shows the hexadecimal IP address (line 9 of the screenshot, 0x20012ac) and the port number. The log of the RPC module is critical because it ties the logical nodes together.

look at the server

After analyzing the client log, the server log is very important. The following still takes create-table as an example, please see the screenshot:

  1. From the client log, find pSql, the value is 0x5572c4fab3a0, so directly search for 0x5572c4fab3a0 in taosdlog, you can see the log with green background in the screenshot. Therefore pSql is a very important parameter to string the client and server logs together.
  2. For the specific operation of create-table, there is mnode processing. In the screenshot, because the first table is created, it is necessary to create a vnode first, and then create a series of operations such as a table, which involves many modules, so I won't explain it in detail.
  3. Finally, after mnode creates the table, it sends back the response through the RPC module (line 52 of the screenshot, the last line), so it can be clearly seen that the server is working properly.

Note: After the dnode module receives the message, it will distribute the message to the message queues of mnode and vnode according to the message type. Then the worker thread will consume the message in the message queue and process the message. For vnode, the sub-modules tsdb, wal, sync, and cq are all executed in a single thread, so after finding pSql (the second line of the screenshot), you need to look down in order according to the thread ID, and you can know the whole The process is well analyzed.

a few key points

  1. Find the failed SQL statement first
  2. Find the value of pSql and copy it, assuming it is xxxxx
  3. grep "xxxxx" taoslogx.x, find the client log related to this SQL, see if you can find the problem
  4. Open the taosdlog server log, search for the value xxxxx of pSql, check the timestamp to see if it is the failed operation
  5. Then analyze the server log

The messages of the RPC module are critical. It is very important that for each RPC message, parse code: xx will be printed, which is the result of the protocol parsing, 0 means there is no problem, and other values ​​indicate that the protocol parsing is unsuccessful. But at the same time, the message itself also has code: 0xXX, which is the error code brought by the sender, which is usually sent by the server to the client. If it is correct, the code is 0, otherwise it will report an error.

Another log matching method

When a client sends a message through the RPC module, the log shows something like

sig:0x01000000:0x01000000:55893

This is the RPC's source ID, dest ID, and transcation ID. Combined, the three parameters can uniquely identify a link from a client. Each new message is sent, the transcation ID will be incremented by one, so it is unique for a period of time (transcation ID is two bytes and will be cyclic).

Version 1.6 can only rely on the sig string to match the client and server logs, but it needs to see a lot of context, so it is troublesome and inefficient.

Version 2.0 transfers pSql to the server side, so that the log matching between the client and the server will be greatly accelerated.

How to get familiar with logs

  1. First, understand the design of TDengine and understand the flow of each main operation.
  2. Turn on all log switches (set debugFlag to 135), execute all SQL statements once, and check the corresponding client and server logs against each SQL.

View the SQL statement executed by the client

The client will generate a lot of logs to see the SQL statement executed, which is easy to analyze and repeat the problem. There are several ways to find out what kind of SQL statement the system executed

  1. If the client log is turned on, execute: grep “SQL: ” taoslog*, you will see all executed SQL statements in the log.
  2. If you use taos to manually execute SQL statements, please check the hidden file .taos_history in the home directory, which contains all the historical commands executed by taos.
  3. Configure the client. In the configuration file, set the tscEnableRecordSql parameter to 1, that is, print the SQL statement input by the client to a separate file (tscnote-xxxx.0, where xxxx is the pid), which is the same directory as the client log.
  4. For the resetful interface, set the httpEnableRecordSql parameter to 1 in the taosd configuration file, and the htpp request will be printed to a separate file (httpnote.0), which is the same directory as the server log.

Dynamic modification log

Sometimes the server or client cannot be restarted, but the log settings are incorrect. At this time, dynamic settings are required. Perform the following steps:

show dnodes; // find the ID of each dnode
alter dnode id debugFlag 143; // set the corresponding debugFlag

The id of the second step is obtained in the first step.

Sometimes it is necessary to output subsequent logs to a new file to facilitate viewing and searching of logs. Execute:

alter dnode id resetlog;

Sometimes the shell cannot be linked at all. At this time, you can send the SIGUSR1 command to the process on the machine running taosd, such as:

kill -SIGUSR1 pidxxx

The original text was first published at: https://mp.weixin.qq.com/s/LUz1niYajOR35UpOlfszIQ

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324164758&siteId=291194637