Fault Quick Locating Manual

Fault Quick Locating Manual

During system operation and maintenance, when a clear fault alarm is received,

  1. Common fault types and descriptions
    1. Capacity failure:

Storage capacity includes server disk space, NAS disk space, database table space, database data disk space, etc. The main reason for the server space alarm is that the logs and temporary business files occupy a lot of disks and the cleaning strategy is missing or unreasonable; the database table space alarm, as long as the business table has a large amount of data and has a large field in the table, the table space will grow too much. quick. Pay special attention to the growth of large fields in the flow table during batch running and heavy volume.

    1. CPU usage fault:

When the CPU utilization of the application and database server exceeds 90%, it will receive a fault management process. Scenarios with high application CPU usage are high CPU usage caused by high concurrency of services. In this scenario, the recovery can be resumed after the peak business period is over. In addition, abnormal log output, unreasonable loops in the program, memory overflow, etc. will also lead to high CPU usage of the application server. The scenario where the database CPU is high is mainly caused by dml operations on a large amount of data during batch running.

    1. Out of memory:

When the program reads a large amount of data into the memory while the application is running, and the memory is not recovered in time, it will cause memory overflow. The most common scenario of memory overflow is caused by the full table of a large database table or a large amount of data query, and secondly, file reading and writing can also cause some memory overflow.

    1. Database lock table:

In the database concurrency scenario, deadlocks are caused by unreasonable transaction control or unreasonable database operations, especially when a large amount of data is deleted from a table, which will lead to serious deadlocks, or multiple threads operate on the same resource when running batches concurrently It can also cause a deadlock.

    1. Interface timeout failure:

When the CPU usage of the application is too high or the memory overflows, the new business cannot apply for the resources of the server, causing the thread to wait and causing the interface to time out; in addition, slow SQL will also cause the interface to time out; and the external interface response timeout also cause

  1. Rapid fault location (specific location methods for various faults)
    1. Capacity failure:

For specific capacity alarms, the server capacity can be viewed by logging in to the specific server

    1. CPU usage fault:

Possible reasons include business logic problems (infinite loop), frequent GC, and excessive thread context switching.

Troubleshooting steps

1. Use top to locate the PID that occupies too much CPU

top

2. Through the ps aux | grep PID command

ps -mp pid -o THREAD,tid,time | sort -rn

3. Convert the required thread ID to hexadecimal format

printf "%x\n" tid

4. Print the stack information of the thread. At this point, look at the stack log to locate the problem.

jstack pid |grep time -A 30

5. Execute "jmap -dump:format=b,file=filename process ID", which will cause the memory heap of a certain process to be output to a file. You can use the mat tool of eclipse to check which objects are more in the memory.

    1. Out of memory failure:
      1. Insufficient heap memory (java.lang.OutOfMemoryError: Java heap space)

reason

1. There may be large object allocation in the code

2. There may be a memory leak, resulting in the failure to find a large enough memory to accommodate the current object after multiple GCs.

Solution

1. Check for large object allocations, most likely large array allocations

2. Through the jmap command, dump the heap memory, use the mat tool to analyze it, and check whether there is a memory leak problem

3. If no obvious memory leak is found, use -Xmx to increase the heap memory

4. There is another point that is easy to be overlooked. Check whether there are a large number of custom Finalizable objects, or they may be provided by the framework. Consider the necessity of their existence

      1. Permanent generation/meta space overflow

Error message:

java.lang.OutOfMemoryError: PermGen space

java.lang.OutOfMemoryError: Metaspace

reason

The permanent generation is the specific implementation of the method area by the HotSot virtual machine, which stores class information loaded by the virtual machine, constants, static variables, and JIT-compiled code.

After JDK8, the metaspace replaced the permanent generation, the metaspace uses local memory, and there are other changes in details:

    String constants are transferred from the permanent generation to the heap

    JVM parameters related to permanent generation have been removed

There may be several reasons for the overflow of the permanent generation or metaspace:

1. Before Java7, the String.intern method was frequently used incorrectly

2. A large number of proxy classes are generated, causing the method area to be burst and cannot be uninstalled

3. The application runs for a long time without restarting

Solution

The reason for permanent generation/meta space overflow is relatively simple, and the solutions are as follows:

1. Check whether the permanent generation space or the metaspace setting is too small

2. Check whether there are a large number of reflection operations in the code

3. After dump, check whether there are a large number of proxy classes generated by reflection through mat

4. Enlarge the trick and restart the JVM

      1. method stack overflow

Error message:

java.lang.OutOfMemoryError : unable to create new native Thread

reason

This kind of exception is basically caused by the creation of a large number of threads. I have encountered it once before, and a total of more than 8,000 threads came out through jstack.

Solution

1. The capacity of each thread stack size reduced by *-Xss*

2. The total number of threads is also limited by the free memory of the system and the operating system. Check whether there is such a limit under the system:

    /proc/sys/kernel/pid_max

    /proc/sys/kernel/thread-max

    max_user_process(ulimit -u)

    /proc/sys/vm/max_map_count

    1. Database lock table:

--Lock table query SQL

SELECT object_name, machine, s.sid, s.serial#

FROM gv$locked_object l, dba_objects o, gv$session s

WHERE l.object_id = o.object_id

AND l.session_id = s.sid;

Query the transaction corresponding to the lock table according to the script, locate the business code, and check the business logic. Contact the dba on site to kill the lock table process to ensure business progress.

    1. Interface timeout failure:

According to the location of log analysis interface timeout, the general reasons for timeout are as follows

  • SQL execution time is too long
  • It takes too long to call the external interface
  • Business processing data volume is too large
  • Too many business cycles
  • Server pressure leads to slow program execution
  • Database pressure causes slow SQL execution

Solution

  • Optimize slow SQL, analyze the execution plan, and improve SQL execution efficiency by optimizing indexes or binding indexes
  • If a large number of timeouts occur on a single-node server, you can manually go offline and change the node. After the server returns to normal, it will go online
  • Reduce the amount of business handled by a single interface
  • Optimize the loop business code to reduce the business of multi-layer loop execution

Example: Slow SQL troubleshooting and analysis process:

  1. Find the time-consuming SQL in the log (the execution time of the SQL will be printed in the business log)
  2. Search the corresponding SQL in em (Performance-SQL-Search SQL) according to the SQL text, and perform fuzzy search according to the content of the SQL text; check the cursor cache and AWR snapshot in the data source option
  3. Determine search results based on search results, or execution time
  4. Check the detailed information of SQL, and analyze the execution plan and statistical information of SQL at that time. Determine the reason why sql takes a long time
  5. Determine the SQL optimization scheme according to the execution plan, and improve the SQL execution efficiency by optimizing the index or binding the index
    1. Common troubleshooting methods:
  1. Log in to the server to view (database, server)
  2. EM view database status
  3. Log troubleshooting
  4. production business
  5. Database lock table query
  6. Production DUMP package location

For faults with unclear causes, check them one by one through the checkList

  1. View server running status: memory, CPU, storage
  2. Application process view: EURCA, server process
  3. View database EM status: waiting process, slow sql
  4. Lock table and session query: focus on troubleshooting TX locks (database health inspection manual)
  5. Abnormal log troubleshooting: ERROR and EXCEPTION logs at the end of the day
  6. Production dump package export: (dump fault analysis manual)

Guess you like

Origin blog.csdn.net/qq_34068440/article/details/126505474