Data Warehouse Practice丨Active Prevention-DWS Key Tool Installation Confirmation

Abstract: After confirming whether gdb is installed, the cluster status is repeatedly abnormal after the core problem is triggered by the user database instance of the tool. Analyze the root cause of this problem in time and avoid it in time.

This article is shared from Huawei Cloud Community " Active Prevention - DWS Key Tool Installation Confirmation ", author: Shangguan Hanyu.

【Key tool confirmation】

1. Confirm whether gdb is installed (after the core problem is triggered by the user database instance of the tool, the cluster status is repeatedly abnormal, and the root cause of this problem should be analyzed in time and avoided in time)

Log in to any cluster node and execute the following commands (executed outside the sandbox in the HC/HCS/HCSO environment):

gdb --help

If the following information is prompted, it has been installed

2. Whether gstack is installed (a tool associated with gdb, this tool will be installed by default after gdb is installed, and its function is the same as gdb)

Log in to any cluster node and execute the following commands (executed outside the sandbox in the HC/HCS/HCSO environment):

gstack

If the following information is prompted, it has been installed

For gdb and gstack installation, please refer to the following link:

https://bbs.huaweicloud.com/forumreview/thread-182292-1-1.html

3. Whether the core is configured (this configuration can ensure that the abnormal stack information can be captured after the database instance triggers the core problem, so that the gdb tool can be used to obtain the triggered instance abnormal SQL from the captured information in a timely manner to avoid and locate the root cause)

Execute the following command to confirm when the cluster status is Normal (this operation does not affect the business when the cluster is normal)

kill -11 standby dn process number, check whether a core file is generated in the corresponding data directory , if a core file is generated, it has been configured.

If not configured, please follow the link below to configure:

HC/HCS/HCSO core configuration: https://bbs.huaweicloud.com/forum/forum.php?mod=viewthread&tid=181948

Pure soft core configuration: https://bbs.huaweicloud.com/forum/forum.php?mod=viewthread&tid=182036 

4. Whether pg_xlogdump exists (abnormal business generates a large number of xlogs, resulting in slow business, rapid increase in disk usage, etc., use this tool to analyze abnormal business)

If pg_xlogdump prompts the following information, it has been installed (execute after loading environment variables in pure soft environment, and execute in HC/HCS/HCSO log in sandbox )

5. Whether pagehack exists (data files are silently damaged, use this tool to analyze abnormal data files)

If pagehack prompts the following information, it has been installed (execute after loading environment variables in pure soft environment, HC/HCS/HCSO logs in to the sandbox to execute)

The pg_xlogdump and pagehack tools obtain the following links:

https://bbs.huaweicloud.com/forum/forum.php?mod=viewthread&tid=142380

The upload steps are as follows:

Step 1: Log in to the first CN node, use omm (Ruby users on the cloud) to upload pagehack and pg_xlogdump tools to the node under $GAUSSHOME/bin/ Step 2: Distribute the tools to other nodes

gs_ssh -c "scp $hostname:$GAUSSHOME/bin/pagehack $GAUSSHOME/bin/"

gs_ssh -c "scp $hostname:$GAUSSHOME/bin/pg_xlogdump $GAUSSHOME/bin/"

$hostname is the hostname of the first cn node.

6. Uploading steps of the gs_detect tool (this toolkit is not developed by the operation and maintenance team, which includes cluster status abnormality diagnosis tools, IO high tools, data file damage scanning and other tools, to facilitate timely location and recovery after problems occur)

Step 1: The omm user logs in to the first cn node (using Ruby on the cloud), obtains the gs_detect tool from the attachment and renames it to gs_detect.tar.gz and uploads it to the first cn node /home/omm path (HC/HCS/HCSO The shape is placed under the path of the first cn node /home/Ruby)

Step 2: Unzip using the following command

cd /home/omm

tar -zxvf gs_detect.tar.gz

Step 3: Distribute the gs_detect tool to other nodes

gs_ssh -c "scp -r hostname:/home/omm/gs_detect /home/omm"

$hostname is the hostname of the first cn node.

  Note: The distribution command on the cloud needs to be executed in the sandbox

【System Reinforcement】

1. Confirmation of arm reinforcement items (x86 machines are not involved)

https://support.huawei.com/enterprise/zh/bulletins-product/ENEWS2000007743

2. The Centos7.6 impi module causes the server to restart repeatedly. For the repair plan, see the attachment "CentOS7.6 ipmi module patch integration guide.docx"

 

Click to follow and learn about Huawei Cloud's fresh technologies for the first time~

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/8694753