IT operation and maintenance: use data analysis platform to monitor DELL server

overview

In the daily operation and maintenance of enterprises, we have a large number of server devices. Generally, common monitoring software can be used to realize automatic alarms for equipment failures. It is necessary to further understand how many hardware failures have occurred this year, including how many memory failures, hard disk failures, motherboard and CPU failures, and which server has the most failures? These events can provide a basis for us to replace equipment and reduce failures.

We choose Honghu to collect server logs. It can quickly search and customize charts, which perfectly solves these pain points.

monitoring target

Collect logs from server devices in the infrastructure

Monitor the login security of server devices

Monitor configuration security of server devices

Statistical server equipment failures

install vector

As a data collector, Vector can receive the syslog logs of the device and forward them to the Honghu platform. Vector configuration method refer to the following:

Install Vector

picture

Check the vector version after installation to confirm that the installation is successful

After vector is installed, when executing vector directly, the system will first search for the command under /usr/bin, if it is not in this directory, it will not find it. At this time, we need to create a link file for these commands that cannot be found, and link it to /usr/bin  

picture

(For the specific code, please join the Honghu technical exchange group, see the Honghu knowledge base for details)

Log in to the Honghu platform, go to Data Management > New Dataset

picture

picture

Edit the data source name, select the data set range as the "switch" created above, and it will be enabled

picture

To create a syslog.toml script, you need to adjust the fields

address = "0.0.0.0:514": 0.0.0.0 means to receive syslogs sent by all hosts, and 514 means the receiving port (syslog defaults to 514)

._target_table = "switch" : Indicates the name of the dataset you created above

mode = "udp": Indicates the protocol for receiving syslog syslog defaults to udp)

address = "172.20.6.111:9092": Honghu's IP and corresponding port

picture

Run the modified syslog script, keeping it running.

picture

(For the specific code, please join the Honghu technical exchange group, see the Honghu knowledge base for details)

Log in to the switch to trigger syslog (note: logging in to the switch and inputting commands will automatically trigger syslog). Log in to the Honghu platform to check whether the data is imported into the switch dataset. As shown in the figure below, the event count has shown that the data import is successful

picture

Query the data imported into the switch dataset through vector

picture

picture

configure server syslog

Configure the system log server, fill in the IP of the log system, and the port number is default

picture

turn on the alarm

picture

Set the type of alerts sent to the logging platform

picture

Honghu checks whether the log is received

picture

field extraction

The purpose of extracting the field: For example, when generating a chart, I want to call severity and find that Honghu has not extracted this field, so I cannot call this field to generate a chart.

Because IDRAC has its own specific log format, Honghu does not extract all the fields, so it needs to extract data according to the IDRAC log format. The logic of field extraction is to first create a view through SQL statements. After the view is generated, we can directly call the fields in the view, and its actual logs are still stored in the original data set.

Let's first analyze the log format of IDRAC. Our actual log format analysis, the fields that need to be extracted

picture

idrac_syslog, here is the name of the view that needs to be created

In the switch._time script, starting from the sixth line, switch refers to the original data set that needs to extract data. Here you need to replace it according to your data set name.

where contains( switch._message, 'iDRAC') 'iDRAC' refers to the specific field you want to search for this write log, through IDRAC you can limit all logs containing IDRAC in the log

picture

How to write and test regular expressions

Visit https://regex101.com/, fill in the regular expression in REGULAR EXPRESSION, or write directly here

Enter the log in TESTING STRING, you can directly copy the relevant log from Honghu

If the regular expression is correct, the corresponding color will be generated, and the extracted field name and field content will be displayed in the Match information in the lower right corner

picture

Run in higher order query, success

picture

Let's run the search statement test, and we can see the fields we extracted in the red part

picture

Chart display

In the chart creation part, I will only give one example for each type of chart here, but I will list all search languages ​​for your reference.

Dashboards > New Dashboard

picture

Created

picture

Log alarm level statistics

New Chart > Log Alarm Level Statistics

Select the chart type: pie chart

Query statement: This statement can be verified in the query first to confirm that the result of the search is what you want

Time frame: choose 30 days, you can adjust it according to your own situation

picture

picture

After generating the chart, you can view various types of alarms in the past 30 days.

Purpose: For example, if there is no error or alarm, we can easily judge that the device is running well. It is also easy to judge the current proportion of various types of alarms.

picture

number of servers

New Chart > Number of Servers

Select the chart type: single value trend chart

Query statement: This statement can be verified in the query first to confirm that the result of the search is what you want

Time frame: choose 30 days, you can adjust it according to your own situation

picture

picture

After generating the chart, you can count all the servers that have logs sent to Honghu.

Purpose: It is used to confirm the total number of currently monitored servers, which is convenient for checking errors and omissions, mainly for the good-looking chart layout.

picture

Configuration change details

New Icon > Configuration Change Details

Select the chart type: table (because I need to show the details here, so choose the table method)

Query statement: This statement can be verified in the query first to confirm that the result of the search is what you want

Time frame: choose 30 days, you can adjust it according to your own situation

picture

picture

After the chart is generated, you can view the specific time, which user, which device, and what operations were performed

Purpose: For example, change the device configuration during non-maintenance or non-working hours, and check who has logged in to the device during this time period and what configurations have been performed, so as to determine whether these behaviors are normal and compliant.

picture

Configuration Change Chart

New Icon > Configure Changes Chart

Select the chart type: histogram

Query statement: This statement can be verified in the query first to confirm that the result of the search is what you want

Time frame: choose 30 days, you can adjust it according to your own situation

picture

picture

After generating the chart, you can see which servers have changed how many times their configuration has been changed in the last month

Purpose: For example, if there is no failure or maintenance recently, a change in the server configuration is an abnormal event. Through the details of the configuration change, it can be judged whether these behaviors are normal and compliant.

picture

Configuration Change Trend Chart

New Icon > Configuration Change Trend Chart

Select a graph type: Line graph

Query statement: This statement can be verified in the query first to confirm that the result of the search is what you want

Time frame: choose 30 days, you can adjust it according to your own situation

picture

picture

After generating the chart, you can check the trend of server changes in the last month

Purpose: For example, if there is no failure or maintenance recently, a change in the server configuration is an abnormal event. Through the details of the configuration change, it can be judged whether these behaviors are normal and compliant.

picture

search phrase

All search terms in this article are listed below for reference

Tip: The command part of the search statement will call the extracted character field. If the character field is not extracted, an error will appear

picture

renderings

After the chart is created, we select "Grid Layout" to optimize and adjust the chart layout. The final rendering is as follows

picture

Guess you like

Origin blog.csdn.net/Yhpdata888/article/details/132499554