There must be something you want to know about unified event management (2)

Part of the content of this article comes from Dr. Bu----Senior Product Expert of Qingchuang Technology

Hello~ We meet again~ In the previous issue, we talked about events and event management. Click here to resume the exciting content of the previous issue: About unified event management, there must be something you want to know (1)

This issue mainly takes you to see how event management is applied in real life, mainly including the following two aspects ( *Reminder: This sharing has a lot of dry goods and is long in length, interested friends can read it later , beware of loss ):

1. Application Scenarios of Event Management

2. How to conduct unified event management

1. Application Scenarios of Event Management

1. Intelligent operation and maintenance AIOps

Intelligent event management integrates the alarm information of IT monitoring tools, intelligently reduces alarm noise by 95%, automates the event management process, strengthens team collaboration, accelerates fault location and repair, and minimizes business impact.

2. Security Information Event Management SIEM

Gather internal and external security events of the enterprise, gain real-time insights into security risks through the rule engine and event flow processing engine, and use flexible event handling processes to help teams proactively respond to security incidents.

3. Internet of Things

The event information of smart devices and sensors is aggregated and processed in real time at the edge nodes and core nodes of the Internet of Things, and through event stream processing, new data models are captured and discovered, and more high-value application scenarios are explored.

4. Business Analysis

Break through the data boundary between business operations and IT support, obtain more business data from the system in real time, and help the team respond quickly and correctly to events that affect the business. In times of crisis, master the chaos.

From the above scenarios, it is not difficult to find the universality of the application of unified event management. How does unified event management apply to daily scenarios? I will illustrate through the following 3 cases of different scales.

Case 1: Single User Service Event

Zhang, the manager of the private banking center of a bank, is trying to log in to the bank's private banking system to check whether there is any recent visit arrangement in the customer list assigned to him. However, he was not authenticated for access and was unable to log in despite trying to reset his password, so he contacted the IT help desk.

Xiao Wang, the IT service desk manager, obtains the detailed information of Manager Zhang and verifies whether he is the manager of the private banking center of the bank. After passing the verification, Xiao Wang logs in to the administrator module of the private banking system and checks Manager Zhang's personal information and related configurations. It turned out that due to job transfers, some changes in personal data were not implemented correctly, resulting in errors.

Xiao Wang triggered and re-executed for these changes. Then Manager Zhang tried to log in again and successfully logged into the system. Xiao Wang closes the event record on the workbench, and the system sends a satisfaction survey to Manager Zhang. Manager Zhang was very satisfied and gave Xiao Wang a 5-star praise.

Xiao Wang continues to check the changes related to the private banking system, and other people's changes are already running normally. Xiao Wang confirms that "no need to create a work order".

Case 2: Multi-user service event

Manager Li at the IT service desk has noticed that the phone data has increased recently, and basically all of them have received the same incident: the mobile phone transfer has not responded for a long time. At the same time, the on-duty manager of the alarm operation desk learned that there was a database error in a certain business system, and they were dealing with the message of the problem.

Manager Li assessed that this was an important service incident. He immediately logged into the ITSM system to issue an announcement about the mobile phone transfer problem, and immediately created an incident ticket, requesting the team to collect information related to the problem. Events (including the IT service desk and the alarm workbench of the unified event management platform) are associated for centralized management without wasting redundant resources for separate processing.

After 10 minutes, Manager Li received the latest news from the IT manager that the system is now back in operation, so he re-asked several staff on duty at the IT service desk to verify the mobile phone transfer business, confirming that they have returned to normal, and closed the ticket.

Finally, he re-updated the content of the bulletin in the ITSM system.

Case 3: Major IT service incident

"It's not good!" Xiao Li, the NOC engineer on duty, exclaimed.

The alarm workbench of the unified event management platform found an alarm storm, and new alarms continued to appear on the screen. A large number of virtual machines were down, which meant either a core switch failure or a problem with the hypervisor.

Joe logs the event on the ITSM system and defines it as a major event. He contacted the cloud administrator and the network administrator and set up a meeting.

As a public cloud service provider, the public relations manager also needs to get involved, because she needs to understand the situation, severity, and scope of the incident in real time, and needs to notify customers in time to deal with the possible public opinion pressure caused by the incident.

Cloud administrators quickly discovered that this was caused by a bug on the hypervisor. They immediately called the hypervisor vendor. At the same time, the cloud administrator adjusts the priority of the event to the highest.

As more and more virtual machines went down, calls flooded the call center, and the CEO stepped in himself, making phone calls to the large customers affected. At this point the supplier did not respond to the incident as quickly as possible, but the CTO had triggered an emergency response and the incident was resolved within 2 hours.

In the following period of time, the CTO organized a review of the incident to find out the root cause of the incident, and the supplier also participated in it. An incident report is formed, and a series of research and development, testing, and change plans will be initiated for the content of the report to ensure that such incidents will not happen again.

2. How to conduct unified event management

From the three examples of different scales, it can be seen that in the process of incident or emergency response, in order to meet the service needs of customers, the IT team will perform various activities according to the following best practice process, mainly including :

1. Detect events

Event detection usually includes the following three methods:

  • A user reports a problem and an on-call person at the Service Desk will verify if it is an incident.

  • The degree of urgency depends on the commitment to the customer's SLA, that is, the speed at which services can be restored.

  • Priority, for different business or customer impacts, which ones should be dealt with first.

2. Log events

In general, the recording of events is done through systems that provide the ability to manage, summarize and analyze historical events, including:

  • Call center system: External customers generally contact the call center system by phone, and customer service personnel are responsible for recording customer problems here.

  • IT Workbench: Internal users usually access the IT Workbench when reporting problems.

  • Monitoring system: In order to automatically monitor and discover potential problems in the system, services and related service components are monitored to find abnormalities.

  • Unified event management platform: It will collect abnormalities generated by different monitoring systems in a unified manner, and timely and synchronously report faults to the unified event management platform for users and customers of the call center system and IT workbench for unified monitoring manage.

  • ITSM system: If the event is confirmed to be a major event and needs to be retained, it is necessary to create an event sheet in the ITSM system afterwards for auditing.

3. Event classification

In the event classification phase, events are mainly classified according to the following:

  • What type: such as hardware failure, software failure, network failure or others.

  • Impact degree and scope: such as which businesses and customers have been affected.

  • Urgency: Depends on the commitment to the customer's SLA, that is, the speed at which the service can be restored.

  • Priority: Which should be prioritized for different business or customer impacts.

Classification helps:

Accelerate the identification and disposal efficiency of incidents; effectively identify who should be responsible for the incident; reduce incident disposal costs.

4. Diagnostic events

At its core, incident diagnosis is about determining what went wrong and the fastest way to restore normal service to that problem.

If the event has happened before and has hit the event model, it can be diagnosed directly by the frontline personnel. However, incidents that are more complex or have not occurred before will require joint investigations by cross-functional teams or second-line experts.

5. Resolve incidents

Incident resolution refers to the solution to the incident after the diagnosis is completed, including temporary fix solutions and permanent fix solutions. Generally, permanent repair is not pursued in the process of emergency and incident handling, but it is hoped that production can be resumed as soon as possible through a series of operations in the shortest possible time. The main operations include the following:

  • Automatic implementation: Generally, based on the known event model defined in advance, the automatic resolution and automatic recovery of events are completed, without manual diagnosis and treatment, and all are completed automatically.

  • Record it for the O&M engineers to solve by themselves: Generally, according to the event model or system analysis results, disposal suggestions will be given, and the O&M engineers will make decisions, and finally complete the recovery process through manual operations. For some complex scenarios, you can also ask the support team or suppliers to provide corresponding solutions, and the operation and maintenance engineers will perform the operation process.

6. Close the event

Once the incident is resolved, an official closure of the incident is required. Closing requires the following actions:

  • Communicate with users, customers, or other management and stakeholders that business services have returned to normal.

  • Update the configuration information of the CMDB as needed, such as increasing the size of the database cluster for business recovery.

  • Update billing, such as the input of internal and external manpower, adding new servers, etc.

7. Review after the fact

Post-event review is often neglected by many organizations, but it is an essential and important link for knowledge summary, optimized monitoring, optimized event handling, and optimized existing events and application processes.

The event review is generally completed within 5 working days after the event occurs. In this link, a review post must be set up to review the summary report of the operation and maintenance engineer on the event disposal in detail. The main contents of the report include:

  • report date

  • Reporting Officer

  • Incident Overview: In one or two short sentences, briefly describe the incident along with the root cause, time of occurrence, and impact. For example, at 9:25 am on August 5, 2023, due to a database failure, about 20% of the transactions during the failure period had longer response times, which affected the user experience. The duration was about 15 minutes, and the severity level was "major".

  • Event details: ① Describe in detail what happened? ②What is the root cause of this problem? ③Temporary solution to this problem (quick recovery solution to restore business as soon as possible)? ④ A permanent solution to the problem.

  • Impact: the impact on business, on customers, on transactions, etc., severity level.

  • Timeline: In order to guarantee the SLA, it is necessary to record in detail the discovery time, notification time to the person in charge, response time, resolution time, closing time, etc., mainly referring to the corresponding assessment standards within the enterprise and the commitment standards to end users.

  • Participants (the participants will be different in different emergency and event scenarios): ① Incident commander. ②Recorder. ③ Liaison officer. ④Other participants: such as experts in different fields, development or testing, etc.

  • How did we respond to this incident: ①What we did well: For example, in the previous emergency and incident response process, we have never used processes, methods, technologies, etc., which can greatly improve the timeliness of incident response of. ② Poorly done: For example, during the response process, we found that the existing process or method would cause resistance to specific links and needed to be improved.

  • follow-up action plan

Complete any necessary fixes to prevent similar issues from recurring in the future. like:

①The monitoring of specific indicators is too sensitive, and some adjustments need to be made in the monitoring source; due to the BUG of the program, the BUG repair plan is formulated with the engineering R&D team and entered into the schedule

②If it cannot be repaired permanently, can we quickly repair it through automated means when similar incidents occur again. For example, for specific alarms, rules and automatic repair scripts can be configured, and when it occurs again, it can be automatically repaired without manual intervention.

③Optimize the existing process to improve the efficiency of response

Well, the above is the whole content of this sharing. If you have any questions about unified event management, please leave a message in the comment area to discuss~


Qingchuang Technology, a benchmark supplier in the AIOps field continuously recommended by Gartner. The company is committed to assisting enterprise customers to improve insight into operation and maintenance data, optimize operation and maintenance efficiency, and fully reflect the influence of technology operation and maintenance on business operations.

 The common choice of industry leading customers

​Learn more about operation and maintenance dry goods and technology sharing

You can follow with one click in the upper right corner

We have been deeply involved in the field of intelligent operation and maintenance for nearly ten years

AIOps Benchmarking Supplier Recommended by Gartner for Consecutive Years

See you next time~ 

Guess you like

Origin blog.csdn.net/qq_37641528/article/details/132299174