Exploration of Network Fault Location under Inaccurate Resources

Introduction

The ability to locate network faults mainly depends on the network topology relationship. By judging whether the faults are from the same source, the compression, filtering, and correlation analysis of related alarms can be realized, and then the root alarm can be locked or the scope of the root cause of the alarm can be narrowed down. This depends heavily on the accuracy of the network resources. If the resources are inaccurate or incomplete, it will be difficult to locate the cause of the fault through the topological relationship of the network resources. It can only distribute network fault tickets, relying on manual business and technical experience to analyze the cause of the fault. Even manual on-site survey and location can only be done by dispatching orders, resulting in low efficiency in locating network faults, uncontrollable service recovery time, and easy customer complaints.

Aiming at the pain points and difficulties of network fault location under inaccurate resources, we try to explore solutions. By introducing analysis technologies such as text similarity and optical path ratio calculation, we provide solutions for common resource inaccurate scenarios to improve the accuracy of alarm correlation. Thereby improving the fault location capability and reducing the distribution of invalid network trouble tickets.

2 Fault scene focus

This paper focuses on the two fault scenarios of dynamic ring power outage and transmission optical cable interruption. In the case of inaccurate resources, different fault alarm correlation analysis methods are introduced to solve the problem of mining potential alarm correlations between devices in the case of inaccurate resources, and the correlation relationship is applied. In the follow-up real-time alarm analysis process, it helps to quickly locate the root cause of the alarm, reversely realizes the initial calibration of resource data, realizes the combined dispatch of alarm work orders, and reduces the pressure on front-line personnel.

Dynamic environment equipment power outage scenario: text similarity based on alarm features excavates the potential relationship between the machine room and equipment

Extract alarm data that cannot be associated with alarms due to inaccurate resources from historical alarm data, use this type of alarm as a node to extract alarm data within a certain time window (such as: 30 minutes), and introduce a text similarity algorithm to analyze the alarm keyword text Intelligent analysis, lock related alarm data, and identify the device to which the alarm belongs. Finally, through the continuous deduction and verification of a large number of sample data, the relationship between the two can be verified.

Transmission cable interruption scenario: Mining the potential relationship between optical cable and transmission equipment based on the calculation of optical path ratio

By locating the optical path of each device and the historical alarm port of the transmission route, combined with the signal flow traceability analysis method, using the same routing multiplex section as the scope, analyze the specific alarm of a single or multiple system segments, and trace the segment where the fault source device port is located , and then locate the corresponding faulty system segment. Finally, the analyzed and judged equipment, ports, optical paths, and optical cables form a transmission source correlation library for subsequent real-time alarm correlation analysis.

Next, we explore detailed solutions for the above two fault scenarios.

3 Fault location based on text similarity analysis

Typical application scenario: Fault location of decommissioned equipment under dynamic ring power outage alarm.

Analysis of main pain points: When there is a power outage in the computer room, there will be a power outage in the computer room and a large number of outage alarms for affected equipment. However, due to inaccurate or missing resources, the equipment outage alarm and the power outage alarm in the computer room cannot be automatically associated in the fault management system , the affected out-of-service alarms cannot be compressed and filtered, and eventually a large number of network trouble tickets are generated and dispatched to front-line personnel for processing, resulting in a huge increase in the pressure of front-line personnel to check.

Key solutions: introduce a text similarity algorithm to mine the potential relationship between the computer room and related equipment from historical alarm data, make up for inaccurate or missing resources, and provide a reference for subsequent real-time alarm correlation analysis.

According to the setting of the correlation time window, correlate outage alarms of base stations and equipment (such as OLT) related to power outages. In addition, a text similarity algorithm is added to deduce historical alarms such as base station decommissioning and equipment (such as OLT) decommissioning in the time range of 30 minutes when the power failure alarm occurs, to determine the accuracy of resources and possible resource inaccuracies to identify network devices.

An example diagram of the positioning process is as follows:

           

Offline alarm modeling analysis:

Establish an alarm offline analysis module, construct alarm data groups according to time and area dimensions, and classify, model, and store historical alarms.

Step1: Offline alarm data extraction:

Extract uncompressed, filtered, and correlated device decommissioning analysis sample data from a large amount of historical alarm data.

Step2: Aggregation of offline alarm data:

Take the extracted data as the analysis node, set a certain associated time window (for example: 30 minutes as a time analysis dimension), and gather the base station decommissioning, OLT decommissioning, and computer room within this time range according to the region where the alarm data of the current node belongs Historical alarm data such as power outages form an alarm data group.

Alarm feature keyword extraction:

The equipment in the computer room is usually named according to certain rules. For example, the name of the equipment in the computer room will contain the words of the computer room name, which has certain characteristics of the computer room name. Based on the setting of the basic conditions, keyword extraction is performed on the equipment alarm in the aggregated offline alarm data group , which mainly extracts information such as computer room name, network element name, device name, port, link, and alarm title in the alarm information.

After the keyword extraction of the alarm information is completed, a new alarm text data group is established through the extracted alarm keyword information.

Cosine similarity feature analysis:

Use the key feature weight library to match the value weight of the feature, calculate the similarity of key features through the cosine similarity algorithm, and mine out-of-service equipment in the same area as the computer room.

Calculation principle: The cosine value of the angle between two vectors in a vector space is used as a measure of the difference between two individuals. The cosine value is close to 1, and the angle tends to 0, indicating that the more similar the two vectors are, the cosine value is close to 0. , the angle tends to be 90 degrees, indicating that the two vectors are less similar.

Step1: Feature word segmentation preprocessing

Feature word segmentation preprocessing mainly performs word segmentation processing on selected alarm feature keywords according to word frequency, examples are as follows:

Name of wireless computer room: Wireless computer room on the 1st floor of Qingshui Township Site, Shangyun County

Computer room equipment name 1: Qingshui wireless computer room_BBU03

The word segmentation method is as follows:

T1={clean, water, wireless, computer room, _, BBU, 03}

T2= {Shangyun City, Qing, Shui, Township, Site, 1, Building, Wireless, Computer Room}

Step2: feature word segmentation merge

Merge and normalize the branch data of multiple grouping vectors to form a vector group. Continuing the above example, the execution results are as follows:

T={Shangyun City, Qing, Shui, Township, Site, 1, Building, Wireless, Computer Room, _, BBU, 03}

Step3: Feature word frequency statistics

Calculate the number of occurrences of the normalized T vector component words in the original vector set Ti and mark them with Si respectively. For each word Wi in the set, if Wi appears in Ti, it is recorded as 1, otherwise it is 0. The execution results are as follows:

S1={0,1,1,0,0,0,0,1,1,1,1,1}

S2={1,1,1,1,1,1,1,1,1,0,0,0}

Thus, the comparison of the two warning keyword texts is transformed into calculating the similarity between the two vectors S1 and S2.

Step4: Calculation of cosine similarity

Referring to the calculation principle of cosine similarity, the two directional groups S1 and S2 can be refracted into two line segments in space, starting from the origin ([0, 0, ...]) and pointing to different directions. An angle is formed between two line segments. If the angle is 0 degrees, it means that the direction is the same and the line segments coincide, which means that the text represented by the two vectors is completely equal; if the angle is 90 degrees, it means that a right angle is formed, and the direction Not similar at all; if the angle is 180 degrees, it means the direction is exactly the opposite. Therefore, the similarity of vectors can be judged by the size of the included angle. The smaller the angle, the more similar it is.

Bring in the S1 and S2 word segmentation parameter data, and the calculation formula of the cosine similarity measure is as follows:

=0.70846617

From the calculation results, the cosine value of the cosine angle is 0.70, which is close to 1. It can be determined that the text features of the two alarm keywords are similar.

In order to improve the accuracy of determining the threshold value, according to factors such as different regions, the length of the computer room name, and the same equipment type, the threshold value training can be carried out by extracting the computer room name and equipment name that have a relationship, and a relatively accurate threshold value can be obtained.

Merge duplicate device counts:

Deduce and analyze multiple occurrences of dynamic ring power failure alarms, merge and compare the equipment that appears repeatedly in the same area after each computer room power failure, determine the equipment associated with the computer room, and establish the corresponding computer room and equipment association library.

Real-time alarm correlation:

The potential relationship between the computer room and the equipment mined based on the text similarity analysis of offline alarm data is configured as an alarm association rule. Correlation of out-of-service alarms of base stations and OLTs, processing of correlation, filtering, and compression of related alarms.

Append the merged dispatch list:

Through real-time alarm correlation analysis, when a certain equipment out of service is caused by a power outage in the computer room, judge whether the power outage in the computer room has been dispatched. If the order has not been dispatched, the merge method will be adopted, and the equipment out of service alarm will be merged into the power outage alarm in the computer room as an associated alarm; if the order has been dispatched The method of adding is adopted, and the analyzed device out-of-service alarm is added as an associated alarm to the power outage alarm.

4 Fault location based on optical path ratio analysis

Typical application scenario: Multi-specialty equipment alarm fault location when the transmission optical cable is interrupted.

Analysis of main pain points: The transmission major includes transmission internal lines and transmission external lines. Due to the external optical cables, pipes, manpower wells and other space physical resources involved, due to the long-term lack of effective management, some resources are inaccurate or even missing (such as optical path mismatch), resulting in Equipment alarms and transmission LOS caused by the same optical cable interruption cannot be effectively correlated, resulting in the failure of related equipment failures to be remotely located. It is only possible to spend manpower on site to troubleshoot the causes one by one, which affects the recovery time of equipment failures.

Key solutions: Through the alarm ports of each device and transmission route, based on the business logic relationship, locate the port to which the corresponding optical path belongs, and conduct traceability analysis on the signal flow. Calculate the matching degree of the optical paths of each port, analyze the specific alarms of one or more system segments (single and bidirectional) in the same multiplex section, and confirm the proportion of the number of alarm optical paths under each optical cable segment, so as to trace the source of the fault equipment The section where the port is located, and then locate the cable section corresponding to the faulty system.

Alarm data collection:

The docking fault management system collects network alarm data, or the docking collection platform obtains the original alarm data and standardizes the alarm data, and extracts the transmission optical cable interruption alarm data from it.

Alarm cluster analysis:

Taking the transmission cable interruption alarm data as the analysis node, set the analysis time window to filter and delineate the offline alarm analysis data. When the optical cable is interrupted, the service may be completely blocked. The alarms that occur at the same time in a specific time range can be clustered and analyzed, including backbone equipment off-network, MSE equipment off-network, BAS off-network, PTN/IPRAN equipment off-network, and base station outage. Service alarms, and classify and aggregate feature alarms by time and category.

Resource property population:

Based on the resources or device resource data provided by the network management system, corresponding resources are filled and supplemented for each professional alarm. The supplementary resource information includes: device type, board model, port, AZ terminal information, etc.

Key feature extraction:

According to the characteristics of alarms of different professions and equipment types, extract their key characteristic data. The key data should include: occurrence time, belonging area, port information, etc.

  • Base station alarm: Extract information such as occurrence time, port, base station name, latitude and longitude, area, etc.

  • OLT alarm: extract occurrence time, port, OLT name, NE_IP, area and other information.

  • Transmission alarm: extract the occurrence time, port, ONU name, OLT name, NE_IP, area and other information.

  • BARS alarm: extract occurrence time, port, BARS name, NE_IP, area and other information.

Optical path signal flow analysis:

Divide districts, counties, and cities by region, match optical paths and circuits according to ports and equipment names, analyze the off-network status of PTN, OTN, and base station equipment, and locate the number of optical paths for each equipment information direction.

Specifically, the following scenarios are included:

Optical cables affect PTN equipment off-network: By analyzing the optical path flow of alarm PTN equipment in the alarm data set:

  • PTN equipment->A or Z-end equipment->relay circuit->optical path->optical path routing->fiber core->optical cable segment

  • PTN equipment->A or Z-end equipment->relay circuit->transmission circuit->system segment->optical path->optical path routing->fiber core->optical cable segment.

  • Optical cable affects transmission (OTN/WDM/SDH) equipment off-network: By analyzing the optical path flow of alarm OTN/WDM/SDH equipment in the alarm data set:

  • OTN/WDM/SDH equipment->A or Z-end equipment->system segment->optical path->optical path routing->fiber core->optical cable segment.

The optical cable affects the OLT equipment off-network: According to the following signaling process, confirm the number of optical channels of the alarming OLT equipment:

  • OLT equipment->A or Z-end equipment->relay circuit->optical path->optical path routing->fiber core->optical cable segment.

  • OLT equipment->A or Z-end equipment->relay circuit->transmission circuit->system segment->optical path->optical path routing->fiber core->optical cable segment.

  • OLT equipment->A or Z-end equipment->switch/BRAS equipment->relay circuit->transmission circuit->system segment->optical path->optical path routing->fiber core->optical cable segment.

  • OLT equipment->A or Z-end equipment->switch/BRAS equipment->relay circuit->transmission circuit->system section->transmission circuit->system section->optical path->optical path routing->fiber core-> cable section

Optical cables affect the off-network of base station equipment: By analyzing the optical path flow of wireless base station equipment in the police data set:

  • 3G/4G wireless base station -> A or Z end PTN equipment -> relay circuit -> optical path -> optical path routing -> fiber core -> optical cable segment.

  • 3G/4G wireless base station -> A or Z end BBU equipment -> relay circuit -> optical path -> optical path routing -> fiber core -> optical cable segment.

Judgment of light path ratio threshold:

Based on the analysis of the flow direction of the optical path, calculate the number of matching optical paths for each device alarm port, summarize the number of optical paths of each section of optical cable, compare the total number and proportion of optical paths of each section of optical cable, sort by the proportion, determine the optical cable section with the largest proportion, and list is a suspected faulty cable segment.

Alarm related dispatch list:

After analyzing the potential correlation between the equipment alarm and the optical cable mid-section alarm, the equipment alarm can be attributed to the sub-alarm in the optical cable mid-section alarm, and the two can be associated to facilitate unified correlation and dispatch.

5 Future Prospects

With the advancement of cloud network services, business networking methods are becoming more and more complex, business integration scenarios are increasing, and customers have higher and higher requirements for networks. The ability to quickly locate and deal with network faults will help operators maintain a good reputation in the market. one of the important safeguards. The ability to quickly locate faults under the condition of accurate resources has tended to be perfected, but problems such as lack of resources and inaccurate resources are still serious problems for operators. Therefore, the ability to locate faults under inaccurate resources will be a powerful force for network operation and maintenance. means.

This article mainly explains the solutions for two scenarios where resources are inaccurate. In the future, we will continue to explore other business scenarios, and continue to improve fault location methods under inaccurate resources. For example, introduce AI and big data analysis to realize ODN optical path dumb Resource topology restoration realizes resource data calibration by letting dumb resources speak, and promotes the improvement of fault location capabilities. In addition, if an active optical probe device is implanted in the optical fiber, the optical path data collection and big data association aggregation analysis can be realized by receiving the probe data, which can more accurately restore the optical link topology and have a better effect on fault location.

Guess you like

Origin blog.csdn.net/whalecloud/article/details/127511346