TextMining day1 short text mining framework in the process of power equipment operation and maintenance

image-20230703092411160

  • Preprocessing First, similar to general natural language processing tasks, short texts in logs, tickets, and specifications are preprocessed.

    • Word segmentation is a necessary basic step in preprocessing Chinese text. In English text, a space is used to separate two words, so this step is usually skipped.
    • POS tagging marks the part-of-speech (POS, part-of-speech) of each word, which may be beneficial for subsequent analysis.
    • Stop word removal For most text mining tasks other than statistical work, stop words such as inspector names, locations, substations, etc. are meaningless, so they generally need to be removed from the text.
  • Data cleaning Due to the limited knowledge and experience of inspection engineers, there may be errors such as information omission and information contradiction in logs and labels, except for short texts in specifications. Therefore, in order to ensure the credibility of short text mining, it is necessary to clean up the text data in logs and receipts in two steps: error identification and quality improvement.

    • Misidentification
    • **quality improvement**
  • The Representation module converts text data into a form that a computer can understand.

    • Structured Forms Traditionally, short texts have been represented in structured forms, usually vectors or matrices.
    • Semi-structural formula This paper proposes a semi-structured representation of short text based on knowledge graph technology, which converts short text into a graph structure.

    Finally, combined with other forms of data (such as numerical data), combined with the practical application of power equipment operation and maintenance, the structured or semi-structured text data is analyzed.

  • data analysis

    • Machine Learning Machine learning methods are mainly used when the mapping relationship between data and results is complex and hidden.
    • Rule-Based For some tasks where mapping relationships can be determined, rule-based methods are more appropriate because of their strong interpretability.

    Finally, the data analysis module will output results related to the judgment and decision-making of power equipment operation and maintenance.

  • application

    • judgment
      • degree of defect
      • health index
    • decision making
      • defect handling
      • maintenance strategy

III. Specific Design of Short Text Mining Framework

A. Specific design of the preprocessing module

image-20230703104617346

As shown in Figure 2, the first stage is to acquire a vocabulary containing terms and idioms, and a well-segmented and labeled power corpus.

image-20230703104633926

The second stage, shown in Figure 3, is to segment and label the original short text sent to the preprocessing module.

B. The specific design of the data cleaning module

image-20230704092212538

Key parameters and algorithms in quality improvement steps

image-20230704092853556

C. Indicates the specific design of the module

image-20230704092224633

image-20230704092327683

image-20230704092335978

D. Specific design of the data analysis module

The key parameters of CNN are shown in Table 6

image-20230704095947413

IV. Case studies

A. Defect degree judgment based on text classification

Based on the short text mining framework, experimental group 1 (EG1) represented the text as a vector and applied SVM for data analysis, and experimental group 2 (EG2) represented the text as a matrix and analyzed the text data through CNN.

In addition, in order to compare with EG1, the specially designed data cleaning module was skipped in the control group 1 (CG1), and the specific design of the specially designed VSM in the representation module was skipped in the control group 2 (CG2).

Also, for comparison with EG2, we skipped the CNN in the specially designed data cleaning module and the specially designed data analysis module in control group 3 (CG3) and control group 4 (CG4), respectively. During the experiment, the training time and testing time of the machine learning classifier were recorded, reflecting the offline and online computing efficiency of the data analysis module, respectively. The results are shown in Table VII.

image-20230704100351330

Comparing EG1 and EG2, it can be seen that the deep learning model CNN is more accurate than the traditional machine learning model SVM, but the efficiency is lower. Deep learning models have more parameters and can analyze features more efficiently, but require more time. The choice of model affects the accuracy and efficiency, which are important components of the specific design in practical applications.

The accuracy rate of EG2 is as high as 97.98%. Although it takes the most training and testing time, the efficiency is significantly higher than manual classification. Therefore, the short text mining framework with specially designed modules can effectively guide the judgment and achieve satisfactory results in overall accuracy and efficiency.

B. Defect Handling Decision Based on Text Retrieval

For a new defect log, if an existing defect log with the same defect condition as the new log can be retrieved, the previous processing method can be referred to to make a processing decision for the new defect.

In practice, even if the defect conditions in two defect logs are the same, the descriptions of these two logs may be quite different due to the different knowledge and experience of different engineers. Therefore, textual similarity does not reflect coherence well, and a deep understanding of the relationships contained in textual information is required. To solve this problem, the defect log is represented in a semi-structured form in the representation module, and the relationship between the defect logs is clearly expressed in the form of a knowledge graph.

The key parameters in the power knowledge graph construction (mainly the relation extraction step) are shown in Table VIII,

image-20230704103225379

The constructed knowledge graph contains 2386 nodes and 2769 edges, part of which is shown in Figure 8.

image-20230704103233166

Statistics Results of Defect Log Retrieval

image-20230704104034498

As shown in Table 9, the proposed knowledge graph-based semi-structured representation performs best among the three metrics, which proves that the specific design of the representation module can effectively improve the overall performance. The knowledge graph realizes knowledge reasoning by directly representing the relationship, so as to understand the text information more deeply. In order to give a more intuitive explanation, we choose two groups of defect logs in Table X for illustration.

image-20230704104053990

For each representation method, the consistency of the two defect logs in each group is judged, and the results are shown in Table XI .

image-20230704104115618

In Table X, A1 and A2 refer to the same defect, but the description of the defective equipment and parts is very different. Compared with A2, A1 lacks the defective equipment "transformer" and does not state whether the type of element "tap changer" is on-load or off-circuit. Therefore, the three representation methods based on the structured form cannot identify the consistency of A1 and A2. However, the knowledge graph model can infer that the paths corresponding to the two defect logs are the same through the connection of nodes, as 9 , where the gray nodes are the nodes corresponding to the marked defect logs, and the paths corresponding to the logs are in bold Edge highlighting (the same below).

image-20230704104215053

image-20230704104249171

V. Conclusion

A text mining framework suitable for power equipment operation and maintenance is proposed. Our main innovation is to address the characteristics of short texts in power equipment operation and maintenance, and propose specific designs for each module of the framework, making the framework more suitable for text mining in the power industry. Through two case studies related to defect degree judgment and defect handling decision-making, the guiding role of short text mining framework for practical application is demonstrated. Meanwhile, the results of two case studies show that the specific design of each module is beneficial to improve the overall performance of short text mining in power equipment operation and maintenance.

In the operation and maintenance of power equipment, the further improvement of short text mining research mainly has two aspects. One is to enhance the interpretability of the short text mining framework through techniques such as syntactic analysis, so that it can understand text data in a way closer to human thinking. The second is to build a general data fusion model that considers all data forms to further improve accuracy and broaden application fields. Both of these aspects will be important directions for our future research.

Guess you like

Origin blog.csdn.net/qq_43537420/article/details/131530214