Abstract

Presently, the State Grid Corporation of China has accumulated a large amount of maintenance records for power primary equipment. Unfortunately, most of these records are unstructured data which lead to difficultly analyze and utilize them. The emergence of natural language processing technology and deep learning methods provide a solution for unstructured text data. This paper proposes a progressive multitype feature fusion model to recognize Chinese named entity of unstructured maintenance records for power primary equipment. Firstly, the textual characteristics and word separation difficulties of maintenance records are analyzed, then 7 main entity categories of power technical terms from unstructured maintenance records are chosen, and 3452 maintenance records are labeled by these categories, which is so called EPE-MR training dataset. Secondly, the standard test reports, standard maintenance, and fault analysis reports for three types of power primary equipment (namely, main transformer, circuit breaker, and isolating switch) are employed as corpus to train character embedding in order to obtain certain words representation ability of maintenance records. After that, progressive multilevel radicals feature extraction module is designed to get detailed and fine semantic information in a hierarchical manner. Further, radicals feature representation and character embedding are concatenated and sent to BiLSTM module to extract contextual information in order to improve Chinese entity recognition ability. Moreover, CRF is introduced to handle the dependencies among prediction labels and to output the optimal prediction sequence, which can easily obtain structured data of maintenance records. Finally, comparative experiments on public MSRA dataset, China People’s Daily corpus, and EPE-MR dataset are implemented, respectively, which show the effectiveness of the proposed method.

1. Introduction

With the development of artificial intelligence technology, extracting high-level structured semantic information such as entities, attributes, and relations from natural language to solve more advanced tasks in various industries is a prevailing research hot spot [1, 2]. In the field of power energy, the requirements of smart grid for correlation analysis and processing of effective information contained in big data of electric power are also continuously increasing. In particular, it is coping with a large number of operation and maintenance records generated during the maintenance of various types of equipment. However, in the prior art, the management of equipment operation and maintenance records is limited to electronic storage and query of historical data. It is difficult to use these data to do some meaningful processing analysis of the actual work, which cannot meet the requirements of intelligent information management and delicacy management. The main challenge is how to recognize Chinese named entity of maintenance records in the form of unstructured text. Supposing one can obtain Chinese named entity in uniform structured text of maintenance records of power equipment, it is possible to analyze and evaluate the key information of power operation and maintenance records (e.g., analyzing equipment lifetime). One of the most effective methods is to construct knowledge graphs.

Named entity recognition (NER) is a crucial foundational step in knowledge graph construction, as a large number of interest scenarios of involving information extraction, question answering systems, and machine translation [35]. Moreover, NER is widely used in the fields of chemistry [6], medicine [7, 8], and military affairs [9]. It aims to identify entities of a specific category from a piece of unstructured text. Initially, named entity recognition mainly employed rule-based and dictionary-based methods [10], which construct rule templates and dictionary libraries by linguists to perform rule matching on the text. However, there are some limitations. First, the construction of these rules depends on specific language, domain, and text style, while the construction process is complicated and time-consuming. Second, different rules need to be constructed for different fields with poor portability. Subsequently, scholars proposed statistical-based methods, including hidden Markov model (HMM), maximum entropy model (MEM), support vector machine (SVM), and conditional random field (CRF) [1114]. Nevertheless, these models suffer from strong dependence on the corpus and slow convergence speed.

Recently, deep learning-based approaches have enabled end-to-end named entity recognition via neural networks that no longer rely on manually defined features [15, 16]. The most commonly used method is based on character embedding, and then the character embedding features are input into a long short-term memory (LSTM) with conditional random field (CRF) [8, 17]. However, this method cannot represent the multiple meanings of a word. The emergence of BERT alleviates this problem and is widely used in Chinese medicine [18, 19], Chinese idiom recommendation [20], and Chinese sentiment classification [21]. Another drawback is that LSTM cannot encode back-to-front information. In response to this problem, the proposal of bidirectional long short-term memory (BiLSTM) [22] provides an approach for the network to capture the two-way semantic dependence. Moreover, the radicals of Chinese characters represent different meanings that can provide useful information to the network, so this feature can be considered for inclusion in the network. An example is given in Figure 1. The meaning of character “” is to disappear. It is broken down into two parts, “” and “.” “” represents water or river, and “” means small. These two radicals are merged to indicate the water is small, which means that the water is gone or almost gone.

Based on the above analysis, we propose a Chinese named entity recognition method based on progressive multitype feature fusion (PMTFF) to recognize unstructured maintenance records of power primary equipment. The main contributions are as follows:(1)We choose 7 main entity categories of power technical terms from unstructured maintenance records and label 3452 unstructured maintenance records (EPE-MR dataset)(2)We employ the standard test reports, standard maintenance, and fault analysis reports of power primary equipment to train BERT model as the character embedding to obtain words representation ability(3)We propose a progressive multilevel radical feature extraction module (PML-RFE) to extract valuable semantic information

2. Dataset of Power Primary Equipment Maintenance Records

In the information database of power primary equipment of national grid, the basic information of equipment, maintenance records, and other data are stored. This information is of great significance to the management and maintenance of equipment, fault cause judgment, and fault differentiation analysis. However, the currently used manual extraction method is costly, inefficient, and not conducive to statistical analysis of the data. Therefore, we propose to use a deep learning-based approach for analysis, where a large number of training samples are indispensable. For this purpose, we construct the first EPE-MR dataset for Chinese named entity recognition of maintenance records of power primary equipment.

2.1. Text Characteristics and Difficulties Analysis of Maintenance Records

Compared with the general Chinese text, the state grid database of power primary equipment contains many types of information, and new entities are constantly emerging, such as damaged location of equipment, manufacture factory, and user unit. Meanwhile, the complexity of faults and repair methods varies from device to device, resulting in wide variations in the text length of different maintenance records. Statistics from the existing data show that the shortest overhaul records are 11 words, and the longest can be up to 354 words.

From the collected data and experiments, we conclude that the difficulty of word segmentation for maintenance records has the following three points:(i)The structure, integrity, and textual style of information are greatly different due to the different writing habits of maintenance person. One of the more obvious cases is that some key information is abbreviated. For example, “ (main voltage converter)” is abbreviated to “ (main transformer),” which creates a barrier to understanding maintenance records.(ii)The numerical information in the text relies on contextual feature for entity category judgment, such as voltage level “110 kv,” equipment maintenance time “2019-09-0513:46:41,” and equipment model “ (capacitor model) BAMH2.”(iii)For different maintenance records, the text describing the damaged parts and maintenance of equipment has the problems of tedious content, complex semantics, and unclear segmentation boundaries.

2.2. Data Collection and Description

We collect and label the power primary equipment maintenance records (EPE-MR) dataset from the enterprise resource planning (ERP) and power production management system (PMS) of Hubei State Grid Corporation. The labeling tool we used is Wizard Labeling Assistant [23], and six members of the group worked together to complete the annotation. In the task of named entity recognition, there are two commonly used identification systems: BIOES and BIO. We employ the BIO annotation approach to label the data. Concretely, B (Begin) represents the beginning of an entity, I (Intermediate) denotes the middle of an entity, and O (Other) represents the part that does not belong to any type of entity. In the constructed dataset, we define seven entities in power domain. Specifically, the first character is marked as “B (entity category),” the subsequent characters are marked as “I (entity category),” and other characters not related to the field are uniformly marked as O. We give an example in Table 1.

The entity classification of the existing maintenance records dataset is not complete enough, and the data volume is small, which brings challenges to the detection accuracy. To solve this issue, we construct the EPE-MR dataset, which contains seven entity categories of power terminology (i.e., VoltageLevel, EquipmentName, LineName, TransforSta, DamagePart, RepairCondition, and Time) and involves rich scenarios and a large amount of maintenance data. The entire dataset has a total of 3452 sentences, and we divide them into a training set and a test set at a ratio of 9 to 1. Figure 2 gives examples of sentences and entities in the EPE-MR dataset.

3. Proposed Method

The structure of the proposed model is shown in Figure 3. First, we get radical feature representation by the proposed progressive multilevel radical feature extraction module (PML-RFE). Then, the character embedding and radical feature representation are concatenated and sent to the BiLSTM module. Finally, the generated sequence is judged using a CRF module to obtain a globally optimal sequence.

3.1. PMTFF Module
3.1.1. Character Embedding

In natural language processing, character embedding is usually employed to map a word to a low-dimensional dense semantic space, which can effectively solve the problem of text feature sparseness in traditional machine learning methods, so that similar words in the semantic space have closer distance [24, 25].

Pretrained character embeddings are beneficial to improve the accuracy of the model. Nevertheless, as far as we know, no scholars have given pretrained character embeddings for Chinese power primary equipment maintenance records. Thus, we employ our own EPE-MR dataset to train character embeddings. Currently, Word2Vec and GPT are more commonly used. However, these models have obvious problems with Chinese named power equipment maintenance records. Specifically, Word2Vec model is a static word embedding, which cannot represent multiple meanings of a word. GPT is a one-way language model, which cannot obtain the context of a word. The emergence of the BERT model alleviates the above problems. It enhances the semantic representation of word vectors by considering character-level and sentence-level relational features. Meanwhile, if the semantic knowledge is applied to the named entity recognition task in the power domain, it allows the model to better mine the feature information of power records and texts. Therefore, we utilize BERT to train Chinese power equipment maintenance records.

3.1.2. Progressive Radical Feature Representation

The development of Chinese characters consists of two main aspects. In terms of form, it gradually changes from graphics to strokes, pictographic to symbolic, and complexity to simplicity. In the principle of character creation, it has gone through a process from representational and ideographic to morpho-syntactic. Chinese characters are commonly construed as consisting of radicals and original basic parts. Radicals can reflect the intrinsic meaning of Chinese characters, which can provide valuable semantic knowledge for accurate model identification. Consequently, we propose to extract the characteristic representation of the radicals.

Chinese characters have different constructions and are square in shape, which is similar to image. We can exploit these similarities to obtain semantic knowledge using convolutional neural network (CNN). Here, we propose a progressive multilevel radical feature extraction module (PML-RFE). Figure 4 shows the overall schematic diagram of radical feature extraction, and the specific structure of PML-RFE is given in Figure 5.

Different hierarchical of characteristics contain different information. Low-level features possess rich spatial structure and deep-level information contains valuable semantic clues. Making full use of these two types of features can provide sufficient information for the network. However, if these two types of features are simply added or spliced by elements, it will cause redundancy of information and ignore the complementarity between different levels of features. Therefore, we propose the progressive attention mechanism to gradually acquire the connections between different levels of features, as shown in Figure 5.

First, four feature maps of different levels are obtained by CNN. Second, two adjacent feature maps are processed in turn to get enhanced features. Finally, this information is gradually merged instead of stitching them all together. More importantly, the enhanced features are obtained by the attention mechanism (AM) we designed, as shown in Figure 5(a). Specifically, the information obtained by the high-level features () through global pooling and sigmoid activation function is used as weight to guide the low-level features (). Then, the weighted features () are merged with the high-level features and sent to the next stage. Equation (1) presents the expression of this process.while GP denotes the global pooling, is the sigmoid activation function, i = (1, 2, 3), and represents 3 × 3 convolution, and Fr is the new feature map.

3.2. BiLSTM Module

In NER task, recurrent neural network (RNN) is usually used to deal with sequence annotation problem. However, the traditional RNN cannot effectively solve the “long-range dependence” problem of sequence data [26, 27]. With the introduction of the LSTM neural network, this problem has been solved. LSTM is not only able to capture long-range information but also alleviates the gradient disappearance problem, so it is suitable for named entity recognition in the electric power field.

Figure 6 presents the basic structure of LSTM, including forgetting gate, input gate, and output gate. The long-term memory function is achieved by maintaining and updating the network state. The calculation procedure is shown inwhere W is the weight matrix, b is the bias vector, is the activation function, is the content to be added, denotes the update state at time t, it, ft, and Ot represent are the input gate, forget gate, and output gate at time t, respectively, and ht is the output at time t.

The output prediction of LSTM is unidirectional and cannot process context information. However, the recognition of Chinese entities in power primary equipment maintenance records needs to be judged based on the status information before and after.

BiLSTM is a combination of forward LSTM and backward LSTM. Figure 7 gives the framework of BiLSTM. The execution steps are as follows: first, it calculates the input sequence in order and reverse order to obtain two different hidden layer representations. Then, the final hidden layer feature representation is obtained through vector splicing. By using BiLSTM, context relationships can be captured to improve the accuracy of named entity recognition.

3.3. CRF Module

Considering the correlation between adjacent labels, we add a condition random field (CRF) [28] inference layer after the BiLSTM network to obtain an optimal prediction sequence through the relationship between adjacent labels. The CRF algorithm steps are as follows:

Assuming that X = {x1, x2, …, xn} is the complete expansion of the input sequence vector, the score of the predicted sequence Y = {y1, y2, …, yn} is shown in

while A is the transfer score matrix, Aij represents the transfer score from label i to label j, matrix P is the output of the BiLSTM layer, Pij represents the output score of the i-th word under the j-th label, n represents the sequence length, and k represents the number of labels. The probability of predicting sequence Y is calculated bywhile represents the real marker sequence and Yx is all possible marker sequences. During the training process, maximize the likelihood probability of the correct label sequence, as shown in

The output sequence with the maximum score after decoding is shown in

4. Experiments

4.1. Datasets and Evaluation Index

We perform experiments on our own constructed power primary equipment maintenance records (EPE-MR) dataset, MSRA [29] dataset, and China People’s Daily corpus.(1)EPE-MR is a dataset we made for maintenance records of power primary equipment. There are 3452 sentences and 7 types of entities, which are VoltageLevel, EquipmentName, LineName, TransforSta, DamagePart, RepairCondition, and Time. As far as we know, this is the first Chinese entity recognition dataset for the subject.(2)MSRA dataset is a public Chinese entity recognition dataset. Specifically, there are 46364 and 4365 sentences in the training set and test set, respectively. The entity types included are person, organization, and location.(3)China People’s Daily corpus contains the same types of entities as MSRA. The dataset provides sufficient data information, with 17573 and 1718 sentences in the training and test sets, respectively.

The evaluation indexes used in the experiment are precision (P), recall (R), and F1. The calculation methods of each evaluation index are as follows:while TP is the number of entities that are correctly classified, FN is the number of entities that are actually related entities but not recognized by the model, and FP is the number of entities that are actually unrelated entities but the model judges them to be entities.

4.2. Ablation Study

We implement the proposed model with PyTorch in Python 3.6.90% of the data are used as training set and the rest as test set. In the training process, the optimizer employs Adam with a learning rate of 0.001. Meanwhile, LSTM_dim is set to 300, batch_size to 32, and max_seq_len to 300. To prevent overfitting, Dropout is set to 0.5.

We perform ablation analysis on the EPE-MR dataset. Table 2 gives the performance analysis of different modules. The variations of F1 values for different modules on the test set are presented in Figure 8. Specifically, the F1 of our approach is 88.3%. From the perspective of comprehensive evaluation index F1, the proposed method is 14.2%, 9.9%, 12.1%, 8.8%, 7.6%, and 2.1% higher than other combinations. There are two main reasons for the success of proposed approach: (i) the proposed method enhances the semantic representation by fully considering the character-level, word-level, and radical-level relationship features; (ii) the two-way structure helps the network to obtain contextual information.

For a more visual observation, the entity recognition results are given in Figure 9. Specifically, the left side of the figure shows the entity label types, and the right side presents the corresponding classification results. It can be seen that the proposed method has a good word separation effect on professional vocabulary in the electric power domain.

4.3. Comparison with State-of-the-Arts Approaches

To evaluate the proposed method more comprehensively, we compare it with Dong et al. [30], Lattice-LSTM-CRF [31], Cao et al. [32], WC-LSTM [33], and CNN-BiLSTM-CRF [34] on the MSRA dataset, as shown in Table 3. Specifically, Lattice-LSTM-CRF improved the traditional LSTM unit into a grid LSTM and then explicitly used word and word order information to avoid the transmission of word segmentation errors. WC-LSTM employed word information to enhance semantic information and reduce the impact of word segmentation errors. The above improved models always stay in the extraction of character and word features and cannot characterize both meanings simultaneously. In contrast, the proposed approach can solve this problem very well. It can learn not only the phrase-level information representation but also the rich semantic information features.

As can be seen from Table 3, the F1 value of the proposed method reaches 96.63%, which is 5.68%, 3.45%, 5.99%, 2.89%, and 5.54% higher than that of Dong et al. [30], Lattice-LSTM-CRF [31], Cao et al. [32], WC-LSTM [33], and CNN-BiLSTM-CRF [34], respectively. The bar graphs of each quantitative indicator for the different methods are given in Figure 10. On the whole, the proposed approach has certain advantages.

Table 4 gives a quantitative comparison of Collobert et al. [35], Chiu and Nichols [36], Shen et al. [37], Lample et al. [38], and the proposed method on China People’s Daily corpus. The F1 value of the proposed method reaches 96.54%. Figure 11 depicts the bar graph of quantitative analysis metrics for different models. From the comprehensive view of the three evaluation metrics, the proposed approach shows the best performance. Moreover, during the experiment, we found that the performance of the proposed model will decrease when it encounters typos.

5. Conclusion

In this paper, we propose a Chinese named entity recognition method based on progressive multitype feature fusion for maintenance records of power primary equipment. Firstly, a dataset for Chinese named entity recognition of power primary equipment maintenance records is collected and labeled. Then, the contextualized word vector is obtained using the BERT preprocessing model. Thirdly, radical feature representations are extracted by the PML-RFE module, and these features are concatenate with character embeddings and sent to the BiLSTM module. Finally, CRF is combined to improve the recognition effect. The experimental results on EPE-MR, MSRA, and China People’s Daily datasets show that the proposed method achieves superior results compared with other existing methods.

Data Availability

Our research is still in the research stage, so we cannot disclose relevant information of the data for the time being.

Conflicts of Interest

There are no conflicts of interest from the authors.