Abstract
Risk analysis, as an important prerequisite of risk management, is critical to reducing occupational injuries and other related losses. However, suffering greatly from incomplete hazard identification and inaccurate probability analysis, risk analysis is considered the weakest link in risk management, which seriously affects risk evaluation and control in complex workplaces. To improve the performance of hazard identification and analysis, a data-driven risk analysis approach is established, which consists of an improved equivalent class transformation (Eclat) algorithm, a sliding window model, and a change pattern mining algorithm. Through this approach, a large number of historical hazard records are transformed into association rules composed of object keywords and deviation keywords, and information such as potential keyword combinations, conditional probabilities of potential deviations, and the change pattern of potential hazards can be extracted. The function of the approach is threefold. Firstly, the data-driven risk analysis process is designed to identify the association rules between different hazard keywords. Secondly, Eclat algorithm is optimized to calculate the frequency and probability of potential hazards, which is conducive to improving the accuracy of probability estimation. Thirdly, the change pattern is developed to analyse the hazard change trend to support the cause analysis. A practical application in a Chinese hazardous chemical manufacturer is presented. Case studies have shown that the efficiency of the improved algorithm is increased by 13.68%, and 59.66% of potential hazards can be identified in advance, and relevant information can be extracted to support risk analysis.
1. Introduction
Preventing and mitigating accidents, protecting employees from occupational injuries, and protecting the environment from damage is a major goal of all industries and superintendents [1]. Risk analysis, as an important prerequisite of risk management, is critical to reducing injuries and other related losses [2]. However, suffering greatly from incomplete hazard identification and inaccurate probability analysis, risk analysis is considered the weakest link in the risk management process, which seriously affects risk evaluation and control in complex workplaces [3, 4]. Improving the ability of superintendents to identify and analyse hazards in complex workplaces will be necessary to address the issues raised by ongoing accident occurrences [5].
To achieve this goal, a large number of methods are designed and used, which can be divided into administrative and technological [6]. Administrative methods mainly rely on the experience of superintendents and historical records, such as hazard and operability analysis [7], failure mode and effects analysis [8, 9], layer of protection analysis [10, 11], job hazard analysis [12], safety checklists [13], and fault tree [14]. In recent years, many researchers have gradually realized that the results of the administrative methods are not accurate enough due to the corresponding hazards that are analysed based on information with fuzziness and randomness [15]. Meanwhile, with the complexity of the workplace, it will be difficult for superintendents to identify hazards completely. The workplace has reached saturation with respect to traditional safety strategies that were originally implemented to comply with regulations [12].
In recent years, technological methods represented by data mining have been aided in solving some specific problems in various fields of safety engineering. The articles can be divided into two different types: articles focusing on textual data and articles focusing on numerical data [16]. Textual data mainly consist of accident and incident records. Tixier analysed textual data of construction accidents and identified combinations of accident causes by utilizing natural language processing (NLP) [17]. Brown and Yang used latent Dirichlet allocation (LDA) and text-based Bayesian network (TBN) to analyse railway safety text data to support typical accident risk analysis [18, 19]. Marucci-Wellman studied the classification method of narrative text based on machine learning, which is used to classify a large number of occupational injury and illness events [20]. Mutlu and Altuntas integrated FMEA, FTA, and BIFPET methods to analyse occupational safety and health risks [21–23]. Robinson applied the latent semantic analysis method to analyse the main causal factors of the safety narrative [24].
Due to the lack of standard terms and formats, there are few algorithms that can analyse textual data directly. For this reason, in some articles, textual data is transformed into keywords combinations or numerical data to adapt to more numerical data mining algorithms. Lu extracted reciprocating compressor information containing failure symptoms and causes of failed components from technical journal articles and analysed the association rules between monitoring parameters and maintenance records [25]. Xu used a priori algorithm to analyse the main factors leading to road traffic accidents and their association relationships [26]. Song applied a Bayesian network to explicitly explore statistical associations between crash severity and significant variables [27]. In addition, grey comprehensive correlation degree, artificial neural network (ANN), random forests (RF), K-nearest neighbour (KNN), hidden Markov model (HMM), and other algorithms are also used to analyse spare parts [28], turn-back faults in urban rail [29], cause of traffic accident [30], human unsafe factors [31], and so on.
The above-mentioned articles show that converting textual data into several keywords and applying association rule algorithms for further analysis are excellent ideas for obtaining risk information. However, there are still some small weaknesses in these articles, which cannot be directly applied to workplace hazard identification. Firstly, these existing articles are mainly applied to specific scenarios or specific types of accidents, such as reciprocating compressor failures, road traffic accidents, and so on, which have a common feature that the contribution factors are less and fixed. There are many managed objects in complex workplaces, including equipment, facilities, personnel, processes, and so on, leading to complex and diverse hazards. It will be time-consuming to analyse and identify all types of hazards one by one using the above-mentioned methods. Therefore, the research object should be extended to multiple types of hazards, and the information of potential hazards should be obtained as much as possible through the association relationship. Secondly, the importance and frequency of hazards are quite different, which is often ignored in existing studies. Traditional association rule algorithms tend to ignore hazards with high consequences and low frequency, so an improved algorithm needs to be developed. Thirdly, workplace hazards are dynamically changing, which is different from the contribution factors of specific accidents in the above-mentioned articles. It is necessary to design an algorithm for analysing hazard trends.
The aim of this article is to apply data mining algorithms to improve the ability of superintendents to identify and analyse hazards in complex workplaces. In this article, hazard identification and analysis can be seen as building and analysing a complex network. Hazard records can be simplified as the combination of several keywords. For example, “the pipeline is corroded” can be decomposed into “pipeline” and “corroded.” If “pipeline” and “corroded” are represented by two nodes, “the pipeline is corroded” can be represented by two nodes and their connecting lines. Several nodes and their connecting lines are combined to form a complex network that can reflect the hazard information of the workplace. The purpose of hazard identification can be regarded as finding potential nodes through known nodes. Hazard analysis is to calculate the probability of potential nodes and predict the change trend of these nodes. Motivated by these ideas, a data-driven risk analysis approach is designed, which consists of an improved equivalent class transformation (Eclat) algorithm, sliding window model, and change pattern mining algorithm, to discover the nodes, connections, and their changes.
The function of this approach is threefold. Firstly, the data-driven risk analysis process is designed to identify the association rules between different hazard keywords and solve the problem of incomplete hazard identification caused by insufficient experience. Secondly, Eclat algorithm is optimized to calculate the frequency and probability of potential hazards, which is conducive to improving the accuracy of probability estimation. Thirdly, the change pattern is developed to analyse the hazard change trend to support the cause analysis.
The article is structured as follows. Section 2 presents the research background about Eclat algorithm and sliding window model. Section 3 presents the workflow and main links of the data-driven risk analysis approach. Section 4 is a case study about a Chinese hazardous chemical manufacturer, which is used to analyse the practicality of the approach. Conclusions and future work are given in Section 5.
2. Related Work
2.1. Equivalent Class Transformation Algorithm
Association rules are proposed to mine the relationship between elements in a large database, which are usually expressed in the form of [32]. Association rule mining is divided into two stages. Firstly, all frequent items meeting the minimum support threshold are found. And then the rules meeting the minimum confidence threshold are identified from the frequent item sets. Support is the co-occurrence probability of elements, that is, the probability at which the element and the element appear together (equation (1)). Confidence is the conditional probability, that is, the probability that the element exists if the element exists (equation (2)). In this article, the association rules mining algorithm is used to analyse the association relationship between different hazard keywords and calculate the probability of keywords to support risk analysis.
Eclat algorithm is used to mine the association relationship in hazard records. Ecalt algorithm is an algorithm that uses vertical data format to search frequent items [33]. Taking hazard identification as an example, seven hazards were identified, and the corresponding keywords were extracted. The data formats of Tables 1 and 2 are called horizontal data format and vertical data format, respectively.
Based on the vertical data format, the hazard sets are divided into subsets by using the equivalence relationship of keyword ID. The bottom-up search method is used in each subset to generate frequent items independently. Figure 1 shows the search path when the minimum support threshold is 3. Since there is no need to calculate and scan the candidate frequent items, Eclat is suitable for analysing hazard datasets, which include many types and quantities.

Furthermore, in complex workplaces, the number and frequency of keywords are quite different. Some keywords, especially those related to hazards with high severity and low frequency, will be missed by the traditional algorithm. To overcome this defect, the concept of multiple minimum support thresholds needs to be introduced, which can improve the number of association rules by reducing the minimum support threshold of specific keywords.
2.2. Sliding Window Model
In safety engineering, the objects of data mining can be divided into static data and stream data. Hazard identification data can be classified as stream data. Stream data is a large, continuous, fast, and time-varying data sequence. Let be the time stamp and be the data collected at , then the stream data can be expressed as .
Stream data is unbounded, in which only limited data can be processed, mined, and analysed. Thus, the definition of the data stream window is introduced [34]. The sliding window model is a typical model for data stream processing. Its core idea is that when processing the data stream, users often care about the recent data. Let the window size be , which can be one second, one minute, one day, and the time point be , then the range of the window is . When the time point changes in sequence, the window will slide, as shown in Figure 2.

Based on the sliding window model, the historical hazard records can be analysed to obtain the hazard information of the current workplace. Through the comparison of hazard records in different stages, the change of hazards will also be pointed out.
3. Methodology
The data-driven risk analysis approach can be divided into three parts, association rule mining, change pattern mining, risk analysisanalysis and confirmation, as shown in Figure 3. Firstly, NLP is applied to transform textual hazard records into a combination of several hazard keywords, and the improved Eclat algorithm is used to mine the association rules between these keywords. Then the association rules in different stages are divided into two groups, and the hazard change patterns, such as increase pattern, decrease pattern, and new pattern, are identified. Finally, the target object is mapped to the association rule set and the change pattern set; the potential hazards and their conditional probabilities are extracted; and the change trend of the potential hazards is also informed. In practice, the data-driven risk analysis approach can be completed by an information system. As long as the object or scope is clear, the corresponding risk information will be automatically informed.

3.1. Association Rule Mining
3.1.1. Data Transformation
Data transformation is the process of transforming textual hazard records into structured data. In this process, each hazard record will be transformed into several keywords, and each keyword will be assigned an index and frequency. In this way, the textual hazard records can be quantitatively analysed.
There are various types of hazards, and the ways of hazard description are also different. In China, the interim provisions on the investigation and control of safety accidents require enterprises to carry out hazard identification, focusing on the dangerous state of equipment and facilities, human unsafe behaviours, and management defects. Standardization of enterprise safety production (GB/T 33000-2016) provides enterprises with a standardized list to guide enterprises to carry out this work. Its core idea is to identify and eliminate the deviation between equipment, facilities, personnel behaviours, management, other objects, and the standard state. Figure 4 shows the description of the four types of hazards. By analysing the content of the text, it can be found that the keywords of different hazards can be divided into entities, attributes, and deviations. Entities include equipment, facilities, workplaces, and so on. Attributes include process parameters, hazard factors, and so on. These two kinds of keywords are the objects of hazard identification and risk analysis. Deviation refers to the deviation of the object from the safety standard, such as defect, failure, low, exceed, and so on.

Therefore, a hazard record can be transformed into one or more object keywords and one deviation keyword. If there are multiple deviation keywords in a hazard record, it can be divided into multiple combinations. To improve the accuracy of hazard analysis, these keywords need to be standardized. For example, “grounding wire” and “earth wire” are transformed to “ground lead”; “exceed the standard” and “higher than the threshold” are transformed to “more.” The expression is as follows:
Several hazards are sorted according to the identified time, and the hazard stream data is formed. According to the sliding window model, the stream data are divided into two datasets for further analysis [35]. As is shown in Table 3, stage 1 is closer to the current period, and the operating conditions and hazards in this stage are also similar to the actual situation. Therefore, the hazards identified in stage 1 can be used to analyse the potential hazards that may exist in the current stage. By comparing the difference between the hazards identified in stages 1 and 2, the trend of hazard changes can be discovered.
THU Lexical Analyzer for Chinese (THULAC) is used to decompose hazard records, and a self-made Safety Engineering Terminology Dictionary (SETD) is used to extract and standardize keywords. All these works will be done automatically by Python.
3.1.2. Improved Eclat Algorithm
Eclat algorithm is applied to mine the association relationship between the object keywords and the deviation keywords, and the following expression can be obtained:
Equation (5) reflects the probability of in the workplace. Through this equation, the frequency of occurrence of can be calculated, which can be used to assist the superintendent in determining the key objects. Equation (6) reflects the probability of among all hazards related to . Different methods and tools are needed to identify different deviations. Equation (6) can be used to assist superintendents in formulating identification strategies in advance.
However, in the same workplace, the frequency and severity of different hazards are different. The higher the severity of the hazards, the lower the frequency, which leads to lower support for the corresponding keywords. In association rule mining, if the minimum support threshold is set too high, some important keywords with lower support will not be extracted. Conversely, if the minimum support threshold is set too low, the frequent keyword sets will increase dramatically, which reduces the density of hazard information and increases the difficulty of hazard analysis.
In response to this phenomenon, the concept of multiple minimum support thresholds is proposed to improve the Eclat algorithm. The definition is as follows.
Definition 1. Let the keyword set of all hazard records be represented as and use and to represent their minimum support thresholds. Then for , when and , the minimum support threshold of is equal to the minimum value of the minimum support thresholds of all keywords. The expression is as follows:
Definition 2. Let be the minimum confidence threshold. The conditions of the association rule can be expressed as follows:After setting the multiple minimum support thresholds, the recursive nature of the Eclat algorithm is destroyed, which means that the subset of frequent items may not meet their minimum support thresholds. Take Tables 1 and 2 as an example, let the minimum support thresholds of A, B, C, and D be 2, 6, 5, and 6 respectively, then {A}, {B}, {D}, {A, B}, and {A, D} can be identified by the Eclat algorithm, as shown in Figure 5(a). Although {A, C}, {A, B, C}, {A, B, D}, {A, C, D}, and {A, B, C, D} also meet their minimum support thresholds, they are still omitted by Eclat algorithm, which is mainly because their subsets, such as {C}, {B, C}, {B, D}, {C, D}, and {B, C, D}, do not meet their minimum support thresholds.
For the above situation, sort 1-items in ascending order according to their minimum support thresholds. Then the n-items (n ≥ 2) with the same prefix have the same minimum support threshold, which is equal to the minimum support threshold of the corresponding prefix. As shown in Figure 5(b), . In each of these branches, the subset of frequent n-items (n ≥ 2) are also frequent, so the recursive property will be restored.
The search process is as follows: Step 1: traverse the hazard dataset and construct the vertical data format, as shown in Table 2 Step 2: according to the multiple minimum support thresholds, sort the keywords in ascending order and construct the data space as shown in Figure 5(b) Step 3: search each of these branches to obtain keyword sets that meet their minimum support thresholdsThe search process can be summarized as: for any keyword set , when and have the same prefix, if , then . The proof is as follows:
Set , , then can deduce .
Because and have the same prefix and 1-items are arranged in ascending order of their minimum support thresholds, , which means the keyword sets with the same prefix have the same minimum support threshold.
It can be inferred from that . Then, when does not meet the minimum support threshold, . In other words, does not meet the minimum support threshold.

(a)

(b)
3.2. Change Pattern Mining
Change pattern mining is a method to analyse the change trend of the research object by comparing the differences of the association rules in two stages [36]. This method has been successfully applied in supermarket sales to analyse changes in customer consumption patterns [37]. At present, the defined change patterns mainly include added patterns, subtractive patterns, emerging patterns, and unexpected patterns, which reflect different changes. However, the change patterns of hazard and consumption are not consistent. The change patterns need to be redefined.
The hazard change patterns can be divided into growth pattern, decrease pattern, new deviation pattern, new object pattern, and new hazard pattern. Related terms and symbols are defined as follows: is an association rule in ; and are object and deviation in ; and the support of is is an association rule in ; and are object and deviation in ; and the support of is is the similarity of and , where is the similarity of and , where is the similarity of and , where is the maximum of , where is the maximum of , where is the maximum of , where is the similarity threshold
Let be close to the current time point, the criteria for hazard change patterns are shown in Table 4. Growth pattern: this pattern states that an association rule appears in both and , and compared to , the support of this rule in has increased. For the hazard change patterns, if the support of an association rule remains unchanged, it is also regarded as a growth pattern. This pattern usually reflects that the hazard has not been effectively controlled, resulting in the same hazard being continuously identified in . In this case, the main task of the superintendent is to formulate and implement a risk management plan. Decrease pattern: this pattern is the opposite of the growth pattern. Compared to , the support of the rule is reduced in . Superintendents should continue to implement the risk management plan. New deviation pattern: compared to , some new deviations are identified in the same object. For example, in , the main deviation of the pipeline was corrosion, but in , there was a leak in the same pipeline. The lack of safety management on specific objects is the main reason for the emergence of new deviations. In this case, it is urgent for superintendents to strengthen safety management for the object. New object pattern: compared to , some new objects are identified with similar deviations. For example, “corroded pipeline” was identified in , and “corroded electrostatic jumper” was also identified in . Poor workplace environment or process problems are generally the cause of similar deviations in various objects. Environmental management and process quality control are the keys to eliminating this phenomenon. New hazard pattern: for the hazard in , the object and deviation are not identified in . The main reason is the change of equipment, facilities, personnel, or process. PHA, HAZOP, and other administrative methods can be used to analyse the risk to prevent the deterioration of these hazards.
3.3. Risk Analysis and Confirmation
Through the above two links, hazard association rules and hazard change patterns are extracted. In the process of daily hazard identification, the superintendent can determine the objects, such as pipelines, storage tank, or power supply systems, and map them into the association rule set and the change pattern set. Then, the association rules related to the object can be extracted, and the possible deviation and change trend will be obtained to improve the efficiency of risk analysis.
4. Application and Analysis
4.1. Data
The hazard data comes from the workplace hazard identification records of a Chinese hazardous chemical manufacturer, as shown in Table 5. General Code for Safety Standardization of Hazardous Chemicals Manufacturers is used as the work guideline of hazard identification, which includes basic management, equipment and facilities, technological process, working environment, and occupational hazards. Basic management includes work permit, change management, contractor and supplier management, emergency management, and so on. Equipment and facilities mainly include production facilities, safety facilities, and related equipment. The technological process is different in different enterprises, and the main parameters include temperature, pressure, flow, liquid level, composition, and so on. The working environment includes workshop environment, factory environment, warehouse environment, and so on. Occupational health mainly includes occupational hazard management, occupational disease control, and occupational health monitoring.
From 2017 to 2020, 152 hazard identification activities were carried out, and a total of 12,353 hazards were identified and recorded in the form of Table 5. The types of hazards are shown in Figure 6. These 152 sets of hazard data will be used in this article.

4.2. Application
4.2.1. Hazard Association Rule Mining
Ten hazard datasets (numbered 143 to 152) in the third quarter of 2020 are selected as the observation group for simulation application. Forty datasets (numbered 103 to 142) in the first half of 2020 and the second half of 2019 are divided into and , respectively, which slide with the change of the observation time, as shown in Figure 7. In , association rules are mined to identify the potential hazards and calculate the conditional probability. In , the association rules are compared with those in to analyse the hazard change patterns. Experimental data are used to calculate the accuracy of hazard prediction.

Python 3.7 is used to realize the data-driven risk analysis approach. For historical hazard association rule mining, the multiple minimum support thresholds of each keyword are set to 20% of its frequency, and the minimum confidence threshold is 0.2. Around 684 historical hazard association rules in and 784 historical hazard association rules in are obtained.
The left items (i.e., objects) are clustered in the association rules, and the association rules are extracted with higher frequency and confidence, as shown in Table 6. The first item indicates that the deviation of the “warming apparatus” and the other two objects in is “missing,” the frequency is 10, and the confidence is 0.67. In , the main deviation types of the “warming apparatus” and the other two objects are “eroded,” “missing,” and “damaged.” Through Table 6, it can be found that the frequency and confidence of the deviations corresponding to “warming apparatus,” “safety cages,” “rain sewage valve,” “stop valve,” “discharge outlet,” and “environmental monitor” are higher, and these should be checked first.
The visualization of the main association rules is shown in Figure 8. It can be seen from Figure 9 that the main deviations in are “missing,” “more,” and “inefficient,” and the main deviations in are “missing,” “damaged,” and “more.” Compared with Figure 8(b), the type of nodes and the number of lines in Figure 8(a) are different, which reflects that the hazard has changed in .

(a)

(b)

4.2.2. Hazard Change Pattern Mining
The association rules in and are compared in Figure 9. In Figure 9, the abscissa and the ordinate are used to represent the object and deviation, respectively. The intersection of the object and the deviation is regarded as a type of hazard, where the hazards in are represented by the red circles, and the hazards in are represented by the blue circles. The frequency of hazards is reflected by the magnitude, and the probability of occurrence of deviation is reflected by transparency. The higher the frequency and probability, the larger the corresponding shape and the darker the color. As shown in Figure 9, although the main types of deviations in the two stages are similar, the objects of the hazards are quite different.
To quantify the difference of hazards and identify the hazard change trend, hazard change pattern mining needs to be carried out. Let be 0.7, and the results of hazard change pattern mining are shown in Table 7 and Figure 10.

New object patterns and new deviation patterns accounted for 93%, which reflects the inadequate management of specific objects and the workplace environment. For example, due to the inadequate management of the drainage system, the safety signs of rain sewage valves and discharge outlets are lost, and the drainage efficiency is insufficient. This situation is also reflected in Figure 9, that is, the red circle corresponding to the rain sewage valve and the discharge outlet does not overlap the blue circle.
Less growth patterns and decrease patterns reflect that superintendents have a stronger ability in hazard control. Once a hazard is identified, the most similar hazards can be quickly eliminated. However, the number of hazards such as the “warming apparatus, missing” and “sulfur dioxide, more” is on the rise. In Figure 9, they are represented as larger red circles that overlap the blue circles. It is necessary for superintendents to carry out cause analysis and optimize control measures.
The reason there is no new pattern is that the hazardous chemical manufacturer has neither introduced new equipment nor changed the process in the past two years.
4.2.3. Simulate the Risk Analysis Process
In risk analysis, superintendents will obtain relevant risk information as long as the identified object is mapped into the association rule set and change pattern set. For example, in activity 143, the “hydraulic supply line” was mapped into the dataset, and the following results were output, as shown in Table 8 and Figure 11.

The number of “warming apparatus, missing” is the highest, and the probability is 71%, which is a large number. It is recommended that the warming apparatus should be checked first. Moreover, this hazard is a growth pattern, which means that the frequency is increasing. Therefore, even if this type of hazard is not identified in this activity, it is necessary to take precautionary measures.
The frequency and probability of “flange, bolt, and missing” are also high, and it is a new object pattern, indicating that the maintenance of flanges is insufficient. Therefore, it is recommended to carry out centralized maintenance of flanges and pay attention to the safety status of bolts.
“Bracket, unreliable” and “valve, eroded” are the new deviation patterns, which may be related to weather factors. In the first half of 2020, rainstorms and gales occurred many times in this area. And it is predicted that this weather will continue until October, which will lead to a further increase in the frequency of these hazards. Superintendents should pay attention to the impact of weather on equipment and facilities.
Finally, a total of 54 hazards were predicted in this activity.
4.3. Analysis
4.3.1. Efficiency Analysis
Apply multiple minimum support thresholds and a single minimum support threshold to mine frequent item sets. And then, calculate the number of keywords in frequent item sets, the number of frequent items, and the ratio between them. The ratio of keywords to frequent items is used to express the value density of frequent item sets. The higher the ratio, the more hazard information contained in frequent item sets, which is beneficial to improve the efficiency of association rule analysis. The hazard datasets in are used for analysis, where the multiple minimum support thresholds are set according to 20% of the frequency of each keyword, and the single minimum support threshold is set to 2, 5, and 10. The results are shown in Table 9 and Figure 12.

In Figure 12, the number and ratio of the keywords are represented by the size and position of the circle. When the single minimum support threshold is equal to 5 and 10, the number of keywords has significantly decreased, which is not enough to support risk analysis. When the single minimum support threshold is equal to 2, the number of keywords is the same as the first group, but the ratio is about 10% lower than the first group on average, and the total ratio is 13.68% lower. Therefore, it can be considered that setting multiple minimum support thresholds has a better effect on improving the efficiency of association rule mining.
4.3.2. Application Effect
Analyse the ten experimental datasets from 143 to 152, count the number of predicted hazards, and calculate the ratio of the predicted number to the actual number, as shown in Figure 13. In the experimental analysis, an average of 38.6 hazards were predicted, accounting for 59.66%. This shows that in each hazard identification activity, 59.66% of the potential hazards can be notified to the superintendent in advance, which not only improves the pertinence of hazard identification but also provides basic information for risk evaluation and control.

The simulation application shows that the data-driven risk analysis method can predict more than half of the hidden dangers so as to improve the hazard identification efficiency in the workplace. However, there are still some limitations in practical application. Firstly, the method is highly dependent on historical data. Insufficient historical data and inaccurate data records will reduce the prediction accuracy. Secondly, this method can only predict and analyse the identified hazards based on historical records, and there is no analysis means for the unrecorded hazards.
5. Conclusions
In this article, a data-driven risk analysis approach is proposed to help superintendents identify and analyse hazards. Through the NLP and the improved Eclat algorithm, the historical hazard records are transformed into structured association rules. With the aid of change pattern mining, the hazard change trend can be identified. In actual work, as long as the superintendent delimitates the object, the relevant risk information such as potential hazards, conditional probabilities, and change trends can be automatically output. Based on this information, the pertinence of hazard identification and the efficiency of a hazard analysis can be improved. Case studies have shown that the efficiency of the improved algorithm is increased by 13.68%, and 59.66% of potential hazards can be identified in advance.
It should be pointed out that the multiple minimum support thresholds, minimum confidence threshold, and sliding window size are selected based on experiments. In actual work, superintendents can try to make adjustments according to needs. Moreover, this approach requires a large amount of historical records to support data mining. The more historical records, the more abundant risk information will be output. Therefore, it is very essential to establish a systematic historical hazard database and record the hazard data continuously.
It is very meaningful to use data mining techniques to support risk analysis in complex workplaces. Keyword extraction, synonym merging, redundant keyword deletion, and unidentified hazard prediction need further research to improve the effectiveness of hazard text analysis and the accuracy of hazard prediction. The information system construction of the hazard analysis approach should also become the focus of future research.
Data Availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
Conflicts of Interest
The authors declare that there are no conflicts of interest.
Acknowledgments
This research was funded by The National Key Research and Development Program of China, grant no. 2019YFC0810701.