Abstract

Syslog is a critical data source for analyzing system problems. Converting unstructured log entries into structured log data is necessary for effective log analysis. However, existing log parsing methods demonstrate promising accuracy on limited datasets, but their generalizability and precision are uncertain when applied to diverse log data. Enhancements in these areas are necessary. This paper proposes an online log parsing method called DLLog, which is based on deep learning and has the longest common subsequence. DLLog utilizes the GRU neural network to mine template words and applies the longest common subsequence to parse log entries in real-time. In the offline stage, DLLog combines multiple log features to accurately extract the template words, creating a log template set to assist online log parsing. In the online stage, DLLog parses log entries by calculating the matching degree between the real-time log entry and the log template in the log template set. This method also supports the incremental update of the log template set to handle new log entries generated by systems. We summarized the previous works and validated DLLog using real log data collected from 16 systems. The results demonstrate that DLLog achieves high parsing accuracy, universality, and adaptability.

1. Introduction

Log data serves as a valuable and reliable source for operations staff to monitor systems, detect abnormalities, and locate faults [1]. Log data, easily obtainable from systems, contains a wealth of information, including system status, performance, and resource usage. However, log data is inherently unstructured, while most system analysis tasks require structured data as input [24]. Therefore, parsing unstructured log data into structured data becomes essential [5, 6]. This paper aims to develop a log parsing method characterized by high accuracy, universality, and adaptability. The goal is to enable the accurate extraction of log templates from log data without manual intervention.

Traditional log parsing methods require considerable human resources and time. Moreover, as system scale and complexity increase, data volume expands rapidly. Importantly, developers have not established a unified standard for log format, making traditional manual log parsing methods impractical. Static code-based parsing methods exhibit high limitations [79] because obtaining system source code during the parsing process is challenging. While frequent pattern mining-based log parsing methods demonstrate competitive parsing efficiency, they struggle to match rare logs with low frequency to any log template, resulting in suboptimal parsing results [1012]. Clustering-based log parsing methods often suffer from low parsing accuracy due to their simplistic parsing patterns (e.g., dividing log groups based on word frequency or different word types) [2, 1315]. In comparison to static code or frequent pattern mining-based methods, clustering methods have slower parsing speeds and require numerous iterations.

Current log parsing methods often exhibit limitations in terms of parsing accuracy and universality. While a specific log parsing method may demonstrate high detection accuracy for a particular dataset, it frequently struggles to maintain comparable accuracy when applied to a broader range of datasets. It is imperative for log parsing methods to incorporate incremental update capabilities, as systems undergo sporadic updates or optimizations postdeployment, resulting in the generation of new log data that needs to be matched with novel log templates. Log parsing methods lacking incremental update functionality require substantial computational resources to build a new parsing model. Undoubtedly, a log parsing method with the ability to update its model during the parsing process is of paramount importance.

To address these challenges, we propose an online log parsing method called DLLog, based on GRU neural networks and the longest common subsequence. Our method outperforms existing approaches by accurately mining log template words using multiple log features, thus achieving high universality. Prior to template matching, DLLog pre-classifies log templates to reduce incorrect template matching time. Moreover, our method supports log template set updates to accommodate new log data generated by the system.

DLLog parses logs by utilizing the structural, frequency, and association features of logs entries. It combines offline log template word mining and online log parsing to enhance the universality and parsing accuracy of DLLog for large-scale log datasets. In the offline mining stage, DLLog initially employs common regular expressions to clean and remove obvious parameter words from the logs. Then, it transforms the log entry into a sequence of word frequencies based on log word frequency and log structural feature. Subsequently, DLLog employs a GRU neural network to identify potential relationships between log words and extract log template words based on these relationships. Due to log sequences including the log structural feature, log template words from rare logs are more easily and accurately mined by DLLog. Finally, log entries with the same log template words are categorized into the same log group. Each log group corresponds to a log template. Different log templates form a pre-classified log template set based on the log structural feature. This process does not require manual intervention. Besides, this grouping pattern can effectively avoid the problem that rare logs cannot match any log template.

During the online parsing stage, DLLog processes logs by calculating the matching degree, defined as the length of the longest common subsequence between real-time log entries and the existing log template set. Based on the matching results, DLLog determines whether to update the log template set. By adopting incremental updates to the log template set, DLLog eliminates the need for retraining models, ensuring the efficient operation of log parsing methods and enhancing the method’s universality when applied to large-scale log datasets. This paper evaluates DLLog on several extensive log datasets, demonstrating its success in achieving high parsing accuracy, universality, and adaptability.

The primary contributions of this paper are summarized as follows:(i)This paper introduces an offline log template word mining approach that utilizes a GRU neural network to extract log template words and partition log data into distinct log groups.(ii)This paper proposes an online log parsing method that leverages the longest common subsequence, enabling updates to the log template set to accommodate newly generated log data from the system.(iii)We conducted comprehensive experiments and evaluations on various large-scale log datasets, demonstrating the superior performance of DLLog in terms of accuracy, universality, and adaptability.

The rest of this paper is organized as follows: Section 2 presents the related work of log parsing. Section 3 presents the basic structure of log data. Section 4 presents the detailed design of DLLog. Section 5 evaluates the performance of DLLog through experiments. Finally, Section 6 presents the final remarks.

System logs are invaluable data resources extensively utilized in system operation and maintenance, fault analysis and detection, and various practical applications [1619]. Since log messages typically consist of semi-structured text strings, log parsing is essential for converting unstructured logs into structured data [20]. Log parsing preserves the essence of log entries, removes parameter words, and minimizes log entry dimensions, making it easier to map diverse unstructured logs to standard log templates. We have categorized and summarized recent research in log parsing into the following categories.

2.1. Static Code-Based Log Parsing Methods

Liang et al. [8] introduced an MTS-DCGAN log parsing method based on source code analysis. This approach involves querying class names, call relationships, and object names associated with system behaviors. By traversing the syntax tree, log templates are constructed. Kabinna et al. [21] proposed the Cox models, which follow similar principles as the MTS-DCGAN, identifying format strings in the code to create log parsing templates. While these methods accurately generate log templates, they are dependent on access to system source code, limiting their applicability to closed-source systems.

2.2. Heuristic-Based Log Parsing Methods

He et al. [22] developed Drain, a log parsing method that utilizes parsing trees. Drain constructs parsing trees and then compares variances between log entries and log event groups within the parsing tree for log parsing purposes. While Drain provides high parsing accuracy, its versatility is limited and, and it requires domain-specific knowledge.

Zhang et al. [23] presented the FT-tree method for log parsing, which creates a log template tree by analyzing log words and their combinations. The process involves pruning the log template tree by removing branches that do not satisfy certain constraints. Consequently, all log words along the path from the root node to any leaf node in the pruned log template tree constitute a log template. However, a drawback of this method is its tendency to overlook infrequent log templates, potentially leading to reduced parsing accuracy.

2.3. Clustering-Based Log Parsing Methods

Sedki et al. [10] proposed the unified log parsing tool, which identifies frequent phrases in log data to form frequent candidate itemsets. These itemsets are then clustered to generate class clusters along with their corresponding templates. Fu et al. [13] proposed the LKE method, a log parsing technique based on K-means clustering. LKE extracts log templates from initial log groups obtained by segmenting clusters using cluster midpoints and parameter distances. However, due to log data imbalance, clustering-based methods might misclassify low-frequency log template words as parameter words, leading to lower parsing accuracy.

2.4. Other Log Parsing Methods

Makanju et al. [24] proposed the iterative partition log mining (IPLoM) method, which employs iterative partitioning to categorize log entries into distinct groups. IPLoM further refines partitions based on log identifiers and location information to extract log templates for each log group. AEL [25] employs a clone detection method for log parsing, assuming significant text similarity among log entries within the same log event. AEL employs the “Adjust” step to consolidate similar log execution events and resolve all log templates. Du and Li [26] presented Spell, an online log parsing method based on the longest common subsequence, which updates and maintains the longest common subsequence library (LCSMap) of log event sequences.

3. Log Structure Overview

System logs are unstructured data stored as free text, recording various events, states, errors, or interaction behaviors generated by systems or components. Typically, there is no unified standard for defining log entry formats and syntax structures across different systems. Each log entry consists of a constant part and a variable part. The constant part, also referred to as the log template, comprises fixed plain text information generated by the printout code, containing semantic information in the form of log template words. The variable part, including dynamic parameter information such as IP addresses, port numbers, and file names, changes with log events. The words that make up the variable parts are referred to as log parameter words and generally lack valuable semantic information. Although the formats of log data vary greatly among different systems, these log data typically include the following important components: timestamp, log level, components, and log events.(1)Timestamp: The time when the system generated the log entry.(2)Log level: Also known as log type, it indicates the severity of log events (such as info, error, and warn).(3)Component: The name of the component (software module or server) that generates log events.(4)Log event: Describe the system interaction event information under specific time and environment. Generally, a log entry contains only one log event.

In log data, the log event serves as the core of each log entry. Log parsing extracts the constant part (common field) of log events to create a log template representing each log entry. Table 1 displays log samples from eight different types of original log data, including distributed systems, supercomputer systems, operating systems, and mobile systems.

We take the HDFS log entry (081109 203521 146 INFO dfs. DataNode$Packet Responder: Received block blk_7503483334202473044 of size 233217 from10.251.71.16) as an example. The log event part is generated by the system printout code “LOG.info (“Received block” + block + “size” + block. getNumBytes() + “from” + inAddr).” The fixed parts are “Received block,” “of size,” and “from,” which remain unchanged regardless of the event object. Simultaneously, these words also constitute the log template for the log event. Table 2 displays the classification results of the aforementioned original log entry based on Timestamp, Log level, Component name, and Log events. This table also presents the Log template words, Parameter words, and Log template.

In HDFS log data format Table 2, the symbol “”denotes a placeholder. In fact, a log template can be used to represent multiple log entries. Figure 1 provides twelve examples of HDFS raw log data.

In these examples, the log template “Received block of size from”can also represent the second log entry (081109 205412 832 INFO dfs. DataNode$PacketResponder: Received block blk_-5704899712662113150 of size 67108864 from 10.251.91.229). Each log entry in log data can be characterized by only one log template, but one log template can represent multiple log entries. Table 3 displays the corresponding log templates for all log data examples in Figure 1.

As shown in Table 3, we can convert 12 different types of unstructured log entries into 5 types of structured data by transforming log data into log templates. Indeed, a log template is a standardized format for representing a group of original log entries. Log entries with the same log template represent the same type of log events. In essence, the core of log parsing lies in converting each log entry into a specific log template. During log parsing, a parser must explicitly distinguish between the constant and variable parts of the log event, extract the constant log part (log template words) to compose the log template, and then use the log template to represent the log entry, thereby completing the log data parsing task.

4. DLLog Architecture and Overview

This section will provide a detailed overview of the proposed online log parsing method, DLLog, which is based on GRU deep learning and has the longest common subsequence. The fundamental concept behind DLLog is that log templates typically consist of the longest combinations of frequently occurring words. DLLog comprises three main modules: log data vectorization, offline log template word mining, and online log parsing. Figure 2 illustrates the framework of the DLLog. Table 4 illustrates notations with their explanatory terms of The DLLog.

4.1. Offline Log Template Word Mining

The solid arrow in Figure 2 illustrates DLLog’s offline log template word mining process. This module initially scans and cleans the entire log dataset. It counts the frequency of each word that makes up the log level, component name, and log event. Using this frequency information, we construct a log word frequency table. DLLog vectorizes each log entry based on the word frequency ID in the word frequency table, converting the log entry into a vector to create the log word frequency sequence. During the training stage, the GRU neural networks are employed to learn the relationships between log words, enabling DLLog to extract log template words from log sequences based on the learned associations. Finally, DLLog categorizes log entries into different log groups depending on whether the log template words are identical. Each log group’s log entries share the same log template, and the log templates from different log groups constitute the set of log templates.

4.2. Online Log Parsing

The dashed arrow in Figure 2 illustrates the online parsing process of the DLLog. Unlike offline log template word mining, the log sequence in online log parsing does not require sorting based on word frequency. Each real-time log entry only needs to undergo a cleaning process before being input into the log vectorization module to generate a sequence of log word frequencies. Then, DLLog calculates the matching degree between the current log sequence and the log template within the existing log template set. By comparing the matching degree with the predefined threshold, DLLog determines whether the parsing is successful or if the log template set needs updating.

4.3. Log Vectorization

We define the log dataset as , where represents a log entry, and we define the log event set , where represents a log event. Let be the set of words constituting the log event. These words are also referred to as tokens. If the log appears frequently (that is, it has a high-frequency), then has a high probability of being a log template word; We define the set of log templates as , where represents a log template composed of multiple log template words , arranged in a specific manner. It should be noted that each log template in the log template set corresponds to multiple log entries, and a log entry can only be represented by one log template.

Log vectorization is the first step in the DLLog. Its objective is to convert unstructured log entries into vectorized sequences, which are then used for offline log template word mining and online log parsing. The process for vectorizing log entries consists of three steps:(1)The first step is to scan the entire log dataset, break down the log into words, and employ regular expressions to filter obvious log parameter words (such as IP address and file path) with a fixed format. This log vectorization process in Section 4.1 processes log datasets using the log data filtering rules provided by the FT-tree [23], spell [26] and drain [22], which is widely adopted in the Log parsing domain.(2)The second step is to count the frequency of log words. In this step, the module fully considers the structural and frequency features of the log. The frequency is derived from the statistics of log level word , log component word and log event word . Next, we categorize the frequency information by the word type, sort it in descending order, and store in the word frequency table, denoted as , which is defined as . Each word is assigned a unique word frequency . Within the word frequency table , the s corresponding to the are positioned at the beginning of the frequency table, followed by in the middle, and at the end. Figure 3 provides a structure sample of the word frequency table .We use the HDFS original log dataset in Figure 1 as an example to further illustrate the process of creating the word frequency table . This log dataset includes two log levels: “” and “.” Therefore, the word frequency sorting result for log levels can be expressed as . Similarly, the frequency sorting results for the four log components can be expressed as . The processing of follows the same procedure as .The final word frequency table corresponding to the log dataset can be expressed as . In the table , each row is represented as a triple , where the first unit represents the word frequency ID, the second unit is the word itself, and the third unit is the frequency (the number of times the words appear in the dataset). By categorizing log words into , and , this method helps prevent the incorrect categorization of low-frequency log template words as log parameter words, mitigating issues arising from unbalanced log data features.(3)The third step is to replace log words with word frequency , constructing the log word (token) frequency sequence in ascending order.

It is important to note that the online log entry vectorization process only requires the first and third steps. Since the word frequency table has already been constructed, we simply need to follow step (1) to clean the online log entry. If a log word appears in the real-time log but is not present in the word frequency table , we incrementally update the word frequency table with the newly encountered log word. Then, according to the word frequency table , we construct the cleaned log data into a log word sequence. Figure 4 illustrates the example of log vectorization.

4.4. Offline Log Template Word Mining

Offline log template word mining aims to create an accurate log template set. During the log vectorization module, DLLog converts each log entry into a sequence of log word frequencies based on the log structural features and log frequency features. In the offline log template word mining module, DLLog learns the relationship between log words through GRU neural network. It determines whether words are log parameter words or log template words, enabling the accurate extraction of log templates.

The core method of offline log template word mining is GRU neural network [27]. GRU neural network is a well-known variant of recurrent neural network (RNN) and was introduced by Cho et al. [27]. It has found wide application in various fields, including text classification [28, 29], machine translation [30], emotion analysis [31].

Compared with LSTM neural network, the GRU neural network has a forgetting and updating mechanism, both of which excel at tracking long-term dependencies. These mechanisms address the challenge of gradient vanishing or exploding that often occurs in recurrent neural networks during multiple propagations. Unlike LSTM neural network, the GRU neural network simplifies the internal network structure, resulting in more efficient state information updates. The internal structure of GRU unit is depicted in Figure 5.

In Figure 5, represents input at time (current time), is the reset gate, is the update gate, is the hidden state of the current GRU unit, is the hidden state from the previous GRU unit, is the candidate hidden state. And are activation functions, represents addition and represents point multiplication.

Each GRU block in a GRU neural network consists of an update gate and a reset gate. The reset gate determines which part of the information in the hidden state is “forgotten,” while the update gate decides how much of the current input information is incorporated and temporarily stored in the hidden state . The formulas for reset gate, update gate, and hidden state are as follows:where, , , and represent the weight value. When , it means retaining the state from the past time to the current state. When , it signifies forgetting the past status information.

4.4.1. Training Stage of DLLog

The DLLog log template word mining model employs a two-layer GRU neural network. Compared with a single-layer GRU neural network, the two-layer GRU neural network exhibits superior learning and generalization capabilities, making it better at preserving long-term dependencies within sequences. The log template word mining model based on the GRU neural network consists of four layers: the word embedding layer, the GRU neural network layer, the fully connected layer, and softmax layer. Figure 6 illustrates the network structure of the DLLog log template word mining model.

We input a log token subsequence of length (where represents the size of the sliding window) into the model, where . First, the log token subsequence is passed through the word vectorization layer, which maps each token to a computationally recognizable vector. These word vectors then serve as input to the first layer of the GRU neural network. Both the first and second layers of the GRU neural network comprise GRU units, matching the length of the input data.

In each GRU cell, the input consists of the hidden state from the previous time step and the external input data at the current time step. The currently embedded word vector and hidden state are both weighted in the update gate using their respective weights. The result of this weighted sum, obtained using formula (1), is then passed through a sigmoid activation function to calculate the final value of the update gate. The input for the reset gate is identical to that of the update gate, with both being multiplied by their corresponding weights. Formula (2) is applied to calculate the value of the reset gate. The reset gate determines how much information from the previous hidden state will be updated to the current candidate hidden state , while the update gate decides how much information from the previous hidden state will be updated to the current hidden state . The candidate hidden state and the hidden state are computed using formulas (3) and (4), respectively. Subsequently, the retained information (hidden state ) is passed to the next GRU unit.

For the double-layer GRU neural network, each GRU unit in the second layer corresponds to a GRU unit in the first layer. The hidden state produced by each GRU unit in the first layer serves as the input for the connected GRU unit in the second layer. Finally, the fully connected layer and softmax function are employed to transform the final hidden state of the second-layer GRU neural network into a probability distribution for predicting the next log word. During the training phase, the model utilizes the cross-entropy as the loss function and employs stochastic gradient descent (SGD) to iteratively update the weight parameters. The calculation formula for the cross-entropy loss function is given bywhere represents the actual label, is the probability value, is the number of categories (the number of words in the word frequency table ), and is the total number of samples.

4.4.2. Log Template Word Mining Stage

In the log template word mining stage, the input method for log data remains the same as in the training phase. The input consists of a log token subsequence , where is the size of the sliding window, and . The output of model is a probability distribution denoted as , which includes the probabilities associated with all words in the word frequency list . Assuming that represents the probability corresponding to the target log word , we use to indirectly indicate the association between and input sequence . If exhibits a strong association with the input sequence, it is determined to be a log template word; otherwise, it is the log parameter word. Figure 7 illustrates the sample log template mining process.

In fact, the final output of the model can be considered a binary classification problem. Based on prior experience, the target log word following an input sequence is not unique. Therefore, it is essential to manually set an appropriate probability threshold when mining log template words. If the probability value of the target word exceeds the threshold , the target word is considered to have a strong correlation with the input sequence, and it it is identified as a log template word. Conversely, Conversely, if the probability of the target word is below the threshold , it is categorized as a log parameter word. To prevent mistakenly identifying log parameters as log template words, the extraction of template words for that sequence is halted when any target word in the sequence is identified as a parameter word (the first occurrence of a log parameter word within the sequence). Subsequently, processing continues with the next log word sequence until the entire log data has been processed.

After extracting the log template words corresponding to each log entry, the log entries should be divided into different log groups based on the log level, component name, and log template words. Log entries within each log group share the same log template words. For each log group, a data structure named “”is created to store the corresponding log template of that log group. A data structure named is initialized as empty, which will store the final log template set, and . Figure 8 illustrates the sample structure of the log template set.

4.5. Online Log Parsing Module

In the online log parsing stage, when a new log entry arrives, DLLog first cleans, divides the original log entry and vectorizes to construct the log word frequency sequence . This process has been described in Section 4.1. Then, DLLog compares the current log word frequency sequence with the log templates in to determine whether matches any existing log template , or if it should create a new log template to extend the log template set . This section utilizes the longest common subsequence (LCS) to calculate the matching degree. Algorithm 1 shows the pseudo-code for online log parsing.

Input: log word frequency sequence and log template set
Output: log template
(1)Initialize optimal matching degree , optimal template length and number
(2)Initialize temporary log set
(3)Go through all log templates in with the same and as to form the log template set
(4)for in do
(5)
(6)if or ( ==  and  < ) then
(7)  
(8)  
(9)  
(10)end if
(11)end for
(12)if (best < ) or (best < ) then
(13),  = //Call the log template set update function
(14)if == then
  return
(15)else
  return
(16)end if
(17)else
  return
(18)end if

For a given current log word frequency sequence , the first step is to search for log templates in the existing log template set with the same log type and component name as the current log word frequency sequence . These matched log templates form a new set, . Then, we calculate the matching degree between each log template in and the current log word frequency sequence using the longest common subsequence (LCS) method [3234]. The matching degree is determined by the length of the longest common subsequence. For instance, if there are three log templates (, , and ) in that share the same log type and component name as the current log word sequence. The matching degrees between the current log word sequence and these log templates are denoted as , , and , which are calculated using the .

The second step is to find the log template with the highest matching degree corresponding to the current log word frequency sequence . If these log templates (, , and ) share the same matching degree with the current sequence , the system selects the log template with the shortest length as the corresponding log template. It is important to note that the matching degree, denoted as , between the selected log template and the current log word sequence should be greater than or equal to half the length of the current log word sequence and half the length of the selected log template. If, for any reason, the log template set cannot produce a match for the current log word sequence , a new log template must be generated and added into the log template set . If it is impossible to generate a new template based on the existing data, the current log word sequence is stored in the temporary log set .

In each case, the examples are as follows:(i)If the matching degree are ordered as , the system chooses for the further processing.(ii)If , and , is the length of the log word sequence and is the length of the log template , then the log template is the final log template corresponding to the log word sequence .(iii)If , or , then a new template must be created for the current log word sequence .(iv)If , and , , , DLLog selects the log template with the minimum length as the final log template corresponding to the log word sequence by comparing the lengths of log templates .

The third step is to update the log template set. According to reference [23], when the system begins generating new types of system log entries, it often generates a substantial amount of log data of these new types within a single day. These log data typically contain numerous different parameter words. Consequently, new templates can be directly extracted by computing the longest common subsequence of these new types of logs. The pseudo-code of the log template set update algorithm is presented in Algorithm 2.

Input: log word frequency sequence , log template set , and temporary log set
Output: new log template and new log template set
(1)Initialize optimal matching value , optimal matching length and number
(2)for in do
(3)if ( == ) and ( == ) then
(4)  
(5)  if or ( ==  and  > ) then
(6)   
(7)   
(8)   
(9)  end if
(10)end if
(11)end for
(12)if (best < ) or (best < ) then
(13) Add to //Add the log word sequence to the temporary log set
  return
(14)else
(15) = 
(16) Add to //Add a new log template to the log template set
  return,
(17)end if

For the current log word sequence , which fails to match any log template within the log template set , it becomes necessary to calculate the longest common subsequence between and each log entry in the temporary log set . Subsequently, the optimal longest common subsequence is selected as the new log template. Similarly, this new template needs to be longer than or equal to half the length of both the current log word sequence and the selected log entry from . Once this condition is met, the log template can be added to the log template set , thereby updating the set. The next time a new type of log entry of the same kind appears, the first two steps can be employed to match the log template.

5. Evaluation

This section first introduces the hardware and software environment, the experimental log dataset, and the evaluation metrics. Finally, specific experimental results are presented to demonstrate the superiority of DLLog.

5.1. Experimental Setting
5.1.1. Experimental Dataset

The log datasets used in this section consist of 16 real-world log datasets published by the LogPai team (https://github.com/logpai). In the LogHub data repository, these log data come from different systems, including distributed systems (HDFS, Hadoop, Spark, ZooKeeper, and OpenStack), supercomputers (BGL, HPC, and Thunderbird), operating system (Windows, Linux, and Mac), mobile system (Android, HealthApp), server applications (Apache, OpenSSH) and standalone software (Proxifier). LogHub log dataset can not only be used to measure the accuracy of log parsing methods but also test the robustness and efficiency of parsing methods. These datasets have been widely employed in similar research endeavors [15, 22, 26, 35]. Table 5 provides detailed information about these log datasets.

For each log dataset, Zhu et al. [11] sampled it and manually marked the log template of each log entry. In all experiments in this section, these markers were used as the basic factual basis for evaluation.

5.1.2. Evaluation Index

In the field of log parsing, parsing methods are typically evaluated using the Parsing Accuracy (PA) metric, as defined in reference [11]. PA is calculated as the ratio of correctly parsed log messages to the total number of log messages. Each log message corresponds to a specific log template, and log messages sharing the same log template are grouped into the same cluster, representing a particular type of log message. When assessing the correctness of parsed log messages, it is considered correct only when the log template corresponding to the log message is correctly divided into the log template cluster. In comparison to the evaluation metric (the ) used in prior studies [3537], PA is considered a more rigorous measure.

5.1.3. Environment and Implementation

We have implemented the methods proposed in this chapter using the open-source Python machine learning library, PyTorch. All experiments were conducted in a consistent experimental environment using Python 3.8 with PyTorch 1.7.0. The hardware platform utilized for the experiments featured an AMD Ryzen 5 3600 6-core processor running at 3.6 GHz, an NVIDIA GTX1660 GPU, 128 GB of memory, and the Windows 10 64 bit operating system. We constructed our model based on the above environment. Specifically, during the offline training process and the log template mining process, it runs on a GPU to accelerate model training. The DLLog online parsing phase runs on a CPU to allow for a fair comparison with other log parsing methods.

The number of training epochs is set to 300, the hidden dimensions of the GRU model are 64, and the number of layers is 2. In the Log Template Word Mining stage, the sliding window size, , is set to 3, and the probability threshold, , is set to 0.63. The learning rate is set to 0.001.

5.2. Accuracy Evaluation

In our experiments, we aimed to select state-of-the-art log parsing methods as comparison baselines. However, due to the unavailability of the source code for some methods [38, 39], such as Uniparser [38], we attempted to reproduce it for further experiments. Unfortunately, the parsing results of the reproduced model did not yield satisfactory outcomes on certain datasets. Consequently, to assess the accuracy of DLLog, we compared it with five baseline log parsing methods: Drain [22], Spell [26], Nulog [40], IPLoM [24], Logram [41], and Brain [42]. Drain is a tree-based log parsing method, Spell is a log parsing method based on the longest common subsequence, Nulog is a log parsing method based on a deep self-supervised learning model, IPLoM is a log parsing method based on iterative partition, Logram is a log parsing method based on the N-Gram statistical language model, and Brain is a rule-based log parsing method, specifically utilizing the longest common pattern. The 6 log parsing methods have been introduced in detail in Section 2. The comparison results of parsing accuracy are shown in Table 6. We set the best comparison result to bold.

As depicted in Table 6, DLLog achieves the best parsing accuracy in 7 out of the 16 log datasets, with an impressive average parsing accuracy of 0.891. Compared to state-of-the-art log parsing methods, the highest average parsing accuracy demonstrates the superiority of DLLog. DLLog also achieved high parsing accuracy scores on datasets where the optimal parsing accuracy was not attained. DLLog’s average parsing accuracy is approximately 4% higher than Brain and Drain. In comparison to the relatively lower accuracy of the Logram method, DLLog’s average parsing accuracy is 11% higher. However, we also observed that due to different rules in generating logs for various log systems, no log parsing method can achieve optimal parsing accuracy on all datasets.

Nearly every parsing method can achieve satisfactory parsing results for log datasets with simpler structures, such as HDFS and Apache log datasets; some methods even achieve the optimal parsing accuracy of 1. For log datasets with more complex structures, like HealthApp and HPC log datasets, the accuracy of each parsing method decreases to varying degrees. However, DLLog still attains the highest parsing accuracy on both datasets. The Spell method, which relies solely on the longest common subsequence for log parsing, achieves accuracies of 0.654 for HPC and 0.787 for BGL datasets. But DLLog based on deep learning and the longest common subsequence achieves accuracies of 0.996 for HPC and 0.988 for BGL datasets. This suggests that DLLog can effectively aid the model in parsing logs and enhance log parsing accuracy by fully utilizing the structural, frequency, and associative features of logs.

5.3. Versatility Evaluation

Experiment 2 evaluated the versatility of each log parsing method. The purpose is to verify whether the proposed method can widely support different log data types. Detailed statistics are given in Table 7, including the median (Median), minimum (Min.), standard deviation (STD), and interQuartile Range (IQR). Figure 9 shows the boxplot of the accuracy distribution for each log parsing method. For each box in Figure 9, the line from bottom to top represents the minimum observation value (Lower bound), the lower quartile (Q1), the median (Q2), the upper quartile (Q3), and the maximum observation value (upper bound). The length of each box represents the interQuartile Range of the corresponding log parsing method.

From Table 7 and Figure 9, it is clear that DLLog has the smallest InterQuartile Range of 0.186, and DLLog has the smallest standard deviation of 0.143, which is 2.0%, 11.1%, 17.8%, 7.7%, 21.8%, and 14.8% lower than Drain, Spell, Nulog, IPLoM, Logram, and Brain. This indicates that DLLog has the highest versatility and stability compared with other log parsing methods. The average parsing accuracy of Drain, Nulog, and Brain is basically the same, but Drain is better than Nulog and Brain in terms of stability. In contrast, the versatility of Logram log parsing methods needs to be improved. Overall, DLLog is superior to the other five comparison methods in accuracy and versatility.

5.4. Efficiency Evaluation

Due to the system producing a large amount of log data in real-time, the online parsing efficiency of log parsing methods must also be considered. Experiment 3 verifies the running time spent by the five methods to parse all HDFS and BGL log entries (2.16 G in total). The experimental results are shown in Figure 10, the ordinate represents the parsing time, and the abscissa represents the size of log data volume.

As illustrated in Figure 10, DLLog exhibits a linear growth trend with the increase of log data in both log datasets. Parsing the BGL log dataset takes more time for each log parsing method compared to the HDFS log dataset, as the HDFS log dataset has 30 templates while the BGL log dataset has 619 templates (20 times more than the HDFS log dataset). Logram demonstrates the fastest parsing speed when the dataset size is less than 100 MB. This is because Logram, based on n-gram, calculates frequencies simply by counting, which saves a significant amount of time compared to other parsing methods based on complex rules. We found that Nulog has the slowest parsing speed because throughout the entire parsing process, Nulog, based on deep learning models, continuously needs to retrain the model for parsing.

Although DLLog also requires training deep learning models, it is only used during the initial offline parsing stage (with a data size of 0.3 M). In the online log parsing stage, DLLog pre-classifies log templates before template matching to compare the similarity of newly arrived logs with existing log templates. Thus, for similarity comparison, only a comparison between the newly arrived log and log templates that meet specific categories is required. This strategy significantly reduces template matching time. While DLLog does not achieve the optimal parsing speed compared to the 6 parsing methods, its parsing speed remains within an acceptable range, and it attains the highest parsing accuracy and the best versatility.

5.5. Incremental Update Evaluation

The maintenance and upgrade of systems result in the generation of new log data. Therefore, it is crucial to consider the performance of log parsing methods when dealing with newly emerging log types. Experiment 4 evaluated the update capabilities of 7 parsing methods on the HDFS and Android datasets, chosen due to their volumes and the availability of ground truths for such evaluations. We used an initial data volume of 2 K (approximately 0.3 M) for each dataset for model training. Subsequently, we processed the trained model with data volumes of 1 M, 10 M, 100 M, 500 M, and 1000 M for each dataset. With the increase in the number of logs, new log types may emerge. For instance, the 2 k HDFS logs are generated from 14 log templates, while the 1000 MB HDFS dataset contains 29 log templates. An excellent log parser should exhibit stable accuracy when introducing new logs accurately parsing new log types. The experimental results are shown in Figure 11, the ordinate represents the parsing accuracy, and the abscissa represents the size of log data volume.

We can observe that, with the increase in volume and the introduction of new types of log data, DLLog demonstrates optimal stability, indicating its effective handling of new log types. However, it’s worth noting that the parsing performance of all log parsing methods on the HDFS dataset surpasses that on the Android dataset. This difference can be attributed to the significantly larger number of log templates in the Android dataset, making parsing more complex. DLLog’s exceptional incremental update capability, derived from the combination of deep neural networks and the longest common subsequence, enables it to effectively process new log types. Consequently, even with an increase in log data volume, the decline in parsing accuracy is not significant.

5.6. Ablation Experiment

In the process of constructing the log word frequency sequence, we conducted ablation experiments using two sequences: one sorted according to our method and the other unsorted, to verify the effectiveness of our approach. We conducted a comparative experiment between DLLog based on LSTM and DLLog based on GRU. The experimental results are illustrated in Table 8.

As depicted in Table 8, DLLog is significantly influenced by whether the log word frequency sequence is sorted. DLLog based on the sorted log word frequency sequence exhibits an average parsing accuracy that is 39% higher than the unsorted version. Moreover, substantial improvements are observed on each dataset, such as Thunderbird, Linux, etc. This is attributed to the presence of numerous templates and variable parameter words in these datasets, making the sorting of the log word frequency sequence more impactful on DLLog’s training. We processed the sequence of word frequencies transformed based on the frequency table F in ascending order, leveraging the characteristics of high-frequency templates appearing more frequently and the construction method of the frequency table. This arrangement ensures that high-frequency template words form a “fixed” combination, with dynamic parameters following, allowing the GRU neural network to accurately learn the log pattern combinations.

On most datasets, the parsing accuracy of DLLog based on GRU is similar to that based on LSTM, but there is a significant difference in parsing accuracy on OpenSSH and Linux. We believe that in the majority of cases, the classifications of the two are consistent. Only in a few log sequences are the classifications by LSTM and GRU are different, leading to some inaccuracies in log grouping. Moreover, these log groups constitute a significant portion, resulting in a substantial difference in parsing accuracy between the two, according to the formula for grouping accuracy calculation. We attribute the difference in parsing accuracy between GRU and LSTM to the fact that GRU, as a simplified variant of LSTM, predicts more accurately in short sequence datasets, and logs are a type of short sequence data. Conversely, for longer log sequence data, LSTM performs slightly better than GRU.

6. Conclusions

In this paper, we proposed DLLog, an online log parsing method for accurately and incrementally parsing templates without the need for domain-specific knowledge. DLLog leverages the GRU neural network for offline template word mining and leverages the longest common subsequence for parsing log entries in real-time. By utilizing multiple log entry features, DLLog can autonomously extract template words, eliminating the requirement for manual intervention and enhancing its versatility in parsing unstructured log data. Additionally, DLLog supports incremental updates of the log template set, making it adaptable to newly generated log entries in evolving systems. We conducted a comprehensive evaluation of the DLLog parsing method on multiple extensive log datasets, and the experimental results unequivocally demonstrated its remarkable accuracy, universality, and adaptability when parsing large-scale log data. In our future research endeavors, we intend to incorporate location information and character features of log words to assist the log parsing method in distinguishing between log parameter words and log template words. This endeavor aims to further enhance the precision and effectiveness of DLLog.

Data Availability

Data deposited in a repository (https://www.usenix.org/cfdr/).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The work was supported by National Natural Science Foundation of China under grant no. 61672392 and no. 62072342.