Abstract

Due to the complexity of the social network server system, various system abnormalities may occur and in turn will lead to subsequent system failures and information losses. Thus, to monitor the system state and detect the system abnormalities are of great importance. As the system log contains valuable information and records the system operating status and users’ behaviors, log data in system abnormality detection and diagnosis can ensure system availability and reliability. This paper discloses a log analysis method based on deep learning for an intrusion detection system, which includes the following steps: preprocess the acquired logs of different types in the target system; perform log analysis on the preprocessed logs using a clustering-based method; then, encode the parsed log events into digital feature vectors; use LSTM-based neural network and log collect-based clustering methods to learn the encoded logs to form warning information; lastly, trace the source of the warning information to the corresponding component to determine the point of intrusion. The paper finally implements the proposed intrusion detection method in the server system, thereby improving the system’s security status.

1. Introduction

The development and rise of social networks have changed our way of life, realized the interconnection between people, accelerated information dissemination speed, and changed social communication. While enjoying the convenience that social networks bring to our lives, we also need to protect information. If there is a social network security issue, our personal information will be stolen. The data stored in social network servers and the services provided are themselves potential targets for various attacks. Due to their diversity and particularity, these attacks may have disastrous consequences. In this context, social network server security has become a major challenge, and research in this area is also increasing. The development of various tools and mechanisms ensures that the safety level is improved and meets modern life requirements. These methods include the IDS (intrusion detection system). The IDS is a tool used to detect network attack attempts and is used to identify abnormal activities and behaviors designed to interfere with the normal operation of the system [1].

Various logs are generated during the operation of the host system. The logs record the status of the computer during operation and various operations performed by the system. They are an excellent resource of information for online data monitoring and anomaly data detection. Therefore, the audit of system logs can be used as the host, an essential means of anomaly detection. There are already various security audit systems in the market, such as log audit systems and IDS. These systems can implement log collection, auditing, and abnormal behavior mining functions [2]. However, in the actual use, due to the differences in logs and the single backwardness of log auditing methods, these systems are often only suitable for specific types of hosts, and the abnormal behaviors that can be detected are not comprehensive and accurate.

Data mining capabilities have been further improved with AI and machine learning development, and research results based on machine learning into log auditing continue to appear. However, for different host systems and processes, the types of logs generated are inconsistent. Traditional machine learning detection methods need to use different feature extraction methods for different types of logs, and users must have professional background knowledge to better extract logs’ characteristic information. In reality, the types and grammar of logs are constantly updated, and a certain method cannot be directly applied to multiple systems, and it takes much workforce to update and match. The relatively advanced one is to use deep learning for anomaly detection. The increasing amount of system log data is sufficient for the deep learning model to perform learning processing. When the parameters are appropriate, there is almost no need to extract features manually, and the deep learning model can complete log detection well.

For this purpose, in this paper, we propose a log analysis method based on deep learning and apply it in the IDS to ensure the server’s security. Figure 1 shows the scenario of attackers and defenders using the IDS. The attackers on the Internet will use various ways to intrude the social network server and then leak the users’ privacy. The IDS monitors the operating conditions of the social network servers in accordance with certain security policies and discovers as much as possible various attack attempts, attack behaviors, or attack results to ensure the confidentiality, integrity, and availability of network system resources [3]. The proposed method in this paper will ensure the users’ privacy for the social network. This paper’s main work is to form a log parser, build the feature extraction neural network, and finally implement the IDS.

The rest of this paper is organized as follows. Section 2 summarizes the related work and emphasizes log analysis and AI methods. Section 3 proposes the deep learning methods for the IDS. Section 4 shows the experiments and results. Finally, Section 5 concludes the paper.

2.1. Social Networks’ Security

The social network is a platform for human beings to transmit information and conduct social conversations on the Internet. Many researchers show that, with the rapid development of social networks, the way of information exchange between Internet users has also undergone tremendous changes, and social networks play a more unceasingly important role in people’s everyday life [4]. People can follow the latest developments of friends and celebrities on Sina Weibo or get other people’s attention by sharing information on WeChat Moments. Social networks provide users with the following main functions:(1)Creating and sharing user information which is public or semipublic within a certain scope(2)Providing a list of users who can be contacted and provide users with a communication platform(3)Online chatting, making friends, video sharing, blogs, online communities, music sharing, adding comments, etc.(4)Providing open interfaces for application plug-in development

Simultaneously, social networks face increasingly serious security threats and have become key targets of attacks on the Internet. In order to actualize the purpose of information exchange and public communication, people have the will and capacity to put their information in public on various social network websites [5]. Without privacy protection and corresponding technical awareness, the safe use of social network websites will not be guaranteed. Coupled with the lack of relevant safety technology, laws, and regulations, safety hazards have gradually become prominent. We must attach great importance to the security of social networks [6].

Statistics and research on international relevant state departments of social networking security issues show as an application in the form of a computer network, security issues, social networks, and traditional computer network security problems have many similarities [7]. However, due to the openness of social networks, in addition to various traditional network information security threats, such as viruses, vulnerabilities, and Trojan horses, social networks also need to face many attacks against their characteristics [8]. It can be divided into three categories: user privacy and data security risks associated with leak triggered, the proliferation of spam and security risks caused by cyberattacks, Internet rumors and public opinion network security threats, and other security risks caused [9].

The security solutions of social networking sites can be implemented through appropriate technical means and management measures, and the effects of different technologies or management on security protection are minimal. The technical methods are as follows [10]. 1. Identity the authentication scheme. 2. Social network identity encryption scheme. 3. Data security management. 4. They grouped data access control. The construction of social network security management measures includes the following aspects: 1. strengthen the construction of website security; 2. strengthen users’ awareness of data protection; 3. strengthen supervision of third parties; 4. strengthen legal supervision.

2.2. Deep Learning Algorithm for Security

Developments in information technology have required newer and better methods to analyze how these information systems work. Various machine learning methods study the principles of the device. Deep learning is treated as advanced technology and widely deployed in multiple embranchments, including pattern recognition, natural language processing, and network security. Due to the corresponding increment in the production of data, the ordinary machine learning methods deployed in the field of cybersecurity are gradually unable to work for intrusion detection in the network systems [11]. Therefore, the use of deep learning methods for big data analysis is an innovation that attempts to study the patterns of network connections to detect unauthorized access to computer networks.

Primarily, the system operates according to known principles, including two machine learning methods: probabilistic and deterministic [12]. The deterministic machine learning method employs a small sample dataset and analyzes them to find any regular pattern deviation. IT experts will then evaluate this information and develop models for classifying the data and processing the results. Usually, the information in the model is compared with the baseline. Therefore, any abnormal data beyond the average level is deemed to be invasive [13].

Besides, the probabilistic method of machine learning takes another big progress because it can evaluate the patterns involved in the evaluation, and these patterns may not be able to get rid of deterministic analysis [14]. The whole system depends on the cluster detecting any strange characters related to the data. The system depends on the unsupervised operation, where the whole system operates independently, generating the map, and ultimately analyzes any abnormal behavior by the same machine, so this method is more effective because the evaluation is conclusive. To be precise, 90% of the problems can be solved by conservative estimate [15].

Deep learning has also recently gained attention because of its advantages. This method is dynamic because the system is predictable and can adapt to generated data. It is worth noting that this method uses the output of the top-down method and uses it as the input of the bottom-up method [16]. The model also uses linear models to extract features, and these linear models are used as essential functions of layers. These layers depend on each other to form a more in-depth system architecture [17].

2.3. Security Application for the Intrusion Detection System

Under the constant development of web technology, the web is used more and more widely. The frequency with which malicious hackers attack a website is usually proportionate to the websites’ value. Even if the website’s value is comparatively insignificant, it will face malicious test attacks by “script kiddies” or tests on various large-scale vulnerability scanners, just like a saying in the security industry: “there are only two kinds of people in the world, one knows that they have been hacked, and the other I do not know if I was hacked” [18].

At this time, the log analysis of the website is critical. As the role of the management manager, operation staff, and maintenance technician of the website, if they are not on to the real-time security status of the server, they will become the type of “unknown whether they were hacked” and result in losses. Of course, there is another situation where hacking caused economic losses [19]. At this time, we will also conduct log analysis and other emergency measures to try to recover the loss. In short, two of the most direct and most apparent purposes are to implement security log analysis. Self-security events occur on the server to understand, followed by an emergency analysis and evidence collection [20].

The development of deep learning technology has promoted the progress of intrusion detection research. It can use the hierarchical structure to achieve unsupervised functional learning and data pattern classification. The feature extractors and classifiers can be integrated into a framework without the need for security experts to extract features [21] manually. Deep learning methods can efficaciously deal with traffic data of the large-scale network. Compared with shallow traditional machine learning methods, it has higher efficiency and detection rate, but its training process is more complex, and the model’s interpretability is poor.

At present, the IDS based on deep learning solves the following problems: (1) the abnormal traffic data in the network is far less than the category imbalance problem of normal traffic data [22]; (2) the network data volume is large, the feature dimension increases, and the shallow machine learning technology is challenging to match. Massive high-dimensional data is detected, and it is challenging to extract nonlinear features from the data [23]; (3) improve the algorithm itself.

For the above problems, the current work is summarized as follows: (1) if an imbalanced dataset is used to build an intrusion detection model, it will seriously impact detection accuracy. Two methods are usually used to solve the imbalance problem in the data. They are a solution at the algorithm level (cost function method) and a solution at the data level (undersampling and oversampling methods) [24]. Some studies have used these two methods to deal with the imbalance of data categories in the IDS, but most research works have not considered the data imbalance. With the emergence of GANs with strong generation capabilities in recent years, new ideas have been provided to solve data category imbalance. GAN can solve this problem by generating new data. Therefore, the problem of data category imbalance is still currently studied as one of the hot issues [25]. (2) Most deep learning algorithms such as AE, DBM, DBN, LSTM, and CNN [26] have been used to solve this problem. In the deep learning comparison experiments cited in the article, it can be found that the detection accuracy of deep learning methods is usually excellent Traditional machine learning techniques, such as naive Bayes, decision tree, random forest, and support vector machine [27], reflect the effectiveness of deep learning detection methods [28]. However, some evaluation indicators such as the detection accuracy of binary classification and multiclassification problems based on deep learning IDS research still need to be improved [29]. For example, based on the accuracy of the multiclassification problem on the KDDTest + test set in the NSL-KDD dataset, the accuracy is roughly in the range of 79%–85%, while the accuracy of the multiclass problem on the KDDTest-21 test set is roughly 60%–69% [30]. Within the scope, there is still room for improvement. (3) Some studies have improved the deep learning algorithm to improve the model detection ability, such as using the SVM to replace the softmax function to improve the detection ability of the GRU model [31] and using a genetic algorithm to optimize the network structure of DBN [32].

3. System Description and Methodology

3.1. System Architecture

Each log line is generated by the output statement of the source code. For example, the log print statement of a process is printf (“accepted password for %s from %s port %d ssh,” user, host, and port), and then, during the running of the program, it may generate a log sentence: “17 Nov 2020 19 : 12 : 48 combo sshd (pam) [1241]: accepted password for root from 10.21.234.33 port 8888 ssh.” The logs printed by a source code are of the same type, and their codes are called log keys.

Adjacent or similar logs have a high degree of relevance. A log often depends on the previous logs. When a log breaks this logic, it means that an abnormal log execution path has occurred. Thus, we can regard the log execution sequence as a multiclassification problem [33]. The total number of log keys is certain. We regard it as K. In the training phase, we input the typical log execution sequence to generate a multiclassifier model. In the testing phase, we input the history of the most recent log key, and the output is the probability distribution of a log key. When the error between the sequence prediction result and the actual result is large, we can consider the log to be abnormal.

Figure 2 describes the procedure of the IDS. The input is a sequence of log data, and the output is the probability value. The log parser will parse the log data into the appropriate form and feed it to the neural network.

3.2. Principle of Log Parser

Logs can be divided into log keys and log parameters. We must first separate the two and parse the logs into the structure. The complete process of parsing logs is as follows (Algorithm 1).

Input: log input
Output: sequence output
(1)Initialization (, , )
(2)Store to
(3)Read by STREAM
(4)Traverse the to find the largest common subsequence
(5)ifthen
(6) GOTO
(7)else
(8) GOTO
(9)end if
(10)Initialize the line of log into the list
(11)Update the line log
(12)Update the template
(13)GOTO
3.3. Handling Feature Extraction

After the log parsing is completed, we have obtained the structured log of the system, but, at this time, the log key is only in the form of a string, and the parameter list elements are also strings, which cannot be directly used as the input of the deep learning model, so we need to use it characterized as a feature vector in the digital form. The feature extraction process converts the string into quantifiable numbers, thereby constructing a feature vector matrix. We use two different characterization methods for log keys and parameters due to their different formation methods and expressive meanings. Figure 3 shows the flow chart of the feature extraction.

3.3.1. Log Key Encoding

Since the log is output by the program’s code or process, the code is constant, so the type of the output log is also constant, and the number is often not very large. Therefore, for the log key, it is directly coded using sequential numbers. For example, for log keys , we directly characterize them as .

3.3.2. Log Parameter Encoding

Unlike the log type, the parameter value is not generated by the template but dynamically generated according to the actual situation during the system operation, so it often has great uncertainty. The string type of the parameter value will be many and various. In many cases, the direct use of simple integer permutation codes will cause the linear length to be too large.

My approach is to extract all parameter lists, perform parameter preprocessing, and remove all punctuation marks and special characters. Because these characters are not used as criteria for parameter abnormalities, they may affect characters’ accuracy. Then, all the parameter strings are recomposed to form a token list, the module under the deep learning library Keras is used to process the strings, and the method is used to learn the text dictionary, which is the mapping relationship between the corresponding words and numbers. Statistics such as word frequency of parameter values. Then use the function of the text.Tokenizer module to convert the parameter text to a number, and use 0 to pad sequences of different lengths to the same length.

3.4. Anomaly Detection Scheme

Given a small number of regular log key sequences, and then, input the LSTM model for training. A detection model is obtained. When the system is running normally, the log sequence generated by the system is collected, a log key sequence is obtained after log analysis and feature extraction, and then the LSTM model is used to predict the possibility of a log key after the sequence. The next log in the situation has a greater probability of probability distribution; then, this log is considered a regular log. Otherwise, it is judged to be abnormal.

LSTM is a special type of RNN to avoid long-term dependence through deliberate design. Storing information for a long time is the default behavior of LSTM. All recurrent neural networks have chain repeating modules of neural networks. Different from the single neural network layer of RNN, there are four network layers in LSTM, and they interact in a very special way [34]. The LSTM core unit function process is shown in Figure 4, and the whole steps of the process are described as follows (Algorithm 2).

Input: input sequence
Output: prediction
(1)while BatchNotFinished do
(2) InitializeParameters(, , , )
(3) Connect the previous hidden state with the current input
(4) Put the into forget layer, DELETE irrelevant data
(5) Create a candidate layer using cell state
(6)input layer, decide candidate layer data
(7) Calculate the vector using forget, candidate and input layers
(8) Calculate the current output
(9) Update the new hidden state
(10)end while
(11)Output the prediction

The first step of LSTM is to select what information to abandon from the cell state. This decision is made by the S-shaped network layer called the “forget gate layer.” It receives and , and the output value is between 0 and 1 for each number in the cell state . 1 means “accept this completely,” and 0 means “ignore this completely:”

The next step is to determine which information needs to be stored in the cell state. This is divided into two parts. In the first part, an S-shaped network layer called the “input gate layer” determines which information needs to be updated. In the second part, a network layer creates a new candidate value vector-, which can be used to add to the cell state. In the next step, we combine the above two parts to generate an update to the state:

Now, update the old cell state to . The previous steps have already decided what to do, and we just need to do it. We multiply the old state by to forget what we decided to forget. Then, we add , which is the new candidate value, which is scaled proportionally according to the updated value we decide for each state:

Finally, we need to calculate the output value. The output depends on our cell state, but will be a “filtered” version. Firstly, we run the S-shaped network layer to determine which parts of the cell state can be output. Then, we input the cell state into tanh (adjust the value between and 1) and multiply it with the output value of the S-shaped network layer so that we can output the points we want to output:

In the training phase, the model needs to find an appropriate weight distribution so that the ultimate output data of the LSTM sequence generates the required labels and outputs it along with the input in the training dataset. In the training procedure phase, each input and output use the gradient descent method to find the minimum loss to update these parameters’ weights. The input is a log sequence, and the output is the log key value immediately following this log sequence. In training, the loss function used by the log key is categorical cross-entropy loss, and the parameter uses the mean square loss to measure the error.

In the detection stage, we use a layer containing LSTM blocks to predict the output of an input sequence and add each log key of to the LSTM block corresponding to this layer. We use multiple hidden layers, and the previous hidden state will be used as the input in the next layer’s core blocks, turning the model into a deep neural network, and the input layer will come from all log key types . The logs are encoded as one-hot vectors. The output layer uses a standard polynomial logistic function (standard multinomial logistic function) to convert the output into a probability distribution function. When the model predicted log key and the actual log key have an enormous difference, which exceeds the threshold we defined, it is determined that the log has an abnormal execution path.

The model architecture of anomaly detection based on the LSTM is shown in Figure 5. The figure’s top shows an LSTM module, and the repeated LSTM blocks make up the entire architecture. Each LSTM module will record a state as a fixed-dimensional vector. The LSTM module’s state from the previous time step and its external input will be used as the next LSTM module’s input to calculate the new state and output. This method ensures that the log information in the log sequence can be passed to the next in an LSTM block [35].

Figure 4 shows a sequence of the core blocks that form an expanded form of the cyclic pattern. Each block retains a hidden state and a state vector , both of which are transferred to the next block, thus initializing its state. For each log key data in the input sequence , we use an LSTM core block. Therefore, a layer is composed of h-expanded LSTM core blocks. In an LSTM core block, the input and the output of the previous block can be determined:The degree to which the previous unit state is retained in the state How the current layer input and previous layer output affects the stateThe establishment of the output

LSTM is determined to use a combination of gate functions. These functions resolve the state by influencing the state reserved by the previous LSTM block, the previous block’s output information, and the current block’s input information flow. A set will learn the parameters of each gate of weights.

4. Result and Analysis

To solve the security problem of social network servers, we propose an IDS based on log analysis. The proposed method in this paper is composed of log analysis, feature extraction, and classification detection. The feature extraction part uses the LSTM neural network so that the IDS can better extract the feature information hidden in the log and achieve better detection results. This section will analyze the dataset, evaluation method, and experimental results.

The HDFS audit log (audit log) reflects the user operations on HDFS. The detailed information includes the operation’s success or failure, user name, client address, operation command, and operation directory. For each user’s operation, the NameNode will organize the information in the form of key-value pairs into a log in a fixed format and then record it in the audit.log file. Through the audit log, we can view various operating conditions of HDFS in real time, track various misoperations, and do some indicator monitoring.

The dataset uses the public HDFS log dataset. The HDFS data contains 11,175,629 log messages, which are Hadoop logs collected from the Amazon EC2 platform, and are marked as normal or abnormal by experts in the relevant field [36]. The parsed dataset is shown in Table 1.

4.1. Numerical Standardization

First, calculate the average value and average absolute error of each attribute:

Thereinto, stands for the average value of the attribute , stands for the average absolute error of the attribute , and stands for the attribute of the record . Then, standardize each data record:

Thereinto, stands for the attribute of the record after standardization.

4.2. Numerical Normalization

Normalize each value after standardization to the section :

Thereinto, stands for the maximum value of the sample data, and stands for the minimum value [37].

4.3. TF-IDF Term Weighting

In the large text corpus, some words appear very many, and they carry a small amount of information. We cannot directly use these words’ frequency in the classifier, which will reduce the terms that we are interested in, but the frequency is minimal. We need to further re-weight the feature frequency of the feature into a floating-point number to facilitate the classifier’s use. This step is completed by TF-IDF conversion [38].

If a word is more common, the denominator is larger, and the inverse document frequency is smaller and closer to 0. The reason for adding 1 to the denominator is to avoid the denominator being 0 (that is, all documents do not contain the word). The operator means taking the logarithm of the obtained value:

A high word frequency in a particular document and a low word frequency in the entire document collection can produce a high-weight TF-IDF. Therefore, TF-IDF tends to sift through universal words and keep meaningful words. That is, TF-IDF is actually .

4.4. Word2Vec

The training process of word2vec is to train an external neural network to map each word in the training set to a vector space of a specified dimension [39].

The basic unit of word2vec vectorization is words. Each word is mapped to a vector of a specified dimension, and all words form a word sequence (sentence) to become a vector matrix (the number of words x in the specified word2vec embedding dimension). However, the input required by the machine learning algorithm is a one-dimensional tensor. Therefore, we also need to perform feature processing using the word vector table to perform feature encoding on the original corpus via the TF-IDF method.

Precision and Recall rate are used in this paper as indicators:TP. True Positive relates to the num of correctly classified as maliciousTN. True Negative relates to the num of correctly classified as benignFP. False Positive relates to the num of mistakenly classified as maliciousFN. False Negative relates to the num of mistakenly classified as benign

The is defined as the weighted harmonic mean of the test’s precision and recall [40]. The calculation of the takes into account the and of the test. The , also known as a positive predictive value, is the proportion of positive results that really indicate a positive. The rate (also called sensitivity) is the ability to test to identify a positive result to correctly obtain an accurate positive rate. The can reach the best value, that is, the and rate are 1. The worst means that the lowest and lowest rate should be 0.

The test environment is shown in Table 2, and the parameters’ description is shown in Table 3.

The parameter and represent the sensitivity of the IDS, and the adjustment of these two parameters is the core mission of the experiment. The parameter and reflect the performance of the IDS, and the adjustment should both consider the effect and overhead. The following parameter values  = 10,  = 10,  = 2, and  = 64 are used by default to train the deep learning model. represents the range of normal conditions in the prediction output (the threshold of the normal logs), is the sequence window size, and and represent the number of hidden layers and the size of hidden layers in the LSTM model. Each of the four parameters will be adjusted in these experiments; with the target parameter’s adjustment, the other three parameters will remain in the default value. To evaluate the model, the and the all-around performance should be concerned. Increasing the value of or will increase the model’s complexity, increasing the training cycle and system overhead. Variable adjustments are taken as the parameters, and the results are as follows.

In Figure 6, the compared parameter is the , which refers to the several top numbers of the prediction output sequence. The range of this parameter chooses from 6 to 12 because the performance will be deficient when the value is under 6, and the overhead will be high when the value is above 12. It can be concluded from the figure when the parameter is 11, the reaches the highest point and slightly decreases when the parameter is over 11. In Figure 7, the compared parameter is the size of the , which refers to the length of the input sequence, The range of this parameter chooses from 6 to 13 for the reason of the efficiency and cost. In the experiment, the is both at a high point when the parameter is 9 and 13. However, the and are closer when the parameter is 13. Therefore, the value should be 13.

In Figure 8, the compared parameter is the amount of the hidden layers , which relates to the increase of the neural network’s resolution and complexity. The range of this parameter chooses from 1 to 5 concerning the balancing of the profit and cost. It can be concluded from the figure that the and the performance of the model become more balanced when the parameter is 2. In Figure 9, the compared parameter is the size of each hidden layer (), which represents the amount of the node units in the hidden layer. The value of this parameter is according to the convergence, scale of the input and output layer, and the training samples. Here, we tested from 16 to 512 nodes of each hidden layer, and the figure indicates that the has already reached a higher point when the value of this parameter is 64. And, when the parameter increases from 64, the indicator does not vary much, but the overhead of the system increases obviously. Thus, it comes to the conclusion that the parameter should be 64 in this experiment.

In conclusion, the model achieves the best performance when  = 11,  = 13,  = 2, and  = 64.The further comparative study showed that the proposed method has better performance than other deep learning algorithms and traditional machine learning algorithms. This paper selects MLP [41], RBM [42], SVM [43], and naive Bayes algorithms [44] for comparative analysis, uses comprehensive F1-score indicators for comparison, and uses training samples of different magnitudes. As shown in Figure 10, the algorithm proposed in this paper achieves a more accurate recognition rate than traditional machine learning algorithms and has better results than ordinary deep learning algorithms.

5. Conclusion and Future Works

Social networking sites are often faced with server security risks. The exposure of social network servers to the ubiquitous Internet environment leaves massive attacks everywhere. Applying deep learning-based log analysis methods, IDSs can detect potential cyber-attacks and prevent the social network server from being unsecured. This paper first analyzes the features of social network servers and summarizes the security issues existing in social network servers. Afterwards, the solution to the security issue was discussed, and the social network website server security protection was selected as the entry point, and the existing protection solutions were summarized. After comparing and analyzing the advantages and disadvantages, a log detection scheme based on deep learning is proposed to protect social network servers.

In this paper, a security log analysis method based on the IDS is proposed by using deep learning algorithms. The proposed IDS has three parts: log parser, feature extraction, and anomaly detection. The log analysis method combines deep learning algorithms with traditional black and white lists, rule matching, and statistical strategies to complement each other’s advantages. Traditional analysis methods have good identification capabilities for known and common attack behaviors, while deep learning algorithms have an adaptive ability to detect unknown and new attack behaviors and distinguish different types of attacks. The performance of deep learning is better, and the overhead is lower. In the experiments, the proposed method is proved to have a significant improvement on the detection capabilities, which has a preferable than other comparative methods. Nevertheless, this paper mainly focuses on classical datasets and algorithms but has not made further comparisons with new datasets and algorithms. More datasets and algorithms will be employed in the future work to conduct experiments, and the proposed scheme will be optimized.

Data Availability

The HDFS data used to support the findings of this study have been deposited in the Wei Xu repository (https://github.com/logpai/loghub/tree/master/HDFS).

Conflicts of Interest

There are no conflicts of interest to declare.