Abstract

The explosive growth of web-based technology has led to an increase in sophisticated and complex attacks that target web applications. To protect against this growing threat, a reliable web attack detection methodology is essential. This research aims to provide a method that can detect web attacks accurately. A character-level multichannel multilayer dilated convolutional neural network (MC-MLDCNN) is proposed to identify web attacks accurately. The model receives the full text of HTTP requests as inputs. Character-level embedding is applied to embed HTTP requests to the model. Therefore, feature extraction is carried out automatically by the model, and no additional effort is required. This approach significantly simplifies the preprocessing phase. The methodology consists of multichannel dilated convolutional neural network blocks with various kernel sizes. Each channel involves several layers with exponentially increasing dilation sizes. Through the integration of multichannel and multilayer dilated convolutional neural networks, the model can efficiently capture the temporal relation and dependence of character granularity of HTTP requests at different scales and levels. As a result, the structure enables the model to easily capture dependencies over extended and long sequences of HTTP requests and consequently identify attacks accurately. The outcomes of the experiments carried out on the CSIC 2010 dataset show that the proposed model outperforms several state-of-the-art deep learning-based models in the literature and some traditional deep learning models by identifying web attacks with a precision score of 99.65%, a recall score of 98.80%, an score of 99.22%, and an accuracy score of 99.36%. A useful web attack detection system must be able to balance accurate attack identification with minimizing false positives (identifying normal requests as attacks). The success of the model in recognizing normal requests is further evaluated to guarantee increased security without sacrificing web applications’ usability and availability.

1. Introduction

1.1. Background

Web applications are the gate to a great deal of sensitive data. The convenience of the Internet enables a large number of attackers to interact with web applications. Attackers have been able to conduct massive attacks more swiftly due to sophisticated and well-planned attack strategies through a variety of network technology tools. Web attacks continue to grow in both frequency and severity daily. Internet users are particularly at risk for threats that could result in monetary loss, identity theft, data fraud, and a loss of trust in conducting business. It is estimated that by 2025, the financial loss will amount to $10.5 trillion [1]. Researchers are developing new theoretical advancements to improve web security to reduce web attacks [2].

1.2. Limits of Prior ATRs

For more than a decade, numerous studies have provided solutions that make use of machine learning (ML) approaches to address web attack detection difficulties. However, it is believed that these kinds of solutions have only achieved a limited level of adoption in practice since they demand a significant amount of expert effort to develop and maintain [3]. For many solutions in this discipline, feature engineering is necessary to attain the desired performance. The current approaches focus on using bag-of-words feature engineering techniques, which reveal details about the entity of a word included inside the dataset. As a result, these techniques suffer from a lack of precise representation of the sequence in the dataset. Moreover, ML feature representations must be updated to keep up with the most recent web attacks. Given that feature engineering is frequently considered to be the most time-consuming component of developing an ML system, deep learning has drawn considerable attention for the detection of web attacks.

Deep learning algorithms are a branch of ML that employs artificial neural networks. They are better than conventional ML techniques as they can automatically learn the expressive feature representation, the lexical pattern, and the sequencing pattern of a given input while generalizing the expressive data representation. They have demonstrated an outstanding performance in many research fields such as image processing [4], natural language processing [5], speech recognition [6], computer vision [7], human activity recognition [812], and cyber security fields [13, 14]. In recent years, 1D convolutional neural networks (1D CNNs) in particular have shown an impressive performance for text classification [15, 16]. A considerable amount of research has also been conducted for web security based on deep learning-based approaches [17, 18]. The preprocessing stage of HTTP requests is greatly simplified at the character level in a number of deep learning-based approaches in the literature [2, 19]. These methods considered the HTTP requests as a series of characters and attempted to capture the character granularity dependencies to extract the pattern of attacks.

Despite the state-of-the-art performance of deep learning applications for web attack detection, they still have limitations when it comes to capturing long-term relations. For instance, in order to cover longer sequences, convolutional neural networks (CNNs) require larger receptive fields. Wider receptive fields necessitate more layers, and additional layers lead to additional parameters and a more challenging training procedure [20]. Another effective technique for handling sequential data is the use of long short-term memory networks (LSTMs), in which the receptive field may equal the entire input. Yet, due to their challenges in handling the issue of vanishing/exploding gradients, LSTMs still have difficulties in learning very long-term relations [21].

1.3. Research Motivation

Dilated convolutional neural networks (DCNNs) can be considered as a middle point of CNNs and LSTMs [22]. Dilated CNN is a type of CNN in which the filter has a defined spacing or dilation rate, enabling the network to increase its receptive field without adding more parameters. Dilated CNNs have the ability to significantly grow receptive fields without affecting resolution or coverage [23]. Dilated CNNs have lately achieved remarkable success in image segmentation [24], text classification, and text-to-speech [25, 26]. This study’s main goal is to accurately identify web attacks using the full text of HTTP requests while keeping the preprocessing phase as simple as possible. Hence, the HTTP requests are regarded as sequences of characters. Dilated CNNs’ effectiveness in capturing long-term relations and dependencies in comparison to the shortcomings of LSTMs and CNNs served as our motivation. In this study, a model based on dilated CNNs is developed as a multichannel multilayer dilated CNNs (MC-MLDCNNs). The model takes the full text of HTTP requests as the input, prepossesses them at the character level, and passes them as vectors to the model for the detection task.

The proposed methodology is developed exclusively for the detection of complicated and challenging patterns of web attacks that traditional security techniques could fail to recognize. It is expected that the proposed model will perform better than several existing models in terms of accuracy, false positive rate (FPR), precision, recall, and score. It is predicted that the proposed methodology will show its efficiency in identifying web attacks, hence improving the security posture of online systems and services.

1.4. Main Contribution

This study proposes a character-level multichannel multilayer dilated convolutional neural network (MC-MLDCNN). The model’s architecture has the following benefits:(i)Since the full text of HTTP requests is analyzed at the character level to automatically extract the relevant and important features, the preprocessing and feature extraction process is thereby greatly simplified. The character-level strategy makes the model adaptable and simple to apply since it does not rely on external efforts to derive key features of attacks.(ii)The dilated CNNs in the model’s structure help to address the shortcomings of LSTM and CNNs in terms of capturing long-term dependencies.(iii)The model benefits from the use of many channels with different kernel sizes in order to extract a variety of temporal relationships from the requests.(iv)Each channel consists of stacking layers of dilated CNNs with exponentially increasing dilation sizes. Accordingly, receptive fields are expanded exponentially. This strategy enables the model to extract various temporal relations at different levels, capture high-ordered feature interactions, and consequently discover complex and long-term dependencies of inputs.

To the best of our knowledge, character-level MC-MLDCNN is first specifically tailored to detect web attacks. The study’s main contribution is the methodology’s specifically designed structure, which is based on dilated CNNs. The effectiveness of the proposed methodology for identifying web attacks is investigated in the section Results and Discussion. According to the obtained results conducted on the CSIC 2010 dataset [27], the proposed methodology outperforms several competitive state-of-the-art models in the literature in terms of accuracy, false positive rate, recall, precision, and score. In the field of cyber security, our developed model for detecting web attacks represents a significant advancement that contributes to scientific knowledge in several ways.

1.4.1. Enhanced Attack Detection Accuracy

The proposed methodology closes a significant gap in current web attack detection solutions by concentrating on both short-term and long-term relationships of characters. Consequently, it can extract the complicated pattern of web attacks efficiently. In comparison to conventional methods, it achieves higher accuracy in identifying complex attacks by utilizing multichannel multilayer dilated CNNs. This increase in accuracy makes the defense against web-based attacks more reliable.

1.4.2. Reduced False Positives

Any successful cyber security system must focus on reducing false positive (identifying normal requests as attacks) alerts. The load on security workers will be reduced as a result of our model’s reduction in false alarms, which also lowers the possibility that actual attacks might be missed. This enhancement helps create a more effective security infrastructure and improved resource allocation.

1.4.3. Protection against Evolving Attacks

Intricate patterns and hidden malicious activity are frequent components of difficult-to-recognize attacks over the Internet, which can elude detection by traditional security technologies. The flexibility and capability of the proposed methodology to identify evolving attacks assist in the ongoing scientific study of risks associated with cyber security.

1.4.4. Data-Driven Insights

The proposed methodology produces useful information and insights about new attack trends and patterns as part of its operation. To develop proactive security measures and advance the field of attack intelligence more broadly, such information can be studied to better comprehend cyberattacks.

1.4.5. Framework for Further Research

Future studies in the area of web attack detection can build on the basis provided by the proposed methodology. Researchers can expand on its architecture by adding new features and optimizing algorithms, promoting ongoing developments in the cyber security industry.

In a nutshell, the proposed methodology contributes to scientific knowledge by pushing the limits of what is possible in cyber security in addition to providing practical solutions to the critical challenge of complicated web attacks. It is a valuable advancement to the ongoing efforts for protecting modern systems and infrastructure against novel attacks due to its accuracy, decreased false positives, adaptability, and possibility for further research.

The proposed methodology could be deployed in many locations in a security architecture as shown in Figure 1. Figure 1(a) shows that MC-MLDCNN is located in between a firewall and a web server. Figure 1(b) shows that MC-MLDCNN is located in parallel to the web server to alert security operators. Figure 1(c) shows that MC-MLDCNN is used in place of a WAF, or even better, together with a WAF to enhance its efficacy.

Mehta et al. carried out a comparative assessment of machine learning methods for the goal of detecting SQL injection, including logistic regression, random forest, SVM, naive Bayes, decision trees, gradient boost, K-means clustering, and KNN [28]. The findings of the experiment indicate that logistic regression performs best. Louk and Tama integrated bagging with gradient boosting decision tree (GBDT) techniques such as gradient boosting machine (GBM), LightGBM, CatBoost, and XGBoost to identify anomalies in an intrusion detection system [29]. According to the results, a combination of bagging and a gradient boosting machine (GBM) achieves the highest performance.

Deep learning methodologies have been progressively adopted in recent years and have shown promising results when compared to traditional ML approaches. The effectiveness of deep learning approaches over conventional ML techniques for the intrusion detection task is demonstrated in an experiment conducted by Althubiti et al. using the CSIC 2010 dataset [30]. Althubiti et al. extracted five important features to train the LSTM. The results of the study demonstrate that Althubiti et al.’s deep learning-based strategy beats a study [31] that used the same data with 9 extracted features and traditional ML techniques. Recurrent neural networks (RNNs) are also employed in a publication [32] to perform the same detection task on the NSL-KDD dataset. RNN is fed and trained after a feature engineering process. The model’s performance is contrasted to benchmark ML algorithms. RNN performs better than ML models according to the results of the experiment. Fang et al. integrated LSTM with a bidirectional recurrent neural network (BRNN) to estimate the cyberattack rates [33]. Xing et al. conducted the experiments using self-collected data and compared the findings to hybrid models that included ML algorithms as well as statistical prediction models like ARIMA.

In a study for the identification of phishing attacks, Kasim makes use of both machine learning and deep learning techniques [34]. The features are encoded using a sparse autoencoder and a principal component in the proposed approach’s feature engineering phase. With the aid of the light gradient boosted machine model (LightGBM), the encoded features are selected and categorized. The ISCX-URL dataset, which contains 77 distinct features, is used for evaluation. These features include measures related to the URL, host, domain, directory, file, and more. Out of the 154 features produced by the outputs of the principal component and autoencoder, the top 20 features for the study are chosen. In order to distinguish between regular HTTP and attack, Dawadi et al. conducted a comparison analysis between the two types of HTTP requests to extract attack-indicating characteristics and features using IDS ISCX 2012, 2019 DDoS CIC, and CISC 2010 datasets [35]. The relative features are given into a layered LSTM model for the task of attack detection after the feature engineering and preprocessing phase. For the evaluation stage, self-collected data are utilized. Although the discussed proposed models have achieved good performance, they are all highly reliant on feature engineering.

Hao et al. suggested a model based on stacked Bi-LSTM to detect web attacks [36]. The URL and the body of the post requests of the CSIC 2010 dataset are simply used. The word embedding technique is utilized to feed the input to the model. Each word is converted into a word vector using the Word2vec technique [37]. Similarly, Alaoui and Nfaoui used Word2vec to feed the CSIC 2010 dataset’s HTTP method, HTTP request, and payload to the model [38]. Alaoui and Nfaoui employed an ensemble of LSTMs to identify web attacks. Zhang et al. suggested a word-level CNN utilizing the full text of the HTTP CSIC 2010 dataset [39]. Kernels of different sizes are used to convolve in the convolution layer of the model. A max-pooling layer is next applied to the outputs, and the results are passed to a fully connected layer for classification purpose.

Through the use of deep learning models, Tian et al. proposed a distributed system for web attack detection [40]. The methodology can be used in an Edge-of-Things (EoT) environment. FastText [41] and M-ResNet, a particular variation of ResNet [42], are incorporated in the proposed method. The URLs of the requests are converted to vectors using Word2vec and TF-IDF. After that, the vectors are concatenated and fed to the model. Similarly, Luo et al. developed an ensemble-based methodology to identify web attacks in a distributed environment [43]. The detection of web attacks is carried out individually using three deep learning models: M-ResNet, LSTM, and CNN. Using the results gathered from the models, an ensemble classifier then makes its final prediction. Although these techniques successfully identify attacks, they have limits since they are focused on word-level strategies. For instance, word-level methodologies are not able to extract any valuable information from newly discovered words that appear in the test phase and are absent from the training set. In addition, memory problems also arise when the number of distinct words increases.

Rong et al. applied a character-level embedding technique in their proposed CNN model [44]. They used only the query part of HTTP requests to detect injected attacks and conduct the experiments on the data that have been independently crawled by the authors themselves. In addition to the query parameters, Odumuyiwa and Chibueze used the body parameter of POST requests in their character-level CNN model to identify HTTP injection attacks [45]. Character-level CNN was used by Saxe and Berlin to identify malicious URLs, file paths, and registry keys [3]. Saxe and Berlin used several convolutional layers with various kernel sizes followed by a sum-pooling layer in the model. The evaluation is conducted by utilizing Saxe and Berlin custom data.

Jemal et al. used both CNN and LSTM to identify web attacks [2, 19]. An LSTM is included after the CNN layer in the suggested models. The CNN component ignores the irrelevant data and achieves the input’s important properties, while the LSTM component captures the data’s sequential relationship. While Gong et al. applied character embedding to URLs of the CSIC 2010 dataset, Jemal et al. employed ASCII embedding (the code-level information) of the full text (whole content) of HTTP requests.

Vinayakumar et al. investigated some deep learning algorithms based on CNN, RNN, LSM, and CNN-LSTM architectures to categorize malicious/benign URLs using the character-level embedding technique [46]. LSTM and CNN-LSTM models are the most effective ones for the attack detection task according to the Vinayakumar et al. reported results.

Hung et al. leveraged both character embedding and word embedding techniques to enhance the performance of the proposed CNN model [47]. The performance of the suggested approach is assessed using the URLs of Hung et al.’s self-collected data.

Kasim uses SVM to detect DDoS attacks and makes use of an autoencoder for feature learning along with dimensionality reduction [48]. Yi et al. evaluated and analyzed the application of deep learning-based approaches for network attack detection [49]. Their research covers technologies for feature extraction, traffic representation, model training, and model robustness enhancement, as well as several difficulties and issues that may arise during the development stage, such as unbalanced data and distribution shift.

To represent features and detect anomaly-based web attacks, Pillai and Sharma used deep learning methodologies [50]. A stacked autoencoder (SAE) and a denoising autoencoder (DAE) outputs are combined and fed into the generative adversarial network (GAN) as input to enhance the feature representation. For the classification phase, the deep Boltzmann machine with Bi-LSTM is proposed. As a binary classifier, the deep Boltzmann machine is utilized to detect attacks. Bi-LSTM is additionally applied as a multiclass classifier to categorize various types of attacks.

Thajeel et al. carried out a thorough literature review of ML and deep learning techniques used for the goal of detecting XSS attacks [51]. CNN is found to be the deep learning-based algorithm that is most frequently used. Different preprocessing methods including feature engineering and data cleansing are also examined. In addition, the widely employed performance metrics are assessed. According to Thajeel et al.’s study, accuracy, precision, recall, and scores are the most commonly employed metrics for the XSS attack detection task.

Dilated CNNs were recently used by Rizvi et al. for an intrusion detection system [52]. Numerous successive dilated CNN layers without including any max-pooling layer are applied along with some feature engineering in the proposed model. The performance is assessed using the CSE-CIC-IDS2018 and CIC-IDS2017 datasets.

Table 1 provides a summary of the studies that have been discussed in this section.

3. Methodology

This section gives details about the dataset in the subsection Dataset and the structure of the proposed methodology as MC-MLDCNN in the subsection Character-Level MC-MLDCNN.

3.1. Dataset

The CSIC 2010 HTTP dataset is one of the most well-known and frequently used datasets in the area of web security. This dataset is the focus of numerous comparative experiments in the literature [2, 36, 39, 53, 54]. It is created by the Spanish Research National Council (CSIC) in 2010 at the Information Security Institute. It contains the most serious attacks that target the web servers as static attacks and dynamic attacks such as SQL injection, CRLF injection, cross-site scripting (XSS), buffer overflows, information gathering, file disclosure, server-side include, parameter tampering, and unintentional illegal requests.

It has 61065 requests which include 36,000 normal requests and approximately 25,000 abnormal requests. Nearly, 59% of the dataset is normal and 41% is abnormal. The full text of the HTTP requests is used for the experiments in this study. An example of a request is demonstrated in Figure 2.

3.2. Character-Level MC-MLDCNN

The character-level embedding technique and the MC-MLDCNN structure are described in the following subsections.

3.2.1. Character-Level Embedding

A character-level embedding is used to represent and embed the full text of HTTP requests into the model. Character embedding not only makes the model learn the structural patterns of the input string but also gives the model the ability to attain embedding for new unseen inputs. Many word-level embedding approaches suffer from the inability to extract patterns for unseen words. Besides, the model size will increase as the data size increases. Character-level embedding has the additional benefit of maintaining the model size stable as the number of characters is constant. Consequently, the memory problem regarding word embedding is alleviated. In the embedding phase, a vocabulary consisting of alphabets and numeric characters is formed. In addition, some other characters that appear frequently in HTTP requests are added to the vocabulary as illustrated in Table 2.

The UNK token is added to the vocabulary for the characters that are not in the alphanumeric and special characters defined in Table 2. Token PAD is defined for padding purpose. Every character has a special embedding vector, and this information is kept in an embedding matrix . After getting the indices of the vocabulary according to Table 2, each of the HTTP requests can be represented as a sequence of indices. These indices are the index of each character in the embedding matrix . The sequence length is set to a threshold . Any HTTP requests less than this threshold are padded, and the longer ones are truncated. Each character is embedded into a -dimension vector. The embedding is subsequently given a random initialization before being trained. Each row includes a character’s vector representation in the character-level embedding matrix. Hence, a character-level embedding of an input results in a matrix with rows and columns (see Figure 3).

3.2.2. MC-MLDCNN Framework

CNN is a deep network structure made up of several layers such as the input layer, convolutional layer, pooling layer, fully connected layer, and output layer [55]. The alternating convolutional and pooling layers make up the most noticeable structure among these layers. In a multilayer CNN structure, a convolutional layer plus a pooling layer may extract important features at various levels. In CNNs, three architectural principles are integrated to ensure shift invariance at some level: local receptive fields, spatial/temporal subsampling, and shared weights [55]. The relationship between the receptive field size and the number of layers and kernel size is linear. To cover a longer sequence, a bigger receptive field is needed. A bigger receptive field necessitates more layers, which results in complicating the learning process. Dilated convolutions provide receptive fields that are expanding exponentially while maintaining resolution and coverage.

Dilated convolutions [23] are convolutions in which the filter is applied over a region that is longer than its length by ignoring input data at a certain phase. To put it simply, dilated convolutions are convolutions applied to input with specified gaps. The goal of dilated convolutions is to increase the convolution kernel’s receptive field while maintaining the number of kernel parameters untouched. It is performed by filling a fixed element 0 between the original convolution kernels. Conventional CNNs are equal to dilated CNNs with a dilation size of 1 without any gap between the parameters of the kernel. An example of a transformed kernel in dilated CNNs is shown in Figure 4.

The receptive field of dilated convolutions can be exponentially enlarged by applying multiple convolutional layers with increasingly dilated values successively. As a result, information in the larger context can be integrated with less computing effort. Figure 5 demonstrates a three-layer convolution structure for both traditional and dilated convolution neural networks. The number of parameters in both CNN and dilated CNN is the same. Under identical circumstances, the dilated CNN may gather data from a wider area of input in comparison to the traditional CNN.

In this study, a multichannel multilayer dilated CNN is proposed as MC-MLDCNN. Each channel is made up of the input layer, dilated convolutional layers each followed by a pooling layer, a fully connected layer, and an output layer. All the convolutional operations are based on 1D convolutions [56]. A multichannel multilayer dilated CNN includes multiple channels (where ). In each channel, contains a fixed kernel size and exponentially increasing dilation size ∈ . The kernel size varies among the channels . Each of the dilated CNN layers is followed by a max-pooling layer to extract the influential features. The kernel size of max-pooling layers is fixed and set to 3 in all layers.

Each channel can exploit the temporal relationship of length in its input and expand it exponentially. This strategy makes the model able to capture temporal and local dependencies of various scales at different levels. The output of the channels is then concatenated, flattened, and passed to a fully connected layer. Consequently, the model is better able to extract aggregated contextual information at different levels and can learn complex and long-term dependencies.

Given an HTTP request as input, Figure 6 briefly describes the feature mapping phase of a multilayer dilated CNN. Figure 7 illustrates the general structure of MC-MLDCNN.

3.2.3. Model Configuration

The model contains two channels. All of the models are implemented with Keras [57] and Python [58]. After converting the HTTP requests to lower cases, a tokenizer is initialized and fitted to the data in the character embedding phase according to Table 2. Each character is embedded into a 71-dimensional vector. The embedding is then randomly initialized and learned during the training phase. The obtained representations are stored in an embedding matrix where each row is a vector representation of a character. Since more than 99 percent of requests have lengths less than 900, is set as 900. Hence, a character-level embedding of each HTTP request results in a matrix with 900 rows and 71 columns. The model consists of 2 channels each with 256 kernels and fixed sizes of 5 and 6, respectively. Each channel includes 3 convolutional layers. The first convolutional layer has no dilation (equal to a dilation size of 1), whereas the next convolutional layers are with successive dilation sizes of 2 and 4. Each convolutional layer is followed by a max-pooling layer with a kernel size of 3. The outputs of each channel are concatenated, flattened, and passed through a fully connected layer with 512 nodes and regularized by the dropout technique at a rate of 0.2. Binary cross entropy as a loss function and the Adam optimizer with a learning rate of 0.001 as an optimization strategy are applied. The ReLU activation function is used in all layers, and the sigmoid layer is applied for classification purpose. A simplified diagram of the model is shown in Figure 8.

4. Results and Discussion

The dataset is divided into training sets and test sets randomly, making sure that the class distribution in each set is similar to the original dataset. The hold-out strategy [30, 59] is applied as the size of the dataset is sufficiently large. Hence, in the experiments, 67% of the dataset is utilized as a training set and 33% is employed as a test set. As a result, a significant portion of the data is allocated to testing while still leaving an adequate amount for the training phase. The experiments are conducted in two steps: the hyperparameter tuning part and the evaluation phase with the obtained parameters.

4.1. Hyperparameter Tuning

The key hyperparameters of the proposed methodology are the number of channels, the number of kernels, the sizes of the kernels, the number of fully connected layers’ nodes, and the sizes of the dilations. 10% of the training set is used as the validation set for the assessment’s findings. First, a multichannel CNN (MC-CNN) is built as a baseline model to determine the number of channels, with each channel having one block of CNN. The number of kernels and the number of nodes are initialized as 256 and 512, respectively. The findings for various channel sizes and their corresponding kernel sizes are displayed in Table 3. Two channels with kernel sizes of 5 and 6 produce the greatest results. This means the best results are achieved when 256 filters are applied to 5 characters at a time in the first channel and 6 characters in the second channel, respectively. As a result, the number of channels is set to two channels with kernel sizes of 5 and 6.

Further experiments are conducted to obtain the best number of filters (kernels). The outcomes are displayed in Table 4. 256 kernels produce the best accuracy result. The number of kernel parameter is therefore set to 256.

Table 5 demonstrates the findings for a different number of nodes. The finest results are achieved for 512 nodes. Therefore, the number of nodes parameter is set to 512.

The dilated CNN layers are then added to each channel and optimized. Table 6 represents the results for numerous dilation sizes. The highest accuracy results are achieved for dilation sizes of 1, 2, and 4, respectively.

Consequently, the proposed model (MC-MLDCNN) includes two channels with kernel sizes of 5 and 6, respectively. For each channel, there are fixed 256 filters. The fully connected layer’s node value is assigned to 512.

4.2. Evaluation Results

Two models are built to examine the effectiveness of dilated CNNs: one as a multichannel CNN (MC-CNN) and the other as a multichannel multilayer CNN (MC-MLCNN). The multichannel CNN model only has one block of CNN in each channel without any dilation. The multichannel multilayer CNN model includes multiple successive CNN blocks in each channel and does not incorporate any dilation as well. The structure is similar to the MC-MLDCNN with a dilation size of 1 for all CNN blocks. In both multichannel CNN and multichannel multilayer CNN, max-pooling is integrated after each CNN layer. All models have the same kernel sizes, number of channels, and other hyperparameters as the proposed MC-MLDCNN model. The performance of the models is shown in Figure 9.

It is clear that layered CNNs, either dilated or not, outperform single-layer CNNs, concluding that utilizing several layers of CNNs in each channel improves the performance of MC-CNN. Dilated CNNs increase the accuracy and precision metrics as a result of exponentially increasing dilation sizes. A higher recall score of nondilated multilayer CNNs implies that attacks are better detected, yet normal requests are misclassified. This means more interruption and less availability for rightful users. The score provides a more accurate picture of the performance of the models as both recall and precision scores are taken into account. The proposed MC-MLDCNN achieves the best value. The proposed model outperforms both candidates concerning accuracy and scores. This indicates that the proposed model suitably distinguishes attacks and normal requests.

The proposed model is compared with the best-performing related work by Rizvi et al. [52]. Rizvi et al.’s model uses a single channel with many layers of dilated CNNs. The size of the dilations has been increased exponentially akin to the proposed model. On the other hand, Rizvi et al. did not include any max-pooling layer. Table 7 shows the comparison of MC-MLDCNN and Rizvi et al.’s model. Another remarkable observation is the fact that MC-MLDCNN rapidly converges in contrast to Rizvi et al.’s model. The accuracy and loss curves for the training and validation sets are shown in Figure 10.

As it can be concluded from Table 7 and Figure 10, Rizvi et al.’s state-of-the-art method performs well. Still, MC-MLDCNN enhances the performance by 2.51%, 2.03%, 4.16%, and 3.11% in terms of accuracy, precision, recall, and scores, respectively. This enhancement occurs in 4.29% less number of epochs.

Table 8 compares and contrasts the proposed model’s performance with the state-of-art works such as a word-level Bi-LSTM-based model proposed by Hao et al. [36], an ASCII-level CNN-LSTM-based method by Jemal et al. [2], a character-level CNN-LSTM-based approach by Gong et al. [19], and a character-level multichannel CNN-based model suggested by Odumuyiwa and Chibueze [45].

Table 8 states that MC-MLDCNN outperforms Hao et al., Gong et al., and Odumuyiwa and Chibueze models in terms of all the assessment metrics. The recall score of the Jemal et al. model is the highest, while its precision score is the lowest. Similar to the previous discussion, this implies misclassification of valid requests. Considering the scores, MC-MLDCNN demonstrates priority over Jemal et al.’s model.

In addition to the proposed model with the optimal hyperparameters, 6 other MC-MLDCNN models with various parameter values are trained and evaluated to demonstrate the effectiveness of the proposed methodology in Table 9.(i)Model 1 is trained using 2 channels with kernel sizes of and . Each channel consists of 5 layers with dilation sizes of 1, 2, 4, 8, and 16, respectively . The number of nodes of the fully connected layer is set to 512 .(ii)Model 2 is trained using 3 channels with kernel sizes of , , and . Each channel consists of 3 layers with dilation sizes of 1, 2, and 4, respectively . The number of nodes of the fully connected layer is set to 256 .(iii)Model 3 is trained using 2 channels with kernel sizes of and . Each channel consists of 3 layers with dilation sizes of 1, 2, and 4, respectively . The number of nodes of the fully connected layer is set to 256 .(iv)Model 4 is trained using 2 channels with kernel sizes of and . Each channel consists of 4 layers with dilation sizes of 1, 2, 4, and 8, respectively . The number of nodes of the fully connected layer is set to 256 .(v)Model 5 is trained using 3 channels with kernel sizes of , , and . Each channel consists of 3 layers with dilation sizes of 1, 2, and 4, respectively . The number of nodes of the fully connected layer is set to 256 .(vi)Model 6 is trained using 3 channels with kernel sizes of , , and . Each channel consists of 3 layers with dilation sizes of 1, 2, and 4, respectively . The number of nodes of the fully connected layer is set to 512 .

In Table 9, the MC-MLDCNN models are contrasted not only with the competitive deep learning models in the literature but also with the character-level traditional deep learning models like CNN, LSTM, and Bi-LSTM.

All of the MC-MLDCNN models outperform the benchmark deep learning models when comparing precision outcomes. Jemal et al., however, performed the best among the recall scores. The second-best recall values are shared by all of the MC-MLDCNN models. All of the MC-MLDCNN models outperform the benchmark deep learning models when scores are taken into account. It is a desired situation given that the goal of this research is to accurately detect web attacks.

Except for Jemal et al.’s model, all the MC-MLDCNN models outperform the benchmark deep learning models when accuracy results are compared. Although the proposed model, model 2, model 3, and model 5 outperform Jemal et al.’s accuracy and produce the greatest results, the other MC-MLDCNN models’ accuracy scores lag behind Jemal et al.’s. It implies that Jemal et al. performed slightly better in normal request detection tasks than model 1, model 4, and model 6. It should be highlighted that the outcomes with the best hyperparameters of benchmark models are compared with the MC-MLDCNN models. As a result, it is expected that occasionally some of the MC-MLDCNN models perform slightly worse. This implies the importance of selecting appropriate hyperparameters. Even models 1, 4, and 6 surpass all benchmark models when it comes to the task of attack detection, demonstrating the effectiveness of this methodology.

For an even more comprehensive assessment, the FPR scores (the rate of classifying normal requests as attacks) are also added to Table 9. The FPR values are unfortunately not provided in the studies by Jamal et al. and Gong et al. Excluding Jemal et al. and Gong et al., it should be noted that MC-MLDCNN models have the lowest FPR values indicating that they are effective at detecting normal requests as well. It is noteworthy that properly identifying attacks is prior to properly detecting normal requests because of the destructive consequences of a situation in which an attack is mistaken for a normal request as opposed to the one in which a normal request is mistaken for an attack. Since the accuracy and scores of MC-MLDCNN models are the highest, the proposed methodology has precedence over the models by Gong et al. and Jamal et al. even if the FPR values for Gong et al. and Jamal et al. are not supplied.

The character-level MC-MLDCNN model outperforms a number of cutting-edge models in the literature as well as traditional deep learning models, according to the experimental results. The provided methodology successfully detects web attacks. Although the aim of this research is to reliably identify web attacks, it has also been demonstrated that with the right hyperparameter, the model performs excellently in identifying normal requests as well.

5. Conclusion

Web attacks have severe effects such as data breaches, financial losses, reputational harm, and other consequences for both customers and organizations. Web attack detectors are essential for guaranteeing security, in other words, the confidentiality, integrity, and availability of the systems. The goal is to maintain system availability while protecting data confidentiality and system integrity. Therefore, precise web detectors provide the most reliable systems.

Even though various deep learning approaches are suggested in the literature with acceptable detection performance, many of them struggle to effectively capture complicated and lengthy sequence relationships of HTTP requests. The MC-MLDCNN methodology, which is proposed in this study, is capable of efficiently learning complex and lengthy character relationships of the requests. The method is based on the integration of multichannel and multilayer dilated CNNs. The MC-MLDCNN learns the dependencies among the characters of HTTP requests at various levels and across a broad range. Therefore, it successfully captures the long-term dependency of characters in HTTP requests. Consequently, the model is accurate in recognizing the intricate pattern of attacks.

Several MC-MLDCNN models with different parameter settings are trained and evaluated. The experimental results are contrasted with various prior effective deep learning-based approaches that have been proposed in the literature. The outcomes show that the proposed methodology outperforms the benchmark deep learning models and can reliably identify attacks. In addition, the method’s accuracy surpasses all benchmark models when the right hyperparameters are used. This indicates the proposed methodology outperforms the benchmark models in terms of overall performance for both detecting attacks and normal requests. Although developing an accurate web attack detection system is the major objective of this research, a system with a high false positive is ineffective. A false positive, in other words, identifying a normal request as an attack, stops business continuity. This means availability is lost for that moment. Accurately recognizing web attacks and limiting false positives are two factors that must be balanced for a web attack detection system to be successful. The results of the experiments show conclusively that the character-level MC-MLDCNN methodology fits the criteria and is effective for usage in web application security systems.

The proposed methodology has the potential to dramatically improve the capability of identifying complex web attacks. The future goal is to accumulate data regarding tricky and sophisticated attacks that the practically deployed security mechanisms miss. By evaluating the proposed methodology on these samples, future research will primarily concentrate on enhancing the performance of the current model and developing and exploring cutting-edge ways to offer more robust and dependable protection against web attacks.

Data Availability

The benchmark dataset is an open-source dataset and available over the internet.

Ethical Approval

No ethical issues are disclosed by the authors. No personally identifiable data have been used in this study.

Conflicts of Interest

The authors declare that they have no conflicts of interest.