Abstract

While promoting the development of the Internet of Things, cloud-fog hybrid computing faces severe information security risks. The intrusion detection system deployed in the fog node has lower latency but needs to be more lightweight. In response to the abovementioned problems, this paper proposes a lightweight intrusion detection model based on ConvNeXt-Sf. First, the two-dimensional structure of the latest computer vision model ConvNeXt is reduced to a one-dimensional sequence. Then, the design criteria of the lightweight computer vision model ShuffleNet V2 are used to improve ConvNeXt to make the latter more lightweight. Finally, the max-min normalization and label encoder are built into the data preprocessing model to convert the network traffic into a form conducive to ConvNeXt learning. The proposed model is evaluated on the TON-IoT and BoT-IoT datasets. The params of ConvNeXt-Sf are only 1.25% of that of ConvNeXt. Compared with the ConvNeXt, the ConvNeXt-Sf shortens the training time and prediction time by 82.63% and 56.48%, respectively, without reducing the learning capability and detection capability. Compared with the traditional models, the accuracy of the proposed model is increased by 6.18%, and the FAR is decreased by 4.49%. Compared with other lightweight models, the ShuffleNet V2 is better at making ConvNeXt lightweight.

1. Introduction

The Internet of Things (IoT) with hybrid cloud-fog computing as a large-scale computing infrastructure is to support numerous data- and computing-intensive tasks [1]. The IoT is rapidly being used more and more widely in the fields of industry, medical treatment, and transportation and has become an important part of the “Internet of Everything” era. While cloud computing provides a large amount of resource support for IoT applications, its characteristics, such as on-demand self-service, broad network, measured service, and rapid elasticity, also promote the development of the IoT [2]. However, the remote connection between cloud nodes and end devices at the network edge over the Internet creates problems such as performance, security, latency, and stability [3, 4]. The emergence of these problems has promoted the development of fog computing. Fog computing is the extension of cloud computing to the edge of the network, which is located between cloud nodes and end devices, forming the IoT with hybrid cloud-fog computing, as shown in Figure 1. Fog nodes deploy resources close to the edge of the network, enabling IoT applications to obtain resources safely and stably with low latency. This makes fog computing the optimal solution for providing efficient and secure IoT services [4]. Resource-intensive tasks are still uploaded to the cloud node to run so that the overall performance of the network can be optimized. The IoT with hybrid cloud-fog computing has been widely used in smart cars, smart buildings, smart grids, smart cities, smart health, smart agriculture, smart industry, and other fields [5].

IoT devices are vulnerable to attacks due to their shortcomings and the constant updating of network attacks [6]. Therefore, using deep learning technology to build a new generation of the intrusion detection system (IDS) has become a security requirement of the IoT with hybrid cloud-fog computing. The IDS was proposed in 1980 and was adopted as a security measure to protect the network in 1990 [7, 8]. The IDS detects attacks by detecting the collected data in real time and alerts network security personnel. The IDS can be divided into host based and network based according to data sources. The host-based IDS is deployed in the host to monitor the attack behavior that occurs within the host. The network-based IDS is deployed in the key network nodes to monitor the attack behavior in the network traffic. The IDS can be divided into signature based and anomaly based according to the detection methods. The signature-based IDS detects attacks by comparing the collected data with data stored in the attack database. It has good detection capability for known attacks, but it is difficult to effectively detect unknown attacks. The anomaly-based IDS analyzes the difference between collected data and normal data to detect attack behavior. It can effectively detect unknown attacks but often has a high false alarm rate (FAR).

Deep learning is an important branch of machine learning. The deep learning model can learn effective information from large-scale high-dimensional data without feature engineering or expert rules. Computer vision (CV) and natural language processing (NLP) have achieved remarkable results as hot fields of deep learning, and many high-performance deep learning models have emerged. Therefore, many researchers have applied the ideas and models in the fields of CV and NLP to the intrusion detection of the IoT and achieved good results.

In order to effectively protect the security of the IoT with hybrid cloud-fog computing, we build an IDS based on the latest model ConvNeXt and deploy it in fog nodes with limited resources. Considering the limited resources of fog nodes, we use the design criteria of ShuffleNet V2, a lightweight CV model, to improve the ConvNeXt. The main contributions of this paper are as follows:(i)The proposed model is the first time to apply the ConvNeXt to the intrusion detection of IoT with hybrid cloud-fog computing. The ConvNeXt is the most advanced model in the CV, which improves the best results of deep learning models in target detection, image classification, and other tasks.(ii)The proposed model is the first to use the design criteria of ShuffleNet V2 to make the ConvNeXt more lightweight. The ShuffleNet V2 is a lightweight CV model, and its design criteria are instructive for the lightweight construction of deep neural networks.(iii)The proposed model can be better deployed in resource-limited fog nodes to better protect the IoT with hybrid cloud-fog computing by taking advantage of the low latency of fog nodes.

The rest of this paper is organized as follows: Section 2 summarizes previous research work on building IoT intrusion detection systems using deep learning models in the fields of CV and NLP. Section 3 describes the intrusion detection problem of IoT with hybrid cloud-fog computing. Section 4 introduces the intrusion detection model proposed in this paper. Section 5 analyzes the experimental results of the proposed model and the comparison model. Section 6 is the conclusion of this paper and the prospect of the future research.

Deep learning models have strong learning capabilities for complex data, making them widely used in CV and NLP, and other fields, and they have achieved excellent results. Therefore, many network security researchers use deep learning models to improve the performance of the IDS.

Since objects in CV and NLP differ in format from network traffic, one approach is to convert the format of network traffic. Jo et al. [9] transformed a feature in the NSL-KDD dataset into a pixel in an image, transforming a record into a 2D picture. Based on these, they proposed three methods to convert the data in the NSL-KDD dataset into pixel images and used CNN to classify the converted images. Compared with deep learning models such as autoencoder (AE) and recurrent neural network (RNN), they found that CNN can process data with multiple protocols at the same time. Hussain et al. [10] first performed data cleaning on the CICDDoS2019 dataset to reduce the number of features to 60. The normal traffic and attack traffic data were then transformed into a number of images, respectively, where each channel contained 60 samples with 60 features. Finally, the image was converted into 224 × 224 × 3 and input into the ResNet18 to detect DoS and DDoS attacks against IoT devices. Zhong et al. [11] used the Word2Vec to convert network traffic into word vectors and used sequence models test-CNN and gate recurrent unit (GRU) in the field of NLP to classify word vectors. Kozik et al. [12] used 37 calculated values to describe the characteristics of packets sent by the same IP within a time window and stored them in a probabilistic data structure. Intrusion detection was then performed using a classifier consisting of the encoder part of the transformer and a feed-forward neural network. This model can efficiently capture short-term malicious behavior in the network. This method will not change the performance of the model, but the change in the data structure may change the information carried by the network traffic and affect the learning result of the model. Moreover, when there is a lot of network traffic, the additional resources consumed by this method in the data preprocessing stage cannot be ignored for resource-limited fog node devices.

Another option is to change the structure of the model to enable the model to process network traffic. Hassan et al. [13] proposed an IDS based on the CNN and weight-dropped LSTM (WDLSTM). The one-dimensional CNN with deep structures and the weight sharing mechanism was able to quickly learn useful features from massive network traffic. The combination of LSTM and the neuron’s probabilistic dropout mechanism effectively prevented overfitting. Derhab et al. [14] combined the one-dimensional CNN with causal convolution to construct an IoT IDS based on the temporal CNN (TCNN). Experiments on BoT-IoT datasets showed that the TCNN had better detection capability while training time was very close to the CNN. This method does not need to consume additional resources, but it needs to ensure that the modification of the model structure will not reduce the performance of the model or reduce the performance of the model within an acceptable range.

Models that can process network traffic can be used directly. Almiani et al. [15] constructed a fog computing-based IDS using a multilayer RNN trained by an improved back-propagation algorithm. The two layers of the RNN with different structures were used in series to target DoS and other attack types that are difficult to detect. Xu et al. [16] replaced neurons in the AE with neurons in the LSTM. The constructed IoT IDS can capture time series features through the LSTM and perform intrusion detection through the feature learning capability of AE. This situation is optimal and does not have the negative effects of the above two methods. However, this situation is not common when introducing models from other fields into intrusion detection, especially in the CV, which does not consider the input of the model to be one-dimensional data.

For the IoT with hybrid cloud-fog computing, Kumar et al. [17] deployed the proposed TP2SF on cloud nodes and fog nodes to take advantage of their respective advantages. Transactions were first processed at fog nodes. The transaction is uploaded to the cloud node for processing if the fog node cannot process it. The TP2SF consisted of a trustworthiness module, a two-level privacy module, and an intrusion detection module. The intrusion detection module was composed of the XGBoost to perform anomaly detection. Kumar et al. [18] proposed the Sp2f consisting of a two-level privacy engine and a deep learning-based anomaly detection engine. The Sp2f was deployed in fog nodes, while cloud nodes were only used to assist in storing blockchain information in the two-level privacy engine to relieve resource pressure on fog nodes. In the anomaly detection engine, the stacked LSTM was used to perform intrusion detection on the output data of the secondary privacy engine. The above two articles consider the security of both the fog layer and cloud layer and deploy the same IDS in the nodes of both layers. Although excellent results were obtained, the characteristics of the fog layer and the cloud layer are very different. It is difficult for the same IDS to exert optimal performance in both layers simultaneously. Therefore, the best practice is to design the most suitable IDS according to the characteristics of each layer. When the IDS of one layer can operate effectively, the security threats faced by the other layer will be reduced.

3. Problem Description

In order to effectively detect the attack behavior in the IoT with hybrid cloud-fog computing, a lightweight intrusion detection model based on ConvNeXt-Sf is proposed in this paper. The essence of the intrusion detection of the IoT with hybrid cloud-fog computing is the classification of network traffic. After the network traffic from cloud nodes and end devices enters the fog node, it will be detected by the intrusion detection model deployed in the fog node. The intrusion detection model is divided into two stages, namely, the data preprocessing stage and the classification stage, as shown in Figure 2.(1)In the data preprocessing stage, the raw network traffic is processed by the data preprocessing model into the format required by the classification model in the next stage. In this paper, the label encoder and max-min normalization are used to construct the data preprocessing model LE-MMN to quantify and normalize the raw network traffic.(2)In the classification stage, the network traffic, after data preprocessing, is classified into different types by the classification model. For the multiclassification model, the classification results are normal and various attack types. The intrusion detection model deployed in the fog node needs to have excellent detection capability and be lightweight simultaneously. In this paper, based on the one-dimensional ConvNeXt, the design criteria of ShuffleNet V2 are used to make lightweight improvements to obtain the classification model ConvNeXt-Sf.

Assuming network traffic contains samples , each sample contains features . Important notations to be used in this paper are given in Table 1, and the intrusion detection problem of the IoT with hybrid cloud-fog computing is briefly introduced.

First, the LE-MMN performs data preprocessing on the network traffic set entering the fog node. The LE-MMN performs numerical and normalization processing on the feature set in the sample , so that each feature conforms to the input format required by the ConvNeXt-Sf and obtains the preprocessed network traffic set . Then, the ConvNeXt-Sf classifies each network traffic to detect the attack behavior in the network traffic set .

4. IoT Intrusion Detection Model Based on ConvNeXt-Sf

The IoT intrusion detection model based on the ConvNeXt-Sf proposed in this paper is composed of a data preprocessing model LE-MMN and a classification model ConvNeXt-Sf, as shown in Figure 3. After the network traffic from normal users or attackers enters the fog node, it is first digitalized and normalized by the LE-MMN. Then, the preprocessed network traffic is classified by the ConvNeXt-Sf to obtain intrusion detection results.

The next two sections will introduce the structure of the proposed model in detail. The first section introduces the data preprocessing model LE-MMN. The second section introduces the classification model ConvNeXt-Sf.

4.1. Data Preprocessing Model

In the IoT with hybrid cloud-fog computing, the network traffic has the problem of discrete features and inconsistency dimensions. Both the TON-IoT dataset collected in the IoT with hybrid cloud-fog computing and the dataset BoT-IoT for botnets used in this paper have the above two problems. Discrete features cannot be used as input data for the proposed model. Therefore, it is necessary to use the label encoder to map discrete features to continuous features [19]. The label encoder assigns an integer index to each distinct feature value in a discrete feature. For example, after using the label encoder to process the TON-IoT dataset, some discrete features in it are mapped to be similar to .

The features of different dimensions as the input of the deep learning model can easily lead to the problem of vanishing gradient and slow convergence. Therefore, the max-min normalization is used in this paper to unify the dimension of the feature to , as shown in the following formula:where is the normalized value, is the feature value being normalized, is the minimum value in this feature, and is the maximum value in this feature.

In this paper, the data preprocessing model LE-MMN is obtained by combining the label encoder and the max-min normalization, as shown in Algorithm 1.

Input: ,
Output:
(1)for to do
(2)ifthen
(3)  
(4)end if
(5)end for
(6)for to do
(7)fortodo
(8)  
(9)end for
(10)end for
4.2. Classification Model

The performance of the classification model will directly affect the detection capability of the intrusion detection model. Therefore, based on the cross-domain idea in vision transformer [20], ConvNeXt is used as the classification model in the intrusion detection model. ConvNeXt is a novel high-performance CV model proposed by Liu et al. in 2022 [21]. Liu et al. found that many transformer-based CV models borrowed the mechanism of CNN to obtain stronger performance than the CNN, such as the Swin transformer [22], and they also found that models constructed purely using CNN modules were easier to train than transformer-based models. Therefore, they borrowed the mechanism of transformer-based CV models such as the Swin transformer to modernize the ResNet to build a CV model constructed purely using CNN modules and outperformed transformer-based models. Compared with the ResNet, the improvement of ConvNeXt in model structure is divided into macroimprovements and microimprovements. Macroimprovements include the ratio of the number of blocks in each stage and the hyperparameters of convolutional layers. Microimprovements include the type and number of activation functions and normalization layers. The structure of the block in ConvNeXt is shown in Figure 4.

There are five versions of the ConvNeXt as shown in Table 2. Each version of the ConvNeXt achieves faster classification speed and higher accuracy in comparison experiments with each version of Swin transformer with the same amount of parameters and FLOPs on the ImageNet dataset.

The ConvNeXt processes two-dimensional image data in the field of CV, and the network traffic in the field of intrusion detection is one-dimensional data. It is necessary to first carry out the one-dimensional modification of ConvNeXt. All two-dimensional operations in ConvNeXt are replaced with one-dimensional operations. For example, the two-dimensional convolution in ConvNeXt’s block is replaced by the one-dimensional convolution.

This paper uses the design criteria of ShuffleNet V2 to make lightweight improvements to the ConvNeXt so that ConvNeXt can be suitable for deployment in fog nodes. In 2018, Ma et al. [23] proposed four design criteria to guide the design of efficient CNN architecture. Based on these design criteria, the ShuffleNet V1 was lightweight improved to obtain ShuffleNet V2. The design criteria of ShuffleNet V2 focus on reducing the memory access cost (MAC) and improving the degree of network parallelism. These measures can make the model consume less time with the same FLOPs. The ShuffleNet V2 has lower error rates in comparison experiments with models with similar FLOPs, such as MobileNet, and is faster on GPU and ARM CPU.

The first design criterion is “minimum MAC when the number of input and output channels is the same.” The block in the ConvNeXt adopts the inverted bottleneck structure as shown in Figure 5, in which convolutional layers have a difference of 4 times the number of input channels and the number of output channels, thus violating this criterion. However, since the inverted bottleneck structure is an important part of ConvNeXt, and its effectiveness has been proven and widely used [2426], in order to keep the original structure of ConvNeXt as much as possible, the inverted bottleneck structure is not modified in this paper.

The second design criterion is “group convolution with a too large number of groups will increase the MAC,” as shown in the following formula:where h and are the height and width of convolution kernel, respectively, and are the number of input and output channels of convolution layer, respectively, is the number of groups of group convolution, and B is FLOPs. Obviously, when hw, , and B are constant, the larger the is, the larger the MAC is. Therefore, the structure of the block in the ConvNeXt is modified by referring to the structure of the block in the ShuffleNet V2. First, the channel split is used to divide the input data into the left and right branches equally by channel. Then, convolutional layers and layer norm modules are only used in the right branch, while the data in the left branch remain unchanged. Thus, the in the right branch is reduced by 50%, and is reduced from (96, 192, 384, 768) to (32, 64, 128, 256), thereby reducing the MAC.

The third design criterion is “fragmented network reduces the degree of parallelism.” The fragmented network includes serial fragmentation and parallel fragmentation, as shown in Figure 6. It can be seen from Figure 4 that the blocks in the ConvNeXt only contain serial fragmentation, and multiple blocks are serially stacked. Therefore, the ConvNeXt-T with the lowest degree of fragmentation in ConvNeXt is selected as the benchmark model. Then, the number of serially stacked blocks in each stage in the ConvNeXt-T is reduced from (3, 3, 9, 3) to (1, 1, 3, 1) on the premise of maintaining the original ratio to reduce the degree of network fragmentation further.

The fourth design criterion is “element-level operations add nonnegligible memory and time consumption.” Ma et al. found through experiments that short-cut accounts for up to 20.82% and 11.24% of the time consumed by the model in GPU and ARM CPU, respectively. Therefore, the short-cut in the ConvNeXt is replaced by channel split and concatenate with reference to the design of ShuffleNet V2. The data of the left and right branches divided by channel split are concatenated by channel instead of the element addition operation used in the short-cut, so the MAC and time consumption of the model are reduced, and the effect of the short-cut is still retained. In addition, the layer scale and drop path operations in ConvNeXt’s block are also removed to reduce FLOPs further.

Finally, the ConvNeXt-Sf model proposed in this paper is obtained, as shown in Figure 7.

5. Simulation Experiment and Analysis

In this paper, the TON-IoT and BoT-IoT datasets are used to evaluate the performance of the proposed model in the IoT with hybrid cloud-fog computing. The first reason for choosing the TON-IoT and BoT-IoT datasets is that they are the latest IoT datasets, so that they can represent the latest IoT with hybrid cloud-fog computing. The second reason is that the TON-IoT dataset and the BoT-IoT dataset represent different types of IoT activities, making it possible to evaluate the performance of the model in the face of multiple IoT activities more extensively in the experiments. The hardware and software environment of the experiment is shown in Table 3. The hyperparameters of the proposed model in the experiments are shown in Table 4.

This part is divided into three sections. The first section introduces the dataset used in the experiment. The second section introduces the performance metrics used in the experiment. The third section analyzes the experimental results.

5.1. Datasets

The data for the TON-IoT dataset were derived from a real large-scale network designed by the Cyber Range and IoT Labs, the School of Engineering and Information technology, UNSW Canberra at the Australian Defence Force Academy [19, 27]. The network realized the structure of the IoT with hybrid cloud-fog computing, so this dataset can truly evaluate the detection capability of the IDS deployed in the fog node. This paper used the network traffic data in TON-IoT and the training test set provided by TON-IoT. The training test set had a total of 461,043 samples. In this paper, 40,000 samples were randomly selected as the test set, and the remaining samples were used as the training set. Each sample had 45 features, including 2 label features. Table 5 shows the sample type, number of samples, and the proportion of each type in the TON-IoT training set and test set.

DoS and DDoS attacks can easily make resource-limited fog nodes unable to provide services, compromising the entire IoT with hybrid cloud-fog computing. Therefore, the BoT-IoT dataset proposed by the Cyber Range Lab of UNSW Canberra was selected in this paper [28, 29]. They designed a real botnet environment for DoS and DDoS attacks and extracted 5% of the original data as training test sets, with a total of 3,668,522 samples. In this paper, 400,000 samples were randomly selected as the test set, and the remaining samples were used as the training set. Each sample had 45 features, including 3 label features. Table 5 shows the sample type, number of samples, and the proportion of each type in the BoT-IoT training set and test set.

5.2. Performance Metrics

The performance metrics commonly used to evaluate the detection capability of IDS include accuracy, precision, recall, F1-score, FAR, area under curve (AUC), training time, and prediction time [30]. Except for AUC, training time, and prediction time, the calculation of other performance metrics needs to use the following four parameters:(i)True positive (TP): the number of positive samples that were correctly classified as positive(ii)True negative (TN): the number of negative samples that were classified detected as negative(iii)False positive (FP): the number of negative samples that were falsely classified as positive(iv)False negative (FN): the number of positive samples that were falsely classified as negative

The calculation method of the performance metrics used in this paper is as follows.(a)Accuracy: the proportion of the number of samples correctly classified by the model to all test samples.(b)Precision: the proportion of actual positive samples among all samples classified as positive by the model.(c)Recall: the proportion of all actual positive samples correctly classified as positive by the model.(d)F1-Score: the harmonic mean of precision and recall suitable for scenes with class imbalance.(e)FAR: the proportion of all actual negative samples misclassified as positive by the model.

5.3. Experimental Results

This section will present and analyze the multiclass experimental results of the proposed model and the comparison model on the TON-IoT and BoT-IoT datasets. The ConvNeXt compared in the experiment is the one-dimensional ConvNeXt-T.

For the intrusion detection system deployed in the fog node, it is not only required to have excellent detection capabilities but also a sufficient degree of lightweight. The shorter the training time and prediction time of the model, and the smaller the number of parameters, the more lightweight the model is. The structure of the model mainly determines the parameters of the model. A model with a more complex structure usually has a larger number of parameters and requires more storage and computing resources provided by the device.

The ConvNeXt-DenseNet and ConvNeXt-GhostNet are ConvNeXt modified by lightweight models DenseNet [31] and GhostNet [32], respectively. In the ConvNeXt-DenseNet, the connection between feature maps only happens inside each stage. In a stage, the output feature map of each block is concatenated with the output feature map of the previous block by channel. The dim of ConvNeXt-DenseNet is reduced from (96, 192, 384, 768) to (16, 32, 64, 128). Each block in ConvNeXt-GhostNet first outputs feature maps with half of the dim channels. Then, the ghost module generates a feature map with half of the dim channels. Finally, the two groups of feature maps are concatenated by channel. The proposed model is compared with ConvNeXt-DenseNet and ConvNeXt-GhostNet to verify the advantages of ShuffleNet V2 in making ConvNeXt lightweight.

The number of parameters of ConvNeXt, ConvNeXt-DenseNet, ConvNeXt-GhostNet, and the proposed model was 26,771,242, 969,642, 8,679,318, and 335,354, respectively, as shown in Figure 8. DenseNet, GhostNet, and ShuffleNet V2 can reduce the number of parameters of ConvNeXt. The proposed model had the least params. The number of parameters of the ConvNeXt-DenseNet was 3.62% of that of ConvNeXt. The number of parameters of the ConvNeXt-GhostNet was 32.42% of that of ConvNeXt. The number of parameters of the proposed model was 1.25% of that of ConvNeXt. The proposed model has the largest reduction in the number of parameters.

There were four steps to reduce the number of parameters. In the first step, dim was reduced from (96, 192, 384, 768) to (32, 64, 128, 256). Because this step reduced the number of convolution kernels and the number of groups of group convolution, the number of parameters was reduced to 3,000,938. The number of parameters at this time was reduced by 88.80% compared to the initial time. Because the ConvNeXt consists almost entirely of convolutional layers and dim affects the number of convolution kernels and the number of groups of group convolutions, reducing dim brings the greatest reduction in the number of parameters. In the second step, the layer scale and drop path in the block were removed. Because this step simplified the structure of the block, the number of parameters was reduced to 2,998,730. The number of parameters at this time was reduced by 0.07% compared to the previous step. In the third step, the block of the ConvNeXt was replaced by the block of the ShuffleNet V2. Because this step reduced dim and element-level operations, the number of parameters was reduced to 825,626. The number of parameters at this time was reduced by 72.46% compared to the previous step. In the fourth step, the number of serially stacked blocks in each stage was reduced from (3, 3, 9, 3) to (1, 1, 3, 1). Because this step reduced the fragmentation of the model, the number of parameters was reduced to 335,354. The number of parameters at this time was reduced by 59.38% compared to the previous step.

The training time and prediction time of the ConvNeXt, the ConvNeXt-DenseNet, the ConvNeXt-GhostNet, and the proposed model on the TON-IoT and BoT datasets are shown in Figures 9(a) and 9(b), respectively.

In the TON-IoT dataset, the training time of the ConvNeXt, ConvNeXt-DenseNet, ConvNeXt-GhostNet, and the proposed model was 148,833.812 seconds, 62,879.730 seconds, 148,939.281 seconds, and 25,857.369 seconds, respectively. The training time of the ConvNeXt-DenseNet and the proposed model was shorter than that of ConvNeXt, while the training time of the ConvNeXt-GhostNet was longer than that of ConvNeXt. The training time of the ConvNeXt-DenseNet was 42.25% of ConvNeXt. The training time of the proposed model was 17.37% of that of ConvNeXt. The proposed model had the largest reduction in training time. The prediction time of the ConvNeXt, ConvNeXt-DenseNet, ConvNeXt-GhostNet, and the proposed model was 18.493 seconds, 18.637 seconds, 33.125 seconds, and 10.671 seconds, respectively. The prediction time of the proposed model was shorter than that of ConvNeXt, while the prediction time of the ConvNeXt-DenseNet and the ConvNeXt-GhostNet was longer than that of ConvNeXt. The training time of the proposed model was 57.70% of that of ConvNeXt. Only the proposed model shortened the prediction time.

In the BoT-IoT dataset, the training time of the ConvNeXt, ConvNeXt-DenseNet, ConvNeXt-GhostNet, and the proposed model was 104,818.627 seconds, 47,121.162 seconds, 112,625.791 seconds, and 19,517.676 seconds, respectively. The training time of the ConvNeXt-DenseNet and the proposed model was shorter than that of ConvNeXt, while the training time of the ConvNeXt-GhostNet was longer than that of ConvNeXt. The training time of the ConvNeXt-DenseNet was 44.95% of that of ConvNeXt. The training time of the proposed model was 18.62% of that of ConvNeXt. The proposed model had the largest reduction in training time. The prediction time of the ConvNeXt, ConvNeXt-DenseNet, ConvNeXt-GhostNet, and the proposed model was 190.732 seconds, 162.010 seconds, 291.182 seconds, and 83.011 seconds, respectively. The prediction time of the ConvNeXt-DenseNet and the proposed model was shorter than that of ConvNeXt, while the prediction time of the ConvNeXt-GhostNet was longer than that of ConvNeXt. The prediction time of the ConvNeXt-DenseNet was 84.94% of that of ConvNeXt. The prediction time of the proposed model was 43.52% of that of ConvNeXt. The proposed model had the largest reduction in prediction time.

The experimental results of the ConvNeXt ConvNeXt-DenseNet, ConvNeXt-GhostNet, and the proposed model on ToN-IoT and BoT-IoT datasets are shown in Table 6. The bold values in the table mean that the corresponding model is optimal on this performance metric. On the TON-IoT dataset, the accuracy, precision, recall, F1-score, FAR, and AUC of the proposed model are 0.9998, 0.9997, 0.9987, 0.9992, 0.0000, and 0.9989, respectively. The ConvNeXt, ConvNeXt-DenseNet, ConvNeXt-GhostNet, and the proposed model had a small gap in detection performance. On the BoT-IoT dataset, the accuracy, precision, recall, F1-score, FAR, and AUC of the proposed model are 1.0000, 0.9962, 1.0000, 0.9981, 0.0000, and 1.0000, respectively. The ConvNeXt, ConvNeXt-DenseNet, and the proposed model had a small gap in detection performance. The F1-score of the proposed model is 0.0118 higher than that of ConvNeXt-GhostNet.

In summary, the proposed model outperformed the ConvNeXt, ConvNeXt-DenseNet, and ConvNeXt-GhostNet in terms of the number of parameters, training time, and prediction time. And the proposed model had good detection performance. Therefore, the ShuffleNet V2 has more advantages in making ConvNeXt more lightweight than DenseNet and GhostNet.

The training set curve of the model can reflect the learning capability of the model in the training stage. The training set curve of the model on the TON-IoT dataset is shown in Figure 10(a). On the TON-IoT dataset, the initial accuracy and loss of the ConvNeXt were slightly worse than the proposed model, and its stability in the early stage of training was also slightly worse than the proposed model. The training curve of the model on the BoT-IoT dataset is shown in Figure 10(b). On the BoT-IoT dataset, the initial accuracy and loss of the ConvNeXt were slightly worse than those of the proposed model. Because models with more parameters are generally harder to fit in the early stages of training, ConvNeXt’s initial accuracy, initial loss, and stability in the early stages of training were all slightly worse than the proposed model. But after the early stages of training on both datasets, the ConvNeXt and the proposed model quickly achieved very similar training results and stability.

The validation set curve of the model can reflect the generalization capability of the model in the training phase. The validation set curve of the model on TON-IoT is shown in Figure 11(a). There were fluctuations in the whole training process of the proposed model and the ConvNeXt on the TON-IoT dataset. The fluctuations of the proposed model were more severe and frequent in the early training stage. The generalization capability of the proposed model was slightly reduced in the early training stage because the structure of the proposed model was more simplified than that of the ConvNeXt. However, the simpler structure of the proposed model makes it less prone to overfitting in the middle and later stages of training. The situation in Figure 11(b) was the same as that in Figure 11(a), which further supports the above view.

The training of the ConvNeXt on the TON-IoT dataset reached the optimum at the 30th epoch of the tenth-fold cross-validation, which took 136,193.988 seconds. Its accuracy and loss on the training set were 0.9995 and 0.0013, respectively, and its accuracy and loss on the validation set were 0.9998 and 0.0006, respectively. The training of the proposed model on the TON-IoT dataset reached the optimum at the third epoch of the ninth-fold cross-validation, which took 20,705.409 seconds. Its accuracy and loss on the training set were 0.9996 and 0.0010, respectively, and its accuracy and loss on the validation set were 0.9998 and 0.0008, respectively. Compared with the ConvNeXt, the time took by the proposed model to train to the optimal model was shortened by 84.80%.

The training of the ConvNeXt on the BoT-IoT dataset reached the optimum at the 49th epoch, which took 25,690.379 seconds. Its accuracy and loss on the training set were 0.9998 and 0.0006, respectively, and its accuracy and loss on the validation set were 1.0000 and 0.0001, respectively. The training of the proposed model on the TON-IoT dataset reached the optimum at the 92nd epoch, which took 8960.602 seconds. Its accuracy and loss on the training set were 0.9999 and 0.0004, respectively, and its accuracy and loss on the validation set were 1.0000 and 0.0000, respectively. Although the number of epochs required to train the proposed model to the optimal model was more than that of the ConvNeXt, the time consumed by the proposed model was 65.12% shorter than that of the ConvNeXt because the shorter time consumed by each epoch.

The experimental results on the TON-IoT and BoT-IoT datasets are shown in Table 7. The accuracy of the proposed model on the TON-IoT dataset was 0.9998, the precision was 0.9997, the recall was 0.9987, the F1-score was 0.9992, the FAR was 0.0000, and the AUC was 0.9989. Compared with the comparison model, the proposed model achieved the best results on the performance metrics except for the AUC. In the AUC, the proposed model was slightly worse than the ConvNeXt, and the difference was 0.0009. The accuracy of the proposed model on the BoT-IoT dataset was 1.0000, the precision was 0.9962, the recall was 1.0000, the F1-score was 0.9981, the FAR was 0.0000, and the AUC was 1.0000. Compared with the comparison model, the proposed model achieved the best results in accuracy, recall, FAR, and AUC. Compared with the ConvNeXt, the proposed model had a slight increase of 0.0002 and 0.0040 in accuracy and F1-score, respectively.

In summary, the proposed model achieved the best results on both the TON-IoT and the BoT-IoT datasets. Especially in the FAR, the results of the proposed model on both datasets were 0.0000. The performance of the proposed model on the TON-IoT dataset was better than that on the BoT-IoT dataset. The main problem of the proposed model on the BoT-IoT dataset was the low precision. Although the precision of the proposed model was 0.0002 higher than that of ConvNeXt, it was 0.0035 and 0.0032 lower than that of comparison models TP2SF and Sp2f, respectively. This led to the F1-score of the proposed model on the BoT-IOT dataset decreasing by 0.0006 compared with that of the Sp2f. After enough epochs in the training stage, models with more parameters are generally more prone to overfitting, resulting in reduced generalization capability. Therefore, in the classification results of the test set, the proposed model had a stronger comprehensive performance than ConvNeXt, which confirmed that the proposed model was better than ConvNeXt in generalization capability.

By analyzing the results of models of each type, the performance of the models can be evaluated more comprehensively. The experimental results of each type on the TON-IoT dataset are shown in Table 8. The performance metrics of the proposed model only in the types of Injection(3), Normal(5), Password(6), and Scanning(8) were not all optimal. The detection results of the proposed model on Injection(3) and Scanning(8) were reduced by 0.0001 to 0.0006 compared to ConvNeXt. The results of the proposed model on normal(5) were slightly worse than the sp2f, and the accuracy, recall, and F1-score had decreased by 0.0001. The proposed model outperformed all comparison models on Password(6) and other types. Compared with the ConvNeXt, the proposed model was simpler in structure and had stronger detection capability. It is worth mentioning that the TON-IoT dataset is a dataset with imbalanced samples, which is the same as the real network situation. Among them, the number of samples of Normal(5) is 286 times that of MITM(4) and 15 times that of other types. Therefore, the proposed model had strong detection capability for samples with a large difference in number and can effectively detect uncommon network attacks. Therefore, the proposed model had strong detection capability for samples with a large difference in number and could effectively detect uncommon network attacks.

The experimental results of each type on the BoT-IoT dataset are shown in Table 9. The proposed model achieved the best results on all five types, and the results of the proposed model in Recon(2) and Normal(3) increased by 0.0001 to 0.037 compared to ConvNeXt. Although the precision of the proposed model had reached 1.0000 in other types, the precision in Recon(2) was only 0.9811. Although the precision of the proposed model is the same as that of the ConvNeXt, it was reduced by 0.0189 compared to that of the TP2SF and the Sp2f. This resulted in a 0.0093 decrease in the F1-score of the proposed model compared to that of the Sp2f and ultimately led to the lower overall precision of the proposed model on the BoT-IoT dataset. Like the TON-IoT dataset, the BoT-IoT dataset is also a dataset with data imbalance. DoS(0) and DDoS(1) in the BoT-IoT dataset account for 97.502% of all samples, while the least Theft(4) only accounts for 0.002% of all samples. The precision, recall, and F1-score of TP2SF in Theft(4) are all 0.0000. Although the precision of Sp2f in Theft(4) reached 1.0000, the recall was only 0.7083, and the F1-score was only 0.8292. This indicates that the TP2SF and the Sp2f lacked sufficient detection capability for very small samples. However, both the ConvNeXt and the proposed model detected Theft(4) correctly. This shows that the proposed model inherited the excellent detection capability of the ConvNeXt for extremely unbalanced samples while being more lightweight.

Compared with other attack types, the DoS and DDoS are simple but effective, so they have become a very common attack method for network attackers and therefore have received more attention from network security researchers. The botnet targeted by the BoT-IoT dataset is one of the preconditions for attackers to launch DoS and DDoS attacks. DoS and DDoS samples are included in both the TON-IoT and BoT-IoT datasets. DDoS(1) and DoS(2) are in the TON-IoT dataset, and DoS(0) and DDoS(1) are in the BoT-IoT dataset. For DoS, the proposed model achieved the optimal 1.0000 or 0.0000 on all performance metrics in both datasets. For DDoS, the proposed model also achieved the optimal 1.0000 or 0.0000 on all performance metrics in the BoT-IoT dataset. However, in the TON-IoT dataset, the recall and F1-score of the proposed model were 0.9994 and 0.9997, respectively, while the optimal 1.0000 or 0.0000 in other performance metrics. In summary, the proposed model achieved better results than the comparison models for detecting DoS and DDoS in both datasets.

6. Conclusion

In order to enable the IDS deployed in fog nodes to protect IoT devices effectively, this paper applies a new high-performance CV model ConvNeXt to IoT intrusion detection and makes lightweight improvements. First, ConvNeXt is one-dimensionalized to enable it to process network traffic. Then, we use the design criteria of the lightweight CV model ShuffleNet V2 to make lightweight improvements to ConvNeXt to obtain the classification model ConvNeXt-Sf. Finally, the data preprocessing model LE-MMN and the classification model ConvNeXt-Sf are combined to construct the IoT intrusion detection model. Experiments on the TON-IoT and BoT-IoT datasets show that ConvNeXt-Sf is more lightweight than ConvNeXt and slightly enhances detection accuracy. Compared with ConvNeXt-DenseNet and ConvNeXt-GhostNet, the proposed model not only has fewer parameters and shorter training time and prediction time but also has good results in performance metrics such as accuracy and FAR.

In the future research work, we will try to combine unsupervised learning or semisupervised learning with the proposed model. Because in a realistic IoT with hybrid cloud-fog computing, network traffic is mostly unlabeled. Therefore, using supervised learning requires a lot of resources to label the data. Using unsupervised or semisupervised learning can significantly reduce resource consumption. Moreover, unsupervised learning has been widely used in the field of NLP, so if it can be introduced into IoT intrusion detection, it will further promote the unification of multiple fields.

Data Availability

The TON-IoT dataset can be downloaded from https://research.unsw.edu.au/projects/toniot-datasets and the BoT-IoT dataset can be downloaded from https://research.unsw.edu.au/projects/bot-iot-dataset.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Guosheng Zhao and Yang Wang contributed equally to this work.

Acknowledgments

This present research work was supported by the National Natural Science Foundation of China (Nos. 61202458 and 61403109) and the Natural Science Foundation of Heilongjiang Province of China (No. LH2020F034).