Abstract

As social media use increases, the number of users has risen also. This has increased the volume of data carried over the network, making it more important to secure users’ data and privacy from threats. As users are unaware of hackers, social media’s security flaws and new forms of attack will persist. Intrusion detection systems, therefore, are vital to identifying intrusion risks. This paper examines a variety of intrusion detection techniques used to detect cyberattacks on social media networks. The paper provides a summary of the prevalent attacks on social media networks, such as phishing, fake profiles, account compromise, and cyberbullying. Then, the most prevalent techniques for classifying network traffic, including statistical and artificial intelligence (AI) techniques, are addressed. The literature also demonstrates that because AI can manage vast, scalable networks, AI-based IDSs are more effective at classifying network traffic and detecting intrusions in complex social media networks. However, AI-based IDSs exhibit high computational and space complexities; therefore, despite their remarkable performance, they are more suitable for high computing power systems. Hybrid IDSs, utilizing statistical feature selection and shallow neural networks, may provide a compromise between computational requirements and efficiency. This investigation shows that accuracies of statistical techniques range from 90% to 97.5%. In contrast, AI and ML technique detection accuracy ranges from 78% to 99.95%. Similarly, swarm and evolutionary techniques achieved from 84% to 99.95% and deep learning-based detection techniques achieved from 45% to more than 99% detection rates. Convolutional neural network deep learning systems outperformed other methods due to their ability to automatically craft the features that would classify the network traffic with high accuracy.

1. Introduction

Technological advancements have increased people’s dependency on networks and applications in their daily activities. Business, education, healthcare, banking, and e-governments are a few instances of large-scale reliance on the internet [18]. In addition, the rapid growth of social media is giving rise to a new type of threat that spills over from the cyber world to real life [911]. Consequent to this reliance on the network infrastructure and the expected increase of attacks in quantity and quality, conventional security mechanisms are deemed inadequate to provide the required levels of security. Conventional intrusion prevention methods, including firewalls, access controls, and encryption, have failed to provide complete defense against more sophisticated malware attacks and infiltration attempts on networks and systems. This is because intrusion detection struggles with issues including extremely high network traffic volumes, highly imbalanced attack class distribution, the inability to distinguish between normal and abnormal behaviors, and the need for ongoing adaptation to a continuously changing environment [1214]. Intrusion detection systems (IDSs) are now an essential part of the security architecture that is utilized to identify these threats before they cause significant harm. Several issues, including data gathering, data preprocessing, intrusion recognition, reporting, and reaction, must be taken into account when developing an IDS. Recently, most research has been concentrated on how to create detection models efficiently and accurately by applying recent advanced techniques. Therefore, recently, machine learning and deep learning methods have been used to develop intrusion detection systems in the field of cybersecurity and intrusion detection. Most of the recent studies focus on applying artificial neural networks (ANN), random forest (RF), support vector machines (SVM), naïve Bayesian (NB), or a combination of these techniques to increase the effectiveness of the IDSs.

This paper discusses the evolution of social media and provides an overview of the security mechanisms available in the industry. The paper starts by presenting the significance of social media in modern life and describing the complexity of social media networks. Then, most common attacks in social media are highlighted and reported. After that, the paper continues with common techniques used to detect cyberattacks in social media networks. Common intrusion detection techniques are discussed and categorized into statistical, artificial intelligence, machine learning, swarm intelligence, evolutional computation, and deep learning techniques to detect attacks in social media networks. In addition, the paper summarizes how the attack can be detected within social media network traffic with the help of an intrusion detection system. The main contribution in this paper is addressing the social media-specific attacks and demonstrating different categories of intrusion detection systems used in social media cloud systems. Though social media is affected by all types of attacks available in other cloud systems, it still exhibits specific types of attacks such as cyberbullying. The second contribution is that intrusion detection systems are categorized in different categories based on the methodology and complexity of the algorithm. Moreover, important databases used in developing and evaluating the performance of intrusion detection techniques are highlighted and listed. Finally, the performance of intrusion detection techniques is recorded based on the relevant references and datasets used.

The next section of this paper discusses the importance of social media in modern life and its impact on individuals and businesses. The third section is dedicated to discussing different types of attacks in social media, followed by intrusion detection technique classification and discussion in the fourth section. Future directions and opportunities are presented in the fifth section followed by conclusive remarks in the sixth section.

1.1. Literature Collection and Screening Scheme

The total number of papers collected in this study and the screening scheme are shown in the Prisma scheme in Figure 1. The papers are mainly collected from Google Scholar and the Saudi Digital Library. The Saudi Digital Library is a registrar that provides registration to international journals and databases. The search keywords used in collecting the papers are related to different techniques used in traffic-based intrusion detection systems, particularly, in social media. The emphasis on traffic-based techniques is to avoid system modeling-based and behavior-based intrusion detection techniques and shed light on the techniques implemented to analyze and classify network traffic features. We collected 398 articles related to social media intrusion detection techniques in general. The collected literature went through a series of investigations to select the most relevant and peer-reviewed literature. In the first step, before screening, 21 duplicate records are removed. Moreover, records not peer reviewed are excluded, including white papers, archived papers, case studies, records written in languages other than English, and papers published before 2013. The remaining 306 records went through a more comprehensive screening according to relevancy to the social media and intrusion detection techniques. The screening process resulted in the exclusion of 249 records, and the remaining 60 records are used in this review paper. In addition to records used, we collected 8 reports from a leading organization in assessing social media evolution. Out of the 8 reports, we used 4 as we could not retrieve 2 and the last 2 are not closely relevant to the social media industry.

2. Importance of Social Media in Modern Life

Today, online social networks have become indispensable. They are considered the most effective way to communicate with family and friends and keep people in contact, even if they are geographically apart. Day by day, huge amounts of data, such as photos, videos, and simple messages, are shared through popular social networks, such as Twitter, Facebook, Instagram, and YouTube. Social networks are becoming more important because of the vast and diverse data they communicate. Therefore, it is considered one of the most desirable sources for hackers who intend to get individual information without the owner’s consent.

2.1. Volume of User’s Statistics

Social media is used by billions of people, and it has become an important part of people’s lives all over the world. A decade ago, social media was much simpler and smaller than it is today. The statistics show that 4.48 billion people used social media globally in 2021, more than doubling from the 2.07 billion in 2015. Since 2015, social media’s average annual growth rate has been 12.5%. The statistics also show that people spend an average of 2 hours and 24 minutes daily on social media. This means that if someone starts using social media at 16 and lives to be 70, they will have spent 5.7 years of their life using it [1]. However, in 2020, with the coronavirus pandemic (COVID-19), hundreds of millions of people worldwide spent more time at home, isolating themselves, socially distancing themselves, and attempting to keep themselves in touch despite the lockdowns. At that time, most people utilized social media for communication, sharing news, and keeping themselves in contact with their families and friends. Therefore, in 2021, the major social media platforms appeared to thrive and significantly grow compared with 2019 [2].

The reports show that social media is a trend, and it has exploded in popularity over the last few years. Moreover, reports from the last few years show how social media has grown rapidly and become an essential, effective, and indispensable tool for communication between people worldwide [3].

2.2. Complexity of User Network

However, as social media became increasingly pervasive in our lives, the complexity of social media networks increased in parallel. Millions of users of different ages share billions of conversations, photos, and videos every minute. The complexity, large scalability, dynamic nature, and public nature of social media networks make their challenges unpredictable [4]. In terms of big data, social media contains a large volume of unstructured data from a wide variety of sources, such as the web, mobile devices, and cloud storage shared worldwide [5]. These data that have been distributed around the globe have reach terabytes and petabytes in size. Such complexity makes traditional systems inefficient and useless for dealing with social media networks.

2.3. Social Media Importance in Modern Life and Business Statistics

Despite the complexity of social media networks, the advantages of this network are undeniable. Social media has become a very useful platform for both individuals and businesses. In terms of individuals, social media networks are a powerful tool to keep people in contact even if they are geographically apart. As long as you have an internet connection, you can text anyone on the planet at no cost. In addition, it keeps people updated and raises awareness of what is happening in other regions. For example, with the first signs of the COVID-19 sickness, many countries imposed lockdowns and banned regional social gatherings. At that point, social media became the most effective way to keep people in touch, provide up-to-date and reliable information about COVID-19 to the public, and raise people’s awareness of issues, such as how to maintain social distance, wear masks, and isolate.

Furthermore, social media plays a critical role in entertainment during the lockdown period. Most people relied on social media to pass the time during the lockdown period [6]. On the other hand, in terms of business, social media has evolved and become an essential tool for businesses. Social media has evolved into an extremely useful business platform, serving an important role in marketing and advertising. Some businesses have even abandoned their websites in favor of social media. It assists businesses in meeting consumers from around the world to enhance sales and increases the likelihood of fulfilling their more ambitious sales targets [7]. For example, suppose a business owner has a store where he sells a product. In that case, business owners can market their products by creating Instagram or Snapchat accounts and sharing them with the public by posting them on those platforms. Also, business owners can have a WhatsApp account to receive purchase requests and contact their customers directly.

In addition, celebrities can reach many people, and companies take advantage of celebrities’ effects to promote their goods and services through social media networks. Therefore, social media has recently been a significant factor in driving product and raising sales. For example, a Canadian statistical report from 2021 shows that 71 percent of companies use social media for marketing themselves, with 52 percent of those who do posting at least once a day. And around 90% of users follow brands and business accounts to keep themselves updated with the latest products [8].

2.4. Future Perspective of Social Media

Every year, social media has become a ubiquitous part of people’s lives. As we can see, the use of social media is changing so quickly that it is difficult to predict what its shape will be in the near future. However, it is expected that social media networks will be more popular for shopping than they are now, and people will utilize social media more significantly than they do now. In addition, the network infrastructure will be much more complex and more challenging to handle with the increased amount of conversation and media shared between users [9].

3. Emerging Cybersecurity Attacks on Social Media

As social media networks and access to the internet continue to grow quickly, most businesses are becoming more open to a wide range of threats and attacks [5]. So, traditional security measures like firewalls, access control, and encryption have not been able to protect networks and systems from malware and attacks that are too complicated.

This section demonstrates the security of social media networks, and it is divided into four subsections. The first subsection demonstrates the volume of attacks on social media. The second section discusses attack types, whereas the third and fourth sections discuss user and business vulnerability perspectives, respectively.

3.1. Statistical Perspective of Attacks in Social Media

The number of cyberattacks has been steadily increasing over the last ten years, and they are becoming more hazardous than ever. This is because threat actors and their techniques are continuously changing. Every day, around 300,000 new pieces of malware are produced to attack individuals and organizations [10]. According to a cybersocial media security report, 2021 was one of the busiest for cyberattacks. Compared to 2020, there was a 50% rise in overall weekly attacks on business networks. This indicates that social media attacks doubled throughout 2021 [11]. The findings report that every day, 30,000 websites are hacked. Every 39 seconds, a company is hit by a cyberattack, and some type of malware has hit more than 60% of businesses worldwide. According to IBM’s Cost of a Data Breach Report 2021, the average total cost of a data breach climbed from $3.86 million to $4.24 million in 2021 [12]. Due to longer reaction times, remote working increased the average total costs, according to the report’s results [10].

3.2. Type of Attacks

According to the Pew Research Center, in 2021, Facebook, Snapchat, Twitter, WhatsApp, and Instagram were expected to be the most visited social media sites [13]. The main purpose of these platforms is to bring people and organizations together. It has also created numerous businesses for enterprises. With these platforms, the way people communicate has changed dramatically. However, as these platforms become more popular and grow rapidly, they are becoming more vulnerable to cyberattacks. Rapid growth increases the attack surface and makes social media networks vulnerable to different cyberattacks. Furthermore, because social media platforms are accessible to the public internet and contain a large amount of data with up-to-date information, they are regarded as a desirable source for hackers who use social media to conduct various types of cyberattacks, such as cyberbullying, social engineering, and phishing. In terms of big data, the rapid growth of public and private data accessible by anyone from anywhere from different sources, such as computers, mobile devices, and web applications, makes social media users’ victims of various attacks [5]. In addition, social media platforms host millions of sensitive pieces of information, such as usernames, passwords, and financial information. However, cyberattacks have become cheap and simple and can be easily conducted by attackers who intend to impersonate people and steal their personal information [14]. The most common attacks that target social media are shown in Table 1.

3.2.1. Phishing Attack

Phishing is a social engineering attack where attackers convince users to give their credential information through fake emails, messages, phone calls, etc. [15]. Many social media platforms, such as WhatsApp, Facebook, and Instagram, used for online business, advertisement, and marketing, become easily exploited with phishing attacks to steal users’ personal information, such as usernames, passwords, and financial information. This attack is commonly spread by sending false sign-in links via social media posts or fake emails to redirect users to a fake sign-in page. The URL of this page is similar to the original one, but with small changes in numbers or alphabets, which are mostly not noticed by users. Therefore, the victims mostly enter their credentials in the fake sign-in form, and the attacker steals user credentials to launch other attacks. One example of phishing attacks that target social media is the Facebook phishing attack. As Facebook scaled in 2019 and owned the most common apps, including Facebook, Instagram, Messenger, and WhatsApp, with around 3 billion active users, it became a top target for hackers who intended to steal users’ credentials. In the first quarter of 2020, thousands of unique Facebook-phishing URLs were created, with a higher number of total phishing emails impersonating Facebook. The impersonation of Facebook emails includes a request to reset the password, a security alert, and confirming the account, which focuses on stealing user login credentials. In addition, some fake emails scare users: if they do not update their password, their accounts will be blocked. In these attacks, the attacker can launch additional attacks after gaining access to the compromised accounts. Therefore, it is a dangerous attack with a large victim space [19].

3.2.2. Fake Profile Attack

False identities are used in sophisticated threats and to launch other malicious operations. Fake identity includes creating profiles with arbitrary names, pictures, and other information that cannot be linked to anyone. Some fake profiles impersonate someone in order to gain the trust of the victims [16]. The fake account can be found in different places on the internet, such as shopping websites, social media platforms, and banking systems [20]. However, because social media platforms have become increasingly popular for sharing news, chatting, marketing, advertising, etc., social media has become a top target for fake profile attacks. In addition, creating a profile on social media is straightforward, and no extensive verification process exists. Therefore, creating a fake profile on social media platforms is simple. On social media platforms, for example, fake accounts frequently offer discounts and gifts that appear to be from the brands themselves, despite the fact that both the accounts and the offers are fake. Fake accounts usually post images from the spoofing page, give it a similar name, and reach out to the original page’s followers. The statistics show that, in 2021, nearly 1.3 billion fake profiles were removed from Facebook [21].

3.2.3. Account Compromise Attack

Hacking into users’ accounts has become a profitable business for attackers. By attacking users or commercial accounts, cybercriminals can spread their destructive messages or fake information to a big user by taking control of the victim’s account; this attack results in catastrophic damage, including tarnished reputations and significant financial losses [17]. In terms of social media, users develop trusting relationships with the accounts they follow. For a variety of reasons, their trust can keep on growing. For example, the user may know the account’s owner in person, or the account may be run by a reputable business such as an official news agency. Unfortunately, attackers exploit this trust by compromising accounts and spreading malicious information from accounts that victims trust. Once an attacker has gained access to a social media account, they might use it for malicious activities, such as spamming, phishing, or linking to malware. However, an account compromise attack can result in significant consequences beyond an individual’s or company’s reputation.

For instance, Twitter was targeted by attacks to compromise social media accounts in July 2020. It said outside parties had hacked 130 high-profile Twitter accounts to promote a bitcoin fraud. The attackers obtained access to Twitter’s administration tools, according to a Twitter report, so they could change the accounts themselves and post the messages directly. They appeared to have gotten access to the tools through Twitter workers via social engineering. The attack tweets prompted people to send bitcoins to a certain cryptocurrency wallet. Within minutes of the initial tweets, more than 320 transactions had already occurred on one of the wallet’s addresses. Bitcoins worth more than $110,000 were placed in one account before Twitter removed the fraudulent messages. This incident is considered one of social media’s worst hacks [22].

3.2.4. Cyberbullying Attack

Cyberbullying is using the internet, phone, or any other technology to send a text or upload photos that embarrass or hurt a person. Several studies report that between 10% and 40% of internet users become victims of cyberbullying attacks. Cyberbullying can have many consequences, from acute uneasiness to suicide [18]. It occurs on social media, messaging platforms, gaming platforms, and mobile phones. However, cyberbullying has become a serious issue because of the rise of social media platforms in the last decade. On social media platforms, cyberbullying is repeated behavior intended to frighten, enrage, or shame the targeted individuals, for example, fabricating lies about someone on social media or sharing embarrassing images or videos of them; impersonating someone and delivering derogatory words on their behalf or through fictitious accounts; and using messaging systems to send hateful, abusive, or threatening texts, photos, or videos, etc. According to cyberbullying statistics, cyberbullying attacks increased in January 2020 because people, including children, are spending 20 percent more time on social media because of the COVID-19 pandemic lockdowns. In addition, the statistic shows that 44% of internet users in the United States have been harassed online [23].

3.3. User Vulnerability Perspective

Because there are many social media platforms on the internet, an attacker can utilize several threat strategies to do social engineering and phishing. For an attacker, there is no “one-size-fits-all” social media threat. Any publicly available data on personal and professional social media accounts could be exploited in future assaults. However, in the age of increased cybercrime risk, people are unaware of the threats they confront daily. Even though people frequently know the proper answers to cybersecurity awareness questions, they do not apply their knowledge in real life. In other words, people are aware of the cyber risk but do not practice it well in their daily lives. In terms of training and security awareness, more is required than just simply informing people about what they should and should not do. People must also be able to respond in cases of cybercrime [24].

3.4. Business Vulnerability Perspective

In terms of business, many businesses are now shifting to social media for online shopping, advertisement, and marketing. Using WhatsApp, Telegram, Instagram, and other ways to communicate, businesses can get requests, advertise, market their products, and talk to their customers through social media. On the other hand, people discover, learn about, follow, and shop from brands using social media platforms. So, it is true that social media makes and simplifies a lot of business actions, and it has become an important part of business-marketing strategies and the growth of businesses of all sizes. It is also a very important part of making more money by influencing more people. On the other hand, using social media for business can lead to a higher risk of attack and fraud, a growing threat, the spread of bad programs, and, sometimes, even a stoppage of progress. Obstructing progress in such a way that social media represents the greatest threat to a company’s reputation can directly impact companies financially.

Furthermore, hackers develop new attacking techniques, such as stealing brand names, building fake company accounts, interacting with customers, and posting fake advertisements from impersonating business accounts. The harm and destructive impacts of such attacks are very difficult to stop [25]. Therefore, using social media for business is considered a critical cybersecurity issue.

4. Intrusion Detection Techniques in Social Media

Over the years, with the rapid growth of social media, it has become easy to exploit social media for malicious activities. Malicious activities cause serious security risks to the user and organizations. Today, detecting malicious activities is a trending topic in numerous cybersecurity application domains. One of the essential steps to enhancing security is detecting and identifying abnormal activities using an intrusion detection system. Several researchers proposed tools and methods to enhance security using statistical, artificial intelligence, machine learning, and deep learning techniques to detect intrusions in a large network as shown in Table 2.

4.1. Statistical Methods

Many machine learning approaches are derived from statistics, which is the cornerstone of machine learning. The main difference between machine learning and statistical techniques is that statistical techniques deal with smaller amounts of data and require more human effort. On the other hand, machine learning has a high predictive power and requires minimal human intervention because the computer does most of the processing. The following are some statistical techniques used to develop an intrusion detection system for classifying network traffic as listed in Table 3.

One of the earliest studies [26] demonstrated that the world of wired and wireless computer networking had changed dramatically, resulting in too many threats and risks from malicious nodes that cause security attacks. Numerous intrusion prevention measures, such as access control and encryption, protect only normal nodes, but they do nothing for malicious nodes. Therefore, the study attempts to detect attacks so that a countermeasure can be taken to prevent the attack from being completed or causing further damage after a successful attack has occurred. In addition, there are several limitations in the existing intrusion detection system. For example, many IDSs fail to generalize unknown data, and most IDSs fail to detect an attack coming from an internal system or an inappropriate determination of threshold detection. Thus, the study provides an efficient framework that is adaptive, scalable, and capable of predicting security attacks at the node or system level to overcome intrusion detection limitations. The study presents an intrusion detection framework for computer networks by determining thresholds for network parameters. Mainly, the proposed model determines a threshold level to distinguish between normal, uncertain, and abnormal values for essential network variables and performs an efficient vulnerability assessment based on these values. However, the proposed model consists of many frameworks. Feeding the basic framework with normal and abnormal network data is the first step in the proposed model. Second, the parameter identification framework is used to determine the critical network parameters that are impacted by security attacks. Third, the threshold parameter framework uses the Six Sigma technique to determine the key network parameter threshold values to distinguish between normal, susceptible, and abnormal network traffic. Fourth is the intrusion detection framework, in which the attack is detected using an audit trail based on significant parameters and their threshold levels. If abnormal behavior is detected, an alarm is sent to the reaction and protection framework. Fifth, the countermeasure framework prevents the intended attack from being completed or prevents further damage when a successful attack is detected. For performance evaluation, the proposed model was based on data implemented on the benchmark DARPA. The experimental result shows that the proposed Six Sigma-based intrusion detection system outperforms related models by at least 25% and 20% for the false-positive rate and prediction rate, respectively. The experiments results show that the proposed model can effectively evaluate vulnerability and detect intrusion from raw network data.

Waskita et al. demonstrated that the number of novel attacks is increasing and that the existing attacks are becoming difficult to identify by intrusion detection systems [27]. As a result, anomaly detection technologies may raise alarms for regular activity (false positive) or fail to raise alarms during attacks (false negative). In addition, outlier detection is one of the most critical aspects of data analysis. The word “outlier” is used to describe data that does not fit with the normal range of data. In other words, the outliers are critical values that do not belong to any classes and are outside of the data distribution. For that, cleaning data and detecting outliers is important to avoid false detection. However, the paper examines three techniques for host-based intrusion detection systems, including PCA, chi-square distribution, and Gaussian mixture distribution, and compares their performance. The result shows that the detection rate for the PCA and Gaussian mixed distributions is 97.5 percent each, whereas the chi-square distribution is 90 percent. Also, the study demonstrated that anomaly detection systems generally have a high rate of false alarms. Another issue with anomaly detection systems is that they cannot categorize border data, meaning that normal data is misclassified as malicious data and abnormal data is classified as normal data. As a result, the researcher employed a confusion matrix, which provides an accurate detection rate and a low false alarm rate. These strategies can be used for any outlier detection and large datasets.

To increase the accuracy of detection, Kabir et al. proposed a novel approach for statistically analyzing the network traffic [61]. The proposed model examines the network traffic collected by an intrusion detection system to determine normal and abnormal behaviors. The proposed system continuously monitors incoming network packets using the active ports of each host in the network. After that, it generates average behaviors at various time scales. At a certain time, the average of the behavior is used as a baseline for normal traffic, where the system will be able to detect incoming threats to the network under supervision by deploying an intensive search-based decision algorithm for traffic classification. To enhance the performance, after continuously capturing traffic for a shorter period, the captured data must be preprocessed or normalized to enable proper evaluation between different ports. However, researchers argue that averaging the collected data over a long period produces a less accurate result. Therefore, the researcher proposed setting a short period of collected data to determine the normal region. As a result, a step-by-step comparison of the authorized zones formed over a long and short period is proposed to increase accuracy and reduce cost. The study’s result demonstrated that the basic statistical technique might still produce good results. The researcher is motivated to simulate further statistical-based models with well-known datasets in the future.

Moreover, the significance of accurate detection systems was the concern of Azad and Jha in [28]. They showed that the main goal of an intrusion detection system is to find as many attacks as possible with as few false alarms as possible. This means the system must be accurate, precise, and effective in attack detection. However, most accurate intrusion detection systems are slow and cannot handle huge amounts of network traffic. For that, there is an urgent need to develop a detection system that is fast enough to process a large amount of data with a minimal number of false alarms. In the literature, a lot of past research has tried to develop intrusion detection systems using different techniques, including popular statistical classification methods such as SVM. However, the main drawback of the past methods is that they cannot handle huge data sizes. Therefore, the paper proposed a modified version of the SVM based on the concept of the least square SVM (LS-SVM) to detect intrusion in a large dataset. For detecting intrusion in a large dataset, it is essential to reduce the dimension of the dataset before feeding it to the classifier. Therefore, the study first determines the required size to characterize the features of the entire dataset. Then, the study used the proposed optimum allocation (OA) technique to divide the entire dataset into preset subgroups and select samples from these clusters. After that, the study uses these samples as an LS-SVM input set to detect various intrusions in IDS. Generally, the study proposed two model architectures depending on the data selection process. The first architecture is OA based on LS-SVM 1, which combines the training and testing data. After that, determine the training and testing size using OA. Then, the samples are selected from both the training and testing sets. In contrast, the second architecture is OA-based LS-SVM 2. In this architecture, once the sizes of training and testing are chosen using the OA technique, the training and testing datasets are separated into various specified subgroups. Then, the OA technique is used to decide the size of each subgroup’s training and testing so that the sum of these sizes equals . Finally, the needed samples are selected from each training and testing subgroup. For simulation, the LS-SVM classification system has been implemented in MATLAB using the LS-SVM toolbox. The experiment was conducted on a desktop computer with 8 GB RAM, Intel R, CPU 3.40 GHz, and Core i7. The proposed system was validated using the KDD 99 dataset, a common dataset for testing intrusion detection systems. For result calculation, the study calculates precision, recall, and -value, including true positive, false negative, and false positive. The study result shows that by comparing the model with the well-known system in the previous work, the proposed model is effective for detecting intrusions in static and scalable datasets. The main reason is that the OA-LS-SVM uses the best characteristics of the whole dataset.

Similarly, authors in [29] demonstrated the importance of maintaining a high level of security to allow businesses to share information in a safe and reliable way. Data exchanges over the internet are always vulnerable to intrusions and misuse. As a result, intrusion detection systems have become an essential component of computer and network security. The study proposes an IDS that uses a genetic algorithm (GA) to identify different forms of network intrusions quickly. The precalculation and detection phases are the two key phases of the proposed system. During the precalculation phase, training data is used to construct a collection of features. These features will be used for comparison and prediction in the detection phase. The study used the standard dataset from the KDD CUP 99 to construct the algorithm and analyze the system’s performance. To simplify the implementation, only the continuous and discrete numerical features have been considered among the extracted features of the datasets. The experimental result shows that the approach performs well for most classes in general and has a higher detection rate of 99.4 percent for denial of service.

Furthermore, Li et al. [30] emphasized that the rapid growth of the internet and data processing increases vulnerabilities and threats. Therefore, building an effective intrusion detection system is essential for protecting users and web-based applications. The study proposed an intrusion detection system with a hybrid classifier to detect intrusions that target web-based applications. The hybrid classifier (DTNB) is a combination of the decision table (DT) and NB algorithms. When creating predictions, the DT saves the input data in a condensed form based on selected features and uses it as a lookup table. The combined model- (DTNB-) learning algorithm is like standalone DTs, using forward selection to select attributes. However, the DT and NB class probability estimates must be merged to create overall class probability estimates. The architecture of the proposed system includes preprocessing the dataset using normalization and filtering, feature selection, learning the model using a DTBN hybrid classifier, evaluating the model’s performance, and comparing the results. The effectiveness of the DTNB-based model is evaluated using the new version of KDD, which is the NSL KDD dataset. The study result shows that the DTNB hybrid classifier has a high detection rate for intrusion detection compared to the naïve Bayes and decision stump. The output of the proposed hybrid classifier delivers higher performance on behalf of the following parameters: correctly classified instance, incorrectly classified instance, RMSE, MAE, time taken, and kappa statistic.

4.2. AI and Machine Learning Methods

Because they can handle and process a lot of data, artificial intelligence and machine learning techniques are better at classifying network traffic. There are several research proposals for artificially intelligent techniques to detect attacks on network traffic. However, it is important to highlight the notion of ML and differentiate between ML techniques and statistical techniques. Machine learning techniques are a family of parametric techniques that include parametric learnable statistical methods, according to [31]. Parametric learnable statistical methods are a category of algorithms that evolve their parameters based on the available data; hence, they are learnable. Such methods include Bayesian networks and inference, SVM, solution tree, NN, and GA. The main distinction between conventional statistical methods and machine learning statistical methods is the context of application and evolution of hyperparameters in the algorithm. Therefore, this definition of ML, which is available in [31], is adopted in this work. Table 4 lists the latest literature that developed AI and ML methods with the dataset used for validation and development.

In terms of using a naïve Bayesian classifier and a decision tree, Rai et al. [32] proposed a new approach for adapting network intrusion detection. Mainly, the study addresses various data mining challenges, such as dealing with continuous attributes, handling missing values in the dataset, and reducing the noise in the training set. In addition, the study is aimed at addressing many IDS limitations, such as low accuracy, slow speed networks, long training times, and a high false-positive rate. The study performed several processes, including noise handling, managing continuous attributes, and selecting the input attribute from the dataset. For noise handling, the missing attribute is replaced with a frequent attribute, the redundant values are removed, and the contradictory examples are tagged correctly before training. For managing continuous attributes, the study divides continuous values into a series of intervals. Finally, for selecting the input attribute from the dataset, the study reduces the system complexity by selecting the most effective attribute before feeding the data into the model. However, to evaluate the proposed model’s efficiency, the KDD 99 dataset has been used. The KDD 99 dataset includes five classes: normal, DoS, R2L, U2L, and probing. Furthermore, the experiment is carried out on a computer with 1 GB of RAM and an Intel Core 2 Duo processor at 2.0 GHz. The study results show that the proposed model decreased false positives and increased accuracy. In addition, the proposed model performed well in detecting attacks on the KDD 99 dataset, with a 99% detection rate.

To address the effects of noise and inconsistency on the detection efficiency, Aung and Min [33] highlighted that most commercial signature-based IDSs cannot detect new types of attacks. They can only detect intrusions that are stored in the database. However, intruders are smart and always update their attacking techniques and change their intrusion patterns. Therefore, developing an intelligent IDS with a high detection rate is important. For an intelligent IDS, it is essential to train the system with a large dataset to make a future prediction. Due to the large size of the data, the intrusion detection databases consist of noisy and inconsistent data. Therefore, there is a need for effective data mining techniques to clean the data. To solve the problem of classifying a big intrusion detection dataset to minimize false positives (FP) and enhance detection rates (DR), the study proposed an intrusion detection system based on boosting and a naïve Bayesian classifier. It uses a naïve Bayesian classifier to construct the probability set for each round and updates the example’s weight based on the error rate provided by the training examples in each round. Boosting adjusts the distribution of training examples over time so that the base classifiers can focus on the most challenging classification examples. At the end of each boosting round, the training example’s weights are changed. The weights of examples that were misclassified will go up, while the weights of examples that were correctly classified will go down. In each round after the first, the classifier has to focus on samples that are hard to sort. On the other hand, the naïve Bayesian classifier is a simple method to classification that assumes that the attributes are independent when figuring out the class-conditional probability. It can handle the missing attributes by figuring out how likely it is that a person belongs to each class and ignoring the probability. The detection rate (DR) and the number of false positives (FP) are used to measure how well the proposed model works. The tests were performed on a computer with 1 GB of RAM, an Intel Core 2 processor, and a clock speed of 2.0 GHz. The KDD 99 has been used to estimate the intrusion detection performance of the proposed learning algorithm with a -nearest-neighbor classifier (KNN), decision tree classifier, SVM, neural network (NN), and GA. The study result shows that the accuracy for detecting normal traffic was 100% with a false-positive rate of 0.03%, the accuracy for detecting probing was 99.95% with a false-positive rate of 0.36%, the accuracy for detecting DoS was 100% with a false-positive rate of 0.03%, the accuracy for detecting U2R was 99.67% with a false-positive rate of 0.10%, and the accuracy for detecting R2L was 99.58% with a false-positive rate of 6.71%. The study result shows that the proposed algorithm produced high detection rates for many forms of network intrusions and reduced the false-positive rate.

In the same way, authors in [34] showed that as the number and severity of network attacks have grown in recent years, IDSs have become an even more important part of network security. It is hard to get people to pay attention to IDS performance optimization because there is so much security audit data, and intrusion behaviors are complicated and change over time. However, one of the strongest machine learning techniques for classifying network behavior is the SVM. Support vector machines are used in many intrusion detection systems. They are, nonetheless, computationally intensive. Therefore, dimension reduction strategies are used to extract relevant features from a dataset to overcome this difficulty. This paper uses the information gain ratio (IGR) and the -means technique in SVM to detect network intrusion. The study proposes a comprehensive approach for selecting the optimal set of NSL-KDD dataset attributes for characterizing normal traffic and distinguishing it from malicious traffic using SVM. This study employs a hybrid technique for feature selection that incorporates filter and wrapper models. This method ranks features based on an independent metric called the information gain ratio. The prediction accuracy of the -means classifier is used to find an optimal set of features that maximize the SVM classifier’s detection accuracy. However, the proposed model’s framework is made up of several phases. First is preprocessing of the NSL-KDD dataset. This includes tagging characters with numbers, keeping the data in the same range by using a scaling process, and replacing the name of an attack with a number. Second, the IGR is used to rank features based on their values. The IGR is a quantitative tool for ranking the importance of features in a dataset. It is used to calculate the proportional entropy decrease of each feature to figure out how important it is. Third, the best feature subset is generated using the -means classifier. The detection rate for each collection of features is calculated using the -means classifier. Only the top-ranked feature is initially included in the set of features. After each iteration, features are added to the S list based on the rank assigned by the IGR measure when the accuracy of the selected subset does not decrease. The method is stopped when the accuracy decreases due to model overfitting. Last, build an intrusion detection system using a SVM. With the help of the optimal feature set selection approach, the SVM model is created, which is trained and built using the reduced NSL-KDD dataset. The study result shows that the accuracy for 23 features of the NSL-KDD was 99.32, whereas the accuracy for 30 features was 99.37. Therefore, the study concluded that the SVM model’s efficiency and capability remained unchanged even with a smaller feature set. Furthermore, training and testing time is reduced by reducing the number of features.

The increasing complexity of modern network platforms made the conventional firewalls, encryption techniques, and virus scanners insufficient against new attacks as demonstrated in [35]. The authors, moreover, highlighted that the intrusion detection system is also a type of computer and network security management that can find attacks and make it less likely that they will happen again. However, misdetection and a lack of real-time response to attacks are significant issues in intrusion detection.

The study came up with a model that uses data mining techniques like the -means clustering algorithm and the RBF kernel function of the SVM to find network attacks. The proposed model cuts down on the number of attributes connected to each data point. This made the model more accurate and easier to understand. In particular, the study used a method called “feature selection” to narrow down the number of features by choosing a subset of the original attributes. This procedure is required since using all available features would be computationally infeasible. The study used the KDD CUP 99 dataset for training and testing datasets. The -means clustering algorithm is applied to the training dataset in the first stage. As a result, anomalous datasets consist of clusters for DoS, Probe, R2L, and U2R attacks. In the second stage, the RBF kernel function of the SVM is employed as a classifier to determine whether an intrusion has occurred. The KDD CUP 99 dataset has been used to evaluate the efficiency of the proposed model. First, small training and testing datasets are built for each attack type using these data. Then, these datasets are used for testing purposes. After that, the results are calculated as the detection rate (DR) and accuracy (ACC). The study result shows that with all attributes used for a DoS attack, the proposed model (KMSVM) has an accuracy of 92.86 percent. In contrast, KM has an accuracy of 86.67 percent, and SVM has an accuracy of 40 percent. So, the KMSVM outperforms -means and SVM classifiers. The result of using the reduced attribute sets shows that, for a DoS attack, the proposed model (KMSVM) has an accuracy of 100 percent. In contrast, KM has an accuracy of 100 percent, and SVM has an accuracy of 57.89 percent. Similarly, for the detection rate, the KMSVM outperforms the KM and SVM. Thus, the study concluded that the accuracy and detection rate of DoS, Probe, U2R, and R2L increase when the proposed model is used.

In addition, Farid et al. [36] demonstrated that the KDD CUP 99 dataset has been used for a long time to evaluate intrusion detection systems in the network. However, a significant flaw in the KDD CUP 99 dataset means it does not reflect the current network environment or recent attack trends. The study employs a new dataset called Kyoto 2006+, based on three years of real traffic data collected from various types of honeypots. Every instances in the Kyoto 2006+ dataset is labeled as normal, attack, or unknown attack. It used 134,665 network instances for training and testing. To categorize network packets that can be used for NIDS, the study used the decision tree (J48) algorithm. The decision tree is made up of decision nodes and leaf nodes, each of which specifies a test over one of the characteristics and reflects the class value. Different decision tree algorithms exist, including J48, ID3, FT, and BF tree. The study used the J48 algorithm because it has a higher accuracy rate. The confusion matrix has been used to measure the proposed model’s performance. It consists of four evaluation factors: true positive, true negative, false positive, and false negative. In addition, the confusion matrix has been used to calculate the accuracy of the proposed model. The test is run on an Intel Core i5 CPU machine with 4 GB of RAM and Ubuntu as the operating system. Furthermore, the study used 10-fold cross-validation for testing and training. The preprocessing has been implemented to discretize the data, and WEKA 3.6.10 has been used to visualize the output of the decision tree classifier. The study result shows that there are 97.23% correctly classified instances, whereas there are 2.67% incorrectly classified instances, with a high true-positive rate of 99% for normal and attack packets. In addition, the simulation result shows that the model can detect unknown attacks as well.

Furthermore, a study by [37] showed that the internet is open to a wide range of attacks and security problems because of the fast growth of applications and services on it. Intrusion detection systems have been used to protect networks for a long time. However, traditional detection methods may set off false alarms or not be smart enough to find new threats, making it hard to meet security requirements. Thus, artificial intelligence (AI) has been considered for intelligent detection to overcome the limitations of traditional intrusion detection. AI-based techniques can extract deep knowledge or patterns from historical data and make intelligent decisions to predict network intrusions. Therefore, the study proposed using AI algorithms to develop an advanced intrusion detection system. The study applies a combination of improved AI algorithms to accomplish feature selection and flow classification. For feature selection, the enhanced Bat algorithm has been used. It separated the entire swarm into subgroups using the -means method, allowing each subgroup to learn more effectively within and among distinct populations. After that, the study optimizes random forest for flow classification by updating the weight of each sample after repeatedly building each tree and making the final choice using a weighted voting mechanism. However, the Bat algorithm is a popular global optimization technique that can generate excellent results but is prone to becoming locked in local minima. To address this limitation, the study improved the Bat algorithm in two ways: swarm division and binary differential mutation. Swarm division divides the total number into subgroups using the -means algorithm. In binary differential mutation, following each iteration’s update of a bat, it applied the differential evolution mutation mechanism to the BA algorithm to increase population diversity and bats’ ability to jump out of local optimums. Similarly, in random forest, dealing with an unbalanced dataset decreases its performance in minority data classes. Because samples are chosen randomly and replaced when trees are built, the minority class with fewer examples is less likely to be selected and learned from. To address this limitation, the study improves the random forest algorithm in three ways: weight initialization, weight updating, and weight voting. For evaluation, the KDD CUP 99 has been used for training and testing after applying the downsampling process. For performance analysis, the study assesses the proposed methods in three ways. First, the suggested BA method is evaluated for feature selection for optimality and convergence. Second, the upgraded RF algorithm’s detection capabilities are evaluated for flow categorization on the entire dataset as well as in distinct classes of flows. Finally, the above-mentioned algorithms are combined in the suggested two-stage intrusion detection to draw comparisons with existing methods. The study result shows that the accuracy of the proposed model was 96.42% with a false-positive rate of 0.98%, whereas that of [36] has an accuracy of 95.21% and a false-positive rate of 1.57%. The result shows that the proposed model improves accuracy and reduces overhead. Compared to previous methods, the study shows that the system improves detection ability without consuming much time.

Emphasizing the rapid evolution of new attack mechanisms, authors in [38] showed that hackers try many different ways to get into the network and hurt the organization’s data. Their study proposed a C4.5 decision tree approach to construct the decision tree algorithm. When building a decision tree, feature selection and split value are crucial. The algorithm in this research is intended to overcome these two issues by proposing a novel approach to selecting the split value and selecting features. Information gain is used to choose the most relevant features, and the divided value is such that the classifier is unbiased towards the most frequent values. The C4.5 algorithm sorts all of the values of an attribute before selecting the split value. The gain ratio of all the values is computed by determining the lower value as the threshold value and then calculating the split value from these sorted values. The split value for that node is decided based on the value that delivers the highest gain ratio. In other words, to calculate the split value, there is no need to sort the attribute values; the split value is calculated by averaging the values in the domain of a specific attribute at each node. It gives the same weight to all of the values in the domain, so the classifier does not favor the most common values in the domain. This gets around the problem with the decision tree, makes the calculation simpler, and makes it easier to understand. The DTS method is used on a 64-bit processor running Windows 8.1 with a CPU speed of 2.20 GHz and 8 GB of RAM. This is done with the help of the tools WEKA and MATLAB. The proposed model is compared to current algorithms, including C4.5, AD tree, and CART. The experiments used the NSL-KDD dataset to test the efficiency of the proposed model. The NSL-KDD dataset is a new version of the KDD 99 dataset. The study used the NSL-KDD dataset because the KDD 99 dataset had many redundant records in the training and testing datasets. The proposed algorithm’s performance is compared to the performances of several other approaches. For performance analysis, the study used several parameters, such as accuracy, false-positive rate, true-positive rate, and the time it takes the classifier to build the model. The study result shows that the decision tree-based intrusion detection system has 75.7 accuracy, a 0.6 false-positive rate, and a 0.95 true-positive rate. The time for building the model was 4.42 seconds. According to the findings, the proposed model is more efficient in discovering attacks in the network with fewer features and requires less time to develop the model.

Developing a robust and low computational complexity IDS is a challenge tackled in [39]. The study proposed a hybrid data mining method including -means and -nearest neighbors to reduce the system’s time complexity and increase accuracy. One of the simplest strategies for solving clustering problems is -means clustering. The fundamental purpose of using -means clustering is to divide and organize data into normal and attack instances. It works with a dataset D that has objects and divides them into clusters. This procedure begins by choosing objects from D. The cluster center is calculated using the mean value of the objects in each cluster. In contrast, the KNN technique is a nonparametric classification and regression approach. The output of KNN classification is class membership. However, to appropriately distinguish the normal from the intrusions in the data collection, the study used a ten percent dataset from KDD CUP 99. This dataset has four types of intrusions: DoS, Probe, U2R, and R2L, as well as normal samples. There are 494,021 samples in 10% of this training set. The successfully categorized samples of the decision -means and -nearest neighbor-based strategy are 493,809 records, according to a 10-fold cross-validation investigation. And there are 212 records in that approach’s wrongly classified samples. Therefore, it takes 0.18 seconds to train the model for the approach. Furthermore, the successfully categorized samples of the -means and -nearest neighbor-based approaches are 167,885 records in 66-34 percent validation. And there are 82 records in that approach’s wrongly classified samples. The training time for this model is 0.2 seconds. The experiment was conducted on a PC with a 64-bit Windows 7 operating system, an Intel Core i3-4010U CPU running at 1.70 GHz, and 4 GB of RAM. The study result demonstrated that the proposed model has good accuracy, and the training time was acceptable.

Conversely, authors in [40] focused on detection accuracy in contrast to computational complexity. In their study, they developed an effective IDS that can distinguish between normal and the other four types of attacks. Because the training dataset that has been used is so enormous, processing it will take a long time and consume a lot of memory. Therefore, the study proposed a feature selection technique that minimizes the number of characteristics. The feature selection method is a GA approach with the proposed fitness function. Furthermore, the study proposed a BFSS algorithm to perform a fitness test on all feature sets obtained through multiple GA tests to determine which one is the best. This was proposed to further enhance the feature subset selection technique. The result is a set of features used to create a classifier using logistic regression. By using the proposed best feature set selection (BFSS) method, the study selected the best feature set from GA’s 100 feature sets. Then, it chooses a specific set of features from the results of GA. After that, it finds the class with the highest frequency for each distinct observation in that feature set. Then, the score of that observation is the frequency acquired in the preceding phase. The score of the feature set is the sum of all the different observations, and the higher score is considered the best feature set. By using this strategy, the study compares all the feature sets generated by GA tests and chooses the best feature set. For the experiment, the study used NSK-KDD, which was generated from the original KDD dataset. The NSL-KDD dataset consists of four attacks: DoS, R2L, U2L, and Probe attacks. However, before feeding the NSL-KDD dataset into the model, the study applies data preprocessing, data normalization, and feature selection. After that, the study used logistic regression to detect network traffic for both normal and attack behaviors. The study result shows that by conducting the cross-validation comparison between the GA-BFSS of all features and the three GAs of feature sets, the GA-BFSS has 93.26, whereas the GA1, GA2, and GA3 have 92.12, 92.64, and 93.00, respectively. On the other hand, by comparing the classification of GA-BFSS of all features and the three GAs of features set, the GA-BFSS has 74.94, whereas the GA1, GA2, and GA3 have 71.54, 71.57, and 72.77, respectively. Therefore, the study concluded that the proposed model had a better feature set than the traditional GA method. In addition, the logistic regression algorithm can effectively classify multiclass problems.

Random forest and SVM-based algorithms are investigated in [41] in terms of their intrusion detection accuracy. However, selecting and reducing the most important features from the dataset is essential to enhancing the accuracy and speed of the system. Thus, the study used recursive feature elimination (RFE) to improve the system’s performance. The proposed system consists of data collection, preprocessing, feature selection, and classification. For data collection, the NSL-KDD dataset has been used for training and testing. Data preprocessing is used to convert the data into a more consistent format by normalizing it, finding the missing features, splitting the dataset into a set of attacks, and identifying the feature categories. Mainly, feature selection methods are employed to identify the most influential factors and remove redundant data that does not affect the predictive model’s accuracy. Feature selection includes the filter method, wrapper method, and embedded method. Filter methods have been used for calculating the score for each feature, and wrapper methods have been used for preparing features in different combinations. In contrast, the embedded method evaluates each feature, increasing the model’s accuracy. Once the important features were selected, the study used the random forest and supported vector machine classification algorithms. The random forest algorithm is a supervised classification method for classification and regression applications. It creates several decision trees from a set of randomly chosen characteristics. Then, it calculates votes from the several decision trees, with the most popular class being considered the final prediction. Conversely, the SVM is a classifier that separates objects using hyperplane separation. It divides training data into categories by an obvious hyperplane gap. It discovers the optimum or optimal hyperplane in high dimensions with the greatest distance from the nearest point and divides the training data into categories. In addition, the study compares the performance of both models before and after feature selection with a confusion matrix. The study result shows that, after selecting the important features, the accuracy of the SVM classifier using all features was 90% for DoS, 89% for Probe, 79% for R2L, and 100% for U2R. In contrast, the accuracy of the RF classifier using all features was 85% for DoS, 88% for Probe, 78% for R2L, and 100% for U2R. The study concluded that each type of attack’s accuracy was different for each model (random forest and SVM) (DoS, Probe, R2L, and U2R). SVM outperforms random forest in terms of performance and accuracy.

4.3. Swarm Intelligence and Evolutionary Computation (SI/EC) Methods

Swarm intelligence and evolutionary algorithms are, in general, cost-/profit-minimizing/maximizing algorithms based on an iterative evolution of solutions to reach optimality. Plenty of research in IDS have employed evolutionary algorithms either as a classifier or as optimal feature selection algorithms. The most common methods explored in research are listed in Table 5, with the dataset source used in each respective research.

An optimal feature selection evolutionary algorithm is proposed by the authors in [41]. They highlighted that choosing the right features for intrusion detection is crucial. By using the KDD CUP 99 dataset, the study evaluated the performance of three feature selection algorithms: GAs, particle swarm optimization (PSO), and differential evolution (DE). These three algorithms were used to get the best number of features to improve overall accuracy. The experiment consists of two primary phases. In the first phase, the study puts the suggested evolution algorithms into practice to choose the best number of features to set for the IDS. Second, the study tests the findings with neural network and support vector machine classifiers. In this step, the proposed methods are evaluated using NN and SVM classifiers. Using the GA features, PSO features, and DE features, the study ran three validation experiments. Each experiment is carried out five times for each dataset, with the training and testing data randomly chosen. The study findings demonstrate that DE regularly outperforms GAs and PSO for feature selection issues, both in terms of classification accuracy and feature count.

With the help of a genetic algorithm and particle swarm optimization, [42] also proposed a new way to use computers to pull out features from the KDD CUP 99 datasets. The goal of their research is to find ways to detect DoS attacks with high accuracy and few false positives. The DoS attacks are detected in the proposed IDS model by training the particle swarm optimization classifiers with genetic-particle swarm optimization based on wrapper feature selection, which is superior to conventional intrusion feature selection. There are two stages that make up architecture, phases 1 and 2 are the training and testing phases, respectively. The KDD CUP 99 datasets were used for the training phase. DoS patterns were discovered through the implementation of data preprocessing, feature selection using a genetic algorithm, and a classifier using a PSO. In the second stage, known as testing, the traffic that was gathered was examined as in training, and a choice was made once a pattern was found. By examining traffic behavior, new patterns were discovered; if they were inconsistent with genuine traffic, the pattern was recorded, and the database was updated. The MATLAB program was used to implement the suggested work. The proposed model has been compared to various well-known techniques, including fuzzy clustering, and has excellent reliability and good interpretability. The result demonstrates that the suggested approach would be able to classify intrusion events with a high rate of accuracy and adequate interpretability of derived rules and with fewer false alarms. In addition, the study demonstrated that the PSO is better than the fuzzy clustering technique.

In addition, the ensemble of feature selection (EFS) algorithm and teaching learning-based optimization (TLBO) methodologies are combined in [43] to create a novel hybrid technique, EFS-TLBO. The EFS technique is used in the proposed EFS-TLBO method to rank the features for selecting the best subset of relevant information. The TLBO is used to isolate the most crucial characteristics from the generated datasets. The extreme learning machine (ELM) is used by the TLBO algorithm to select the most useful attributes and improve classification accuracy. In a benchmark dataset, the performance of the proposed model is evaluated. The experimental results show that the proposed model has a higher predictive accuracy, detection rate, and false-positive rate than other strategies that are well-known from the literature while requiring fewer relevant features. The practical outcomes demonstrate that the recommended technique outperformed the other techniques with an accuracy of 99.95% in the NSL-KDD dataset.

Furthermore, Dubey et al. demonstrated that the swarm intelligence techniques are used to improve outcomes through improved optimization of results with some common performance evaluation parameters to measure the signal or alarm and count the frequency of these signal values for the attack alertness [44]. The study employs feature-reduction approaches to enhance the outcomes. Moreover, they combined the approaches of PSO, GA, and neural networks (RBF) to compare the detection rate for the reduced dataset using experimental results on the basis of the KDD CUP dataset to test the performance of various methods and existing algorithms. It then chooses a key subset of feature values from the dataset in order to improve classification accuracy and reduce computational complexity. The study result demonstrated that the precision, recall, and accuracy for RBF were 84.11, 84.21, and 84.43, respectively, while the precision, recall, and accuracy for GA were 85.68, 85.32, and 87.47 and for PSO were 86.13, 85.78, and 88.13, respectively.

The computational efficiency of bioinspired swarm algorithms is discussed in [45]. The authors demonstrated the features of different swarm algorithms as stand-alone IDS and as feature selection and parameter optimization for ML- and DL-based IDSs. They showed that using the swarm to preprocess datasets and select the optimal set of features as inputs to ML and DL IDSs resulted in significant improvements to the IDS in terms of computational and space complexity. Furthermore, they showed that detection rates and convergence of ML and DL IDSs have been significantly improved with swarm algorithms integrated into preprocess features. Interpretability and parallel computation in swarm and evolutionary algorithms are the main factors that make them competitors to their ML and DL counterparts. The features of swarm and evolutionary algorithms are further emphasized in [46]. The authors demonstrated different types of evolutionary algorithms explored in the literature and highlighted their advantages and disadvantages. Though SWEVO systems are deemed effective in improving the performance of IDSs in different dimensions, the effect of dataset accuracy in resembling network complexity is a real challenge in SWEVO IDS. Furthermore, the continuous increase in modern network complexity is another serious challenge that needs to be continuously addressed to explore the efficiency of different techniques in IDS accuracy. The challenges faced in ensuring the comprehensiveness of datasets are further discussed in [47]. The three main datasets commonly used in the literature are evaluated, and their features are highlighted and discussed in [47]. The authors showed that KDD and UNM lack important features emerging in modern networks that hinder the dataset’s ability to resemble current challenges posed by emerging network technologies. They, moreover, showed that the ADFA-LD dataset has covered the disadvantages of the KDD and UNM datasets by providing features resembling modern network complexity. The proposed dataset provided more accurate detection rates and provided the ability to detect new threats posed by new networks.

4.4. Deep Learning Methods

Deep learning models are a category of machine learning systems that utilize the power of multilayers of neural networks to learn complicated patterns. The main distinction between deep learning and machine learning is the ability of deep learning systems to craft the features suitable to classify the data with remarkably high accuracy. Deep learning systems consist of multiple layers, hence the name “deep,” that are used to craft the features followed by conventional fully connected layers for decision-making. This organization of layers is the main driving power of deep learning systems. As in conventional machine learning systems, feature extraction is usually accomplished through preprocessing of the data, and it needs complicated engineering crafting techniques to be feasible. However, this computational power of deep learning systems is at the expense of computational complexity. The structure of multilayers in deep learning and the richness of layers that can be cascaded result in a variety of topologies developed to accomplish different yet complicated tasks as listed in Table 6 for intrusion detection.

Sistla et al. [49] showed that making an effective intrusion detection system that can find attacks that are hard to predict is not an easy task. Their study showed that deep learning techniques could be used to make an intrusion detection system that is both flexible and effective. Self-taught learning (STL) has been used on the NSL-KDD dataset for detecting network intrusions. STL is a deep learning technique that divides the classification process into two steps. First, the feature representation is learned from the vast collection of unlabeled datasets; after that, the learned features are applied to labeled data and used for classification in the second stage. The study employs sparse autoencoder-based feature learning in the proposed model since it is reasonably simple to construct and performs well. A sparse autoencoder is a neural network with three layers: input, hidden, and output. It works by optimizing the weight value using a propagation algorithm. Moreover, the NSL-KDD dataset has been used since it overcomes the KDD’s limitations, such as eliminating redundancy. However, the dataset has been preprocessed before applying self-taught learning to it. Using 1-to- encoding, nominal properties are turned into discrete attributes. Furthermore, the dataset has one attribute, whose value is always 0 for all records in the training and test sets. This attribute was removed from the dataset. After that, the max–min normalization was implemented to keep the data in the same range as the sigmoid function, between 0 and 1. Before self-teaching, the study uses NSL-KDD training data without labels for feature learning with a sparse autoencoder. This is done after the data has been preprocessed. In the second step, softmax regression was used to apply the newly learned representation of features to the training data in order to classify them. For the performance evaluation of the proposed model, the study used the accuracy, precision, recall, and -measure metrics. The study compared the proposed model STL with softmax regression (SMR). The result shows that the -measure of the STL is 98.84%, whereas the SMR is 96.79%. In addition, the STL had an accuracy rate of 88.39 percent for the 2-class classification, whereas the SMR had a rate of 78.06 percent. Furthermore, the STL had a 79.10 percent accuracy in the 5-class classification, whereas the SM had a 75.23 percent accuracy. And the STL and SM have precision values of 85.44 percent and 96.56 percent, respectively. On the other hand, the STL outperformed the SM in terms of recall. The STL and SM had recall values of 95.95 percent and 63.73 percent, respectively. Therefore, the study concluded that the proposed model outperformed the SMR and many of the previous work results.

Moreover, Kim et al. [50] emphasized that many information devices have become very hard to use because of how technology has changed in recent years. As a result, there is a good chance for attackers to reveal sensitive information because users share so many frequent conversations through this technology. In response to ever-evolving network threats, an intrusion detection system using deep neural networks (DNN) was proposed and evaluated using the KDD CUP 99 dataset. The data were first preprocessed for input to the DNN model through data normalization and transformation. To develop a learning model, the DNN method was applied to preprocessed data, and the whole KDD CUP 99 dataset was utilized for verification. However, the topology of the proposed model consists of four hidden layers with 100 hidden units. The activation function of those hidden layers is the ReLU function. This nonlinear activation function can improve model performance by better expressing a complex classification than a linear activation function. The adaptive moment (Adam) optimizer, a stochastic optimization method for DNN learning, was also used in the study. The PC utilized for training and testing has an Intel i5 3.2 GHz processor, 16 GB of RAM, and an NVIDIA GTX 1070 graphics card. For evaluation, the detection rate, accuracy, and false alarm rate of the DNN model were assessed to determine its detection accuracy. The study results demonstrate that the accuracy and detection rate are extremely high, averaging 99%. Furthermore, the false alarm rate was 0.08 percent, implying that the chances of incorrectly identifying normal data as intrusions are quite small.

Moreover, innovative topology developed by Nguyen et al. in [51] used a restricted Boltzmann machine (RBM) and a deep belief network to build a deep learning system for anomaly detection. To perform unsupervised feature reduction, the one-hidden layer RBM has been used, and the RBM’s weights are transmitted to another RBM, resulting in a deep belief network. The pretrained weights are then fed via a fine-tuning layer that includes a logistic regression (LR) classifier with multiclass softmax. The deep belief network (DBN) is created by stacking multiple RBMs. The RBM is the pretraining layer that receives full-featured input and performs parallel sampling into the hidden units. Then, the contrastive divergence is used to train the first RBM of the first layer of the deep belief network. After that, the RBM modifies the weights between each iteration to allow the network to learn the important representation of the data. Then, the output of the first layer RBM is fed as an input to the second layer, which is a directed RBM. The network performs classification using a wake-sleep algorithm variation. To identify more than one type of intrusion in the training and testing data, the RBM-based deep belief network feeds the produced data to a layer comprising logistic regression with a softmax classifier. The study used the DARPA KDD CUP 99 datasets to evaluate the deep learning architecture developed in C++ in Microsoft Visual Studio 2013. The finding shows that the logistic regression classifier outperforms backpropagation. In addition, employing the deep learning feature reduction property can get a high accuracy rate that outperforms earlier deep learning approaches. The experimental result demonstrated that one RBM successfully classified attacks at 92%, whereas using a deep belief network by changing the number of hidden layers and the training rate through UI shows that using two hidden layers, each with 80 units, produced the highest accuracy rate of 95 percent, a low false-negative rate of 2.48 percent, and a high true-positive rate of 97.5 percent.

Focusing on DoS attack and its complexity, Wang et al. demonstrated that detecting attacks is extremely difficult due to the vast number and variety of malicious traffic. Some detection systems also face challenges regarding detection accuracy and execution time [52]. In addition, DoS is a significant attack that violates the availability of services, systems, and connections. To identify DoS attacks, they proposed IDS-CNN, an IDS platform based on convolutional neural networks (CNNs). Data collection, preprocessing, a DoS detection model using CNN, and decision-making are the four main layers of the proposed system architecture. The data collection is the first layer that takes real-time network traffic from existing data such as KDD CUP 99 or the data collector. The preprocessing takes the data from the first layer and applies normalization to keep them in the same range. After normalizing the data, the matrix conversion module will turn it into a matrix that will be used as the input to the CNN model. The next layer is the detection model, which employs a trained CNN model to classify the input data into five categories, including normal traffic and four types of attacks. The decision-making layer is the final layer of the proposed model; following the classification result, a decision on whether to block or allow traffic should be made. The topology of the proposed model consists of two convolutional layers and three fully connected layers. Because performing a downsize operation for samples with very small sizes in a used dataset is unnecessary, the pooling layers are not employed in the proposed CNN architecture. In the first convolution layer (Conv1), 64 filters have been used with a size of . In the second convolution layer (Conv2), 128 filters have been used with a size of . The same parameters were employed in both convolutional layers: zero-padding, stride, , and the ReLU activation function. There are three fully connected layers, all of which are entirely connected. Conv2 feature maps are utilized as inputs for the first fully connected layer. The parameters for the second fully connected layer are the same as for the first: hidden units, bias, and the ReLU activation function. The final, fully connected layer computes the class scores. For performance evaluation, the KDD CUP 99 dataset has been used. It consists of four attack categories: DoS, R2L, U2R, and Probe. The TensorFlow library and the Python programming language have been used to implement the CNN model in this experiment. The CNN model has been tested in two different scenarios. In the first experiment, the number of convolutional layers increased from one to three. After that, the number of layers that produce the best results is preserved, and the number of iterations is changed to 1000, 10,000, and 15,000. After that, the accuracy of CNN is compared to other machine learning approaches such as SVM, naïve Bayes, and KNN using the configuration that produces the best results. After changing the model several times, the study finds that the optimum configuration offers the best performance with two convolutional layers and 15,000 training iterations with an accuracy of 99.89%. The comparison shows that CNN has the maximum detection accuracy of 99.87 percent, while naïve Bayes achieves the lowest detection accuracy of 54.42 percent. However, naïve Bayes has the quickest execution time, taking only 98 seconds to complete, while CNN takes 600 seconds and KNN takes 225,600 seconds. Therefore, the study concluded that the proposed IDS-CNN addresses the limitation of machine learning techniques spending a long time on detection and having a low detection rate.

Alrawashdeh and Purdy [53] highlighted that security breaches frequently happen unexpectedly. To avoid high false alarm rates and poor detection accuracy against unidentified threats, there is a need to create a flexible and effective neural IDS (NIDS). To develop a powerful and adaptable NIDS, the study proposed using a deep learning approach to take advantage of feature learning and classification. However, the proposed model consists of three fully connected layers, two pooling layers, and two convolution layers. The convolution layer uses kernels of sizes and , while the pooling size for both pooling layers is . Additionally, three fully connected layers use 50, 20, and 2 neurons. Except for the final layer, which utilizes the softmax activation function, all layers use the rectified linear unit (ReLU) activation function. Furthermore, a dropout of 0.2 is used to prevent overflow of the optimization process. Also, the study used the adaptive moment estimation (Adam) approach, and the epochs and sizes of each batch are set to 100 and 500, respectively. To evaluate the performance of the proposed CNN-NIDS, the study used the NSL-KDD dataset, which consists of network traffic attributes. However, the nominal properties in the dataset are transformed into discrete attributes during the preprocessing stage using 1-to- encoding. Since one of the dataset’s attribute columns is always zero and has no bearing on training or testing, it is excluded from the list of attributes. Then, using max–min normalization, all the attributes are normalized in the range [0, 1]. For simulation, the model has been developed using Python on a computer with Windows 10 installed, 16 GB of RAM, and an Intel Core i7 processor. The study result shows that 99.79 percent accuracy of proper detection was found in the evaluation of test data from the NSL-KDD dataset. However, by comparing the CNN-NIDS with the DNN-NIDS, the accuracy of proper detection using DNN is 98.90%. Therefore, the study concluded that the proposed CNN-NIDS method is practical for identifying and detecting probable network intrusions.

Because of the huge internet traffic in real life, deep learning can perform better in extracting data features, as demonstrated by [54]. The study suggests using the whole NSL-KDD dataset to train an IDS model based on CNN. The study compares the performance of the proposed CNN model with traditional machine learning methods such as SVM and RF, as well as deep learning approaches such as long short-term memory (LSTM) and DBN. However, the study treats the one-dimensional input data as a three-dimensional image with a height of only one channel. Then, the NSL-KDD dataset is converted into a 1-dimensional convolutional architecture input format. Because the training and testing inputs to CNN should be numeric matrices, they must be transformed into numeric attributes. The one-hot encoder has been utilized to translate category features in the dataset. Then, data normalization was used to keep the data in the same range. Multistage features (MS) are mainly given to a two-layer classifier in a three-stage architecture. The convolutional layer and the max-pooling layer extract features in the first stage, which are then combined with features in the second and third stages. A ReLU activation layer and a max-pooling layer with stride 1 follow all convolution layers to keep the form of the previous convolution layer. A flattened layer reduces the output dimension from three to two layers. After the preceding layer, the fully connected layer is introduced, followed by a dropout layer. Before the softmax layer yields 5-dimensional features, a dense layer reduces the feature size to 5. Following that, combining 3-stage deep features with 5-dimensional features, the final fully connected layer reduces the feature size to 5. However, the experiment is run on PyCharm using an Intel Core i7-6700HQ CPU at 2.60 GHz × 8 with 8 GB of RAM and a GeForce GTX 960M graphics card. The study result shows that the overall two-test accuracy of the proposed CNN for the five-class categorization is 80.1321 percent for the KDDTest+ and 62.3206 percent for KDDTest-21. Therefore, it is evident that the proposed CNN model outperforms the traditional machine learning, which has 74.188%, 71.30%, 71.91%, and 73.18% for RF, SVM, DBN, and LSTM, respectively, for KDDTest+ and 51.02%, 45.54%, 46.73%, and 49.37% for KDDTest-21, respectively.

Blanco et al. [55] addressed the limitations of deep learning IDS in classifying new types of attacks that were not included in the training set. Consequently, they demonstrated that most organizations’ infrastructure is vulnerable to several threats, including security breaches and system exploitation. Pattern matching makes it easy for an intrusion detection system to find known types of attacks, but it is harder to find new attacks. In their study, an attempt was made to develop an intrusion detection system that utilizes a deep learning approach that not only learns but also adjusts to patterns that were not previously learned. The proposed system uses the NSL-KDD dataset to train a deep neural network that consists of a sparse autoencoder and logistic regression. Stacking the autoencoders builds a deep network, and a logistic regression network is used to classify the features learned. The outcome of logistic regression is the identification of two distinct user groups. An autoencoder is an unsupervised learning artificial neural network that can adapt to learn new features from a set of input data. The first level of the sparse autoencoder decreases the number of features set from 115 to 50. The second level of the sparse autoencoder decreases the number of feature sets from 50 to 10. After that, the logistic classifier uses the new features learned from the level 2 autoencoder to determine if traffic is normal or under attack. The activation function of logistic regression is a sigmoid function that gives the probability measure of the output in the range [0, 1]. So, the final stack creates a fully connected network with one input layer, one output layer, and two hidden layers. The original dataset’s 115 inputs are compressed into 50 nodes in the second layer and 10 in the third layer. The final output layer determines whether traffic is normal. However, the dataset is preprocessed before being applied to the network. The data collection is normalized using the max–min operation, and the nonnumeric data are replaced with numeric values. The proposed system’s performance has been calculated, and the results show that the model’s precision score was 84.6 percent, and its recall score was 92.8 percent. The specificity and negative predictive values were 80.7 percent and 90.7 percent, respectively. The model’s overall accuracy was 87.2 percent.

Ding and Zhai [56] developed a novel deep learning-based intrusion detection system using the CSE-CIC-IDS 2018 dataset. The CSE-CIC-IDS 2018 numerical data is first converted into images. Convolutional and max-pooling layers are then organized to create a CNN-based intrusion detection model. Specifically, the proposed system consists of two convolutional layers and two max-pooling layers placed behind each other. The first convolution layer has 16 kernels of size . Followed by the maximum pooling layer with a kernel size. Then, the second convolution layer used 32 kernels of size . They were followed by a max-pooling layer with a kernel size. Then, the study employs the “ReLU” activation function for every convolutional layer. After each max-pooling layer, dropout is implemented to avoid overfitting. Then, the fully connected layer is deployed to evaluate the effectiveness of the proposed CNN model and the RNN model. Then, the result between CNN and RNN on the CSE-CIC-IDS 2018 dataset is compared. Furthermore, a preprocessing subdataset has been implemented to enhance the detection rate. The study result shows that, in CNN, the detection rate of benign, DoS, brute force, SQL injection, and infiltration attacks was 1, 0.97, 0.86, 0.57, and 0.33, respectively. In contrast, the detection rate was lower in the RNN model. Therefore, the study concluded that with CSE-CIC-IDS 2018, the CNN model in the multiclass classification outperformed the RNN model in terms of accuracy.

Moreover, the authors in [57] proposed an intrusion detection-based deep learning model that consists of many stacked, fully connected layers. The first layer is the input layer, which passes 44 features into the neural network. After the input layer, there are eight hidden layers that consist of 140, 120, 100, 80, 60, 40, 20, and 120 nodes each. Then, the output layer, the softmax layer, generates probabilities for the 13 classes to make the prediction. The LeCun-uniform initialization was utilized for weight initialization on fully connected layers, whereas the glorot-uniform initialization was employed for the output layer. The ReLU activation function has been employed for all of the fully connected layers. The Adam optimizer has been utilized for training the neural network. To avoid overfitting in neural networks, regularization techniques have been utilized. The study used the CICIDS2017 dataset, a public intrusion detection dataset that meets all real-world cyberattack scenarios. It is separated into eight files and contains benign and different types of attacks in network traffic from five consecutive days of collection. Data cleaning has been implemented to eliminate infinity, missing, and other nonsensical values in the database. In addition, data transformation has been implemented to merge some attack classes into one common class in the database. The proposed model was evaluated in terms of recall, precision, and F1 score for each of the 13 different classes and a 10-fold cross-validation. The confusion matrix produced by each of the ten cross-validation splits served as the basis for the metrics utilized to evaluate the proposed model. The study result shows that the proposed model has a 99.95 percent overall accuracy, a precision of 94.31 percent, a recall or detection rate of 95.62 percent, and an F1 score of 94.1 percent. In addition, the false-positive rate (FPR) is equal to 0.0005, as an average of the FPR of all classes and splits.

Furthermore, Toupas et al. [58] proposed a multiphase intrusion detection system based on the LeNet-5 model of convolutional neural networks. The proposed IDS model has four phases: data acquisition, preprocessing, CNNs, and attack detection. The LeNet-5 model is the most classic convolutional neural network, which can reduce input dimension so that training speed and convergence rate can be increased. The topology of the proposed model consists of three convolutional layers, two pooling layers, and one fully connected layer, followed by the softmax to produce the output. For model deployment, the benchmark KDD CUP 99 datasets have been used. Data cleaning and normalization have been employed to avoid the irrational calculation of the dataset. The encoding method has been used to deal with the feature matrix, making it easier for the model to recognize the traffic data packet’s properties to improve its accuracy, stability, and convergence rate. Data normalization is keeping the data in the same range without destroying the linear relationship between them. The min–max normalization method has been used to eliminate the large differences in data values. The training processes are run on an Ubuntu 16.04 LTS-4.40 system with 126 GB of RAM and a 2.2 GHz Intel® Xeon® processor. TensorFlow is version 1.11.0, and Anaconda is version 4.5.11. After fifty epochs of training, the model’s accuracy is greater than 99 percent, its DR is greater than 99 percent, and its false alarm rate is less than 0.1.

Many organizations and businesses have encouraged their staff to work from home in light of the recent COVID-19 pandemic. As Liu [59] shows, this condition posed a new threat to the privacy and security of the information that businesses had. In this study, deep learning algorithms were built into NIDS prediction models to automatically find network attacks. The network intrusion detection system has been constructed with SVM and a deep convolutional neural network (DCNN), and the performance has been evaluated with various kernels and activation functions. The topology of the proposed DCNN-based intrusion detection system consists of convolutional layers used to extract the feature from the data and pooling layers that enhance and minimize the feature map size. Once the feature is extracted using convolution and pooling layers, the flattening layer is used to convert the feature map into a vector. After that, the hidden layers optimize the parameters and hyperparameters to make correct predictions. The output layer consists of two neurons because it is a binary classification to detect malicious and normal traffic only. However, the study used different kernels of SVM and several activation functions of DCNN to evaluate the performance of the proposed model. In addition, the model’s performance was evaluated using metrics like accuracy, precision, recall, and F1 score on the NSL-KDD dataset. The NSL-KDD dataset addresses the limitations of the KDD 99 dataset. The key advantage of the NSL-KDD dataset is that there are no redundant records in the training set. Hence, the classifier no longer gives biased results. Furthermore, only 12,440 samples were utilized to evaluate the proposed method’s performance. Out of 12,440, 9950 are used for training, with the remaining 2490 for testing. Using the linear kernel of SVM, it is shown that only 1210 samples out of 2490 are correctly predicted, and 370 samples are predicted in the false-negative group, whereas using the RBF kernel of SVM shows that the prediction accuracy is 85%. In terms of different activation functions for DCNN, the experimental result shows that, in the case of using the ReLU function in both input and hidden layers, the sigmoid is the best activation function used in the output layer with 96% accuracy in 100 epochs. In the case of using the sigmoid function in both input and hidden layers, the sigmoid is the best activation function used in the output layer, with 97% accuracy in 100 epochs. In the case of using the softmax function in both the input and hidden layers, the sigmoid is the best activation function used in the output layer, with 97% accuracy in 100 epochs. In the case of using the Tanh function in both the input and hidden layers, the sigmoid is the best activation function used in the output layer, with 97% accuracy in 100 epochs. Finally, the study concluded that the DCNN outperformed the SVM in terms of accuracy.

A study conducted by Chen et al. [60] demonstrated that traditional network security systems could not meet the demand for robust and efficient intrusion detection systems. As a result, the development of effective IDSs is always a challenge. The study proposes a unique intrusion detection approach based on the CNN. After that, an efficient, real-time, and automated intrusion detection system, IDS-CNN, has been well constructed using deep learning technology. Several open-source tools, such as tcpdump for packet capture, Bro for traffic analysis, and TensorFlow, are used to construct the system. The topology of the proposed IDS-CNN system consists of the input layer, two convolution layers, two activation function layers, pooling layer, and two full connection layers, as well as a loss layer and output layer, which used TensorFlow to construct a CNN framework. The input picture is , the first convolution’s feature map is , and the number of convolution kernels is 32. The number of convolution kernels in the second convolution is set at 64. The output images are after the first and second convolutions. The IDS-CNN system’s entire procedure involves capturing the data packet, analyzing it to retrieve the feature value, and classifying the feature value to complete the intrusion response. The NSL-KDD dataset has been used to assess the proposed model’s performance. This new version of the KDD 99 dataset includes four types of attacks: DoS, R2L, U2R, and Probe. The main advantage of the NSL-KDD dataset over the KDD 99 is that there is no redundant or duplicate data in the training set. As a result, the classifier will not produce any biased results. The experimental result shows that the accuracy of the proposed model is 97.7%. The study concluded that the proposed model could monitor the network in real-time and identify any unusual network behavior. In addition, it can defend itself against anomalous behavior.

A combination of evolutionary and CNN algorithms is proposed by Mohammadpour et al. in [62]. In their work, they proposed a multiclass network traffic classifier using CNN. By altering the layout of the input features and reducing the number of individual features, a GA is used to produce a high-quality solution. The study used two public datasets with varying attack ratios: UNSW (10 classes) and NSL-KDD (4 classes). There are 25 features in the NSL-KDD, whereas there are only 23 in the UNSW. The study organized the feature input vector in a matrix because a CNN requires a bidimensional matrix as input. The proposed CNN is made up of a filter convolution with a depth of 4. This filter takes a input and generates a output with four values per pixel. A second convolution filter has a depth of 8 and a size of . It creates a new matrix from the output. The CNN then contains a max-pooling layer that retains the depth of each matrix while reducing its dimensionality by taking its maximum range value. As a result, a matrix is generated. After max pooling, there is a dropout layer, which is a regularization strategy that makes the network more robust. Then, the dropout output is flattened to generate a vector that will be the MLP’s input. After that, one hidden layer of 64 neurons was employed for the MLP, followed by a new dropout layer and a softmax layer to get the predicted class. The rectified linear unit (ReLU) activation function has been considered in all neurons and filters since it consumes fewer resources and produces better outcomes than sigmoid functions. For simulation, the study has been conducted on a single computer with an Intel® Core™ i7-6700K CPU running at 4.00 GHz, 16 GB of RAM, and Ubuntu 16.04 LTS installed. However, the study result shows that the top result finds six attack types with a normal traffic accuracy of 98.14 percent using the UNSW dataset. The classifier with cross-validation demonstrates an accuracy of 95.45% using the NSL-KDD training dataset. The accuracy of unknown and new attacks on the test dataset drops to 94.47 percent.

Similarly, a hybrid convolutional neural network model is proposed by Kim et al. in [63]. They presented an intrusion detection system for IoT networks to identify various attacks. The proposed model comprises four stages: data collection, preprocessing, network training, and attack detection. The system log is selected as data, and it is preprocessed to remove undesired noises. The structure of the convolutional neural network consists of an input layer, an output layer, and several hidden layers. In addition, the proposed model employs LSTM to learn content throughout the network. This procedure is used to extract important information from nodes and identify malicious nodes and their attacks. Therefore, the study combines both LSTM and a convolutional neural network to increase the performance and reduce the complexity of the proposed intrusion detection system. The bidirectional LSTM models were developed, which used two hidden layers to process the input sequence in both forward and backward directions. These characteristics involve data transmission in the network and obtaining the required data. However, the proposed system has been tested in the lab and compared to an RNN-based intrusion detection system. The UNSW NB15 dataset has a training validation ratio of 70% and a test validation ratio of 30%. The proposed system extracts data features and distinguishes between attack and normal settings. The Intel i5 processor, clocked at 2.4 GHz with 8 GB of RAM, was used to test the proposed model. The evaluation metrics used to validate the proposed system are the true positive, false positive, accuracy, precision, recall, -score, and error function. The study result shows that the proposed model outperforms the RNN model in terms of detection accuracy, precision, and recall, but the ratio of true positives to false positives is better than RNN mode 1. The study concluded that the proposed hybrid convolutional neural network model has a 98 percent efficiency, which is 3 percent higher than the traditional recurrent neural network model.

Hu et al., in addition, presented a CNN-based intrusion detection approach optimized via the fruit fly optimization algorithm (FOA) [64]. In the first step of the proposed method, a model for a multiclass intrusion detection system based on CNN is created. The proposed CNN model consists of an input layer where grayscale images are created as input using the original training data, then three convolutional layers. Each convolutional layer employs a wide convolution with no paddings and the ReLU function as the activation function. The spaces between the three convolutional layers are filled using two maximum pooling layers. After that, there are two fully connected layers. A random dropout is added to the first fully connected layer to avoid overfitting. The softmax function is the activation function for classification in the second fully connected layer. However, FOA is used in the pretraining phase of the process to address the issue of class imbalance. Each batch is obtained during the training process using the resampling method following the resampling weights. For implementation, the Lenovo Workstation ThinkStation P520 with a 2080 Ti graphics card has been used with the NSL-KDD dataset. In the preprocessing phase, by using the one-hot encoding technique, all discrete features are first transformed into a binary vector. After that, all continuous features were normalized to keep them between 0 and 1. Then, these data will be transformed into a binary vector using the one-hot encoding technique. The initial training dataset is transformed into a binary vector with 464 dimensions and then into a grayscale vector with 58 dimensions. Then, the 8 zero paddings are added to each grayscale vector and converted into an grayscale image. The study result shows that the precision score for Norm, DoS, Probe, R2L, and U2R using the FOA optimizer was 95.5%, 98.79%, 81%, 84%, and 2.4%, respectively. The study concluded that a pretraining procedure could be carried out using the fruit fly optimization algorithm to create the optimal training data resampling weights.

5. Current Challenges and Future Directions in Intrusion Detection Techniques

Intrusion detection systems are one of the most active fields in network security. Nowadays, networks are becoming very complicated in terms of structure, platform, services, data types, and volume. Moreover, social media, in particular, encroaches on the privacy of users and businesses. Also, social media has its own feature of attacks such as cyberbullying.

This shows that network data are expanding tremendously in volume and in diversity in modern networks. Accompanying this expansion is the emergence of new types of attacks and their observed increase in volume and methodology. Attackers have the tools, technologies, and intelligence that enable them to produce new types of threats and take advantage of vulnerabilities of new emerging technologies. Therefore, the battle between cybersecurity researchers and attackers never ends. Cybersecurity researchers explored different approaches to develop effective and robust IDS systems that are capable of efficiently identifying different attacks. In general, the explored methodologies in IDS development can be categorized into four major categories, namely, statistical methods, AI and machine learning methods, swarm intelligence and evolutionary methods, and deep learning-based methods. Aside from the major categories, hybrid systems, which are a combination of two or more methods, are also explored in the literature.

Statistical methods are mainly based on probability features of network traffic data. This approach assumes that different attacks exhibit specific statistical patterns that can be detected by modeling the probability features of the data. To extract the features, the research develops a sophisticated statistical method that would explore the statistical distribution of big datasets to be able to correctly classify the traffic. Computational complexity of the statistical method is a function of the traffic data dimensionality and volume. However, statistical methods show acceptable limits of computational complexity. In contrast to their acceptable complexity, statistical methods’ accuracies ranged from 90% to 97.5%. However, the main disadvantage of statistical methods is their inability to detect new attacks that exhibit new statistical features that are not included in the training data. This disadvantage can be mitigated only by recrafting the features to include the new attacks features, which is not an easy task.

Artificial intelligence and machine learning methods are a family of learnable systems that train on the data to craft the optimal features that would enhance the detection. In general, AI-ML methods comprise two main stages, namely, feature extraction and decision-making. Feature extraction is responsible for crafting or selecting the optimal feature from the dataset to be input to the decision-making system. Crafting the data may include mapping and diffusion processes which are implemented to transform the data onto other domains that would produce enhanced classification, whereas feature selection is more abstract wherein the algorithms select the most effective features from the available datasets. The optimal features form the input vectors to the decision-making system that is usually a cascaded multilayer neural network. The resiliency of neural network layers is the added advantage of AI-ML methods over statistical methods, as the neural network layers are more able to adapt to new attacks and adjust their hyperparameters to train on new attack patterns. However, the complexity of engineering crafting of features in the first stage of AI-ML methods is still the main disadvantage of such methods. The detection accuracies achieved using AI-ML methods are comparable to those achieved using statical methods and range from 78% to more than 99.95%.

Swarm intelligence and evolutionary algorithms, on the other hand, are mainly cost minimizing/maximizing algorithms. Solutions to the problem are proposed initially, and then, iterative methods are used to fuse the solutions based on their cost function evaluation and the penalties that are used to exclude undesirable solutions. Swarm intelligence and evolutionary algorithms are deemed very effective in many classification problems including IDS, as their accuracies are comparable to AI-ML methods. Moreover, they exhibit resiliency against new attack patterns. However, their complexity is considered comparative to AI-ML, and forming the cost functions that would produce an accurate evaluation of the proposed solution is not an easy task when dealing with complex data.

Finally, deep learning methods are emerging as robust and resilient methods in classification. They are based on neural network topologies. The structure of deep learning can be viewed as two stages where the first stage forms multiple layers (such as convolutional layers) that learn and extract the features followed by the decision-making layers that perform the classification. This structure eliminates the need to craft the features and provide the power of deep learning methods over the other methods. Moreover, with the availability of different layers that perform different operations, plenty of deep learning methods are explored with different levels of accuracy and complexity. Deep learning methods outperformed other methods in terms of accuracy of detection, and added to that, they provided higher levels of resiliency which enabled them to adapt to new attacks in complex networks.

Despite the detection accuracies achieved in the developed algorithms, intensive research is still needed to resolve the limitations of the existing methods. Computational complexities, space complexities, and resiliency are among the challenges that need to be addressed. Moreover, neural-based algorithms (AI-ML and deep learning) are suffering from the interpretability problem. That is, in case of false detection, it is difficult to interpret the reasons for the malfunction. Added to all that are the fast-emerging technologies in networks, the data complexity, and the attackers’ emerging methods which add stresses on researchers to explore more technologies and develop more robust algorithms in the field of IDS. Furthermore, social media networks are becoming integral parts of our lives. Therefore, social networks need attention in exploring their vulnerability and the efficiency of different IDSs in detecting and classifying the attacks. Many researchers explored different methodologies in IDS related to different network platforms, such as IoT and cloud systems. This paper covered the gap by presenting different methodologies, ranging from conventional statistical methods to DL methods. Furthermore, this paper demonstrated and compared the performance of different methodologies, which provides a benchmark for researchers in the field to select the proper method for classification tasks. The importance of social media and their specific role in our lives have been demonstrated, along with the need for further studies to explore social media-specific vulnerabilities and attacks.

6. Conclusion

The review of the relevant literature reveals a variety of distinct methods for identifying cyberattacks performed against a network. Each strategy has a variety of advantages as well as disadvantages. Furthermore, the precision, detection rate, and number of false positives produced by each approach varied. According to studies, statistical and artificial intelligence- (AI-) based methodologies are the most often utilized techniques for classifying network traffic and finding network breaches. The statistical methods deal with fewer data points and require more engineering crafting of features. Artificial intelligence, on the other hand, is highly effective at making predictions and requires relatively little human input because the machine handles most of the processing. Furthermore, statistical methods are incapable of meeting the need for a powerful and effective intrusion detection system in modern and future networks, as well as the need to manage the volume of network traffic. AI approaches, on the other hand, are superior at classifying network traffic because of their ability to manage big, scalable networks that are intrinsically complicated. Furthermore, the literature review demonstrates that deep learning is a powerful and effective method for identifying intrusions in large, complex, and scalable networks. This is demonstrated by the fact that deep learning has been shown to be one of the most effective methods. Still, there is a need for further research to develop methods that will cope with the ever-increasing complexities of attacks and network platforms.

Conflicts of Interest

All authors declare no conflict of interest.

Acknowledgments

The authors acknowledge the Deanship of Scientific Research, Vice Presidency for Graduate Studies and Scientific Research, King Faisal University, Saudi Arabia (Project No. GRANT3186). The authors extend their appreciation for the financial support that has made this study possible.