Predicting the Robustness of Large Real-World Social Networks Using a Machine Learning Model

Nguyen, Ngoc-Kim-Khanh; Nguyen, Quang; Pham, Hai-Ha; Le, Thi-Trang; Nguyen, Tuan-Minh; Cassi, Davide; Scotognella, Francesco; Alfieri, Roberto; Bellingeri, Michele

doi:https://doi.org/10.1155/2022/3616163

Complexity

On this page

Abstract Introduction Results Discussion and Conclusion Appendix Abbreviations Data Availability Conflicts of Interest Authors’ Contributions Acknowledgments References Copyright Related Articles

Special Issue

Frontiers in Data-Driven Methods for Understanding, Prediction, and Control of Complex Systems 2022

View this Special Issue

Research Article | Open Access

Volume 2022 | Article ID 3616163 | https://doi.org/10.1155/2022/3616163

Predicting the Robustness of Large Real-World Social Networks Using a Machine Learning Model

Ngoc-Kim-Khanh Nguyen,¹Quang Nguyen,^2,3,4Hai-Ha Pham,⁵Thi-Trang Le,⁴Tuan-Minh Nguyen,⁴Davide Cassi,^6,7Francesco Scotognella,^8,9Roberto Alfieri,^6,7and Michele Bellingeri^6,7,8

Academic Editor: Andrea Murari

Received30 Jun 2022

Revised24 Sept 2022

Accepted03 Oct 2022

Published09 Nov 2022

Abstract

Computing the robustness of a network, i.e., the capacity of a network holding its main functionality when a proportion of its nodes/edges are damaged, is useful in many real applications. The Monte Carlo numerical simulation is the commonly used method to compute network robustness. However, it has a very high computational cost, especially for large networks. Here, we propose a methodology such that the robustness of large real-world social networks can be predicted using machine learning models, which are pretrained using existing datasets. We demonstrate this approach by simulating two effective node attack strategies, i.e., the recalculated degree (RD) and initial betweenness (IB) node attack strategies, and predicting network robustness by using two machine learning models, multiple linear regression (MLR) and the random forest (RF) algorithm. We use the classic network robustness metric R as a model response and 8 network structural indicators (NSI) as predictor variables and trained over a large dataset of 48 real-world social networks, whose maximum number of nodes is 265,000. We found that the RF model can predict network robustness with a mean squared error (RMSE) of 0.03 and is 30% better than the MLR model. Among the results, we found that the RD strategy has more efficacy than IB for attacking real-world social networks. Furthermore, MLR indicates that the most important factors to predict network robustness are the scale-free exponent α and the average node degree <k>. On the contrary, the RF indicates that degree assortativity a, the global closeness, and the average node degree <k> are the most important factors. This study shows that machine learning models can be a promising way to infer social network robustness.

1. Introduction

The study of the social network from a complexity science perspective has attracted much interest recently [1]. Especially, the study of dynamic processes that take place in these complex networks can have various applications. For example, the study of network robustness, i.e., “network robustness” is the capacity of a network to hold its functionality when a proportion of nodes/edges are removed, can help attack a network efficiently, or inversely design a more robust network structure in practice [2–7]. On the other hand, the study of epidemic processes that take place in the network can be used to spread the news [8–12], optimize vaccination strategy [13–15], or define a better social-distancing rule [16–19].

Besides a few simple model networks where analytical models can be developed [20–24], most of the studies rely on computer simulations. For example, for the study of the network’s robustness, node/edge removal Monte-Carlo simulations are usually employed. In such a process, nodes/edges are sequentially removed from the network using computer simulations. A “robustness” metric is then recorded at each step of the removal process. The most commonly used robustness metric is the largest connected component (LCC) of the remaining network [25].

The way nodes/edges are selected to be removed is called the removal strategy or attack strategy. One can classify attack strategies into two types, initial and recalculated attack strategies. For an initial attack strategy, nodes/edges are removed according to a node/edge ranking that is computed ahead of the removal simulation. In contrast for a recalculated attack strategy, the ranking is updated after each node/edge removal [4].

For node removal attack strategies, the node ranking is usually computed using node centrality measures such as degree [26, 27], closeness [4], and betweenness [7, 30]. It was found that for social networks, the recalculated betweenness node attack strategy (RB) is, on average, the most effective node attack strategy to dismantle the network [2, 7, 28, 29]. Other effective strategies are the recalculated degree (RD) and the initial betweenness (IB) [7, 28, 30].

Because of the sequential nature of the removal process, the node removal simulation is computationally costly, especially for recalculated strategies. For example, a simulation using an RD attack strategy has a time complexity of O (N × E), where N is the number of nodes and E is the number of edges of the network. The reason is that the node removal process has an N step, and at each step, a degree ranking is computed taking a time that scales with E. However, for RB, the computation of the whole network’s betweenness is known to be very computationally costly, due to the definition of the network’s node betweenness [31, 32]. The most efficient known algorithm for calculating network betweenness is the Brandes algorithm [33], which has a time complexity of O (N × E). In consequence, the whole node removal process using IB and RB attack strategies can have a time complexity of O (N × E) and O (N² × E), respectively. Although the IB attack strategy has the same time complexity as the RD attack strategy, the RB’s time complexity is much higher. For illustration, in Figure 1, we present the total simulation time tIB and tRD for the corresponding attack strategies IB and RD, respectively, for all our studying social networks (48 networks see Section 2). In addition, we present the total simulation time tRB for the attack strategy RB for 4 networks (insert graph) as an example, as a function of the product N × E. We found a good linear relationship between t_IB and t_RD and N × E for all networks as expected, and tRB is about two orders of magnitude higher than t_IB and t_RD, for networks of equal N × E.

Figure 1

Computation time of a complete Monte Carlo network node attack simulation for all studied real-world social networks (using initial betweenness (IB) and recalculated degree (RD) attack strategy) and for 4 networks (using recalculated betweenness strategy (RB)) as a function of the product N × E (node number (N) edge number (E). We found that tIB and tRD scale approximately linearly with respect to the product N × E, while tRB scales linearly with respect to the product N² × E (insert graph). From this result, we can estimate that the RB simulation time for the largest networks in our dataset will take more than 50 days using the same hardware.

The simulation time can become an issue for the cases of social networks because their size can be extremely large. In fact, to our knowledge, most studies of dynamic processes on social networks that use an RB attack strategy only consider small-size real-world social networks of less than 100,000 nodes [7, 28, 30]. For very large social networks, the RB node attack strategy can take an unrealistic amount of time. Therefore, RB is not suitable for large social networks for an average computer station. One possibility is to use the alternative betweenness-based attack strategy with only one betweenness calculation, namely, the initial betweenness attack strategy IB, together with other recalculated strategies that use another node centrality metric that is less computationally costly. In consequence, in this work, we consider two candidate attack strategies for breaking large real-world social networks, IB and RD attack strategies. Besides the comparative study between different network node attack strategies, other works focused on the relationship between network robustness and network structural indicators (NSIs). Iyer et al. [4] studied network robustness as a function of the node clustering coefficient (or node transitivity). The research on model networks with tunable clustering coefficients demonstrates that networks with higher clustering coefficients are more robust, with the most important effect for the node degree and node betweenness attack [4]. Nguyen and Trang [34] studied Facebook social networks and found that those networks with higher modularity Q have lower robustness to node removal. The modularity indicator Q introduced by Newman and Girvan [35] measures how well a network breaks into communities, (i.e., a community or module in a network is a well-connected group of nodes that have sparser connections with nodes outside the group). In [29], the authors empirically analyzed how the modularity of scale-free models and real-world social networks affects their robustness and the relative efficacy of different node attack strategies. The abovementioned studies analyzed the relationship between network robustness and a single NSI.

On the other hand, machine learning (ML) is a technique that has seen a huge breakthrough in the last decade, beating state-of-the-art results in many prediction applications [36]. It initially solved technical problems in computer vision and natural language processing [37–39] and then expanded into many other fields such as health care, finance, manufacturing, energy, and environment. The key characteristic of an ML model is the ability to intelligently learn nonlinear relationships between the input and output without explicitly knowing them.

In this work, given such a complex relationship between network robustness and NSIs, we adopted a method from machine learning in order to learn such a complexity. Our main contribution is the application of the ML model to predict real-world social network robustness with acceptable errors. We develop ML models to predict network robustness under two main attack strategies, the IB and RD attack strategies, independently. We also implemented three popular ML models, single-variable linear regression, multiple-variable linear regression, and random forest models. Our results demonstrate that a data-driven method such as ML can be an efficient way to study the network’s complexity.

Our work comprises three steps: (1) collect a real-world network dataset and compute NSIs; (2) run Monte Carlo node attack simulations to estimate network robustness; (3) build and evaluate a model that predicts network robustness from their NSIs. The paper is organized as follows: in Section 2, we describe our dataset of 48 real-world social networks. In Section 3, we describe the network robustness Monte Carlo simulation method and three ML models for predicting the network robustness, i.e., simple and multiple linear regression (SLR and MLR, respectively) and random forest (RF) model. Section 4 presents the main results, and finally, we discuss and conclude in Section 5.

Real-world social networks are downloaded from two sources: the Stanford Large Network Dataset Collection (https://snap.stanford.edu/data/) and the Network Repository social networks (https://networkrepository.com/soc.php). We select 48 social networks with a node number (N) ranging over five orders of magnitude. The smallest network is the “Twitch user-user network of gamers who stream in Portugal” having N = 1,914, and the largest network is the “e-mail network from an EU research institution” with N = 265,216. However, the network with the largest number of edges (E) is the “BlogCatalog social blog” with E = 4,186,390. The social networks used in this study are unweighted (i.e., we do not take into account edge weights) and undirected (we do not consider edge directionality).

Table 1 summarizes 48 real-world social networks and their NSIs. Besides N and E, we also compute the following NSIs:(i)Network density <k> is the average node degree, i.e., the average number of edges per node.(ii)Fitted scaled-free exponent (α): we assume that all social network degree distributions follow a power law of P(k) ∼ k−α where k is the node degree. The power exponent value α is fitted using the ordinary least squared method. From this fitting, we also extract the fitting variance of α, denoted by α².(iii)Assortativity (a): the assortativity coefficient is a Pearson correlation coefficient of the degree between pairs of linked nodes [40], which varies between −1 and 1. A positive value of a indicates a preferential connection between nodes of a similar degree, while negative values indicate that nodes of different degree have more change to connect.(iv)Modularity (Q) : The modularity indicator Q calculates how a network can be partitioned into subnetworks (modules or communities): where E is the number of edges, a_ij is the element of the adjacency matrix A in the row i and column j, k_i is the degree of i, k_j is the degree of j, c_i is the module (or community) of i, c_j that of j, the sum goes over all i and j pairs of nodes, and δ(x, y) is 1 if x = y and 0 otherwise [13].(v)Global clustering coefficient (C): the global clustering coefficient (C) is based on triplets of nodes. A triplet is three nodes that are connected by either two (open triplet) or three (closed triplet) undirected edges. The global clustering coefficient is the number of closed triplets (or 3x triangles because a triangle comprises 3 overlapping triplets, each centered at one of the three nodes) over the total number of triplets (both open and closed). The formula is as follows: where λ_closed is the number of closed triplets and λtotal is the total number of triplets in the network. The global clustering coefficient represents the overall probability for the network to have adjacent nodes interconnected, thus making more tightly connected modules [41].(vi)Average closeness (Cl) is the average of all network nodes’ closeness, where the closeness (or closeness centrality) of a node is calculated as the reciprocal of the sum of the length of the shortest paths between the node and all other nodes in the graph [42, 43]: where N is the number of nodes and d(i, j) is the length of the shortest path between nodes i and j.

2.1. Network Robustness Monte Carlo Simulation

For each network, we run two node removal processes using Monte-Carlo simulations. Nodes are removed consecutively following the ranking of initial betweenness (IB) and the ranking of the recalculated degree (RD). In the case of ties, e.g., nodes with an equal betweenness or degree score, we removed one of them at random. After each node removal, we compute the network robustness measure and the relative size of the largest connected component LCC, together with the accumulated proportion of nodes removed q. Finally, we obtain two curves LCC (q) corresponding to two node removal processes, IB and RD. The whole simulation is repeated 10 times, and the final curves LCC (q) are the average results.

In addition, we compute a single value defined as the network robustness (R), as performed by Bellingeri et al. [44], and the area below the normalized LCC curve during the removal process, R = . R therefore can be between two theoretical extremes, (absolute fragile network) and (absolute robust network). We denote RRD and RIB as the network robustness against RD and IB node attack strategies, respectively.

In summary, we collect 48 real-world social networks, and then, we compute 9 NSIs for each network as inputs. In parallel, we run Monte Carlo simulations and obtain the robustness represented by two metrics, RRD and RIB. The higher they are, the more robust the network is. Those two metrics are the output of each network and will be predicted using ML models.

3. Machine Learning Approach

This section presents the details of SLR, MLR, and RF models.

3.1. Simple Linear Regression Model (SLR)

Linear regression is the simplest model for prediction. The SLR model between the network robustness R and an NSI x is expressed by the linear equation:where a₀ is the intercept and a₁ is the slope. In (4), an ordinary least square (OLS) is applied for estimating coefficients by minimizing an appropriate loss function [45, 46]. Once the OLS process, which is also called the fitting process, is performed, we can use (1) to predict the robustness R of a new network for a given indicator x. In addition, we derive a statistics t-test from the OLS process with the null hypothesis H0: a₁ = 0. A rejection of H0 means that there is a significant linear relationship between R and the NSI x.

We run the SLR model fit for all NSIs listed in Table 1 excluding E because it can be expressed in terms of two other NSIs : E = N < k>/2.

3.2. Multiple Linear Regression Model

Multiple linear regression (MLR) is an extension of SLR for multidimension variables x = (x₁, x₂, …, x_n), where x₁, x₂, …, x_n are NSIs. The linear equation between network robustness R and NSIs is as follows:where a_i are coefficients obtained from the OLS method.

3.3. Random Forest Model

The random forest (RF) belongs to the ensemble class of ML models, indicating that it aggregates the prediction from an ensemble of ML base models, here, decision tree regression (DTR) models. We briefly describe the DTR in the following section.

A DTR starts with the root of the tree containing all samples (48 networks in our case). It then splits into two different nodes by selecting samples whose value of a certain variable is higher or lower than a certain threshold value. Figure 2(a) represents a basic decision tree diagram for our dataset. The root node containing 48 networks splits into two other nodes by considering whether the variable (NSI in our case) scale-free exponent α is higher or lower than 2.5.

(a)

(b)

Figure 2

(a) An example of a decision tree: the root node containing 48 networks splits into two other nodes, Node 11 and Node 12, with 13 and 35 networks, respectively, according to the value of the scale-free exponent α. Then, Node 12 splits into two other nodes, Node 21 and Node 22, with 15 and 20 networks, respectively, according to the value of the network density <k>. We assume that at Node 11, Node 21, and Node 22, no split is possible because of a certain stopping rule, and thus, they become final leaves. In general, any NSI can be used to divide networks at any split, and the decision tree can be arbitrarily complex depending on the stopping rule. (b) An illustration of the same decision tree in the 2-dimension (α and <k>) space with final leaves.

The DTR selects the variable, and its splitting value is based on information theory, in concrete considering the entropy concept. Entropy is a metric of uncertainty of a node. The DTR splits a node by maximizing the information gain, which is the weighted difference between the total entropy of two resulting nodes and the entropy of the initial node. The DTR successively splits until a stopping condition is reached, for example if the size of the current node is smaller than 20. The final node is also called a leaf node. In Figure 2(a), after the first split of the root, the left child node becomes a leaf node, while the right child node continues to split into two leaf nodes.

Once the final DTR is obtained, it can be used to predict the value of a new sample as follows. The new sample will be classified into one of the leaves, and its prediction value will be the average value of all the samples that are classified into the same leaf.

Finally, the RF model creates multiple decision trees randomly drawn from the data, usually several hundred, and averaging the results from all trees to output a new result often leads to strong predictions [47, 48].

The decision tree can fit nonlinear datasets because it can split the same NSI multiple times. However, decision tree is easy to be overfitting, i.e., it is too sensitive to the training data while failing to predict new coming (testing) data. In order to address this problem, a random forest (RF) model is obtained by creating multiple randomly drawn decision trees from data, usually several hundred. The final regression prediction will be the average prediction of all the decision trees [47–49] (in this work, we implement an RF with 300 DTRs). Using an RF, “feature importance” measurement can be derived to rank the NSI [50].

3.4. Data Preparation, Validation, and Performance Evaluation

All NSIs can be computed from the network’s data, and thus, our dataset did not contain missing values. We also exclude E because of redundancy as mentioned above. The other 8 NSIs are normalized to avoid large differences in the indicators’ range:where is the value of the NSI i for observation (network) j and and are the mean and the standard deviation of the NSI i, respectively.

In the first step, we use the whole dataset to build ML models and compare the results between models and two target variables. However, due to overfitting problems in many ML models, the model’s performance for new data is not always coherent as that in the training step, and we need to validate models in the second step. We choose the leave-one-out validation [51]. In this way, we train each of the above models 48 times: each time the whole dataset excluding one observation is used to train the model, and then, the model is used to predict the target value of the remaining (hold-out) observations and repeats for each of 48 hold-out observations. The overall evaluation result is the average across all 48 regressions.

It is noted that for the SLR model, we only consider regression coefficients in order to analyze the dependence of robustness metrics with respect to each NSI. However, for MLR and RF models, we analyze the prediction of robustness metrics using four common evaluation metrics for regression problems, the root mean square error (RMSE) and the coefficient of determination (also named the explained variance ratio, R²) as analytical metrics and the frequency distribution and the Q-Q plot of residual errors as graphical metrics.

RMSE is the square root of the summation of the squared difference between observed and predicted data points. The RMSE has the same unit as the target feature and is generally considered the model error. A lower RMSE value represents superior prediction results. The formula of the RMSE is provided bywhere n is the number of observations, R_j denotes the empirical (simulated) network robustness, and R_{predicted, j} is the predicted value of robustness for the observation j.

R ² is used to represent the general prediction performance of regression models. R² is one minus the ratio of the remaining variance and the original variance. The formula of R² is provided bywhere n is the number of observations, R_j is the simulation robustness, R_{predicted, j} denotes the predicted value for observation j, and is the average of all the simulation robustness. R² varies between 0 (the model has no prediction ability) and 1 (the model correctly predicts all values).

The residual error, , is simply an error between the empirical (simulated) network robustness and the predicted value of robustness. The distribution histogram of is expected to be close to the origin. Furthermore, the most important assumption of a linear regression model is that residual errors are independent, and consequently, these errors are expected to be normally distributed.

The network is analyzed using the “graph-tool” library in Python. All data preparation, model building, and evaluation are written using the Python code. The hardware for numerical simulations is a PC with an i9-10850 Intel processor and 32 GB RAM.

4. Results

4.1. Network Robustness as a Function of the NSIs and SLR T-Test

The simulation robustness of each network RIB and RRD is represented in Table 2. Overall, we found that RRD is slightly smaller than RIB for most networks (43 out of 48 networks), with an average of 0.148 vs. 0.173, respectively. It suggests that the RD strategy has more efficacy than IB for attacking real-world social networks. The largest and sparsest network, Email-EuAll (N = 265,216 and <k ≥ 1.58), has the smallest robustness with an equal RIB and RRD of 0.001. In contrast, the gemsec_deezer_HR network, with N = 54,575 and <k ≥ 9.12, has the strongest robustness with an RIB and RRD of 0.375 and 0.338, respectively.

In Figure 3, we plot RRD and RIB as a function of 8 independent NSIs, and we found that RRD and RIB behave similarly in all cases. The SLR unveils some significant relationships between R and NSIs (Figure 3 and Table 3). For example, in Figure 3(a), we can see that both RRD and RIB slightly decrease with the network size N. This linear dependence between robustness RRD and RIB and N is tested by using the SLR model, and we found that it is statistically significant, with a confidence level of 95% ( value <0.05, Table 3).

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Interestingly, RRD and RIB do not statistically linearly depend on the network density <k> as found previously in [4, 52] (Figure 3(b) and Table 3). This contrasting observation would suggest that network robustness also depends on other NSIs and that the network density alone cannot predict the whole network’s robustness as previously seen.

Besides N, the only other NSI that shows a significant linear relationship is the modularity Q (Figure 3(f)) in the case of RRD.

However, in Figure 3, we still observe some nonlinear dependencies. For example, in Figure 3(e), we show that network robustness decreases with the assortativity coefficient a when a > 0. However, it decreases faster when a is close to 0 and increases with a when a < 0.

Similarly, in Figure 3(g), we found that the relationship between RRD and RIB and the global clustering coefficient C follows an inverted u-shaped pattern. We ran a two-line statistical test [53] and found that two-line (or broken line) regression is significantly better than a single-line test. The breakpoint was found to be C = 0.115. Both RRD and RIB linearly increase with C (with a significance level of 95%) up to the breakpoint and linearly decrease with C (with a significance level of 95%). One possible explanation is that if the network is sparse, more triplets help increase the network’s connectivity and thus increase its robustness. However, above a certain value (when C = 0.115), more triplets may denote the presence of hubs or central nodes, which are likely to be the target of intentional node removal strategies such as RD and IB, consequently lowering network robustness.

4.2. Machine Learning Prediction of Network Robustness

The results of the previous section suggest that the social network’s robustness depends on multiple NSIs in a highly complex, multidimensional, and nonlinear manner. To improve the model prediction, in this section, we use two multiple variable ML models, MLR and RF, to predict network robustness.

The results of multiple linear regression MLR are shown in Table 4. We found that both R_IB and R_RD have a positive overall linear regression coefficient with respect to α, Q, Cl, and <k> and a negative overall linear regression coefficient with respect to α², a, C, and N. Moreover, the MLR result indicate that α, α², and <k> are the most significant coefficients. A positive linear regression coefficient for the average node degree <k> suggests that networks are more robust when k is higher, while all other NSIs are fixed. This result agrees with previous outcomes demonstrating that denser networks may be more resistant to the attack [4, 52]. However, the different results between the MLR and SLR would suggest that there is a strong correlation between <k> and other NSIs. In addition, the MLR model predicts R_IB better than R_RD, with an R² coefficient of 58.04% compared to 51.76%. Nevertheless, the RMSE was smaller for R_RD, with a value of 0.0657, compared to 0.0709 for RIB (this is because the standard deviation of RIB is higher than that of R_RD, as shown in Table 2 (bottom row)).

Because of the nonlinearity found in the previous section, we expect that the regression result using the RF model will be improved. Table 5 represents the regression result of the RF model. We found that R² increases to 92.24% and 91.88% for R_IB and R_RD regressions, respectively. Interestingly, the RF model predicts R_RD roughly as well as R_RD, while MLR predicts R_IB better than R_RD, suggesting that R_RD may follow a stronger nonlinear relationship with NSIs than RIB. Additionally, the RMSE improved both for R_IB and R_RD, with a value of 0.0272 and 0.0241, respectively. Interestingly, the feature importance ranking in Table 5 shows that with an RF model, the assortativity a, the global closeness C, and the node number N are the most important NSIs. This result agrees with the exploratory observations shown in Figure 3 as discussed above.

In Figure 4, we compare network robustness R_IB and R_RD with the prediction value given by MLR and RF using a scatter plot. The scatter plots indicate that RF fit data significantly better than MLR, where the predicted actual data points are closer to the diagonal line y = x. Meanwhile, for MLR regression, we still found nonlinear dependency between the actual and predicted values. As a matter of fact, the MLR model was not able to capture the inherent nonlinearity dependency in the actual data. We also analyzed the residual errors of the above regression using the frequency histogram and QQ-plot and found that they follow a normal distribution relatively well (Figures 5–8).

(a)

(b)

(a)

(b)

(a)

(b)

(a)

(b)

(a)

(b)

Finally, we run leave-one-out regression for both models MLR and RF in order to avoid overfitting. The result is summarized in Table 6, and the scatter plots are shown in Figure 9. We found that the prediction result is less accurate than the above “in-sample” training with lower RMSEs in both MLR and RF models. We obtained an RMSE of 0.0812 and 0.0760 for R_IB and R_RD predictions using MLR, respectively, and an RMSE of 0.0733 and 0.0636 for R_IB and R_RD predictions using RF, respectively. Even though the regression results are less effective because we predict the single sample which is independent of the remaining samples used for training (building the ML model), residual errors still fit well to a normal distribution as shown in the histogram and QQ-plots (Figures 10–13).

(a)

(b)

(a)

(b)

(a)

(b)

(a)

(b)

(a)

(b)

5. Discussion and Conclusion

In this work, we have analyzed the robustness of 48 real-world social networks with the node number ranging over five orders of magnitude, from 1,914 to 265,216. Using Monte Carlo simulations, we have run two commonly used node attack strategies, IB and RD strategies, whose computation time is within our hardware capability. We found that their corresponding simulation time, t_IB and t_RD, scales linearly with the product of the network’s node number and edge number, i.e., N × E. We also found that the two attack strategies IB and RD present similar efficacy when evaluated by the unique robustness metric R, with RD slightly better than IB (average R_RD is slightly smaller than average R_IB). It suggests that for the social networks used in this study, the RD strategy is the most efficient strategy to dismantle (breakdown) networks, both in terms of computational cost and breakdown efficiency.

To understand how the structure of a social network determines its robustness, we investigate the relationship between the metric R and a set of network structural indicators (NSIs) from the literature. The simple linear regression (SLR) between R and NSIs shows low goodness of fitting, and it is overall not able to produce significant prediction models. The low goodness of SLR would indicate that network robustness depends on NSIs in a nonlinear manner.

To improve fitting, we have developed two machine learning models to predict two robustness metrics R_RD and R_RD from the combination of 8 NSIs, multiple linear regression (MLR), and random forest (RF) model. The latter one is chosen as it can handle nonlinear data well and is built on a collection of base models, decision tree classifiers. We found clearly that the random forest model can predict network robustness better than the multiple linear regression model. In concrete, the RF model predicts network robustness with an RMSE of 0.0272 and 0.0241 for R_IB and R_RD, respectively. This result is encouraging to predict real-world social network robustness, although the error is about 16% (for R_IB, the RMSE is 0.0272 compared to an average R_IB of 0.173, and for R_RD, the RMSE is of 0.0241 compared to an average R_RD of 0.148). Meanwhile, when the leave-one-out evaluation is applied, the RMSE increases to 0.0733 and 0.0636 for R_IB and R_RD, respectively, which is about one-third of the average value.

Finally, MLR indicates that the most important factors to predict RIB are the exponent α and the average node degree <k>, for both R_IB and R_RD. In particular, a higher value of α is correlated with higher R_IB and R_RD. Higher absolute values of the exponent α denote a network with fewer hub nodes (highly connected nodes) [35]. In consequence, the RD and IB attack strategies cannot find large hub nodes whose removal may disintegrate the network faster, resulting in higher values of R_RD and R_IB. Additionally, MLR indicates that <k> is positively related to lower R_IB and R_IB. This last outcome agrees with previous results, demonstrating that networks with higher edge density may be more resistant to the attack [4, 52]. On the other hand, it confirms that SLR, which focuses on a single NSI, may not be able to predict the robustness of real-world social networks.

Our work demonstrates that the ML model can be used to predict network robustness with acceptable results. Therefore, it alleviates the need to run a full Monte Carlo simulation on a network when only approximate robustness is needed. Meanwhile, more network datasets are expected to improve the accuracy of ML models. This work also contributes to the understanding of the relationship between real-world social network robustness and its structural indicators. Finally, we have proved that using a data-driven approach to predict the outcome of the nonlinear and complex dynamic process, such as network robustness, is an appropriate approach [54–60].

Appendix

The histogram and the QQ-plot of residual errors of all regressions are given in Figures 5–8 and Figures 10–13.

Abbreviations

RD:	Recalculated degree node attack strategy
IB:	Initial betweenness node attack strategy
RB:	Recalculated betweenness node attack strategy
t_IB:	Total simulation time for the attack strategy IB
t_RD:	Total simulation time for the attack strategy RD
t_RB:	Total simulation time for the attack strategy RB
SLR:	Simple linear regression model
MLR:	Multiple linear regression model
RF:	The random forest model
DTR:	Decision tree regression model
NSI:	Network structural indicator
RMSE:	Mean squared error
R²:	Coefficient of determination (also named the explained variance ratio)
a₀:	Intercept coefficient of SLR
a₁:	Slope coefficient of SLR
OLS:	Ordinary least square method
:	Error between the empirical (simulated) network robustness and the predicted value of robustness
α:	Fitted scale-free exponent
k:	Node degree
<k>:	Average node degree
a:	Degree assortativity
Cl:	Global closeness
C:	Global clustering coefficient
LCC:	Largest connected component
N:	Number of nodes
E:	Number of edges
Q:	Modularity indicator
α²:	Fitting variance of α
q:	Accumulated proportion of nodes removed
R:	Network robustness
R_RD:	Network robustness against RD node attack strategies
R_IB:	Network robustness against IB node attack strategies.

Data Availability

All the 48 real-world social networks are downloaded from the Stanford Large Network Dataset Collection (https://snap.stanford.edu/data/) and the Network Repository social networks (https://networkrepository.com/soc.php).

Conflicts of Interest

The authors declare no conflicts of interest.

Authors’ Contributions

QN conceived analyses. NKKN, HHP, TTL, and TMN performed simulations. QN, NKKN, FS, RA, MB, and DC wrote the paper.

Acknowledgments

This work was supported by Vietnam’s Ministry of Science and Technology (MOST) under the Vietnam-Italy Scientific and Technological Cooperation Program for the period of 2021–2023 and by the Vietnam National University Ho Chi Minh City (VNU-HCM), Ho Chi Minh city, Vietnam under grant nos. B2017-42-01. This research was funded by a grant from the Italian Ministry of Foreign Affairs and International Cooperation. This project received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 Research and Innovation Programme (grant agreement No. (816313)). The authors are greatly thankful to Van Lang University, Vietnam, for providing the budget for this study.

References

S. Lehmann and Y.-Y. Ahn, “Complex spreading phenomena in social Systems,” Computational Social Sciences), Springer International Publishing, Berlin, Germany, 2018.
View at: Google Scholar
M. Bellingeri, D. Cassi, and S. Vincenzi, “Efficiency of attack strategies on complex model and real-world networks,” Physica A: Statistical Mechanics and its Applications, vol. 414, pp. 174–180, 2014.
View at: Publisher Site | Google Scholar
M. Bellingeri, D. Bevacqua, F. Scotognella et al., “Link and node removal in real social networks: a Review,” Frontiers in Physiology, vol. 8, p. 228, 2020a.
View at: Publisher Site | Google Scholar
S. Iyer, T. Killingback, B. Sundaram, and Z. Wang, “Attack robustness and centrality of complex networks,” PLoS One, vol. 8, no. 4, Article ID e59613, 2013.
View at: Publisher Site | Google Scholar
F. Morone and H. A. Makse, “Influence maximization in complex networks through optimal percolation,” Nature, vol. 524, no. 7563, pp. 65–68, 2015.
View at: Publisher Site | Google Scholar
K. Nguyen and Q. Nguyen, “Resilience of stock cross-correlation network to random breakdown and intentional attack,” Studies in Computational Intelligence, vol. 760, pp. 553–561, 2018.
View at: Google Scholar
S. Wandelt, X. Sun, D. Feng, M. Zanin, and S. Havlin, “A comparative analysis of approaches to network-dismantling,” Scientific Reports, vol. 8, no. 1, Article ID 13513, 2018.
View at: Publisher Site | Google Scholar
R. Pastor-Satorras, C. Castellano, P. Van Mieghem, and A. Vespignani, “Epidemic processes in complex networks,” Reviews of Modern Physics, vol. 87, no. 3, pp. 925–979, 2015.
View at: Google Scholar
C. Stegehuis, R. van der Hofstad, and J. S. H. van Leeuwaarden, “Epidemic spreading on complex networks with community structures,” Scientific Reports, vol. 6, no. 1, Article ID 29748, 2016.
View at: Publisher Site | Google Scholar
C. Li, L. Wang, S. Sun, and C. Xia, “Identification of influential spreaders based on classified neighbors in real-world complex networks,” Applied Mathematics and Computation, C, vol. 320, pp. 512–523, 2018.
View at: Publisher Site | Google Scholar
J. Wang, C. Li, and C. Xia, “Improved centrality indicators to characterize the nodal spreading capability in complex networks,” Applied Mathematics and Computation, vol. 334, pp. 388–400, 2018.
View at: Publisher Site | Google Scholar
L. Weng, F. Menczer, and Y.-Y. Ahn, “Virality prediction and community structure in social networks,” Scientific Reports, vol. 3, no. 1, p. 2522, 2013.
View at: Publisher Site | Google Scholar
P. Holme, “Efficient Local strategies for vaccination and network attack,” Europhysics Letters, vol. 68, no. 6, pp. 908–914, 2004.
View at: Publisher Site | Google Scholar
L. K. Gallos, F. Liljeros, P. Argyrakis, A. Bunde, and S. Havlin, “Improving immunization strategies,” Physical Review E, vol. 75, no. 6, pp. 1–4, 2007.
View at: Publisher Site | Google Scholar
J. Hadidjojo and S. A. Cheong, “Equal graph partitioning on estimated infection network as an effective epidemic mitigation measure,” PLoS One, vol. 6, no. 7, Article ID e22124, 2011.
View at: Publisher Site | Google Scholar
M. A. Amaral, M. Md Oliveira, and M. A. Javarone, “An spidemiological model with Voluntary Quarantine strategies Governed by evolutionary vame dynamics,” Chaos, Solitons & Fractals, vol. 143, Article ID 110616, 2021.
View at: Google Scholar
H. Amini and A. Minca, “Epidemic spreading and qquilibrium social distancing in geterogeneous networks,” Dyn Games Appl, vol. 12, no. 1, pp. 258–287, 2022.
View at: Publisher Site | Google Scholar
M. Bellingeri, D. Bevacqua, F. Scotognella, R. Alfieri, and D. Cassi, “A comparative analysis of link removal strategies in real complex weighted networks,” Scientific Reports, vol. 10, pp. 3911–3915, 2020b.
View at: Publisher Site | Google Scholar
M. Bellingeri, M. Turchetto, D. Bevacqua et al., “Modeling the consequences of social distancing over epidemics spreading in complex social networks: from link removal analysis to SARS-CoV- 2 Prevention,” Frontiers in Physiology, vol. 9, Article ID 681343, 2021.
View at: Publisher Site | Google Scholar
D. Achlioptas, R. M. D’souza, and J. Spencer, “Explosive percolation in random networks,” Science, vol. 323, no. 5920, pp. 1453–1455, 2009.
View at: Google Scholar
G. Dong, J. Fan, L. M. Shekhtman et al., “Resilience of networks with community structure behaves as if under an external field,” Proceedings of the National Academy of Sciences, vol. 115, no. 27, pp. 6911–6915, 2018.
View at: Publisher Site | Google Scholar
O. Riordan and L. Warnke, “Explosive percolation is continuous,” Science, vol. 333, no. 6040, pp. 322–324, 2011.
View at: Publisher Site | Google Scholar
Y. Sun, C. Liu, C.-X. Zhang, and Z.-K. Zhang, “Epidemic spreading on weighted complex networks,” Physics Letters A, vol. 378, no. 7-8, pp. 635–640, 2014.
View at: Publisher Site | Google Scholar
A. Majdandzic, B. Podobnik, S. V. Buldyrev, D. Y. Kenett, S. Havlin, and H. Eugene Stanley, “Spontaneous recovery in dynamical networks,” Nature Physics, vol. 10, no. 1, pp. 34–38, 2014.
View at: Publisher Site | Google Scholar
S. Wandelt, X. Shi, X. Sun, and M. Zanin, “Community detection boosts network dismantling on real-world networks,” IEEE Access, vol. 8, pp. 111954–111965, 2020.
View at: Publisher Site | Google Scholar
R. Albert, H. Jeong, and A.-L. Barabasi, “Diameter of the world-wide web,” Nature, vol. 401, pp. 130-131, 1999.
View at: Google Scholar
R. Cohen, K. Erez, D. ben Avraham, and S. Havlin, “Resilience of the internet to random breakdowns,” Physical Review Letters, vol. 85, no. 21, pp. 4626–4628, 2000.
View at: Publisher Site | Google Scholar
Q. Nguyen, T. V. Vu, H. D. Dinh et al., “Modularity affects the robustness of scale-free model and real-world social networks under betweenness and degree-based node attack,” Appl Netw Sci, vol. 6, no. 1, p. 82, 2021b.
View at: Publisher Site | Google Scholar
X. Sun, V. Gollnick, and S. Wandelt, “Robustness analysis metrics for worldwide airport network: a comprehensive study,” Chinese Journal of Aeronautics, vol. 30, no. 2, pp. 500–512, 2017.
View at: Publisher Site | Google Scholar
Q. Nguyen, N. K. K. Nguyen, D. Cassi, and M. Bellingeri, “New betweenness centrality node attack strategies for real-world complex weighted networks,” Complexity, vol. 2021, Article ID 1677445, pp. 1–17, 2021a.
View at: Publisher Site | Google Scholar
L. C. Freeman, “A set of measures of centrality based on betweenness,” Sociometry, vol. 40, no. 1, p. 35, 1977.
View at: Publisher Site | Google Scholar
S. Wasserman and K. Faust, Social Network Analysis: Methods and Applications, Cambridge University Press, Cambridge UK, 1994.
U. Brandes, “A faster algorithm for betweenness centrality,” Journal of Mathematical Sociology, vol. 25, no. 2, pp. 163–177, 2001.
View at: Publisher Site | Google Scholar
Q. Nguyen and T. Trang Le, “Structure and robustness of Facebook’s pages networks,” in Proceedings of the 2019 the 10th Conference on Network Modeling and Analysis (Marami 2019), Dijon, France, November 2019.
View at: Google Scholar
M. E. J. Newman, “Power laws, Pareto distributions and Zipf’s law,” Contemporary Physics, vol. 46, no. 5, pp. 323–351, 2005.
View at: Publisher Site | Google Scholar
S. Shalev-Shwartz and S. Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, Cambridge UK, 2014.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, NV, USA, June 2016.
View at: Google Scholar
K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in Computer Vision – ECCV 2016, J. M. ., N. S. Leibe and M. Welling, Eds., vol. 9908B, pp. 630–645, Springer International Publishing, Cham, 2016.
View at: Google Scholar
Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
View at: Publisher Site | Google Scholar
M. E. J. Newman, “Spread of epidemic Disease on networks,” Physical Review E - Statistical Physics, Plasmas, Fluids, and Related Interdisciplinary Topics, vol. 66, no. 1, Article ID 016128, 2002.
View at: Publisher Site | Google Scholar
M. E. J. Newman, “The structure and function of complex networks,” SIAM Review, vol. 45, no. 2, pp. 167–256, 2003.
View at: Publisher Site | Google Scholar
D. S. Callaway, M. E. J. Newman, S. H. Strogatz, and D. J. Watts, “Network robustness and Fragility: percolation on random graphs,” Physical Review Letters, vol. 85, no. 25, pp. 5468–5471, 2000.
View at: Publisher Site | Google Scholar
S. Wasserman and K. Faust, Social Network Analysis: A Handbook, Sage, CA, USA, 2000.
M. Bellingeri, D. Bevacqua, F. Scotognella, and D. Cassi, “The heterogeneity in link weights may decrease the robustness of real-world complex weighted networks,” Scientific Reports, vol. 9, no. 1, Article ID 10692, 2019.
View at: Publisher Site | Google Scholar
A. S. Goldberger, “Classical linear regression,” Econometric Theory, John Wiley & Sons, NY, USA, 1964.
View at: Google Scholar
F. Hayashi, Econometrics, Princeton University Press, NJ, USA, 2000.
L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classification and Regression Trees, CRC Press, Boca Raton, Florida, 1984.
J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, Morgan kaufmann, Burlington, Massachusetts, USA, 2006.
L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001a.
View at: Publisher Site | Google Scholar
R. Genuer, J. M. Poggi, and C. Tuleau-Malot, “Variable selection using random forests,” Pattern Recognition Letters, vol. 31, no. 14, pp. 2225–2236, 2010.
View at: Publisher Site | Google Scholar
G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning: With Applications in R, Springer, NY, USA, 2013.
R. Albert and A.-L. Baraba ́si, “Statistical mechanics of complex networks,” Reviews of Modern Physics, vol. 74, no. 1, pp. 47–97, 2002.
View at: Publisher Site | Google Scholar
R. H. Jones and B. A. Molitoris, “A statistical method for determining the breakpoint of two lines,” Analytical Biochemistry, vol. 141, no. 1, pp. 287–290, 1984 Aug 15.
View at: Publisher Site | Google Scholar
R. Albert, H. Jeong, and A.-L. Baraba ́si, “Error and attack tolerance of complex networks,” Nature, vol. 406, no. 6794, pp. 378–382, 2000.
View at: Publisher Site | Google Scholar
A. Bavelas, “Communication gatterns in task‐oriented groups,” Journal of the Acoustical Society of America, vol. 22, no. 6, pp. 725–730, 1950.
View at: Publisher Site | Google Scholar
M. Bellingeri and D. Cassi, “Robustness of weighted networks,” Physica A: Statistical Mechanics and Its Applications, vol. 489, pp. 47–55, 2018.
View at: Publisher Site | Google Scholar
P. Holme, B. J. Kim, C. N. Yoon, and S. K. Han, “Attack vulnerability of complex networks,” Physical Review A, vol. 65, no. 5, Article ID 056109, 2002.
View at: Publisher Site | Google Scholar
M. E. J. Newman, “Assortative mixing in networks,” Physical Review Letters, vol. 89, no. 20, Article ID 208701, 28 October 2002.
View at: Publisher Site | Google Scholar
G. Sabidussi, “The centrality index of a graph,” Psychometrika, vol. 31, no. 4, pp. 581–603, 1966.
View at: Publisher Site | Google Scholar
Y. Sun, C. Liu, C.-X. Zhang, and Z.-K. Zhang, “Epidemic spreading on weighted complex networks,” Physics Letters A, vol. 378, no. 7-8, pp. 635–640, 2014.
View at: Google Scholar

Copyright

Copyright © 2022 Ngoc-Kim-Khanh Nguyen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

692

Downloads

559

Citations

Complexity

Frontiers in Data-Driven Methods for Understanding, Prediction, and Control of Complex Systems 2022

Predicting the Robustness of Large Real-World Social Networks Using a Machine Learning Model

Abstract

1. Introduction

2. Real-World Social Network Datasets and Robustness Estimation

2.1. Network Robustness Monte Carlo Simulation

3. Machine Learning Approach

3.1. Simple Linear Regression Model (SLR)

3.2. Multiple Linear Regression Model

3.3. Random Forest Model

3.4. Data Preparation, Validation, and Performance Evaluation

4. Results

4.1. Network Robustness as a Function of the NSIs and SLR T-Test

4.2. Machine Learning Prediction of Network Robustness

5. Discussion and Conclusion

Appendix

Abbreviations

Data Availability

Conflicts of Interest

Authors’ Contributions

Acknowledgments

References

Copyright