Abstract

Telecommunication network fraud crimes frequently occur in China. Predicting the number and trend of telecommunication network fraud will be of great significance to combating crimes and protecting the legal property of citizens. This paper proposes a combined model of predicting telecommunication network fraud crimes based on the Regression-LSTM model. First, we find that there is a strong correlation between privacy data illegally sold on the dark web and telecommunication network fraud data. Hence, this paper constructs a Linear Regression model using the privacy data illegally sold on the dark web to predict the number of telecommunication network fraud crimes. Second, an LSTM prediction model is constructed using the data of telecommunication network fraud cases on China Judgments Online based on the time-series feature of telecommunication network fraud crimes. Third, this paper uses the error reciprocal method to combine the two models for prediction. In addition, this paper selects the monthly data set of telecommunication network fraud occurring in 2021 for experimental evaluation. The experimental results show that the accuracy of the Regression-LSTM model constructed in this paper is 86.80%, and the RMSE is 0.149. Compared with the ARIMA, Linear Regression, LSTM, Additive-ARIMA-LSTM, and Multiplicative-ARIMA-LSTM models, the Regression-LSTM model proposed has the highest prediction accuracy.

1. Introduction

Telecommunication network fraud refers to the crime of setting up traps, implementing remote fraud on the victim, and inducing the victim to transfer money through text messages, telephone, and network. Telecom fraud crimes recently have a progressively increasing tendency in China, causing much property damage to people. Thus, it is necessary to take measures, for example, predicting the number of telecom fraud crimes, to curb the trend of the high incidence of telecom fraud. It is of great significance to help the public security departments implement crisis early warning and prevention. However, no one has yet studied the prediction of the number of telecom fraud crimes.

This paper is aimed at building a model of predicting telecommunication fraud by using big data, e.g., the data of personal privacy traded on the dark web and the data of telecommunication network fraud cases on China Judgments Online. The data of personal privacy traded on the dark web is essential to building the predicting model. For criminals, obtaining accurate personal information about the target group is a prerequisite before they commit telecommunication network fraud. Nowadays, the illegal trading of such personal information is mainly made in trading forums on the dark web [1, 2]. Therefore, we should consider the data of personal privacy traded on the dark web as an essential source for building the predicting model of telecom fraud crimes. Furthermore, the occurrence of telecom fraud crimes has its own timing characteristics, so we also use the data of telecommunication network fraud cases on China Judgments Online to analyze the quantitative characteristics of crimes over time.

Hence, this paper builds a combined model (i.e., Regression-LSTM) by using the Linear Regression and the LSTM model to predict the number of telecom fraud crimes. The Linear Regression submodel is used for finding the causality feature and correlation between the trading data of the dark web and the crimes and then predicting the number of telecom fraud crimes. In addition, the LSTM submodel is used for mining the time-series feature of the crimes and predicting the quantitative characteristics with the monthly data set of telecommunication fraud cases on China Judgments Online.

Different from general prediction models only based on the time-series feature, the Regression-LSTM model considers both the causality feature and the time-series feature of telecom fraud crimes. This paper uses the error reciprocal method to make a weighted combination of the Linear Regression submodel and the LSTM submodel. The Regression-LSTM model is compared with the ARIMA model, Linear Regression model, LSTM model, Additive-ARIMA-LSTM model, and Multiplicative-ARIMA-LSTM model, and it obtains the best prediction effect.

At present, the academic research on the crime of telecommunication network fraud mainly focuses on the causes, prevention and control measures [35], and governance schemes [6, 7]. Few scholars study the prediction of the number and trend of telecommunication network fraud. Currently, crime prediction is more focused on crimes except telecom fraud. For example, Yu et al. [8] predict the trend of juvenile delinquency based on the grey prediction method. Yu et al. [9] use a fuzzy BP neural network to predict the number of crimes in Dalian over the years 2000-2009. Feng et al. [10] use the Prophet model and the LSTM model to predict the number of crimes in three major cities in the United States. Chen et al. [11] use the Gray-Markov model to analyze the weekly crime rate of property intrusion in Langfang City from January to June 2013. Tu and Chen’s study [12] is based on the ARIMA-LSSVM mixed model for the combined prediction of small sample crime time series. Jha et al. [13] propose an improved ARIMA model and an improved ANN model to predict the number of crimes in India in the next five years. Zhai et al. [14] use the Prophet model to predict the number of crimes in a long-term time series. Shen et al. [15] use LSTM to construct an optimal binary alarm data model and frequency statistical data model to predict the probability and number of crimes in each region of a city in China. Hossain et al. [16] predict the number of crimes in San Francisco from 2003 to 2015 based on the Adaboost improved random forest algorithm. Yan and Hou [17] predict the changing trend of the number of daily theft crimes based on the LSTM model. Liu [18] predicts the number of crimes in large cities based on an improved LSTNet model.

The above prediction methods for various crimes are mainly based on the time-series data to predict the risk of crimes in a future period. The influence of factors besides time on criminal activity is not considered.

Compared with the above studies only based on time-series features of crimes, this paper combines the causality feature and the time-series feature of telecom fraud crimes. This paper finds a strong correlation between personal privacy data traded on the dark web and telecommunication network fraud crimes. On this basis, a Linear Regression prediction model is built for telecommunication network fraud crimes. In addition, we use the real telecom fraud cases from January 2014 to December 2021 on China Judgments Online to build an LSTM network model that can effectively learn the time-series feature of telecom fraud crimes. We further build a Regression-LSTM model based on these two models with the error reciprocal method. Regression-LSTM can fully combine the respective advantages of the Linear Regression prediction model and the LSTM prediction model to achieve better prediction results.

3. Introduction to Experimental Data

The data used in this paper mainly includes personal privacy data traded on the dark web and judgment document data of telecom fraud published on China Judgments Online. Personal privacy data traded on the dark web is a critical data source. It is an important channel for public security departments to investigate and combat crimes. The judgment document data of telecom fraud published on China Judgments Online is real telecommunication fraud case data solved by the police departments.

3.1. Personal Privacy Data on the Dark Web

First, we enter the dark web with specific technical means and choose the most popular Chinese trading forum containing multiple trading sections on the dark web as our data source. Next, we find out that the personal privacy data sold in the section “Data Resources” is closely related to the crime of telecommunication network fraud. So we use crawlers to obtain the whole data for this section. From June 2018 to June 2021, there are 4,340 posts related to personal privacy in this section. The statistical figure of the number of new posts on the dark web per month is shown in Figure 1. As can be seen from the figure, the privacy data leaked on the dark web every month has an obvious cyclical fluctuation in quantity.

3.2. Telecommunication Network Fraud Data on China Judgments Online

China Judgments Online (https://wenshu.court.gov.cn/) publishes information on various cases that have been solved so that we can inquire about cases related to telecommunication network fraud. According to China Judgments Online, there were 9,190 judgment documents related to telecommunication network fraud from 2010 to 2021. Among the judgment documents, there are 332 in connection with fake prosecutor fraud, 288 about feelings cheating fraud, 227 about fake customer service fraud, 181 about abnormal online shopping fraud, 54 about fake boss fraud, 67 about fake teacher fraud, 2420 about loan fraud, 449 about fake credit fraud, 43 about fake ticket fraud, 213 about fake acquaintance fraud, 62 about game fraud, 19 about naked chat blackmail, and 4,835 about various other forms of judgment documents for telecommunication network fraud.

The number distribution of telecom network fraud cases is shown in Figure 2. It can be seen from the figure that from 2014 to 2017, the number of telecommunication network fraud crimes in China was relatively small. After 2018, telecommunication fraud crimes surged and showed periodic fluctuations. This paper selects the data of telecommunication fraud crimes on China Judgments Online from January 2014 to December 2021 for time series prediction.

4. Prediction of Telecom Fraud Crime Based on Regression-LSTM

Due to the strong correlation between the personal privacy data sold on the dark web and telecommunication network fraud crimes, we establish a Linear Regression model to predict telecommunication network fraud crimes. Since the criminal case data is time-series data, it is possible to conduct time-series modeling and predict the number and trend of future crimes. Therefore, we also establish an LSTM model to predict telecommunication network fraud crimes.

For the above two models, although the Linear Regression model can predict the number of telecom fraud crimes based on the causality feature, the model does not consider the time-series feature of the crimes themselves. Meanwhile, although the LSTM model can predict the number of telecom fraud crimes based on the time-series feature, it ignores the causality feature of the crimes. Therefore, this paper further uses the error reciprocal method to perform the weighted summation of the two models, constructing a telecom fraud prediction model based on the combined Regression-LSTM model. The architecture of this combined model is shown in Figure 3.

Based on the dark web privacy data and telecom fraud data obtained by crawlers, this paper firstly constructs a correlation matrix between the two. It then uses Spearman correlation analysis to verify that the two data are strongly correlated. Because of the strong correlation between the two kinds of data, this paper trains a Linear Regression model to predict telecommunication fraud after normalizing them. Then, based on the time-series features of telecom fraud data, we standardize the telecom fraud data and calculate the loss function. An LSTM telecom fraud prediction model is constructed with the Adam optimization algorithm and MMSE (Minimize Mean Square Error) principle. Finally, we combine these two models with the error reciprocal method to construct a Regression-LSTM model to predict the monthly number of telecom fraud cases in 2021.

4.1. Prediction of Telecom Fraud Based on Linear Regression
4.1.1. Correlation Matrix

According to the announcement of the Beijing Anti-Telecom Network Fraud Center, there are currently twelve main types of telecom fraud crimes: fake prosecutor fraud, feelings cheating fraud, fake customer service fraud, abnormal online shopping fraud, fake boss fraud, fake teacher fraud, loan fraud, fake credit fraud, fake ticket fraud, fake acquaintance fraud, game fraud, and naked chat blackmail. Each kind of transaction data on the dark web may be used for telecom fraud. For example, loan data can be used for loan fraud; student data can be used to fake teacher fraud; shopping data can be used for abnormal online shopping fraud. We study the criminal process of these twelve types of telecom fraud and manually analyze the posts of privacy data transactions on the dark web. When a certain transaction post on the dark web contains content that can be used for some fraud case, it is marked as 1. Otherwise, it is marked as 0. Based on this, a correlation matrix with rows and columns is constructed:

where is a 0/1 matrix, indicates whether the th post is related to the th fraud type. The associated value is 1. Otherwise, the value is 0. We sample 1000 posts to study the relationship between the posts and the twelve case types. At last, we construct a matrix of 1000 rows and 12 columns, as shown in Figure 4.

After further analysis of the data sold by each post in the forum, we find that the number of data records sold by different posts is often different. For example, in a post titled “83W personal information_rich_manager, etc.,” the amount of privacy records it sells is 830,000, although there is only one post. Therefore, the number of privacy records involved in the post should also be considered when constructing the correlation matrix. On this basis, we continue to improve the correlation matrix, defining a correlation matrix with weights:

In this matrix, indicates that the th post is related to the th type of case, and the number of privacy records sold by this post is . replaces the 0 or 1 in with the number of data records sold by the post related to some type of case. For example, means that the th post is related to the th type of case, and the post the dark web publishes contains 90,000 privacy records. Part of the content of matrix is shown in Figure 5.

Finally, we sum each column of the correlation matrix with weights separately to construct a matrix with 1 row and columns:

In this matrix, is as follows:

records the total amount of privacy data for each type of telecom fraud crimes. For example, billion where means the number of data records can be used for fake prosecutor fraud; billion means the number of data records can be used for feelings cheating fraud.

4.1.2. Correlation Analysis between Privacy Data and Telecom Fraud Cases

(1) Comparison between the Number of Judgment Documents for Telecom Fraud and the Amount of Privacy Data Sold on the Dark Web. How is the number of cases published on China Judgments Online distributed and whether there is some similarity or correlation with the distribution of the number of personal privacy data records sold on the dark web? To solve this question, we sum up each column of the weighted correlation matrix separately to calculate the total amount of data that could be used to commit each type of telecommunication network fraud. Furthermore, combined with the number of various types of judgment documents for telecom fraud on China Judgments Online, a numerical comparison study was conducted, as shown in Table 1.

As can be seen from Table 1, the amount of privacy data sold on the dark web is very large. In all types of privacy data, the amount of data used for loan fraud is the largest (2.539149668 billion), followed by the data used for naked chat blackmail (1.878473666 billion) and the data of fake prosecutor fraud (1.2756833 billion). Among the judgment documents, loan fraud crimes are the most (2420), followed by fake credit fraud (449), then fake prosecutor fraud (332), and feelings cheating fraud (288). Obviously, we can find that the amount of privacy data used for loan fraud is the largest, consistent with most loan fraud crimes in the judgment documents. This explains why loan fraud crimes are so common in real life.

In order to further analyze correlations behind these two sets of data, according to Table 1, we have drawn the distribution curve of the number of judgment documents for telecom fraud published on China Judgments Online and the distribution curve of the number of personal privacy data records sold on the dark web, as shown in Figure 6.

Among them, the horizontal axis represents 12 types of telecom frauds, the left vertical axis represents the number of personal privacy data records sold on the dark web, and the right vertical axis represents the number of judgment documents for telecom fraud.

As Figure 6 shows, it can be concluded that the distribution of the quantity of data traded on the dark web and the distribution of the number of judgment documents on China Judgments Online are generally consistent. The reason for this may be as follows: Before the occurrence of telecom network fraud, some related data has been sold on the dark web, so criminals are likely to buy this data for precise fraud. Since the number of different types of data on the dark web is different, the probability of criminals committing corresponding types of telecom fraud is also different. Additionally, criminals will keep buying such privacy data if they find that data-backed frauds are easier to pull off. At the same time, more and more such privacy data will be transported to the dark web through the underground black industry chain. In the end, a mutual causality cycle is formed, making illegal privacy data transactions and telecom fraud on the dark web expand viciously.

It is worth noting that the distribution on the right side of the two curves has opposite trends. That is to say, the number of cases of game fraud and naked chat blackmail published on China Judgments Online is relatively small, while the number of corresponding types of data transactions on the dark web is relatively large.

For game fraud, there is a lot of such privacy data sold on the dark web. Accordingly, there are many potential defrauded people. However, the number of judgment documents about such cases is relatively small. The reasons may be as follows: First, from the perspective of victims, the majority of people participating in online games are minors. This group lacks legal awareness. Once they encounter game fraud, most of them will not choose to call the police. Secondly, in terms of family supervision, many teenagers will be supervised by their parents in the financial investment of online games, thus reducing the risk of being defrauded. Thirdly, game fraud crimes are inherently difficult to detect. Many games in the market are not strictly regulated, and there are few effective ways to find criminals, resulting in a low crime-solved rate. Fourthly, the force of police departments is generally insufficient at present, and the work intensity of police is very high.

For naked chat blackmail, there is also a lot of privacy data, such as “single dating” and “hotel accommodation,” sold on the dark web, which may be used for naked chat blackmail. However, the number of judgment documents in such cases is relatively small. The main reason may be that many victims constrained by reputation are always unwilling to call the police.

(2) Correlation Analysis Based on Spearman. Based on the preliminary statistical comparison and manual qualitative analysis, the above discussion proves that the correlation between posts and telecom fraud crimes exists. Next, by the Spearman correlation analysis, we calculate the correlation coefficient to obtain an accurate measure of the degree of the correlation.

Correlation analysis refers to the analysis of the degree of correlation and closeness between two or more variables. The correlation coefficient is a measure of correlation analysis. The value of the correlation coefficient is between [-1, +1], which represents the degree of association between two random variables [19]. When the correlation coefficient is less than 0, it is called negative correlation; when it is greater than 0, it is called positive correlation; when it is equal to 0, it is called zero correlation. The larger the absolute value of the correlation coefficient, the stronger the correlation between the two variables. The closer the correlation coefficient value is to -1 or +1, the closer the relationship is; when it is close to 0, the relationship is distant. Generally speaking, when the absolute value of the correlation coefficient is less than 0.2, it indicates a very weak correlation or no correlation. When it is between 0.2 and 0.4, it indicates a weak correlation. When it is between 0.4 and 0.6, it indicates a moderate degree of correlation, and when it is between 0.6 and 0.8, it indicates a strong correlation, and when it is greater than 0.8, it indicates a very strong correlation.

The commonly used correlation analysis methods mainly include Spearman and Pearson. Spearman is a nonparameter analysis method that is independent of the data distribution. But when we use Pearson for correlation analysis, the sample data must obey the normal distribution [20]. To choose a suitable correlation analysis method, we first test the normality of the sample data.

In this paper, with SPSS 25.0, we use Kolmogorov-Smirnov single-sample normality test method to test the normality of the sample data. The results are shown in Table 2.

As can be seen from Table 2, the significance level of the sample data is less than 0.05, meaning that the sample data does not obey the normal distribution. Therefore, Spearman is chosen to verify the correlation between data on the posts on the dark web and telecom fraud cases.

The principle of Spearman correlation analysis is as follows [21]:

where and are the mean of and , respectively, and is the result of calculation after grading the variables separately.

Before calculating the correlation between the number of various telecom fraud crimes and the number of corresponding posts on the dark web, according to the above discussion, game fraud and naked chat blackmail are excluded. The results of the Spearman correlation analysis are shown in Table 3.

As can be seen from Table 3, the correlation coefficient between the amount of privacy data sold on the dark web and the number of different types of fraud cases is 0.782, which is at a strong correlation level, indicating that there is a strong correlation between the posts on the dark web and telecom fraud cases.

4.1.3. Prediction of the Number and the Trend of Telecommunication Fraud

As concluded above, because of the strong correlation between the privacy data on the dark web and telecommunication fraud data, we can predict the number of telecommunication fraud cases in the future based on the number of privacy data sold on the dark web.

(1) The Interval between Posting Time and Case Closing Time. First of all, it usually takes a while for criminals to commit telecom fraud, and it will also take a while for the public security department to solve the telecom fraud crimes after the calling. Therefore, compared with the time when the privacy data transaction on the dark web occurs, there must be a certain lag when the judgment document is published online after the case is solved. If we want to predict the number of telecom fraud crimes, we need to calculate the general interval.

From August 2018 to December 2021 (41 months), the monthly number of posts on the dark web and the monthly number of judgment documents for telecommunication fraud on China Judgments Online are counted. Part of the data is shown in Table 4. We define two vectors, and , representing the set of posts on the dark web and the set of judgment documents, respectively.

, where indicates the number of posts posted on the dark web in the th month of 2019. The dimension of is 12, which indicates that is a dataset of posts for 12 consecutive months on the dark web.

, where indicates the number of telecom fraud documents on China Judgments Online in the month. The dimension of is 12, which indicates that is a data set of judgment documents for telecommunication fraud on China Judgments Online for 12 consecutive months. The value range of is the integer interval [0, 17]. When takes values from small to large in [0, 17], a set of vectors will be obtained. is the data set of judgment documents in the same year and month as . is the data set of judgment documents shifted by months from the month where is located.

We sequentially calculate the correlation between and each vector of , obtaining 18 sets of Spearman correlation coefficient values, as shown in Table 5.

The largest Spearman correlation coefficient is 0.783, and the corresponding is 13, which means that when is 13 months later than , and the Spearman correlation coefficient between and is the largest. As a result, we infer that the interval from the time when posts appear on the dark web to the time when the telecom fraud case is solved is about 13 months.

(2) Linear Regression Prediction Model. This paper uses a Linear Regression model to predict the number of telecom fraud cases. First, we construct a training data set with the number of posts on the dark web from August 2018 to November 2019 and the number of telecom fraud cases 13 months later (from September 2019). Then, we use the least-squares method to perform linear fitting on the training data set and complete the model training after obtaining the parameter values in the Linear Regression model. The specific process is as follows:

We define a vector , where represents a data set consisting of the number of posts on the dark web for 16 consecutive months, and indicates the number of posts in the th consecutive month since August 2018. The value of is as follows:

We define a vector , where ; thus, , where represents the data set of judgment documents for telecommunication fraud for 16 consecutive months from September 2019. The value of is as follows:

Based on the data sets and , we draw a quantitative comparison chart, as shown in Figure 7.

It can be seen from Figure 7 that the number of posts on the dark web has a similar distribution to the number of telecom fraud judgment documents. The trends of the two curves are roughly the same. In addition, the Spearman correlation coefficient of and is 0.776, indicating a strong correlation between posts on the dark web and telecom fraud judgment documents. It proves that the interval between the posting time and the closing time is 13 months is reasonable.

Before training the linear model, we normalize the vector :

The least squares Linear Regression model is as follows:

In the above formula, , .

The training set is constructed according to and , and the training result is shown in Figure 8.

The RMSE (Root Mean Square Error) of training data is 32, and the is 0.63, indicating that the regression model fits well. , indicating that the results of the model are reliable.

The parameters obtained by training , and , so the Linear Regression model is as follows:

We use the trained Linear Regression model to make predictions. This paper predicts the number of telecommunication fraud cases to be closed each month in 2021 based on the number of monthly posts on the dark web from December 2019 to November 2020.

We define a data set that represents the number of posts on the dark web from December 2019 to November 2020, where represents the number of posts on the dark web for the th consecutive month since December 2019. The value of is as follows:

The number of telecom fraud cases in all months in 2021 predicted by this paper based on the Linear Regression model is represented by the vector . We calculate based on as follows:

4.2. Prediction of Telecom Network Fraud Crime Based on LSTM Model

Time series prediction of crimes is one of the main directions of criminology. The time-series model can be used to analyze the statistical laws of time-series data and further predict the number of telecom fraud crimes. This paper uses the Long Short-Term Memory (LSTM) model to predict the time series of telecom fraud crimes.

4.2.1. Introduction to LSTM Model

Long Short-Term Memory (LSTM) is a Recurrent Neural Network (RNN) variant. Theoretically, RNN can handle any long-distance dependency problem [22, 23]. However, due to its gradient disappearance, gradient explosion, and other problems, RNN has only short-term memory and cannot achieve long-term preservation of information. Therefore, LSTM solves the long-term dependence of information by adding internal gating mechanisms and memory cells to maintain the long-term preservation of information. The LSTM cell structure is shown in Figure 9.

The LSTM unit includes an input gate , a forget gate , and an output gate . The input and output vectors of the LSTM hidden layer are and , and the memory unit is . The input gate is used to control the amount of current input data flowing into , that is, the amount of input information saved to . The forget gate controls the retention and forgetting of information and avoids the gradient disappearance problem caused by the back-propagation of the gradient over time in a specific way, that is, the influence of the previous moment information on the current moment . The output gate can control the influence of memory unit on the current output value . It controls the output information of memory unit at time . The output is not only affected by and , but also by . Memory unit is not only affected by and , but also by .

At time , the mathematical expression of each gate state is as follows:

where is the intermediate calculation result of the input information and past information passing through the input gate and is the intermediate calculation result of the input information and past information passing through the forget gate. is the intermediate calculation result of the input information and past information passing through the output gate. is the intermediate result of the input information and past information passing through the tanh activation function, is the sigmoid activation function, represents the dot product between elements, multiplied point by point, and are the weight matrices of the neural network, and is the paranoid vector of the neural network.

4.2.2. Test of LSTM Model

This model uses the data of 96 months of telecommunication fraud judgment documents from January 2014 to December 2021. The data on telecommunication fraud occurring during the first 84 months is used as the training set, and the data occurring during the last 12 months is used as the test set.

The stride of the time series of LSTM is 24, and the label length is 12. This model uses the min/max normalization method. The normalized data range is [-1, 1]. After the data is fed into the LSTM model, the loss function is calculated. In the model training process, the Adam algorithm is used to optimize, and the solution is solved based on the principle of minimum Root Mean Square Error.

The loss function curves of the training set and the validation set of the model training process are shown in Figure 10, and the accuracy curve of the validation set is shown in Figure 11.

As shown in Figures 10 and 11, with the increase in training times, the accuracy rate also increases, and then, the accuracy rate gradually becomes stable. The loss function of the model begins to converge after 20 cycles, and the model reaches the optimal performance at this time, saving the internal parameters of the model.

This paper repeats the experiment 20 times to predict the number of telecom fraud crimes for all months in 2021 based on LSTM and perform Kolmogorov-Smirnov tests on the obtained results, as shown in Table 6.

As can be seen from Table 6, the significance level of 20 sets of predicted results is less than 0.05, which means that each set conforms to a normal distribution. Therefore, we use the mean of 20 sets of results as the prediction result of the LSTM model. It is represented by the vector :

4.3. Hybrid Prediction Models

The Linear Regression model predicts the number of telecommunication fraud crimes based on the correlation between privacy data and telecommunication fraud. At the same time, the LSTM model studies the time-series feature of telecom fraud crimes to make the prediction. This paper proposes a Regression-LSTM combination model, combining the advantages of the two models to predict the number of telecom fraud crimes.

First, the errors of the Linear Regression model and the LSTM model are calculated separately. Then, the prediction results of the two models are weighted and combined with the error reciprocal method to the final prediction result . The principle of the error reciprocal method is as follows:

where represents the weight coefficient of the model, is the predicted value of the Linear Regression model, is the predicted value of the LSTM model, is the error of the Linear Regression Model, and is the error of the LSTM model. According to (15)–(17), it can be seen that through the error reciprocal method, the two models can be combined to reduce the overall error. Thus, the Regression-LSTM model can improve the prediction accuracy.

In addition, the convergence of the Regression-LSTM model is based on the Linear Regression model and the LSTM model, respectively. Since the Linear Regression model and the LSTM model are convergent, the Regression-LSTM model, a linear combination of the two models, is convergent.

In order to compare the Regression-LSTM model with existing hybrid models, this paper constructs the Additive-ARIMA-LSTM hybrid model and Multiplicative-ARIMA-LSTM hybrid model for time series prediction of telecom network fraud crimes [24].

In the Additive-ARIMA-LSTM method as in Figure 12, the time series is considered as an addition of a linear () and a nonlinear () component as Equation (18). First, a linear model is applied on the time series to obtain the forecasts on linear component (). Then, the residual series () is computed by subtracting the forecasts on linear component () from the original time series as Equation (19). The residual series is used by a nonlinear model to obtain the forecasts on nonlinear component . Then, the final forecasts are obtained by adding the forecasts on linear component with the forecasts on nonlinear component as in Equation (20). In this paper, we have used ARIMA as the linear model and LSTM as the nonlinear model. Hence, Additive-ARIMA-LSTM is obtained and used for forecasting.

In the Multiplicative-ARIMA-LSTM method as in Figure 13, the time series is considered as a multiplication of a linear () and a nonlinear () component as in Equation (21). First, a linear model is applied on the time series to obtain the forecasts on linear component (). Then, the residual series () is computed by dividing the forecasts on linear component () from the original time series as in Equation (22). The residual series is used by a nonlinear model to get the forecasts on nonlinear component . Then, the final forecasts are computed by multiplying the linear component forecasts with nonlinear component forecasts as in Equation (23). In this paper, we have used ARIMA as the linear model and LSTM as the nonlinear model. Hence, Multiplicative-ARIMA-LSTM is obtained and used for forecasting. The multiplicative hybrid method has the problem of division by zero, and it occurs when the forecast on linear component () is zero. Hence, to avoid this problem, we have set to 0.1 when it has a value 0.

5. Evaluation of the Models

This paper uses the measurements of Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), Root Mean Squared Error (RMSE), accuracy (ACC), and CPU time to compare the performance of different models. MAE is the average value of absolute error, which is a primary measurement of error. MAPE can examine the ratio of error to actual value based on MAE. RMSE is the ratio of the square of the deviation between the model predicted value and the real value to the number of predictions . It is susceptible to significant or minor errors in the prediction data, so it can better reflect the accuracy of the prediction results. ACC can reflect the accuracy of the prediction results and is an essential measurement for evaluating the model. CPU time describes the time consumption of the algorithm.

where is the real value; is the predicted value of the model; is the number of predictions.

where is the total number of predicted and is the relative error of a certain point and is defined as follows:

This paper compares the Linear Regression Model, the LSTM model, the ARIMA model, the Additive-ARIMA-LSTM model, and the Multiplicative-ARIMA-LSTM model with the Regression-LSTM model, as shown in Figure 14. The MAE, MAPE, RMSE, ACC values, and CPU time of each model are shown in Table 7. The experimental environment is as follows: AMD Ryzen 5 3600 6-Core CPU@3.60 GHZ, 16 G memory, Windows 10, python3.9, torch1.10.0.

It can be seen from Figure 14 that most models have high accuracy in predicting the number of telecommunication network fraud crimes. The trend and the number of fraud crimes predicted by the Regression-LSTM model are almost identical to the actual number of crimes. The Regression-LSTM model has a higher fitting degree and a better prediction effect than other models. Moreover, according to Table 7, the Regression-LSTM combined model proposed in this paper has the best performance. The MAE and RMSE values of this model are 0.122 and 0.149, smaller than other single models. The ACC value reaches 86.80%, indicating the highest prediction accuracy. In addition, the CPU time of the Regression-LSTM model is the shortest among the hybrid models.

In Figure 14, the number of telecom fraud crimes approaches 200 almost every month. In March and October, the number is high, which reminds the public security department to pay more attention to the crimes in the two months. Although the curve of real value and the curve of Regression-LSTM do not overlap, their shapes and trends are roughly the same, which shows that it is feasible and practical to predict the amount and trend of telecom fraud crimes based on Regression-LSTM model.

The Regression-LSTM model has many applications in public security. First, we can predict the number distribution of telecom fraud crimes based on the model so that the police department can focus on antifraud publicity during the high-frequency time of crimes and strengthen early warning and preventive measures in a targeted way. Second, as an intelligent analysis tool for crime prediction, this model can be applied to a city to evaluate the rate of fraud crimes and improve its public security level. Last but not least, the model can improve the level of cyberspace security management. If the number of telecom fraud crimes surges in a certain period of time, we can conclude that a large number of personal privacy data were stolen and sold on the dark web. It reminds the internet regulators to track specific virtual identities and crackdown on the underground illegal trading on the dark web.

6. Conclusion

In this paper, the Regression-LSTM model is proposed to realize the prediction of telecommunication network fraud crimes. Monthly telecom network fraud data of 2021 on China Judgments Online is selected for model testing. Compared with the other models, the proposed combined model can make up for the insufficiency of a single model and better utilize both the causality features of crimes and the time-series features, reducing the error of a single prediction model. The experimental results show that the combined model Regression-LSTM has the highest prediction accuracy, and the RMSE value is lower than other single prediction models, indicating the best prediction effect.

The Regression-LSTM model combines the advantages of the Regression model and LSTM model. It considers both the causality features of telecom fraud crimes and the time-series features. The Linear Regression submodel is used for finding the causality features and correlation between the trading data of the dark web and the crimes, and the LSTM submodel is used for mining the time-series features of the crimes. This paper confirms that the privacy data sold on the dark web strongly correlates with telecom fraud crimes, providing a new direction for subsequent researchers to research telecom fraud crimes by using the data on the dark web.

The Regression-LSTM model provides a useful analytical tool for telecom fraud prediction. It is of great significance to the protection of public safety. The prediction results of the model can not only remind citizens to strengthen awareness of fraud prevention but also urge enterprises to fulfill their responsibilities of privacy data protection. Under the precise guidance of the prediction model, the police departments can effectively reduce the occurrence of telecom fraud.

In addition, there are still some shortcomings in this paper. In the follow-up research, we will increase the scope and quantity of open-source data to improve the prediction accuracy of telecommunication network fraud crimes.

Data Availability

The dataset used in this study can be provided by the first author and corresponding author upon request. The data are not publicly available due to restrictions, i.e., privacy or ethics.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

This research was funded by the Special Project of the Ministry of Public Security’s Basic Work of Strengthening Police by Science and Technology “Research on risk group analysis model based on dynamic spatial relationship network” (No. 2020GABJC02) and “Research on the Early Warning Model and Prevention Mechanism of Telecommunication Network Fraud Crime,” Fundamental Research Project of PPSUC (No. 2021JKF420).