Abstract

With the rapid development of the information technology, mining news content efficiently is a difficult problem faced by the government and enterprises. The classification, clustering, and prediction of neural network are used in news mining. A keyword based on neural network and word vector are proposed as a text feature model, and the model is compared with other neural network models. This paper studies the vectorization of text for similarity recommendation and introduces two models: word vector model and text vector model doc2vec based on neural network. In the model with word vector as feature vector, the recommendation accuracy is about 75.35%, doc2vec model is about 44.5%, and the recommendation accuracy of the model with keyword as text is about 88.61%. The recommendation accuracy is about 91%, and the performance has greatly improved. The more the number of keywords, the better the recommendation effect of the model. When the number is about 20, the improvement of the recommendation effect tends to be gentle, and the continuous increase of the number of keywords will increase the operation time of the model. It is proposed that the keyword and word vector based on neural network together as the text vector model can more accurately mine news data, quickly obtain news information, and make prediction and early warning for many industries such as the real estate industry.

1. Introduction

With the development of the information age, we can quickly and accurately understand the development of all walks of life through network news. The update and improvement of news mining technology has become an important research direction. At present, the common news mining technology is mainly used to extract keywords and quantify text. Among them, the keyword extraction technology uses term frequency inverse document frequency (TF-IDF). The number of words in the text is obtained by calculating the IDF value. The keyword extraction is accurate, simple, and easy to implement (GNH et al.) [1]. The key extraction of the word graph model uses the textrank algorithm, which adds the concept of edge weight so that each word will have a corresponding score. Extracting text keywords through score can even make a text summary (Fakhrezi et al.) [2]. The keyword extraction of the topic model uses probabilistic latent semantic analysis (PLSA) to generate closely related and high-frequency words for each topic involved in the news (Khotimah et al.) [3]. The implicit Dirichlet distribution is obtained from the prior distribution on the PLSA model. Any data coordinate distribution is calculated by the Gibbs sampling algorithm. Compared with simple statistics, the text semantic information is considered (Yamaguchi and Templin) [4]. However, the lack of semantic analysis and complex calculation of keyword extraction lead to the poor effect of news recommendation. On the basis of keyword extraction, this paper combines the characteristics of neural network and word vector, and constructs a text vector model based on neural network.

For the method of network news mining, this study proposes that the keyword and word vector based on neural network are used as the text vector model. Neural network basically imitates the operation mode of the brain. Artificial neurons can operate synchronously. If they provide data continuously, they will spontaneously adjust the weight and threshold to achieve an accurate output value and have a strong learning and generalization ability. Although some neurons in the neural network have problems, it does not cause obvious interference to the overall structure and has strong fault tolerance. Based on the above characteristics, the model of news mining is constructed through neural network.

The innovation of this research is to combine the characteristics of high accuracy and easy realization of text-based keyword extraction technology with the advantages of word vector model based on neural network according to the idiomatic meaning of context, so as to make up for the problem that keyword extraction cannot analyze the context. Neural network is applied to the model to reduce the complex calculation of model keyword extraction so that the accuracy of news recommendation can reach more than 90%, and it is applied to the real estate market to play an early warning role and have a more comprehensive understanding of the development of the industry.

The research is mainly divided into four parts. The first part mainly introduces the research status of neural network and news mining. The second part establishes a news mining model based on neural network in the application of neural network to news mining. The third part compares the constructed model with other separate models. The last part summarizes and evaluates the results obtained, and puts forward the shortcomings of the research.

With the continuous development of the market, many researchers use the characteristics of neural network and mining algorithm to get relevant news data and give early warning of market development so as to stabilize the normal development of social economy. Fuhl et al. proposed a fully convoluted neural network. The model adopts hierarchical automatic learning features and simulates the bottom-up visual mechanism through significance prediction. After evaluating the model with difficulty to significance data set, the model achieves the best effect compared with others [5]. Gongfa et al. use the characteristics of convolutional neural network in gesture recognition to avoid the feature extraction process and reduce training parameters to achieve the purpose of unsupervised learning. Convolutional neural network combined with error back propagation algorithm greatly reduces the model error and makes gesture recognition more accurate and fast (Gongfa et al. 2019) [6]. Xie and Kumar use the ability of convolutional neural network to learn biometrics combined with the discrete hash trained by the loss function based on triples to achieve accurate performance. This method is superior to other finger vein image matching methods and provides a significantly reduced template size (Xie and Kumar) [7]. Pradhan et al. combine the deep neural network and groundwater influencing factors to obtain potential groundwater, obtain parameters, randomly generate spring water missing points, and use the knife cutting test for sensitivity analysis. The results show that the deep neural network can well capture the groundwater potential zone in mountainous areas and help to select appropriate groundwater resources (Pradhan et al.) [8]. Wang et al. designed a multi-parameter IOT online measurement system based on bpnna for simulation test experiments. Through the algorithm simulation experiment, the absolute error between the predicted final moisture content and the measured moisture content is within 0.3, which speeds up the data collection time (Wang et al.) [9]. Buckingham et al. determine the potential of new data sources by analyzing the data, applications, and methods of gdelt to understand the changing events in the news media, help understand social changes, and detect large-scale environmental changes. News mining plays an increasingly important role (Buckingham et al. 2020) [10]. Reddy et al. only uses the features of news text to detect false news methods. By combining style features and word vector representation through integration methods, the accuracy of predicting false news is as high as 95.49%, providing a new research method for news screening (Reddy et al.) [11].

Arhab et al. collected evidence about parking topics by mining Finnish news articles. The research emphasized word co-occurrence analysis, emotional scoring, and entity monitoring, so as to judge the potentially unforeseen situations that affect users’ decisions and contribute to the development of urban planning based on political situation (Arhab et al. 2021) [12]. Li et al. examines the virtual assistant service through text exploration and analysis, collects consumer comment data and realizes theme modeling, analyzes news keyword experiments, confirms consumers’ expectations or concerns about AI through the content of text mining, and can put forward specific AI service directions according to consumers’ needs (Li et al.) [13]. Englmeier et al. have developed the prototype of contexter system to check false news and wrong information. The model can present the blueprint of facts, abstract representation of facts, and look for similar facts, so as to avoid the threat to society and economy caused by wrong information (Englmeier) [14]. Wang et al. built an urban real estate early warning model based on multi-classification support vector mechanism, analyzed the real estate market from all aspects by selecting early warning index data, and the results successfully predicted the market operation of the next year, providing an accurate and reliable method for market early warning (Wang et al.) [15]. In order to prevent the collapse of the stock market, Habibi uses the autoregressive model to establish the early warning system and uses the recursive formula of the least squares estimation of the regression parameters to obtain the EWS index probability. The results show that the EWS index probability can accurately predict the crisis of the stock market (Habibi) [16]. Zhang et al. put forward a perspective model based on behavioral finance for stock market crisis early warning. The possibility of the stock market crisis is predicted by the logit model. The experimental results show that investor sentiment has a significant positive impact on stock market crisis. Therefore, the logit model considering investor sentiment can achieve an early warning effect on the stock market (Zhang et al.) [17]. Chen et al. proposed a lane marking detection method based on machine vision and image processing, which realizes the early warning function of lane departure through traditional image processing and semantic segmentation, as an important guarantee of traffic safety (Chen et al.) [18].

The analysis of the above research results shows that the early warning system plays an important role in maintaining social stability. Mining news can provide a scientific basis for the early warning system. Neural network has achieved good results in solving practical problems such as classification, clustering, and prediction, but there is less research in news mining. Therefore, the research combines the neural network structure in the news mining algorithm to improve the accuracy of news mining and provide a new research direction for accurately obtaining news information.

3. Construction of News Mining Model Based on Neural Network

3.1. Construction and Training of Neural Network Model

The network structure of neural network model is the most important design. The structure includes the selection of the number of input neurons, hidden layer neurons, and output neurons and the selection of activation function. The important process of BP neural network is shown in Figure 1.

BP neural network has good generalization ability. By establishing the relationship between input and output, it can predict the new state according to the law of existing data. During the application of the model, the over-fitting and under-fitting of training samples, data parameters, and other factors will affect the generalization ability. In this paper, the name and context knowledge are used in training the model, so a simple and effective entity rebalancing algorithm is proposed. Its main idea is to make the labeled entities evenly distributed in the same category. There are two main reasons why the proposed entity rebalancing algorithm is effective. First, the average distribution will encourage the model to use both name knowledge and context knowledge, because there is no simple statistical clue to use due to uneven distribution. Secondly, in most cases, different entities in the same category should be interchangeable in semantics, so as to avoid the differences between training and testing and improve the generalization ability of the model. Under-fitting can be avoided by changing the network structure or learning calculation, but the over-fitting state is difficult to control. The first step of forward propagation of BP neural network is the input value of each neuron to each unit of the hidden layer at a time as shown in formula as follows:

In formula (1), represents the total number of neurons in the hidden layer, represents the threshold of neurons in the hidden layer at time, and represents the weight of neurons in the input layer and neurons in the hidden layer. In the second step, the activation function is brought in to calculate the output of each unit of the hidden layer at time as follows:

According to the calculation method of (1), calculate the input value of each unit of the output layer at time as shown in (3), where represents the total number of units of the output layer, represents the threshold of neurons at time, and represents the weight of unit of the hidden layer and neurons of the output layer.

According to the calculation method of (2), calculate the output value of each unit of the output layer at time shown as follows:

The output value obtained through forward propagation carries out error back propagation, and represents the variance between the target output and the actual output at time as follows:

The calculated actual error is compared with the preset allowable error value. If the actual error value is greater than the allowable error value, the BP neural network will adjust the weight and threshold to continuously reduce the actual error value until the actual error value is not greater than the allowable error value. Formulas (1) to (5) constitute the BP neural network model structure. The neural network algorithm determines the weight and threshold of the network structure by inputting training samples, calculates the output value of each layer of neurons, and then calculates the error between the output value and the expected output. Adjust the weight and threshold according to the error, and repeat the steps of obtaining the error until the error reaches the set condition range to end the training. Algorithm learning methods are divided into supervised learning and unsupervised learning. This typical learning rule depends on the error correction algorithm and competitive algorithm (Gerist et al.) [19], Hebb algorithm (Manurung et al.) [20]. However, the traditional learning algorithm has some problems, such as falling into local extremum, slow learning speed, and convergence speed of the algorithm. Gradient descent method, quasi Newton method, and LM algorithm are commonly used to solve the above problems. Through the combination of gradient descent method and quasi Newton method, the LM algorithm is obtained by deriving the second derivative. The correction process of the algorithm is shown in formulas as follows:

In formula (6), represents the vector after iteration weight and threshold, in formula (7), represents the identity matrix, represents the adaptive adjustment coefficient, and the value is not less than 0, and in formula (8), represents the error of the network node and represents the Jacobian matrix. When the value of approaches 0, formula (7) is Gaussian Newton method; when the value of approaches positive infinity, formula (6) is gradient descent method. Adjust the weight and threshold in the above way until the output error meets the expected output error range, or the algorithm runs to the maximum number of iterations to stop training. BP neural network has achieved good results in the fields of classification, clustering, and prediction. Especially in the field of information processing, it has excellent performance in the links of information collection, reception, transmission, and processing. Therefore, the application of this algorithm in network news mining can efficiently and accurately collect the required information.

3.2. On the Construction of Word Vector and Text Vector Model of Neural Network

Through the extraction of news keywords, relevant news can be obtained to achieve some purpose. For example, the early warning of the real estate market can be determined by the ratio of commercial housing price and sales area, and the development trend of real estate can be judged by policy news. Then, relevant news data can be mined by extracting keywords such as price, sales area, and real estate policy, so as to obtain the development of the real estate market (Dianov) [21]. When using text keywords to recommend news with similar text features, the text can construct text vector model and text statistical model. The weight measurement of words in the text determines the keywords of the text, and a fixed number of keywords are taken for each news as the combination of text features. Preprocess the text data, delete the feature attributes that are not important and have no impact on the data, calculate the weight value of all words in the text, take the first few words with the highest weight value in the text as the keyword set, and recommend the largest number of news texts in the intersection part of the data set obtained through the intersection operation of key sets. The implementation of the model based on text keywords is simple and direct, but it lacks word semantic information and word order information, which will lead to two news articles with different meanings being defined as similar news because of the emergence of common keywords. For example, one news mainly describes the sales of real estate, and the other news mainly describes the situation of real estate policies, and extracts the key words, sales and policies, This situation is unknown. It is all news about real estate. In view of the above problems, the text vector based on neural network is proposed as the text feature. The text vector based on neural network represents the text with a vector of fixed dimension size, takes the word as the basic unit of the text, the word vector can form the text vector, takes the sentence as the text unit, extracts the theme of the text, and combines the word vector to form the text vector. In addition to relying on the way of forming text vectors by word vectors, text vectors can be obtained by matrix decomposition and doc2vec algorithm.

The statistical language model can calculate the probability of words in text or sentence. It is often modeled with maximum likelihood function. The model based on neural network is shown in the formula as follows:

In formula (9), represents the neural network, the parameters in the network structure are represented by , and the word context is represented by . Compared with the statistical language model, the improved model has the advantages of more timeliness in expressing conditional probability and simpler calculation.

After training, the neural network language model (nnlm) obtains the word vector, which represents the distribution representation of words in the model (Hentschel et al.) [22]. Nnlm models conditional probability , and the network structure is shown in Figure 2. In Figure 2, the word represents the words before . In the model, the previous word is used to predict the next word. represents the word vector. The word vector is stored in the parameter matrix. The parameter matrix is located in the input layer. The matrix size is represented by , represents the vocabulary size, and the word vector dimension is super parameter, which is represented by .

The input layer splices the word vectors corresponding to the number of words into a vector with a length of as follows:

The hidden layer calculates the input layer to obtain the hidden layer output . As shown in formula (11), represents the parameter matrix from the input layer to the hidden layer, and represents the offset from the input layer to the hidden layer.

Tanh is a hyperbolic tangent function, and the value range of the function is [−1, 1]. The calculation formula of the output layer is as shown in (12). In (12), is the parameter matrix from the hidden layer to the output layer, and is the offset from the hidden layer to the output layer.

Finally, the S-type function is used as the activation function to normalize . By optimizing the algorithmic model parameter , the word vector is the parameter matrix from the word to the input layer. The formula is shown as follows:

Neuron is the basis of neural network. It processes information through the interaction between neurons and uses weight and threshold to store information. Each neuron has multiple ports represented by , and the weights with different input port configurations are represented by . The input signal is integrated by neuron to obtain the total input value , which is linearly processed with the threshold . The processed value is brought into the function , and the neuron output value is as shown in Figure 3.

The word vector model based on neural network can learn the semantics through the text context, filter the meaningless stop words in the news text, and obtain the vector representation of all words through the model training. The text vector can be expressed by adding or averaging the word vectors, combined with the method of extracting keywords (Xiong et al.) [23]. Each word in the text has a corresponding score, and select the keywords with the largest score as the sample set. After normalization, the numerical value is obtained to form a text vector. The sentence vector can form a text vector. Therefore, in addition to extracting keywords, the text can also extract key sentences. When quantifying sentences with the help of word2vec, calculate the weight of sentence vector with cosine similarity as shown in the following formula:

In formula (10), and represent different sentences, respectively, represents the vector of the sentence, and and EE represent the words in the sentence.

Doc2vec is an unsupervised learning algorithm. Each column of the matrix is used to represent the unique vector formed by each text, and each column of the matrix is used to represent the unique vector formed by each word. Text vector and word vector are spliced to predict the next word. This model is the distributed memory model of paragraph vectors (pv-dm) model as shown in Figure 4.

The words in the model are shareable, but the text vector in the new text needs to be recalculated, so the matrix of the fixed word vector and the parameters of the next word classification layer of the model are predicted, and the random gradient descent method is used to update until convergence to obtain the text vector. The similarity recommendation steps of the above text vector model based on neural network are to preprocess the text data set, delete unimportant feature attributes, train the word vector or text vector through the neural network model, and recommend the news by means of distance measurement or cosine similarity on the basis of the vector. Moreover, the model can learn the semantics in combination with the context. Whether the addition or weighting of word vectors can accurately express the semantics needs to be verified. For news with a large text vocabulary, texts with different semantics may show similarity in the model. For example, the price of real estate will be linked with the price of goods in other industries. If the word vector of two words in a text is (1, 0) (0, 1) and the word vector of another text is (1, 1) (0, 0), the vector representation under the weighted average is (0.5, 0.5). Combining the advantages of keyword based statistical information model and semantic information based word vector model, both of them are regarded as the characteristics of the text. The recommendation of similar news text is shown in formula as follows:

In (15), represents the super parameter, B , and represent the news, and represent the keyword set of news and news , and and represent the vectors of news and news , respectively.

4. Experimental Design and Analysis

The news data set used in the study is from the National Bureau of statistics, and the vocabulary of each text data is sufficient [24]. In the experiment, TF-IDF algorithm is used to extract keywords. The text quantization method adopts weighted word vector addition. The keyword TF-IDF value is used as the weight, and the word vector dimension is taken as 100. When text keywords and feature vectors are obtained, a news item is randomly selected. Select the 10 news items with the highest similarity through formula (14), and calculate the probability that the 10 news items and the selected news items are in the same category. In this experiment, 1000 samples are randomly selected and the average probability is taken. The higher the probability is, the more accurate the algorithm recommendation is. In the experiment, the number of keywords and trade-off parameters are set. The value of the key word number is, which is a positive integer. Compromise value [0, 1]. When the value is 0, only text keywords are recommended. When the value is 1, only text vectors are recommended. The results are shown in Figure 5.

It can be seen from Figure 5 that the change in the number of keywords has a certain impact on the recommendation effect. When the number reaches a certain value, the recommendation effect will no longer change significantly. The recommendation effect in the case of special value is not as good as the recommendation effect of the combination of the two. Verify the effectiveness of formula (15). The results show that no matter how many keywords there are, the model recommends the best effect. When the trade-off value is between 0.6 and 0.9, although the more keywords, the better, there is no obvious difference in probability when there are about 20. The dimension of word vector with a dimension of 100 is reduced by principal component analysis (PCA) (Lu et al.) [25]. After visualizing some words in a two-dimensional plane, we can know that words with similar semantics are also in similar positions in the model as shown in Figure 6.

In order to make the model more accurate, the doc2vec technology introduced is compared with the model. Doc2vec’s text vector dimension is also 100. After quantifying the news text, it is processed with the model based on keywords and word vectors as common features. The results are shown in Figure 7. The numbers 0, 1, 2, 3, and 4 in the figure represent real estate business, political policy, film and television, sports, and science and technology, respectively.

News in the same field are also closer in the model. After obtaining the text vector characteristics of each news through doc2vec, test doc2vec with the same test method. Select a news about real estate risk, use cosine similarity to calculate the most similar 10 news, and calculate the probability of belonging to the same kind of news [26]. The experiment repeats the average values of 500 and 1000 times. The experiment sets the super parameters of vector dimension, the super parameter value is 50 to 300, and the recommended effect data are obtained as shown in Figure 8.

According to the results shown in Figure 8, the effect is the best when the dimension of hyperparametric text vector is 50, which is far from the proposed method in which keywords and word vectors are common features [27]. It may be that too little data leads to insufficient training of doc2vec model and affects the accuracy. At the same time, it indicates that the method based on the common features of keywords and word vectors also has significant effect in terms of less data. Comparing the four experimental results of selected keywords, word vector weighting, combination of keyword and word vector weighting, and doc2vec, Figure 9 is obtained [28]. The combination of clear keywords and word vector weighting has the best recommendation effect.

5. Conclusion

The experiment obtains relevant news data by mining news keywords, and makes an early warning effect on the development of the real estate market. In the experiment, a model based on neural network is proposed, in which keywords and word vectors are used as text features, and its performance is compared with other neural network models. The results show that the neural network model based only on keywords, the neural network model based only on word vectors, and the neural network model combined with keywords and word vectors all affect the recommendation effect due to different super parameters. When the super parameters reach a certain value, the recommendation effect is gradually stable. When the number of super parameter keywords in the model is about 20, the recommendation effect of the three models is also gradually stable. When the hyperparametric trade-off value is 0, that is, when the keyword is used as the feature vector, the recommendation effect of the model is 88.61%. When only the word vector is used as the text vector model, that is, the hyperparametric trade-off value is 1, and the recommendation accuracy of the model is 75.35%. When the keyword and word vector are combined and the ratio reaches a certain value, the recommendation accuracy of the model is as high as 91%, which is much better than other models. However, the lack of hyperparametric samples leads to the inaccuracy of the comparative experiment. Therefore, there are still great deficiencies in the research. In the follow-up work, we need to further optimize the network structure and improve the accuracy of news recommendations.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The author declares that there are no conflicts of interest.