Abstract
In order to effectively solve the above problems, an automatic extraction method of web text information based on network topology coincidence degree is proposed. Search engine, web crawler, and hypertext tag are used to classify web text information, and then, dimensionality reduction is carried out. After processing, the similarity of different features of web page text information is calculated, the similarity is sorted, and the similar text information is extracted according to the correlation based on segment estimation. The experimental results show that the designed method can simplify the complexity of the associated information of the data set and improve the amount of data collection and the success rate of information collection.
1. Introduction
After decades of transformation and development [1, 2], computer technology and information technology have brought earth-shaking changes to human society, transforming human beings from the industrial age to the information age, and making people involved in the wave of information extraction, collection, storage, and analysis. In particular, the Internet as the carrier of information media has become a clear symbol of this era [3]. With the rapid growth of the Internet, the web has developed into a huge information service network containing a variety of information resources and sites all over the world [4, 5]. Network topology refers to the physical layout of devices interconnected by transport media. There is a specific physical, real, or logical, virtual arrangement of the members of a network. If two networks have the same connection structure, they have the same network topology. However, the significant overload of text information on web pages has brought great difficulties to information extraction [6]. Due to the unstructured and disorderly nature of information, people generally can only use full-text extraction to find the required information, so that the webpage containing the required information is filled with a large number of advertisements and irrelevant links, and useful information and useless information are mixed together, increasing the difficulty of correct positioning information [7]. In order to deal with these problems, an automatic extraction technology is urgently needed to help people quickly find the information they really need from the mass of information. The automatic extraction method of web text information is a good way to solve this problem.
Relevant scholars have put forward numerous studies. In reference [8], an adaptive parameter optimization model of infrared small target 3D information extraction based on particle swarm algorithm was proposed. Multiobjective particle swarm optimization algorithm was used to optimize the parameters of the 3D information extraction method, realizing the adaptability of the detection method in different detection scenarios. In the optimization algorithm, an adaptive environment selection strategy is proposed to enhance the ability of evolution and obtain high-quality solution sets. In addition, the inflection point selection strategy is designed to obtain the best parameters of the small target detection method. Experimental results show that compared with the baseline method, this method can detect small targets in different scenes accurately and stably. Reference [9] proposed accelerated training of depth information extraction system for cancer pathology report based on Bootstrap aggregation. Data used in machine learning consisted of free text from electronic cancer pathology report, and partitioned data training was carried out through multitask convolutional neural network and multitask hierarchical convolutional attention network classifier. Up to 40,000 models were generated by dividing a large problem into 20 subproblems, resampling training cases 2,000 times and training deep learning models for each guided sample and each subproblem. Many models were trained simultaneously in a high-performance computing environment in the laboratory. Compared with a single-model approach, model aggregation improves task performance. Although some progress was made in the study, but is not applicable to complex web page text information, based on network topology contact ratio for this web page text information extraction method automatically, as a web page text messages between the number of common neighbors to quantify information nodes of the network topology of coincidence degree, as the change of common neighbor between two points, the importance of text on web pages will also change.
2. Web Page Text Information Preprocessing Based on Network Topology Coincidence Degree
2.1. Automatic Extraction and Classification of Web Page Text Information
We classify the automatic extraction technology of web page text information, and the specific contents are as follows: (1)Search engine: refers to a service-oriented website that can input the desired information conditions in the search interface and extract them [10, 11]. Its principle is to find the content consistent with the information according to the information entered by the user, use some algorithms to process and analyze the web page text information, and store it in the database after integration [12]. The original data is characterized by strong relevance and uniform rules. When users search for information, they can directly feed back the search conditions(2)Web crawler: it is a program hidden in the search engine, which can search relevant web pages or download the page [13, 14]. It is like a spider crawler, it can switch between many web pages arbitrarily, so it is also called a web robot(3)Hypertext tagging: it is a technology for integrating information. According to the user’s requirements, the useful information searched can be spliced together for processing [15]. Therefore, users can use hypertext links to directly find the required text. Hypertext contains a lot of content, such as Chinese characters, pictures, videos and other information
Under the condition of network topology coincidence, the automatic extraction of web page text information needs the help of network control system. The diagram structure of control system is shown in Figure 1.

2.2. Dimensionality Reduction of Web Page Text Information
In the process of automatic extraction of web page text information under the degree of network topology coincidence, in order to ensure the accuracy of extraction, it is necessary to extract the relevant features of web page text information, and the features of web page text information usually consist of multiple dimensions, which will make the process of web page text information extraction too complex, thus reducing the accuracy of web page text information extraction [16, 17]. Because the dimension of web page text information contains a large amount of redundant data, it is necessary to reduce the dimension of web page text information, retain the main feature information, and eliminate the impact of redundant data [18]. The specific method is as follows.
Step 1: collecting web page text information under the network topology coincidence degree can form a web page text information matrix of , where is the number of web page text information and is the number of dimensions of web page text information. Therefore, the web page text information matrix can be described as a collection of dimensions, that is,
In formula (1), represents the mean vector of web page text information, and represents the specific quantity of web page text information.
Step 2: map the data of web page text information in high-dimensional space to low-dimensional space to reduce the characteristic dimension of web page text information. The formula is as follows:
In formula (2), represents the characteristic dimension coefficient of web page text information.
Step 3: set as the feature mean vector of web page text information, then
In formula (3), represents the transposition of the feature mean value of web page text information, and represents the average value of the feature dimension of web page text information.
Step 4: delete the smaller eigenvector or the larger eigenvector matrix in the web page text information [19, 20]. Assuming that the mean value of the feature vector of web page text information is , there is . Therefore, the feature mean vector of web page text information can be approximated as
In formula (4), represents the feature mean vector coefficient of web page text information.
Through the above methods, the feature dimension of web page text information can be reduced, redundant data can be deleted, and the main web page text information features can be retained, which provides an accurate basis for the extraction of web page text information.
3. Automatic Extraction of Web Page Text Information Based on Network Topology Coincidence Degree
3.1. Similarity Search of Web Page Text Information Based on PageRank
At present, the widely used search engine is Google, and its success is mainly due to its high-quality search results. Google interprets itself as “clearly understanding the meaning of users and meeting the needs of users” and has developed a breakthrough PageRank, which is an excellent web page relevance ranking algorithm [21]. PageRank algorithm mainly measures the value of the website through the number and quality of internal and external links. PageRank uses the hyperlink topology directed graph on the web to model the website and search the relevance of the web page [22]. Similarly, the PageRank algorithm is completed through the reference and referenced relationship between web page text information. The link is regarded as a reference to search for similar web page text information through the correlation between web page text information. Among them, the PageRank algorithm for relevance search of web page text information mainly has the following advantages: (1)There will be no citation between two articles in the web page text information. The PageRank algorithm can effectively reduce the process of cyclic calculation, and only simple nested calculation is required [23, 24](2)Most of the research is mainly to solve the practical problems in life, such as scattered topics and speech, which need to be carefully screened by experts in the corresponding field, so as to promote the web text information to have high quality and authority(3)The number of citations of web page text information shows the importance of the corresponding subject in the research field, but the hyperlink of web page can only play the role of navigation and guide users to browse web pages orderly according to the needs of designers [25].
PageRank algorithm is mainly used to analyze the link structure of web pages and mine the information of grid itself in the Internet network. At the same time, it is also called “hyperlink analysis.” The basic idea of the whole algorithm is assuming that there is a link to the web page in the web page ; it shows that the owner of thinks is more important, so some importance scores in are given to , and the importance score can be expressed as
In formula (5), represents the PageRank value of , and the PageRank value of is the accumulation of a series of page importance scores similar to .
A data model applied to information filtering and indexing is established as a vector space model, which plays a very important role in the text similarity calculation of search engine [26, 27]. The main method of similarity comparison in the vector space model is to calculate the cosine through the vector. The model is a very important key assumption, that is, the order in which the formed article entries appear in the article is not important, and they play an independent role in the theme of the article. Therefore, the web page text information can be regarded as a collection of a series of disordered entries [28]. In the vector space model, each index word is regarded as a component and is regarded as a multiple element vector space composed of index words. Generally, the phrase in which index words appear at least once in the file is called keyword. In the search process, the input extracted words can also be converted into multiple element vector space of the file after word segmentation and other operations [29]. Among them, the correlation between documents and search words is mainly obtained by comparing the angle deviation between vectors of different documents and extracted word vectors.
The vector space model takes the feature item as the coordinate represented by the document and represents the web page text information as a point in the multidimensional space in the form of a vector. It is mainly used to represent each single component in different vectors. Calculating the cosine value between components can judge the similarity between web page text information. The vector space model is shown in Figure 2.

In the vector space model, is set to represent a text set containing web page text information, that is,
in the set can be expressed as a vector as shown in
In formula (7), represents the weight of the th feature item in the web page text information . The vector space model represents the web page text information and query mode as a vector constructed with words as elements, and each word is weighted by word frequency and inverse text frequency. Then, the similarity between web page text information and query method is obtained by calculating the cosine angle between vector elements.
In search engine, the vector space model is mainly used to calculate the similarity between different web page text information. In the process of actual extraction, search engines need to classify the search content and mainly take the value of PageRank as the criterion for the preliminary classification of web page text information. Then, we search the inverted file table in the system and select the web page text information containing keywords at the same time. Finally, the similarity of different web page text information and extracted content is calculated by the vector space model [30].
Based on the above analysis, it is necessary to further analyze the architecture and workflow of the search engine: before users submit the extracted content to the query module, all web pages are analyzed and calculated through the search engine. When the vector space model based on the PageRank value is used for preliminary classification, it is necessary to consider the web page text information, calculate the similarity of different search results and search contents, and sort them [31, 32]. The specific operation flow of web page text information similarity search is shown in Figure 3.

In Figure 3, the search content refers to the extracted information input by the user, and the preliminary search refers to the preliminary extraction of web page text information [33]. When querying a large number of web page text information, it is necessary to adjust and expand some institutions in the search engine. The similarity of web page text information is sorted, and the line similarity web page text information search is realized according to the correlation between web page text information.
3.2. Feature Point Extraction of Web Page Text Information Based on Segment Estimation
On the basis of the above analysis, according to the information retrieval results, we extract the feature points of web page text information based on segment estimation and set a series of observation values of web page text information as
Formula (8) is called a web page text information sequence, represents the th observation value at any time point, and represents the observation value at the last stage of the observation time. There are many definitions of segmented estimation, such as local extreme points and edge points, but these are extracted from one-dimensional web page text information. There is still little research on multidimensional web page text information. At this stage, it is still in its infancy and needs further exploration [34]. The web page text information contains local extreme value feature points and nonlocal extreme value feature points, as shown in Figure 4:

(a) Local extremum characteristic point

(b) Nonlocal extremum characteristic point
As shown in Figure 4, each region of local extreme value feature points presents a symmetrical state, and nonlocal extreme value feature points presents an asymmetric state. Local important extreme points, points with large fluctuations in a short time, and the starting point and end point of web page text information are collectively referred to as feature points [35].
Given a web page text information sequence, the following definitions can be obtained from the perspective of extreme points: (1)Extreme characteristic point
Given a segmentation method, the neighborhood of point is , where represents the average value of and , that is, the number of rows of the matrix; represents the average value of and , that is, the number of columns of the matrix. Assuming that represents the minimum point in the domain , it is called the local minimum point. (2)Nonextreme characteristic points
In neighborhood ,
When or , assuming , it indicates that point fluctuates greatly in a short time.
After selecting local extremum feature points and nonlocal extremum feature points, the specific operation steps of web page text information feature extraction based on segment estimation are given below: (1)Based on segmentation estimation, the web page text information sequence is given, and the following segmentation method is given: , and the center point is (2)Add the segmentation method to the feature point set(3)The local extremum important points in different segmentation segments are selected through relevant definitions, and the feature point set is added at the same time(4)Select the points with large fluctuation in a short time in different segmentation segments and add the feature point set at the same time
3.3. Web Page Text Information Extraction Method under Network Topology Coincidence Degree
After the feature point extraction of web page text information based on segment estimation is completed, the web page text information extraction is realized under the network topology coincidence degree. Traditional web page text information extraction methods can only extract web page text information with the same or similar character expression forms and can not respond to the impact of the change of web page text information characteristics under the network topology coincidence degree, which reduces the accuracy of web page text information extraction. Therefore, web page text information is extracted under the network topology coincidence degree.
In the process of web page text information extraction, it is necessary to calculate the distance between feature vectors in web page text information feature space to measure the similarity between web page text information features [36]. The Euclidean distance method can extract web page text information. The Euclidean distance method needs to set the first web page text information as and the second web page text information as , and the feature vectors between the two information are and , respectively. Then, the text extraction formula is
In formula (10), is the Euclidean distance between two web page text information features and is the result of normalization.
3.3.1. Web Text Information Mining
The basic idea of web page text information extraction is to cluster the features of web page text information according to the similarity, each cluster center represents the main features of a web page text information and use the cross line method to match the features of web page text information, so as to realize the extraction of web page text information under the coincidence degree of network topology.
The implementation method of massive web page text information mining is to extract the features according to the text features and use the feature fusion method to extract the text classification and recognition from the massive network data text. The more mature text feature extraction method is to realize different weights for different words to represent the importance of the words in the document. In the weighting implementation method, the easiest weighting method is Boolean weighting, that is, when a word appears in the document, the weighting is 1; otherwise, the weighting is 0. The definition is expressed by the following formula:
In formula (11), represents the frequency of the word in the document ; represents the weighted result of the word. In the recognition process of text data mining, it is necessary to use digital normalization method to process it based on computer. Through normalization, the keywords in the document can be well classified and measured, and the important properties of each keyword can be characterized, so as to realize keyword recognition. The actual mining needs to be realized from the following aspects: (1)The training subject editing function displays the web page text in the form of parameters according to the requirements of the training subject, including web page text feature parameters, web page text quantity parameters, web page text property parameters, web page text category parameters, and web page text capacity parameters; the training plan refers to adopting optimization techniques under a certain mining environment according to the characteristics of web page text Technical measures against massive environment are expressed in specific forms. The key of simulation training system is that the training plan can be designed arbitrarily to artificially increase the intensity and difficulty of training, so as to achieve better training level effect(2)Combined with the performance of web page text mining, taking massive as the simulation object, the deep association mining technology is used to mine web page text data in real time to meet the planning requirements of training subjects(3)According to the training plan, simulate web page text mining and carry out optimization training by optimizing and adjusting technical measures. The optimization and adjustment technical measures should be combined with the expected performance of web page text mining to fully reflect the authenticity of optimization simulation mining(4)Assessment and evaluation: according to the effect information of the mined data, the operation steps are recorded, the rate of change and other characteristic parameters in the mining results are calculated, and the whole operation process is qualitatively or quantitatively evaluated. The key of examination and evaluation is to establish a reasonable evaluation system and evaluate the results of optimization training scientifically, so as to make the training more targeted
The process of extracting web page text information under the network topology coincidence degree is as follows: (1)Set the type of web page text information to be extracted, that is, the number of clustering centers , the weight coefficient of web page text information features, and determine the weight matrix and iterative processing times of web page text information feature attributes(2)The objective function is calculated according to the weight of the eigenvalue of web page text information(3)Set a threshold value of web page text information features for the expansion and change of web page text information features(4)The clustering center of web page text information features is updated, and the membership function is updated(5)Update the weight of web page text information features
According to the mining process of web page text information, the association framework of multi task learning is constructed, as shown in Figure 5:

According to the association framework structure constructed in Figure 5, we sort out the text features of the association structure output by the final framework. The relevance extraction process of web page text information is simulated according to the actual output relevance structure text features.
According to the method described above, the characteristics of web page text information are expressed in vector form, the characteristics of web page text information are selected by evaluation function, the dimensionality reduction of web page text information is realized, and the redundant data in web page text information is deleted, clustering the features of web page text information according to feature similarity, determining the objective function of web page text information extraction, and using constraints. The key of web page text information extraction under network topology coincidence degree is to minimize the value of objective function.
3.3.2. Feature Extraction of Web Page Text Information
With the rapid expansion of human knowledge, mankind has entered the civilized stage of information explosion. How to accurately extract web page text information under the network topology coincidence degree has become an urgent problem to be solved. Under the network topology coincidence degree, in order to realize the accurate extraction of web page text information, it is necessary to accurately extract the characteristics of web page text information. The characteristics of web page text information can be described in the form of vector, and its form is as follows:
In formula (12), represents the value of web page text information, represents the feature weight of web page text information, and represents the number of features.
The content of web page text information can be described by a spatial vector model. If the web page text information is long, the number of features of web page text information will be large, and the process of web page text information extraction will become extremely complex. Therefore, we need to select the main features to represent the web page text information and reduce the feature dimension of web page text information. In the process of extraction, the evaluation function is usually used to select the features of web page text information. The commonly used feature evaluation functions mainly include information gain, mutual information, and statistics. Among them, statistics can represent both positive and negative correlation between web page text information features and feature categories. The expression formula is as follows:
In formula (13), represents the length of web page text information, and , , and represent the probability of occurrence of feature and feature . Through statistics, appropriate web page text information features can be selected and the feature dimension of web page text information can be reduced, which provides a basis for the extraction of web page text information.
In the process of extraction, the clustering center of web page text information features and the weight of features are adaptively adjusted, and finally, the accurate extraction of web page text information is realized.
4. Experimental Analysis
In order to verify the comprehensive effectiveness of the automatic extraction method of web page text information based on network topology coincidence, in the hardware environment of Intel Celeron Tulatin1GHz CPU and 385 MB SD memory and MATLAB 6 0 software environment. The design method, reference [8] method, and reference [9] method are compared to verify the effect of automatic extraction of web page text information.
Automatic extraction methods are divided into new web page text information extraction, complex extraction, and simple extraction. The main purpose is to avoid too single extraction content. In the experimental test, the effect of the extraction method is verified. Increasing the complexity of the extracted content database in the experimental process can better reflect the completeness of the experimental data. The specific extracted content database is shown in Table 1.
The experimental environment consists of two computers connected to the Internet through network equipment. One computer is used as a web server to provide data, and the other computer is used to extract user information. The connection mode is set as routing mode. The computer CPU is Intel F4600, the hard disk is SATA 500 G, the memory is 2 G, the main frequency is 5 GHz, and the download speed of access network is 700 KB/S. The experimental data set selects the real information data of a province as the mobile phone 2G/3G/4G/5G network traffic data, which lasts for 30 days from November 1 to November 30, 2021. Each network traffic data includes the user’s mobile phone number. In the experiment, 9 relevant fields of the data set are selected, and the field format is shown in Table 2:
The Internet businesses concerned in the experiment are Baidu, Sina Weibo, Taobao, and QQ. The extraction parameter settings are shown in Table 3.
The 30-day user information contained in the data set is divided into three consecutive small sets according to the time period, with a time length of 10 days. A real user is identified by the mobile phone number. The three groups of experiments, respectively, dereprocess the data set and screen the repeated digital identities of users. We count the user de duplication number of the three methods and take the average value. The experimental results are shown in Figure 6:

It can be seen from Figure 6 that the number of repeated QQ digital identities in this province is the largest. The methods in this paper, reference [8], and reference [9] are 48%, 45%. and 42%, respectively. Microblogging followed, with 35%, 30%, and 27% of the three methods, respectively; Baidu and Taobao are the least, and about 1/2 of users have at least two digital identities. For different Internet services, the number of deduplication in this method is much larger than that in reference [8] and reference [9], which reduces the number of digital identities of the same user, simplifies the complexity of the associated information in the data set, and reduces the difficulty of information extraction to a great extent.
After the user digital identity is deduplicated, the user information of the data set is extracted, and the information collection effects of the three groups of experiments are compared through the collection success rate. The formula of acquisition success rate is
In formula (14), represents the number of link addresses accessed, represents the number of invalid link addresses, and represents the number of link addresses successfully extracted. When a complete piece of data information is extracted, the information data is stored uniformly. We count the number of access addresses, invalid links, and collection successes of the three methods, calculate the collection success rate of different services in the three periods of the data set, and take the average value. The comparison results are shown in Figure 7:

As can be seen from Figure 7, due to the high digital identity repetition rate of QQ business, the collection success rate of both methods in reference [8] and reference [9] declined to varying degrees, while the collection effect of the method in this paper remained stable. Moreover, the information collection effect of this method is significantly better than that of reference [8] and reference [9], with an average success rate of 94.6%. The average value of reference [8] and reference [9] is 77.9% and 73.8%, respectively. Compared with the two traditional experiments, the success rate of this method is 16.7% and 20.8%, respectively. We make the webpage text information of automatic extraction more abundant.
To sum up, the design method has good performance, which can reduce the difficulty of information extraction and improve the amount of data collection and the success rate of information collection.
5. Conclusion and Prospect
5.1. Conclusion
(1)The automatic extraction method of web page text information based on the coincidence degree of network topology simplifies the complexity of the associated information of the data set(2)This method reduces the number of digital identities of the same user, simplifies the complexity of the associated information of the data set, and reduces the difficulty of information extraction to a great extent(3)The designed method improves the amount of data collection and the success rate of information collection and enriches the automatically extracted web page text information
5.2. Prospect
As there are still many deficiencies in the study of time relationship, it is necessary to make in-depth research and analysis around the following aspects in the future work: (1)Further improve the web page preprocessing method before web page text information extraction, especially design a good denoising algorithm, so as to reduce the interference caused by web page noise and improve the extraction accuracy(2)In the future research, we will use rules to build the set to be predicted, deeply mine Web text information and browsing preferences, extract information more intelligently, and improve the recall rate of digital identity(3)Improve the text extraction algorithm around multimedia resources. The structure of web pages is complex and changeable, and it is difficult to extract the text of surrounding web pages. We take a better algorithm to further judge the relationship between surrounding web pages and multimedia resources and improve the accuracy of web page text information extraction
Data Availability
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.
Conflicts of Interest
The authors declared that they have no conflicts of interest regarding this work.