Abstract

In order to improve the classification accuracy and reduce the classification time of the economic text big data visualization classification algorithm, one based on the Pitton algorithm is proposed. The economic text big data are preprocessed by filtering out useless symbols, word segmentation processing, and removing stop words. According to the processing results, the most relevant features of the economic text big data classification process are selected, including Gini index, information gain, mutual information, etc., and the TF-IDF weighting algorithm is used to weight the economic text data features. Based on feature weighting, using Naive Bayesian algorithm, combining classification probability distribution and text probability distribution, Naive Bayesian classifier is constructed to obtain the optimal classification result through input vector, and the visual classification of economic text big data is completed through Python software programming. The simulation results show that the classification accuracy of the algorithm for the visual classification of economic text big data can reach 100%, and the classification time is less than 5 seconds. It has high accuracy and fast efficiency.

1. Introduction

With the rapid development and widespread application of information technology and the Internet, the Internet has gradually become an important channel for people to obtain information. The amount of various information data has exploded, and people have gradually moved from an era of lack of information to an era of information overload. Hundreds of millions of web text data are generated every second in the world. Internet users are often unable to quickly and effectively obtain the valuable content they need in the face of such massive text big data. In order to quickly obtain valuable text big data, these texts need to be classified. Therefore, the research and implementation of the text classification system are of great significance [1, 2]. Economic text is an important part of text big data, and it is also an important way for people to obtain information. The original text classification technology is done manually. Facing the exponentially increasing text information on the Internet, manual information extraction can no longer meet people’s needs. This requires the use of the computer’s ability to process information automatically, quickly, and in a large amount to provide users with quick and accurate access to valuable information [3, 4]. With the growing development of the Internet today, economic textual big data are no longer limited by the layout and length of paper media but present a huge amount of information and a wide variety of characteristics. Using computers to effectively filter and classify these economic data can quickly and efficiently obtain valuable information content, reduce human resource investment, and improve information use efficiency. The text classification technology can filter economic data in advance, classify the collected economic data, and generate a category economic database, which can facilitate users to browse the category information of interest and improve user experience. In short, text classification technology can help people filter and filter big data, which has more important research significance in real life [5].

Therefore, many experts and scholars have conducted a lot of research in the field of economic text processing, forming a relatively mature technical system of economic text classification. The continuous development of economic text classification technology has become a branch of natural language processing. This technology is widely used in research fields such as information retrieval, information screening, and automatic abstract generation. Using these research results, users can quickly obtain the information they need from a large amount of complex economic text data and, at the same time, can eliminate information. In addition, in this fast-paced era, users need to quickly understand the main content of information, and economic text classification technology can refine this information to generate a summary. Literature [6] proposed a Chinese multilabel text classification method based on big data technology. First, the semantic information fusion method and fuzzy block fusion method were used to extract the semantic similarity features of Chinese multilabel text big data, and a big data analysis model of Chinese multilabel text was established. Second, the feature detection of Chinese multilabel text is carried out by using the statistical data feature extraction method. The semantic feature extraction model of Chinese multilabel text big data is built. The fuzzy C clustering analysis method is used to establish a fuzzy clustering model of Chinese multilabel text big data, and the multilevel feature information reorganization method is used to reconstruct the fuzzy feature of Chinese multilabel text big data. It is based on the semantic attribute characteristics of Chinese multilabel text. Feature reconstruction realizes Chinese multilabel text classification and big data fusion processing. However, this method has low accuracy in the visualization and classification of economic text big data, resulting in poor classification results. Literature [7] proposed a large data text classification method based on distributed NBC under the Spark framework. This method explores the Naive Bayes classifier (NBC) by studying the adaptability of the MapReduce and Apache Spark frameworks and studies the existing aspects. In the big data computing framework, the training sample data set is divided into m categories based on the Naive Bayes text classification model. In the training phase, the output of the previous MapReduce is used as the input of the next MapReduce, and four MapReduce jobs are used. The design process makes full use of the parallel advantages of MapReduce [810]. Finally, the class label value of the maximum value is taken out during the classifier test, and the experiment is performed on the Newgroups data set. The proposed method has higher classification accuracy than the comparison algorithm. However, the economical text big data visualization classification time of this method is relatively long, resulting in low classification efficiency [11, 12].

Aiming at the problems of the above methods, this study proposes a Python-based economic text big data visualization classification algorithm and verifies the effectiveness of the method through simulation experiments. The innovation point of this study is to use TF-IDF weighting algorithm to determine the feature weight of economic text data, which can accurately express text information, solve the problem of low classification accuracy of existing literature results, introduce Naive Bayes algorithm to classify big data of economic text, and improve the classification efficiency.

2. Python-Based Visual Classification of Economic Text Big Data

2.1. Python Software

Since all the processes of visual classification of economic text big data in this article are completed by Python software programming, the brief introduction of Python software is provided below.

The original meaning of Python is python. It was proposed by Guido van Rossum in 1989. It is an object-oriented, interpreted computer programming language. After its public release, it has been popular among program enthusiasts. Python is characterized by concise and clear syntax, is easy to read, has good scalability, and is an open-source software. Programmers do not need to have a deep programming foundation. The programming language is easy to understand and is very suitable for learning primary programming [1315]. At present, many institutions at home and abroad, including schools, have gradually developed training and courses related to Python programming. Python can implement rich APIs and various scientific computing tools, so that even programmers who are accustomed to using C language, C++, and JAVA can easily get started and operate, and integrate their written packages in other languages into the python platform. In addition, the Python compiler itself can also be integrated into programs in other languages, hence the name “glue language” [1620]. The scientific computing extension library dedicated to Python has also received a lot of praise, especially its 3 very useful extension libraries: Num Py, Sci Py, and matplotlib. Num Py is generally used to process arrays, and its processing speed is very fast; Sci Py can easily perform numerical operations; matplotlib can provide rich drawing functions [2123]. More and more researchers are using python and its extension libraries to process experimental data, draw beautiful charts, and even develop some applications. In order to improve the visual classification efficiency of economic text big data, this study applies Python software, which processes data array and calculates the numerical value very fast [24].

2.2. Economical Text Big Data Preprocessing

For the obtained economic text, it needs to be preprocessed, and the initial economic text is expressed as a word sequence related to classification only. Different economic texts will undergo different preprocessing steps in the preprocessing process. The steps and methods of text preprocessing are as follows [25]:

2.2.1. Filter Out Useless Symbols

Usually, a complete economic text consists of basic content, symbols, numbers, letters, etc. According to the characteristics of the original economic text data, regular expressions are used to obtain the text content, and only information useful for economic text classification is retained.

2.2.2. Word Segmentation

After processing the original economic text with useless symbols, the text needs to be processed by word segmentation. Chinese word segmentation refers to the division of a sentence into independent words or words [26]. Word segmentation is the process of recombining consecutive words or words into word sequences according to certain rules. Text segmentation is one of the important steps in the process of text classification. It has attracted a large number of researchers to study word segmentation algorithms and has achieved some good results in the field of word segmentation. The commonly used word segmentation algorithms are as follows:

Segmentation Based on String Matching. The segmentation dictionary is the basic condition for string segmentation, and then the entries in the dictionary are compared with the strings in the text. If the result of the comparison is true, it means that the string can be divided from other content and can be used as a word or phrase in the text. If the result of the comparison is false, it cannot be used as a word or phrase in an economic text and cannot be divided. The word segmentation has low time complexity and is easy to understand, but if there is no corresponding word or phrase in the word segmentation dictionary, it cannot be segmented [27, 28].

Statistics-Based Word Segmentation. If the word segmentation is to realize the model construction of Chinese, it needs the part-of-speech and statistical characteristics of manual annotation as the basic conditions. When the model is built, the next word segmentation operation is performed, and the probability of each word can be calculated by the constructed model, and the word with the highest probability value is used as the final word segmentation result. The advantage of this word segmentation is that it overcomes the shortcomings of word segmentation for string matching, whereas the disadvantage is that it requires manual labeling for the construction of the model, which limits the use of the word segmentation and requires relatively high time complexity.

Word Segmentation Based on Comprehension. The principle is to analyze the grammar and semantics of words simultaneously during the word segmentation operation, and it is indispensable for the syntactic and semantic information to eliminate the phenomenon of word ambiguity. Three interrelated subsystems constitute the whole part of the word segmentation system. To determine whether a word segmentation is ambiguous, it needs to be based on the syntactic and semantic information of words and sentences obtained by the word segmentation subsystem under the coordination of the general control part. When the generality and complexity of language knowledge have not been solved, the technology of word segmentation based on understanding is currently at an immature stage [29].

The word segmentation tool in this article is the ICTCLA word segmentation system developed by the Chinese Academy of Sciences through which the text segmentation operation is performed. This is because ictcla word segmentation system can automatically find new feature languages from the text content based on information cross entropy and then adaptively test the language probability distribution model of corpus to realize adaptive word segmentation. The word segmentation system has a high accuracy of word segmentation and high applicability of word segmentation. Even if the word segmentation dictionary does not correspond to this sentence, it can segment words according to the characteristics of words and sentences, and the time complexity of word segmentation is low and easy to understand. Therefore, it solves the limitations of the above word segmentation methods.

2.2.3. Get Rid of Stop Words

The content of an economic text is generally composed of verbs, nouns, adjectives, and other words, but it also contains many functional words, also called stop words, which do not contain textual information. Stop words can be divided into two categories: (1) this type of word is widely distributed and often appears in all types of text, resulting in the word’s inability to distinguish text categories. For example, the word “we” exists in many types of texts, so this type of word cannot be used as an information word representing related categories. (2) Some function words such as prepositions, adverbs, conjunctions, and modal particles, such as “ah” and “being” are included. Filtering out these words with no actual meaning will improve the accuracy of the later text classification. In order to improve the accuracy, the feature words participating in the economic text classification cannot contain stop words, so it is necessary to delete the stop words. By comparing the words in the stop vocabulary list with the words in the text vocabulary set after word segmentation, if the word does not exist in the stop vocabulary list, then the word is regarded as the text content; otherwise, it is deleted [30].

2.3. Text Feature Processing
2.3.1. Text Feature Selection

Feature selection of economical text data refers to the process of constructing the minimum and best feature sets that can represent the original text through specific methods without changing the original spatial attributes. In feature selection, we need to try to identify the features that are most relevant to the classification process, because some words are more likely to be related to category distribution than others. Therefore, it is necessary to determine the most important features in the classification and define the category contribution function of feature words, including Gini index, information gain, mutual information, etc. [31].

(1) Gini Index. One of the most commonly used methods to quantify the level of feature recognition is to use a measurement method called the Gini index. represent the percentage of class tags of different classes of word . In other words, is the conditional probability that the document belongs to the th category, provided that the document contains the word . Therefore, there is

Then, represents the Gini index of word is defined as follows:

The value of Gini coefficient is always in the range of . The greater the value of the Gini coefficient is, the greater the recognition power of is. For example, when all documents containing the word belong to a particular class, the value of is 1. On the other hand, when documents containing word are evenly distributed among different classes, the value of is . One comment of this method is that the global class distribution may be biased from the beginning, so the above-mentioned metrics may sometimes not accurately reflect the discriminative ability of the underlying attributes. Therefore, in order to reflect the discriminative ability of attributes more accurately, a normalized Gini index can be constructed. represent the global distribution of documents in different classes, and then the normalized probability value is determined as follows:

In the formula, represents the Gini coefficient with . These normalized probability values are used to calculate the Gini coefficient as follows:

The use of the global probability ensures that the Gini index more accurately reflects the difference between the classes when there is a deviation of the class distribution in the entire text collection. For a text corpus containing texts, words, and classes, the information gain calculation complexity is , because it takes time to calculate for all different words and classes.

(2) Information Gain. Another commonly used text feature selection method is information gain or entropy. Suppose the document contains the word , then let be the global probability of class , and be the probability of class . Let be the score of the text containing :

Among them, represents the information gain IG value, represents the total number of categories, represents the category, represents the probability that the sample belongs to the category in the sample set, represents the probability that the text contains the noun , represents the probability that the noun appears in the category, and represents the conditional probability that the sample belongs to the category under the premise that the noun does not appear. In the information gain, the larger the IG value is, the more important the feature item is, the more it can reflect the category information, and the more conducive it is to the classification. By setting the threshold, when the Ig of the feature item is greater than the preset threshold, the feature words are retained; otherwise, the feature words are eliminated. That is to say, when the new text comes in, the feature words and weight of the new text are determined, so as to reduce the dimension of the text vector.

(3) Mutual Information. Mutual information measurement is derived from information theory and provides a formal method for modeling mutual information between features and classes. According to the degree of co-occurrence of class and word , the pointwise mutual information between word and class is defined. The expected co-occurrence of category and word on the basis of mutual independence is given by , and the true co-occurrence is given by . In practice, the value of may be much larger than , or it may be much smaller, depending on the degree of correlation between the class and the word . There are two ways to determine the correlation of words. One is to determine the correlation by the ratio of the word and the class, and its calculation formula is as follows:

The second method from the point of view, word is positively correlated with type, when , word is negatively correlated with type. Note that is specific to a particular class . The overall mutual information needs to be calculated as a function of the mutual information of different types of words . These are defined by using the average and maximum values of on different classes:

To determine the relevance of word , you can use either of these two methods. However, the second measurement method is more effective in determining the positive correlation between word and category. This is because it expands the type of words, calculates the mean and maximum values, and makes the measurement results more accurate.

2.3.2. Text Feature Weighting

This study uses the TF-IDF weighting algorithm to weight the features of economic text data. The term frequency-inverse document frequency (TF-IDF) is one of the most commonly used weighting algorithms. This method is usually used to measure the importance of words or words contained in a text to text. The main idea of TF-IDF is as follows: if a word or phrase appears more frequently than other words or phrases in the same text, but the word or phrase rarely appears in other texts, then the word or phrase is, it is believed to have a good ability to distinguish text categories. Its calculation formula is as follows:

Among them, represents the probability that the feature item appears in the entire text set and represents the number of times a word appears in the text, which is used to measure the ability of the word to express the text content. For a word in the text, its importance is expressed as follows:

Among them, represents the number of occurrences of a specified feature item in the text, and the denominator represents the sum of the number of occurrences of each word or phrase in the text. The inverse document frequency is the inverse ratio of the frequency of the feature item in the text set, also called the inverse document frequency, which is used to measure the ability of the word to distinguish documents. The frequency of the reverse document of a specified feature item is related to the number of all texts and the number of documents containing the feature item. Based on the eigenvalues and mutual information measures of the new text obtained in the previous section, the IDF calculation formula for the feature item is as follows:

Among them, represents the total number of texts in corpus , represents the number of texts containing the word, represents the -th feature item, and represents the -th text.

2.4. Visual Classification Based on Naive Bayes Algorithm

This study uses the Naive Bayes algorithm to visually classify the economic text big data, because the main advantages of the Naive Bayes classifier are short training calculation time and high calculation accuracy, thereby improving the accuracy of the economic text big data visualization classification effectiveness. The Naive Bayes algorithm is a modular classifier under the known prior probability and class conditional probability. Its basic idea is to calculate the probability that document belongs to class . In Bayesian classifiers (also called generative classifiers), a probabilistic classifier based on the modeling of word features of different classes needs to be established. Then, the text is classified according to the posterior probability of the words appearing in the economic text and the text belonging to different categories. Naive Bayes is easy to implement and calculate. Its Bayes theorem formula is as follows:

Among them, is the posterior probability where the feature term appears in the entire text set, which means the prior probability of reversing the document when appears under condition . is the prior probability, which represents the prior probability of hypothesis . The posterior probabilities in the text set contain more information than the prior probability household that contains more information than

Naive Bayes is a statistical method in the form of probability. It is an estimation of a set of probability parameters. Its purpose is to combine the probability distribution of classification and text. It is based on Bayesian rules and relies on simple document representation. The best visual classification is selected through the input vector, and a Naive Bayes classifier is constructed to select the most likely category. It calculates the conditional probability of each category of word , where is a feature in the context. The Naive Bayes classifier finds the class that maximizes the formula by choosing as the most suitable class in the context:

3. Simulation Experiment Analysis

In order to verify the performance of the Python-based economic text big data visualization classification algorithm proposed in this study in practical applications, through Python software programming, in the Win 10 operating system, in the python 2.7 development environment, the 2.7 ghz Intel Core i5 CPU and 8gb1867mhz DDR3 memory are simulated and analyzed.

This study selects several types of economic data that are familiar to people daily as the corpus. The content is the Sohu economic data training set provided by Sogou Lab. Seven types of data are selected to form their own corpus for analysis. Each category has 2000 documents. The basic information of the corpus is listed in Table 1.

The classification accuracy is based on the test results, and its expression is as follows:

In the formula, represents the number of samples that correctly classify samples belonging to the positive class into the positive class, and represents the number of samples incorrectly classify the samples that belong to the negative class into the positive class.

Using classification accuracy as an experimental indicator, we adopt the Python-based economic text big data visualization classification algorithm proposed in this study, the Chinese multilabel text classification algorithm based on big data technology proposed in reference [6], and the distributed Chinese text classification algorithm proposed in reference [7], and NBC’s big data text classification algorithm classifies the above-mentioned document categories and verifies the classification accuracy of the three algorithms. The comparison results are shown in Figure 1.

According to Figure 1, the document classification accuracy of the Python-based economic text big data visualization classification algorithm proposed in this study can reach up to 100%, while reference [6] proposed a Chinese multilabel text classification algorithm based on big data technology and literature [7] The document category classification accuracy of the proposed distributed NBC-based big data text classification algorithm is only 80% and 70%, which shows that the document category classification effect of the Python-based economic text big data visualization classification algorithm is better than that of the literature. Reference [6] proposed a Chinese multilabel text classification algorithm based on big data technology and the reference [7] proposed a distributed NBC-based large data text classification algorithm for document classification.

In order to further verify the effectiveness of the algorithm in this study, we adopt the Python-based economic text big data visualization classification algorithm proposed in this study, the Chinese multilabel text classification algorithm based on big data technology proposed in reference [6] and the Chinese multilabel text classification algorithm based on big data technology proposed in reference [7]. The document category classification time of the distributed NBC big data text classification algorithm is compared and analyzed, and the comparison result is shown in Figure 2.

According to Figure 2, the document category classification time of the Python-based economic text big data visualization classification algorithm proposed in this study is within 5 s. Reference [6] proposed a Chinese multilabel text classification algorithm based on big data technology, and the classification time of the proposed distributed NBC-based big data text classification algorithm is within 9 s [7]. The document category classification time of the Python-based economic text big data visualization classification algorithm proposed in this study is longer than that of the Chinese multilabel text classification algorithm based on big data technology proposed in reference [6] and the distributed NBC-based algorithm proposed in reference [7]. The document category classification time of the data text classification algorithm is short.

4. Discussion

This study significantly improves the visual classification accuracy of big data in economic texts. This is because this study first preprocesses the big data of economic text, thereby eliminating useless symbols, performing word processing, and eliminating stop words, and lays a concise and accurate data foundation for the later text classification. The TF-IDF weighted algorithm is used to weight the characteristics of economic text data, and the Naive Bayesian algorithm is used to complete the visual classification of economic text big data, which improves the overall classification accuracy. This study significantly improves the efficiency of big data in economic text. This is because the method of this study preprocesses the big data of economic text, it eliminates useless symbols, performs word processing, and eliminates stop words, which facilitates the further classification of text and saves the processing time. At the same time, according to the processing results, the most relevant features in the process of economic text big data classification are selected for classification, which improves the classification efficiency.

5. Conclusion

With the rapid spread of the Internet and the rapid development of digital media technology, a large amount of data information has appeared on the Internet. As more and more information is stored in electronic form, the amount of text is growing rapidly. This information is mainly stored in text databases and repositories, which contain a large amount of text belonging to various sources, such as e-mail messages, business, news articles, entertainment, government, art, literature, military, economics, and digital books Pavilion. The text stored in the database has an inherent unstructured or semi-structured format. The increasing size and complexity of this unstructured text data make effective management of a crucial issue, and this phenomenon has led to an increase in text classification research. In order to improve the effect of economic text big data, this study examines a Python-based visual classification algorithm. It uses TF-IDF weighting algorithm to determine the feature weight of economic text data, accurately expresses text information, and introduces a simple Bayes algorithm to classify economic text big data. Simulation experiments demonstrate that the proposed method has high classification accuracy and efficiency. However, due to limited conditions, the classification results of this study are insufficient, and future studies need to further improve the clarity and completeness of the classification results.

Data Availability

The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.