Abstract
The arrival of the boom of teaching Chinese as a foreign language (TCFL) and personalized learning has led to a sharp increase in the demand for the Chinese language reading material. There are numerous reading materials available in Chinese for foreign students and learners to read and evaluate. The high-quality TCFL reading materials with reasonable arrangement can provide convenience for learners with different reading comprehension, interpretation abilities, and levels to master a language more quickly. Therefore, this study carries out an automatic readability assessment of books in Chinese as a foreign language. This paper comprehensively considers the factors affecting the difficulty of reading materials from the perspective of Chinese ontology based on the existing readability assessment research. Using natural language processing and a database management system to extract the features of books in Chinese as a foreign language, the text readability is evaluated with a statistical machine learning algorithm. The model is optimized by feature selection and sorting feature selection technology. The packaging feature selection technology is introduced to optimize the algorithm performance. The feature sets and each independent feature in the three dimensions of word meaning, part of speech, and discourse were optimized by the machine learning regression model based on certain evaluation indexes. Moreover, this work examined that the regression model is effective at identifying and recommending simpler textbooks for learning with difficult foreign language materials. For high-proficient learners, this approach significantly improves performance and measurement efficiency of reading books.
1. Introduction
Students are encouraged to read a variety of books in addition to their textbooks, as comprehensive reading is important for learning a foreign language. Reading materials for learning a second language must be carefully selected. On the one hand, a challenging book will obstruct comprehension. If the book is too simple, it will not push the learner’s language skills, leaving them feeling frustrated and discouraged [1]. The valuable materials should be challenging and readable, with the perfect balance of new, challenging, and familiar words. Graded readers, which are designed with specific language skill levels in mind, help to solve this problem to some extent, but they only provide a small amount of the TCFL material and the grade levels are still coarse. Language teachers are usually tasked with identifying acceptable materials for each particular student, yet they are unable to use them [2, 3].
Therefore, in foreign language learning, identifying reading materials based on the reading ability of learners or audiences is important. Book readability assessment, also known as the book reading difficulty level, is a natural language processing (NLP) task that involves classifying reading learning resources based on the book reading complexity [4]. There has been a systematic and technical technique for comprehending the objective and subjective aspects associated with book readability, better helping readers in understanding more difficult books, or properly determining the job of book reading difficulty since the nineteenth century [5]. Book readability has been characterized as the totality of all variables that impact readers’ knowledge of TCFL materials, reading speed, and interest in book content, based on the research of these systems. These characteristics might include TCFL syntax difficulty and readers’ conceptual familiarity with specific ideas in the book. The book complexity of valid reasoning or inferential is used to connect various perspectives and so many other important components [6]. There are associated graphics or images to clarify the TCFL. In complement to these TCFL traits, readers’ attributes, such as interest, education, social background, professional competence, and other variables, can substantially influence book accessibility [7, 8].
The degree to evaluate if a text is readable and understandable is readability. The development of readability formulas that measure the difficulty of a text in each language is one of the purposes of readability research [9]. A readability formula is a formula that combines all the quantifiable factors that affect reading difficulty, especially textual factors. A formula is formulated to evaluate the difficulty of a text. Intelligent auxiliary system is a popular research content in the field of education. Readability evaluation provides a research framework, research methodology, and research content for the design and development of an intelligent expert evaluation system, which is an important part of the system [10]. In view of the massive demand for books in Chinese as a foreign language, problems in difficulty control of textbook compilation, and readability assessment as an essential part of the intelligent expert assessment system, it is important to carry out readability assessment research on books in Chinese as a foreign language [11]. The feature integrated with machine learning is based on the advancement of machine learning and natural language processing (NLP) technology, which combines complicated characteristics with novel ways of assessing text difficulty [12]. Regression methods in machine learning include support vector machines [13]. The supervised feature selection technology is classified into three types based on combination with a learning algorithm embedded feature selection, packaging feature selection, and ranking feature selection [14, 15]. This method can add the prediction results of formula method, cognitive theory method, and language model method into the regression model as the characteristic index to improve the evaluation performance, which is superior to other evaluation methods.
This study proposes a personalized machine learning system for assisting automated foreign language learners in selecting the most appropriate reading material based on vocabulary complexity. The personalized machine learning algorithm selects a small sample of words for foreign students for initial assessment. The technique teaches complicated word recognition in learning Chinese as a foreign language (TCFL) in the books. The most prevalent TCFL approaches classify each word as simple as possible, if the user is familiar with it. As a result, there are numerous books available to help foreign students improve their vocabulary. As a basic step, this work aims to find not only recognized and unfamiliar vocabulary but also challenging words for international students, which is our first contribution. Furthermore, this study utilizes an experiment-based teaching of Chinese as a foreign language (ETCFL) in the book and examines its influence on the vocabulary difficulty of the recoverable book from the perspective of a language student. The algorithm especially aims to choose words for the learner that have a low number of non-complex words and a high proportion of difficult terms. This research work shows that the method is efficient in obtaining easier books for reduced learners and more challenging books for increased learners, as our second contribution. The main significance of the paper is the following:(i)Firstly, this work presented an automated and effective assessment of book readability in Chinese as a foreign language, which can also help specified persons, such as text processors, students, and teachers by identifying and categorizing the analysis difficulties of related books.(ii)A comprehensive evaluation of the factors that affect reading materials from the perspective of Chinese ontology is provided based on current readability book evaluations using natural language processing and database management technologies.(iii)A statistical machine learning method is used to evaluate the readability of a book. For this purpose, feature selection technology is utilized to optimize the model, and the sorting feature selection approach is used to extract the features of books in Chinese as a foreign language.(iv)Using a statistical machine learning approach, determining the readability of books in Chinese as a foreign language is turned into a regression problem for data mining. An analogous technique, i.e., the Crisp-DM or the cross-industry standard process for data mining is discussed as well.
The rest of this paper is arranged in logical order as follows: Section 2 demonstrates related work, Section 3 illustrates machine learning algorithm principles, Section 4 shows feature extraction of books in Chinese as a foreign language, Section 5 shows the evaluation process, and Section 6 shows experimental analysis and evaluation results. Finally, Section 7 provides the paper conclusion.
2. Related Work
The readability of a book is typically used to determine as challenging it for individuals to comprehend book material. A specified readability level or a readability score can be used to assess the readability of a book [16]. The readability level is used in this study to assess a book’s readability. The assessment of book readability may be regarded as a binary classification issue: to build a statistical model based on a sample of books with known readability levels, which will eventually be applied to a text with unknown readability levels [17]. The study of text readability assessment has at least a decade of history; yet, this is far from a solved topic, and automated text readability measurement remains a tough study subject [18]. Recently readability education relied heavily on the text vocabulary aspects, representing relevant word attributes with proxy characteristics. When making judgments, we need to consider the diversity, complexity and breadth of implementation. The knowledge of experienced judges and correlation analysis establish if one language’s trouble level is improved [19]. These studies demonstrated that research on text accessibility started to pay attention to the entire of selecting features. The readability research system was established in the 1940s and lasted into the 1990s. During this time, academics began to experiment with different readability formulas and incorporate substitution variables representing vocabulary and syntactical material into the calculations. Develop direct groupings in the hopes of obtaining an appropriate reading difficulties assessment standard and properly evaluating text readability [20]. The purpose of the analytical and critical identification challenge is to classify a vocabulary as complicated if the user is unfamiliar with it and non-complex overall. Complex word identification (CWI) is a technique commonly used in vocabulary simplifying assignments to determine which words should be removed to improve a user’s reading comprehension while preventing future rationalization [21]. According to the author of this paper, word frequencies were seen to be the most reliable predictor of word difficulty in a similar topic for CWI in English. By merging numerous machine learning, threshold-based, and lexicon-based voting sub-systems, the winning team achieved and recall of 0.769 and an accuracy of 0.147. As a result, because the complete test set was graded by one student, this strategy has been evaluated in language classes of different levels of ability [22].
3. Machine Learning Algorithm Principles
When the training sample number is relatively small, the machine learning algorithm can also achieve good regression generalization ability. In the case of linear inseparability, the machine learning algorithm reflects the data into higher dimensional space through kernel function and constructs linear decision function in higher dimensional space to solve the dimension problem [23]. The kernel function determines the complexity of the regression function set, and the algorithm performance is controlled by the learning strategy that embodies the principle of structural risk minimization. Finally, the global optimal solution is obtained by solving the convex quadratic programming problem.
The decision surface is constructed, is perpendicular to the direction of the dividing line and is called the normal vector.
To solve the distance formula, the deformation is
The binary norm of , scalar gamma control linear intercept, is represented by , which is an “interval.”
The optimization problem of the machine learning algorithm is the maximization problem.
The total number of sample points is given by .
This is the representation of an optimization problem, usually using a sequence minimum optimization (SMO) algorithm to get .
4. Feature Extraction of Books in Chinese as a Foreign Language
4.1. The Hierarchical Features of Words
Text processing uses natural language technology, including Chinese text separation, text regularity statistics, and database technology to determine the number of word levels. Figure 1 depicts the entire procedure flow chart.

4.1.1. First Step
The is text word segmentation because the existing word segmentation technology is relatively mature, the NLPIR Chinese word segmentation system of the Chinese Academy of Sciences is used. The more information a text contains, the more granular it becomes [24].
4.1.2. Second Step
Word frequency statistics need to calculate the number of occurrences of each word in the article. Delete punctuation to keep only words and count the number of words. Finally, the number of words is output as a file.
4.1.3. Third Step
The word frequency statistics after the file. XLSX is stored in a database from which two databases are built [25].
4.1.4. Fourth Step
It is about the same word or word statistical data duplication. For example, the word “看”, the same word “看”, appears three times in the full text, but this result appears twice obviously repeated statistics. The multi-level word list in the HSK vocabulary level standard is shown in Table 1.
4.2. Word Part-of-Speech Features
The pseudocode for obtaining part-of-speech features is as follows:(1)Enter the matching keyword.(2)Enter the file name to determine whether the file exists. If the error prompts that the file already exists, otherwise create a new file.(3)Traverse the files in this folder and subfolders in the current directory and first determine whether the file starts with .exe end, if so, return directly.(4)If the file is not in .exe, then open the file, output the file name of the file, and loop through the file.(5)Match the text in the file with the keyword, then increase the keyword statistical value for keyword word frequency statistics, and output the number of matched keywords.(6)Save the output information according to the file name and keyword statistics to the new file from left to right.(7)When the search file ends, end word frequency statistics, and close the file.
4.3. Textual Features of Words
The length of articles in books published in Chinese as a foreign language is used to extract features, including the total number of characters, paragraphs, sentences, and other features to reflect the difficulty of the articles [26]. In Microsoft Word, get the two eigenvalues of the total number of characters and the number of paragraphs. The formula “total frequency of words/total number of sentences (two kinds, with or without commas)” is used to get the average number of words in each sentence. A comprehensive and multi-level approach is used to assess the difficulty of the article and restore the inner law of the textbook compiled by experts. This experiment selects the characteristics of 200 articles from six textbook sets in terms of words, semantics, discourse, and other dimensions.
5. Evaluation Process
5.1. Experiment Design
Using a statistical machine learning technique, the challenge of evaluating the readability of books in Chinese as a foreign language is transformed into a regression problem in data mining [27]. The process is equivalent to crisp-DM the cross-industry standard process for data mining. Crisp-DM standard model drawing is shown in Figure 2.

The process of constructing the regression model of the SVM supervised learning algorithm is to measure the objectivity, accuracy, and standardization of the internal laws of the textbooks compiled by experts [28].
The structure of the three steps is shown in Figure 3.

5.2. Regression Model
In the regression model constructed in Rapid miner, a data mining tool, the type of label [0–1], a characteristic value, is determined as a label in the Set Role operator [29]. The root of the mean-square error is selected as the evaluation index.
This can be accurate to the difficulty value of each article, represents the difficulty value of the article in level, represents 1 for intermediate, 2 for intermediate, 3 for advanced, and 4 for advanced.
5.3. Model Optimization
The number and selection of features in a machine learning approach are critical. Improper selection of features or too small or too many features will lead to under-fitting and over-fitting problems. To make the model more generalizing and effective, feature selection refers to the selection of a feature subset from all the features based on a specific evaluation function. The RMS error index is used in regression analysis to predict how important and significant a feature is [30, 31]. RMS error can be used to assess the regression model accuracy in reflecting measurement precision.
5.4. Evaluation Index
Regression modeling focuses on label differences. The interval [0, 1] in the regression model represents the difficulty degree of each article, and the output is the observed value of the difficulty degree of each article. The quality of a common evaluation model is determined by its accuracy [32, 33].
5.4.1. Accuracy Index
The percentage of the sample size is predicted to be true.
5.4.2. Recall Index
The recall rate measures how many positive categories are predicted to be correct.
5.4.3. Precision Index
The proportion of predicted positive classes that are positive is measured by precision.
5.4.4. AUC Indicator
It is an indicator to judge the overall performance of the prediction model.
5.4.5. The Root of the Mean-Square Error
The root of the mean-square error is the square root of the ratio of the deviation between the observed value and the predicted value to the number of observations , which better reflects the measurement precision.
6. Experimental Analysis and Evaluation Results
6.1. Word Grade Results
The results of sorting dimension features of books in foreign languages words of Chinese using a machine learning algorithm are shown in Table 2.
6.2. Result of the Part of Speech
Table 3 shows the results of using a machine-learning algorithm to sort semantic dimension features in books written in foreign languages words of Chinese.
6.3. Result of Word Text
Table 4 shows the results of sorting text dimension features in books in Chinese as a foreign language using a machine learning algorithm.
6.4. Results Analysis
Heuristic search feature selection for features of different dimensions and all features, respectively, produced the optimal regression graph. Table 5 and Figure 4 shows the accuracy chart for various types of experiments with various dimensional features using the machine learning algorithm.

7. Conclusion
This study mainly introduces the automatic readability evaluation of books in Chinese as a foreign language and transforms the readability evaluation problem into a regression problem in data mining. Firstly, the readability assessment of books in Chinese as a foreign language to be studied in this study is a carefully integrated study of Chinese ontology and information technology. It is an important research content in the construction of an intelligent expert evaluation system, and it is a research problem of both theoretical significance and practical application value of interdisciplinary research. Secondly, the application of modern educational technology in books in Chinese as a foreign language has brought technical support for the reform and development of Chinese teaching and textbook compilation. Chinese learning has become a trend and the teaching and dissemination of Chinese as a foreign language is widely needed. The evaluation of Chinese as a foreign language book can reduce the time and energy of the demanders to search for information, which has certain practical significance. In the regression model, the problem of setting readability value labels is solved through uniform segmentation. When compared to expert evaluation, the cost is lower, and the local characteristics in the questionnaire samples can be effectively avoided by overfitting the model. In the regression method, evenly distributed difficulty values can set the readability of articles more carefully and with higher precision. It will still be a feasible method for readability evaluation under the continuous adaptation and development of textbooks in the future. However, this method has some limitations for large-scale applications, such as the data preparation required, which is too complicated and time-consuming. The evaluation performance can be improved further by adding more feature categories.
Data Availability
The datasets used during the present study are available from the corresponding author upon reasonable request.
Conflicts of Interest
The author declares that there are no conflicts of interest regarding the publication of this paper.