Abstract
The majority of the traditional methods deal with text matching at the word level which remains uncertain as the text semantic features are ignored. This also leads to the problems of low recall and high space utilization of text matching while the comprehensiveness of matching results is poor. The resultant method, thus, cannot process long text and short text simultaneously. The current study proposes a text matching algorithm for Korean Peninsula language knowledge base based on density clustering. Using the deep multiview semantic document representation model, the semantic vector of the text to be matched is captured for semantic dependency which is utilized to extract the text semantic features. As per the feature extraction outcomes, the text similarity is calculated by subtree matching method, and a semantic classification model based on SWEM and pseudo-twin network is designed for semantic text classification. Finally, the text matching of Korean Peninsula language knowledge base is carried out by applying density clustering algorithm. Experimental results show that the proposed method has high matching recall rate with low space requirements and can effectively match long and short texts concurrently.
1. Introduction
With the rapid development of the digital society, people’s needs in the fields of artificial intelligence such as information retrieval, automatic question answering, and dialogue systems have begun to appear, and intelligent matching algorithms are needed to meet the high needs of the users [1]. In order to meet these requirements, natural language processing technology emerged, which can provide users with efficient information retrieval services [2]. Text matching algorithm is the core research area in natural language processing technology. The dimension disaster and sparsity of data in the traditional text matching field have affected the development of natural language processing. Moreover, the majority of the traditional text matching models ignore the relationship between words and cannot recognize the semantic similarity between words [3]. To address the listed problems, researchers have proposed multiple text matching methods using modern technologies.
Chen et al. combined the idea of transforming trie tree into double array form by proposing an improved multipattern matching algorithm based on Aho–Corasick algorithm [4]. The method is based on a string searching mechanism that locates elements of a finite set of strings within the input string. A finite state machine is constructed resembling a trie having additional links between various internal nodes which allow the automation of transition between string matches without needing any backtracking. According to the analysis of the results of comparative experiments, the algorithm not only successfully matched all the pattern strings to be found in the text, but also had a fast processing speed. On the other hand, Wu proposed a text matching method combining pretraining model and language knowledge base [5]. Based on large-scale pretraining model, this method introduced external linguistic knowledge by generating synonym antonym vocabulary learning task and phrase collocation learning task via utilizing WordNet jointly trained with Multi-Task Deep Neural Network (MT-DNN) to further improve the performance of the model. In the end, text matching annotation data are used for fine-tuning. The experimental results on two public data sets, Microsoft Paraphrase Corpus (MRPC) and Quora Question Pairs (QQP), testify that this method can effectively improve the performance of text matching by introducing external language knowledge for joint training on the basis of large-scale pretraining model and fine-tuning framework. Zen and Chen proposed a text matching model [6] based on word embedding and dependency via constructing a semantic representation obtained by integrating word meaning and dependency between words. The approach formulates a matrix describing the semantic matching degree of each part of the two texts through cosine mean convolution and K-MAX pooling operation and then uses long-term and short-term memory network to learn the mapping relationship between the matching degree matrix and the real matching degree. The experimental results show that the text matching accuracy of the model is reasonably high. In continuation, Xu et al. [7] proposed a deep learning model based on self-learning text utilizing nearest neighbor graph framework to deal with short text matching. The nearest neighbor graph can use word embedding to convert the text into vector form. The nearest neighbor relationship is formulated on the vectors to represent the text samples by constructing the text similarity relationship matrix. Twin convolution neural network is used afterwards to learn a better nearest neighbor graph in order to complete the text matching task. Experimental results reveal that the model can effectively improve the accuracy of text recognition and matching.
Although the discussed methods solve text matching problem to a certain extent, these methods mostly deal with text matching problems at the word level and do not consider the semantic level of the text, resulting in unsatisfactory text matching effects and low matching recall rates. This also results in the problem of high space utilization, and the comprehensiveness of the method is not much good. Moreover, these are not applicable to the processing of long text and short text at the same time. As a better alternative, this paper proposes a text matching algorithm based on density clustering in Korean Peninsula language knowledge base to solve the problems of traditional methods and to improve the overall text matching effect.
2. Text Matching Algorithm of Korean Peninsula Language Knowledge Base
The proposed method is divided into four steps for better explanation and understanding as explained in the following sections.
2.1. Text Semantic Feature Extraction
As the first step, the deep multiview semantic document representation model is utilized to capture the semantic dependency of the semantic vector of the matched text. Equation (1) represents the process of capturing the semantic dependency of the text for a text sequence :where represents the semantic vector of the text, represents the semantically similar text segment, and represents the similar text with the interference information added.
Once the semantic dependency is captured, it is necessary to perform feature extraction and feature selection of high-dimensional abstract semantic information for word granularity and sentence granularity [8]. A deep neural network is used to construct a multigranularity semantic feature extraction model as shown in Figure 1.

As described in Figure 1, the convolution layer of the deep neural network uses single dimensional convolution kernel for feature extraction of high-dimensional semantic information at the word granularity level as presented in where represent the output semantic vector of the convolution layer, represent the total text, represent the bias parameter of the convolution kernel, represent the convolution operation, and represent the size of the shared weight of the convolution kernel, considering as the length of the convolution kernel and as the width of the convolution kernel.
In the succeeding step, the in-depth multiview semantic document representation model adopts the global maximum strategy in the pooling layer to perform feature selection for the high-dimensional semantics at the sentence granularity level and extract the most important semantic features as shown in where represent the semantic features outputted by the pooling layer, represent the candidate text vector set, and represents the dependency between semantics.
After feature extraction and feature selection of semantic information at the levels of word granularity and sentence granularity by convolution neural network, the text to be matched becomes a new high-dimensional abstract semantic vector. In the subsequent steps, the resultant dense vector to be matched is further processed by using the interaction function to obtain the text semantic features.
2.2. Text Similarity Calculation Based on Subtree Matching
An interesting way to find the similarity between related fragments of text is via trie matching. Although multiple solutions are available in the literature, but majority are derived from two basic techniques, i.e., Tree-to-Tree similarity matching and subtree similarity matching where the former matches the subject trees and returns a measure of their distance. The cost of calculating this function varies, ranging from linear time algorithms to algorithms which solve NP-hard problems. Considering long sequences of text as addressed in current study, the tree-to-tree similarity is not much suitable to be adopted as a potential solution. On the other hand, “subtree similarity matching” [9] finds all the subtrees among the subjects which are most similar to each other. This option is potentially scalable and could be successfully used for both short and long sequences of text as per the requirements of the current study. So as per the semantic features of the text obtained by the aforementioned calculation, the subtree matching method is used to calculate the text similarity. Since the matching subtree of each text is often different, so while designing a text similarity algorithm for different texts and , it is necessary to consider that the two may have the same matching subtree or they may have different matches. In the following text both possible cases are discussed separately.
2.2.1. When Text and Text Have the Same Matching Subtree
Ideally, when text and text have the same matching subtree , the matching subtree is used as an intermediary and the text metadata feature vectors of both texts will have the highest degree of semantic overlap. Since the text metadata feature vector can characterize the text, the similarity between the two texts will be obviously high. The similarity relationship between texts based on the same matching subtree is shown in Figure 2.

In Figure 2, the matching subtree acts as an intermediary bridge for the calculation of similarity among the two text samples under discussion. Similarity 1 and similarity 2 represent the similarity between text and text and the matching subtree respectively, which can be calculated by equation (4). Similarity 3 is the similarity between text and text calculated by using the matching subtree as the intermediary, which can be calculated by equation (5):where represents the similarity between text and text ; and both represent the average similarity between text and the matching subtree, respectively, while and both represent the difference between text and text .
While calculating whether two texts are similar, it is necessary to judge the influence of the difference between the text and the matching subtree similarity on overall text similarity. If absolute value of the difference is large, the text similarity value will eventually decrease [10]. Equation (6) describes the formula for the calculation of similarity difference between the text and the matching subtree:where and both represent the local relevance of the text, represents the joint feature vector, and represents the interactive information between the texts. It could be seen that a larger value of will result in a slower value of similarity between texts.
2.2.2. When Text and Text Have Different Matching Subtrees
It is usually a special case that two texts have the same matching subtree. More often, text and text have different matching subtrees. The similarity relationship between text and text is shown in Figure 3.

In Figure 3, subtree is the matching subtree of text and subtree is the matching subtree of . The matching subtrees and act as an intermediary bridge when calculating the similarity of the two texts where similarity 1 represents the similarity between text and subtree , and similarity 2 represents the similarity between subtree and subtree . Similarly, similarity 3 represents the similarity between the text and the subtree . When the three similarities are all known, the similarity between text and text can be computed by where represents the number of valid words in the database and represents the length of the original text.
Similar to the first case, it is also required to determine the impact of the difference between the three similarities on text and text as expressed in where and both represent the high-dimensional abstract text semantic vector and represent the matching function.
So, in a nutshell, the similarity calculation process of the text and text for different matching subtrees mainly includes the following three steps: Step 1: calculate the similarity between text and text and their matching subtrees and , respectively Step 2: calculate the similarity between the matching subtrees and Step 3: use matching subtrees and as an intermediary to calculate the similarity between text and text
2.3. Semantic Classification Model Based on SWEM and Pseudo-Twin Network
Word embedding is a language modeling technique that represents words or phrases as real-number vectors. The words are grouped together to create a representation that is comparable to that of words with similar meanings. To construct the representation, word embedding learns the relationship between the words. Several methods including probabilistic modeling, cooccurrence matrix, and neural network based methods could be utilized for the calculation of word embedding.
Simple Word Embedding-based Model (SWEM) [11] is a model based on word vectors that utilize pooling technology. The module itself has no parameters. The twin network constructed with this model has fewer parameters and can be trained faster. The semantic classification model constituted upon SWEM mainly includes input layer, SWEM layer, aggregation layer, and output layer. The input layer is a word vector, which can be fine-tuned or fixed directly according to the training process. The SWEM layer uses a pooling method to process the text, the aggregation layer uses a common distance measurement algorithm to obtain the distance between the text vectors represented by SWEM, and the output layer is a simple two-class classification system to determine whether the texts are similar.
For the subsequent explanation of SWEM, suppose a text pair with length and with length . Similarly consider the set of labels , where 0 means that the text is not similar and 1 means that the text is similar.
2.3.1. Input Layer
The text is directly segmented into single words, and Word2Vec is used to train word vectors for all text. During the training process, it is found that keeping the word embedding layer trainable may cause the model to overfit, so the trainable parameter is set to “No,” which actually reduces the difference in word vectors between the training set and the test set [12]. The two texts are represented as and through the abovementioned operations where is the size of the word vector. In order to ensure that the word vector can be integrated, the word vector size is set to 300 in the pretraining stage.
2.3.2. SWEM Layer
This layer uses three variants of SWEM mainly including two pooling technologies and a fusion method: SWEM-max, SWEM-aver, and SWEM-concat, all of which are explained in the following text:(1)SWEM-aver: it averages the word vector by element which is equivalent to average pooling. This method uses each element of the word vector that is equivalent to fusing the information of each word. The advantage of SWEM-aver is that the information of each sequence element can be considered in the result through the addition operation, which can be expressed by the formula mentioned in where and , respectively, represent the input text vector sequence and the candidate text vector sequence while represents the matching probability likelihood function.(2)SWEM-Max: this variant is equivalent to the maximum pooling technology, SWEM-max brings very good interpretability to the model as the text vector trained with SWEM-max has great sparsity. From the perspective of specific tasks, the words that can highlight the theme in each text can now be easily selected.(3)SWEM-concat: it combines average pooling and maximum pooling by complementing the two pooling methods and splice the results obtained by the above two pooling methods.
For all the three SWEM variant models, no internal components of the model need to learn explicitly as these models only use inherent word embedding information for text classification, which can minimize the semantic classification time and improve the classification efficiency.
2.3.3. Aggregation Layer
The most important function of the aggregation layer is to aggregate the (two) obtained text representations into a fixed-length matching vector which is carried out using distance measurement formulas or fixed splicing methods for aggregation. The distance measurement formula adopts the most common splicing formula in the twin network. It is the specialty of the twin network to utilize the target values of two different data points (in this case “vectors”) rather than the targets themselves. For this purpose, the two text vectors are multiplied and subtracted to obtain the original vector of the two texts. Finally, a long vector is obtained as the aggregation vector as presented in where represent the matching score function while represent the multigranularity matching information.
2.3.4. Output Layer
Since the text labels are presented as 0 and 1, therefore, the output of the text classification result is fed into second classification. The result obtained by the distance measurement layer is passed through a fully connected layer and then the final text classification result is obtained through sigmoid [13] as per the following equation:
2.4. Implementation of Text Matching in Korean Peninsula Language Knowledge Base Based on Density Clustering
Based on the results of text feature collection, text similarity calculation, and text classification processing, the density clustering algorithm is used to match the Korean peninsula language knowledge base text [14]. The density clustering algorithm is an unsupervised clustering algorithm based on high-density connected areas. In the entire sample space, each target cluster is composed of a group of dense sample points divided by low-density areas. The purpose of the algorithm is to filter low-density areas and find dense sample point.
For the current study, a sample space is set, and the text having actual semantics is distributed along the diagonal direction of the cluster interval. Therefore, the slope density of the defined cluster interval is which is calculated by the following formula:where and represent the sum of projections of the matching points in the cluster on the corresponding coordinate axis; represents the cluster interval of the text , while represents the cluster interval of the text .
Moreover, the class cluster interval satisfies the conditions mentioned in
It is said that the cluster interval contains , which is referred to as “cluster contents” and is denoted as .
Similarly, the clusters meet the conditions of
It is said that the cluster intervals and intersect referred to as “cluster intersection” and are denoted as .
According to the abovementioned analysis, each matching point is initialized to a cluster before text matching, and the set containing all clusters is set to . According to the related definition of density clustering, the matching rules are set as follows: Rule 1: if the density of two clusters is reachable, merge the two clusters, recalculate the cluster interval of the new cluster, and delete the original cluster from . Rule 2: if two clusters meet the cluster inclusion conditions, delete the included clusters. Rule 3: exit condition, and traverse ; no clusters can be merged.
According to Rule 3, if clusters are merged, the new clusters must be compared with the existing clusters again until there are no clusters that can be merged.
Based on the above analysis, it can be seen that the problem under discussion of current study can be implemented in two steps:(1)Order matching: according to the semantic characteristics of the text, the possibility of merging adjacent clusters is high. Therefore, first sort all the initial cluster intervals in order, and then match according to Rule 1.(2)Iterative merging: on the basis of sequence matching, iterative merging is performed according to Rules 1 and 2 until no new merges are generated. The algorithm description is shown in Figure 4.
According to Figure 4, the postprocessing is to process the matching results according to a certain rejection condition where those clusters are discarded that do not meet the semantics. User can process either the merged clusters or the restored text fragments although, in general, it is better to process the restored text fragments [15]. According to the semantic characteristics of similar texts, the conditions for discarding clusters are defined as follows: Condition 1: the cluster density exceeds the range set by the density deviation Condition 2: the density of clusters is less than a preset threshold

The abovementioned conditions must be considered comprehensively. If the cluster density exceeds the density deviation, but its density is higher than a certain threshold, then it can still be considered as a candidate in the similar text matching problem.
3. Results
In order to verify the comprehensiveness and effectiveness of the proposed study and to prove its acceptability, various simulation experiments are carried out. In the experiments, the improved multimode matching algorithm based on Ah–Corasick algorithm [4] and the text matching method combined with pretraining model [5] are selected for comparison purpose. The matching recall percentage, space utilization, and comprehensiveness are used as comparison indexes for the comparative analysis.
3.1. Dataset and Evaluation Criteria
The experiment selects SQL dataset as the basic dataset and carries out data analysis through the MATLAB simulation software. The average number of words of all articles in the data set is 734, and the maximum number of words is 21791. Due to the influence of various factors in the experimental process, the experimental data can have errors. In order to overcome the impact of errors on the experimental results, repeated measurements are carried out to obtain the average value. Figure 5 is the output interface diagram of experimental results produced by MATLAB.

3.2. Analysis of Specific Results
3.2.1. Recall Rate
Considering the matching recall rate as the primary experimental index, different methods are compared, and the results are presented in Figure 6.

The higher the matching recall rate, the more comprehensive the matching results. As could be noticed in Figure 6, the text matching recall obtained by the proposed algorithm gets improved with the increasing number of iterations as the recall rate shows an obvious upward trend. Compared with the traditional methods, the text matching recall shows significant improvement. It proves that when the proposed algorithm is used to process similar text, more text can be matched as compared to other counterparts developed for the purpose. This is because before text matching, the algorithm utilizes the deep multiview semantic document representation model to capture the semantic dependency of the semantic vector of the text samples thereby extracting the semantic features of the text and thus the text can be matched more pertinently according to the extraction results.
3.2.2. Space Utilization
The second important measure to evaluate the experimental index is the space utilization rate. Figure 7 presents the results of three methods under discussion as per their utilized space.

Lower space utilization means less space is occupied by the text matching method which makes it a more attractive choice. It could be noticed from Figure 7 that as the number of iterations increase, the space occupied by different methods for text matching shows an increasing trend. Compared with the traditional methods, the proposed algorithm utilized minimum space. This is because this algorithm uses the subtree matching method to calculate the text similarity according to the feature extraction results and the semantic classification model is designed based upon SWEM and pseudo-twin network to classify the text semantics. In this step, different types of text can be classified. The text is divided, and the text that does not belong to the same type is eliminated, which reduces the space occupied by invalid text.
3.2.3. Comprehensiveness of Matching Results
In order to further test the superior performance of the proposed algorithm, short and long text matching experiments are carried out to compare the matching accuracy of different methods. The test results are shown in Tables 1 and 2.
Analyzing the data in Tables 1 and 2, it can be seen that the text matching accuracy of the proposed algorithm is higher than that of the traditional method, whether it is short text or long text. Among them, the highest matching accuracy of the proposed algorithm for short text is 95.7%, whereas the highest matching accuracy rate for long text is 91.5%. These results indicate that the text matching results of the proposed algorithm are more generalized and comprehensive and could be used to obtain more useful text information.
4. Conclusion
The traditional text matching methods usually rely on word matching algorithms while ignoring the important information embedded into semantics of the sentence. In order to overcome the problem of poor text matching effect and to enhance the text matching results, a text matching algorithm based on density clustering in Korean Peninsula language knowledge base is proposed which has the following salient features:(1)The proposed system uses subtree matching to calculate text similarity. The similarity calculated among vectors in the semantic space can accurately describe the semantic relationship between different texts.(2)The characteristics of the natural language of human society include hierarchical structure, serialized structure, and combination operations. These can be well combined with the hierarchical and serialized characteristics of the deep text matching model itself.(3)The density clustering algorithm can make perfect use of large-scale data and the computing power of high-performance computers, starting from the laws of human natural language and improving the accuracy and effectiveness of the text matching.
Overall the results computed upon the above characteristics of the proposed model testify the superiority of the method over existing approaches. Hence it could be claimed that the proposed method may be adopted for practical applications of Korean Peninsula language knowledge base text matching and could be further applied in other similar domains.
5. Future Direction
The context of a word with respect to its position in a phrase or sentence provides an important clue for its correct translation as demonstrated in the current study. Ambiguity is a situation when a word can be translated into two or more options and the system is confused upon the selection of the correct translation. In such a situation, the context can help in resolving the ambiguity. As a future direction for the researchers working in the area, the current study may be tested for ambiguous phrases and there are good possibilities that it could yield promising results.
The current study is developed by using density clustering; however, considering the number of advanced artificial intelligence tools developed by using machine learning, neural networks, and deep learning, the study may be improved by augmenting more modern AI methods.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this study.