Abstract

The main objective of this paper is to present a new clustering algorithm for metadata trees based on K-prototypes algorithm, GSO (glowworm swarm optimization) algorithm, and maximal frequent path (MFP). Metadata tree clustering includes computing the feature vector of the metadata tree and the feature vector clustering. Therefore, traditional data clustering methods are not suitable directly for metadata trees. As the main method to calculate eigenvectors, the MFP method also faces the difficulties of high computational complexity and loss of key information. Generally, the K-prototypes algorithm is suitable for clustering of mixed-attribute data such as feature vectors, but the K-prototypes algorithm is sensitive to the initial clustering center. Compared with other swarm intelligence algorithms, the GSO algorithm has more efficient global search advantages, which are suitable for solving multimodal problems and also useful to optimize the K-prototypes algorithm. To address the clustering of metadata tree structures in terms of clustering accuracy and high data dimension, this paper combines the GSO algorithm, K-prototypes algorithm, and MFP together to study and design a new metadata structure clustering method. Firstly, MFP is used to describe metadata tree features, and the key parameter of categorical data is introduced into the feature vector of MFP to improve the accuracy of the feature vector to describe the metadata tree; secondly, GSO is combined with K-prototypes to design GSOKP for clustering the feature vector that contains numeric data and categorical data so as to improve the clustering accuracy; finally, tests are conducted with a set of metadata trees. The experimental results show that the designed metadata tree clustering method GSOKP-FP has certain advantages in respect to clustering accuracy and time complexity.

1. Introduction

As an important tool for data management, metadata play a very important role in data integration, sharing, retrieval, and the construction of data warehouses. The research and application of metadata is extensive, in which metadata-based data integration and data clustering have been widely and deeply studied, while the clustering study of the metadata itself is rare [1]. FIHC (Frequent Itemset based Hierarchical Clustering) is an agglomerative hierarchical clustering algorithm first proposed by Benjamin [2]. First, frequent word sets are mined, and then the texts are used with the same frequent word sets as an initial cluster to classify the texts. From this, Feng and Chen designed a metadata clustering method based on MFP [3]. This method measures the similarity between metadata trees by the characteristics of MFP. However, the MFP method reduces the computational complexity at the cost of losing part of metadata information, which will have a negative impact on the subsequent feature vector clustering. In addition, the test process shows that the above method is not suitable for large-scale metadata tree clustering.

In order to solve the clustering problem of mixed-attribute data more effectively, K-prototypes algorithm [4], EKP algorithm, and SBAC algorithm have been proposed [5]. Compared with other algorithms, the K-prototypes algorithm has the advantages of simple and efficient, but the K-prototypes algorithm is more sensitive to the initial clustering center, which also has a negative impact on the clustering accuracy. Since metadata usually appear in the form of JSON, XML, etc. in web engineering, metadata clustering research is mainly embodied in clustering research of XML documents [6]. Typical methods for describing the structure of a document include both trees and vectors. Therefore, research to document clustering is usually methodologically translated into trees and vectors to carry out clustering analysis. Feng et al. put forward a new clustering based on K-medoids clustering and the genetic algorithm to enhance the accuracy of the XML document clustering in 2015 [7]. Wang et al. propose a document clustering based on the CFP algorithm (Clustering with Feature Order Preference) in 2016 [8]. Based on the tree structure, Costa and Ortale projected XML documents onto the path from root nodes to leaf nodes and proposed a new clustering method combining XML features with mixing lengths [9]. Since then, with the increasing application of swarm intelligence algorithms, swarm intelligence optimization algorithms have been introduced into the application of clustering algorithms [10]. A new clustering algorithm called K-MWO has been proposed by Kang et al. in 2016. The K-MWO algorithm makes full use of global optimization ability of MWO and local search ability of K-means [11]. A novel clusterability assessment method called Density-based Cluster ability Measure (DBCM) has been proposed by Jokinen in 2019 [12]. The above methods design the corresponding clustering method for the clustering problem of mixed-attribute data and further study the clustering problem of XML documents with structural features. However, due to the structural characteristics of the metadata tree, these clustering algorithms which are mainly suitable for metadata records cannot be directly applied to the clustering of metadata trees. In addition, clustering accuracy of these clustering algorithms, such as the typical K-prototypes algorithm, is not satisfactory.

The main objective of this paper is to present a new clustering algorithm for metadata trees. At the same time, improving the accuracy of the K-prototypes clustering algorithm and the loss of critical information in the MFP method is also important contents of this research. The GSO algorithm, brought up by Krishnanand and Ghose, is a new swarm intelligence optimization algorithm [13]. Compared with other swarm intelligence algorithms, the GSO algorithm has more efficient global search advantages and simpler algorithm flow, which is suitable for solving multimodal problems [14]. The clustering accuracy of the K-prototypes algorithm can be improved by the GSO algorithm [15]. In this document, the author combines the improved GSO algorithm and K-prototypes algorithm with the idea of MFP [16] to design a new metadata tree clustering schemes to realize the clustering of metadata tree sets. This paper mainly contains four aspects: first, mine MFPs in the metadata tree set and calculate the feature vectors of these MFPs using the “term frequency and inverted document frequency (TF/IDF)” method [17]; second, combine GSO with K-prototypes to design a metadata tree-oriented clustering method GSOKP; third, employ MFP’s feature vector to describe the features of the metadata tree and perform clustering on the feature vector via GSOKP to achieve the clustering of the metadata tree set; finally, establish metadata tree sets as experimental data to test the validity of text algorithms. To sum up, the contributions of this paper mainly include the design of an improved K-prototypes algorithm by the GSO algorithm, an optimization strategy of feature vector calculation of metadata tree, and finally propose a new clustering method for metadata tree sets.

2. Clustering of Metadata Trees

2.1. Description of the Metadata Tree

Metadata are special data that describe data. Metadata clustering includes the judgment about the similarity of metadata structures and metadata records. Metadata structures are often described using metadata trees. So, clustering for metadata structure similarity can be described as clustering for metadata trees.

Definition 1. Metadata tree.
To make , meaning that metadata are composed of n metadata elements, wherein meaning that metadata elements are composed of subattributes. Each metadata element and subattribute correspond to a node of the metadata tree, and the tree structure shown in Figure 1 is an example of a metadata tree MD.

Definition 2. Metadata path.
In a metadata tree, a set of metadata element sequences where the nodes do not recur from the root node to the leaf node is called a metadata path. In Figure 1, for example, is a metadata path.

Definition 3. Similarity between metadata trees.
The similarity of metadata trees is primarily measured by the similarity of path sets. Assume that there is a metadata example as shown in Table 1.
The metadata example is described as metadata tree Mete, as shown in Figure 2.

Definition 4. Frequent item.
Frequent item is a metadata element that occurs frequently in a set of metadata trees. Its frequency is measured by the occurrence rate of the elements in the metadata tree. That is, the rate of occurrences of the metadata element or in each metadata tree of metadata tree set . Set the frequency threshold to be . If , or is a frequent term.
Based on the structure of the data example given in Table 1, the metadata tree SURF shown in Figure 3 can be constructed.
In the metadata set of Mete and SURF, if , the elements contained in both Mete and SURF are frequent items. Assume that the set of frequent items is S. Then,

Definition 5. Frequent path.
A metadata path consisting of frequent items is called a frequent path [3]. In the metadata set of Mete and SURF, the set of frequent paths is assumed to be R. Then,

Definition 6. Maximal frequent path (MFP).
Frequent paths that are not contained by other frequent paths are called maximal frequent paths, and the set of maximal frequent paths is noted as . Then,

2.2. Feature Vectors of Metadata Trees

The similarity between metadata trees can be measured using the feature vector of the metadata tree to improve computational efficiency. Therefore, the TF/IDF method can be used to calculate the weight of each MFP and the feature vector of the metadata tree [17].

If denotes the weight of the MFP in metadata tree , and denote the importance of the MFP in metadata tree and in the whole metadata tree set, respectively. Then,

From the perspective of metadata tree similarity, tree nodes at different levels have different weights in similarity calculation. The root nodes have significantly higher weights than the leaf nodes. So, frequent path node level parameter is introduced in the calculation of . represents the level of the MFP in metadata tree , i.e., the highest level of the node, and the level of the root node is 1. means the number of occurrences of the MFP in metadata tree .

The importance of the MFP in the metadata tree set is primarily identified by the frequency at which the path appears in the set:where denotes the total number of trees in the metadata tree set whereas refers to the total number of metadata trees that contain the MFP.

Take the Mete and SURF metadata set for example. When , set , and then , , , and . The result is as follows:

If there are MFPs in set , metadata tree set contains a total of metadata trees. Then, is an dimensional matrix. The feature vector of metadata tree is denoted as .

2.3. Similarity Calculation of Metadata Trees
2.3.1. Introduction of Key Feature Parameters

Using the importance of MFP in the metadata to measure the key features of the metadata tree excludes the key features such as root node and metadata tree depth from MFP and its feature vector. This means that introducing the identification information of the root node and the depth of the metadata tree will allow more accurate identification of the features of metadata tree . That is,

2.3.2. Similarity Calculation for Heterogeneous Metadata Trees

The similarity between and is computed to identify the similarity between and in the metadata tree. has numerical characteristics in and , while and have categorical features. Thus, the similarity (or integrated distance) of feature vectors and is mainly a combination of and .

For the calculation of numerical values, Euclidean distance is adopted:

For categorical values, the calculation is performed through dissimilarity:

Due to different computing methods, there are also significant differences in value ranges. As is a predominantly numerical value, a weighted proportion is hence introduced to calculate the integrated distance:

3. Design of Metadata Tree Clustering Algorithms

3.1. GSO Algorithm

The GSO algorithm is mainly used to realize optimization by simulating the characteristics of glowworms in terms of luminescence and aggregation [14], and the attraction and aggregation between individual glowworms are primarily achieved by their own brightness (fluorescein value). The steps of this algorithm are described as follows:(i)Step 1: initialize glowworm position , and initialize and assign a value to glowworm number , moving step length , fluorescein initial value , and other related parameters.(ii)Step 2: calculate the fitness of the glowworm based on the objective function, i.e., fluorescein value :In the above, denotes the conversion of positions to fluorescein values, means volatility of fluorescein, and stands for the enhancement ratio of fluorescein.(iii)Step 3: calculate the probability that glowworm moves toward the direction of glowworm in the field: represents the set of glowworms in the field.(iv)Step 4: update the position of each glowworm :(v)Step 5: update the radius of the glowworm’s decision domain, control the number of glowworms in the field, avoid excessive aggregation of glowworms, and improve the global optimization effect: denotes the perceptual realm of glowworm at time ; means the threshold of the number of glowworms in the field; refers to the radius of the perceptual realm.(vi)Step 6: determine whether the algorithm ends and decides whether to proceed to the next round of iterations until the end of the algorithm.

3.2. Flow of the K-Prototypes Algorithm

K-means is a classic clustering algorithm based on division. The algorithm is mainly suitable for clustering numerical datasets. Based on the K-means algorithm, Huang proposed the K-modes algorithm aimed at categorical datasets. These algorithms use different methods to calculate the distance between data objects. To realize the clustering of mixture datasets, the K-prototypes algorithm combines the basic methods of the K-means and K-modes algorithm [18]. The flow of K-means, K-modes, and K-prototypes is very similar, but they are suitable for different types of datasets. The K-prototypes clustering algorithm directed at data objects with mixed attributes such as numeric and categorical data. The steps of the K-prototypes algorithm can be described as follows:(i)Step 1: Initialization. Input the category number of clustering and randomly select data points in the dataset as the initial cluster center.(ii)Step 2: Clustering. Traverse the whole dataset, calculate the distance between each data point and the initial cluster center, take the minimum distance, and assign to the corresponding cluster center. Thus, the dataset can be divided into nonintersecting cluster sets .(iii)Step 3: Updating the Cluster Center. For each cluster in the cluster set, calculate the new cluster center.(iv)Step 4: Determining Algorithm End Condition. If there is no change after the original cluster center is updated or the maximum number of iterations is reached, the algorithm terminates; otherwise, it will go to Step 2 and enter the next round of iteration.

3.3. GSOKP-FP Design

GSO, K-prototypes, and MFP are combined to design the GSOKP-FP algorithm to realize the clustering of metadata trees. The framework of this algorithm is shown in Figure 4.

GSOKP-FP flow description is as follows:(i)Step 1: identify the MFP in the metadata tree set. According to Section 2.1, filter out the MFPs in the metadata tree set to form an MFP set.(ii)Step 2: calculate the feature vectors of the metadata tree. According to Section 2.2, calculate the feature vectors of the metadata tree and form a feature vector set.(iii)Step 3: optimize the initial cluster center. Perform GSO algorithm on the feature vector set to solve multiple extreme points in the point set. In the meantime, adopt the GSO algorithm based on the optimization of the good point set to improve the global optimization effect [15].(iv)Step 4: K-prototypes clustering. Select k extreme points as the initial cluster centers of K-prototypes to carry out the K-prototypes algorithm.(v)Step 5: output the metadata tree clustering results. According to the K-prototypes algorithm, output k clusters, mark the category of metadata trees, and output the clustering results.

In addition, the part of GSOKP-FP flow for dataset clustering can be named the GSOKP algorithm.

4. Analysis of Experimental Data and Results

4.1. Selection of Datasets

The experiments in this paper are based on the metadata structure shown in Table 1 to construct a metadata set consisting of metadata trees. The metadata tree nodes are presented in Table 2.

A breadth traversal of all the nodes of the metadata tree with as the root node is performed to form a sequence of nodes listed as follows:

The nodes in a nonempty metadata tree are ordered, and corresponding nodes are identified via the binary system, i.e., 1 means the node included, 0 not included. The examples of metadata trees are given as follows:

The metadata tree shown in Figure 5 can be denoted as .

The metadata tree shown in Figure 6 can be denoted as .

Since the set of leaf nodes can uniquely identify a metadata tree, it is possible to uniquely identify a metadata tree by numbering the leaf nodes. For example, a metadata tree with Root1 can have up to 9 leaf nodes. In other words, it can yield 511 metadata trees . In 511 metadata trees with Root1 as the root node, 511 binary numbers can be represented as 000000001 to 111111111 according to the position of leaf nodes. When the number of leaf nodes (without individual root node) is d, and d = 1, there are 9 corresponding binary numbers in 511 binary numbers: 000000001,000000010,000000100, 000001000,000010000, 000100000, 001000000, 010000000, and 100000000. That is, .Similarly, the number of binary numbers (metadata trees) under different d values can be calculated with . The number of metadata trees with Root1 (or Root2 and Root3) as the root node is shown in Table 3. Also, the number of metadata trees with Root2 and Root3 as root nodes is 511, respectively. The metadata tree for the experiment is selected among a total of 1,533 metadata trees.

Because of the mutual inclusion of metadata trees with different numbers of leaf nodes, two types of datasets are experimentally designed: one is the set of metadata trees with the same number of leaf nodes (no inclusion) and the other is the set of random metadata trees (inclusion).

When d = 7, 108 metadata trees with Root1, Root2, and Root3 as root nodes are recorded as Tset_1, which is taken as the experimental object in the first experiment.

When the number of leaf nodes is random, 1,000 metadata trees with Root1, Root2, and Root3 as the root node are randomly selected and recorded as Tset_2, which is taken as the experimental object of the second experiment.

4.2. Experiment Description

Experiment 1. Metadata tree set Test_1.
The value of frequency threshold has a significant impact on the performance and efficiency of the algorithm. The larger the value of , the fewer frequent paths and the shorter the execution time, but more information of the metadata tree will be lost. The smaller the value of , the greater the number of frequent paths, and the less information loss, the longer execution time. Therefore, 50% is a median value, which also conforms to the experience of reference [3] on the value of parameter θ.
If , thenThe individual paths in MFP are numbered as p1, p2, p3, p4, p5, ... ,p12.
In Tset_1, the feature vectors of the metadata tree are shown in Table 4.

Experiment 2. Take metadata tree set Tset_2 as the experimental object.
If , thenThe individual paths in MFP are numbered as p1, p2, p3, p4, p5, p6, and p7. Similarly, in Tset_2, the corresponding feature vectors of the metadata tree are shown in Table 5.

4.3. Analysis of Experimental Results

The root node information and depths of metadata trees are added to the feature vectors, and the feature vector sets are clustered. Next, the GSOKP-FP algorithm is run to obtain the corresponding experimental results, as displayed in Tables 6 and 7.

According to practical results, we focused on comparing the algorithm covered in this paper with other clustering algorithms, such as K-means, K-modes, K-prototypes, KL-FCM-GM, SBAC, EKP, DC-MDACC, and the metadata tree clustering algorithm mentioned in the literature [3] in terms of clustering accuracy, and time complexity. Clustering accuracy refers to the proportion of accurately classified samples to total samples.

First, from the perspective of classification accuracy, the comparison information of the algorithm designed in this paper with other intelligent algorithms on some UCI datasets and the K-means, K-modes, and K-prototypes algorithms on metadata tree sets in respect to clustering accuracy is indicated in Tables 8 and 9. The clustering accuracy of other algorithms can be got from the literature [19]. The test given in reference [3] is to cluster four metadata trees by pairwise comparison of their similarity. Therefore, the clustering algorithm given in reference [3] can be recorded as MCM-FP is not suitable for large-scale metadata tree set clustering, such as the metadata tree set Tset_1 and Tset_2 in this paper.

Compared with other typical clustering algorithms, the test results show that the GSOKP algorithm has higher clustering accuracy, and the algorithm is more suitable for clustering of mixed-attribute data. The advantages of GSOKP in clustering accuracy are mainly due to the following reasons: firstly, the GSO algorithm provides a better initial clustering center for the K-prototypes algorithm, which improves the clustering effect of the K-prototypes algorithm significantly; secondly, a new calculation method of the distance between mixed-attribute data has been designed in the GSOKP algorithm. The calculation method can better describe the distance or similarity between mixed-attribute data. In addition, the test results show that the DC-MDACC algorithm is also excellent. The test results of DC-MDACC are better than the GSOKP algorithm in the test of the dataset Acute and Statlog Heart. The main reason is that a swarm intelligence optimization algorithm is also used in the DC-MDACC algorithm and more preprocessing is performed on the test data. Therefore, the time complexity of the DC-MDACC algorithm is higher.

Through the GSOKP-FP algorithm, it is able to achieve better classification accuracy than the conventional K-means and K-modes methods in both the conditional and random selection of metadata trees to form metadata tree sets. The algorithm designed in this paper is more suitable for the clustering of bigger metadata tree sets, while the method described in the literature [3] is more suitable for the clustering of smaller metadata tree sets.

Next, the algorithm mentioned in this chapter mainly consists of three parts in terms of time complexity: swarm intelligence algorithm for selecting initial cluster centers, K-prototypes algorithm to solve the clustering results, and maximum frequent path mining for constructing the metadata tree dataset. Set the number of glowworms as , the number of metadata trees as , and the number of iterations as . Then, the time complexity of the GSO algorithm for solving the initial cluster center can be denoted as for convenience. Set as the number of iterations, as the number of metadata trees, and as the number of clusters. Then, the time complexity of the K-prototypes algorithm for solving clusters is like this: , whereby the combined time complexity is obtained. The comparison of time complexity is shown in Table 10, and time complexity of other algorithm can be got from the literature [3, 19].

Compared to K-means, K-modes, and K-prototypes, the algorithm covered in this chapter has higher time complexity. The time complexity of the algorithm in this chapter is lower compared to that of the agglomerative hierarchical clustering algorithm stated in the literature [3]. The time complexity of this method mainly consists of two iterative clustering algorithms: creating metadata tree similarity matrices and scanning metadata tree similarity matrices, of which the combined time complexity is about .

5. Conclusions

The GSOKP-FP algorithm designed in this paper introduces the GSO algorithm and K-prototypes algorithm into the solution of metadata tree clustering, which enables the clustering of the metadata tree sets by clustering the feature vectors of metadata trees. MFP can better describe the key features of a metadata tree and effectively reduce the dimension of numerical computation and the time complexity. However, since MFP extracts more common information between metadata trees, through which some key information is lost while reducing the numerical dimension. In this paper, information such as root node and tree depth are added to the feature vector described by MFP to improve the computing accuracy of metadata tree similarity and the clustering precision. The experiments show that the GSOKP-FP algorithm designed in this paper is able to achieve a better metadata tree clustering effect.

Data Availability

The experimental object (metadata tree sets) used to support the findings of this study has been constructed in this paper. The datasets (Iris, Soybean, Acute, and Statlog Heart) used to support the findings of this study have been deposited in the UCI repository (http://archive.ics.uci.edu/ml/datasets.html).

Conflicts of Interest

The author declares that there are no conflicts of interest regarding the publication of this manuscript.

Acknowledgments

The work was supported by the research project of the National Social Science Foundation of China (no. 20BTQ046).