Abstract

In the multilabel learning framework, each instance is no longer associated with a single semantic, but rather with concept ambiguity. Specifically, the ambiguity of an instance in the input space means that there are multiple corresponding labels in the output space. In most of the existing multilabel classification methods, a binary annotation vector is used to denote the multiple semantic concepts. That is, +1 denotes that the instance has a relevant label, while −1 means the opposite. However, the label representation contains too little semantic information to truly express the differences among multiple different labels. Therefore, we propose a new approach to transform binary label into a real-valued label. We adopt the low-rank decomposition to get latent label information and then incorporate the information and original features to generate new features. Then, using the sparse representation to reconstruct the new instance, the reconstruction error can also be applied in the label space. In this way, we finally achieve the purpose of label conversion. Extensive experiments validate that the proposed method can achieve comparable to or even better results than other state-of-the-art algorithms.

1. Introduction

Classification is a high-frequency vocabulary in machine learning. We often say that classification generally refers to single-label classification, that is, an object is given a category. In multilabel learning, the meaning of classification is multilabel classification. Specifically, an instance is associated with more than one class label simultaneously. Multilabel learning has many application fields, such as web mining [13], text categorization [46], multimedia contents annotation [711], and bioinformatics [1214].

In recent years, the field of multilabel learning has gradually attracted significant attention. A variety of algorithms have been proposed, which can be basically divided into two categories [15]: algorithm adaptation and problem transformation. The core idea of the former is to transform the previous supervised learning algorithm so that it can be used to solve multilabel learning problems, such as ML-kNN [16], while the latter is to convert the multilabel learning problem into other known problems to solve, such as BR [17]. Some multilabel algorithms solve the multilabel learning problem without using the correlation among different labels, such as LIFT [18]. The main idea of the LIFT is to obtain the identifying characteristics of each label and build a new feature space. It first obtains the positive and negative examples corresponding to each label and then performs cluster analysis on the corresponding set of examples to obtain the cluster centers and finally uses the cluster centers to construct the label-specific features. In the process of solving the multilabel learning problem, LIFT does not consider label correlations; hence, it can be regarded as a new feature conversion method. Some algorithms consider the label correlation [1925] for solving the multilabel learning problem. For example, the basic idea in [20] is to model the correlation among labels based on the Bayesian network and to achieve efficient learning by using the approximate strategy. Indeed, the rational use of the correlation among labels can effectively boost the performance of multilabel classification. For example, if an image has labels “football” and “rainforest,” it is likely to be labeled “Brazil”. It has a low probability of being labeled “river” if a document is annotated with “desert”. Therefore, how to effectively explore and make full use of label correlations is a crucial problem for multilabel learning.

In fact, for an object with multiple labels, the importance of the related labels is still different. Although the importance of each label is not given directly, we can judge the importance of each label through external observation. Generally speaking, the larger the proportion in the original object, the more important the corresponding label. Accordingly, how to accurately express the importance of the label is also a challenge.

The method in [26] decomposes the original output space in order to obtain potential label semantic information, which can effectively increase the ability of the subsequent feature selection. Motivated by the decomposition of the label space in [26], in the paper, we propose a method named label low-rank decomposition (LLRD) for multilabel classification. The LLRD algorithm first performs low-rank decomposition on the label matrix, then combines the decomposed results with the original features to form new features, and mines the structural information of the feature through sparse reconstruction. Third, it transforms the binary label into the real-valued and finally converts the classification problem into a regression problem.

The contribution of this paper is as follows:(1)Utilize low-rank decomposition to reveal the global label correlations and achieve good classification results(2)Combine the low-rank decomposition results with the original features reducing the information loss in the subsequent label transformation process(3)Carry out extensive experiments on different field datasets to verify the effectiveness of different algorithms

2. Materials and Methods

2.1. Datasets

In this experiment, a total of 13 datasets were used covering four fields: audio, text, image, and biology. All these data resources can be collected from Mulan (http://mulan.sourceforge.net/datasets.html) and Meka (http://meka.sourceforge.net/#datasetsru). Table 1 gives the specific details of the datasets. The number of instances, label space, and the dimension of features are denoted by |S|, L(S), and D(S), respectively. LDen (S) is the density of label, which is the result of the normalization of label cardinality LCard(S).

2.2. Notations

Formally, suppose be the d-dimensional input space and denote the output domain of q class labels. Let be the multilabel training dataset with p examples, where is a d-dimensional instance vector and is the label vector corresponding to . Let represent the input data matrix, and denote the matrix from which is removed from . Let is a matrix composed of label vector.

2.3. The Process of LLRD

First, LLRD decomposes the label matrix with low-rank method. In the framework of multilabel learning, label matrix is often considered to be low rank [27, 28] due to the existence of label correlations. Low-rank structure is also a way to explore the global relationship between labels. Therefore, we can perform low-rank decomposition on the label matrix. Assuming that the rank of is r < q, can be written as follows:where represents the dependency of on the original label space and is a mapping of the original label and also contains label correlation information.

Second, we combine with to form a new feature space . In order to reveal the inner structure of the feature space, we use sparse reconstruction [29] method to model the relationship between the training instances. Specifically, we use to represent the training object relationship matrix, where is a measure of the relationship between and . Let denote the corresponding sparse reconstruction coefficient related to . According to the sparse representation theory, can be calculated as follows:where represent a combination of all training instances except . We can solve the above problem using alternating direction method of multiplier [30].

Third, we transform the original binary label set associated with any in the training set into a real-valued label vector , where and . Because the real value contains more information, and through the size of the value, we can also infer the importance of the label. Since the input space and the label space are often interrelated, it is assumed that the relationship between and in the input space also exists between and in the label space. Accordingly, the representation errors of different elements in the label space can be written as follows:where . The above quadratic programming problem can be solved by mature tools related to quadratic programming. The original multilabel classification problem can be transferred into a multioutput regression problem. There are many solutions [31] to solve it. The learning of LLRD method contains three phases: low-rank decomposition, sparse reconstruction, and multioutput regression. The time complexity of low-rank decomposition and sparse reconstruction is . If we choose multioutput support vector regression to realize the classification, the time complexity is . Thus, the total complexity of LLRD is .

3. Results and Discussion

3.1. Experiment Setup

In this subsection, we investigate comparisons between our LLRD and other six multilabel learning methods on six multilabel evaluation criteria, which include two categories: example-based and label-based metrics [32]. The example-based metric is to first obtain the performance of the learning system on each test example and finally returns the average of the entire test set. Unlike the above example-based metric, the label-based metric first returns the performance of the system on each label and finally gets the macro/microaveraged F1 value on all labels.

In this paper, one-error, coverage, ranking loss, and average precision are employed for example-based performance evaluation. And macroaveraging and microaveraging F1 are label-based metrics. For example-based metrics except average precision, as their values increase, it means that the performance of the algorithm is worse. For the remaining metrics, their values are proportional to the performance of the algorithm.

Let be the multilabel test set and can be seen as the confidence of being the corresponding label associating with . In addition, can be converted into a ranking function . If holds, then the corresponding ranking function has .

The six evaluation criteria for the algorithm used in the paper are defined as follows:(1)One-error:(2)Coverage:(3)Ranking loss:(4)Average precision:(5)Macroaveraging F1:(6)Microaveraging F1:where FNj, TNj, FPj, and TPj indicate the number of false-negative, true-negative, false-positive, and true-positive instances with regard to .

In order to test the effectiveness of LLRD, we chose six multilabel learning algorithms MLFE [33], RAKEL [34], ML2 [35], CLR [36], LIFT [18], and RELIAB [37] for performance comparison. MLFE makes full use of the intrinsic information in feature space, making the semantics of the label space more abundant. The specific parameters of MLFE are set as follows: , , , and , , and searched from {1, 2, …,10}, {1, 10, 15}, and {1, 10}. RAKEL is a high-order approach. The basic idea of the algorithm is to transform the multilabel learning problem into integration of multiclass classification problem. We use the default settings recommended by RAKEL algorithm, namely, , ensemble size . For ML2, respective parameter values are recorded as follows: and and selected from {1, 2, …, 10}. ML2 is the first multilabel learning algorithm to attempt to explore manifolds at the label level. CLR is a second-order problem transformation method. It solves the problem of multilabel classification by using label ranking, in which ranking among labels is implemented by pairwise comparison. The associated parameter ensemble size is set to . LIFT uses different feature sets to distinguish different labels by clustering positive and negative examples. The value of ratio parameter r is 0.1, as suggested in [18]. RELIAB utilizes the implicit relative information of label to achieve the task of multilabel learning. The parameters and take values from {0.1, 0.15, …, 0.5} and {0.001, 0.01, …, 10}, respectively. For LLRD, , r can be selected from {1, 2, …, q−1}. In a word, the parameter settings of the comparison algorithm are as recommended in the related papers.

3.2. Experimental Results

For each dataset in our experiment, we adopt the tenfold cross-validation strategy. Our experimental results are mainly distributed in Tables 2 and 3, where we record the performance of different algorithms in different multilabel datasets. Specifically, the average and standard deviation of the corresponding evaluation criteria are recorded in the tables. For each evaluation metric, “↓” indicates “the smaller the better” and “↑” indicates “the larger the better”. The best results are shown in bold form.

We use Friedman test [38] based on the average ranks for verifying whether the difference between algorithms is statistically significant. If the assumption that “all algorithms have equal performance” is rejected, it means that the performance of each algorithm is significantly different. As can be seen from the data presented in Table 4, the hypothesis that there is no significant difference among the algorithms is not valid under the condition of 0.05 significance level. Therefore, we need to conduct a post hoc test to further distinguish the various algorithms. Usually, there are two options for post hoc test, one is the Nemenyi test [38] and the other is the Bonferroni–Dunn test [39]. For algorithms, the former needs to compare times, while the latter only needs times in some cases. Thus, we choose the latter. The Bonferroni–Dunn test is used to test whether LLRD is more competitive than the comparative algorithm, in which LLRD plays a role of control algorithm. When the difference of average rank between two algorithms is more than one critical difference CD, the performance of two algorithms is obviously different. The CD value mentioned here can be calculated from , where k = 7 and N = 13, when the significance level is 0.05, the corresponding .

The CD diagram associated with LLRD and its comparison algorithm is shown in Figure 1. The numbers on the horizontal axis of the coordinate indicate the average rank value of each algorithm under different evaluation criteria. There is no significant difference in performance among the various algorithms connected by solid lines.

Through the analysis of the above experimental results, we can draw the following conclusions:(1)In terms of the four evaluation criteria of one-error, coverage, ranking loss, and average precision, LLRD is obviously superior to RELIAB, RAKEL, and CLR.(2)The smaller the average rank value, the better the performance of the corresponding. For LLRD, five of the average rank value in the six CD subdiagrams are optimal, which shows LLRD outperforms other algorithms.(3)For regular-size datasets, LLRD ranks first in 69% of the cases under different evaluation criteria, while for large-scale datasets, it ranks first in 36.1%.

4. Conclusions

In this work, we propose a novel multilabel classification algorithm named LLRD, which adopts the low-rank decomposition to gain the internal information of label and further reduce the information loss of the label transformation via the new feature space. Experimental results show that the performance of the proposed LLRD is better than many state-of-the-art multilabel classification techniques. In the future, we will explore alternative models combining the low-rank decomposition and classification into a joint optimization problem for considering more complex correlation of labels.

Data Availability

The datasets used in our manuscript are all public datasets, which can be downloaded from “http://mulan.sourceforge.net/datasets.html” and “http://meka.sourceforge.net/#datasetsru”.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Key R&D Program of China (2019YFC1521400), National Natural Science Foundation of China (61806159 and 61806160, 61972312), and China Postdoctoral Science Foundation (2018M631192).