Abstract

By leveraging neural networks, deep distance metric learning has yielded impressive results in computer vision applications. However, the existing approaches mostly focus a single deep distance metric based on pairs or triplets of samples. It is difficult for them to handle heterogeneous data and avoid overfitting. This study proposes a boosting-based learning method of multiple deep distance metrics, which generates the final distance metric through iterative training of multiple weak distance metrics. Firstly, the distance of sample pairs was mapped by a convolution neural network (CNN) and evaluated by a piecewise linear function. Secondly, the evaluation function was added as a weak learner to the boosting algorithm to generate a strong learner. Each weak learner targets the difficult samples different from the samples of previous learners. Next, an alternating optimization method was employed to train the network and loss function. Finally, the effectiveness of our method was demonstrated in contrast to state of the arts on retrieving the images from the CUB-200-2011, Cars-196, and Stanford Online Products (SOP) datasets.

1. Introduction

In the past decades, distance metric learning has been applied effectively in image retrieval, face recognition, person re-identification, clustering, etc. It is now a hot topic in the field of computer vision. Thanks to the recent success of convolutional neural networks (CNNs), deep distance metric learning methods have attracted lots of attention [1].

Each deep distance metric aims to map training samples to features via CNNs. The mapping should narrow the distance between similar sample pairs and increase that between dissimilar sample pairs. To learn deep distance metrics, many approaches have been developed based on sample pairs [2, 3], triplets [2, 4], or quadruplets [5]. This study attempts to learn the simple similarity functions of sample pairs. The distance metric was defined as the Euclidean distance between sample pairs, which can be computed rapidly compared with other metrics.

Most of the existing methods of deep distance metric learning try to improve the single loss function based on a single distance metric. However, a single distance metric is insufficient to handle all the samples from the given data distribution. In fact, feature data are generally not distributed uniformly: the density varies from region to region in the data distribution [6, 7]. To solve the problem, some scholars resorted to ensemble technique and employed several learners to map each sample to multiple subspaces [810]. Nevertheless, these strategies do not support end-to-end training of the network and loss function of each weak learner. The lack of this training model suppresses the discrimination ability and increases the susceptibility of the metric to noise. The accuracy of deep distance metric learning could be further improved through joint training of the network and loss function.

This study aims to improve the adaptability of conventional deep distance metric between pairs of samples. The main idea is to divide the last fully connected layer of the CNN into multiple nonoverlapping groups (Figure 1), each of which is a separate feature mapping of the network. The distance metric of sample pairs mapped by one of the groups was evaluated by a piecewise linear function. Each group has a corresponding evaluation function, which is added as a weak learner to the boosting algorithm to generate a strong learner. This finally forms a multidistance metric ensemble. In addition, the same underlying feature representation, which was pretrained through experiments, was applied to the fully connected layers of all groups. In this way, the high computing cost of CNN training in the boosting framework was significantly reduced. In that framework, each learner reweights the training samples for successive learners, according to the gradient of the loss function. As a result, the successive learners would focus on difficult sample features, producing more suitable feature representations. The final ensemble output is a linear composition of multiple weak learners. Furthermore, the performance of the conventional distance metric was improved by introducing a piecewise linear function, which evaluates the similarity of sample pairs in distance metric learning. This facilitates the joint training of the network and loss function. Through the evaluation of various deep distance metric learning methods in the image retrieval task, it can be seen that Recall@1 of the proposed method is 4.2, 2.8, and 0.4 higher than that of the previous best score on CUB-200-2011, Cars-196, and SOP datasets, respectively. Experimental results show that the proposed method outperforms the comparison methods, while avoiding overfitting to a certain extent.

The contributions of our work are as follows:(1)The last fully connected layer of the CNN was used to form multiple groups of features, which was designed to form a distance metrics ensemble and formulated as a boosting problem. Then, an alternating optimization method was adopted to jointly train the network and loss function.(2)A piecewise linear function was employed as the evaluation function of the distance metric of sample pairs mapped by CNN and added as a weak learner to the boosting algorithm to generate a strong learner.

2. Literature Review

This section reviews the most closely related works out of the numerous publications on the hot topic of distance metric learning.

2.1. Deep Distance Metric Learning

Many methods employ a discriminative distance metric loss function to increase interclass distance and reduce intraclass distance [2, 4, 1114]. For instance, contrastive loss is a popular tool of deep distance metric learning that minimizes the distance between the eigenvectors of positive sample pairs and widens the distance between the negative sample pairs [2, 3]. Based on contrastive loss, triplet loss creates a 3-tuple with a positive sample pair and a negative sample pair, in the light of the relative relationship between intraclass distance and interclass distance [2, 4], and ensures that the positive sample pairs are closer in the mapped feature space than the negative sample pairs. Many other loss functions have been extended from the above two losses, namely, histogram loss [15], quadruplet loss [5], N-pair loss [11], angular loss [12], and hierarchical triplet loss [16].

Taking a tuple of samples as training samples yields a huge amount of training data. Deep distance metric learning would be greatly enhanced by acquiring more effective samples. Recently, several scholars have designed sampling strategies to tackle hard and semihard negative mining [1618]. For example, Xuan et al. [7] observed that easy positive samples help to preserve the intraclass difference and thus improve the generalization ability of triplet loss. However, the use of easy positive samples constantly underchallenges the metric, making the embedding space less discriminative.

2.2. Ensemble Learning

The methods above all strive to improve the loss function based on a single distance metric. However, it is difficult for them to adapt to all available data. Recently, ensemble learning, which iteratively trains an ensemble from several weak learners for the final prediction, has been incorporated to boost the generalization performance of deep metric learning.

Negrel et al. [8] explained how to use their boosting-based metric learning algorithm to compute hierarchical organizations of face databases. Kim et al. [9] introduced multiple attention-based learners for ensemble. Xuan et al. [10] grouped labels randomly to create a large family of related embedding models, which can serve as an ensemble. Sanakoyeu et al. [6] employed a divide-and-conquer strategy to divide the embedding space to several clusters and used each cluster to train a single learner.

2.3. Other Related Metric Learning

In addition, there are other types of distance metric approaches recently, such as sample selection, local metric, and hierarchical metric. Wu et al. [19] proposed a distance-weighted sampling procedure, which selects more informative and stable examples than traditional approaches, achieving excellent results in the process. Wang et al. [13] generalized tuple-based losses and reformulated them as different weighting strategies of positive and negative pairs within a minibatch. Roth et al. [20] proposed to learn the distribution for sampling negative examples instead of using a predefined one. Local metric learning methods [21, 22] learn a collection of Mahalanobis distance metrics, each operating on a different subset of the data obtained by K-means or Gaussian mixture clustering. From [23, 24], we learn a two-level category hierarchy by using coarse and fine classifiers. Ge et al. [16] proposed a hierarchical version of triplet loss that learns the sampling all together with the feature embedding.

Different from the above approaches, our approach realizes the end-to-end training of the network and loss function of each weak learner, thereby enhancing the accuracy of deep distance metric and reducing the probability of overfitting.

3. Methodology

3.1. Boosting-Based Deep Distance Metric Model

Let be N training sample pairs, each of which belongs to one of the two class labels . If the two samples belong to the same class, the pair is labeled yn = +1 and called a positive sample pair; if the two samples belong to different classes, the pair is labeled yn = -1 and called a negative sample pair.

We divide the last fully connected layer of the CNN into multiple nonoverlapping groups. The training sample pair is fed into the CNN to generate a eigenvector pair , which is extracted from the mth group of the last fully connected layer. So, the training sample pair can be mapped to generate multiple groups of features, which was designed to form a distance metrics ensemble and formulated as a boosting problem.

Drawing on the idea of the boosting algorithm, multiple weak learners are adopted to produce a strong learner of distance metrics between the mapping values of training sample pairs. The weak learners are trained on reweighted samples, according to the gradient of the loss function. In general, we want to a set of weak learners and their corresponding boosting model:where M is the number of weak learners and φm is the distance metric evaluation function between the eigenvectors of the training sample pair mapped by the mth group of the fully connected layer.

In the above formula, was used to quantify the similarity between two training samples, and it responds to this similarity based on whether the two samples should be considered to represent the same class. Therefore, a threshold was defined to deal with the distance metric between two training samples, and a piecewise linear function was adopted as the evaluation function. This function reduces the distance between similar training samples and increases that between dissimilar ones in the mapped space. The evaluation function can be defined aswhere is a generic distance metric (the simple Euclidean distance), αm and βm are the evaluated similarity and dissimilarity between the two samples, respectively, and tm is a distance metric threshold. If the Euclidean distance between two mapped training samples is smaller than the threshold tm, then the evaluation value is αm; otherwise, it is βm (Figure 2).

In each round of boosting, a new weak learner is trained on the reweighted training set in the minibatch, according to the gradient of the loss function, and then added to form a strong learner. As demonstrated by Friedman [25], the training of a single learner can be written as a loss function minimization problem:where is a loss function. Here, the exponential loss function is utilized. Inspired by Schapire and Singer [26], formula (3) can be rewritten aswhere is the weight of training sample xi in iteration m. The weak learner is selected to minimize the loss function in each iteration to update the strong learner. Both αm, βm, and tm of the distance evaluation function and of the mth group of the fully connected layer need to be optimized.

The proposed approach is easily integrated into some deep metric learning approaches, such as triplet loss, N-pair loss, and hierarchical triplet loss. However, for some loss functions, such as histogram loss and angular loss, it is not applicable and needs to be improved.

3.2. Joint Training

As to formula (3), we need attempt to jointly learn both the network and loss function. We note that its function was nonconvex, which was difficult to solve in general. Referring to Zhang et al. [27], this study applies an alternating optimization method to jointly train the network and loss function.

Since a learner needs to be optimized in each round of boosting, the optimization problem (4) was investigated, while fixing parameters of the mth group of the fully connected layer. Formula (4) can be decomposed into

Taking partial derivatives of formula (5) with respect to αm and βm and setting both to zero to optimize each parameter,

After each iteration, the weights of the training sample pairs are updated using the exponential loss function:

Then, the weights of all training sample pairs are normalized. As shown in formulas (6) and (7), the parameters that affect the evaluation function are only related to tm, i.e., the optimal value obtained by the traversal method. If the training sample pairs are classified correctly, the weight of successive learners tends to be small; otherwise, the weight tends to be large. Hence, successive learners focus on different training sample pairs than previous learners, increasing the diversity among learners (Figure 3).

The next step is to update parameters of the mth group of the fully connected layer, while fixing αm, βm, and tm of the evaluation function. These parameters were trained with the contrastive loss function, using the standard backpropagation algorithm. In the forward process, the similarity distance metric is computed for each input training sample pair. In the backward process, the gradient of the loss function is iteratively propagated for each group (Figure 4).

For the contrastive loss function, the distance metric threshold tm obtained through weak learning training serves as the distance margin of a negative training sample pair. Then, the contrastive loss function can be established as

The training procedure is illustrated as Algorithm 1.

Initialization: initialize distance metric threshold tm, m = 1,..,M, the weights of the training sample pairs , m = 1,..,M and n = 1,..,N, and parameters of the mth group of the fully connected layer , m = 1,..,M, randomly;
 for m = 1 to M
  repeat
  for n = 1 to N
  Select the optimal value tm, which minimizes formula (4), and update the weights of the nth training sample pairs;
  Compute the derivatives of formula (9), and update the parameters of the mth group of the fully connected layer via backpropagation;
  end
   until terminal condition.
 end
 Output DM

4. Experiments and Results’ Analysis

To verify its effectiveness, our method for deep distance metric learning was tested on three standard datasets: CUB-200-2011, Cars-196, and Stanford online products (SOP). Following the standard protocol proposed by Oh Song et al. [2], each dataset was broken down into a training set and a test set. For the CUB-200-2011 dataset, 5,864 images in the first 100 classes were allocated to the training set and 5,924 images in the last 100 classes were allocated to the test set. For the Cars-196 dataset, 8,054 images in the first 98 classes were allocated to the training set and 8,131 images in the remaining 98 classes were allocated to the test set. For the SOP dataset, 59,551 images of 11,318 classes were allocated to the training set and 60,502 images in 11,316 classes were allocated to the test set.

The performance of our method in retrieving images from the above datasets was evaluated by Recall@K. For each retrieval task, the authors computed the percentage of the testing images whose top-K retrieved images contain at least one image with the same class label. The K value was set to K∈{1, 2, 4, 8, 16, 32} for CUB-200-2011 and Cars-196, and K∈{1, 10, 100, 1000} for SOP. Our method was implemented under the framework of TensorFlow. Following Oh Song’s approach [2], GoogLeNet was adopted as the feature extractor. The batch size was fixed at 128 in all experiments.

Since the deep distance metric could be affected by the number of weak learners, the influence of that number on our method was observed through experiments on each of the three datasets. As shown in Figure 5, with the growing number of weak learners, the Recall@1 score first increased and then declined. The highest Recall@1 score was achieved at 8, 6, and 7 weak learners, for CUB-200-2011, Cars-196, and SOP, respectively. A possible reason is that the images in Cars-196 have relatively small variations, those in SOP face large view-point changes, and those in CUB-200-2011 feature a large pose variation and a strong background clutter. In the following experiments, the number of weak learners was set to 8, 6, and 7, for CUB-200-2011, Cars-196 and SOP, respectively.

The eigenvector size also exerts a major effect on the deep distance metric. Hence, experiments were carried out on Cars-196 with 6 weak learners and different eigenvector sizes. Drawing on the work of Wang et al. [13], the eigenvector size was increased from 64 to 1,024. Figure 6 compares the retrieval performance of our method with that of the multiscale (MS) method [13]. As shown in Figure 6, the retrieval performance of both methods gradually increased with the eigenvector size. Our method performed stably, when the size was equal to or greater than 256, and always outshined the MS. Hence, the eigenvector size was fixed at 256 in subsequent experiments.

Next, the training results and testing results were contrasted on Cars-196. As shown in Figure 7, the training R@1 only had a small gap from the testing R@1. On Cars-196, the R@1 score of the training set was around 93%, only 7% than that on the test set. This clearly shows that our method avoids overfitting.

Figure 8 shows the convergence curves of our method and several state-of-the-art methods on Cars-196. In the first 40 epochs, our method reached the performance of the state of the arts and converged faster than the other methods. However, according to the trend of the curve in Figure 8, that is, number of epochs from 0 to 50, the convergence rate of ours model was not the maximum. However, on the whole, the convergence rate of our method was satisfactory. Except for the MS, the contrastive methods took hundreds of epochs to converge. Thus, the training time of our method was compared with that of the MS. On a single NVIDIA Tesla V100 GPU, the mean running time of our method was 24.36s per epoch on CUB-200-2011 and 40.29s per epoch on Cars-196, while that of the MS was 28.45s and 43.58, respectively.

Finally, the image retrieval efficiency of our method was compared with that of the state-of-the-art methods on CUB-200-2011 and Cars-196, respectively. The comparison results (Tables 1 and 2) show that our method outperformed these methods, including higher-order tuples such as LiftedStruct and N-Pairs, as well as angular loss and ensemble methods such as annotation-based expansion (ABE) and deep randomized ensembles for metric learning (DREML). In particular, on the challenging CUB-200-2011 dataset, our method led the best-performing state-of-the-art method by a large margin: 4.2% in R@1. On SOP, our method also attained the best performance (Table 3). On all the datasets, our method, with a low feature dimension, performed better than the existing methods, with high feature dimensions.

5. Conclusions

This study presents a deep distance metrics ensemble method based on boosting, which generates the final distance metric through iterative training of multiple weak distance metrics. Specifically, the last fully connected layer of the CNN was used to form multiple groups of features. The sample pairs were mapped by the CNN, and the distance between the mapped sample pairs was evaluated by a piecewise linear function. The function was added as a weak learner to the boosting algorithm to generate a strong learner. Then, an alternating optimization method was utilized to optimize the parameters of network and loss function. The effectiveness of our method was demonstrated on three datasets widely used in image retrieval tasks. The future research will further improve our method by cascading more models and combine our method with other loss functions.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by Scientific Research Project of Xuzhou University of Technology (Grant no. XKY2019107), Science and Technology Project of Construction System, Jiangsu Province, China (Grant no. 2018ZD077), Natural Science Research Project of Jiangsu Province Universities, China (Grant no. 20KJB170023), and Xuzhou Science and Technology Planning Project (Grant no. KC21303).