Abstract
In order to obtain the discriminative compact appearance model for tracking objects effectively, this paper proposes a new structural tracking strategy that includes multicue inverse sparse appearance model and optimal metric evaluation between online robust templates and a limited number of particle samples in the looping process. Multicue inverse sparse appearance model globally improves the efficient selection of informative particle samples that can avoid the cumbersome coding and decoding cost for the trivial random particle samples. Only the most potential crucial cases are involved in each tracking loop. This refrains from unreasonable, rough numerical reduction of particle samples and also keeps the unbiasedness and dynamic stochasticness of the sampling process. Meanwhile, low-rank self-representatives for positive and negative samples facilitate the formulation of a suitable code book that arranges the useful sparse coefficients for feature bags and facilitates optimal metric evaluation for online training. It also alleviates the accuracy degradation of tracking occluded objects and improves the robustness of the tracker. Both of them preserve the discriminative compactness of target which speeds up particle filtering localization to separate the target object from distractors. Moreover, the proposed method exploits online appearance representations to learn the sharing compact information that avoids massive calculation burdens for massive visual data.
1. Introduction
As an effective solution to locate the interesting target, object tracking is seamlessly deployed in several surveillance services, which is very necessary to acquire better optimal appearance modeling method to satisfy the distributed surveillance requirements, such as the compactness of the model and low computation cost of transmission. Despite the number of solutions having been implemented in this field, it also often accompanies with challenging problems about the object appearance model, such as occlusions, illumination changing, and pose variations [1]. For the reason that high-accuracy surveillance needs expensive and complex deployment under limited computation resources in the real environment, it is crucial to leverage compactness and robustness for appearance modeling. Through machine learning methods for robust improvements, incremental subspace learning was utilized to tackle with the templates’ dramatic changes [2–4] in order to alleviate dirty templates’ training. They are not only, to some extent, effective to exploit the intrinsic subspace structure but also could not avoid huge storage of high-dimension data for nuclear norm minimization. To fix up this drawback, algorithms [5, 6] with sparsity representation were presented by the multilinear framework under the minimization of reconstruction error. However, the learned sparse appearance model could not provide enough spatial context information for the reason that sparsity representation coefficients were often arranged for target samples in each tracking loop individually. It was easy to ignore global constraints on the related subspace structures among the whole video sequences. Meanwhile, sparse decompositions in accumulated looping also consume higher time that results in low running rate and high energy requirement. Cooperative sparse appearance model that owned the sparse generative model (SGM) and sparse discriminative classifier (SDC) utilized the global templates to update the appearance model and measure the similarity through trivial discriminative blocks [7]. Methods [8–10] took partial and spatial information into consideration to exploit more robust templates with massive burdens about atom dictionary construction and pooling calculation procedures. Even wavelet transformation-based features were extracted to improve the reliability of appearance modeling where joint dictionaries for sparse coding still required tough storage procedures [11]. Intuitively, discriminative method provides adaptive complementary option for various appearance changes [1]. Good discriminative representations usually need mounts of supervised labels to fit the real data distribution [12–14]. Different classifiers were utilized to obtain more discriminative appearance models. Method in [15] relied on the background information where the most discriminating metrics for tracking were classified to keep more stable tracking. In [16], the random forest-based online multiview semisupervised learning algorithm that updated subtrees with individual labels for the unlabeled data was provided. The Hough forest-based backprojection was used in [17] to generate the structural patches. The spatial regularization was adopted in [18] to penalize the learning classifier. Bootstrapped sequential states between frames were shown in [19] to avoid random samples contaminating labeled examples. Recently, convolutional neural network-based methods captured ample information that was sophisticated in describing appearance models by the multilayer nonlinear transformations [20–23]. However, pooling procedure-based abstract convolutional features from network layers might ignore the original complex feature attributes of the appearance model inside the image structure. Furthermore, even utilization of transfer learning was able to adopt large sets of pretraining data, but complex high cost was also spent in the visual data collection, annotation labelling, and training ground-truth data. In our perspective, it was challenging to pursue for deep appearance models under the limited samples. Also, the transfer learning model which resulted from the large-scale dataset might have a certain divergence among various domains [22, 24].
To explore the robust and discriminative compact appearance model and to alleviate the heavy calculation cost in the looping process, this paper proposes an optimal structural tracking strategy that consists of global sparse representatives and local sparse coding feature bag-based optimal metric evaluation. In summary, the main contributions include the following:(1)Multicue inverse sparse appearance model avoids the heavy computation of the redundant particle sampling which results from the trivial random procedures. It obtains the most informative particle cases for structural appearance modeling.(2)Positive and negative samples are replaced by suitable sparse coefficient-based feature bags in the local level that can yield the optimal metric composition. It is potentially better suited for matching evaluation by limited potential powerful information in the spare coding phase. Also, this way reserves the target subtly discriminativeness of the compact model.
2. Background Information
2.1. Inverse Sparse Appearance Model
Given normalized candidates by particle filtering in the -th frame, previous target region as the template in the -th frame can be coded by the dictionary [25, 26]: . Afterwards, sparse decomposition of template is presented by nonnegative combination of sparse coefficients , while template reconstruction error achieves the minimum constraint with penalty term as shown in the following equation:
2.2. Low-Rank Self-Representatives
Optimal selection of low-rank exemplar representatives for high-dimensional data structure is efficiently described by the relevant data groups [27]. Attributes of self-representation in such cases of high relevance are exploited in order to obtain the most crucial ones. Given data samples in a dataset as columns of data matrix , the optimization problem is shown as follows:
Here, is the coefficient matrix and counts nonzero rows of . This compact learning process can be treated as a self-representative procedure that is the analogous structural representation of original data.
2.3. Metric Evaluation
Metric evaluation is calculated by the difference optimization between two feature vectors that can be defined by the positive semidefined matrix. Mahalanobis distance [28] between is a very famous metric evaluation as shown in the following equation:
Here, represents vectors of the training sets. Afterwards, matrix is factorized into the positive semidefined matrix . So, equation (3) is transformed into a new style as follows:
Online learning matrix facilitates the mapping transformation of samples and into a new low-dimensional subspace. This way also takes a new feasible distance metric instead of the original one.
3. Proposed Algorithm
3.1. Multicue Inverse Sparse Appearance Model-Based Particle Sampling (MISAMPS)
Random sampling procedures preserve the stochastic evaluation attributes of nonlinearity and analysis uncertainty, but for large mounts of random samples, computation cost is still a big problem under the limited resources. In order to alleviate the redundancy that results from random particle sampling, this paper applies the global-level multicue inverse sparse appearance model to select the most powerful particle exemplars in the tracking loop. After the initial extraction of ROI (region of interest) in the -th frame, the normalized random particle sampling states in the -th frame are firstly segmented into local patches that can be coded for the atom dictionary. Besides, the segmented patches are arranged for various weights that exploit the potential connectivity for separating the foreground object from the background discriminatively when the ROI faces partial occlusion situations. In this paper, multicues () of each patch, respectively, describe the inverse sparse appearance model for robust representations. As shown in Figure 1, compact selection of powerful particle samples by inverse sparse representation copes with the original sampling of the redundant structure; therefore, it is more valuable to consider the limited sampling cases that are obtained after multicue inverse sparse modeling. To alleviate the tracking drifting problem, the uniform corresponding patches are processed by each single cue extraction, respectively, that can pursue for more accurate local weight distribution.

Commonly, local-level patch weight distribution can be gradually optimized by the adaptive AdaBoost process [25] or quadratic programming theory [29]. For the adaptive weight distribution, the size of each normalized patch is set as (pixel). Coordinate exists inside the -th patch . Following the previous distribution in the previous -th frame, the weight distribution for partial occlusion is shown in Figure 2. This exhaustive presentation ensures that the evolution of feature confidence consistently exists between the current frame and the previous frame. It seems obvious that multicue-based structural weight distributions reflect the dynamical confident patches’ arrangement in which the no-occluded patches (warm-color patches) show high weight distribution and vice versa. Considering the potential structural diversity among different features in the target ROI, multicues provide more various optimal particles’ proposals with low formidable procedures. If normalized candidates are sampled in the -th frame, storage of previous target ROIs as the templates untill the -th frame can be coded by the respective dictionary: . According to equation (1), sparse decomposition for the template by nonnegative sparse coefficients is implemented with the help of distribution until template reconstruction error achieves the minimum constraint under the penalty term constraint as shown in the following equation:

3.2. Feature Bag-Based Optimal Metric Evaluation (FBOME)
To select the most suitable result among the provided sets of particle exemplars, the feature bags are processed in advance which own more discriminating attributes than the original color feature-based representation. Instead of coding by multiple patch-based -means clustering for the convolutional filtering bank in the tracking loop, the principle atom-based code book (PACB) employs the low-rank self-representatives to represent the imperative atoms as shown in Figure 3 that can bring about the subsequent feasible sparse labels for the whole ROI area. Given templates in the -th frame, column vectorization sets can be decomposed under the minimal reconstruction error with constraints. equates . According to equation (2), representatives are selected as principle atoms to take instead of original cases. in equation (2) can be shown aswhere is the -th row of and is the indicator that shows the number of nonzero entry rows of . Its corresponding -th columns are the nonpowerful representatives for the whole structure [27]. To solve the NP-hard problem [30] for the -norm constraint problem, -norm is usually applied for the new limitation concerning the elements of . So, (2) can be solved by the following equation:

Here, is a nonnegative parameter, and confirms the convex optimization. summarizes the -norms of rows of . This solution describes the crucial sets of representatives for the related rows in the data structure. More nonzero entries in the -th rows of play higher imperative weights in the data self-representation. We can obtain . Such solutions which are mentioned above compress the redundant information of with low-rank PACB sets efficiently.
Afterwards, nonzero corresponding powerful cases from the multicue inverse sparse model are coded to preserve more vital, potential, structural information by the obtained dictionary that can describe the spatial appearance variations appropriately for the simultaneous tracking process. If column factorization of -th case is named as , are the bag of words (BOW) [31] which sparsely code the mapping process in equation (8). Here, the least absolute shrinkage and selection operator (LASSO) solution [32] is implemented with the greedy looping procedures in that the algorithm updates atoms until the residual is lower than the initialized threshold or until enough atoms are obtained.
As shown in Figure 4, structural pyramid-pooling procedures, respectively, sample sparse coefficient-based feature bags (BOW) covering the whole ROI with different sampling sizes of , and then average pooling for sparse coefficients in each subset is calculated according to the -th level. Finally, pooling results in every level are concatenated in the linear pooling way. This manner preserves the corresponding sparse statistics of spatial intensity among various scales inside the pyramid structure. Meanwhile, online training templates include both positive sampling cases and consecutive recognition results which guarantee the consistent property of the target ROI in which spatial information of temporal templates can be encoded suitably. Therefore, the optimal metric evaluation is triggered by sparse coefficient-based feature bags from dictionary PACB which ensures the online dynamic updating templates during the whole tracking interaction. Given the target template of the -th frame and the -th sampling candidate in the frame, intuitively, Mahalanobis distance between them can be described as equation (9) according to equations (3) and (4):

Here, is the symmetric positive semidefinite matrix that can be iteratively determined by the online metric learning method [33] during the tracking process. However, with more random sampling candidates in each tracking looping procedure, computation cost spent is obviously much higher. Moreover, the variations between redundant candidates may affect the adjustment for metric evaluations among the following pairs of samples. In order to improve metric evaluation robustly, this paper takes the sparse coefficient-based feature bags instead of original features.
Here, and are the pyramid-pooling structure-based sparse coefficient sets for the -th candidate and template , respectively. Meanwhile, (or ) results from temporal multi-instance metric learning with positive sets and negative cases by automatic shift sampling selection around the previous target ROI in the -th frame. Furthermore, part-based low-rank self-representation decomposition is employed here again to extract more informative positive and negative selections for more robust training. Therefore, the constraints are limited as follows: the difference in the -th frame between positive cases and the template is more than or equal to a small value .
The distance between consecutive elements of template should be less than a small value .
The difference in the -th frame between positive cases and negative cases should be a large margin.
is solved by temporal metric learning with the pair label and the LogDet optimization [33] under constraint parameter . Thus, it can be seemed as the following equation:
Similarly, and are also defined as the pyramid-pooling structure-based sparse codes for and , respectively. With the above optimal metric representation, object tracking process can be treated as selecting the most similar candidate from the limited sampling sets that have been provided by the multicue inverse sparse appearance model. Under the Bayesian inference framework [34], the likelihood can be defined as follows:
3.3. Model Update Mechanism
Within the iterative process for updating the online template library, the proposed algorithm generates both positive and negative samples in a certain radius that can be treated as the auxiliary samples to jointly train the model for antidistractions from background trivial cluttering. All the positive samples are located inside the radius which is near the positive label instance; likewise, negative ones stay in the interval at a certain distance far from positive label instances. Meanwhile, the proposed algorithm does not only update the template through pyramid-pooling structure procedures (11)∼(14) but also adaptively refreshes the templates based on reconstruction error with the threshold and an adaptive tuning parameter :
4. Experiment and Analysis
The experiments are carried on a PC with Intel i7-2.60 GHz CPU and 8 GB storage with MATLAB implementation. All the parameter settings in the experiment are normalized for fair comparisons. minimization optimization is solved by SPAMS package [35] with the regularization constant . For a good tradeoff between effectiveness and time cost, two hundred particles are randomly sampled for providing enough candidates in each tracking loop. All target ROI areas are initialized by manual and modelled as previous sections. Local patches are normalized for 3232 pixel size for the affine transformation. The relation of intervals is commonly . The model update rate and error threshold in equation (16) are 0.95 and 0.065, respectively. To demonstrate the robustness of the proposed algorithm, we firstly give several basic experimental comparisons involving different metric evaluation methods related to the proposed algorithm. Secondly, we adopt the classic OTB database [1] to give the comparisons between our tracker and others.
4.1. Basic Metric Evaluation
OPF [34]: original particle filtering without sparse coding modeling; CBIS [36]: convolutional block feature-based metric evaluation with the inverse sparse appearance model; ASLA [8]: original block feature-based metric evaluation with the sparse appearance model; OBIS [25]: original block feature-based metric evaluation with the inverse sparse appearance model. Test video frames involve several real-life target-tracking tasks, such as low-contrast environment, sharp lighting influence, shape changing, scale variation, cluttering background, and fast pose movement.
Figure 5(a) presents the tracking results under the low-resolution condition. It shows that the human target is tracked successfully at the early stage by almost all the algorithms. However, OPF, ASLA, and OBIS trackers fail when the other pedestrians walk cross the target’s directional routine. It also proves that only feature-based metric comparison lacks enough stable attributes than the hierarchical feature structures, such as the convolutional features or our sparse coefficient-based pyramid-pooling structure. Meanwhile, even the sparse modeling procedure-based OBIS tracker also may lose the target as shown in Figures 6(a) and 7(a). Linear combination-based final sample selection will accumulate the tracking discrepancy among sets of looping calculations which degenerate the tracking accuracy. Figure 5(b) presents that various illuminations with blurring in the ROI produce more troubles for the tracking process. As shown in Figures 6(b) and 7(b), the CBIS tracker and proposed algorithm play more better roles before the 42-nd frame that the biker ROI faces severe lighting. Then, CBIS tracker drifts the correct target area. For the background cluttering situation shown in Figure 5(c), it is trivial to track sharp activity in this squashed environment. Although ASLA and CBIS trackers almost capture a certain part of the target, their effective overlapping area is still less than the proposed algorithm. For the shape variation case shown in Figure 5(d), waving T-shirt is not easy to provide the stable appearance to be followed. In spite of each tracker with the same adaptive scale parameters, other trackers are not robust enough to describe this appearance. Scale variations shown in Figure 5(e) are more obvious that camera view focuses on the singer face from near to far within the dark and lighting stages. Figures 6(e) and 7(e) show the center error representations and overlapping rate comparisons with other trackers. CBIS, OBIS, and ASLA perform steadily before the 80-th frame under the dramatic lighting condition.

(a)

(b)

(c)

(d)

(e)

(f)

(a)

(b)

(c)

(d)

(e)

(f)

(a)

(b)

(c)

(d)

(e)

(f)
However, they all move within several different degrees when the singer face changes the directions. With the target scale declining, the proposed algorithm with double dynamic update schemes and optimal selection avoids the transient loss problems. For the sequences mottle face as shown in Figure 5(f), sunshine projects more mottle poles above the girl face with a certain plane of rotation. Even if there is no tracker to obtain very high overlapping rate through the whole sequence, the proposed tracker performs well in most of the stages. Differences in Figures 6(f) and 7(f) illustrate the accurate trend of all the trackers. In order to analyze the quantitative stability, the average center errors and the average overlapping rates of the five trackers in these experimental comparisons are calculated and shown in Tables 1 and 2. They clearly display that the proposed algorithm has a good tracking performance in the vast majority of situations.
According to the comparison results, the proposed tracker ranks top two among all trackers in terms of optimal metric-based particle sampling selection. The CBIS tracker achieves the tracking task with the second lower average errors. It confirms the collaboration of the inverse sparse appearance model and discriminative particle sampling selection in the frequency domain through the K-means clustering-based dual-layer convolutional networks. However, this method needs initial cluster number for the construction of filter banks which may result in opaque target localization in the complex environments. Also, it does not consider the redundant sample computation cost in the dynamic update procedures; thereby, its further tracking performance is limited in a certain sense. For the ASLA tracker, it must calculate multiple sparse decompositions for each patch in each candidate. Then, each independent patch group must be evaluated through the max-pooling scheme. Thus, much computation memories are spent for coding the inefficient particle samples that hinder its real-time application. Although the OBIS and OPF trackers have much simpler structures and faster tracking speed than the proposed method, they do not own efficient and robust appearance models for test video sequences. Especially, OBIS adopts direct linear combination of particle samples rather than optimal selection. It is easy to drift the target for the reason that the accumulation of weak discrepancies will lead to degenerative states in consecutive procedures.
4.2. OTB Dataset Comparison
We also evaluate the proposed algorithm with other state-of-the-art algorithms including SRDCF [37], SAMF [38], HDT [39], DSST [40], KCF [41], SCM [7], L1APG [42], MIL [19], and CT [43] on the widely utilized OTB dataset. The benchmark dataset contains 50 different sequences with ground-truth annotation. Unified categories of 11 challenging attributes are proposed here. Precision measures the center location error, which means the average difference between the center locations of targets and the ground truths. The final average center location error over all the frames of one sequence defines the overall performance. Here, we set the precision score for each tracker as the threshold of 20 pixels. The overlap ratio defines the overlapping relation between the predicted target area and the ground-truth area : ; the final performance of a tracker on a sequence depends on the storage of successful frames in which is more than a useful threshold. Under the success rate value at threshold, the evaluation can be ranked by the area under the curve (AUC) of each success plot. The proposed comparison adopts the one-pass evaluation (OPE) throughout the whole sequence with the setting initialized by the first frame’s ground-truth position. All the precision and success plots of the proposed comparison are shown in Figures 8(a) and 8(b). Different efficiencies of all trackers in various environmental attributes are illustrated in Figures 9 and 10. The proposed tracker (0.820/0.705) is in the top-2 precision plots and in the top-3 success plots, which obtained great improvements on the sparse-related trackers SCM, MIL, and L1APG. Also, it is better than some of correlation filter-based trackers, such as CT, KCF, DSST, and HDT. In order to fairly compare with the basic power of the tracker’s structure, we only adopt the hand-craft features in all the trackers. From Figure 8(a), we can see that our method achieves at least top-2 highest precision rate in all challenging attributes except deformation and occlusion. In terms of deformation, the precision rate of the proposed tracker (73.0%) is just inferior to SRDCF (84%), KCF (80.4%), and HDT (79.0%). As to the occlusion, the precision rate achieved by our method (81.5%) is almost the same compared to the best score achieved by SRDCF (83.3%) and SAMF (82.8%) for the reason that SRDCF owns the more powerful discriminative kernel strategy which occupies much better positions than our proposed algorithm, but from Figure 10, our method achieves the highest precision rate in several challenging attributes such as low resolution and out of view. In terms of background clutters, motion blur, and fast motion, our method also achieves the second-best success rate. In conclusion, the proposed method is more stable and robust against different visual tracking challenges to a larger extent. Table 3 illustrates the average evaluation of all the ranked trackers, where the proposed algorithm obtains real-time value on a large number of short-term sequences with the same visual properties as the given dataset.

(a)

(b)

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)
5. Conclusion
This paper proposes an optimal metric evaluation-based multicue inverse sparse appearance model for tracking the algorithm. According to the previous discussion, several key advantages exist in our method. Firstly, our scheme facilitates the reduction of redundant particle samples in each looping procedure with the help of the multicue inverse sparse appearance model in the global level. It not only explores the potential structural relationship among various ROI patches but also keeps the unbiasedness and dynamic stochasticness between consecutive frames. To dynamically depict the compact appearance model, crucial particle samples are extracted in the global level that alleviates the original particle filtering massive one-to-one matching computation effectively. This way also yields more precise representation for the target ROI whose effective numerical mounts of particle samples are limited to regularize the particle filtering process. Secondly, in order to select the most optimal particle sample among the previous crucial cases, patch-based low-rank self-representation provides more robust and important training samples (atoms) for constructing the effective sparse coding book (dictionary) in the local level. Explicitly, it avoids opaque clustering for dictionary atoms which depend on manual initialization in a certain degree. Moreover, a set of structural pyramid-pooling process facilitates sparse coefficient-based optimal metric evaluation. In addition, iteratively templates update, and online metric training are included for updating the appearance model in the dynamic update process. Extensive evaluations on the test video sequences have demonstrated the effectiveness of the proposed method with favorable performance. Currently, we are working on a new algorithm that merges multitensors or deep features which are expected to save the computation cost and improve the robust tracking.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by Leading Talents of Shandong University of Science and Technology, 863 Project: Physical Model-Based Dynamic Evolution Technology of Complex Scene (2015AA016404), Shandong Province Higher Educational Science and Technology Program (J17KA075), and the National Nature Science Foundation of China (61801270).